Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

jEdit - a plain text editor


Written by: Rasmus Wernersson

Background: data in plain text format


In bioinformatics it's very common to have the data hosted in simple plain text format. For example:

>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA

The same approach is usually also used for other kinds or data - lists or gene names, statitics on DNA patterns etc. The main idea is to keep every thing simple and open. That way will be easy to use the data as inpt for different kinds of programs, and write simple scripts (small programs) that reads some kind of input, performs some sort of analysis and outputs the result in a readable manner.

How difficult can it be? Text is text, right?

There are two main concerns when speaking about text files:

1) Plain text vs. Rich text / MS Word / Word Perfect / etc.

There exists a number of file formats that can contain text - usually in a nicely formatted matter, with embedded graphics and other fancy features. The problem here is two fold:

  • A lot of irrelevant information is added (visualized below): We simply don't care if the DNA sequence is in BOLD or a fancy font.

  • Even worse there is no standard way to ignore this extra information meaning a MS Word file CANNOT be used as input to our sequence anaysis programs.

Plain text vs. Word Procesor file


2) Different interpretations of "plain text".
In the most widely used type of text files ("old school" text) each letter is represented by one byte (8 bits) = 256 possible symbols. How each numerical value is interpreted can potantially be different, and this is know as encoding.

Normally a derivate of ASCII/ANSI encoding is used - see the table below. As can be seen from the table the text "DNA" whould be represented by the three numbers: 68, 78, 65. If we wanted lower-case it would be 100, 110, 97.

Notice that the values 0-31 is reserved for special purpose "letters" that have no visual representation (more on this later) :

ASCII table 0-127

Since ASCII is an american standard national characters like "Æ", "Ø" and "Å" are NOT represented in the standard part of the alphabet - some of these characters are found in the range 128-255 (see full set her: The Extended ASCII Chart - the table above also originates from this page). I will not go further into how the full range of national characters are handled (nor the UNICODE standard) - but rather give a short bit of advice: When creating sequence files always stick to the English Letters. While it might be tempting to name you  sequence "Æsel_Insulin" or "ØrneDNA" there are no garantee that it will work in all programs.

A second issue is that of Line Endings ("newlines").

Since a text file is basically just a long string of values between 0-255, a special symbol must be reserved to split the text into individual line. This is done by appending an invisible (value 0-31) "newline" character by the end of each line. Unfortunately three standards exist for this:
  • UNIX standard:
    • 10 - LF ("Line feed" char).
  • Old Mac (System 9 and before):
    • 13 - CR ("Carriage Return" char).
  • DOS/Windows:
    • 13, 10 - both CR and LF.
Any good text editor worth it's salt can handle all three standards transparently. However, the most commently used Plain Text editor in Windows ("Notepad") CANNOT handle this issue:

NotePad opening a UNIX format file
FAILS: NotePad trying to open a file with UNIX newlines
NotePad opening a Windows format file
WORKS: Same file, now with DOS/Windows style newlines.



(Wikipedia has a very long discription of the newline issue here: newline).


Installing and using jEdit

A large number of good plain text editors exists for various Operating Systems - for example NEdit for UNIX type systems, BB Edit for the Mac and UltraEdit for Windows - some editors exists for multiple platforms like the jEdit program we'll install and test in a moment.

Many of such text editors were originally developed with programming in mind, and contains a number of features that will make programming easier, such as syntax-highlighting that will show various part of the program being developed in different colors.

For our purpose we will just make use of the most basic functionality for viewing and editing DNA/Protein sequence files: The ability to handle all kinds of newlines, a garantee of saving the files in plain text format and possible advanced search-and-replace when creating/cleaning our own sequence files.

jEdit Screenshot

Download and Install jEdit

Obviously the fist task will be to install jEdit: Go to the jEdit website: www.jedit.org and locate the lastest "stable" release of jEdit for you platform of choice (for Windows pick the "Windows installer" - for Mac pick the "Mac OS X package"). Download & install the program package.

Make sure you know where the program has been installed, and where to find the short-cut to start it.

Taking jEdit for a test run

Download and unpack the following Zip archive which contains three different versions of the same sequence file: SeqExamplesNewlines.zip.

Contents of the archive:
alpha_globin_OldMac.fsa
alpha_globin_Unix.fsa
alpha_globin_Windows.fsa

In this case the files are in FASTA format (much more about FASTA in the later exercises) and have the extansion ".fsa" - NOTICE: You can open any file with any extension in jEdit - as long as it contains text.

Open the files one by one in jEdit - they should look the same, and which line endings are used will be indicated by the letters "U", "W" or "M" in the lower right hand corner (you can click the letter to change the format) - if you are on the Windows platform, you can also try to open the files in "Notepad" and see what happens.

QUESTION 1:

  • Note down the FILE SIZE (in bytes) of each of the three files (just use the Windows Explorer - right click -> properties / Mac Finder + CMD i / Linux "ls -l" command).
  • Are they all the same size? Why/Why not?

On file extensions and default programs.

Download and unpack the following Zip archive which contains the SAME sequence information embedded in various popular document formats: SeqExamplesFormats.zip

Contents of the archive:
alpha_globin.doc
alpha_globin.html
alpha_globin.rtf

Open each of the file by clicking on them to launce the program associated with the file extension (typically Word for .doc file, a browser for .html file etc).

QUESTION 2:
  • Can we still find the same information (the DNA sequence) in each of the files?
  • Note down the size of the files - do they differ much?
Now try to open each of the files in jEdit - to see what's really in there.

QUESTION 3:
  • What kind of extra information has been added to the HTML and RTF files? (Is it "Human readable"?).
  • What kind of extra information has been added to the DOC file? Any surprises here?

Search and Replace & Block selection

jEdit - normal line based selection
Normal - line based - selection
jEdit -  block selection
Block selection

From time to time it will be necessary to do a slight bit of editing in order to clean up the data we want to work with. In the following example we will be working with the DNA sequence listed below. The task is to clean it up - get rid of the numbers and spaces - and we want to do as little work as possible.

        1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA
       61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA
      121 CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT
      181 GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG
      241 ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC
      301 GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA
      361 AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC
      421 TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT
      481 ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT
      541 TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC
      601 GGCGCCGCC

Open a new jEdit window and paste in the entire block of text. In order to get rid of the numbers we can use a handy feature of jEdit called Block Selection (the difference between "normal" line selection and block selection is illustrated above) - simply hold down Control (Windows+Linux) / CMD (Mac) while dragging the pointer to select a block. Select the block containing the numbers and hit delete.

Next we want to remove the spaces: Open the find dialog (Control F / CMD F). Notice that there are a ton of advanced options - we can safely ignore them for this simple purpose. Make sure that "Search in" is set to "Current buffer" (alternatively you can just select all the text and search in the selection). In the "Search for" field simply enter a single space - and hit "Replace all" to see all the spaces to disappear in a puff of smoke.

QUESTION 4: Paste in the cleaned up DNA sequence in you report.

Conclusion

This concludes the short introduction to text-editors. When ever you work with "strange" sequence files during the course, remember that you can always inspect them using jEdit, to find out what's really in there. The same hold true for other text based format such as the ones used for phylogentic trees, as we will see later.