|
jEdit - a plain text editor
Written by: Rasmus Wernersson
Background: data in plain text format
In bioinformatics it's very common to have the data hosted in simple plain text format. For example:
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
The same approach is usually also used for other kinds or data - lists
or gene names, statitics on DNA patterns etc. The main idea is to keep
every thing simple and open. That way will be easy to use
the data as inpt for different kinds of programs, and write simple
scripts (small programs) that reads some kind of input, performs some
sort of analysis and outputs the result in a readable manner.
How difficult can it be? Text is text, right?
There are two main concerns when speaking about text files:
1) Plain text vs. Rich text /
MS Word / Word Perfect / etc.
There exists a number of file formats that can contain text - usually in a nicely
formatted matter, with embedded graphics and other fancy features. The
problem here is two fold:
- A lot of irrelevant information is added (visualized below): We
simply don't care if the DNA sequence is in BOLD or a fancy font.
- Even worse there is no standard way to ignore this extra
information meaning a MS Word file
CANNOT be used as input to our sequence anaysis programs.
2) Different interpretations
of "plain text".
In the most widely used type of text files ("old school" text) each
letter is represented by one
byte (8 bits) = 256 possible symbols. How
each numerical value is interpreted can potantially be different, and
this is know as encoding.
Normally a derivate of ASCII/ANSI encoding is used - see the table
below. As can be seen from the table the text "DNA" whould be
represented by the three numbers: 68,
78, 65. If we wanted lower-case it would be 100, 110, 97.
Notice that the values
0-31 is reserved for special purpose "letters" that have no visual
representation (more on this later) :
Since ASCII is an american standard national characters like "Æ",
"Ø" and "Å" are NOT represented in the standard part of
the alphabet - some of these characters are found in the range 128-255
(see full set her: The
Extended ASCII Chart - the table above also originates from this
page). I will not go further into how the full
range of national characters are handled (nor the UNICODE standard) -
but rather give a short bit of advice: When
creating sequence files always stick to the English Letters.
While it might be tempting to name you sequence "Æsel_Insulin" or "ØrneDNA" there are no
garantee that it will work in all programs.
A second issue is that of
Line Endings ("newlines").
Since a text file is basically just a long string of values
between 0-255, a special symbol must be reserved to split the text into
individual line. This is done by appending an invisible (value 0-31) "newline" character by the end of
each line. Unfortunately three standards exist for this:
- UNIX standard:
- 10 - LF ("Line feed" char).
- Old Mac (System 9 and before):
- 13 - CR ("Carriage Return" char).
- DOS/Windows:
Any good text editor worth it's salt can handle all three standards
transparently. However, the most commently used Plain Text editor in
Windows ("Notepad") CANNOT handle this issue:

FAILS: NotePad trying to open a
file with UNIX newlines
|
WORKS: Same file, now with
DOS/Windows style newlines.
|
(Wikipedia has a very long discription of the newline issue here: newline).
Installing and using jEdit
A large number of good plain text editors exists for various Operating
Systems - for example NEdit for UNIX type systems, BB Edit for the Mac
and UltraEdit for Windows - some editors exists for multiple platforms
like the jEdit program we'll install and test in a moment.
Many of such text editors were originally developed with programming in
mind, and contains a number of features that will make programming
easier, such as syntax-highlighting that will show various part of the
program being developed in different colors.
For our purpose we will just make use of the most basic functionality
for viewing and editing DNA/Protein sequence files: The ability to
handle all kinds of newlines, a garantee of saving the files in
plain text format and possible advanced search-and-replace when
creating/cleaning our own sequence files.

Download and Install jEdit
Obviously the fist task will be to install jEdit: Go to the jEdit
website: www.jedit.org and locate
the lastest "stable" release of jEdit for you platform of choice (for
Windows pick the "Windows installer"
- for Mac pick the "Mac OS X package").
Download & install the program package.
Make sure you know where the program
has been installed, and where to find the short-cut to start it.
Taking jEdit for a test run
Download and unpack the following Zip archive which contains three
different versions of the same sequence file: SeqExamplesNewlines.zip.
Contents of the archive:
alpha_globin_OldMac.fsa
alpha_globin_Unix.fsa
alpha_globin_Windows.fsa
In this case the files are in FASTA format (much more about FASTA in
the later exercises) and have the extansion ".fsa" - NOTICE: You can open
any file with any extension in jEdit - as long as it contains text.
Open the files one by one in jEdit - they should look the same, and
which line endings are used will be indicated by the letters "U", "W"
or "M" in the lower right hand corner (you can click the letter to
change the format) - if you are on the Windows platform, you can also
try to open the files in "Notepad"
and see what happens.
QUESTION 1:
- Note down the FILE SIZE (in
bytes) of each of the three files (just use the Windows Explorer
- right click -> properties
/ Mac Finder + CMD i / Linux "ls
-l" command).
- Are they all the same size? Why/Why not?
On file extensions and default programs.
Download and unpack the following Zip archive which contains the SAME
sequence information embedded in various popular document formats: SeqExamplesFormats.zip
Contents of the archive:
alpha_globin.doc
alpha_globin.html
alpha_globin.rtf
Open each of the file by clicking on them to launce the program
associated with the file extension (typically Word for .doc file, a
browser for .html file etc).
QUESTION 2:
- Can we still find the same information (the DNA sequence) in each
of the files?
- Note down the size of the files - do they differ much?
Now try to open each of the files in
jEdit - to see what's really in there.
QUESTION 3:
- What kind of extra information has been added to the HTML and RTF
files? (Is it "Human readable"?).
- What kind of extra information has been added to the DOC file?
Any surprises here?
Search and Replace & Block selection

Normal - line based - selection
|

Block selection
|
From time to time it will be necessary to do a slight bit of editing in
order to clean up the data we want to work with. In the following
example we will be working with the DNA sequence listed below. The task
is to clean it up - get rid of the numbers and spaces - and we want to
do as little work as possible.
1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA
61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA
121 CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT
181 GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG
241 ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC
301 GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA
361 AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC
421 TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT
481 ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT
541 TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC
601 GGCGCCGCC
Open a new jEdit window and paste in the entire block of text. In order
to get rid of the numbers we can use a handy feature of jEdit called Block Selection (the difference
between "normal" line selection and block selection is illustrated
above) - simply hold down Control (Windows+Linux) / CMD (Mac) while
dragging the pointer to select a block. Select the block containing the
numbers and hit delete.
Next we want to remove the spaces: Open the find dialog (Control F /
CMD F). Notice that there are a ton of advanced options - we can safely
ignore them for this simple purpose. Make sure that "Search in" is set to "Current buffer" (alternatively you
can just select all the text and search in the selection). In the "Search for" field simply
enter a single space - and hit "Replace all" to see all the
spaces to disappear in a puff of smoke.
QUESTION 4: Paste in the
cleaned up DNA sequence in you report.
Conclusion
This concludes the short introduction to text-editors. When ever you
work with "strange" sequence files during the course, remember that you
can always inspect them using jEdit, to find out what's really in there. The same
hold true for other text based format such as the ones used for
phylogentic trees, as we will see later.
|