Required reading
Learning Perl, ed. 4:
Chapter 7; p. 100-106
Chapter 8; p. 107-115
Chapter 9; p. 121-123, 127-128
or
Learning Perl, ed. 5:
Chapter 7; p. 107-115 mid
Chapter 8; p. 117-126 mid, 129 mid-131
Chapter 9; p. 135-137, 140-142
or
Learning Perl, ed. 6:
Chapter 7; p. 121-131 mid
Chapter 8; p. 133-135 mid, 138-144
Chapter 9; p. 155-158
A very useful pdf on regular expressions
Notes on metasymbol tables.
Subjects covered
Regular expressions, pattern matching, substitution, translation.
Necessary files to complete this exercise
To download the files to your system, just press the Shift key while
you left click on the blue link. Follow the instructions.
dna7.fsa
data1.gb
data2.gb
data3.gb
data4.gb
You can play around with these files as much as you like. If you change or
destroy them, just download them again.
Remember to write #!/usr/bin/perl -w on the first line of
your programs.
All the following exercises have to be done in Perl
- Let us improve on exercise 2.6. Use regular expressions (RE) to
check that the data you get when asking for numbers, are actually numbers.
Also check that the operation is valid.
These should all be considered as
numbers: "4" "-7" "0.656" "-67.35555"
These are not numbers: "5." "56F" ".32" "-.04"
- Improve exercise 5.6 by using regular expressions to find the ID,
accession number and amino acid sequence. Note: This exercise also covers
verification and printing of fasta file.
- Improve exercise 4.7 using all you have learned.
The program shall now take a DNA FASTA file (getting the file name from
command line or asking interactively for it, both methods shall work), and
reverse and complement all entries in the file. There can be more than one entry,
study dna7.fsa. Hint: Use substitution or transliteration (translation)
for complementing the DNA.
- The last exercises will all have to do with the files data1-4.gb, which
are various Genbank entries of genes. First you should study the files, notice the
structure of the data. In all exercises you will have to parse (read and find
the wanted data) the files using RE's which are very well designed for that purpose.
Every exercise adds to the previous ones, so the final program can do a lot.
Remember. Your program should be able to handle all files, but just one at a time.
- Extract the accession number, the definition and the organism
(and print it).
- Extract and print all MEDLINE article numbers which are mentioned in the entries.
- Extract and print the translated gene (the amino acid sequence).
Look for the line starting with /translation=.
Generalize; An amino acid sequence can be short,
i.e. only one line in the feature table, or long,
i.e. more than one line in the feature table.
- Extract and print the DNA (whole base sequence in the end of the file).
- Extract and print ONLY the coding DNA. That is described in FEATURES - CDS (Coding DNA Sequence).
As an example, the line in data1.gb says 'join(2424..2610,3397..3542)' and means that the coding sequence
are bases 2424-2610 followed by bases 3397-3542. The bases in between are
an intron and not a part of the coding DNA. Remember to generalize; there
can be more (or less) than two exons, and the 'join' line can continue on the next line.
|