A program can only be executed, when it has execute permission:
chmod 755 <filename>
Remember to write #!/usr/bin/perl on the first line of
your programs.
Necessary files to complete these exercises
To download the files to your system, just press the Shift key while
you left click on the blue link. Follow the instructions.
ex1_1.dat
sprot1.dat
sprot2.dat
sprot3.dat
sprot4.dat
dna.fsa
ex5.acc
data1.gb
data2.gb
data3.gb
data4.gb
start10.dat
res10.dat
dna7.fsa
FastaParse.pm
- Write 'Hello World' on the screen.
- Ask for a positive integer and calculate the factorial (n!) of that number.
Display the result. If input is negative, display an error message.
- Make a program that calculates
the mean value of the numbers in the file ex1_1.dat.
- In the file dna.fsa is the same human DNA in FASTA format.
This format is VERY often used in bioinformatics. Look at it using less
and get used to the format. Observe the first line which starts with a >
and identifies the sequence. The name (AB000410 in this case) MUST
uniquely identify a sequence in the file. This is a DNA (actually mRNA) sequence taken from
the GenBank database. Now make a program that reverse complements the sequence
and writes it into the file revdna.fsa just like you did in previous
assignments. This time you have to keep the first identifying line, so the
sequence can be identified. You must add 'ComplementStrand' in the end
of that line, though, so you later know that it is the complement strand.
Summary: Keep the first line and reverse complement the sequence.
- This exercise deal with SwissProt.
The file sprot1.dat is a SwissProt database entry. Study it with
less. Locate the SwissProt ID (SP96_DICDI),
the accession number (P14328) and the amino
acid sequence (MRVLLVLVAC....TTTATTTATS). There are other entries (
sprot2.dat, sprot3.dat, sprot4.dat). Your program should work on
those, too. Also your program must solve all the problems in ONE
reading of the file.
Read the ID, the accession number and the amino acid sequence.
Save the data to a
file sprot.fsa in FASTA format. Look in the file dna.fsa
for an example of FASTA. Notice the first line starts with > and
immediately after comes an unique identifier, like an accession number
or a SwissProt ID. Any other data must be on the header line only, but in free format.
Sequence data is on the following lines.
- In the file ex5.acc there are 6461 unique GenBank accession numbers
(taken from HU6800 DNA array chip).
An inexperienced bioinformatician unfortunately fouled up the list, so many
of the accession numbers appears more than once. It is your job to clean
the list, so all accession numbers only appear once, and in alphabetical order.
Save the new list in clean.acc. Hint: After sorting a list, duplicates
are "next" to each other, thereby making them easy to find and
eliminate. Another way to go about this is a hash - say no more :-)
- What regular expression would you use to check if data in a string
really is a number.
These should all be considered as numbers: "4" "-7" "0.656" "-67.35555"
These are not numbers: "5." "56F" ".32" "-.04"
- The next exercises (8-15) will all have to do with the files data1-4.gb, which
are various Genbank entries of genes. First you should study the files, notice the
structure of the data. In all exercises you will have to parse (read and find
the wanted data) the files using RE's which are very well designed for that purpose.
Every exercise adds to the previous ones, so the final program can do a lot.
Remember. Your program should be able to handle all files, but just one at a time.
- Extract the accession number, the definition and the organism
(and print it).
- Extract and print all MEDLINE article numbers which are mentioned in the entries.
- Extract and print the translated gene (the amino acid sequence).
Look for the line starting with /translation=.
Generalize; An amino acid sequence can be short,
i.e. only one line in the feature table, or long,
i.e. more than one line in the feature table.
- Extract and print the DNA in the entries data1-4.gb (whole base sequence in the end of the file).
- Extract and print ONLY the coding DNA. That is described in FEATURES - CDS (Coding DNA Sequence).
As an example, the line in data1.gb says 'join(2424..2610,3397..3542)' and means that the coding sequence
are bases 2424-2610 followed by bases 3397-3542. The bases in between are
an intron and not a part of the coding DNA. Remember to generalize; there
can be more (or less) than two exons, and the 'join' line can continue on the next line.
- In the data1.gb file there are 6 references (to articles). Make
a program that extracts all authors from the references, eliminates those
that are duplicates and print the list of persons who had anything to
do with this GenBank entry. This should also work for the other Genbank entries.
Beware: there traps in this exercise, check your output properly.
- In the genbank files data?.gb you should extract the coding DNA
sequence as you already have done. Next you have to display a list of
codons USED in the coding sequence and the number of times they are used.
- You have made a program (let's call it the X-program),
which as input takes a file of accession numbers, start10.dat
and produces some output, which is in res10.dat.
Now you count the lines in your input file and your
output file and you discover that the line numbers do not match. Horror -
your program does not produce output for some input. Now the assignment is
to discover which accession numbers did not produce output. This can be done
in various ways, but now you have to use a hash (as look-up table). Print the
results.
- Now we should use some object orientated techniques. OO programming is
very often used in modules. A module is a collection of subroutines which
somebody benevolent has made available for your use. You can find many Perl
modules at
http://www.cpan.org/.
For now start by saving the file FastaParse.pm in the directory where
your program will be. This is a OO module, which I made for easy reading of
fasta files. The first thing you
should do would be reading the file. There is first a description of the
module, then comes the code. You should not worry about the code,
allthough it is good to learn from when you make your own modules.
The important part is the synopsis (first in the file), which tells you
how to use the module.
First you should make a small program that proves that you have downloaded
and placed the module in the correct place. It could be the program in the
synopsis of the module. If it runs without errors, you are set.
Your first Perl statement in a program that uses the module should be:
use FastaParse; which loads the module.
After that you can use the module as described. Notice the use of '->'
to refer to methods and/or data encapsulated in the module.
- Use this module to parse/read the fasta file
dna7.fsa and solve ex. 6. I repeat the text of the
exercise for convenience: Now make a program that reverse complements the sequence
and writes it into the file revdna.fsa in fasta format. This time
you have to keep the first identifying line, so the
sequence can be identified. You must add 'ReverseComplement' in the end
of that line, though, so you later know that it is the reverse complement.
|