Exercises


A program can only be executed, when it has execute permission: chmod 755 <filename>
Remember to write #!/usr/bin/perl on the first line of your programs.

Necessary files to complete these exercises
To download the files to your system, just press the Shift key while you left click on the blue link. Follow the instructions.
ex1_1.dat
sprot1.dat
sprot2.dat
sprot3.dat
sprot4.dat
dna.fsa
ex5.acc
data1.gb
data2.gb
data3.gb
data4.gb
start10.dat
res10.dat
dna7.fsa
FastaParse.pm


  1. Write 'Hello World' on the screen.
  2. Ask for a positive integer and calculate the factorial (n!) of that number. Display the result. If input is negative, display an error message.
  3. Make a program that calculates the mean value of the numbers in the file ex1_1.dat.
  4. In the file dna.fsa is the same human DNA in FASTA format. This format is VERY often used in bioinformatics. Look at it using less and get used to the format. Observe the first line which starts with a > and identifies the sequence. The name (AB000410 in this case) MUST uniquely identify a sequence in the file. This is a DNA (actually mRNA) sequence taken from the GenBank database. Now make a program that reverse complements the sequence and writes it into the file revdna.fsa just like you did in previous assignments. This time you have to keep the first identifying line, so the sequence can be identified. You must add 'ComplementStrand' in the end of that line, though, so you later know that it is the complement strand.
    Summary: Keep the first line and reverse complement the sequence.
  5. This exercise deal with SwissProt. The file sprot1.dat is a SwissProt database entry. Study it with less. Locate the SwissProt ID (SP96_DICDI), the accession number (P14328) and the amino acid sequence (MRVLLVLVAC....TTTATTTATS). There are other entries ( sprot2.dat, sprot3.dat, sprot4.dat). Your program should work on those, too. Also your program must solve all the problems in ONE reading of the file.
    Read the ID, the accession number and the amino acid sequence. Save the data to a file sprot.fsa in FASTA format. Look in the file dna.fsa for an example of FASTA. Notice the first line starts with > and immediately after comes an unique identifier, like an accession number or a SwissProt ID. Any other data must be on the header line only, but in free format. Sequence data is on the following lines.
  6. In the file ex5.acc there are 6461 unique GenBank accession numbers (taken from HU6800 DNA array chip). An inexperienced bioinformatician unfortunately fouled up the list, so many of the accession numbers appears more than once. It is your job to clean the list, so all accession numbers only appear once, and in alphabetical order. Save the new list in clean.acc. Hint: After sorting a list, duplicates are "next" to each other, thereby making them easy to find and eliminate. Another way to go about this is a hash - say no more :-)
  7. What regular expression would you use to check if data in a string really is a number. These should all be considered as numbers: "4" "-7" "0.656" "-67.35555" These are not numbers: "5." "56F" ".32" "-.04"
  8. The next exercises (8-15) will all have to do with the files data1-4.gb, which are various Genbank entries of genes. First you should study the files, notice the structure of the data. In all exercises you will have to parse (read and find the wanted data) the files using RE's which are very well designed for that purpose. Every exercise adds to the previous ones, so the final program can do a lot. Remember. Your program should be able to handle all files, but just one at a time.
  9. Extract the accession number, the definition and the organism (and print it).
  10. Extract and print all MEDLINE article numbers which are mentioned in the entries.
  11. Extract and print the translated gene (the amino acid sequence). Look for the line starting with /translation=. Generalize; An amino acid sequence can be short, i.e. only one line in the feature table, or long, i.e. more than one line in the feature table.
  12. Extract and print the DNA in the entries data1-4.gb (whole base sequence in the end of the file).
  13. Extract and print ONLY the coding DNA. That is described in FEATURES - CDS (Coding DNA Sequence). As an example, the line in data1.gb says 'join(2424..2610,3397..3542)' and means that the coding sequence are bases 2424-2610 followed by bases 3397-3542. The bases in between are an intron and not a part of the coding DNA. Remember to generalize; there can be more (or less) than two exons, and the 'join' line can continue on the next line.
  14. In the data1.gb file there are 6 references (to articles). Make a program that extracts all authors from the references, eliminates those that are duplicates and print the list of persons who had anything to do with this GenBank entry. This should also work for the other Genbank entries. Beware: there traps in this exercise, check your output properly.
  15. In the genbank files data?.gb you should extract the coding DNA sequence as you already have done. Next you have to display a list of codons USED in the coding sequence and the number of times they are used.
  16. You have made a program (let's call it the X-program), which as input takes a file of accession numbers, start10.dat and produces some output, which is in res10.dat. Now you count the lines in your input file and your output file and you discover that the line numbers do not match. Horror - your program does not produce output for some input. Now the assignment is to discover which accession numbers did not produce output. This can be done in various ways, but now you have to use a hash (as look-up table). Print the results.
  17. Now we should use some object orientated techniques. OO programming is very often used in modules. A module is a collection of subroutines which somebody benevolent has made available for your use. You can find many Perl modules at http://www.cpan.org/.
    For now start by saving the file FastaParse.pm in the directory where your program will be. This is a OO module, which I made for easy reading of fasta files. The first thing you should do would be reading the file. There is first a description of the module, then comes the code. You should not worry about the code, allthough it is good to learn from when you make your own modules. The important part is the synopsis (first in the file), which tells you how to use the module.
    First you should make a small program that proves that you have downloaded and placed the module in the correct place. It could be the program in the synopsis of the module. If it runs without errors, you are set.
    Your first Perl statement in a program that uses the module should be: use FastaParse; which loads the module. After that you can use the module as described. Notice the use of '->' to refer to methods and/or data encapsulated in the module.
  18. Use this module to parse/read the fasta file dna7.fsa and solve ex. 6. I repeat the text of the exercise for convenience: Now make a program that reverse complements the sequence and writes it into the file revdna.fsa in fasta format. This time you have to keep the first identifying line, so the sequence can be identified. You must add 'ReverseComplement' in the end of that line, though, so you later know that it is the reverse complement.


This page was last updated         by Peter Wad Sackett