Lesson 5: Program Structure and Perl Bug Finding

Prev Index Next

Required reading
Learning Perl, ed. 4:
Chapter 2; p. 18-36
Chapter 5; p. 76-77,82-84

or
Learning Perl, ed. 5:
Chapter 5; p. 79 mid-80

or
Learning Perl, ed. 6:
Nothing to read, really. Perhaps entire chapter 5, where you can see a lot of what you can do, and a lot that you should not do, if you want to make clear code.
Notes (from pws); functions: printf, sprintf, uc, ucfirst, lc, lcfirst.

Subjects covered
How to structure your code in smaller parts.
Finding bugs in your program, use strict; and -w.
Formatting output using printf which prints according to a format string, sprintf is similar to printf except the result is returned as a string, uc which returns a sting uppercased, ucfirst uppercases just the first letter, lc which returns a sting lowercased, lcfirst lowercases just the first letter.

From now on 2 point will be subtracted for each solution, that does not "use strict;" or use proper consistent indentation (to a max of 4 point per exercise).

Necessary files to complete this exercise
To download the files to your system, just press the Shift key while you left click on the blue link. Follow the instructions.
sprot.dat
sprot2.dat
sprot3.dat
sprot4.dat
dna.fsa
orphans.sp

You can play around with these files as much as you like. If you change or destroy them, just download them again.

Remember to write #!/usr/bin/perl -w on the first line of your programs.


All the following exercises have to be done in Perl

  1. This and the following 5 exercises deal with SwissProt. The file sprot.dat is a SwissProt database entry. Study it with less. Locate the SwissProt ID (SP96_DICDI), the accession number (P14328) and the amino acid sequence (MRVLLVLVAC....TTTATTTATS). There are other entries ( sprot2.dat, sprot3.dat, sprot4.dat). Your programs should work on those, too. Also your programs must solve all the problems in ONE reading of the file.
  2. Make a program that reads the ID and prints it.
  3. Add the following functionality to the program: Read the accession number and print it.
  4. Add the following functionality to the program: Read the amino acid sequence and print it.
  5. Add the following functionality to the program: Verification of amino acid number. This means extract the number from the SQ line (example: SQ SEQUENCE 629 AA;) and check that the amino acid sequence has that number of residues. It should be the program that determines if something is wrong - not the user. Imagine that before you go home, you set the computer to run through million swisprot entries. The next day, you must be able to see what failed. In a sense you don't care about what succedeed, as that is the common case. You care about what failed, because it is here you must take action.
  6. Now that you have the ID, accession number and AA sequence save it to a file sprot.fsa in FASTA format. Look in the file dna.fsa for an example of FASTA. Notice the first line starts with > and immediately after comes an unique identifier, like an accession number or a SwissProt ID. Any other data must be on the header line only, but in free format. Sequence data is on the following lines.
    Notice that this exercise incorporates the previous 5.
  7. In the file dna.fsa is some DNA. Construct a program that finds possible translation starts :-)
    All proteins start with the amino acid methionine (at least when translating, Met might be removed in later processing states). Methionine is coded with ATG. The exercise is therefore; find the position of all ATG's in the sequence. The first position is 83 as humans count.
    In some organisms different start codon are possible. If you really want to, you can make the program handle those cases too.
  8. Assuming that the first Met at position 83 is translation start, find the corresponding translation stop (which is the first one in frame). Stop codon is coded by TAA, TAG, or TGA. Remember that the stop codon has to be in the same reading frame as ATG. Notice: There are two ways to solving this exercise. The primitive way is to start at the position given. The more general and better way is to find the first ATG and then find the corresponding stop codon. See here for explanation
  9. Make a program that asks for an organism, like 'HUMAN' or 'RAT'. The program should then count the number of lines/times a SwissProt identifier in the file orphans.sp is present with said organism, ie. PARG_HUMAN and LUM_HUMAN are the two first (but not last) for HUMAN.
  10. Playing time a again. Make the guessing program from last week count how many attempts it needed to guess the number and print it when done guessing. It must be able to detect if you lie (and say so, of course). Also, if you haven't done it before, make the program guess in the fewest possible guesses (binary search for you experts out there).

This page was last updated         by Peter Wad Sackett, pws@cbs.dtu.dk