Solutions to exam in 27013, May 28, 2003

 

Assignment 1:

 

a) For finding the amino acid sequence I would look for the line starting with "SQ ". Then the sequence is the following lines until the line starting with "//". I would

use the green line/red line approach to extract the sequence. For finding the variants, I would look for lines that starts with "FT " and has as the second word "VARIANT" or "MUTAGEN". When I find such a line then the first number is the start position of the variation and the second number is the end position. After that there is f.ex. W -> R which means that W is changed to R in this variant.

 

b) Using pseudo code to describe the procedure. Pseudo code is not "real" code, it just illustrates the meaning.

$flag = 0;

$seq = '';

while (still_unread_lines_in_file) {

   read_line;

   $id = get_ID if line_starts_with('ID');

   # Get the sequence

   $flag=0 if line_starts_with('//');

   $seq .= $line if $flag == 1;

   $flag=1 if line_starts_with('SQ');

   # Get variations

   if (line_starts_with('FT') and (second_word_is('VARIANT') or

       second_word_is('MUTAGEN'))) {

      push(@position, get_first_number);

      push(@change, get_changed_AA); }

} # end while

 

output_fasta("$ID original", $seq);

while (@position) {

   $posi = shift @position;

   $AA = shift @changed;

   substitute_in_sequence($seq, $posi, $AA);

   output_fasta("$ID Variation $posi -> $AA", $seq); }

 

sub output_fasta {

   my ($headline, $seq) = @_;

   my $i;

   write_to_file ">$headline\n";

   for ($i = 0; $i < length($seq); $i += 60) {

      write_to_file substr($seq, $i, 60), "\n"; }

}

 


c) Program in perl.

$flag = 0;

$seq = '';

@position = (); # Not necessary

@change = ();  # Not necessary

open(IN, "Appendix1") || die "Can't read file\n";

while (defined ($line = <IN>)) {

   $ID = $1 if $line =~ m/^ID\s+(\w+)/;

   # get sequence

   if ($flag == 1) {

      if ($line =~ m/^\/\//) {

         $flag = 0; }

      else {

         chomp $line;

         $line =~ s/\s//g; # remove spaces

         $seq .= $line; } }

   elsif ($line =~ m/^SQ\s/) {

      $flag = 1; }

   # get variations

   if ($line =~ m/^FT\s+(VARIANT|MUTAGEN)\s/) {

      if ($line =~ m/^FT\s+\w+\s+(\d+)\s+\d+\s+\w\s?->\s?(\w)/) {

         push(@position, $1);

         push(@change, $2); }

      else {

         die "Error in line: $line"; } }

}

close IN;

 

open(OUT, ">fasta") or die "Can't write fasta\n";

&output_fasta("$ID original", $seq);

for ($i = 0; $i <= $#position; $i++) {

   $variant = substr($seq, 0, $position[$i] - 1);

   $variant .=  $change[$i] . substr($seq, $position[$i]);

   &output_fasta("$ID Variation $position[$i] -> $change[$i]", $variant); }

close OUT;

 

sub output_fasta {

   my ($headline, $seq) = @_;

   my $i;

   print OUT ">$headline\n";

   for ($i = 0; $i < length($seq); $i += 60) {

      print OUT substr($seq, $i, 60), "\n"; }

}

 

d) This program checks for one error already; variants have to follow a certain strict pattern or else the program stops. It can be difficult to improve on that. The program also have the standard file opening error checks. I could add check for the flag being 0 (ensuring proper termination of the SwissProt entry), for the $ID being set to something meaningful, likewise with the sequence. In the variant lines in the feature table (FT) we have the information of the original amino acid. We could use that to double check that we are changing the right amino acid at the right position.

 


e) It seems quite obvious from the variant lines, that you can have variations that are not just one amino acid changed but 2, 3 or more in a row that is changed. This can be seen in that there is a begin AND an end position in the variant. This would not be necessary if only one position could change. Also the format seems to have room for something like WR -> AG, meaing two amino acids are changed.

 

 

Assignment 2:

 

a) In order to be able to exclude the sequences on the negative list from the calculations, these has to be found and translated first. There are several ways of doing this, but it is done with a hash here. Notice that the chosen method requires no more data space than the negative list.

# Read negative list

while (lines_in_negative_list) {

   read_line; # line is SwissProtID

   $neg{$line} = 1; } # creates the negative hash

# Read translation list and translate

while (lines_in_translation_file) {

   read_line;

   extract_SwisProtID_and_AccessionNumber;

   $neq{$SwissProtID} = $AccessionNumber if exists $neq{$SwissProtID}; }

# Now the hash has as keys SwissProtIDs and values AccessionNumbers

# This is not immediately useful, as we need the other way around.

# That can be done as both values and keys are unique. Several ways to

# do this but the easiest is to revert the hash.

%neg = reverse %neq;

 

Now we have a useful negative list. To find the accession numbers in

the datafile with the highest and lowest average, we just run through

the whole file, discarding accessions on the negative list, calculate

the average of the rest and keep an eye on the higest/lowest encountered

so far. Observation: There is no need to calculate the average, the sum

is sufficient for finding the higest and lowest.

while (lines_in_data_file) {

   read_line;

   get_accession_from_line;

   unless (AccessionNumber_in_negative_list) {

       get_numbers_from_line;

       calucate_sum;

       if ($sum > $highsum) {

          $highsum = $sum;

          $highacc = $AccessionNumber; }

       if ($sum < $lowsum) {

          $lowsum = $sum;

          $lowacc = $AccessionNumber; } }

}

print "Higest average: $highacc $highsum/6\n";

print "Lowest average: $lowacc $lowsum/6\n";

 


b)  Perl code.

# Reading negative list

open(IN, "appendix4") or die "Can't open negative list\n";

while (defined ($line = <IN>)) {

   chomp $line;

   $neq{$line} = 1; }

close IN;

# Reading translation file

open(IN, "appendix5") or die "Can't open translation file\n";

while (defined ($line = <IN>)) {

   @tab = split(' ', $line);

   $neq{$tab[0]} = $tab[2] if exists $neg{$tab[0]}; }

close IN;

# revert hash

%neg = reverse %neq;

# Read datafile and calculate underway

$highsum = 0;

$highacc = '';

$lowsum = 6;

$lowacc = '';

open(IN, "appendix3") or die "Can't open data file\n";

while (defined ($line = <IN>)) {

   @tab = split(' ', $line);

   unless (exists $neg{$tab[0]}) {

      $sum = $tab[1] + $tab[2] + $tab[3] + $tab[4] + $tab[5] + $tab[6];

      if ($sum > $highsum) {

         $highsum = $sum;

         $highacc = $tab[0]; }

      if ($sum < $lowsum) {

         $lowsum = $sum;

         $lowacc = $tab[0]; } }

}

close IN;

print "Higest average: $highacc ", $highsum/6, "\n";

print "Lowest average: $lowacc ", $lowsum/6, "\n";

 

c) The major assumption here is, that I expect the file with the translation list to completely cover all possible accession numbers/swissprot IDs. This is clearly impossible, and ought to have a warning built-in. Otherwise I have made some minor assumptions, the first being that there are 6 scores/numbers on each line of data in the datafile. It is a major limitation, but since I made the code for the earlier program which generated the data file, I know there are 6 and only 6 scores on each line. If I want to use this code on other similar data, I have to do away with this limitation. Like this:

 

$sum = $tab[1]; # at least one score;

for ($i = 2; $i <= $#tab; $i++) {

   $sum += $tab[$i]; }

 

The second minor assumption is that all scores are between 0 and 1. This is used when I initialize $highsum to 0 and $lowsum to 6. Major limitation, but the same reasoning apply. To remove this limitation I need to restructure at bit. Also if I change anything I should remove both limitations.

$highacc = '';

$lowacc = '';

open(IN, "appendix3") or die "Can't open data file\n";

while (defined ($line = <IN>)) {

   @tab = split(' ', $line);

   unless (exists $neg{$tab[0]}) {

      $sum = $tab[1]; # at least one score;

      for ($i = 2; $i <= $#tab; $i++) {

         $sum += $tab[$i]; }

      if ($highsum eq '' or $sum > $highsum) {

         $highsum = $sum;

         $highacc = $tab[0]; }

      if ($lowsum eq '' or $sum < $lowsum) {

         $lowsum = $sum;

         $lowacc = $tab[0]; } }

}

close IN;

 

It is good to remove these limitations, but not necessary if the program is not to be used for any other purpose than the original. A last assumption I have made is that I expect the data to be there, i.e. 6 scores every time, not 5 or 7.

 

 

d) If you want the 10 highest and lowest scores, you need to put it all into a table that can be sorted according to the score, i.e numerically. Then it is easy to pick out any number of high and low scoring averages. It should not change any assumptions in c) for this implementation. However it adds the assumption, that there is at least 10 accession numbers not on the negative list in the data file. Quite reasonable, but a limitation none the less.

I would do like this:

 

# Read datafile and calculate underway

open(IN, "appendix3") or die "Can't open data file\n";

while (defined ($line = <IN>)) {

   @tab = split(' ', $line);

   unless (exists $neg{$tab[0]}) {

      $avg = ($tab[1] + $tab[2] + $tab[3] + $tab[4] + $tab[5] + $tab[6]) / 6;

      push(@list, "$avg $tab[0]); }

}

close IN;

@list = sort {$a <=> $b} @list;

 

print "Higest averages:\n";

for ($i = $#list; $i > $#list - 10; $i--) {

   @tab = split(' ', $list[$i]);

   print "$tab[1]: $tab[0]\n"; }

print "Lowest averages:\n";

for ($i = 0; $i < 10; $i++) {

   @tab = split(' ', $list[$i]);

   print "$tab[1]: $tab[0]\n"; }