Solutions to exam in 27013, May 28, 2003
Assignment 1:
a) For finding the amino acid sequence I would look
for the line starting with "SQ ". Then the sequence is the following
lines until the line starting with "//". I would
use the green line/red line approach to extract the
sequence. For finding the variants, I would look for lines that starts with
"FT " and has as the second word "VARIANT" or
"MUTAGEN". When I find such a line then the first number is the start
position of the variation and the second number is the end position. After that
there is f.ex. W -> R which means that W is changed to R in this variant.
b) Using pseudo code to describe the procedure. Pseudo
code is not "real" code, it just illustrates the meaning.
$flag = 0;
$seq = '';
while (still_unread_lines_in_file) {
  read_line;
  $id = get_ID
if line_starts_with('ID');
  # Get the
sequence
  $flag=0 if
line_starts_with('//');
  $seq .=
$line if $flag == 1;
  $flag=1 if
line_starts_with('SQ');
  # Get
variations
  if
(line_starts_with('FT') and (second_word_is('VARIANT') or
     Â
second_word_is('MUTAGEN'))) {
    Â
push(@position, get_first_number);
    Â
push(@change, get_changed_AA); }
} # end while
output_fasta("$ID original", $seq);
while (@position) {
  $posi =
shift @position;
  $AA = shift
@changed;
 Â
substitute_in_sequence($seq, $posi, $AA);
 Â
output_fasta("$ID Variation $posi -> $AA", $seq); }
sub output_fasta {
  my
($headline, $seq) = @_;
  my $i;
 Â
write_to_file ">$headline\n";
  for ($i = 0;
$i < length($seq); $i += 60) {
    Â
write_to_file substr($seq, $i, 60), "\n"; }
}
c) Program in perl.
$flag = 0;
$seq = '';
@position = (); # Not necessary
@change = ();Â
# Not necessary
open(IN, "Appendix1") || die "Can't
read file\n";
while (defined ($line = <IN>)) {
  $ID = $1 if
$line =~ m/^ID\s+(\w+)/;
  # get
sequence
  if ($flag ==
1) {
     if ($line
=~ m/^\/\//) {
        $flag
= 0; }
     else {
        chomp
$line;
        $line
=~ s/\s//g; # remove spaces
        $seq
.= $line; } }
  elsif ($line
=~ m/^SQ\s/) {
     $flag =
1; }
  # get
variations
  if ($line =~
m/^FT\s+(VARIANT|MUTAGEN)\s/) {
     if ($line
=~ m/^FT\s+\w+\s+(\d+)\s+\d+\s+\w\s?->\s?(\w)/) {
       Â
push(@position, $1);
       Â
push(@change, $2); }
     else {
        die
"Error in line: $line"; } }
}
close IN;
open(OUT, ">fasta") or die "Can't
write fasta\n";
&output_fasta("$ID original", $seq);
for ($i
= 0; $i <= $#position; $i++) {
  $variant = substr($seq, 0, $position[$i] -
1);
  $variant
.=Â $change[$i] . substr($seq, $position[$i]);
 Â
&output_fasta("$ID Variation $position[$i] ->
$change[$i]", $variant); }
close OUT;
sub output_fasta {
  my
($headline, $seq) = @_;
  my $i;
  print OUT
">$headline\n";
  for ($i = 0;
$i < length($seq); $i += 60) {
     print OUT
substr($seq, $i, 60), "\n"; }
}
d) This program checks for one error already; variants
have to follow a certain strict pattern or else the program stops. It can be
difficult to improve on that. The program also have the standard file opening
error checks. I could add check for the flag being 0 (ensuring proper
termination of the SwissProt entry), for the $ID being set to something
meaningful, likewise with the sequence. In the variant lines in the feature
table (FT) we have the information of the original amino acid. We could use
that to double check that we are changing the right amino acid at the right
position.
e) It seems quite obvious from the variant lines, that
you can have variations that are not just one amino acid changed but 2, 3 or
more in a row that is changed. This can be seen in that there is a begin AND an
end position in the variant. This would not be necessary if only one position
could change. Also the format seems to have room for something like WR ->
AG, meaing two amino acids are changed.
Assignment 2:
a) In order to be able to exclude the sequences on the
negative list from the calculations, these has to be found and translated
first. There are several ways of doing this, but it is done with a hash here.
Notice that the chosen method requires no more data space than the negative
list.
# Read negative list
while (lines_in_negative_list) {
  read_line; #
line is SwissProtID
  $neg{$line}
= 1; } # creates the negative hash
# Read translation list and translate
while (lines_in_translation_file) {
  read_line;
 Â
extract_SwisProtID_and_AccessionNumber;
 Â
$neq{$SwissProtID} = $AccessionNumber if exists $neq{$SwissProtID}; }
# Now the hash has as keys SwissProtIDs and values
AccessionNumbers
# This is not immediately useful, as we need the other
way around.
# That can be done as both values and keys are unique.
Several ways to
# do this but the easiest is to revert the hash.
%neg = reverse %neq;
Now we have a useful negative list. To find the
accession numbers in
the datafile with the highest and lowest average, we
just run through
the whole file, discarding accessions on the negative
list, calculate
the average of the rest and keep an eye on the
higest/lowest encountered
so far. Observation: There is no need to calculate the
average, the sum
is sufficient for finding the higest and lowest.
while (lines_in_data_file) {
  read_line;
 Â
get_accession_from_line;
  unless
(AccessionNumber_in_negative_list) {
     Â
get_numbers_from_line;
     Â
calucate_sum;
      if ($sum
> $highsum) {
         $highsum = $sum;
        Â
$highacc = $AccessionNumber; }
      if ($sum
< $lowsum) {
        Â
$lowsum = $sum;
        Â
$lowacc = $AccessionNumber; } }
}
print "Higest average: $highacc
$highsum/6\n";
print "Lowest average: $lowacc $lowsum/6\n";
b)Â Perl code.
# Reading negative list
open(IN, "appendix4") or die "Can't
open negative list\n";
while (defined ($line = <IN>)) {
  chomp $line;
  $neq{$line}
= 1; }
close IN;
# Reading translation file
open(IN, "appendix5") or die "Can't
open translation file\n";
while (defined ($line = <IN>)) {
  @tab =
split(' ', $line);
 Â
$neq{$tab[0]} = $tab[2] if exists $neg{$tab[0]}; }
close IN;
# revert hash
%neg = reverse %neq;
# Read datafile and calculate underway
$highsum = 0;
$highacc = '';
$lowsum = 6;
$lowacc = '';
open(IN, "appendix3") or die "Can't
open data file\n";
while (defined ($line = <IN>)) {
  @tab =
split(' ', $line);
  unless
(exists $neg{$tab[0]}) {
     $sum =
$tab[1] + $tab[2] + $tab[3] + $tab[4] + $tab[5] + $tab[6];
     if ($sum
> $highsum) {
       Â
$highsum = $sum;
       Â
$highacc = $tab[0]; }
     if ($sum
< $lowsum) {
       Â
$lowsum = $sum;
       Â
$lowacc = $tab[0]; } }
}
close IN;
print "Higest average: $highacc ",
$highsum/6, "\n";
print "Lowest average: $lowacc ", $lowsum/6,
"\n";
c) The major assumption here is, that I expect the
file with the translation list to completely cover all possible accession
numbers/swissprot IDs. This is clearly impossible, and ought to have a warning
built-in. Otherwise I have made some minor assumptions, the first being that
there are 6 scores/numbers on each line of data in the datafile. It is a major
limitation, but since I made the code for the earlier program which generated
the data file, I know there are 6 and only 6 scores on each line. If I want to
use this code on other similar data, I have to do away with this limitation.
Like this:
$sum = $tab[1]; # at least one score;
for ($i
= 2; $i <= $#tab; $i++) {
  $sum += $tab[$i]; }
The second minor assumption is that all scores are
between 0 and 1. This is used when I initialize $highsum to 0 and $lowsum to 6.
Major limitation, but the same reasoning apply. To remove this limitation I
need to restructure at bit. Also if I change anything I should remove both
limitations.
$highacc = '';
$lowacc = '';
open(IN, "appendix3") or die "Can't
open data file\n";
while (defined ($line = <IN>)) {
  @tab =
split(' ', $line);
  unless
(exists $neg{$tab[0]}) {
     $sum =
$tab[1]; # at least one score;
     for ($i = 2; $i <= $#tab; $i++)
{
       Â
$sum += $tab[$i]; }
     if
($highsum eq '' or $sum > $highsum) {
       Â
$highsum = $sum;
       Â
$highacc = $tab[0]; }
     if
($lowsum eq '' or $sum < $lowsum) {
       Â
$lowsum = $sum;
       Â
$lowacc = $tab[0]; } }
}
close IN;
It is good to remove these limitations, but not
necessary if the program is not to be used for any other purpose than the
original. A last assumption I have made is that I expect the data to be there,
i.e. 6 scores every time, not 5 or 7.
d) If you want the 10 highest and lowest scores, you
need to put it all into a table that can be sorted according to the score, i.e
numerically. Then it is easy to pick out any number of high and low scoring
averages. It should not change any assumptions in c) for this implementation.
However it adds the assumption, that there is at least 10 accession numbers not
on the negative list in the data file. Quite reasonable, but a limitation none
the less.
I would do like this:
# Read datafile and calculate underway
open(IN, "appendix3") or die "Can't
open data file\n";
while (defined ($line = <IN>)) {
  @tab =
split(' ', $line);
  unless
(exists $neg{$tab[0]}) {
     $avg = ($tab[1] + $tab[2] + $tab[3]
+ $tab[4] + $tab[5] + $tab[6]) / 6;
     push(@list, "$avg $tab[0]); }
}
close IN;
@list = sort {$a <=> $b} @list;
print "Higest averages:\n";
for ($i
= $#list; $i > $#list - 10; $i--) {
  @tab = split(' ', $list[$i]);
  print
"$tab[1]: $tab[0]\n"; }
print "Lowest averages:\n";
for ($i
= 0; $i < 10; $i++) {
  @tab = split(' ', $list[$i]);
  print "$tab[1]: $tab[0]\n"; }