Lesson 8: Hashes

Prev Index Next

Required reading
Learning Perl, ed. 4:
Chapter 6; p. 88-99
or
Learning Perl, ed. 5:
Chapter 6; p. 93-104 mid
or
Learning Perl, ed. 6:
Chapter 6; p. 107-119 mid
and
Notes about functions keys, values, exists, delete, each

Subjects covered

  • Hashes, which are unordered tables of data.
  • Functions relevant to hashes:
    • keys, returns a table of keys in the hash,
    • values, returns a table of values in the hash,
    • exists predicate that determines if an element exists,
    • delete, which deletes an element,
    • each which iterates over all key/value pair in the hash.

Necessary files to complete this exercise.
To download the files to your system, just press the Shift key while you left click on the blue link. Follow the instructions.
start10.dat
res10.dat
ex5.acc
data1.gb
data2.gb
data3.gb
data4.gb
You can play around with these files as much as you like. If you change or destroy them, just download them again.

Remember to write #!/usr/bin/perl -w on the first line of your programs.


All the following exercises have to be done in Perl

  1. Create a hash where the keys are codons and the value are the one-letter-code for the amino acids. The hash will function as a look-up table. You can find a list here.
  2. Use the hash from the previous exercise in a program, that translates all the nucleotide fasta entries in dna7.fsa to amino acid sequence. Save the results in a file aa7.fsa in fasta format. Remember to keep the 'headlines' for each entry and add 'Amino Acid Sequence' to each of them. The STOP codon is NOT a part of the amino acid sequence.
  3. You have made a program (let's call it the X-program), which as input takes a file of accession numbers, start10.dat and produces some output, which is in res10.dat. Now you count the lines in your input file and your output file and you discover that the line numbers do not match. Horror - your program does not produce output for some input. Now the assignment is to discover which accession numbers did not produce output. This can be done in various ways, but now you have to use a hash (as look-up table). Print the results.
  4. In the file ex5.acc are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, now we should count them. Make a program that reads the file once, and prints a list (or writes a file) with the unique accession numbers and the number of occurrences in the file. A line should look like this: AC24677 2, if this accession occurs twice in ex5.acc.
  5. Building upon the previous exercise, now make the program print the list ordered by occurrences of accession numbers. That means the accession numbers with most duplicates should be first, and accession numbers which only occurs once should be last in the list.
  6. In the genbank files data?.gb you should extract the coding DNA sequence as you already did in 7.9. Next you have to display a list of codons USED in the coding sequence and the number of times they are used.
  7. In the data1.gb file there are 6 references (to articles). Make a program that extracts all authors from the references, eliminates those that are duplicates and print the list of persons who had anything to do with this GenBank entry. This should also work for the other Genbank entries. Beware: there traps in this exercise, check your output properly. You are free to use hashes or not in this exercise.

This page was last updated         by Peter Wad Sackett, pws@cbs.dtu.dk