Lesson 7: Perl Regular Expressions

Prev Index Next

Required reading
Learning Perl, ed. 4:
Chapter 7; p. 100-106
Chapter 8; p. 107-115
Chapter 9; p. 121-123, 127-128

Learning Perl, ed. 5:
Chapter 7; p. 107-115 mid
Chapter 8; p. 117-126 mid, 129 mid-131
Chapter 9; p. 135-137, 140-142

Learning Perl, ed. 6:
Chapter 7; p. 121-131 mid
Chapter 8; p. 133-135 mid, 138-144
Chapter 9; p. 155-158

A very useful pdf on regular expressions
A home page with tutorials on regex, test your regex

Notes on metasymbol tables.

Subjects covered
Regular expressions, pattern matching, substitution, translation.

Necessary files to complete this exercise
To download the files to your system, just press the Shift key while you left click on the blue link. Follow the instructions.
You can play around with these files as much as you like. If you change or destroy them, just download them again.

Remember to write #!/usr/bin/perl -w on the first line of your programs.

All the following exercises have to be done in Perl

  1. Let us improve on exercise 2.6. Use regular expressions (RE) to check that the data you get when asking for numbers, are actually numbers. Also check that the operation is valid.
    These should all be considered as numbers: "4"   "-7"   "0.656"   "-67.35555"
    These are not numbers: "5."  "56F"  ".32"  "-.04"
    Note: This is likely the most difficult regular expression, you will have to make in this set of exercises. Perhaps you should do this later.
  2. Improve exercise 5.6 by using regular expressions to find the ID, accession number and amino acid sequence. Note: This exercise also covers verification and printing of fasta file.
  3. Improve exercise 4.7 using all you have learned. The program shall now take a DNA FASTA file (getting the file name from command line or asking interactively for it, both methods shall work), and reverse and complement all entries in the file. There can be more than one entry, study dna7.fsa. Hint: Use substitution or transliteration (translation) for complementing the DNA.
  4. The last exercises will all have to do with the files data1-4.gb, which are various Genbank entries of genes. First you should study the files, notice the structure of the data. In all exercises you will have to parse (read and find the wanted data) the files using RE's which are very well designed for that purpose. Every exercise adds to the previous ones, so the final program can do a lot. Remember. Your program should be able to handle all files, but just one at a time.
  5. Extract the accession number, the definition and the organism (and print it).
  6. Extract and print all MEDLINE article numbers which are mentioned in the entries.
  7. Extract and print the translated gene (the amino acid sequence). Look for the line starting with /translation=. Generalize; An amino acid sequence can be short, i.e. only one line in the feature table, or long, i.e. more than one line in the feature table.
  8. Extract and print the DNA (whole base sequence in the end of the file).
  9. Extract and print ONLY the coding DNA. That is described in FEATURES - CDS (Coding DNA Sequence). As an example, the line in data1.gb says 'join(2424..2610,3397..3542)' and means that the coding sequence are bases 2424-2610 followed by bases 3397-3542. The bases in between are an intron and not a part of the coding DNA. Remember to generalize; there can be more (or less) than two exons, and the 'join' line can continue on the next line.

This page was last updated         by Peter Wad Sackett, pws@cbs.dtu.dk