Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercises in Pairwise Alignment and Database Searching


In this exercise, you will be introduced to some standard pairwise alignment and database search programs. Since this is the first exercise of the course, you will be guided in a very detailed way through the first part of the exercise. The exercises will be performed on a computer running a UNIX-type operating system.


Working directory and example data

In order to carry out today's exercise, you need some example data. To avoid messing up your home directory, you should create a directory for today's data and results:

mkdir alignment
and then change directory to the new directory:
cd alignment
Next, you must copy the example data:
cp /home/projects/hnielsen/teaching/phd/exercises/pwalign+dbsearch/data/* .
(the character "*" means "all files", and the character "." means "this directory").

Now, you can see the contents of the directory with the command ls or ll. Some of the files contain DNA or protein sequences - they have extension .aa for amino acid and .nuc for nucleotide sequences. You may want to inspect the content of the sequence files. (e.g., the command "cat GLBE_CHITH.aa" will print the file GLBE_CHITH.aa to the terminal window.) See the UNIX guide for explanation of commands like ls and cat.

All the amino acid sequence files are taken from the database UniProt, and all the nucleotide sequence files are taken from the database GenBank. If you want to see the whole database entries (not just the sequence), you can search for the full names at NCBI (for GenBank) or UniProt.


This exercise has two parts:

Pairwise alignments

  • global and local alignments
  • dot-plots
  • various examples of protein and nucleotide sequences
  • different substitution matrices
  • different gap penalties

Database searches

  • Three methods: ssearch, fasta, and blast.
  • Low complexity filtering.
  • Search protein vs. protein, DNA vs. DNA, and Protein vs. DNA.
  • different substitution matrices.

Evaluation

You must answer all the questions marked "Qxx" in the two parts of the exercise. Please write down your answers in a file and mail them to Anders Gorm Pedersen after the exercise. .

Important: Remember to include your full name (both names if you are a team) and your user name (studnnn) in your answer.

It is not necessary to answer all questions correctly to pass the course - but we need to be able to check that you have made an honest attempt!

At the end of the session, we will discuss the results.


Online resources

The tasks we did in the exercise today can also be performed online, although you will not have the same choices regarding substitution matrices, gap penalties, etc. Here is a list of useful links:

  • EMBOSS align - a tool for performing pairwise alignment, both local (Smith-Waterman) and global (Needleman-Wunsch).
  • LALIGN - another local alignment tool, includes PLALIGN for making dotplots.
  • SIM - yet another local alignment tool.
  • Dotlet - on-line dotplotting using a Java applet.
  • JDotter - on-line dotplotting using Java Web Start.
  • FASTA - fast database search tool.
  • BLAST - faster database search tool.
  • CD-BLAST - Fast search of sequence against profile databases.

Documentation:

  • The programs of the FASTA package have manual pages, you can read them with the commands
    man align
    man lalign
    - includes plalign
    man fasta3
    - includes ssearch3
    Further documentation can be found in fasta20.doc (concerning align, lalign and plalign) and fasta3x.doc (concerning ssearch3 and fasta3).

  • The programs of the BLAST package version 2 have no man pages. You can get a short summary of command line options with the command
    blastall -
    to a genome prompt and you can also check the BLAST 2.0 release notes.
  • getsprot and getgene are local CBS software. There are manual pages.