|
Exercises in Pairwise Alignment and Database Searching
In this exercise, you will be introduced to some standard pairwise
alignment and database search programs. Since this is the first
exercise of the course, you will be guided in a very detailed way
through the first part of the exercise. The exercises will be performed
on a computer running a UNIX-type operating system.
Working directory and example data
In order to carry out today's exercise, you need some example data. To
avoid messing up your home directory, you should create a directory for
today's data and results:
mkdir alignment
and then change directory to the new directory:
cd alignment
Next, you must copy the example data:
cp /home/projects/hnielsen/teaching/phd/exercises/pwalign+dbsearch/data/* .
(the character "*" means "all files", and the character
"." means "this directory").
Now, you can see the contents of the directory with the command
ls or ll.
Some of the files contain DNA or protein
sequences - they have extension .aa for amino
acid and .nuc for nucleotide sequences. You may want to inspect
the content of the sequence files. (e.g., the command "cat
GLBE_CHITH.aa" will print the file GLBE_CHITH.aa to the
terminal window.) See the UNIX
guide for explanation of commands like ls
and cat.
All the amino acid sequence files are taken from the database UniProt,
and all the nucleotide sequence files are taken from the database GenBank.
If you want to see the whole database entries (not just the sequence),
you can search for the full names at
NCBI
(for GenBank) or
UniProt.
This exercise has two parts:
- global and local alignments
- dot-plots
- various examples
of protein and nucleotide
sequences
- different substitution matrices
- different gap penalties
- Three methods: ssearch, fasta, and blast.
- Low complexity filtering.
- Search protein vs. protein, DNA vs. DNA, and Protein
vs. DNA.
- different substitution matrices.
Evaluation
You must answer all the questions marked "Qxx" in the two
parts of the exercise. Please write down your answers in a file
and mail them to Anders Gorm Pedersen after the exercise.
.
Important: Remember to include your full name (both
names if you
are a team) and your user name (studnnn)
in your answer.
It is not necessary to answer all questions correctly to pass the
course - but we need to be able to check that you have made an honest
attempt!
At the end of the session, we will discuss the results.
Online resources
The tasks we did in the exercise today can also be performed online,
although you will not have the same choices regarding substitution
matrices, gap penalties, etc. Here is a list of useful links:
- EMBOSS align -
a tool for performing pairwise alignment, both local (Smith-Waterman) and
global (Needleman-Wunsch).
- LALIGN
- another local alignment tool, includes PLALIGN for making
dotplots.
- SIM - yet
another local alignment tool.
- Dotlet
- on-line dotplotting using a Java applet.
- JDotter - on-line dotplotting using Java Web Start.
- FASTA
- fast database search tool.
- BLAST
- faster database search tool.
- CD-BLAST
- Fast search of sequence against profile databases.
Documentation:
- The programs of the FASTA package have manual pages, you can read them
with the commands
man align
man lalign - includes plalign
man fasta3 - includes ssearch3
Further documentation can be found in
fasta20.doc (concerning align, lalign
and plalign)
and fasta3x.doc
(concerning ssearch3 and fasta3).
- The programs of the BLAST package version 2 have no man pages.
You can get a short
summary of command line options with the command
blastall -
to a genome prompt
and you can also check the BLAST 2.0 release notes.
- getsprot and getgene are
local CBS software. There are manual pages.
|