|
Phylogeny of Immunological Proteins
Peder Worning (peder@cbs.dtu.dk)
Overview
During this exercise you will:
- Retrieve information from the protein database SwissProt
- Perform a BLAST search of immunological protein sequences against SwissProt
- Read the BLAST report and select entries of special interest
- Get protein sequences of the selected proteins from SwissProt
- Perform a multiple alignment of the selected protein sequences (using ClustalX)
- Construct unrooted trees from the alignment of the selected sequences
(using the "neighbor joining" algorithm in ClustalX)
- Visualize the trees using the programs unrooted and njplot
- Consider the evolutionary implications of the trees
Background: Immunoglobulins as a protein family
The concept of protein families is based on the observation that, while there is a huge
number of different proteins,
most of them can be grouped, on the basis of similarities in their sequences, into a limited
number of families. Proteins or protein domains belonging to a particular family generally
share functional attributes and are derived from a common ancestor.
It is apparent, when studying protein sequence families, that some regions have been better
conserved than others during evolution. These regions are generally important for the function
of a protein and/or for the maintenance of its three-dimensional structure. By analyzing
the constant and variable properties of such groups of similar sequences, it is possible to
derive a signature for a protein family or domain, which distinguishes its members from all
other unrelated proteins.
The immunoglobulin-like domain is problably one of the most widespread protein modules
in the animal kingdom.
This module has been observed in a large group of related proteins that function, mainly
in the immune system, in cell-cell recogniction or in the structural organization and regulation
of muscles. It is a domain of approximately 100 residues with a fold which consist of seven
to nine antiparallel beta strands. The proteins in the immunoglobulin-like family consist of
one or more of these domains.
Purpose of exercise, description of data
In this exercise you are going to
- make a BLAST search of immunological protein sequences against SwissProt.
- select an subset of the entries and retrieve the protein sequences from SwissProt.
- use clustalx to make a multiple alignment and a phylogenetic tree of the selected proteins.
- inspect the tree and look up the proteins in SwissProts homepage.
Finally, the exercise
Open a Browser and find the homepage of the protein data base SwissProt, use
the right mouse buttom to get it in a separate window.
http://www.expasy.ch/sprot/
Change to this afternoons working directory and have a look at what it contains:
cd exercise2
ls
BLAST search against SwissProt
First, you will use the file 1A02_HUMAN.fasta to make a BLAST search against
the SwissProt database. 1A02_HUMAN.fasta contains the protein sequence
of a MHC class I molecule with allel name A0201, and the SwissProt
data base is in the file /usr/cbs/databases/blastdb/sp.
blastpgp -b 500 -v 500 -i 1A02_HUMAN.fasta -d /usr/cbs/databases/blastdb/sp > 1A02_HUMAN.blast
The command must be written in one line. The file 1A02_HUMAN.blast
contains the BLAST report, take a look:
less 1A02_HUMAN.blast
After the introduction all the matches are listed each on one line. The first column is
a common identifier and the second in parantheses are the SwissProt entry. Then follows a short
description of the protein and two numbers describing the match.
The BLAST report can also be viewed in an editor
nedit 1A02_HUMAN.blast &
(be sure to include the "&" which will make the program run in the background,
so we can start more programs from the same shell window.)
We have made a small perl script that will read only the matches from the BLAST report,
take a look at the result:
blastread.pl 1A02_HUMAN.blast > 1A02_HUMAN.blastread
less 1A02_HUMAN.blastread
Beta-2-microglobulin sequences from different species
Now we will select the same protein from different species and make a phylogenetic tree.
Beta-2-microglobulin is a single immunoglobulin domain, it can be viewed as
the basic building block of the proteins in this protein family.
Select beta-2-microglobulin sequences from the matches in the BLAST report and see
how many you have got:
grep "Beta-2-microglobulin precursor" 1A02_HUMAN.blastread > 1A02_HUMAN.blastread.B2MG
wc 1A02_HUMAN.blastread.B2MG
(grep is a command that write out the lines matching a pattern, here
Beta-2-microglobulin precursor, and wc is a command that counts the number of lines and
words in a file)
Retrieve the protein sequences from SwissProt:
Now we have a collection of beta-2-microglobulin entries from SwissProt, which were
found becase they match a MHC class I protein sequence. We will use this collection to
make phylogenetic tree to illustrade the relations between these proteins.
But first we shall retrieve the protein sequences from SwissProt and collect them
in one fasta file. To do this we will use another small perlscript getfasta.pl
getfasta.pl 1A02_HUMAN.blastread.B2MG > 1A02_HUMAN.blastread.B2MG.fasta
We will make a multiple alignment and a phylogenetic tree using this fasta file.
Multiple alignment
We will use the program ClustalX to make a multiple alignment. (ClustalX
is actually a graphical front-end to the command line-based program CLUSTALW.).
The program is interactive with a typical, relatively user friendly,
windows-style interface. ClustalX has online help available from the pull-down
menu "Help".
Start ClustalX with the following command (be sure to include the
"&" which will make the program run in the background, so we can start
more programs from the same shell window.)
clustalx &
This opens the program in a separate window.
The first thing you have to do is load the sequences. In the "File" menu
choose "Load sequences", and select 1A02_HUMAN.blastread.B2MG.fasta from the
file list that appears.
The sequences are displayed on the screen with names on the left
hand side, and the sequences themselves on the right. Residues are
colored according to amino acid characteristics. It is possible to resize
the window by "pulling" at the edges, so you can fit all the lines in one
window. A scroll-bar at the bottom of the window allows you to move along the
alignment (the sequences are too long to fit in the horizontal direction).
Beneath the sequences there is a ruler starting at 1 for the
first residue position. Below this is a graphical indication of the degree of
conservation in each column of the alignment. A high score indicates a
well-conserved column; a low score indicates low conservation. Since the
sequences have not yet been aligned (they are all just lined up starting with
their first residues), most values are quite low.
Have a look at the alignment parameters:
From the "Alignment" pull-down menu choose "Alignment parameters", and then
"Pairwise alignment parameters". In this window you are able to change
gap-penalties and substitution matrix for the initial, pairwise part of the
multiple alignment. You can also
specify whether you want the pairwise alignments performed using a slower but
accurate method, or a faster but approximate method.
For now just leave everything as is.
Exit the window by clicking the tab labeled "Close".
There is a similar window for changing the multiple alignment parameters
("Alignment", "Alignment parameters", "Multiple alignment parameters".) Have a
look, but keep the default values for now.
Start the multiple alignment:
From the "Alignment" menu choose "Do complete alignment". This opens a
window giving you the opportunity to rename output files. Accept the suggested
names and start the alignment by clicking the tab labeled "Align".
You may be able to get a glimpse of how ClustalX is working in the bottom
part of the window: first, it does all the pairwise alignments.
These alignments are then used to construct a "guide
tree". (The guide tree should not be confused with the phylogenetic tree we
will construct later. The guide tree is entirely based on the pairwise
alignments, and is used to guide the construction of the multiple alignment).
Finally, ClustalX constructs the multiple alignment by progressively "aligning
algnments" following the guide tree.
Depending on system load, this alignment step may take anywhere between 20
seconds and 5 minutes. If waiting gets to be too painful you can change the
pairwise alignment parameters to "fast but approximate", but then you'll also
get a poorer alignment.
Computing an unrooted tree:
From the "Trees" menu choose "Draw N-J tree". This gives you a
window where you
can change the name of the tree-file. Accept the default and click "OK" to
calculate the tree. ClustalX uses the multiple alignment to calculate a distance
matrix with all pairwise distances between the sequences, and then constructs a
tree by progressively clustering sequences that are close to each other (using the
neighbor-joining algorithm). The calculated tree is in the file:
1A02_HUMAN.blastread.B2MG.ph.
(Note: the construction of the NJ-tree is so fast that
its practically finished the second you have clicked "Draw N-J tree" - so no need
to sit around and wait for the result).
Inspect the phylogenetic tree
We will use two different programs to display phylogenetic trees unrooted and
njplot. unrooted gives a more realistic 2 dimensional view of the
distances between the protein sequences in an unrooted tree, but it can be difficult to read
the names of the proteins on the plot.
njplot is actually made to look at rooted trees and only the horisontal distance
matters. The program chose an outgroup by itself, which do not need to be the right one.
An outgroup is one or more sequences that are asumed to be distantly related to the rest of
the sequences in the tree. This picture is simpler but the names are more readable.
unrooted 1A02_HUMAN.blastread.B2MG.ph &
njplot 1A02_HUMAN.blastread.B2MG.ph &
The windows that are opened by these programs can be resized by dragging them by the
egdes. Compare the two representations of the tree.
In the njplot window you can choose a new outgroup by clicking on new outgroup
and then click on a # sign. Try a few and compare the picture with the
unrooted plot. When you are satisfied then click on show tree. The tree can be saved
as a postscript file and printed, click on file and save plot, where you can
choose a name for the postscpipt file.
Do the species group together as you expected them to do?
For some of the identifiers in the tree, the species are not obvious. Search SwissProt's
homepage with the unknown identifiers to get further information about the entries.
Type the identifier and click Quick Search. When you have found out the species
try to root the njplot tree at a problable outgroup.
Human immunoglobulin proteins
Select the human proteins from the matches in the BLAST report, and see how many entries are
left:
grep "_HUMAN" 1A02_HUMAN.blastread > 1A02_HUMAN.blastread.human
wc 1A02_HUMAN.blastread.human
We can remove the proteins with the same desciptions line using another perlscript:
blastextract.pl 4 1A02_HUMAN.blastread.human > 1A02_HUMAN.blastread.human.uniq
wc 1A02_HUMAN.blastread.human.uniq
Retrieve the protein sequences from SwissProt
Now we have a collection of human SwissProt entries with different describtion lines,
which all are members of the immunoglobulin family. We will use this collection to make
phylogenetic tree to illustrade relations in this protein family.
But first we shall retrieve the protein sequences from SwissProt and collect them
in one fasta file.
getfasta.pl 1A02_HUMAN.blastread.human.uniq > 1A02_HUMAN.blastread.human.uniq.fasta
Now we use ClustalX as before to make a multiple alignment and a phylogenetic tree of
the human protein sequences. Start ClustalX and read the file
1A02_HUMAN.blastread.human.uniq.fasta into the program and proceed as before. Use
njplot and unrooted to inspect the tree.
Take a closer look at the tree, type in the identifiers from the tree in the seach window
in SwissProt's homepage. This will tell you which proteins are hidden behind the identifiers,
and you can get an idear of how the immunoglobulin-like proteins are related.
The Toll receptor
The Toll-like receptors are a family of receptors that recognice molecular
stuctures specific for micro-organisms. The Toll-like receptors induce inflammatory
responces as a part of the non specific innate immune response against invading microbes.
The Toll-like receptors are all homologs to the Drosophila Toll protein, which is a
transmembrane receptor involved in the control of dorso-ventral pattern formation in the
fly embryos. Flies with defect Toll-receptors are also unable to recognice fungal infections.
Flies have no adaptive immune response but only non specific innate immune response.
Use the sequence of the Drosophila Toll receptor TOLL_DROME.fasta to make a
BLAST search against SwissProt:
blastpgp -b 500 -v 500 -i TOLL_DROME.fasta -d /usr/cbs/databases/blastdb/sp > TOLL_DROME.blast
blastread.pl TOLL_DROME.blast > TOLL_DROME.blastread
Select the Toll-like proteins from the BLAST report, and get the proteins from
SwissProt:
grep "Toll" TOLL_DROME.blastread > TOLL_DROME.blastread.Toll
getfasta.pl TOLL_DROME.blastread.Toll > TOLL_DROME.blastread.Toll.fasta
The Toll-like receptors are not members of the immunoglobulin family, they are proteins
of a different origin. You can see it by looking for Toll or histocompatibility in our new
and our original BLAST search and count the matches.
grep "Toll" 1A02_HUMAN.blastread | wc
grep "Toll" TOLL_DROME.blastread | wc
grep "histocompatibility" 1A02_HUMAN.blastread | wc
grep "histocompatibility" TOLL_DROME.blastread | wc
Now make an alignment and a phylogenetic tree as before using
TOLL_DROME.blastread.Toll.fasta as the input file to clustalx.
Use both unrooted and njplot to inspect the tree.
What is the evolutionary implications of this tree?
|