Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Phylogeny of Immunological Proteins

Peder Worning (peder@cbs.dtu.dk)


Overview

During this exercise you will:

  1. Retrieve information from the protein database SwissProt
  2. Perform a BLAST search of immunological protein sequences against SwissProt
  3. Read the BLAST report and select entries of special interest
  4. Get protein sequences of the selected proteins from SwissProt
  5. Perform a multiple alignment of the selected protein sequences (using ClustalX)
  6. Construct unrooted trees from the alignment of the selected sequences (using the "neighbor joining" algorithm in ClustalX)
  7. Visualize the trees using the programs unrooted and njplot
  8. Consider the evolutionary implications of the trees


Background: Immunoglobulins as a protein family

The concept of protein families is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.

The immunoglobulin-like domain is problably one of the most widespread protein modules in the animal kingdom. This module has been observed in a large group of related proteins that function, mainly in the immune system, in cell-cell recogniction or in the structural organization and regulation of muscles. It is a domain of approximately 100 residues with a fold which consist of seven to nine antiparallel beta strands. The proteins in the immunoglobulin-like family consist of one or more of these domains.

Purpose of exercise, description of data

In this exercise you are going to

  1. make a BLAST search of immunological protein sequences against SwissProt.
  2. select an subset of the entries and retrieve the protein sequences from SwissProt.
  3. use clustalx to make a multiple alignment and a phylogenetic tree of the selected proteins.
  4. inspect the tree and look up the proteins in SwissProts homepage.


Finally, the exercise

Open a Browser and find the homepage of the protein data base SwissProt, use the right mouse buttom to get it in a separate window.

http://www.expasy.ch/sprot/

Change to this afternoons working directory and have a look at what it contains:

cd exercise2

ls

BLAST search against SwissProt

First, you will use the file 1A02_HUMAN.fasta to make a BLAST search against the SwissProt database. 1A02_HUMAN.fasta contains the protein sequence of a MHC class I molecule with allel name A0201, and the SwissProt data base is in the file /usr/cbs/databases/blastdb/sp.

blastpgp -b 500 -v 500 -i 1A02_HUMAN.fasta -d /usr/cbs/databases/blastdb/sp > 1A02_HUMAN.blast

The command must be written in one line. The file 1A02_HUMAN.blast contains the BLAST report, take a look:

less 1A02_HUMAN.blast

After the introduction all the matches are listed each on one line. The first column is a common identifier and the second in parantheses are the SwissProt entry. Then follows a short description of the protein and two numbers describing the match. The BLAST report can also be viewed in an editor

nedit 1A02_HUMAN.blast &

(be sure to include the "&" which will make the program run in the background, so we can start more programs from the same shell window.)

We have made a small perl script that will read only the matches from the BLAST report, take a look at the result:

blastread.pl 1A02_HUMAN.blast > 1A02_HUMAN.blastread

less 1A02_HUMAN.blastread

Beta-2-microglobulin sequences from different species

Now we will select the same protein from different species and make a phylogenetic tree. Beta-2-microglobulin is a single immunoglobulin domain, it can be viewed as the basic building block of the proteins in this protein family.

Select beta-2-microglobulin sequences from the matches in the BLAST report and see how many you have got:

grep "Beta-2-microglobulin precursor" 1A02_HUMAN.blastread > 1A02_HUMAN.blastread.B2MG

wc 1A02_HUMAN.blastread.B2MG

(grep is a command that write out the lines matching a pattern, here Beta-2-microglobulin precursor, and wc is a command that counts the number of lines and words in a file)

Retrieve the protein sequences from SwissProt:

Now we have a collection of beta-2-microglobulin entries from SwissProt, which were found becase they match a MHC class I protein sequence. We will use this collection to make phylogenetic tree to illustrade the relations between these proteins. But first we shall retrieve the protein sequences from SwissProt and collect them in one fasta file. To do this we will use another small perlscript getfasta.pl

getfasta.pl 1A02_HUMAN.blastread.B2MG > 1A02_HUMAN.blastread.B2MG.fasta

We will make a multiple alignment and a phylogenetic tree using this fasta file.


Multiple alignment

We will use the program ClustalX to make a multiple alignment.  (ClustalX is actually a graphical front-end to the command line-based program CLUSTALW.). The program is interactive with a typical, relatively user friendly, windows-style interface. ClustalX has online help available from the pull-down menu "Help".
  1. Start ClustalX with the following command (be sure to include the "&" which will make the program run in the background, so we can start more programs from the same shell window.)

    clustalx &

    This opens the program in a separate window. The first thing you have to do is load the sequences. In the "File" menu choose "Load sequences", and select 1A02_HUMAN.blastread.B2MG.fasta from the file list that appears.

    The sequences are displayed on the screen with names on the left hand side, and the sequences themselves on the right. Residues are colored according to amino acid characteristics. It is possible to resize the window by "pulling" at the edges, so you can fit all the lines in one window. A scroll-bar at the bottom of the window allows you to move along the alignment (the sequences are too long to fit in the horizontal direction).

    Beneath the sequences there is a ruler starting at 1 for the first residue position. Below this is a graphical indication of the degree of conservation in each column of the alignment. A high score indicates a well-conserved column; a low score indicates low conservation. Since the sequences have not yet been aligned (they are all just lined up starting with their first residues), most values are quite low.

  2. Have a look at the alignment parameters:

    From the "Alignment" pull-down menu choose "Alignment parameters", and then "Pairwise alignment parameters". In this window you are able to change gap-penalties and substitution matrix for the initial, pairwise part of the multiple alignment. You can also specify whether you want the pairwise alignments performed using a slower but accurate method, or a faster but approximate method. For now just leave everything as is. Exit the window by clicking the tab labeled "Close".

    There is a similar window for changing the multiple alignment parameters ("Alignment", "Alignment parameters", "Multiple alignment parameters".) Have a look, but keep the default values for now.

  3. Start the multiple alignment:

    From the "Alignment" menu choose "Do complete alignment". This opens a window giving you the opportunity to rename output files. Accept the suggested names and start the alignment by clicking the tab labeled "Align".

    You may be able to get a glimpse of how ClustalX is working in the bottom part of the window: first, it does all the pairwise alignments. These alignments are then used to construct a "guide tree". (The guide tree should not be confused with the phylogenetic tree we will construct later. The guide tree is entirely based on the pairwise alignments, and is used to guide the construction of the multiple alignment). Finally, ClustalX constructs the multiple alignment by progressively "aligning algnments" following the guide tree. 

    Depending on system load, this alignment step may take anywhere between 20 seconds and 5 minutes. If waiting gets to be too painful you can change the pairwise alignment parameters to "fast but approximate", but then you'll also get a poorer alignment.

  4. Computing an unrooted tree:

    From the "Trees" menu choose "Draw N-J tree". This gives you a window where you can change the name of the tree-file. Accept the default and click "OK" to calculate the tree. ClustalX uses the multiple alignment to calculate a distance matrix with all pairwise distances between the sequences, and then constructs a tree by progressively clustering sequences that are close to each other (using the neighbor-joining algorithm). The calculated tree is in the file: 1A02_HUMAN.blastread.B2MG.ph. (Note: the construction of the NJ-tree is so fast that its practically finished the second you have clicked "Draw N-J tree" - so no need to sit around and wait for the result).

Inspect the phylogenetic tree

We will use two different programs to display phylogenetic trees unrooted and njplot. unrooted gives a more realistic 2 dimensional view of the distances between the protein sequences in an unrooted tree, but it can be difficult to read the names of the proteins on the plot. njplot is actually made to look at rooted trees and only the horisontal distance matters. The program chose an outgroup by itself, which do not need to be the right one. An outgroup is one or more sequences that are asumed to be distantly related to the rest of the sequences in the tree. This picture is simpler but the names are more readable.

unrooted 1A02_HUMAN.blastread.B2MG.ph &

njplot 1A02_HUMAN.blastread.B2MG.ph &

The windows that are opened by these programs can be resized by dragging them by the egdes. Compare the two representations of the tree. In the njplot window you can choose a new outgroup by clicking on new outgroup and then click on a # sign. Try a few and compare the picture with the unrooted plot. When you are satisfied then click on show tree. The tree can be saved as a postscript file and printed, click on file and save plot, where you can choose a name for the postscpipt file.

Do the species group together as you expected them to do? For some of the identifiers in the tree, the species are not obvious. Search SwissProt's homepage with the unknown identifiers to get further information about the entries. Type the identifier and click Quick Search. When you have found out the species try to root the njplot tree at a problable outgroup.

Human immunoglobulin proteins

Select the human proteins from the matches in the BLAST report, and see how many entries are left:

grep "_HUMAN" 1A02_HUMAN.blastread > 1A02_HUMAN.blastread.human

wc 1A02_HUMAN.blastread.human

We can remove the proteins with the same desciptions line using another perlscript:

blastextract.pl 4 1A02_HUMAN.blastread.human > 1A02_HUMAN.blastread.human.uniq

wc 1A02_HUMAN.blastread.human.uniq

Retrieve the protein sequences from SwissProt

Now we have a collection of human SwissProt entries with different describtion lines, which all are members of the immunoglobulin family. We will use this collection to make phylogenetic tree to illustrade relations in this protein family. But first we shall retrieve the protein sequences from SwissProt and collect them in one fasta file.

getfasta.pl 1A02_HUMAN.blastread.human.uniq > 1A02_HUMAN.blastread.human.uniq.fasta

Now we use ClustalX as before to make a multiple alignment and a phylogenetic tree of the human protein sequences. Start ClustalX and read the file 1A02_HUMAN.blastread.human.uniq.fasta into the program and proceed as before. Use njplot and unrooted to inspect the tree.

Take a closer look at the tree, type in the identifiers from the tree in the seach window in SwissProt's homepage. This will tell you which proteins are hidden behind the identifiers, and you can get an idear of how the immunoglobulin-like proteins are related.


The Toll receptor

The Toll-like receptors are a family of receptors that recognice molecular stuctures specific for micro-organisms. The Toll-like receptors induce inflammatory responces as a part of the non specific innate immune response against invading microbes. The Toll-like receptors are all homologs to the Drosophila Toll protein, which is a transmembrane receptor involved in the control of dorso-ventral pattern formation in the fly embryos. Flies with defect Toll-receptors are also unable to recognice fungal infections. Flies have no adaptive immune response but only non specific innate immune response.

Use the sequence of the Drosophila Toll receptor TOLL_DROME.fasta to make a BLAST search against SwissProt:

blastpgp -b 500 -v 500 -i TOLL_DROME.fasta -d /usr/cbs/databases/blastdb/sp > TOLL_DROME.blast

blastread.pl TOLL_DROME.blast > TOLL_DROME.blastread

Select the Toll-like proteins from the BLAST report, and get the proteins from SwissProt:

grep "Toll" TOLL_DROME.blastread > TOLL_DROME.blastread.Toll

getfasta.pl TOLL_DROME.blastread.Toll > TOLL_DROME.blastread.Toll.fasta

The Toll-like receptors are not members of the immunoglobulin family, they are proteins of a different origin. You can see it by looking for Toll or histocompatibility in our new and our original BLAST search and count the matches.

grep "Toll" 1A02_HUMAN.blastread | wc

grep "Toll" TOLL_DROME.blastread | wc

grep "histocompatibility" 1A02_HUMAN.blastread | wc

grep "histocompatibility" TOLL_DROME.blastread | wc

Now make an alignment and a phylogenetic tree as before using TOLL_DROME.blastread.Toll.fasta as the input file to clustalx. Use both unrooted and njplot to inspect the tree. What is the evolutionary implications of this tree?