Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Øvelse: PSI-BLAST


Øvelse skrevet af: Morten Nielsen


Introduction

In last weeks exercise you used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in todays lecture BLAST will often fail to recognize relationships between proteins with low sequence similarity. In todays exercise, you shall use the iterative BLAST program (PSI-BLAST) to calculation sequence profiles and see how such profiles can used to
  • Identify relationships between proteins with low sequence similarity
  • Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein)

Links:

  • NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
    • You shall use the Position-specific iterative and pattern-hit initiated BLAST (PSI- and PHI-BLAST) link
  • NCBI PSI-BLAST Tutorial
  • HHpred is a powerful tool for remote protein homology modeling. Here you shall use it only to visualize protein sequence profiles and identify conserved residues.

First part. When BLAST fails

Say you have a sequence Query and you what to make predictions about its function and structure. As seen in last weeks exercise, you will most often use BLAST to do this. However what happens when BLAST fails?

You shall use the PHI- and PHI-BLAST version of Blast. Go to the BLAST web-site. Now select the Position-specific iterated and pattern-hit initiated BLAST (PHI- and PHI-BLAST) option. Paste in the query sequence Query. Set the database to pdb, and press Blast. Remember to press FORMAT on the formatting BLAST side.

  • Q1 How many significant hits does BLAST find (E-value < 0.005)?
  • Answer: Zero
Now go back to the BLAST web-site. Select the Position-specific iterated and pattern-hit initiated BLAST (PHI- and PHI-BLAST) option. Paste in the query sequence Query. Set the database to nr, and press Blast. Remember to press FORMAT on the formatting BLAST side.

  • Q2 How many significant hits does BLAST find (E-value < 0.005)?
  • Answer: 15 (excluding the identical match)
  • Q3 How large a fraction of the query sequence does the significant hits match (excluding the identical matches)?
  • Answer: About 50%
  • Q4 Do you find any PDB hits among the significant hits (search for pdb in the hit list or look for the colored S to the right of the E-value))?
  • Answer: No

Now run a second BLAST iteration. Press Run PSI-Blast iteration 2, go to the formatting BLAST window and press FORMAT.

  • Q5 How many significant hits does BLAST find (E-value < 0.005)?
  • Answer: More than 100
  • Q6 How large a fraction of the query sequence does the significant hits match?
  • Answer: About 50%

Make sure you understand what is going on.

  • Q7 Why does BLAST come up with more significant hits in the second iteration?
  • Answer: In the first interation BLAST uses the BLOSUM scoring matrix to align and identify significant hits. Before running the second iteration, the sequences of the significant hits are aligned and a sequence profile is estimated. That is at each position the frequency of each of the 20 amino acids is estimated. Now for the second BLAST iteration, this sequence profile is used as scoring matrix making the search specific for the query sequence.
  • Q8 Do you find any PDB hits among the significant hits (search for pdb in the hit list or look for the red colored S to the right of the E-value)?
  • Answer: Yes. The hit gi|2981658|pdb|1A0P|.
  • Q9 What is the PDB identifier for the best PDB hit?
  • Answer: 1A0P
  • Q10 What is the sequence simularity between the query and this PDB hit?
  • Answer: 18%
  • Q11 What is the function of this protein?
  • Answer: Site-Specific Recombinase. DNA Recombination.


Identifying conserved residues

You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This one could do by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.

The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.

Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and HHpred programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).

  • (a): H271
  • (b): R287
  • (c): E290
  • (d): Y334
  • (e): F371
  • (f): R379
  • (g): R400
  • (h): Y436

You shall use the HHpred program to identify which residues are conserved in the Query protein sequence. Go to the HHpred web-site and upload the Query sequence and press submit (note it might take some minutes before your job is completed).

Find the PDB hit identified by PSI-BLAST.

  • Q12 Does HHpred agree that this hit is significant?
  • Answer: Yes. The hit 1a0p is ranked second with an E-value of e-34.

Stroll down to the alignment of the Query and the hit identified by PSI-BLAST (hit number 2). Next click on the Show histograms icon . Now, from the histogram bars you can identify which residues are conserved in the query/template alignment.

  • Q13 Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
  • Answer: R287, E290, R400, and Y436 (Note R287, is refering to amino acids 287 which is an R)

Go to the PDB database and download the structure of 1A0P (use Google to find the link to the PDB database if you do not know it).

    Q14 From the HHpred alignment identify the four essential residues from Q13 in the 1A0P sequence. Use PyMol to visualize the 1A0P structure and show the location of the four essential residues.
  • Answer: The hard part here is to identify the four residues in the sequences for 1A0P. The four residue in the 1A0P sequence are R129, E132, R228, and Y258. In the presentation you will find an image showing the location of the four residues on the 1A0P structure.

Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences fare below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.