|
Fold recognition using web-servers
Morten Nielsen (mniel@cbs.dtu.dk)
Claus Lundegaard (lunde@cbs.dtu.dk)
Overview
Fold recognition (FR) is the name given to the process of assigning
a known structure (a template) to a sequence of unknown structure (the query).
The methodologies used include:
- Straight sequence methods (like BLAST and PSI-BLAST)
- Sequence methods incorporating structure, including:
- predicted and known secondary structure
- structural environments
- structural alignments
- 'True' threading approaches
The difference between "sequence" based methods and methods using threading is
not always clear. In principle the sequence based method defines the "fitness"
of the query onto the template from on the primary structure of the query and template
sequences, respectively. Threading methods on the other hand defines the "fitness"
of the query from the structural environment of the template structure. However
as you saw from the list above some sequence based methods also incorporates
structural information of the template in the alignment so the borderline is
not very clear. The most powerful method are neither "true" sequence based
nor "true" threading method, but some mixture of the two.
Many fold recognition programs are available over the web, and today we're
going to get some experience of how to use them most effectively. It should
be quite fun because you will probably be finding out things that no-one
knew before.
Finding information
Note: If links are not given on this page, it's assumed that you'll use
Google to find it
The exercise
Please mail the answers to all the questions asked in the exercise to:
lund@cbs.dtu.dk
Below we give you a list of three protein sequences. You shall now
try to use some tools to assign a fold and structure to each sequence.
The sequences are placed in the
categories CM (comparative modeling), CM/FR (comparative modeling/fold
recognition), and NF (new fold). As the categories suggest, the CM
is the easy class, the CM/FR the hard, and the NF the difficult (close to
impossible) class. (Why is this?)
In the exercise you shall find out which of the three sequences belong to
which of the three categories, and for the two sequences belonging to the
CM and the CM/FR categories you shall find which template you should use to
build an homology model (template recognition).
On to the BLAST web-site. Select
Protein-protein BLAST (blastp). Blast the
three sequences Query1, Query2,
and Query3 against PDB. Note you do this by
clicking on the Protein-protein BLAST, chose PDB from
Choose database, and pasting in the sequences one by one. Remember to press
FORMAT in formatting BLAST window.
- Q1 What are the E-values for the three searches?
- Q2 Are any of the hits significant (Eval < 0.001)?
Next use the PHI- and PHI-BLAST version of Blast. On to the
BLAST web-site once more.
Now select the Position-specific iterated and pattern-hit initiated BLAST
(PHI- and PHI-BLAST) option. Paste in the query sequence
Query1. Leave the database to nr, and press
Blast. Again remember to press FORMAT one the formatting BLAST
side.
- Q3 How many significant hits does blast find (E-value < 0.001)?
- Q4 How large a fraction of the query sequence does the significant
hits match?
- Q5 Do you find any PDB hits among the significant hits (search for pdb in the
hit list or look for the colored S to the right of the E-value))?
Now run a second Blast iteration. Press Run PSI-Blast iteration 2,
go to the formatting BLAST window and press FORMAT.
- Q6 How many significant hits does Blast find (E-value < 0.001)?
- Q7 How large a fraction of the query sequence does the significant
hits match?
- Make use you understand what is going on.
- Q8 Why does Blast come
up with more significant hits in the second iteration?
- Q9 Do you find any PDB hits among the significant hits (search for pdb in the
hit list or look for the red colored S to the right of the E-value)?
- Q10 What is the PDB identifier for the best PDB hit?
Repeat the PHI- PHI-Blast search for the other two query sequences.
Now you have probably found that one of the three protein targets could
be modeled using sequence searches only, and this query is hence the easy one
(the CM query).
You shall now use some more advanced tools to try to model the last two
sequences. There exist a large series of web-based protein model
programs. Here we cannot go through them all. We normally focus on two
servers. First Phyre (former 3D-PSSM), not because it performs better than the other
programs (it is actually fare from being the best), but because it has a very
nice and informative web-interface.
Since these are down we have chosen to use two other servers:
HHpred
and
FFAS (chose servers then FFAS03).
http://toolkit.tuebingen.mpg.de/index.php?view=waiting&checkid=5103736&
To save you time, we have submitted the 3 sequences to the
HHpred
server. To see
the output click on the following links
Query1,
Query2,
Query3.
We have also precomputed the results for Query 1-3 using the FFAS server
Click here to see the results for Query1-3.
Spend some time looking at the results and make sure you understand
what is going on.
- Q11 Try to classify the queries into CM/FR (hard), and NF (difficult)?
- Q12 What template does the Phyre server find for the hard (CM/FR) query?
You probably found that one of the two sequences could be modeled with a high
certainty using the Phyre server. As in all other prediction games, you can often get a better
idea of an answer to a question by asking the question to many different prediction
servers. The list of public protein modeling servers is long. You can
submit a query to a list of servers using the META-server.
On the server you can submit a query sequence to a list of protein model prediction
server simultaneously. The calculation takes some time, so we have submitted the
sequences to the server beforehand. You find the output by clicking on the links
Query2 and,
Query3.
Check the Meta-server output for each of the two targets (we have left out Query 1).
A powerful ways to combine the output from many prediction
servers, is to extract a consensus prediction. The 3D-jury program does this. The 3d-jury
calculates a score for the models predicted by the different servers, and reports a jury
score. A value above 50 means a significant model. On the Meta server output webside you can select
the set of servers you would like display (use the right selection window, and use Ctrl to select
multiple methods). Select for instance FFAS03, 3D-PSSM, Blast, and PDB-Blast and compare the
predictions.
- Q13 Can any of the sequences be modeled with a significant hit?
- Q14 Does the best hit found by the 3D-JURY and Phyre method agree (i.e. same name or SCOP class)?
- Q15 How do the top scoring jury hits differ in SCOP class?
- Q16 Can you classify the two sequences into hard (CM/FR) and difficult (NF)?
- Q17 Can you come up with a good guess for a template and a SCOP class for the CM/FR query?
Now you probably found that the fold of one to the difficult query sequences could be
found using the jury approach where many different protein structure prediction servers
are combined. The significance of hit you found and the corresponding SCOP class was very
high even though the individual prediction servers could come up with (significant) hits
beloning to very different structure classes!
If you have more time you shall
as a final exercise prove that the post Doc in our group
could have saved 3 years of hard work during her PhD if it had been possible for her to
use the advanced techniques for remote fold recognition to predict the structure and function of her
protein sequence. You shall use the HHpred server to make an homology model of the protein sequence.
The calculation takes some time, so we have submitted the
sequence to the server beforehand. You find the output by clicking on the link
HHpred Model
Look at the output from the prediction and see for instance if the active site
(S9, G42, N74, and H195) is conserved in the alignment to template 1esc.
Look at the
histogram for the profile-profile
alignment for 1esc. Can you here comfirm the conservation of the active site?
And now a final philosophical question. What do you do if you cannot find a template?
Livebench
Now you have seen a real life protein fold recognition experiment. All the
servers included in the Mete-server believe them self to be among the best in the world.
Read about the Livebench
project. Why is it important to assess fold recognition servers in this
way? Click on the Set 9, and see which methods performed the best.
Further reading
For more practical advice on FR and structure prediction in general, including
the issue of domain assignment and non-globular regions which we haven't
had time to cover here, see Rob Russell's comprehensive Guide
to Structure Prediction.
That's it for now!
This is test FASTA link Test to be used as an example.
|