|
Fold recognition using web-servers
Morten Nielsen (mniel@cbs.dtu.dk)
Overview
Fold recognition (FR) is the name given to the process of assigning
a known structure (a template) to a sequence of unknown structure (the query).
The methodologies used include:
- Straight sequence methods (like BLAST and PSI-BLAST)
- Sequence methods incorporating structure, including:
- predicted and known secondary structure
- structural environments
- structural alignments
- 'True' threading approaches
The difference between "sequence" based methods and methods using threading is
not always clear. In principle the sequence based method defines the "fitness"
of the query onto the template from on the primary structure of the query and template
sequences, respectively. Threading methods on the other hand defines the "fitness"
of the query from the structural environment of the template structure. However
as you saw from the list above some sequence based methods also incorporates
structural information of the template in the alignment so the borderline is
not very clear. The most powerful method are neither "true" sequence based
nor "true" threading method, but some mixture of the two.
Many fold recognition programs are available over the web, and today we're
going to get some experience of how to use them most effectively.
Finding information
Note: If links are not given on this page, it's assumed that you'll use
Google to find it
The exercise
Below we give you a list of three protein sequences. You shall now
try to use some tools to assign a fold and structure to each sequence.
The sequences are placed in the
categories CM (comparative modeling), CM/FR (comparative modeling/fold
recognition), and NF (new fold). As the categories suggest, the CM
is the easy class, the CM/FR the hard, and the NF the difficult (close to
impossible) class. (Why is this?)
In the exercise you shall find out which of the three sequences belong to
which of the three categories, and for the two sequences belonging to the
CM and the CM/FR categories you shall find which template you should use to
build an homology model (template recognition).
On to the BLAST web-site. Select
Protein BLAST. Blast the
three sequences Query1, Query2,
and Query3 against PDB. Note you do this by
pasting your sequence (including FASTA header) into the Query Sequence window.
Then under Database select Protein Data Bank proteins(pdb).
Then press BLAST.
- Q1 What are the E-values for the three searches?
- Q2 Are any of the hits significant (Eval < 0.001)?
Next use the PSI-BLAST version of Blast. On to the
BLAST web-site and choose Protein BLAST once more.
Paste in the Query1 sequence. This time under Algorithm select PSI-BLAST. Set the database to Non-redundant protein sequences (nr). Under Algorithm parameters set Max target sequences to 1000. Then press BLAST.
- Q3 How many significant hits does blast find (E-value < 0.001)?
- Q4 How large a fraction of the query sequence does the significant
hits match?
- Q5 Do you find any PDB hits among the significant hits (search
for the colored S to the right of the E-value))?
Now run a second Blast iteration. Press Run PSI-Blast iteration 2.
Note. You might get a Blast error
for one of the sequences. Can you make sense of this error? If you get an error, this just means that Psi-Blast
has failed for this query sequence and you cannot answer the questions Q6-Q10 for this Query.
- Q6 How many significant hits does Blast find (E-value < 0.001)?
- Q7 How large a fraction of the query sequence does the significant
hits match?
- Q8 Make use you understand what is going on. Why does Blast come
up with more significant hits in the second iteration?
- Q9 Do you find any PDB hits among the significant hits (search for
red colored S to the right of the E-value)?
- Q10 What is the PDB identifier for the best PDB hit?
If you did not find any PDB hits, try a third iteration.
Repeat the PSI-BLAST search for the other two query sequences (up to three iterations).
Now you have probably found that one of the three protein targets could
be modeled using sequence searches only, and this query is hence the easy one
(the CM query).
Identifying conserved residues
You have now (hopefully) identified a structural relationship between the Query sequence and
a protein sequence in the PDB database of protein structures. Say you would like to validate this
relationship. This one could do by mutating (substituting) essential residues in the query sequence
and test if the protein function (or structure) is affected by these mutations.
The protein sequence of the CM query (Query1) is large (more than 400 amino acids) and a complete mutation study
including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles
to identify conserved residues that are likely to be essential for the protein structure and/or
protein function.
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo
programs to select four of the eight residues for a mutagenesis study (you shall select the four
residues based on sequence conservation only).
- (a): H271
- (b): R287
- (c): E290
- (d): Y334
- (e): F371
- (f): R379
- (g): R400
- (h): Y436
You shall use the Blast2logo server to identify
which residues are conserved in the Query protein sequence. Go to the Blast2logo server and upload
the Query sequence. The program allows you
to select the sequence database for the Blast search. You shall NOT submit the query here. To save time, we have submitted the job,
and you can find the output following this link
Blast2logo output.
When the job is completed you should see the logo-plot on the website. If the logo does not display, you can download
the image file (click on the Download logo file) and open it from your desktop.
- Q10.1 Spend a little time looking at the logo plot. Can you understand why the logo is so flat for
the first 100 residues (how large a fraction of the query section did the Blast search cover)?
- Q10.2 Which four of the eight residues listed above are most conserved and hence most likely
to be essential for the protein stability and/or function?
You shall use the Phyre protein homology program to validate if the
structural properties of the four most conserved residues from question Q10.2 indeed could form an active site.
Go to the Phyre web-site and upload
the Query sequence.
Note it might take some (10-20) minutes before your job is completed. To save you time, we have
run the calculation for you. Yoy can find the output here Phyre output.
Find the PDB hit identified by PSI-BLAST (you can click on the SCOP code to get to the PDB template for each model).
- Q10.3 Does Phyre agree that this hit is significant?
Download the highest scoring Phyre model (click on the View Model image for the first model), and open the model file in Pymol.
If you do not have Pymol installed on your computer,
you can find a free download here Pymol 099 downloads.
Show the location of the four essential residues from question Q10.2 on the structure.
- Q10.4 Could the residues form an active site?
Remote homology modeling
You shall now use some more advanced tools to try to model the last two
sequences. There exist a large series of web-based protein model
programs. Here we cannot go through them all, thus we will focus on just two
servers. First Phyre, not because
it performs better than the other programs, but because it has a very
nice and informative web-interface. Next
HHpred and
CPHmodels.
To save you time, we have submitted the two sequences (Query2 and Query3) to the three servers
To see the output click on the following links
Phyre
CPHmodels
HHpred
Spend some time looking at the results and make sure you understand what is going on.
The output from CPHModels is fare from intuitive, so
you might what to focus of the two other methods when trying to understand what is going on.
Note that the CPHmodels and Phyre servers provide full atom model(s) for the query sequence based on single template
modeling. HHpred does only provide templates and alignments.
- Q11 Try to classify the queries into CM/FR (hard), and NF (difficult/close to impossible)?
As a rule of thumb, a
CPHModels Z-score 10 will indicate a correct fold, the other servers provide E and P-values.
- Q12 What template do the Phyre and CPHmodels servers find for the hard (CM/FR) query?
Save the top scoring model from the CPHmodels (it only gives you one model) and Phyre servers.
Try to superimpose the two models using Pymol.
This you can do by uploading the two files to Pymol, and
use the align command
align CPHmodel.pdb,Phyremodel.pdb
where CPHmodel.pdb, Phyremodel.pdb are the structures containing the model predicted by CPHmodels and Phyre,
respectively.
As you can see the two models are different but clearly structural imposable.
- Q12.1 Where on the HHpred template list do you find the best scoring Phyre template?
As you probably found, the HHpred server does not agree with the Phyre and CPHmodel servers in what is the
best scoring template. This lead to the concept of meta-servers and multiple template modeling.
Like in all other prediction games, you can often get a better
idea of an answer to a question by asking the question to many different prediction
servers. The list of public protein modeling servers is long. You can
submit a query to a list of servers using the META-server.
On the server you can submit a query sequence to a list of protein model prediction
server simultaneously. The calculation takes some time, so we have submitted the
sequences to the server beforehand. You find the output by clicking on the links
Query2 and,
Query3.
Check the Meta-server output for each of the two targets (we have left out Query 1).
A powerful ways to combine the output from many prediction
servers, is to extract a consensus prediction. The 3D-jury program does this. The 3d-jury
calculates a score for the models predicted by the different servers, and reports a jury
score. A value above 50 means a significant model. On the Meta server output webside you can select
the set of servers you would like display (use the right selection window, and use Ctrl to select
multiple methods). If you have time, play a bit with the settings and compare the predictive results.
- Q13 Can any of the two sequences (Query2 and Query3) be modeled with a significant hit (Jscore > 40)?
- Q14 Does the best hit found by the 3D-JURY, CPHModels, Phyre and
HHpred methods agree (i.e. same name or function)?
- Q15 How do the top scoring jury hits share SCOP class?
- Q16 Can you classify the two sequences into hard (CM/FR) and difficult (NF)?
- Q17 Can you come up with a good guess for a template, SCOP class and function for the CM/FR query?
Now you probably found that the fold of one to the difficult query sequences could be
found using the jury approach where many different protein structure prediction servers
are combined. The significance of hit you found and the corresponding SCOP class was very
high even though none of the individual prediction servers could come up with (significant) hits.
And now a final philosophical question. What do you do if you cannot find a template?
Livebench
Now you have seen a real life protein fold recognition experiment. All the
servers included in the Mete-server believe them self to be among the best in the world.
Read about the Livebench
project. Why is it important to assess fold recognition servers in this
way? Click on the Set Livebench-2008.2, and see which methods performed the best.
Further reading
For more practical advice on FR and structure prediction in general, including
the issue of domain assignment and non-globular regions which we have not
had time to cover here, see Rob Russell's comprehensive Guide
to Structure Prediction.
That's it for now!
|