Homology Modeling

 

Introduction

Predicting protein 3D structure directly from sequence – although theoretically possible – is not yet feasible. Thus, predictions are generally limited to those cases where the 3D structures of related sequences are available. Using this kind of information to derive a 3D model for a sequence of interest is known as homology modeling. The individual steps involved in homology modeling are outlined below.

 

A.      Template search
Any homology modeling project needs a template to act as scaffold for the model. Template structures are found by sequence similarity searches such as BLAST or more advanced methods.

B.      Alignment
Once a template has been identified, the query and template sequences are aligned. This is the single most important step in homology modeling and correspondingly also the most common source of serious errors in the final model.

C.       Modeling
Most homology modeling programs use a single template to derive the 3D model. More advanced methods (such as HHpred) allow for the use of multiple templates.

D.      Consensus modeling
On average, the best homology modeling programs perform comparably. However, results vary more among the different programs as the similarity between query sequence and template structures decreases. Therefore, it is important to compare the output from different methods to obtain the most reliable model.

E.       Model validation
As we generally do not have the real 3D structure of the query sequence (that’s why we do modeling), we cannot measure directly how well the model corresponds to the true structure. We are therefore limited to assessing the correctness using measures such as model compactness and stereochemical properties (bond lengths, angles, dihedrals). More advanced methods have been developed to distinguish native-like from non-native models (e.g. ProQ).

 

Purpose of exercise

This exercise illustrates some of the choices and compromises involved in building 3D models from a protein sequence by use of homology modeling. Here, we will make use of three commonly used web servers: HHpred, CPHmodels and Phyre.

 

Sequence

You are going to work with the protein Y462_TREPA from Treponema pallidum, the gram-negative spirochaete bacterium causing syphilis. The full Swiss-Prot entry for the protein can be found here.

 

>Y462_TREPA

MRRIVCPPVLFLSASLLTGCDFSGIFASIQSEVPLKIPSIRGVVTGLVKCNNKLYACAGQ

LWEKDASKSEGKWTAVNFLPGKKITSIVSKGACVYACVSGEGVYTYTSNGAGRTGGTTTP

STVLGKTNGAIRIGGSDNPFLQMPCELSSGSSGGGGGGSGSSSDGGIKNGSDENVLGSGT

GYVVTTKAVYTKSNSSGTSCTYTKDGTFTATTSPILGCTSDGKGCFYVLDGTDVHCRTVQ

ASGGGNGAHCAVASGSATSCKVAHTVTNPLCIAHVKNGNTEFLLIGGSQGYKEIKLETGS

GSGTGCLKAENVRGPEQWGEDSVTPKDRVSQYEGTIGRFAISDIYTVESTSGAGGTNGGT

NKPDVYVVVGDSQDGYTGLWRFDAQKKEWNRE

 

Exercise

Step 1: Make models with each of the three servers and download the resulting models.

 

HHpred – single template model

1)     Go to the HHpred homepage and submit the sequence using default settings (typical runtime: approximately 5 minutes).

2)     On the result page you will see a number of suggested modeling templates. Press the Create model button.

3)     First, select the Create model from manually selected template(s). This will create a model based on only one template. The default is the top hit.

4)     To create a 3D model (pdb file) you need a license key for Modeller. Ask the teacher. Type in the key under Options and press the Submit job button at the bottom of the page.

5)     From the results page, save the 3D model on your computer as HHpred_1.pdb.

 

Questions:

a)      How much of the query sequence is included in the alignment made by HHpred?

b)     Does the model look ok? (Look at the structure using the View 3D structure button.)

 

HHpred – multiple templates

1)     From the list of recent jobs (left panel on HHpred page) click the “yellow” HHMM button.

2)     Under select templates choose user-defined and select the template with the longest alignment (Cols).

3)     Press the Generate alignment for Modeller button, insert the Modeller license key on the resulting page and press Submit job as above.

4)     Save the resulting model as HHpred_2.pdb.

5)     Repeat steps 1)-3) above with optimal multiple templates in step 2) and save the model as HHpred_3.pdb.

 

Questions:

a)      For each model, which template covers most of the query sequence and how much is included in the alignment made by HHpred?

b)     Do the models look ok? (Look at the structures using the View 3D structure button.)

 

CPHmodels and Phyre

1)     To save time, we have submitted the query sequence to CPHmodels-3.0 and Phyre. Download the models here: CPHmodels_1.pdb and Phyre_1.pdb.

 

Questions:

Load the CPHmodels_1 and Phyre_1 models into PyMOL and have a look at them.

a)      What do you see?

b)     What are the most important similarities and differences between the two?

c)      How do they look compared to the models from HHpred? (Either have another look in HHpred or load them into PyMOL.)

 

Model validation

You are now in a position to evaluate and compare the five models.

1)     Go to the ProQ homepage and submit the models one by one.

2)     Note down the MaxSub and LGscore in a table like the one below.

3)     Now go to the STAN server at the Uppsala Software Factory. Again, submit the models one by one and download their Ramachandran plots (scroll down a bit on the results page).

4)     Note down the percentage of outliers in a table like the one below.

 

If you need a quick reminder on how to read a Ramachandran plot, have a look here. There are many other programs available to generate Ramachandran plots, each using slightly different definitions of the allowed/disallowed regions including different subcategories/regions. The Ramachandran plot generated by the STAN server has the advantage of being simple to interpret as it only has two output categories: allowed and disallowed. The STAN server also outputs various geometrical analyses, which are mostly relevant for experimental structures.

 

Questions:

a)      Do the different measures of quality agree?

b)     Based on the LGscore, MaxSub and Ramachandran statistics what do you think is the best model?

c)      What does this teach us about homology modeling of difficult cases?

 

 

Table 1. Model quality scores.

Model

LGscore (ProQ)

MaxSub (ProQ)

% Outliers (Ramachandran plot)

HHpred_1

 

 

 

HHpred_2

 

 

 

HHpred_3

 

 

 

CPHmodels_1

 

 

 

Phyre_1