Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Construction of sequence logos and weight matrices

Morten Nielsen (mniel@cbs.dtu.dk)


Overview

You shall use Bioinformatics tools to predict peptide-MHC binding and select potential epitope vaccine candidates. The exercise has four parts:

  1. Identification of MHC binding motif
  2. Visualize the motif using sequence logos
  3. Training of MHC binding prediction methods
  4. Use the MHC binding prediction method to select potential epitope vaccine candidates


Background: Peptide MHC binding

The most selective step in identifying potential peptide immunogens is the binding of the peptide to the MHC complex. Only one in about 200 peptides will bind to a given MHC complex. A very large number of different MHC alleles exist each with a highly selective peptide binding specificity.

The binding motif for a given MHC class I complex is in most cases 9 amino acids long. The motif is characterized by a strong amino acid preference at specific positions in the motif. These position are called anchor positions. For many MHC complexes the anchor position are placed at P2 and P9 in the motif. However this is not always the case.

Large number of peptide data exist describing this MHC specificity variation. One important source of data is the SYFPEITHI MHC database (http://www.syfpeithi.de). This database contains information on MHC ligands and binding motifs.

Once an accurate and reliable prediction method of peptide MHC binding has been developed it can be applied in the context of effective vaccine design to search for potential epitope candidates. On a whole genome scale one can search for peptides with high MHC binding affinity.

Purpose of exercise

In this exercise you are going to

  1. Visualize the binding motif using sequence logos.
  2. Use the Easypred web-interface to train Bioinformatics predictor for MHC-peptide binding.
  3. Apply a MHC binding prediction method to select peptides in the Sars genome useful for vaccine design.


Prediction performance

We shall use two performance measures to evaluate the predictive performance of a prediction method. The two measures are AUC (the area under the ROC curve), and the Pearson correlation. A short description of different performance methods is given here Performance measures.

The exercise

Identification of MHC binding motifs

Go to the SYFPEITHI MHC database (click on the right mouse button, and select open in a new window). Use the Find motif, Ligand or epitope option. Have a look at the peptide characteristics for HLA-A*0201 MHC allele. This you do by selecting HLA-A*0201 and press do Query. What are the characteristics of the peptides that bind HLA-A*0201?

Q1: Which positions are anchor position and what amino acids are found at the anchor positions?

Answer: Anchor positions are P2 and P9. Prefered amino acids are P2: LM, P9: VL.

Sequence logos

A powerful way to visualize the peptide characteristics of the binding motif of an MHC complex, is to plot a sequence logo. In a logo the information content at each position in the sequence motif corresponds to the height of a column of letters. The Information content Ip on each position p is defined as log2(20) + Σa(pap log2(pap)). The information content is a measure of the degree of conservation and lies within the range of 0 (no conservation-all amino acids are equally probable) and log2(20) = 4.3 (full conservation-only a single amino acid is observed at that position). The height of each letter within the columns is proportional to the frequency pap of the corresponding amino acid a at that position p. The amino acids are colored according to their properties:

  • Acidic [DE] red
  • Basic [HKR]: blue
  • Hydrophobic: [ACFILMPVW] black
  • Neutral [GNQSTY]: green

In the exercise, you shall use logo plots to visualize the MHC binding motif contained in different weight matrix predictors. Below you see an example of a logo visualization of the binding motif of the MHC allele HLA-A*0201.
.


Construction of weightmatrices

First you shall use the EasyPred web-server (click on the right mouse button, and select open in a new window) for confirm the results from the lecture calculating the weight matrix from the multiple alignment

VFAAA
VHYWW
VLQPK
LREWQ
LPYIH

Go to the EasyPred web-server and past the five peptide sequences from above into the Paste in training examples window. Select No clustering, and set Weight on prior to 10000. Next, press Submit query.

Have a look at the sequence logo.

  • Q2: How many different amino acids are present at the P1 position in the logo?
  • Answer: More than 10!
  • Q3: How many different amino acids are present at the P1 position in the binding data?
  • Answer: 2.

    Click on the link Parameters for prediction method. This will open a window with the weightmatrix estimated from the input peptide sequences.

  • Q3.1:Try to reproduce the matrix values for P1(I), and P1(K). Remember, you shall use the blosum substitution matrix to estimate the pseudo frequency for the two amino acids as we did in the lecture. Here, you find a link to a file containing the Blosum substitution matrix. Here, is a link to a file with the background frequences for the different amino acids Background frequency tables. Also, remember that the values in the weightmatrix are calculated as

    2*log(p/q)/log(2)

    where p is the frequency of a given amino acid (calculated using pseudo counts), and q is the background frequence. You might not get exactly identical numbers to those of the EasyPred program. So, +/- 20% is fine!

    Answer:. The pseudo frequences for I and K are

    g(I) = 0.4*0.12 + 0.6*0.16 = 0.144
    g(K) = 0.4*0.03 + 0.06*0.03 = 0.03

    Since weight on prior (or weight on pseudo count) is much greater than the number of sequences, the final frequences p(I) = g(I), and p(K) = g(K). Using the formula for the values in the weight matrix with q(I)=0.068, and q(K)=0.058 (remember q is the back ground frequeces), we find the weight matrix value to be

    Score(I) = 2.17, and Score(K) = -1.90

    These values compare fine with the values calculated y the Easypred program (2.34, and -2.16).


    Prediction of MHC-peptide binding

    Data

    You shall now use the EasyPred web-interface to train and evaluate a series of different MHC-peptide binding predictors. The EasyPred web-server is an interface that allows for easy training and performance evaluation of weight matrix and artificial neural prediction methods. In this exercise, you shall only use the part of the web-server that trains weight-matrices.

    You shall use three files (eval.set, small.train.set and large.train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201. In the two train.sets you find peptide with binding affinity of either 0.1 (this value has absolutely no meaning, it is set to 0.1 for practical reasons) or 1, where 0.1 indicates a non-binder and 1 that the peptide binds to the MHC complex. The small.train.set contains 110 peptides, whereas the large.train.set contains 232 peptides.

    For the evaluation set (eval.set) the affinities are given as real values. A high value indicates strong binding (a value of 0.5 corresponds to a binding affinity of approximately 200 nM). The values are calculated from the actual nM binding affinities using the relation x = 1 - log(aff nM)/log(50000). A peptide that binds with an affinity stronger than 500 nM is said to be an intermediate binder, and a peptide that binds stronger than 50 nM is a high binder. Note that low affinity means strong binding. As a rule of thumb a peptide must be at least an intermediate binder in order to induce an immune-response. Using the above transformation an intermediate binder (500 nM) will have a value og 0.426, and a strong binding a value of 0.638, respectively. Again note that a high transformed value corresponds to a low affinity value, and hence to a strong binder. The eval.set contains 1266, Click on the filenames to view the content of the files.

    During the exercise you shall use the files small.train.set and eval.set often. It might therefore be smart if you save these files on the Desktop on your lab-top. You do that by clicking on the files names (eval.set, small.train.set) and saving the files as text files on the Desktop. You can also just copy and paste the files every time you need them.

    The accuracy of a prediction method can be measured in many ways. Here we use two measures. The area under the receiver operating curve (ROC) curve (Aroc), and the Pearson correlation. Both measures have a value of 1 for the perfect method. For a method that is random the Aroc measure has a value of 0.5 whereas the Pearson correlation is 0. When you evaluate the performance of a prediction method you hence rank a method with high Aroc and Pearson correlation highest (if you what to know more, use Goggle to find information on how the two measures are calculated).


    Weight Matrix generation

    You shall now use EasyPred web-server to train a series of methods to predict peptide-MHC binding. Go to the EasyPred web-server (again click on the right mouse button, and select open in a new window). When you go through the exercise please read carefully the instructions (all of them). Many of the examples only make sense if you do exactly as described in the text.

    In the upload training examples window browse and select the small.train.set file from the Desktop, in the upload evaluation window browse and select the eval.set file from the Desktop. Important! In the window "Cutoff for counting an example as a positive example" type 0. Now press Submit query.

  • Q4 What is the predictive performance of the matrix method?
  • Answer:: Pearson coefficient for N= 1266 data: 0.07628 Aroc value: 0.56979
  • View the logo plot of the calculated matrix. Can you understand why the matrix performs so poorly?
  • Answer:: The logo shows very low information at all positions. We have trained a method for peptide:MHC binding on a mixture of peptide binders and peptide non-binders. This is clearly wrong. We can only include binders when estimating a binding motif.
  • Q5 How many of the 110 peptides in the small.train.set are included in the matrix construction?
  • Answer:: All 110 peptides were included

    Go back to the EasyPred server window (use the Back button). Set the "Cutoff for counting an example as a positive example" to 0.5. Set clustering method to No clustering and the weight on prior to 0.0 and redo calculation.

  • Q6 What is the predictive performance of the matrix method now?
  • Answer:: Pearson coefficient for N= 1266 data: 0.29529 Aroc value: 0.71191
  • Q7 How many of the 110 peptides in the small.train.set are included in the matrix construction?
  • Answer:: 10
  • View the logo plot of the calculated matrix. Does the logo resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?

    Return to the EasyPred server window (use the Back button). Set the clustering method to Clustering at 62% identity. Keep weight on prior on 0.0. Redo calculation.

  • Q8 What is the predictive performance of the matrix method now?
  • Answer:: Pearson coefficient for N= 1266 data: 0.45328 Aroc value: 0.81865
  • Again view the logo plot of the calculated matrix. Has it changed compared to the previous calculation?

    Return to the EasyPred server window (use the Back button). Keep the clustering method to Clustering at 62% identity. Set the weight on prior to 200.0, and redo calculation.

  • Q9 What is the predictive performance of the matrix method now?
  • Answer:: Pearson coefficient for N= 1266 data: 0.49684 Aroc value: 0.84838
  • View the logo plot of the calculated matrix. What is the big difference between this logo and the two previous ones? (how many different amino acids are present at each position in the binding motif?)
  • Answer:: In the two previous logo plots, only four amino acids where present at for instance P2. In this last logo all amino acids are present.
  • What are reasons for these differences?
  • Answer:: Using pseudo counts will give non-zero frequency values also for amino acids not observed.
  • Does the logo begin to resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?

    Note, that you using as few as 10 binding peptides in the small.train.set was able to derive a weight-matrix with a reasonable predictive performance, and a corresponding logo plot that captured many the features described in the both the SYFPHETHI database, and in the logo plot calculated from the large training data set.


    Now you shall do the last matrix training. Return to the EasyPred server window. Now you shall train a weight matrix using a larger set of data. Press Clear fields. Upload the large.train.set in the "Paste in training examples" window., and the eval.set in the "Paste in evaluation examples". Leave all other option unchanged (Cluster at 62% identity, and weight on prior at 200). Select Sort output on predicted values and submit query.

  • Q10 What is the predictive performance of the matrix method now?
  • Answer:: Pearson coefficient for N= 1266 data: 0.71798 Aroc value: 0.96651
  • View the logo plot of the calculated matrix. Does the logo compare to the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?
  • Q11 Look at the prediction list. How many false positive hits do you find among the top 20 highest scoring peptides (Assignment score < 0.426)?
  • Answer:: One false positive

    Before you continue click on the "Parameters for prediction method" link and save the content to a file on you Desktop (para.dat for instance).


    Finding epitopes in real proteins

    As the last part you shall use the weight matrix to find potential epitopes in the Sars virus. In the EasyPred web-interface clear field to reset all parameter fields. Go to the Swiss-prot home-page Swiss-prot. Search for a Sars entry by typing Sars in the search window. Click you way to the FASTA format for one of the proteins. Here is a link if you are lazy. Paste in FASTA file into the Paste in evaluation examples. Upload the weight matrix parameter file (para.dat) from before into the Load saved prediction method window. Make sure that the option for sorting the output is set to Sort output on predicted values, and press Submit query.

  • Now the top scoring peptides should be the peptides that resemble the binding motif (i.e. your weight matrix) the most. Selecting the top 10 peptides you should have a high change of including a set of potential HLA-A*0201 epitopes. These you could sell to a big pharma company and become rich and famous!!

    Now you are done!!