Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise: Modeling of epitopes

Claus Lundegaard (


In this exercise you shall use some bioinformatical tools to predict location of B-cell epitopes. Since B-cell epitopes in general are structural epitopes and hence are formed by amino acids that are not adjacent to each other in the sequence, prediction of B-cell epitopes is a much more complicated task than T-cell prediction, and most prediction methods are limited to prediction of linear epitopes only.

The exercise has two parts

  1. Prediction of B-cell epitopes using protein homology modeling
  2. Prediction of linear B-cell epitopes using composition scales

For this exercise you will need to have pymol installed on your computer.

Prediction of B-cell epitopes using protein homology modeling

Most B-cell epitopes are structural epitopes formed by amino acids that are sequentially distant but physically is close contact. This property of B-cell epitopes makes accurate prediction a complicated task. In many situations, however, one is capable of prediction the three-dimensional protein structure from the amino acids sequence by use of homology modeling. Once the three-dimensional structure is achieved prediction of B-cell epitopes becomes a much more easy task.

Homology modeling

Go to the NCBI Blast web-site. Click on the protein BLAST (blastp) link. Select pdb as database and blast each of the 3 sequences CHO, PIL, HBV.
  • What is the E value and the percentage identity for the best match for each of the 3 sequences?

    Go to the CPHmodels server. You can read about the method in the abstract. Paste in the the fasta files for the three entries CHO, PIL, HBV and see if the server can create a model. Save the page with the logfile (right-click on the page and choose "save page as".) Down-load the model (right-click on the link model.pdb (in the bottom of the page) and save it on the desk top as CHO.pdb, PIL.pdb and HBV.pdb, respectively).

  • Which pdb entry and which chain was used as template? Each entry in the PDB database has a 4 character alphanumeric identifier also called the "PDB ID" identifying the protein. The protein may contain one or more chains which are identified by a single character identifier.
  • What was the statistical significance of the match (The E value)?
  • How much of the sequence was modeled?
  • Are parts of the template structure not used (deletions)?
  • Are parts of the query sequence not found in the template (insertions)?
  • What is the percent identity in the alignment?
  • Rank the models using your results from answers to the questions above. Which model is the best?

    The measured antigenecity of various regions in Cholera enterotoxin has been reported in several papers (Pellequer et al., 1993; Kazemi and Finkelstein, 1991; Jacob et al., 1983.) Read the abstract of the Kazemi and Finkelstein paper.

  • Which techniques are used to map the epitopes experimentally?
  • Do you think one method is better that the other?
  • Note the "strongly reactive tetramer".

    Now it is time to look at antigenic regions in the Cholera enterotoxin model (CHO.pdb):
    Start Pymol and open CHO.pdb.

    Put these commands one at the time into the command line.
    You can copy-paste or write. Remeber arrow-up will take you to previously inserted commands which can be reedited.
    Whatch the PyMOL Viewer window.

    hide everything
    show cartoon
    select pellequer, resi 33-42
    color red, pellequer
    select pellequer, resi 53-61
    color red, pellequer
    select pellequer, resi 69-84
    color red, pellequer
    select pellequer, resi 104-118
    color red, pellequer
    create tetramer, resi 75-79
    show sticks, tetramer
    color blue, tetramer
    center CHO
    zoom all

    The regions which are shown with a red backbone in your model are regions annotated as epitopes. The side chains of the "strongly reactive tetramer" residues are shown in blue.

  • Where are the antigenic regions found in the structure?
  • Look at the region with the "strongly reactive tetramer". Can you guess the secondary structure of this region?

    Prediction of antigenecity from the amino acids sequence

    We have made the propensity scale predictions of CHO, PIL, and HBV using the following scales:

  • ag: antigenecity (Welling et al., 1985. FEBS Lett, 188(2):215-8.)
  • fl: flexibility (Karplus and Schulz)
  • parker: hydrophilicity (Parker et al., 1986. Biochemistry 25:5425-5432)
  • hp: hydrophilicity (inverted hydrophobicity scale) (Kyte and Doolittle, 1982. J. Mol. Biol., 157:105-132.)
  • and the in-house method BepiPred, which also can be used by going to To make an unbiased and threshold independent evaluation of the predictions, roc-curves have also been contructed by the B-pred script. Those files end with .roc.

    A combined graf of roc-curves for the CHO sequence can be seen here.

    Which method performs best here?

    The threshold varies on the roc-curves, when following a curve going from (0,0) to (1,1). For a high threshold, the number of positives is low, which leads to a low FP and TP. This is close to (0,0). For a low threshold, the opposite is valid, which is close to (1,1).

    For the roc-curve, the False Positive Proportion (FP/AN) is equal to 1 - the specificity. The True Positive Proportion (TP/AP) is equal to the sensitivity. AN = Actual (or Annotated) Negatives = FP + TN. AP = Actual (or Annotated) Positives = TP + FN.

    Approximately where on the graphs are each method best? Where would you like to be on the graph, if you were to predict the location of linear B-cell epitopes? Would you apply a high or a low threshold?

    Take a look at the roc-curves for PIL and HBV. Which methods perform best for those sequences?

    Now finish by plotting BepiPred predictions on the Cholera Enterotoxin model (CHO.pdb): Go to the BepiPred server. Paste in the fasta file for CHO and run the prediction server with the default options.

    Look at the output. Here, an "E" represents a residue that is predicted as part of an epitope, where an "E" in a how-file represents a residue, that is experimentally measured as part of an epitope.

    Note that the residue numbers of predicted epitopes match with the numbers in the lines below called "create BepiPred_x, resi x-x..." (except for residues 18-21 which have not been modelled by CPHmodels).
    Start pymol and load CHO.pdb Plot the BepiPred predictions by entering the commands below into the pymol tcl/tk GUI.

    hide everything
    show cartoon
    select pellequer, resi 33-42
    color red, pellequer
    select pellequer, resi 53-61
    color red, pellequer
    select pellequer, resi 69-84
    color red, pellequer
    select pellequer, resi 104-118
    color red, pellequer
    create Bepipred_1, resi 22-26
    create Bepipred_2, resi 51-54
    create Bepipred_3, resi 73-84
    create Bepipred_4, resi 112-115
    show sticks, Bepipred_1
    show sticks, Bepipred_2
    show sticks, Bepipred_3
    show sticks, Bepipred_4
    color blue, Bepipred_1
    color blue, Bepipred_2
    color blue, Bepipred_3
    color blue, Bepipred_4
    center CHO
    zoom all

    Now you see the annotated epitopes with a red backbone and the BepiPred predictions with a blue side chain.

    Look at the epitopes predicted by BepiPred. Where are they located in the structure? Look at the overlapping regions between annotated epitopes and predicted epitopes (red and blue). Are the predicted epitopes found on the surface of the model?

    As a last thing you can also try to use the DiscoTope server to predict B cell epitopes in Cholera toxin. The structure for this protein can be found in the file CHO.pdb. Unfortunately DiscoTope is quite strict and requires that the input have a chain identifier, as is specified in the database format. The Entry in the file CHO.pdb lacks this identifier. You can add it either by hand in a text editor (VERY CUMBERSOME) or in unix.
    Log in to organism and upload your CHO.pdb (on Mac you can do this on your own machine). Vi now have to change the 22nd character in the lines starting with "ATOM" to A. In gawk this will look like:
    gawk '{if (/^ATOM/){print substr($0,1,21)"A"substr($0,23,100)}else{print $0}}' CHO.pdb > CHO2.pdb
    You need to save CHO2.pdb on your PC/MAC before you can upload it to DiscoTope. You can do that by the SSH program by choosing Window>New file Transfer in Current Directory and then dragging the file to the desk top.

    Alternatively you can try to find cholera toxin on the pdb database, or search for it using the amino acid sequence and then paste the entry name and the chain identifier into the DiscoTope server.

    By clicking on the link "View results in Jmol (please be patient...requires Jmol applet download)" you can see a graphical representation of the predicted epitopes.

    How well do the epitopes predicted by DiscoTope correspond to the annotated epitopes.