Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

NOTE: This exercise was held on Jan 25, 2007 at 9:00 am.

Exercise 1


Task

  1. Locate on the WWW and download the proteome of Mycobacterium tuberculosis.
  2. Identify the secretome part of the proteome.
  3. Predict the presence of CTL epitopes in the proteins of the secretome. Identify the 50 potential MHC ligands with the highest prediction scores.
The task is to be performed using the traditional paste-and-click services on the WWW. In addition, some simple reformatting and selection of data will need to be done on the local host. You will need the functionality answering to the usage of typical UNIX tools e.g. grep, sort, gawk etc. Therefore, we recommend that you should login to your CBS account and perform the necessary actions on 'sbiology'. It also makes sense to run a WWW browser on 'sbiology' and not on the local host, to avoid file transfers.

Step 1: downloading the proteome

A bacterial proteome can be located on the WWW in many ways; if you do not have a favourite download site you may consider the FTP server at NCBI or the SRS server at EBI. It is not important which strain you choose. You should expect around 4,000 proteins, depending on the strain. Make sure to save the proteome in FASTA format as the next step of the exercise will require FASTA format as input.

Step 2: identifying the secretome

The subcellular location has not been verified experimentally for all the Mycobacterium tuberculosis proteins. Therefore, you will employ a prediction method to identify the subset of the proteins to be secreted.

The prediction method to be used is SignalP. Load the SignalP page and investigate the operation of the service. You might like to submit a few proteins and observe the server behaviour. Specifically, consider the following:

  • What are the suitable parameter settings (organism type, method etc.)?
  • What are the input limits compared to the size of your data?
  • What seems to be the output format suitable for further investigation?
You will discover a few problems:
  • The entry names in the SignalP output are truncated which makes it impossible to compare them to the original names.
    Hint: change the entry names to shorter with the program goodname (on the command line) before you submit to SignalP. goodname is a simple script (CBS made) used to manipulate entry names in FASTA files.
  • Your FASTA file is too large to be submitted to SignalP in one go.
    Hint: partition the input file in the shell, an editor or with a simple script e.g.
        gawk '/^>/{i++;}{out=("file.fsa." int(i/1500)+1);print $0>out;}' file.fsa
When you are ready submit the proteins to the SignalP server and save the results in a file.

Step 3: predict MHC ligands

You will employ the prediction server NetCTL. As before, load the NetCTL page and investigate the operation of the service. You might like to submit a few proteins and observe the server behaviour. Specifically, consider the following:
  • What are the suitable parameter settings (HLA supertype, various thresholds etc.)?
  • What are the input limits compared to the size of your data?
  • What are the output format options suitable for the final presentation of the results?
Remember that only the secretome proteins should be submitted to NetCTL. This means that you need to prepare the input file containing only those. From the SignalP output extract the names of the entries predicted as secreted and use the program getfrag (on the command line) to generate a secretome FASTA file. getfrag is a CBS made script used for extraction of entries or fragments of entries from FASTA files.

Submit the secretome file to NetCTL and save the results in a file.

In the final output you can get the original entry names back using the script 'restore.sed' that 'goodname' created for you. Sort the output to identify the 50 top-scoring MHC ligand predictions.




CONTACT

Ole Lund,