NOTE: This exercise was held on Jan 26, 2007 at 9:30 am.

Exercise 3


  1. Locate on the WWW and download the proteome of Mycobacterium tuberculosis.
  2. Identify the secretome part of the proteome.
  3. Predict the presence of CTL epitopes in the proteins of the secretome. Identify the 50 potential MHC ligands with the highest prediction scores.
The task is to be performed using TAVERNA. Before the actual task you will need to get acquainted with the Taverna workbench.

Note: the instructions below should be read in the context of the lecture "Preparing for Taverna, using SOAP - graphical interface to SOAP based Web Services" given yesterday (Wed, Jan 25, 16:00).

Getting acquainted with Taverna

Taverna can be downloaded to your computer free of charge from It needs Java 1.5 to run.

In alternative, you can login to 'sbiology' and issue the command 'taverna'. The program takes some time to start. The CBS services used in this exercise have already been loaded for you, and are available for use in the Advanced Model Explorer.

Note that if you run the program from your computer you have to load the WSDL files by right-clicking "Available Processors" and "Add new WSDL scavenger...":

Let's start by opening the workflow from the RNAmmer example in the lecture. Click on "File" and then "Open workflow location". Load the xml file:

The workflow calls two services, getSeq and RNAmmer. getSeq will be used to retrieve the complete genome of an organism given it's accession number. RNAmmer predicts the location of 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences.

The graphical representation should gve you an idea of the flow of execution. Notice the XML splitters (shown as purple boxes) between the inputs and outputs. These were needed to interrogate the complex structure of the input or output data, and resolve it by a single level. Because our services have a very complex data structure, two layers of XML splitters were used.

Now that you have loaded the workflow, try running it using "File", "Run workflow...". Click "accession" in the input list and "New Input". Run the service using "<accession>AL123456</accession>". 'AL123456' is the accession number for Mycobacterium tuberculosis H37Rv.

Observe the Status window while the different processes complete. They will change color depending on their state. When everything has finished, the Results tab will show up, and we can examine the results in the Result Browser. For now, and because the job is asynchronous, a job id will be output. Go back to the Status tab and click on the getSeq processor, and then on 'Intermediate output'. The complete genomic sequence for the bacteria was the output from this processor.

Now it is time to retrieve our results. Because Taverna does not allow while loops to poll the queue for job status, we have to use another workflow. Copy the job id string from the Results Browser and open the workflow for fetching the result:

Run it with the job id from the previous workflow as the input. Don't forget to encapsulate the string between '<jobid></jobid>' tags. After processing, examine the output in the Results window. Notice where each one of the RNA subunits where predicted to be.

By now you should be familiar on how the workflows run, but lets get on to build your own workflow! Close both results, and the RNAmmer workflow. Leave the fetch workflow open, as you will need it for later.

Identifying the secretome using SignalP

Load the WSDL for SignalP (not needed if working in sbiology):

Press "File", and then "New workflow...". Start by getting the proteome of Mycobacterium tuberculosis. Use the GenomeAtlas 'getProt' service to achieve this. Don't forget to split the complex input and output data using XML splitters, by right-clicking on the node and selecting 'Add XML splitter with name'.

1. Try to see the output of this service when inputting the accession number AL123456.
2. Connect the result from getProt with SignalP (don't forget the 2 layers of XML splitters both in the input and output). SignalP needs another input, the organism. Create a local string constant with the value "<organism>gram+</organism>" and feed it together with the proteome to SignalP.
3. Get the job identifier and fetch the result using the other workflow (Note: you will have to wait a few minutes for results to process)

If time allows, try to input two accession numbers into the workflow, as separate inputs, for example BX293980 and AL123456. Notice how Taverna iterates over your inputs. Examine intermediate outputs.

Please note that the actions described above are intended as a preliminary investigation of Taverna. You should continue exploring the possibilities of the software yourself, Taverna is in development and it is very likely that new features will be added in near future.


Ole Lund,