|
Comparative Genomics using Web Services
David Ussery (dave@cbs.dtu.dk)
Peter Fischer Hallin (pfh@cbs.dtu.dk)
For those of you who did not get the table right, you are not likely to have the rRNA gene predictions right
Please copy the *.ph file to your local directory:
cp /home/people/pfh/ws/all.ph .
INTRODUCTION
During the last few decades, the web has played an essential role for sharing, accessing, and combining biological data.
On a number of occassions during this course you have used web interfaces for accessing various resources
on the internet, for example when predicting genes or accessing online sequence databases. In most cases such online resources
are being offered as HTML based web pages which presents a
graphical view of its functionality. Although these interfaces in some cases
appear intuitive and user-friendly, they make it impossible to combine data from different sources and it is impossible to readily
automate larger tasks if you have longer workflows. Could you imagine completing the gene-prediction exercise and manually
paste in DNA sequnces for the 487 complete microbial genomes, currently available?
We call this the lack of 'interoperabilty' - the servers do not operate in a standarized and systematic fashion - it takes
a human to move the mouse around, paste in the DNA sequence, press the submit botton, and decypher the result.
For adressing this exact problem, many bioinformatic institutes, including CBS, apply Web Services (Wikipedia provides
a brief decription of the term) to share its tools and database.
In the coming years we will see this technology widely implemented and we offer you the chance for a first glimpse
to ask some small questions by the use of a Web Services client.
Web Services are desribed by WSDL-files (WSDL stands for Web Service Description Language) You can check one out if you are sufficently geeky: http://www.cbs.dtu.dk/ws/RNAmmer/RNAmmer_1_1e.wsdl). Once this is acquired and loaded
by your programming language, all aspects of the service is known: Which functions (known as 'operations') does it have? What input
is required for which functions? How does the output look like? Who provides the actual service (known as the 'endpoint'). It is beyound
the scope of this exercise to go in depth with the technology, but as you will see how its simplicity demonstrates the power.
OUTLINE
We will download the genome sequence of the following species and compare some essential properties. During the exercise you will see questions marked in green. Your do not have to write it down - just make sure you are able to find the result.
| Accession | Species and strain |
| AE016828 | Coxiella burnetii RSA 493 |
| CP000247 | Escherichia coli 536 |
| CP000468 | Escherichia coli APEC O1 |
| AE014075 | Escherichia coli CFT073 |
| U00096 | Escherichia coli K12 |
| AE005174 | Escherichia coli O157:H7 EDL933 |
| BA000007 | Escherichia coli O157:H7 str. Sakai |
| CP000243 | Escherichia coli UTI89 |
| AP009048 | Escherichia coli W3110 |
| AE009439 | Methanopyrus kandleri AV19 |
| CP000479 | Mycobacterium avium 104 B |
| AL954747 | Nitrosomonas europaea ATCC 19718 |
| AE005674 | Shigella flexneri 2a |
| AL591688 | Sinorhizobium meliloti 1021 |
| AP008232 | Sodalis glossinidius |
| AL645882 | Streptomyces coelicolor A3(2) |
| AE003852 | Vibrio cholerae |
| AE009952 | Yersinia pestis KIM |
Create todays directory and start the programming language Python.
Log on to organism/cell with your username and password like you normally do. You will now need to log on to a
different server, 'sbiology'. During this login, you will be asked to authenticate the server and you must reply
by typing 'yes':
ssh -X sbiology.cbs.dtu.dk
mkdir ws
cd ws
python
You are now in what is called interactive mode of Python: this means, that all commands you write will be recognised
and executed by Python.
Load the WSDL files and build a proxy to the WSDL
import sys, time, base64
from SOAPpy import WSDL
AtlasProxy = WSDL.Proxy('http://www.cbs.dtu.dk/ws/GenomeAtlas/GenomeAtlas_3_2.wsdl')
RNAmmerProxy = WSDL.Proxy('http://www.cbs.dtu.dk/ws/RNAmmer/RNAmmer_1_1e.wsdl')
You have now loaded the modules which allows us to connect to the two Web Services RNAmmer (details) (will allow os to
predict rRNA genes using hidden markov models) and Genome Atlas (details) which is a repository for complete microbial genomes.
Download the genome sequence of the first accession number
Log on to organism/cell with your username and password like you normally do. But you will now need to log on further to a
different server, 'sbiology'. During this login, you will be asked to authenticate the server and you must reply
by typing 'yes':
getSeqIn = {'accession' : 'AE009439' }
getSeqOut = AtlasProxy.getSeq(getSeqIn)
This is where you should begin to see the whole point: The function called 'getSeq' is NOT recognised by our programming language per default: but because we have connected to the Web Service, the remote operations have become local function we can. This can in other words be done anywhere - even without a login to CBS: As long as you have what is called a SOAP compliant programming interface, like Python
Submit your genome sequence to RNAmmer, to predict the number of rRNA genes. Note, that this task takes
a few seconds, and that we have to include a step where we ask when the result will be ready (we 'poll' for the status). Be aware,
that this is a loop (using 'while') and you must not forget the tab character before the lines starting with "print", "job", and "sleep"
RNAmmerIn = {'kingdom' : 'bac', 'sequences' : getSeqOut.sequences}
jobid = RNAmmerProxy.runService(RNAmmerIn)
job = RNAmmerProxy.pollQueue(jobid)
while job.status != 'FINISHED':
print 'The job is: %s' % job.status
job = RNAmmerProxy.pollQueue(jobid)
time.sleep(5)
# Press [RETURN] and wait a bit
out = RNAmmerProxy.fetchResult(jobid)
for gene in out.entries.entry:
print '%s\t%s\t%s\t%s\t%s\t%s\t%s' % (gene.sequenceEntry, gene.mol, gene.start, gene.stop, gene.direction, gene.feature, gene.score)
# Press [RETURN] to close the loop
Q1: How many rRNA operons to you think are present in the genome?
We will now complete a similar procedure, this time predicting the number of tRNA genes by a
program called tRNA-scanSE
tRNAscanIn = {'kingdom' : 'bac','sequences' : getSeqOut.sequences }
job = AtlasProxy.trnascanRun(tRNAscanIn)
while job.status != 'FINISHED':
job = AtlasProxy.pollQueue(job)
time.sleep(1)
# Press [RETURN] and wait a bit
tRNAscanResult = AtlasProxy.trnascanFetchResult(job)
len(tRNAscanResult.entries.entry)
Q2: How many tRNA genes does the genome posses?
Now we will let python summarize a few small properties of the sequence
(getSeqOut.sequences.entry.seq.count("A") + getSeqOut.sequences.entry.seq.count("T")) / float(len(getSeqOut.sequences.entry.seq)) * 100
Q3: What is the A+T content (in per cent)
Q4: How long is the sequence (in nucleotides) - you should use part of the code from above
We will now download all the protein sequences of the genome, see how many coding genes it possess, and visualize the
frequency of the amino acids. The code will generate a PNG image which you will view using 'xv'. You should open a fresh SSH-session
to 'sbiology'
getProtOut = AtlasProxy.getProt({'accession' : 'AE009439' })
result = AtlasProxy.aaUsage({'sequences' : getProtOut.sequences })
f = open('AE009439.aa.png' , 'w' )
f.write (base64.decodestring(result.image.content))
f.close()
# In a new shell, in the working directory type 'xv AE009439.aa.png' to view the image
We will now calculate the length of the genome and list the amino acid frequences of the proteome
len(getProtOut.sequences.entry)
for aa in result.aaUsage.entry:
print "AA=%s, usage=%s" % (aa.name,aa.freq)
You will now download all open reading frames of the genome, and compose a visualization of the codon usage
getOrfsOut = AtlasProxy.getOrfs({'accession' : 'AE009439' })
result = AtlasProxy.codonUsage({'sequences' : getOrfsOut.sequences })
f = open('AE009439.codon.png' , 'w' )
f.write (base64.decodestring(result.image.content))
f.close()
for codon in result.codonUsage.entry:
print "Codon=%s, usage=%s" % (codon.codon,codon.freq)
# In a new shell, in the working directory type 'xv AE009439.codon.png' to view the image
Q7: By looking at the list just produced and the graphical output, can you identify the 4 most
frequently used codon, starting with the most frequent?
Now, we could ask you to repeat all of these steps for the remaining genomes - but instead we will demonstrate
how all this can readily be build into a script as one continous workflow. A Python script is provided which contains almost
the same code, except for a few improvements: it now allows you to specify an alternative accession number as a argument and codon and
amino acid usages are now sorted. Also, the 16S rRNA gene predictions are now stored file.
# Leave Python by pressing CTRL-D
cp /usr/opt/www/pub/CBS/phdcourse/cookbooks/pfh/workflow.py .
cp /usr/opt/www/pub/CBS/phdcourse/cookbooks/pfh/ACCESSION.list .
rm -f report
foreach accession ( `cat ACCESSION.list` )
python workflow.py $accession > $accession.out &
end
wait
# The wait command will ensure all processes are finished before we continue. Now is a good time for coffee!
Now concatenate your results and use nedit to view
cat *.out > report
nedit report &
We will now construct trees of the rRNA genes you just predicted. In your directory, you should have
the 16s rRNA genes in tab-format, which we will concatenate into a large fasta file and produce a multiple alignment. After this, we log of sbiology in order to build and view the trees
head -n 1 *rnammer.tab* | grep -v = | saco_convert -I tab -O fasta > all.fsa
clustalw all.fsa
exit
We will now construct a tree based on the aligment.
cd
cd ws
clustalx all.aln
Produce NJ tree:
Choose Trees->Draw NJ tree and accept the filename all.ph
Close ClustalX
njplot all.ph
Q8: What does the plot show? How does the plot compare with the information you
compiled for the table?
|