Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Comparative Genomics using Web Services

http://www.cbs.dtu.dk/~pfh/report

David Ussery (dave@cbs.dtu.dk)
Peter Fischer Hallin (pfh@cbs.dtu.dk)

For those of you who did not get the table right, you are not likely to have the rRNA gene predictions right

Please copy the *.ph file to your local directory:
cp /home/people/pfh/ws/all.ph .

INTRODUCTION

During the last few decades, the web has played an essential role for sharing, accessing, and combining biological data. On a number of occassions during this course you have used web interfaces for accessing various resources on the internet, for example when predicting genes or accessing online sequence databases. In most cases such online resources are being offered as HTML based web pages which presents a graphical view of its functionality. Although these interfaces in some cases appear intuitive and user-friendly, they make it impossible to combine data from different sources and it is impossible to readily automate larger tasks if you have longer workflows. Could you imagine completing the gene-prediction exercise and manually paste in DNA sequnces for the 487 complete microbial genomes, currently available?

We call this the lack of 'interoperabilty' - the servers do not operate in a standarized and systematic fashion - it takes a human to move the mouse around, paste in the DNA sequence, press the submit botton, and decypher the result.

For adressing this exact problem, many bioinformatic institutes, including CBS, apply Web Services (Wikipedia provides a brief decription of the term) to share its tools and database. In the coming years we will see this technology widely implemented and we offer you the chance for a first glimpse to ask some small questions by the use of a Web Services client.

Web Services are desribed by WSDL-files (WSDL stands for Web Service Description Language) You can check one out if you are sufficently geeky: http://www.cbs.dtu.dk/ws/RNAmmer/RNAmmer_1_1e.wsdl). Once this is acquired and loaded by your programming language, all aspects of the service is known: Which functions (known as 'operations') does it have? What input is required for which functions? How does the output look like? Who provides the actual service (known as the 'endpoint'). It is beyound the scope of this exercise to go in depth with the technology, but as you will see how its simplicity demonstrates the power.

OUTLINE

We will download the genome sequence of the following species and compare some essential properties. During the exercise you will see questions marked in green. Your do not have to write it down - just make sure you are able to find the result.

AccessionSpecies and strain
AE016828Coxiella burnetii RSA 493
CP000247Escherichia coli 536
CP000468Escherichia coli APEC O1
AE014075Escherichia coli CFT073
U00096Escherichia coli K12
AE005174Escherichia coli O157:H7 EDL933
BA000007Escherichia coli O157:H7 str. Sakai
CP000243Escherichia coli UTI89
AP009048Escherichia coli W3110
AE009439Methanopyrus kandleri AV19
CP000479Mycobacterium avium 104 B
AL954747Nitrosomonas europaea ATCC 19718
AE005674Shigella flexneri 2a
AL591688Sinorhizobium meliloti 1021
AP008232Sodalis glossinidius
AL645882Streptomyces coelicolor A3(2)
AE003852Vibrio cholerae
AE009952Yersinia pestis KIM


  • Create todays directory and start the programming language Python.
    Log on to organism/cell with your username and password like you normally do. You will now need to log on to a different server, 'sbiology'. During this login, you will be asked to authenticate the server and you must reply by typing 'yes':
    ssh -X sbiology.cbs.dtu.dk
    mkdir ws
    cd ws
    python
    You are now in what is called interactive mode of Python: this means, that all commands you write will be recognised and executed by Python.

  • Load the WSDL files and build a proxy to the WSDL
    import sys, time, base64
    from SOAPpy import WSDL
    AtlasProxy = WSDL.Proxy('http://www.cbs.dtu.dk/ws/GenomeAtlas/GenomeAtlas_3_2.wsdl')
    RNAmmerProxy = WSDL.Proxy('http://www.cbs.dtu.dk/ws/RNAmmer/RNAmmer_1_1e.wsdl')
    
    You have now loaded the modules which allows us to connect to the two Web Services RNAmmer (details) (will allow os to predict rRNA genes using hidden markov models) and Genome Atlas (details) which is a repository for complete microbial genomes.

  • Download the genome sequence of the first accession number
    Log on to organism/cell with your username and password like you normally do. But you will now need to log on further to a different server, 'sbiology'. During this login, you will be asked to authenticate the server and you must reply by typing 'yes':
    getSeqIn = {'accession' : 'AE009439' }
    getSeqOut = AtlasProxy.getSeq(getSeqIn)
    
    This is where you should begin to see the whole point: The function called 'getSeq' is NOT recognised by our programming language per default: but because we have connected to the Web Service, the remote operations have become local function we can. This can in other words be done anywhere - even without a login to CBS: As long as you have what is called a SOAP compliant programming interface, like Python

  • Submit your genome sequence to RNAmmer, to predict the number of rRNA genes. Note, that this task takes a few seconds, and that we have to include a step where we ask when the result will be ready (we 'poll' for the status). Be aware, that this is a loop (using 'while') and you must not forget the tab character before the lines starting with "print", "job", and "sleep"
    RNAmmerIn = {'kingdom' : 'bac', 'sequences' : getSeqOut.sequences}
    jobid = RNAmmerProxy.runService(RNAmmerIn)
    job = RNAmmerProxy.pollQueue(jobid)
    while job.status != 'FINISHED':
    	print 'The job is: %s' % job.status
    	job = RNAmmerProxy.pollQueue(jobid)
    	time.sleep(5)
    
    # Press [RETURN] and wait a bit
    out = RNAmmerProxy.fetchResult(jobid)
    for gene in out.entries.entry:
    	print '%s\t%s\t%s\t%s\t%s\t%s\t%s' % (gene.sequenceEntry, gene.mol, gene.start, gene.stop, gene.direction, gene.feature, gene.score)
    # Press [RETURN] to close the loop
    
    Q1: How many rRNA operons to you think are present in the genome?
  • We will now complete a similar procedure, this time predicting the number of tRNA genes by a program called tRNA-scanSE
    tRNAscanIn = {'kingdom' : 'bac','sequences' :  getSeqOut.sequences }
    job = AtlasProxy.trnascanRun(tRNAscanIn)
    while job.status != 'FINISHED': 
    	job = AtlasProxy.pollQueue(job)
    	time.sleep(1)
    
    # Press [RETURN] and wait a bit
    tRNAscanResult = AtlasProxy.trnascanFetchResult(job)
    len(tRNAscanResult.entries.entry)
    
    Q2: How many tRNA genes does the genome posses?
  • Now we will let python summarize a few small properties of the sequence
    (getSeqOut.sequences.entry.seq.count("A") + getSeqOut.sequences.entry.seq.count("T"))  / float(len(getSeqOut.sequences.entry.seq)) * 100
    
    Q3: What is the A+T content (in per cent)
    Q4: How long is the sequence (in nucleotides) - you should use part of the code from above

  • We will now download all the protein sequences of the genome, see how many coding genes it possess, and visualize the frequency of the amino acids. The code will generate a PNG image which you will view using 'xv'. You should open a fresh SSH-session to 'sbiology'
    getProtOut = AtlasProxy.getProt({'accession' : 'AE009439' }) 
    result = AtlasProxy.aaUsage({'sequences' : getProtOut.sequences })
    f = open('AE009439.aa.png' , 'w' )
    f.write (base64.decodestring(result.image.content))
    f.close()
    # In a new shell, in the working directory type 'xv AE009439.aa.png' to view the image
    
  • We will now calculate the length of the genome and list the amino acid frequences of the proteome
    len(getProtOut.sequences.entry)
    for aa in result.aaUsage.entry:
    	print "AA=%s, usage=%s" % (aa.name,aa.freq)
    
    
  • You will now download all open reading frames of the genome, and compose a visualization of the codon usage
    getOrfsOut = AtlasProxy.getOrfs({'accession' : 'AE009439' }) 
    result = AtlasProxy.codonUsage({'sequences' : getOrfsOut.sequences })
    f = open('AE009439.codon.png' , 'w' )
    f.write (base64.decodestring(result.image.content))
    f.close()
    for codon in result.codonUsage.entry:
    	print "Codon=%s, usage=%s" % (codon.codon,codon.freq)
    
    # In a new shell, in the working directory type 'xv AE009439.codon.png' to view the image
    
    Q7: By looking at the list just produced and the graphical output, can you identify the 4 most frequently used codon, starting with the most frequent?
  • Now, we could ask you to repeat all of these steps for the remaining genomes - but instead we will demonstrate how all this can readily be build into a script as one continous workflow. A Python script is provided which contains almost the same code, except for a few improvements: it now allows you to specify an alternative accession number as a argument and codon and amino acid usages are now sorted. Also, the 16S rRNA gene predictions are now stored file.
    # Leave Python by pressing CTRL-D
    cp /usr/opt/www/pub/CBS/phdcourse/cookbooks/pfh/workflow.py .
    cp /usr/opt/www/pub/CBS/phdcourse/cookbooks/pfh/ACCESSION.list .
    rm -f report
    foreach accession ( `cat ACCESSION.list`  )
    python workflow.py $accession > $accession.out &
    end
    wait
    # The wait command will ensure all processes are finished before we continue. Now is a good time for coffee!
    
    
  • Now concatenate your results and use nedit to view
    cat *.out > report
    nedit report &
    
  • We will now construct trees of the rRNA genes you just predicted. In your directory, you should have the 16s rRNA genes in tab-format, which we will concatenate into a large fasta file and produce a multiple alignment. After this, we log of sbiology in order to build and view the trees
    head -n 1 *rnammer.tab* | grep -v = | saco_convert -I tab -O fasta > all.fsa
    clustalw all.fsa
    exit
    
  • We will now construct a tree based on the aligment.
    cd
    cd ws
    clustalx all.aln
    
  • Produce NJ tree:
    Choose Trees->Draw NJ tree and accept the filename all.ph Close ClustalX
    njplot all.ph
    
    Q8: What does the plot show? How does the plot compare with the information you compiled for the table?