|
PhD Lecture by Peter Fischer Hallin, CBS
Computational tools and Interoperability in Comparative Genomics Monday, December 7, 2009 at 13.00 CBS, DTU Systems Biology, Lyngby, Building 208, Auditorium 062
The scientific community is witnessing an explosion in both the number and the complexity of DNA sequencing projects. As sequencing equipment becomes more reliable, faster and less expensive, new possibilities of applying the technology are opening up. The early genome sequencing projects, dating back almost 15 years, presented only individual microbial strains and the large efforts and scientific achievements at this time qualified publication in high ranking journals. Today however, projects like the Human
Microbiome Project (HMP), Human Gut Microbiome Initiative (HGMI) and the Genomic Encyclopedia of Bacteria and Archaea (GEBA) takes sequencing into a new era, to study the genomes and ecological niches of entire populations consisting of thousands of microorganisms. These initiatives put a demand for new analysis tools to process and derive knowledge from the wealth of genomic information. This thesis describes development of new tools and methods to study these types of data. When the genome of characterized strains and environmental samples are sequenced, the ribosomal RNA genes are commonly chosen as a starting point to describe the phylogeny and diversity. The rRNA genes are often interpreted as an 'evolutionary chronometer' and the RNAmmer software was developed as a tool to quickly and consistently identify the rRNA genes allowing for large-scale analysis of phylogeny of complex data sets. RNAmmer solved previous issues of the gene boundary accuracy, that is observed when using BLAST approaches to mapping rRNA genes. The possibility to accurately map the start of rRNA transcripts has allowed the investigation of promotor structures of these highly expressed operons and a promotor analysis in E. coli K12 is demonstrated by applying a mathematical model of the energetics involved in DNA helix opening.But a single gene, such as the 16S rRNA, can in nature not describe the phenotype nor the full coding potential of an organism. This thesis describes the development of the BLASTatlas tool, which is a visualization tool to overview similarity and differences between any number of genomes, metagenomic samples or sequence databases from the viewpoint of a reference genome. This software has proved to be a powerful tool to study the localization and gain/loss of gene clusters, such as pathogenicity islands in virulent organisms. The tool has been used in several research projects and collaborations and was described as a cover article in Molecular BioSystems in 2008, and highlighted in the journal Chemical Biology. Despite the usefulness of this tool, it became obvious that a webbased version, more \biologist friendly" with zooming capability, was needed. This lead to the GeneWiz browser, which was developed in a joint effort with the IT staff at CBS. The tool enables the user to interactively zoom from a global chromosomal scale down the nucletide, while maintaining the overview of all data being presented in the plot. It features disproportional zooming as known from google maps. At the time of writing this thesis, the work is just being published in the second issue of the SIGS journal (Standards In Genomic Sciences).Since starting my Ph.D. project, a total of 630 prokaryotic genomes has been sequenced and published. This represents on average about four genomes per week! As we gain knowledge from this vast amount of data, new prediction methods become available allowing for the generation of even more data; examples include predicting sigma factor genes, chromosomal replication starts, and secretion systems. This combination of new sequence data as well as new predicitons squares the problem: How do we deal with the challenge that more and more genomic material shall be processed through more and more bioinformatic tools? And how is this flow of information formalized and automated allowing bioinformaticians to programmatically submit comparisons of any genome to any prediction method anywhere in the world? The need for interoperable and programmable interfaces for these resources is now widely recognized, and machine-to-machine communication through Web Services has gained acceptance. But ahead lies challenges during the transition from a web-browser-centric thinking towards interoperability and service orietated architecture, SOA. During my Ph.D. work a number of significant contributions to both implementations and server infrastructure has provided remote users access to CBS prediction servers and databases. This work has been presented both during the general meetings of the EU project (EMBRACE) initiating these efforts and during various workshops teaching the usage of Web Services and Comparative Genomics.
Everybody is welcome. Registration is not necessary. |