Exercise - Module M4 - Fetching genomes and predicting protein coding genes
Introduction to the computer system: some technicalities All of the
exercises will take place on a UNIX computer system, since this is by far the
most efficient way to handle large amounts of data. We are aware that this
interface may not seem intuitive to most biologists, however it is our goal that
you become familiar with such a system but there are no requirements for you
being experienced users! The content of the colored boxes within all the
exercises represent UNIX commands which you shall execute from within your
shell. It is not the idea you should type these commands yourselves - you simply use the
copy-paste function of your computer. However, we strongly recommend that you
take one line at a time, and try to figure our what goes on. Bottom line: Don't panic!
Please be sure you have read the pedestrians guide to unix, before starting the
exercises. In the gray boxes below, we will now briefly go through some of the things
that you will see repeatedly in the exercises. You do not have
to execute the commands - just read and understand.
- What does '|', '>', and '<' mean?
Both of the redirects '>' and '>' are very common and you should
understand them. The '>' simply means that whatever data is generated by a
program is redirected (written) to a file. Example:
ls > myfile.txt
The program ls simply lists the content of the directory and this output is here
written to the file myfile.txt. The other redirect '<' is used to redirect
files into program. Example:
wc < myfile.txt
Here, we redirect the file just created into the program wc (word
count) which simply counts the number of lines, words and bytes in the input.
Finally, pipe ('|') is just as important. Instead of writing data to a file (or
reading data from a file) the pipe sends output to another program. Let's do the
same as above in just one line:
ls | wc
This simply means that we execute ls to list the files in the
directory, and whatever output this creates gets send directly into program wc
The advantages of the latter is clear: you don't have to access the filesystem, which in
many cases are relatively slow. Also, ls and wc will actually run in parallel
on their own CPU, and wc will read data in chunks whenever generated by ls.
Admitted, these things will only affect the performance of larger problems - but in bioinformatics,
you often deal with large problems.
- What does less mean?
The program less is a fast and
convenient way to show the content of files. The following keyboard shortcuts
apply to the program:
SPACE : one screenful forward
b : one screenful backward
RETURN : one line forward
k : one line backward
g : top (beginning) of file
G : bottom (end) of file
q : leave the session
- What does set ... mean?
The command set is used to declare a variable in the shell environment.
In these exercises
we often set the accession variable like this:
set accession = L43967
This tells your shell that whenever you write $accession you really mean
L43967. Example: Assume you have a file called L43967.gbk, and you want
to use less to view it's content. After having set the
$accession variable, the two sets of commands will do exactly the
getgene L43967 > L43967.gbk
Of course, the following would work just as well:
set accession = L43967
getgene $accession > $accession.gbk
- How should I treat the indentions?
The above was examples of very very simple commands. You will see, that commands
often grow long and ugly, and will barely fit your screen! For matter of
appearance, we have truncated the lines by adding a backslash ('\') at the and of
the line. This instructs the shell to wait executing the command, even if you
press return. This will allow us to break a long line into many. Example: The command below
will get the genbank accession U00096 and extract all translations and count the
lines/words/bytes of the output. The example below:
getgene U00096 | saco_extract -I genbank -O tab -t | wc
... is exactly the same as this:
getgene U00096 | saco_extract \
-I genbank -O tab -t | \
For easier reading, we have made indentions after the line-breaks. Feel free to
copy-paste all lines within these lines breaks, when it occur.
- What does & mean?
The ampersand (&) is put at the end of a command if you want to run the command
in the background. This means that the shell will be available for new input
immediately after. The process then runs in the background
- What does wait mean?
When you have placed many jobs in the background that runs in parallel, it is
often required that you wait for all processes to finish, before you continue using the results.
This is what the wait command helps you with. So, when you see this command in
the command-box, it must be copied as well. Example: The script below will fetch two
genbank records (U00096 and AL111168) in parallel. Once they have
both finished (ensured by using wait) we will do a wc
getgene U00096 > U00096.gbk &
getgene AL111169 > AL111169.gbk &
wc U00096.gbk AL111169.gbk
rm U00096.gbk AL111169.gbk
- What does umask 022 mean?
When you see the command umask this is to inform the
shell, that whenever you create new files, they will be readable to other
people. This is neccesary, since the results you are creating will be used by
other groups, and since we monitor your progress ;-)
Good luck with your exercises - and remember, we are here to assist, so don't hesitate to ask questions!
Comparing single genomic properties
We have prepared this table containing some
common genomic properties of all currently sequenced prokaryotic genomes.
The table includes the number of replicons, total length of all replicons,
number of coding sequences, number of r- and tRNA genes, and the overall AT
content. Take a few seconds to look at the content.
box-and-whiskers plots You will start out by producing a set of box-and-whiskers
plots to visualize how the genomic properties vary among the prokaryotic
phyla. We have prepared a Perl script (boxplot) which parses this table and produces a
PostScript document containing a box-and-whiskers plot
- Log in to the CBS server organism.cbs.dtu.dk, using the
SSH client PuTTY, installed on the computers. We have created a profile for you
to use, which is placed on the desktop and called "CBS-DTU". Simply double-click
in this icon when you start a new exercise. By now, you should have been
informed about which username and password to use.
- Enter the directory for this exercise and copy the files needed
We will now copy all the files you need for the exercise plus. Also, we need to set
an environmental variable MAKEFILES - don't worry about this.
cp -rL ~www/pub/CBS/courses/thaiworkshop08/m4 ~/
setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile
Now we will use the boxplot program to produce a box-and-whiskers plot comparing
the genome size and the number of predicted coding sequences.
- Length distribution
perl boxplot -f 2 -x 'length' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < genomestat.tab > length.ps
Open the PDF file:length.pdf
- Distribution of number of coding sequences
perl boxplot -f 3 -x '# CDS' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < genomestat.tab > cds.ps
Open the PDF file:cds.pdf
- Distribution of coding density
perl boxplot -f 6 -x 'Coding density' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < genomestat.tab > coding_density.ps
Open the PDF file:coding_density.pdf
QUESTION How does the box-and-whiskers plot assist you
to detect (significant) differences? Do you find any trends among these data -
and are there phyla which posses different properties?
Predicting coding Sequences
In this exericse we will use the software EasyGene to predicing the coding
sequences in full prokaryotic genomes. Since running the software on larger
chromosomes takes time, we will start out by looking at a smaller plasmid while
you become familiar with the system.
Predicting genes with traditional web application
- Download E. coli pO157
Download the sequence of the E. coli pO157 plasmid, accession AB011549.
We have prepared a link to NCBI for this. Copy the entire fasta
sequence into your clip board (CTRL-A, CTRL-C).
- Go to EasyGene
Go to the EasyGene web site and
familiarize yourself with the page. Paste in (CTRL-V) the genome sequence in the
sequence input field, leave the R-value cutoff at 2.0 but remember to
change the model to E. coli K12
Submit your job to the server and wait - it will take a few minutes only.
Predicting coding sequences with Web Services (SoapUI)
Now, we will do exactly the same task, but this time using Web Services. More
specifically, we will use a graphical SOAP client called SoapUI. This tool is
not intended for multiple jobs and programmatic access to the server, rather it
provides a good interface for making initial tests when connecting to a web
service for the first time, in a user friendly way.
- Download and open SoapUI (downloadble
SoapUI is a Java program that runs on any
platform, as long as Java is installed.
- Create new project
Once you have opened SoapUI, right click on "Projects" and select "Create WSDL project". Label it "Exercise M4"
- Add Genome Atlas Web Service
Right-click on "Exercise M4", and add the Genome Atlas WSDL: http://www.
cbs.dtu.dk/ws/GenomeAtlas/GenomeAtlas_3_0_ws0.wsdl. We do not expect
you to understand the WSDL file itself, you should just be aware of the main
principle of such a Web Service Description that it contains all the definitions
of objects and operations needed for the client to submit to the service and
recieve results from the service. All that in a single file!
- Add EasyGene Web Service
Again, right-click on the "Exercise M4" project and now add the EasyGene WSDL:
- Download E. coli pO157
Locate the operation "getSeq" in GenomeAtlas service, and write AB011549 in the
accession field and press and press the green submit button, shaped like a
- Copy the genome sequence
When you receive the response, locate the genome sequence and copy this to your
- Paste the genome sequence
Locate the operation "runService" in the EasyGene service, and in default
request created by SoapUI, paste in the genome sequence. Write any identifier
for the sequence, use the 'EC03' as model (E. coli K12), and use an R-value
cutoff of 2.0 (see picture below)
- Submit your request and obtain result
Submit the request, copy the job identifier to the pollQueue operation, and poll
at reasonable intervals until the job has finished. Once the jobs has finished,
paste in the job identifier in the operation 'fetchResult'
Predicting coding sequences with web services (Perl)
You may not see the big advantage of web services yet, but to convince you we
have implemented the getSeq and runService operations you just
executed, in Perl. You will now do exactly the same set of tasks as you did
above, this time using small Web Services client programs. By installing Perl
and the proper SOAP modules, these script will run on most computers/platforms.
That enables advanced users to integrate foreign software into local programs
- Download genome and predict genes
perl getseq.pl AB011549 > AB011549.fsa
perl easygene.pl 2.0 EC03 < AB011549.fsa > AB011549.orfs.fsa
- Inspect the result
- Counting the ORFs
grep '>' AB011549.orfs.fsa | wc -l
QUESTION What do you think 'grep' does? to learn more about the 'grep’ command, type 'man grep’
QUESTION What do you think 'wc' does? to learn more about the 'wc’ command, type 'man wc’
QUESTION How many genes do you find? What advantages do you see of this method compared to using a web
One final step is, that you will now report to a central MySQL database, how many
genes you found. This technique will be used throughout the exercises.
set nprot = `grep '>' AB011549.orfs.fsa | wc -l`
mysql -B -e "replace into MicrobialGenomicsGroup.thaiworkshop08 (accession,nprot) values ('AB011549-$user',$nprot)"
Coding sequences in your bacterium
The members of group stud115 have been assigned the following sequence:
Oceanobacillus iheyensis HTE831 , accession BA000028
This is the accession number for your own bacterial strain which you will study
in the following days. On friday we will collect results from the different
groups and make a larger comparison. As the first part of these exercises, we
need the prediction of coding sequences from EasyGene - just like you did above.
For this you will be using the Web Services script exactly like you did above.
Since you will now deal with full chromosomes, this run will take much longer
time, and we will therefor let your jobs run over night. The amphersand (&) at
the end of the command indicate, that the jobs will run the background, even
though we close the SSH connection.
- Download genome and predict genes
set accession = BA000028
set model = OI02
perl getseq.pl $accession > $accession.fsa
perl easygene.pl 2.0 $model < $accession.fsa > $accession.orfs.fsa &