Introduction to the computer system: some technicalities

All of the exercises will take place on a UNIX computer system, since this is by far the most efficient way to handle large amounts of data. We are aware that this interface may not seem intuitive to most biologists, however it is our goal that you become familiar with such a system but there are no requirements for you being experienced users! The content of the colored boxes within all the exercises represent UNIX commands which you shall execute from within your shell. It is not the idea you should type these commands yourselves - you simply use the copy-paste function of your computer. However, we strongly recommend that you take one line at a time, and try to figure our what goes on. Bottom line: Don't panic!

Please be sure you have read the pedestrians guide to unix, before starting the exercises. In the gray boxes below, we will now briefly go through some of the things that you will see repeatedly in the exercises. You do not have to execute the commands - just read and understand.
  1. What does '|', '>', and '<' mean?
    Both of the redirects '>' and '>' are very common and you should understand them. The '>' simply means that whatever data is generated by a program is redirected (written) to a file. Example:
    ls > myfile.txt
    The program ls simply lists the content of the directory and this output is here written to the file myfile.txt. The other redirect '<' is used to redirect files into program. Example:
    wc < myfile.txt
    Here, we redirect the file just created into the program wc (word count) which simply counts the number of lines, words and bytes in the input.
    Finally, pipe ('|') is just as important. Instead of writing data to a file (or reading data from a file) the pipe sends output to another program. Let's do the same as above in just one line:
    ls | wc
    This simply means that we execute ls to list the files in the directory, and whatever output this creates gets send directly into program wc

    The advantages of the latter is clear: you don't have to access the filesystem, which in many cases are relatively slow. Also, ls and wc will actually run in parallel on their own CPU, and wc will read data in chunks whenever generated by ls. Admitted, these things will only affect the performance of larger problems - but in bioinformatics, you often deal with large problems.

  2. What does less mean?
    The program less is a fast and convenient way to show the content of files. The following keyboard shortcuts apply to the program:

    SPACE : one screenful forward
    b : one screenful backward
    RETURN : one line forward
    k : one line backward
    g : top (beginning) of file
    G : bottom (end) of file
    q : leave the session

    less /usr/cbs/examples/

  3. What does set ... mean?
    The command set is used to declare a variable in the shell environment. In these exercises we often set the accession variable like this:
    set accession = L43967
    This tells your shell that whenever you write $accession you really mean L43967. Example: Assume you have a file called L43967.gbk, and you want to use less to view it's content. After having set the $accession variable, the two sets of commands will do exactly the same:
    getgene L43967 > L43967.gbk
    less L43967.gbk

    Of course, the following would work just as well:
    set accession = L43967
    getgene $accession > $accession.gbk
    less $accession.gbk

  4. How should I treat the indentions?
    The above was examples of very very simple commands. You will see, that commands often grow long and ugly, and will barely fit your screen! For matter of appearance, we have truncated the lines by adding a backslash ('\') at the and of the line. This instructs the shell to wait executing the command, even if you press return. This will allow us to break a long line into many. Example: The command below will get the genbank accession U00096 and extract all translations and count the lines/words/bytes of the output. The example below:
    getgene U00096 | saco_extract -I genbank -O tab -t | wc
    ... is exactly the same as this:
    getgene U00096 | saco_extract \
      -I genbank -O tab -t | \
    For easier reading, we have made indentions after the line-breaks. Feel free to copy-paste all lines within these lines breaks, when it occur.

  5. What does & mean?
    The ampersand (&) is put at the end of a command if you want to run the command in the background. This means that the shell will be available for new input immediately after. The process then runs in the background

  6. What does wait mean?
    When you have placed many jobs in the background that runs in parallel, it is often required that you wait for all processes to finish, before you continue using the results. This is what the wait command helps you with. So, when you see this command in the command-box, it must be copied as well. Example: The script below will fetch two genbank records (U00096 and AL111168) in parallel. Once they have both finished (ensured by using wait) we will do a wc (word count):
    getgene U00096 > U00096.gbk &
    getgene AL111169 > AL111169.gbk &
    wc U00096.gbk AL111169.gbk
    rm U00096.gbk AL111169.gbk

  7. What does umask 022 mean?
    When you see the command umask this is to inform the shell, that whenever you create new files, they will be readable to other people. This is neccesary, since the results you are creating will be used by other groups, and since we monitor your progress ;-)
    umask 022

Good luck with your exercises - and remember, we are here to assist, so don't hesitate to ask questions!

Comparing single genomic properties

We have prepared this table containing some common genomic properties of all currently sequenced prokaryotic genomes. The table includes the number of replicons, total length of all replicons, number of coding sequences, number of r- and tRNA genes, and the overall AT content. Take a few seconds to look at the content.

Produce box-and-whiskers plots

You will start out by producing a set of box-and-whiskers plots to visualize how the genomic properties vary among the prokaryotic phyla. We have prepared a Perl script (boxplot) which parses this table and produces a PostScript document containing a box-and-whiskers plot

  1. Log in to the CBS server, using the SSH client PuTTY, installed on the computers. We have created a profile for you to use, which is placed on the desktop and called "CBS-DTU". Simply double-click in this icon when you start a new exercise. By now, you should have been informed about which username and password to use.

  2. Enter the directory for this exercise and copy the files needed
    We will now copy all the files you need for the exercise plus. Also, we need to set an environmental variable MAKEFILES - don't worry about this.
    umask 022
    cp -rL ~www/pub/CBS/courses/thaiworkshop08/m4 ~/
    cd ~/m4
    setenv MAKEFILES  /home/databases/genomeatlasdb-3.0_cur/make/Makefile
    Now we will use the boxplot program to produce a box-and-whiskers plot comparing the genome size and the number of predicted coding sequences.

  3. Length distribution
    perl boxplot -f 2 -x 'length' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < >
    gmake length.pdf
    Open the PDF file:length.pdf

  4. Distribution of number of coding sequences
    perl boxplot -f 3 -x '# CDS' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < >
    gmake cds.pdf
    Open the PDF file:cds.pdf

  5. Distribution of coding density
    perl boxplot -f 6 -x 'Coding density' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < >
    gmake coding_density.pdf
    Open the PDF file:coding_density.pdf
    QUESTION How does the box-and-whiskers plot assist you to detect (significant) differences? Do you find any trends among these data - and are there phyla which posses different properties?

Predicting coding Sequences

In this exericse we will use the software EasyGene to predicing the coding sequences in full prokaryotic genomes. Since running the software on larger chromosomes takes time, we will start out by looking at a smaller plasmid while you become familiar with the system.

Predicting genes with traditional web application

  1. Download E. coli pO157
    Download the sequence of the E. coli pO157 plasmid, accession AB011549. We have prepared a link to NCBI for this. Copy the entire fasta sequence into your clip board (CTRL-A, CTRL-C).

  2. Go to EasyGene
    Go to the EasyGene web site and familiarize yourself with the page. Paste in (CTRL-V) the genome sequence in the sequence input field, leave the R-value cutoff at 2.0 but remember to change the model to E. coli K12

  3. Submit
    Submit your job to the server and wait - it will take a few minutes only.

Predicting coding sequences with Web Services (SoapUI)

Now, we will do exactly the same task, but this time using Web Services. More specifically, we will use a graphical SOAP client called SoapUI. This tool is not intended for multiple jobs and programmatic access to the server, rather it provides a good interface for making initial tests when connecting to a web service for the first time, in a user friendly way.
  1. Download and open SoapUI (downloadble from
    SoapUI is a Java program that runs on any platform, as long as Java is installed.

  2. Create new project
    Once you have opened SoapUI, right click on "Projects" and select "Create WSDL project". Label it "Exercise M4"

  3. Add Genome Atlas Web Service
    Right-click on "Exercise M4", and add the Genome Atlas WSDL: http://www. We do not expect you to understand the WSDL file itself, you should just be aware of the main principle of such a Web Service Description that it contains all the definitions of objects and operations needed for the client to submit to the service and recieve results from the service. All that in a single file!

  4. Add EasyGene Web Service
    Again, right-click on the "Exercise M4" project and now add the EasyGene WSDL:

  5. Download E. coli pO157
    Locate the operation "getSeq" in GenomeAtlas service, and write AB011549 in the accession field and press and press the green submit button, shaped like a play-icon).

  6. Copy the genome sequence
    When you receive the response, locate the genome sequence and copy this to your clipboard (CTRL-C).

  7. Paste the genome sequence
    Locate the operation "runService" in the EasyGene service, and in default request created by SoapUI, paste in the genome sequence. Write any identifier for the sequence, use the 'EC03' as model (E. coli K12), and use an R-value cutoff of 2.0 (see picture below)

  8. Submit your request and obtain result
    Submit the request, copy the job identifier to the pollQueue operation, and poll at reasonable intervals until the job has finished. Once the jobs has finished, paste in the job identifier in the operation 'fetchResult'

Predicting coding sequences with web services (Perl)

You may not see the big advantage of web services yet, but to convince you we have implemented the getSeq and runService operations you just executed, in Perl. You will now do exactly the same set of tasks as you did above, this time using small Web Services client programs. By installing Perl and the proper SOAP modules, these script will run on most computers/platforms. That enables advanced users to integrate foreign software into local programs environments.

  1. Download genome and predict genes
    cd ~/m4
    perl AB011549 > AB011549.fsa
    perl 2.0 EC03 < AB011549.fsa > AB011549.orfs.fsa
  2. Inspect the result
    less AB011549.orfs.fsa
  3. Counting the ORFs
    grep '>' AB011549.orfs.fsa | wc -l 
    QUESTION What do you think 'grep' does? to learn more about the 'grepÕ command, type 'man grepÕ

    QUESTION What do you think 'wc' does? to learn more about the 'wcÕ command, type 'man wcÕ

    QUESTION How many genes do you find? What advantages do you see of this method compared to using a web application?

  4. One final step is, that you will now report to a central MySQL database, how many genes you found. This technique will be used throughout the exercises.
    set nprot = `grep '>' AB011549.orfs.fsa | wc -l` 
    mysql -B -e "replace into MicrobialGenomicsGroup.thaiworkshop08 (accession,nprot) values ('AB011549-$user',$nprot)"

Coding sequences in your bacterium

The members of group stud109 have been assigned the following sequence:

Escherichia coli str. K12 substr. DH10B , accession CP000948

This is the accession number for your own bacterial strain which you will study in the following days. On friday we will collect results from the different groups and make a larger comparison. As the first part of these exercises, we need the prediction of coding sequences from EasyGene - just like you did above. For this you will be using the Web Services script exactly like you did above. Since you will now deal with full chromosomes, this run will take much longer time, and we will therefor let your jobs run over night. The amphersand (&) at the end of the command indicate, that the jobs will run the background, even though we close the SSH connection.
  1. Download genome and predict genes
    cd ~/m4
    set accession = CP000948
    set model = EC03
    perl $accession > $accession.fsa
    perl 2.0 $model < $accession.fsa  > $accession.orfs.fsa &