Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Brasil Workshop 2009 >> Course Programme >> Day 1

Day 1: Introduction

Web sites

EasyGene webserver
RNAmmer webserver

Where is the data, and how does it work?

Goal: to introduce you to the number and diversity of nucleotide sequences in the NCBI database.

Begin by linking to the NCBI homepage. If you ever get lost, always return to that page as a starting point. Select TaxBrowser at the top right.

Select Taxonomy at the top right.

The NCBI Taxonomy database contains the names of those organisms whose sequences have been deposited. Only a small fraction of the millions of species estimated to exist on earth is represented! Select the option Taxonomy Statistics in the middle of the left-side navigation bar.

  1. For the year 2005, how many new Bacterial species were added to the sequence database? 

  2. For the year 1999, how many new Bacterial species were added to the sequence database? Wow, what a difference six years makes!

  3. Let's look for a gene sequence from an E. coli strain. Go back and select Bacteria. This organism belongs to the Enterobacteriales within the group of Gammaproteobacteria. Look for one of the pathogenic O157:H7 strains (strain Sakai).

  4. How many nucleotide sequences have been deposited into the Entrez Records for this organism? How many proteins ?

  5. How many nucleotide base pairs does this genome sequence contain? Take a look at the sequence itself by clicking on Genbank. Go back.

    A lot of information may seem confusing, but it is all there to provide scientists with as much information as possible about this sequence. At the bottom of the screen, you will find the nucleotide sequence (all of the A,T,G,C base pairs) of this gene. Click on the PubMed 11258796 to directly link to the title, authors, and abstract of the published paper! Amazing, now you can read the research article that discovered this nucleotide sequence.

    Select the NCBI link in the top left corner of the screen (next to the DNA symbol) to return to the NCBI home page. Great!

Getting data

Goal: get proteins and rRNAs from various databases plus what to do when there are none.

Introduction to the computer system: some technicalities

The exercises will take place on a UNIX computer system, since this is by far the most efficient way to handle large amounts of data. We are aware that this interface may not seem intuitive to most biologists, however it is our goal that you become familiar with such a system but there are no requirements for you being experienced users! The content of the blue boxes within all the exercises represent UNIX commands which you shall execute from within your shell. A shell is a small program that lets you execute commands. In most cases you will not need to type the text in yourself, you can just copy-paste it. Do not copy the text with the # in from of it. However, in some cases there is text marked <like this> which you will need to replace with your own options. We strongly recommend that you take one line at a time, and try to figure our what goes on. Bottom line: Don't panic!

Please be sure you have read the pedestrians guide to unix, before starting the exercises. In the grey boxes below, we will now briefly go through some of the things that you will see repeatedly in the exercises. You do not have to execute the commands - just read and understand.

  • What does '|', '>', and '<' mean?

  • Both of the redirects '>' and '>' are very common and you should understand them. The '>' simply means that whatever data is generated by a program is redirected (written) to a file. Example:

    ls > myfile.txt

    The program ls simply lists the content of the directory and this output is here written to the file myfile.txt. The other redirect '<' is used to redirect files into program. Example:

    wc < myfile.txt

    Here, we redirect the file just created into the program wc (word count) which simply counts the number of lines, words and bytes in the input.

    Finally, pipe ('|') is just as important. Instead of writing data to a file (or reading data from a file) the pipe sends output to another program. Let's do the same as above in just one line:

    ls | wc

    This simply means that we execute ls to list the files in the directory, and whatever output this creates gets send directly into program wc.

    The advantages of the latter is clear: you don't have to access the filesystem, which in many cases are relatively slow. Also, ls and wc will actually run in parallel on their own CPU, and wc will read data in chunks whenever generated by ls. Admitted, these things will only affect the performance of larger problems - but in bioinformatics, you often deal with large problems.

  • What does less mean?

  • The program less is a fast and convenient way to show the content of files. The following keyboard shortcuts apply to the program:

    SPACE : one screenful forward 
    b : one screenful backward
    RETURN : one line forward
    k : one line backward
    g : top (beginning) of file
    G : bottom (end) of file
    q : leave the session

    Less is used in the following way:

    less /usr/cbs/examples/
  • What does set ... mean?

  • The command set is used to declare a variable in the shell environment. In these exercises we often set the accession variable like this

    set accession = L43967

    This tells your shell that whenever you write $accession you really mean L43967. Example: Assume you have a file called L43967.gbk, and you want to use less to view it's content. After having set the $accession variable, the two sets of commands will do exactly the same:

    # version 1
    getgene L43967 > L43967.gbk
    less L43967.gbk
    # version 2
    set accession = L43967
    getgene $accession > $accession.gbk
    less $accession.gbk
  • 4. How should I treat the indentions?

  • The above was examples of very very simple commands. You will see, that commands often grow long and ugly, and will barely fit your screen! For matter of appearance, we have truncated the lines by adding a backslash ('\') at the and of the line. This instructs the shell to wait executing the command, even if you press return. This will allow us to break a long line into many. Example: The command below will get the genbank accession U00096 and extract all translations and count the lines/words/bytes of the output. The example below:

    getgene U00096 | saco_extract -I genbank -O tab -t | wc

          ... is exactly the same as this:

    getgene U00096 | saco_extract \ 
      -I genbank -O tab -t | \

    For easier reading, we have made indentions after the line-breaks (which would not be there if it was written all on one line). 

  • What does & mean?

  • The ampersand (&) is put at the end of a command if you want to run the command in the background. This means that the shell will be available for new input immediately after. The process then runs in the background.

  • What does wait mean?

  • When you have placed many jobs in the background that runs in parallel, it is often required that you wait for all processes to finish, before you continue using the results. This is what the wait command helps you with. So, when you see this command in the command-box, it must be copied as well. Example: The script below will fetch two genbank records (U00096 and AL111168) in parallel. Once they have both finished (ensured by using wait) we will do a wc (word count):

    getgene U00096 > U00096.gbk &
     getgene AL111169 > AL111169.gbk &
    wc U00096.gbk AL111169.gbk
    rm U00096.gbk AL111169.gbk
  •  What does umask 022 mean?

  • When you see the command umask this is to inform the shell, that whenever you create new files, they will be readable to other people. This is neccesary, since the results you are creating will be used by other groups, and since we monitor your progress ;-) 

Downloading genomes and creating protein fasta files

Find your group number and your genomes by looking at the Exercise Groups webpage.

Go to this webpage:

Find the genomes that you have been assigned. Copy the information you find for your genomes into a text file.

The GPID is the number that is used to uniquely identify a genome at NCBI. With this, you can access all known information for that genome, also if it has several chromosomes or contains plasmids. Click on one of your projects, and look around to see what information you can find.

At NCBI, there are two databases which contain sequence data – GENBANK and REFSEQ.

The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only.

RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time. This means that a genome can have several sequences, one from refseq, and one or more from genbank. The proteins found within these sequences can vary, both in number and length.

Log in to the CBS computers

On the menu find Systems, and then find the Console, which contains a socalled "shell" you can perform commands in.

# log into CBS
ssh <your_username>
ssh -Y sbiology
setenv MAKEFILES /home/people/pfh/bin/Makefile

Familiarize yourself with the system.

# create directory that you can store your data in
mkdir data
# change into that director (cd = change directory)
cd data

In unix there is a a command structure which will let you perform one command repeatedly on something that varies from each time, without having to type in each one repeatedly. This is the foreach structure, and shown below is the structure of the foreach command:

foreach <variable> ( <list of things that the command should be performed on> )
the command you will perform on $i

Use this command structure to download your sets of genomes onto the computers at CBS.

# downloading all the genbank files associated with GPID list
foreach i ( <your GPIDS with spaces in> )
~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk
# downloading all the refseq files associated with the GPID list
foreach i ( <your GPIDS with spaces in> )
~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk

Examples for how this looks for downloading some Campylobacter genomes:

# downloading all the genbank files for these file project ids
foreach i (17159 17161 16293 20083 303)
~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk
# downloading all the refseq files for these file project ids
foreach i (17159 17161 16293 20083 303)
~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk

There are now several files within your directory. Look at what files there are, and have a look at one of them. Try to make some sense of what a genbank formatted file contains.

# Examine files you have in your system 
# Look at one file using less
less <filename>.gbk
# look at files and move them to sensible names:
head filename
mv fileanme <sensible_name_only_use_underscores>.gbk

Both genbank files and refseq files contain the proteins that are found in the genome. However, these are somewhat inaccessible within this format. It is much more useful to have them available as fasta files. There are several web servers out there which will extract protein sequences from genbank files, but these are tedious to use for multiple sequences. The foreach structure can combined with conversion programs be used for this.

# Make protein files: 
foreach i (*.gbk)
gmake $i:r.proteins.fsa

As mentioned there might be a different number of genes that are listed in the genbank file and the refseq file. The files ending in proteins.fsa in your directory now contains these proteins. They are in fasta format, which means a > followed by a description line, a newline character (return) and then the sequence. By counting the number of greater than signs (>) the number of genes in that file is found.

# count genes in each: 
grep -c “>” *.proteins.fsa

Examine: are there any differences? Are there more or less in any of the files?

# last, but not least, upload the proteins file to a central directory 
cp *.proteins.fsa ~karinl/<genomeGroup>/proteins

Predicting genes

Sometimes the genome you are working with is not to be found at NCBI or other genome databases. Then you might have to annotate proteins in your genome yourself.

There are several methods for finding the proteins within an unannotated sequence, and giving a full presentation of this subject would be to extensive for this course. We will show you one of these methods, the EasyGene program. This program can be accesed in two different ways, via the web as a form where you submit your sequence and get the annotation back. Another version is through something called Web services. This means that a script, which is a very small program, submits the data you want to have processed by that website, and transfers the data back to you directly.  We will predict genes for one of your genomes through the EasyGene website. The script needed to do this is mentioned above. 

  1. Open the webpage which contained the genome list, as mentioned above.
  2. Find one of your genomes, and click on the accession number for the REFSEQ version of your genome. Find the column with the heading Genome Info and click on the REFSEQ accession number. You will now have a genbank formattet file on your screen.
  3. Click on the link FASTA, this will give you the whole genomic sequence. Click Download and save the genome file as a fasta file.
  4. Open a new browser window and go to the Easygene Server above. Click Browse and find your fasta file. In the organism field, find a genome that is within the same genus as yours, and click Submit.

Examine your results.

Predicting 16S rRNA genes

In some cases the annotations of rRNAs can be inaccurate. Ribosomal RNAs can be predicted using a program called RNAmmer. This program can be either used on the web, through the website mentioned above, or through a web servics script that uses this program via the internet. 

# create directory for holding the rRNA predictions, and go into that directory
cd ~/
mkdir rRNA
cd ~/rRNA
# get just the fasta sequence for these genomes
foreach i (../data/$.gbk)
gmake $i:r.fsa
# run rrnammer for this genome
perl < $i:r.fsa > $i.rrna.fsa
# look at one of the files
less <filename>.rrna.fsa 
Can you see which ones of these genes in the file are 5S, 16S and 23S?

# count the number of rRNAs within each of these
foreach i (*.rrna.fsa)
echo $i
echo `grep -c ">" $i

Genome Atlases 

The last thing you will do today is to look at a Genome Atlas for one of your genomes. Precalculated atlases can be found here: