Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Comparative Microbial Genomics Analysis >> Day 1

Exercise 1: Introduction



Accessing data

Goal: get proteins from various databases plus what to do when there are none.

Today's Web sites

NCBI http://www.ncbi.nlm.nih.gov
Unix guide the pedestrians guide to unix
Prodigal website http://compbio.ornl.gov/prodigal/


Introduction to the computer system: some technicalities

The exercises will take place on a UNIX computer system, since this is by far the most efficient way to handle large amounts of data. We are aware that this interface may not seem intuitive to most biologists, however it is our goal that you become familiar with such a system but there are no requirements for you being experienced users! The content of the blue boxes within all the exercises represent UNIX commands which you shall execute from within your shell. A shell is a small program that lets you execute commands. In most cases you will not need to type the text in yourself, you can just copy-paste it. Do not copy the text with the # in from of it. However, in some cases there is text marked<like this> which you will need to replace with your own options. We strongly recommend that you take one line at a time, and try to figure our what goes on. Bottom line: Don't panic!

Please be sure you have read the pedestrians guide to unix, before starting the exercises. In the grey boxes below, we will now briefly go through some of the things that you will see repeatedly in the exercises. You do not have to execute the commands - just read and understand.

  • What does '|', '>', and '<' mean?

  • Both of the redirects '>' and '<' are very common and you should understand them. The '>' simply means that whatever data is generated by a program is redirected (written) to a file. Example:

ls > myfile.txt

The program ls simply lists the content of the directory and this output is here written to the file myfile.txt. The other redirect '<' is used to redirect files into program. Example:

wc < myfile.txt

Here, we redirect the file just created into the program wc (word count) which simply counts the number of lines, words and bytes in the input.

Finally, pipe ('|') is just as important. Instead of writing data to a file (or reading data from a file) the pipe sends output to another program. Let's do the same as above in just one line:

ls | wc

This simply means that we execute ls to list the files in the directory, and whatever output this creates gets send directly into program wc.

The advantages of the latter is clear: you don't have to access the filesystem, which in many cases is relatively slow. Also, ls and wc will actually run in parallel on their own CPU, and wc will read data in chunks whenever generated by ls. Admitted, these things will only affect the performance of larger problems - but in bioinformatics, you often deal with large problems.

  • What does less mean?

The program less is a fast and convenient way to show the content of files. The following keyboard shortcuts apply to the program:

SPACE : one screenful forward 
b : one screenful backward
RETURN : one line forward
k : one line backward
g : top (beginning) of file
G : bottom (end) of file
q : leave the session

Less is used in the following way:

less /usr/cbs/examples/gene.gb
  • How should I treat the indentions?

The above was examples of very very simple commands. You will see, that commands often grow long and ugly, and will barely fit your screen! For matter of appearance, we have truncated the lines by adding a backslash ('\') at the and of the line. This instructs the shell to wait executing the command, even if you press return. This will allow us to break a long line into many. Example: The command below will get the genbank accession U00096 and extract all translations and count the lines/words/bytes of the output. The example below:

getgene U00096 | saco_extract -I genbank -O tab -t | wc

      ... is exactly the same as this:

getgene U00096 | saco_extract \ 
  -I genbank -O tab -t | \
 wc

When you do this yourself, remove the \ and just glue it together on one line.

  • What does & mean?

  • The ampersand (&) is put at the end of a command if you want to run the command in the background. This means that the shell will be available for new input immediately after. The process then runs in the background.

  •  What does umask 022 mean?

  • This ensures that the files that you create are can be read by other people.

    umask 022 
    When you see the command umask this is to inform the shell, that whenever you create new files, they will be readable to other people. This is neccesary, since the results you are creating will be used by other groups, and since we monitor your progress ;-) 

    A note on data hygiene and reproducibility

    As this course progresses, you will have several directories with several files in them, and you will have done a lot to your data. You cannot expect to be able to remember what you did and what you named your files as you progress in the course. Therefore we strongly recommend that you take notes along the way, so that you know what data you used to make which figures.

    You will also be creating several figures, and working with many different types of data files. In the exercises we will be assigning traceable names to the figures you create using the examples we show. However, when you make your won figures, we strongly recommend that you give files names that will enable you later to figure out why and when you created just that file, and what the input for it was. We also recommend that files of the same type, for instance fasta files, gbk files and protein files have the same tag in the name (fasta files end in fsa, genbank files end in gbk, files containing proteins in fasta format being named proteins.fsa). Knowing what, why and when a file or a figure was created will help you a lot when you go on to write your paper.

    Select your organisms


    Go to this webpage:http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi

    Find the genomes that you have been given. Copy the information you find for your genomes into a text file. Having this, will let you evaluate results better.

    The GPID is the number that is used to uniquely identify a genome at NCBI. With this, you can access all known information for that genome, also if it has several chromosomes or contains plasmids. Click on one of your projects, and look around to see what information you can find. Gather this together with the information from the genome table.

    Genome databases - GENBANK and REFSEQ

    At NCBI, there are two databases which contain sequence data – GENBANK and REFSEQ. Both the two databases store genomic data in a set format, the Genbank format, however, there are differences between the two. Genbank records are created by individual researchers around the world, and are submitted as-is to the NCBI data repository. No curation of these files are done by the NCBI. Genbank files are an example of a primary data source.

    The Reference Sequence (RefSeq) database is a also a database containing Genbank-formatted files. RefSeq is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only.

    RefSeq biological sequences (also known as RefSeqs) are derived from the primary GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time. This means that a genome can have several sequences, one from RefSeq, and one or more from Genbank. This means that there can be differences between the two databases, and you might be able to see examples of this in your data files. 

    Downloading genomes and creating protein fasta files

    • Log in to the CBS computers

      Find a program which will let you log into the CBS computers.

      Computer name: login.cbs.dtu.dk

      User name: studXXX

      You will get your password from the teachers. 

      The password should not, under any circumstances, be passed to somebody else, electronically, written, or orally. If you do, you will lose access to the computers.

      After that you need to log into the computer where we will do the exercises, which is named ibiology.

      # log into CBS
      ssh -Y ibiology
      setenv MAKEFILES /home/people/pfh/bin/Makefile
      umask 022

    • Create a directory where you can store your data.

      # create directory that you can store your data in
      mkdir data
      # change into that director (cd = change directory)
      cd data
      mkdir genbank
      mkdir refseq

    • How to do many things at once.

      In unix there is a a command structure which will let you perform one command repeatedly on something that varies from each time, without having to type in each one repeatedly. This is the foreach structure, and shown below is the structure of the foreach command:
      foreach <variable> ( <list of things that the command should be performed on> )
      the command you will perform on $i
      end


      Note: when you see something in red, you are supposed to replace it with something else.


    • Download your genomes to the computers at CBS.

      Use the foreach command structure to download your sets of genomes onto the computers at CBS.
      Examples for how this looks for downloading some Campylobacter genomes:

      # downloading all the genbank files for these file project ids
      cd genbank
      foreach i (17159 17161 16293 20083 303)
      ~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk
      end
      # downloading all the refseq files for these file project ids
      cd refseq
      foreach i (17159 17161 16293 20083 303)
      ~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk
      end


    • Examine the files in your directory.

      There are now several files within your two directories. Look at what files there are, and have a look at one of them. Try to make some sense of what a genbank formatted file contains.

      # Examine files you have in your system 
      ls
      # Look at one file using less
      less <filename>.gbk

    • Renaming your files.

      You will rename your input files. This is done so that you will know which file contains information for which organism.

      # First, go to the genbank directory
      cd ~/data/genbank
      # Rename the files
      foreach i (*.gbk)
      ~karinl/scripts/utils/extractname.sh $i
      end
      # look at the files in the directory now
      ls

      # Then, go to the refseq directory
      cd ~/data/refseq
      # Rename the files there
      foreach i (*.gbk)
      ~karinl/scripts/utils/extractname.sh $i
      end

    • Extract protein sequences.

    Both genbank files and refseq files contain the proteins that are found in the genome. However, these are somewhat inaccessible within this format. It is much more useful to have them available as fasta files. There are several web servers out there which will extract protein sequences from genbank files, but these are tedious to use for multiple sequences. The foreach structure can combined with conversion programs be used to extract the proteins.

    # First, go to the genbank directory
    cd ~/data/genbank
    # Make protein files:
    foreach i ([A-Z]*.gbk)
    make $i:r.proteins.fsa
    end

    # Then, go to the refseq directory
    cd ~/data/refseq
    # Make protein files:
    foreach i ([A-Z]*.gbk)
    make $i:r.proteins.fsa
    end

    • Count number of genes.

    As mentioned there might be a different number of genes that are listed in the genbank file and the refseq file. The files ending in proteins.fsa in your directory now contains these proteins. They are in fasta format, which means a > followed by a description line, a newline character (return) and then the sequence. By counting the number of greater than signs (>) the number of genes in that file is found.

    # count genes genbank files
    cd ~/data/genbank
    grep ">" [A-Z]*.proteins.fsa |wc -l

    # count genes in refseq files
    cd ~/data/refseq
    grep “>” [A-Z]*.proteins.fsa |wc -l

    Examine: are there any differences? Are there more or less in any of the files? How does this compare with the number of genes listed on NCBI's genome information pages?

    Gene prediction

    You have downloaded two different versions of your genomes. In both cases you extracted the proteins found within the genomes.

    Now you will be predicting your own proteins using a gene predictor called PRODIGAL. This predictor is also available for download through the PRODIGAL website, however, we will here be using the version of the program that is installed at the CBS computers.

    After you have predicted your own genes, we will do some comparisons of your protein datasets.

    • Extract the DNA sequence from your genbank files
      # go to your genbank directory
      cd ~/data/genbank
      # get the fasta sequences you need
      foreach i ( [A-Z]*.gbk )
      make $i:r.fsa
      mv $i:r.fsa $i:r.fasta
      end

      #move these files into a new directory
      cd ~/data
      mkdir fsafiles
      cd fsafiles
      mv ~/data/genbank/*.fasta ~/data/fsafiles
      # look at the files you have here now - remember the commands for less from last time!
      ls
      less <fastafile>.fasta

      Question: can you tell the difference between genbank files and fasta files?

    • Predict proteins from your DNA sequences

      Predicting proteins can take a bit of time. For each genome, you should get a line saying something with "ACTIVATED" and a line saying "FINISHED"
      # create directory for containing the predicted proteins
      mkdir ~/data/prodigal
      cd ~/data/prodigal
      # predicting the proteins
      foreach i ( ../fsafiles/*.fasta )
      perl ~karinl/scripts/ga3/prodigal.pl -t 11 -fasta $i > $i:r.orfs.fsa
      cat $i:r.orfs.fsa | 6pack -1 > $i:r.proteins.fsa
      mv $i:r.proteins.fsa .
      end


    Protein statistics

    You will now do some basic statistic on the protein sets that you have found.

    • How many genes are there in each of the three data sets? HINT: do this for each of the three directories of files you have, ~/data/genbank, ~/data/refseq, ~/data/prodigal

      # go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal
      cd <directory>
      # count the number of genes
      grep “>” *.proteins.fsa |wc -l

      Question: which data set contains the fewest and which one contains the most genes?

    • How long are the genes in each data set?

      # go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal
      cd <directory>
      # calculate the statistics for the gene sets in your directory:
      foreach i ( *.proteins.fsa)
      python ~karinl/scripts/utils/GeneLength.py $i
      end
    Question: can you see any differences between the data sets? Are genbank/refseq/prodigal proteins noticably longer or shorter than the others? Do some genomes have noticably longer or shorter genes than the others?