Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Scientific Communication of Comparative Genomics >> Course Programme >> Day 1

Exercise 1: Introduction

Accessing data

Goal: get proteins from various databases plus what to do when there are none.

Today's Web sites

Unix guide the pedestrians guide to unix

Introduction to the computer system: some technicalities

The exercises will take place on a UNIX computer system, since this is by far the most efficient way to handle large amounts of data. We are aware that this interface may not seem intuitive to most biologists, however it is our goal that you become familiar with such a system but there are no requirements for you being experienced users! The content of the blue boxes within all the exercises represent UNIX commands which you shall execute from within your shell. A shell is a small program that lets you execute commands. In most cases you will not need to type the text in yourself, you can just copy-paste it. Do not copy the text with the # in from of it. However, in some cases there is text marked <like this> which you will need to replace with your own options. We strongly recommend that you take one line at a time, and try to figure our what goes on. Bottom line: Don't panic!

Please be sure you have read the pedestrians guide to unix, before starting the exercises. In the grey boxes below, we will now briefly go through some of the things that you will see repeatedly in the exercises. You do not have to execute the commands - just read and understand.

  • What does '|', '>', and '<' mean?

  • Both of the redirects '>' and '>' are very common and you should understand them. The '>' simply means that whatever data is generated by a program is redirected (written) to a file. Example:

    ls > myfile.txt

    The program ls simply lists the content of the directory and this output is here written to the file myfile.txt. The other redirect '<' is used to redirect files into program. Example:

    wc < myfile.txt

    Here, we redirect the file just created into the program wc (word count) which simply counts the number of lines, words and bytes in the input.

    Finally, pipe ('|') is just as important. Instead of writing data to a file (or reading data from a file) the pipe sends output to another program. Let's do the same as above in just one line:

    ls | wc

    This simply means that we execute ls to list the files in the directory, and whatever output this creates gets send directly into program wc.

    The advantages of the latter is clear: you don't have to access the filesystem, which in many cases are relatively slow. Also, ls and wc will actually run in parallel on their own CPU, and wc will read data in chunks whenever generated by ls. Admitted, these things will only affect the performance of larger problems - but in bioinformatics, you often deal with large problems.

  • What does less mean?

  • The program less is a fast and convenient way to show the content of files. The following keyboard shortcuts apply to the program:

    SPACE : one screenful forward 
    b : one screenful backward
    RETURN : one line forward
    k : one line backward
    g : top (beginning) of file
    G : bottom (end) of file
    q : leave the session

    Less is used in the following way:

    less /usr/cbs/examples/
  • What does set ... mean?

  • The command set is used to declare a variable in the shell environment. In these exercises we often set the accession variable like this

    set accession = L43967

    This tells your shell that whenever you write $accession you really mean L43967. Example: Assume you have a file called L43967.gbk, and you want to use less to view it's content. After having set the $accession variable, the two sets of commands will do exactly the same:

    # version 1
    getgene L43967 > L43967.gbk
    less L43967.gbk
    # version 2
    set accession = L43967
    getgene $accession > $accession.gbk
    less $accession.gbk
  • 4. How should I treat the indentions?

  • The above was examples of very very simple commands. You will see, that commands often grow long and ugly, and will barely fit your screen! For matter of appearance, we have truncated the lines by adding a backslash ('\') at the and of the line. This instructs the shell to wait executing the command, even if you press return. This will allow us to break a long line into many. Example: The command below will get the genbank accession U00096 and extract all translations and count the lines/words/bytes of the output. The example below:

    getgene U00096 | saco_extract -I genbank -O tab -t | wc

          ... is exactly the same as this:

    getgene U00096 | saco_extract \ 
      -I genbank -O tab -t | \

    For easier reading, we have made indentions after the line-breaks (which would not be there if it was written all on one line). 

  • What does & mean?

  • The ampersand (&) is put at the end of a command if you want to run the command in the background. This means that the shell will be available for new input immediately after. The process then runs in the background.

  • What does wait mean?

  • When you have placed many jobs in the background that runs in parallel, it is often required that you wait for all processes to finish, before you continue using the results. This is what the wait command helps you with. So, when you see this command in the command-box, it must be copied as well. Example: The script below will fetch two genbank records (U00096 and AL111168) in parallel. Once they have both finished (ensured by using wait) we will do a wc (word count):

    getgene U00096 > U00096.gbk &
    getgene AL111169 > AL111169.gbk &
    wc U00096.gbk AL111169.gbk
    rm U00096.gbk AL111169.gbk
  •  What does umask 022 mean?

  • When you see the command umask this is to inform the shell, that whenever you create new files, they will be readable to other people. This is neccesary, since the results you are creating will be used by other groups, and since we monitor your progress ;-) 

A note on data hygiene and reproducibility

As this course progresses, you will have several directories with several files in them, and you will have done a lot to your data. You cannot expect to be able to remember what you did and what you named your files as you progress in the course. Therefore we strongly recommend that you take notes along the way, so that you know what data you used to make which figures.

You will also be creating several figures, and working with many different types of data files. In the exercises we will be assigning traceable names to the figures you create using the examples we show. However, when you make your won figures, we strongly recommend that you give files names that will enable you later to figure out why and when you created just that file, and what the input for it was. We also recommend that files of the same type, for instance fasta files, gbk files and protein files have the same tag in the name (fasta files end in fsa, genbank files end in gbk, files containing proteins in fasta format being named proteins.fsa). Knowing what, why and when a file or a figure was created will help you a lot when you go on to write your paper.

Select your organisms

Go to this webpage:

Find the genomes that you have chosen. Copy the information you find for your genomes into a text file.

The GPID is the number that is used to uniquely identify a genome at NCBI. With this, you can access all known information for that genome, also if it has several chromosomes or contains plasmids. Click on one of your projects, and look around to see what information you can find. Gather this together with the information from the genome table.

Genome databases - GENBANK and REFSEQ

At NCBI, there are two databases which contain sequence data – GENBANK and REFSEQ. Both the two databases store genomic data in a set format, the Genbank format, however, there are differences between the two. Genbank records are created by individual researchers around the world, and are submitted as-is to the NCBI data repository. No curation of these files are done by the NCBI. Genbank files are an example of a primary data source.

The Reference Sequence (RefSeq) database is a also a database containing Genbank-formatted files. RefSeq is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only.

RefSeq biological sequences (also known as RefSeqs) are derived from the primary GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time. This means that a genome can have several sequences, one from RefSeq, and one or more from Genbank. This means that there can be differences between the two databases, and you might be able to see examples of this in your data files. 

Downloading genomes and creating protein fasta files

  • Log in to the CBS computers

    Find a program which will let you log into the CBS computers. 

    Computer name:
    User name: studXXX

    You will get your password from the teachers.

    After that you need to log into the computer where we will do the exercises, which is named sbiology.

    # log into CBS
    ssh -Y sbiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
    umask 022

  • Create a directory where you can store your data.

    # create directory that you can store your data in
    mkdir data
    # change into that director (cd = change directory)
    cd data

  • How to do many things at once.

    In unix there is a a command structure which will let you perform one command repeatedly on something that varies from each time, without having to type in each one repeatedly. This is the foreach structure, and shown below is the structure of the foreach command:
    foreach <variable> ( <list of things that the command should be performed on> )
    the command you will perform on $i

  • Download your genomes to the computers at CBS.

    Use the foreach command structure to download your sets of genomes onto the computers at CBS.
    # make directory for genbank files
    mkdir genbank
    cd genbank
    # downloading all the genbank files associated with GPID list
    foreach i ( <your GPIDS with spaces in> )
    ~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk
    # make directory for refseq files
    cd ~/data
    mkdir refseq
    cd refseq
    # downloading all the refseq files associated with the GPID list
    foreach i ( <your GPIDS with spaces in> )
    ~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk

    Examples for how this looks for downloading some Campylobacter genomes:

    # downloading all the genbank files for these file project ids
    foreach i (17159 17161 16293 20083 303)
    ~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk
    # downloading all the refseq files for these file project ids
    foreach i (17159 17161 16293 20083 303)
    ~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk

  • Examine the files in your directory.

    There are now several files within your two directories. Look at what files there are, and have a look at one of them. Try to make some sense of what a genbank formatted file contains.

    # Examine files you have in your system 
    # Look at one file using less
    less <filename>.gbk

  • Renaming your files.

    You will rename your input files. This is done so that you will know which file contains information for which organism.

    # First, go to the genbank directory
    cd ~/data/genbank
    # Rename the files
    foreach i (*.gbk)
    ~karinl/scripts/utils/ $i
    # look at the files in the directory now

    # Then, go to the refseq directory
    cd ~/data/refseq
    # Rename the files there
    foreach i (*.gbk)
    ~karinl/scripts/utils/ $i

  • Extract protein sequences.

Both genbank files and refseq files contain the proteins that are found in the genome. However, these are somewhat inaccessible within this format. It is much more useful to have them available as fasta files. There are several web servers out there which will extract protein sequences from genbank files, but these are tedious to use for multiple sequences. The foreach structure can combined with conversion programs be used to extract the proteins.

# First, go to the genbank directory
cd ~/data/genbank
# Make protein files:
foreach i (*.gbk)
gmake $i:r.proteins.fsa

# Then, go to the refseq directory
cd ~/data/refseq
# Make protein files:
foreach i (*.gbk)
gmake $i:r.proteins.fsa

  • Count number of genes.

As mentioned there might be a different number of genes that are listed in the genbank file and the refseq file. The files ending in proteins.fsa in your directory now contains these proteins. They are in fasta format, which means a > followed by a description line, a newline character (return) and then the sequence. By counting the number of greater than signs (>) the number of genes in that file is found.

# count genes genbank files
cd ~/data/genbank
grep -c “>” *.proteins.fsa

# count genes in refseq files
cd ~/data/refseq
grep -c “>” *.proteins.fsa

Examine: are there any differences? Are there more or less in any of the files? How does this compare with the number of genes listed on NCBI's genome information pages?