Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Scientific Communication of Comparative Genomics >> Course Programme >> Day 2

Exercise 2: Gene prediction and genome properties

IMPORTANT: You must have completed last week's exercises before you start on todays work.

IMPORTANT: due to constraints on the servers we would like to split the class in two: groups 1 through 6 will start with predicting proteins, while the remaining groups will start with examining atlases for their genomes.

Some tools used today

There are some tools that we are using today that deserves a closer explanation.

  • Perl and python: perl and python are  languages which can be used to make everything from large computer programs to small utility programs. In this course we will often use perl and python scripts, which are small programs, to calculate things and to convert files.

  • gmake: gmake is a system which allows us to automate tasks. In many cases doing something involves several steps, such as converting files, calculating something and making a graphic file of it all. When using gmake, you specify what kind of file you want to end up with, and the coputer figures out what steps need to be done to get to that point. For instance. you have a genbank file and you only want the fasta sequence of the genome. The file is named "a.gbk", if you want to use gmake to get the fasta sequence, you type in "gmake a.fsa"., i.e., you replace the ending with what you want to end up with.

Web sites

Prodigal Website
Pre-made Atlasses

Gene prediction

Last week you downloaded two different versions of your genomes. In both cases you extracted the proteins found within the genomes.

Today you will be predicting your own proteins using a gene predictor called PRODIGAL. This predictor is also available for download through the PRODIGAL website, however, we will here be using the version of the program that is installed at the CBS computers.

After you have predicted your own genes, we will do some comparisons of your protein datasets.
  • Log in to the CBS computers

    Do what you did last week in order to log in to the CBS computers.

    Computer name:
    User name: studXXX

    The password is the same as was used last week.

    After that you need to log into the computer where we will do the exercises, which is named sbiology.

    IMPORTANT: this step must be done every time you log into the CBS computers.

    # log into CBS
    ssh -Y sbiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
    umask 022
  • Extract the DNA sequence from your REFSEQ files
    # go to your genbank directory
    cd ~/data/genbank
    # get the fasta sequences you need
    foreach i ( *.gbk )
    gmake $i:r.fsa
    mv $i:r.fsa $i:r.fasta

    #move these files into a new directory
    cd ~/data
    mkdir fsafiles
    cd fsafiles
    mv ~/data/genbank/*.fasta ~/data/fsafiles
    # look at the files you have here now - remember the commands for less from last time!
    less <fastafile>.fasta
    Question: can you tell the difference between genbank files and fasta files?

  • Predict proteins from your DNA sequences

    Predicting proteins can take a bit of time. For each genome, you should get a line saying something with "ACTIVATED" and a line saying "FINISHED".
# create directory for containing the predicted proteins
mkdir ~/data/prodigal
cd ~/data/prodigal
# predicting the proteins
foreach i ( ../fsafiles/*.fasta )
perl ~karinl/scripts/ga3/ -t 11 -fasta $i > $i:r.proteins.fsa
mv $i:r.proteins.fsa .

Protein statistics

You will now do some basic statistic on the protein sets that you have found.

  • How many genes are there in each of the three data sets? HINT: do this for each of the three directories of files you have, ~/data/genbank, ~/data/refseq, ~/data/prodigal

    # go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal
    cd <directory>
    # count the number of genes
    grep -c “>” *.proteins.fsa
    Question: which data set contains the fewest and which one contains the most genes?

  • How long are the genes in each data set?

    # go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal
    cd <directory>
    # calculate the statistics for the gene sets in your directory:
    foreach i ( *.proteins.fsa)
    python ~karinl/scripts/utils/ $i
Question: can you see any differences between the data sets? Are genbank/refseq/prodigal proteins noticably longer or shorter than the others? Do some genomes have noticably longer or shorter genes than the others?

Genome Atlas

Many of the properties that we have shown you today can be presented in a genome atlas, a circular map of the genome. You have seen several of them in the lectures up till now.

We have prepared atlases for many genomes already.

Access the webpages with the prepared atlases. Select ONE of your genomes to examine. 

Things to look at:

BASE Atlas
  • Is your organism AT or GC rich? Does that correspond to what you found earlier?
  • Where is the replication origin in your organism? (HINT: look at the distribution of Gs and As).
  • Are there any regions in your genome that are more AT or GC rich than the rest of the genome?
  • Can you identify the leading and the lagging strand of your genome?
  • Which strand are the genes on? Is there any tendencies for the genes to be either on the leading or the lagging strand, or are they randomly distributed?
  • Can you find any regions that might be highly expressed (HINT: very flexible regions). Can you tell why this region could be higly expressed? What happens if this region also easily melts - would this help or hinder expression?
  • Can you find any regions that can mutate easily (HINT: AT rich regions that can melt easily).
  • Can you find any regions that might be protected against mutation (HINT: rigid regions that won't melt easily).

  • Can you find any globally repeated sequences, either direct or inverted, in this genome? How are these repeats located in relation to the genes in the genome?
  • Are there more local than global repeats, in your opinion?

  • How many rRNA genes does it have? Where are they located (close or far away from the origin of replication, randomly distributed or something else?).
  • Do any of the rRNAs have tRNAs in them?
  • Do the rRNA genes have any special features in the structural parameters? Are there repeats in this region, is the DNA especially flexible, or something else?