Exercise 2: Gene prediction and genome properties
IMPORTANT: You must have
completed last week's exercises before you start on todays work.
IMPORTANT: due to constraints on the servers we would like to split the
class in two: groups 1 through 6 will start with predicting proteins,
while the remaining groups will start with examining atlases for their
genomes.
Some tools used today
There are some tools that we are using today that deserves a closer
explanation.
- Perl
and python: perl and python are languages which
can be used to make everything from large
computer programs to small utility programs. In this course we will
often use perl and python scripts, which are small programs, to calculate
things and to
convert files.
- gmake:
gmake
is a system which allows us to automate tasks. In many cases doing
something involves several steps, such as converting files, calculating
something and making a graphic file of it all. When using gmake, you
specify what kind of file you want to end up with, and the coputer
figures out what steps need to be done to get to that point. For
instance. you have a genbank file and you only want the fasta sequence
of the genome. The file is named "a.gbk", if you want to use gmake to
get the fasta sequence, you type in "gmake a.fsa"., i.e., you replace
the ending with what you want to end up with.
Web sites
Gene prediction
Last week you downloaded two different versions of your genomes. In
both cases you extracted the proteins found within the genomes.
Today you will be predicting your own proteins using a gene predictor
called PRODIGAL. This predictor is also available for download through
the PRODIGAL website, however, we will here be using the version of the program that is installed at the CBS computers.
After you have predicted your own genes, we will do some comparisons of
your protein datasets.
-
Log in to the CBS computers
Do what you did last week in order to log in to the CBS computers.
Computer name: login.cbs.dtu.dk
User name: studXXX
The password is the same as was used last week.
After that you need to log into the computer
where we will do the exercises, which is named sbiology.
IMPORTANT: this step must be done every time you log into the CBS computers.
# log into CBS ssh -Y sbiology setenv MAKEFILES /home/people/pfh/bin/Makefile umask 022
# create directory for containing the predicted proteins mkdir ~/data/prodigal cd ~/data/prodigal # predicting the proteins foreach i ( ../fsafiles/*.fasta ) perl ~karinl/scripts/ga3/prodigal.pl -t 11 -fasta $i > $i:r.proteins.fsa mv $i:r.proteins.fsa . end
Protein statistics
You will now do some basic statistic on the protein sets that you have
found.
- How many genes are there in each of the three
data sets? HINT: do this for each of the three directories of files you have, ~/data/genbank, ~/data/refseq, ~/data/prodigal
# go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal cd <directory> # count the number of genes grep -c “>” *.proteins.fsa
Question: which data set contains the fewest and which one contains the most genes?
- How long are the genes in each data set?
# go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal cd <directory> # calculate the statistics for the gene sets in your directory: foreach i ( *.proteins.fsa) python ~karinl/scripts/utils/GeneLength.py $i end
Question: can you see any
differences between the data sets? Are genbank/refseq/prodigal proteins
noticably longer or shorter than the others? Do some genomes have
noticably longer or shorter genes than the others?
Genome Atlas
Many of the properties that we have shown you today can be presented in
a genome atlas, a circular map of the genome. You have seen several of
them in the lectures up till now.
We have prepared atlases for many genomes already.
Access the webpages with the prepared atlases. Select ONE of your genomes to examine.
Things to look at:
BASE Atlas
- Is your organism AT or GC rich? Does that correspond to what you found earlier?
- Where is the replication origin in your organism? (HINT: look at the distribution of Gs and As).
- Are there any regions in your genome that are more AT or GC rich than the rest of the genome?
- Can you identify the leading and the lagging strand of your genome?
- Which strand are the genes on? Is there any
tendencies for the genes to be either on the leading or the lagging
strand, or are they randomly distributed?
STRUCTURE Atlas
- Can you find any regions that might be highly
expressed (HINT: very flexible regions). Can you tell why this region
could be higly expressed? What happens if this region also easily melts
- would this help or hinder expression?
- Can you find any regions that can mutate easily (HINT: AT rich regions that can melt easily).
- Can you find any regions that might be protected against mutation (HINT: rigid regions that won't melt easily).
REPEAT Atlas
- Can you find any globally repeated sequences, either
direct or inverted, in this genome? How are these repeats located in
relation to the genes in the genome?
- Are there more local than global repeats, in your opinion?
GENOME Atlas
- How many rRNA genes does it have? Where are they
located (close or far away from the origin of replication, randomly
distributed or something else?).
- Do any of the rRNAs have tRNAs in them?
- Do the rRNA genes have any special features in the
structural parameters? Are there repeats in this region, is the DNA
especially flexible, or something else?
|
|