Exercise 1: Introduction
Accessing data
Goal: get proteins from various databases plus
what to do when there are none.
Today's Web sites
Introduction to the computer system: some
technicalities
The exercises will take place on a UNIX computer system, since this is
by far the most efficient way to handle large amounts of data. We are
aware that this interface may not seem intuitive to most biologists,
however it is our goal that you become familiar with such a system but
there are no requirements for you being experienced users! The content
of the blue boxes within all the exercises represent UNIX commands
which you shall execute from within your shell. A shell is a small
program that lets you execute commands. In most cases you will not need
to type the text in yourself, you can just copy-paste it. Do not copy
the text with the # in from of it. However, in some cases there is text
marked<like this>
which you will need to replace with your own options. We strongly
recommend that you take one line at a time, and try to figure our what
goes on. Bottom line: Don't panic!
Please be sure you have read the
pedestrians guide to unix, before starting the exercises. In
the grey boxes below, we will now briefly go through some of the things
that you will see repeatedly in the exercises. You do not have to
execute the commands - just read and understand.
-
What does '|', '>', and '<'
mean?
Both of the redirects '>' and '<'
are very common and you should understand them. The '>' simply
means that whatever data is generated by a program is redirected
(written) to a file. Example:
ls > myfile.txt
The program ls simply lists the content of the
directory and this output is here written to the file myfile.txt. The
other redirect '<' is used to redirect files into program.
Example:
wc < myfile.txt
Here, we redirect the file just created into the
program wc (word count) which simply counts the number of lines, words
and bytes in the input.
Finally, pipe ('|') is just as important. Instead
of writing data to a file (or reading data from a file) the pipe sends
output to another program. Let's do the same as above in just one line:
ls | wc
This simply means that we execute ls to list the
files in the directory, and whatever output this creates gets send
directly into program wc.
The advantages of the latter is clear: you don't
have to access the filesystem, which in many cases is relatively slow.
Also, ls and wc will actually run in parallel on their own CPU, and wc
will read data in chunks whenever generated by ls. Admitted, these
things will only affect the performance of larger problems - but in
bioinformatics, you often deal with large problems.
The program less is a fast and convenient way to
show the content of files. The following keyboard shortcuts apply to
the program:
SPACE : one screenful forward b : one screenful backward RETURN : one line forward k : one line backward g : top (beginning) of file G : bottom (end) of file q : leave the session
Less is used in the following way:
less /usr/cbs/examples/gene.gb
-
How should I treat the indentions?
The above was examples of very very simple
commands. You will see, that commands often grow long and ugly, and
will barely fit your screen! For matter of appearance, we have
truncated the lines by adding a backslash ('\') at the and of the line.
This instructs the shell to wait executing the command, even if you
press return. This will allow us to break a long line into many.
Example: The command below will get the genbank accession U00096 and
extract all translations and count the lines/words/bytes of the output.
The example below:
getgene U00096 | saco_extract -I genbank -O tab -t | wc
... is exactly the
same as this:
getgene U00096 | saco_extract \ -I genbank -O tab -t | \ wc
When you do this yourself, remove the \ and just glue it together on
one line.
What does & mean?
The ampersand (&) is put at the end of a
command if you want to run the command in the background. This means
that the shell will be available for new input immediately after. The
process then runs in the background.
What does umask 022 mean?
This ensures that the files that you create
are can be read by other people.
umask 022
When you see the command umask this is to inform the shell, that
whenever you create new files, they will be readable to other people.
This is neccesary, since the results you are creating will be used by
other groups, and since we monitor your progress ;-)
A note on data hygiene and reproducibility
As this course progresses, you will have several directories with
several files in them, and you will have done a lot to your data. You
cannot expect to be able to remember what you did and what you named
your files as you progress in the course. Therefore we
strongly recommend that you take notes along the way, so
that you know what data you used to make which figures.
You will also be creating several figures, and working with many
different types of data files. In the exercises we will be assigning
traceable names to the figures you create using the examples we show.
However, when you make your won figures, we strongly recommend that you
give files names that will enable you later to figure out why and when
you created just that file, and what the input for it was. We also
recommend that files of the same type, for instance fasta files, gbk
files and protein files have the same tag in the name (fasta files end
in fsa, genbank files end in gbk, files containing proteins in fasta
format being named proteins.fsa). Knowing what, why and when a file or
a figure was created will help you a lot when you go on to write your
paper.
Select your organisms
Go to this webpage:http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi
Find the genomes that you have been given. Copy
the information you find for your genomes into a text file. Having
this, will let you evaluate results better.
The GPID is the number that is used to uniquely
identify a genome at NCBI. With this, you can access all known
information for that genome, also if it has several chromosomes or
contains plasmids. Click on one of your projects, and look around to
see what information you can find. Gather this together with the
information from the genome table.
Genome databases - GENBANK and REFSEQ
At NCBI, there are two databases which contain sequence data
– GENBANK and REFSEQ. Both the two databases store genomic
data in a set format, the Genbank format, however, there are
differences between the two. Genbank records are created by individual
researchers around the world, and are submitted as-is to the NCBI data
repository. No curation of these files are done by the NCBI. Genbank
files are an example of a primary data source.
The Reference Sequence (RefSeq) database is a also
a database containing Genbank-formatted files. RefSeq is a
non-redundant collection of richly annotated DNA, RNA, and protein
sequences from diverse taxa. The collection includes sequences from
plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each
RefSeq represents a single, naturally occurring molecule from one
organism. The goal is to provide a comprehensive, standard dataset that
represents sequence information for a species. It should be noted,
though, that RefSeq has been built using data from public archival
databases only.
RefSeq biological sequences (also known as
RefSeqs) are derived from the primary GenBank records but differ in
that each RefSeq is a synthesis of information, not an archived unit of
primary research data. Similar to a review article in the literature, a
RefSeq represents the consolidation of information by a particular
group at a particular time. This means that a genome can have several
sequences, one from RefSeq, and one or more from Genbank. This means
that there can be differences between the two databases, and you might
be able to see examples of this in your data files.
Downloading genomes and creating protein fasta
files
-
Log in to the CBS computers
Find a program which will let you log into the CBS computers.
Computer name: login.cbs.dtu.dk
User name: studXXX
You will get your password from the
teachers.
The
password should not, under any circumstances, be passed to somebody
else, electronically, written, or orally. If you do, you will lose
access to the computers.
After that you need to log into the computer
where we will do the exercises, which is named ibiology.
# log into CBS ssh -Y ibiology setenv MAKEFILES /home/people/pfh/bin/Makefile umask 022
- How
to do many things at once.
In unix there is a a command structure which will let you perform one
command repeatedly on something that varies from each time, without
having to type in each one repeatedly. This is the foreach structure,
and shown below is the structure of the foreach command:
foreach <variable> ( <list of things that the command should be performed on> ) the command you will perform on $i end
Note: when you see something in red,
you are supposed to replace it with something else.
-
Download
your genomes to the computers at CBS.
Use the foreach command structure to download your sets of genomes onto
the computers at CBS.
Examples for how this looks for downloading some Campylobacter genomes:
# downloading all the genbank files for these file project ids cd genbank foreach i (17159 17161 16293 20083 303) ~pfh/scripts/fetchgbk/fetchgbk -d genbank -p $i > $i.genbank.gbk end
# downloading all the refseq files for these file project ids cd refseq foreach i (17159 17161 16293 20083 303) ~pfh/scripts/fetchgbk/fetchgbk -d refseq -p $i > $i.refseq.gbk end
-
Examine
the files in your directory.
There are now several files within your two
directories. Look at what files there are, and have a look at one of
them. Try to make some sense of what a genbank formatted file contains.
# Examine files you have in your system ls # Look at one file using less less <filename>.gbk
- Renaming
your files.
You will rename your input files. This is done
so that you will know which file contains information for which
organism.
# First, go to the genbank directory cd ~/data/genbank # Rename the files foreach i (*.gbk) ~karinl/scripts/utils/extractname.sh $i end # look at the files in the directory now ls
# Then, go to the refseq directory cd ~/data/refseq # Rename the files there foreach i (*.gbk) ~karinl/scripts/utils/extractname.sh $i end
- Extract
protein sequences.
Both genbank
files and refseq files contain the proteins that are found in the
genome. However, these are somewhat inaccessible within this format. It
is much more useful to have them available as fasta files. There are
several web servers out there which will extract protein sequences from
genbank files, but these are tedious to use for multiple sequences. The
foreach structure can combined with conversion programs be used to
extract the proteins.
# First, go to the genbank directory cd ~/data/genbank # Make protein files: foreach i ([A-Z]*.gbk) make $i:r.proteins.fsa end
# Then, go to the refseq directory cd ~/data/refseq # Make protein files: foreach i ([A-Z]*.gbk) make $i:r.proteins.fsa end
As mentioned
there might be a different number of genes that are listed in the
genbank file and the refseq file. The files ending in proteins.fsa in
your directory now contains these proteins. They are in fasta format,
which means a > followed by a description line, a newline
character (return) and then the sequence. By counting the number of
greater than signs (>) the number of genes in that file is found.
# count genes genbank files cd ~/data/genbank grep ">" [A-Z]*.proteins.fsa |wc -l
# count genes in refseq files cd ~/data/refseq grep “>” [A-Z]*.proteins.fsa |wc -l
Examine: are
there any differences? Are there more or less in any of the files? How
does this compare with the number of genes listed on NCBI's genome
information pages?
Gene prediction
You have downloaded two different versions of your
genomes. In both cases you extracted the proteins found within the
genomes.
Now you will be predicting your own proteins using a gene predictor
called PRODIGAL. This predictor is also available for download through the PRODIGAL website,
however, we will here be using the version of the program that is
installed at the CBS computers.
After you have predicted your own genes, we will do some comparisons of
your protein datasets.
- Extract the
DNA sequence from your genbank files
# go to your genbank directory cd ~/data/genbank # get the fasta sequences you need foreach i ( [A-Z]*.gbk ) make $i:r.fsa mv $i:r.fsa $i:r.fasta end
#move these files into a new directory cd ~/data mkdir fsafiles cd fsafiles mv ~/data/genbank/*.fasta ~/data/fsafiles # look at the files you have here now - remember the commands for less from last time! ls less <fastafile>.fasta
Question: can you tell the difference between
genbank files and fasta files?
- Predict
proteins from your DNA sequences
Predicting
proteins can take a bit of time. For each genome, you should get a line
saying something with "ACTIVATED" and a line saying "FINISHED"
# create directory for containing the predicted proteins mkdir ~/data/prodigal cd ~/data/prodigal # predicting the proteins foreach i ( ../fsafiles/*.fasta ) perl ~karinl/scripts/ga3/prodigal.pl -t 11 -fasta $i > $i:r.orfs.fsa cat $i:r.orfs.fsa | 6pack -1 > $i:r.proteins.fsa mv $i:r.proteins.fsa . end
Protein statistics
You will now do some basic statistic on the
protein sets that you have found.
- How many genes are there in each of the three
data sets? HINT: do this for each of the three directories of files you
have, ~/data/genbank, ~/data/refseq, ~/data/prodigal
# go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal cd <directory> # count the number of genes grep “>” *.proteins.fsa |wc -l
Question: which data set contains the fewest
and which one contains the most genes?
- How long are the genes in each data set?
# go to the directory you want to be in, either ~/data/genbank, ~/data/refseq or ~/data/prodigal cd <directory> # calculate the statistics for the gene sets in your directory: foreach i ( *.proteins.fsa) python ~karinl/scripts/utils/GeneLength.py $i end
Question: can
you see any differences between the data sets? Are
genbank/refseq/prodigal proteins noticably longer or shorter than the
others? Do some genomes have noticably longer or shorter genes than the
others?
|