Exercise: Translation and protein-databases
Exercise written by: Rasmus
Wernersson, Henrik
Nielsen and Morten
Nielsen
QUESTIONS: This exercise contains a number of brief questions, then should be addressed in your report.
Please refer to the bullet-point numbers when aswering the questions.
Translation - Virtual Ribosome
In this part of the exercise, we shall use
Virtual Ribosome - a
software that provides a series of functions to translate DNA to protein sequences.
Besides using the simple functions to translate DNA using a known
reading frame, we shall work on computer-based analysis of possible reading frames, location of START and STOP codons etc.
Step 1
- Open Virtual Ribosome (in a new window): www.cbs.dtu.dk/services/VirtualRibosome/.
Spend a few minutes to get familiar with the website - where do you upload
the input data, and what types of options are available.
- If you only have one sequence, this can be directly pasted into the input
window. Alternatively, Virtual Ribosome can handle a series of different input
formats that allow for multiple sequence inputs (i.e. FASTA).
- Lets first do a simple example, and make a translation of a known gene
Actin (from Yeast). Copy the sequence below into the sequence field and press "submit".
>Yeast_ACT1 ATGGATTCTGAGGTTGCTGCTTTGGTTATTGATAACGGTTCTGGTATGTGTAAAGCCGGT TTTGCCGGTGACGACGCTCCTCGTGCTGTCTTCCCATCTATCGTCGGTAGACCAAGACAC CAAGGTATCATGGTCGGTATGGGTCAAAAAGACTCCTACGTTGGTGATGAAGCTCAATCC AAGAGAGGTATCTTGACTTTACGTTACCCAATTGAACACGGTATTGTCACCAACTGGGAC GATATGGAAAAGATCTGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAA CACCCTGTTCTTTTGACTGAAGCTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACT CAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTACGTTTCCATCCAAGCCGTTTTG TCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATGGTGTTACT CACGTCGTTCCAATTTACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTG GCCGGTAGAGATTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTC TCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCGCC TTGGACTTCGAACAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTAC GAACTTCCAGATGGTCAAGTCATCACTATTGGTAACGAAAGATTCAGAGCCCCAGAAGCT TTGTTCCATCCTTCTGTTTTGGGTTTGGAATCTGCCGGTATTGACCAAACTACTTACAAC TCCATCATGAAGTGTGATGTCGATGTCCGTAAGGAATTATACGGTAACATCGTTATGTCC GGTGGTACCACCATGTTCCCAGGTATTGCCGAAAGAATGCAAAAGGAAATCACCGCTTTG GCTCCATCTTCCATGAAGGTCAAGATCATTGCTCCTCCAGAAAGAAAGTACTCCGTCTGG ATTGGTGGTTCTATCTTGGCTTCTTTGACTACCTTCCAACAAATGTGGATCTCAAAACAA GAATACGACGAAAGTGGTCCATCTATCGTTCACCACAAGTGTTTCTAA
- Look at the result. Note, that the output shows both the DNA, and protein sequences as well as information
on START and STOP codons. You can click on
"instructions" on both the
main page and the results page for details on the what is displayed. Note also, that the "raw" protein
sequence can be downloaded in FASTA format.
- Now please answer the following questions
- How is a STOP codon displayed?
- How is a START codon displayed?
- Does a start-codon alway code for Methionine (M)?
- What is the difference between the two types start codons?
Step 2: Genetic code
- We are now going to work with yet another gene from yeast. This time it is
COX1 that codes for Cytochrome
C OXidase, subunit 1 (for more information click here: COX1 -
Saccharomyces Genome Database). Note that is is a mitochondria-gene.
Translate this gene using default settings.
>Yeast_COX1 ATGGTACAAAGATGATTATATTCAACAAATGCAAAAGATATTGCAGTATTATATTTTATG TTAGCTATTTTTAGTGGTATGGCAGGAACAGCAATGTCTTTAATCATTAGATTAGAATTA GCTGCACCTGGTTCACAATATTTACATGGTAATTCACAATTATTTAATGTTTTAGTAGTT GGTCATGCTGTATTAATGATTTTCTTCTTAGTAATGCCTGCTTTAATTGGAGGTTTTGGT AACTATTTATTACCATTAATAATTGGAGCTACAGATACAGCATTTCCAAGAATTAATAAC ATTGCTTTTTGAGTATTACCTATGGGGTTAGTATGTTTAGTTACATCAACTTTAGTAGAA TCAGGTGCTGGTACAGGGTGAACTGTCTATCCACCATTATCATCTATTCAGGCACATTCA GGACCTAGTGTAGATTTAGCAATTTTTGCATTACATTTAACATCAATTTCATCATTATTA GGTGCTATTAATTTCATTGTAACAACATTAAATATGAGAACAAATGGTATGACAATGCAT AAATTACCATTATTTGTATGATCAATTTTCATTACAGCGTTCTTATTATTATTATCATTA CCTGTATTATCTGCTGGTATTACAATGTTATTATTAGATAGAAACTTCAATACTTCATTC TTTGAAGTATCAGGAGGTGGTGACCCAATCTTATACGAGCATTTATTTTGATTCTTTGGT CACCCTGAAGTATATATTTTAATTATTCCTGGATTTGGTATTATTTCACATGTAGTATCA ACATATTCTAAAAAACCTGTATTTGGTGAAATTTCAATGGTATATGCTATGGCTTCAATT GGATTATTAGGATTCTTAGTATGATCACATCATATGTATATTGTAGGATTAGATGCAGAT CTTAGAGCATATTTCCTATCTGCACTAATGATTATTGCAATTCCAACAGGAATTAAAATT TTCTCATGATTAGCTCTAATCCATGGTGGTTCAATTAGATTAGCACTACCTATGTTATAT GCAATTGCATTCTTATTCTTATTCACAATGGGTGGTTTAACTGGTGTTGCCTTAGCTAAC GCCTCATTAGATGTAGCATTCCACGATACTTACTACGTGGTGGGACATTTTCACTATGTA TTATCAATGGGTGCTATTTTCTCTTTATTTGCAGGATACTATTATTGAAGTCCTCAAATT TTAGGTTTAAACTATAATGAAAAATTAGCTCAAATTCAATTCTGATTAATTTTCATTGGG GCTAATGTTATTTTCTTCCCAATGCATTTTTTAGGTATTAATGGTATGCCTAGAAGAATT CCTGATTATCCTGATGCTTTCGCAGGATGAAATTATGTCGCTTCTATTGGTTCATTCATT GCACTATTATCATTATTCTTATTTATCTATATTTTATATGATCAATTAGTTAATGGATTA AACAATAAAGTTAATAATAAATCAGTTATTTATAATAAAGCACCTGATTTTGTAGAATCT AATCTTATCTTTAATTTAAATACAGTTAAATCTTCATCTATCGAATTCTTATTAACTTCT CCACCAGCTGTACACTCATTTAATACACCAGCTGTACAATCTTAA
- How did the translation succeed? Nothing is wrong with the DNA sequence. Can you
come up with some good reasons for the result?
- Keep the result of the translation in a window (we need it again in a while), and open
a new window with Virtual Ribosome. Translate the DNA sequence once more using a different
translation table (see options). Make an educated guess on what table to select.
- If you have chosen the right translation table, the DNA sequence can be
translated without any problems. Compare the two results and answer the following questions:
- What is the difference in the use of STOP codons?
- What is the difference in the use of START codons?
- Are some codons coding for completely new amino acids?
- More information on the definition of the different translation tables is found here: The
Genetic Codes - NCBI. The tables are shown in a "compressed" format, but can be shown in a more
comprehensible format by using the "Click here to change
format" option. Note:
- The use of START codons is described in details for all genetic codes.
- The difference between the standard-code and other codes is summarized in each
section.
Step 3: Reading frames
Remember to reset all options (in particular make sure that you now use the standard genetic code) before
continuing the exercise.
- We have up to now assumed that the reading frame for the DNA-sequence was known and that
it always started at the first nucleotide. In the following, we shall examen how it is often
possible to identify the most likely reading frame using computational translation tools. We shall use
the sequence below which is the complete mRNA sequence for a yeast-gene (profilin).
Use your biological knowledge to answer the following questions:
- Yeast has introns in some genes, could this be a major problem in this case?
- Can a mature eukaryotic mRNA molecule contain more sequence than the coding region of the gene in question?
>gi|4226|emb|Y00469.1| Yeast mRNA for profilin GGCAAATTATGTCTTGGCAAGCATACACTGATAACTTAATAGGAACCGGTAAAGTCGACAAAGCTGTCAT CTACTCGAGAGCAGGTGACGCTGTTTGGGCTACTTCTGGTGGCCTATCTTTGCAACCAAACGAAATTGGT GAAATTGTTCAAGGCTTCGACAATCCAGCTGGTTTGCAAAGCAATGGTTTGCATATTCAAGGCCAAAAGT TCATGTTGTTGAGAGCTGACGATAGAAGTATCTACGGTAGACATGATGCTGAGGGTGTTGTTTGTGTAAG AACTAAGCAAACCGTTATTATTGCTCATTATCCACCAACCGTACAAGCCGGTGAGGCCACCAAGATTGTC GAGCAATTGGCTGACTACTTGATTGGTGTTCAATACTAATTTATGCAGGTAAAGTTTTCTTGCCTTATAC ACCACCTATTCTGGCATCTGCGGGATTTCGCTTCCTATTTTACAAATATTTTATTGATTGACGCTAATTA TCACTGTAAAAGGCGCACTTTTTATATGTAGTCACATCCGGTATTTAACATATTTACGAAACAGTCTTAA GAATATCGACATTTGATATACTTATGTTTAATTTATCTACATATTACAATCA
- Six reading frame exist: 1, 2, 3 (on the positive stand, i.e. the sequence as you read it), and
-1, -2, -3 (on the negative strand, i.e the complementary DNA string). Since we are working with a
mRNA sequence, we do not need to consider the reading frames on the complementary string.
- Translate the mRNA sequence in the three positive reading frames (1, 2, 3). The easiest way to do this, is to use
a window for each translation to be able to compare the different results.
- What reading frame is most likely the right one?
- NB: remember that START and STOP codons are only shown for the selected reading frame.
- Note also that the DNA-sequence is show alike in all three reading frames whereas the protein sequence is shifted. Why is
this?
- It is possible to show multiple reading frames simultaneously. Use the Plus
(1,2,3) as reading frame, and translate the sequence again.
- Note, that the amino acid letter is centered above each codon (that is M is placed over the "T" in ATG.
- The translation from reading frame 1 is shown just above the DNA sequence, followed by reading frame 2, and 3.
- START and STOP codons for all
three reading frames are shown at once
- For the sake of illustration, we shall try to translate the sequence on the negative strand. Select reading frame -1, and
redo the translation
- How does the DNA sequence look? In what direction shall it be read?
- In what direction shall the protein-sequence be read? Try to compare to the protein sequence in FASTA format.
- Now, lets try to do it all in one go. Select All (6 reading frames) and translate the sequence again.
- How many DNA string are displayed? Why is this?
- Note the large number of possibilities a single DNA sequence contain with respect to translation to protein sequence.
Step 4: ORF finder
- We have now made a manual screening for possible reading frames. Such a procedure might work fine if you have
only one DNA sequence, but this is in general not the case, and often you need to use computer-based ORF
finders. An ORF (Open Reading Frame) is a DNA sequence that is not interrupted by a STOP
codon. Often one will be looking for the longest ORF starting with a START codon and ending at a STOP codon.
- The longest ORF is found by translating the sequence in all six reading frames, and then selecting
the longest protein sequence.
- We shall now use a build-in ORF finder with the most stringent criteria. Under ORF finder, select Start codon:
strict (this forces the ORF to start at ATG), select "All (6
reading frames)" and translate the sequence again.
- Does the result fit to what you found earlier?
- Would it make any difference to the result if we had only a partial sequence where
the last part of the sequence with the STOP codon is missing?
- What would happen if the first 50 nucleotide (with the START codon) were missing?
Protein-databases
In this part of the exercise, we shall extract information from the protein-database, Uniprot. This
database is administrated in collaboration between
Swiss Institute of Bioinformatics (SIB),
European Bioinformatics Institute (EBI), and
Georgetown University.
UniProt, http://www.uniprot.org/,
consists of three part:
- UniProt Knowledge-base (UniProtKB)
protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
homology-reduced database, where similar sequences are merged into single entries
- UniProt Archive (UniParc)
an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot is the most useful, and this is database we shall be using today. Uniprot consists
of two parts:
- UniProtKB/Swiss-Prot
a manually annotated protein-database.
- UniProtKB/TrEMBL
a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences
not yet included in Swiss-Prot.
Here, we shall concentrate on the Swiss-Prot database
http://www.uniprot.org/.
Simple text mining
First, we shall find some Swiss-Prot entries using simple text mining. You shall find entries for human-insulin.
Note, that the syntaxt for Uniprot searches is different from the one you used when searching Genbank.
Open the UniProt home-page http://www.uniprot.org/
Type "human insulin" in the search field in the top of the page. Leave the search-menu on "Protein
Knowledge-base (UniProtKB)", which is default. How many hits do you find?
How many hits are from Swiss-Prot?(tip: Click on "Show only reviewed")
Can you identify the correct hit?
If you do not identify the correct hit immediately, it would often help to narrow down the search. This we can do by
searching for proteins that actually come from human and are called something containing the word insulin, as opposed to
just containing the words human and insulin somewhere in the description. This can be done very easily
At 'Restrict term "human" to' click on; "organism".
How many hits are now left? (still only in Swiss-Prot)?
Restrict term "insulin" to' click
on; "protein name". How many hits are now left (still only in Swiss-Prot)?
Note, that all selection made with the mouse are shown in text format in the Query box in the top of the page.
It is possible to edited the search criteria i this box to make them broader and more narrow. Try for instance to exclude
for proteins that are not insulin, but only insulin-like. This you do by adding the following text in the Query box:
NOT name:insulin-like and click on the search bottom. Note, also that this
syntaxt is different from the one used when searching Genbank.
How many hits are now left?
Try now to exclude proteins that are insulin-receptors or substrates for insulin-receptors.
How many hits are now left?
The content of Swiss-Prot
We shall now see what information is contained in a Swiss-Prot entry, and what further information is available as links
in each entry.
-
Click on the accession-number for insulin (the blue code in the field "Accession"). This will take you to the
insulin-entry in the Swiss-Prot database. Spend some time to get an overview on the page and what information it
contains.
Scroll down to references- how many are there? (Insulin is a highly investigated protein). Note, what each
reference has contributed ("Cited for").
You can get to the PubMed literature database at NCBI by clicking at the link
"PubMed" for a reference - try this. The abstract of a publication can be read here (or directly at UniProt using the
"Abstract"-link), if the work is an actual published article and not a "direct submission".
Read the
"General annotation (Comments)". Here you find some of the general functional and structural annotation of the protein
(the rest is placed in "Features". One of the most important types of comments is naturally "Function".
Another type of comments is "Subcellular location" - where do you find insulin? Why is it found here?
Scroll down to "Sequence annotation (Features)". Note the following:
- Insulin has both a signal-peptide and a pro-peptide. These are both cleaved of before secretion. The mature insulin
(the A and B chains) is hence much smaller than what was shown under
"Sequence information".
- Secondary structure is specified as "HELIX"
(alpha-helix), "STRAND" (part of a beta-pleated sheet) or "TURN" -
Try to click on "Details...".
- Some variants (mutation) of insulin have been described. In some cases it is known what phenotype (variants of
diabetes) is associated with each variant.
To look at the three dimensional structure of a protein you must go to yet
another database, RCSB
PDB under "3D structure databases". We will be working with 3D structures on Wednesday, but
lets just have a fast look here also. As you can see, the 3D structure of insulin has been
determined several times. Select one such structure marked "X-ray"
under "Method" and click on the Entry-linked. Besides a lot of information on how the molecule and the
experimental procedure used to solve the structure, the page also contains a nice picture of the insulin
molecule.
Under "Family and domain databases" you find a long list of databases that using
different techniques have collected proteins that are similar (protein families)
In some cases, are the proteins similar only in smaller parts (domains) but not in other parts, and
in some cases can the databases tell which parts of the actual protein that are known in other species.Some large proteins
can contain several different parts (domains) each with their own evolutionary history.
The most important of these databases is InterPro,because it
collects the information from most of the other databases. Try to click on the InterPro link. This will take you to the
Interpro page with information about the insulin-family, and the long reference list.
Advanced search
UniProts interface allows you to search on most of the fields in the database, not only the fields like name and
organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
Go to the UniProt's website http://www.uniprot.org/.
Click on "Fields" on the left side of the search field
Now we shall find how many proteins that are secreted from the cell just like insulin. Select
"Subcellular location" i the menu "Field". Next type "secreted" in the field "Term" and click "Add & Search".
How many protein do you find?
Combining fields:
How many secreted proteins are fond in humans? Click on "Fields" again, leave the menu to the left
on "AND", select "Organism [OS]" under
"Field", type "human" in the felt "Term" and click "Add
& Search". How many protein do you find now? (Note again here how you can perform the search by editing the
text in the Query box - however to do this you need to know the names for the fields)
Numerical felt: What extremely short proteins are
present in UniProt?
Clear the previous search by clicking the "Clear"-bottom.Click on "Fields" again and select "Sequence length".
Now two new field appear where you can define the lower and upper limits for the search. Type 1 and 10 and search.
How many proteins do you find
Extremely short proteins in TrEMBL are most likely mistakes with no evidence for the
sequences being protein coding. Limit your search to only Swiss-Prot. How many proteins are now left?
A large fraction of the proteins identified are fragments. Try to exclude
fragments from the search. Click on "Fields" again leave the menu on the left on "AND",
set "Field" to "Fragment (yes/no)", select "no" and
search. How many proteins are now left?
And as the final question, can you select only the protein found in humans. How many are these (the answer is 6!)
- Finally you can save the results of your search. Click on the orange "Download..." bottom. You can now save the
search results in the format you prefer.
That's it, .....
|