Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise: Translation and protein-databases


Exercise written by: Rasmus Wernersson, Henrik Nielsen and Morten Nielsen

QUESTIONS: This exercise contains a number of brief questions, then should be addressed in your report. Please refer to the bullet-point numbers when aswering the questions.

Translation - Virtual Ribosome

In this part of the exercise, we shall use Virtual Ribosome - a software that provides a series of functions to translate DNA to protein sequences. Besides using the simple functions to translate DNA using a known reading frame, we shall work on computer-based analysis of possible reading frames, location of START and STOP codons etc.

Step 1

  1. Open Virtual Ribosome (in a new window): www.cbs.dtu.dk/services/VirtualRibosome/. Spend a few minutes to get familiar with the website - where do you upload the input data, and what types of options are available.

  2. If you only have one sequence, this can be directly pasted into the input window. Alternatively, Virtual Ribosome can handle a series of different input formats that allow for multiple sequence inputs (i.e. FASTA).

  3. Lets first do a simple example, and make a translation of a known gene Actin (from Yeast). Copy the sequence below into the sequence field and press "submit".
>Yeast_ACT1
ATGGATTCTGAGGTTGCTGCTTTGGTTATTGATAACGGTTCTGGTATGTGTAAAGCCGGT
TTTGCCGGTGACGACGCTCCTCGTGCTGTCTTCCCATCTATCGTCGGTAGACCAAGACAC
CAAGGTATCATGGTCGGTATGGGTCAAAAAGACTCCTACGTTGGTGATGAAGCTCAATCC
AAGAGAGGTATCTTGACTTTACGTTACCCAATTGAACACGGTATTGTCACCAACTGGGAC
GATATGGAAAAGATCTGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAA
CACCCTGTTCTTTTGACTGAAGCTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACT
CAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTACGTTTCCATCCAAGCCGTTTTG
TCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATGGTGTTACT
CACGTCGTTCCAATTTACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTG
GCCGGTAGAGATTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTC
TCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCGCC
TTGGACTTCGAACAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTAC
GAACTTCCAGATGGTCAAGTCATCACTATTGGTAACGAAAGATTCAGAGCCCCAGAAGCT
TTGTTCCATCCTTCTGTTTTGGGTTTGGAATCTGCCGGTATTGACCAAACTACTTACAAC
TCCATCATGAAGTGTGATGTCGATGTCCGTAAGGAATTATACGGTAACATCGTTATGTCC
GGTGGTACCACCATGTTCCCAGGTATTGCCGAAAGAATGCAAAAGGAAATCACCGCTTTG
GCTCCATCTTCCATGAAGGTCAAGATCATTGCTCCTCCAGAAAGAAAGTACTCCGTCTGG
ATTGGTGGTTCTATCTTGGCTTCTTTGACTACCTTCCAACAAATGTGGATCTCAAAACAA
GAATACGACGAAAGTGGTCCATCTATCGTTCACCACAAGTGTTTCTAA
  1. Look at the result. Note, that the output shows both the DNA, and protein sequences as well as information on START and STOP codons. You can click on "instructions" on both the main page and the results page for details on the what is displayed. Note also, that the "raw" protein sequence can be downloaded in FASTA format.

  2. Now please answer the following questions
    • How is a STOP codon displayed?
    • How is a START codon displayed?
    • Does a start-codon alway code for Methionine (M)?
    • What is the difference between the two types start codons?

Step 2: Genetic code

  1. We are now going to work with yet another gene from yeast. This time it is COX1 that codes for Cytochrome C OXidase, subunit 1 (for more information click here: COX1 - Saccharomyces Genome Database). Note that is is a mitochondria-gene. Translate this gene using default settings.
>Yeast_COX1 
ATGGTACAAAGATGATTATATTCAACAAATGCAAAAGATATTGCAGTATTATATTTTATG
TTAGCTATTTTTAGTGGTATGGCAGGAACAGCAATGTCTTTAATCATTAGATTAGAATTA
GCTGCACCTGGTTCACAATATTTACATGGTAATTCACAATTATTTAATGTTTTAGTAGTT
GGTCATGCTGTATTAATGATTTTCTTCTTAGTAATGCCTGCTTTAATTGGAGGTTTTGGT
AACTATTTATTACCATTAATAATTGGAGCTACAGATACAGCATTTCCAAGAATTAATAAC
ATTGCTTTTTGAGTATTACCTATGGGGTTAGTATGTTTAGTTACATCAACTTTAGTAGAA
TCAGGTGCTGGTACAGGGTGAACTGTCTATCCACCATTATCATCTATTCAGGCACATTCA
GGACCTAGTGTAGATTTAGCAATTTTTGCATTACATTTAACATCAATTTCATCATTATTA
GGTGCTATTAATTTCATTGTAACAACATTAAATATGAGAACAAATGGTATGACAATGCAT
AAATTACCATTATTTGTATGATCAATTTTCATTACAGCGTTCTTATTATTATTATCATTA
CCTGTATTATCTGCTGGTATTACAATGTTATTATTAGATAGAAACTTCAATACTTCATTC
TTTGAAGTATCAGGAGGTGGTGACCCAATCTTATACGAGCATTTATTTTGATTCTTTGGT
CACCCTGAAGTATATATTTTAATTATTCCTGGATTTGGTATTATTTCACATGTAGTATCA
ACATATTCTAAAAAACCTGTATTTGGTGAAATTTCAATGGTATATGCTATGGCTTCAATT
GGATTATTAGGATTCTTAGTATGATCACATCATATGTATATTGTAGGATTAGATGCAGAT
CTTAGAGCATATTTCCTATCTGCACTAATGATTATTGCAATTCCAACAGGAATTAAAATT
TTCTCATGATTAGCTCTAATCCATGGTGGTTCAATTAGATTAGCACTACCTATGTTATAT
GCAATTGCATTCTTATTCTTATTCACAATGGGTGGTTTAACTGGTGTTGCCTTAGCTAAC
GCCTCATTAGATGTAGCATTCCACGATACTTACTACGTGGTGGGACATTTTCACTATGTA
TTATCAATGGGTGCTATTTTCTCTTTATTTGCAGGATACTATTATTGAAGTCCTCAAATT
TTAGGTTTAAACTATAATGAAAAATTAGCTCAAATTCAATTCTGATTAATTTTCATTGGG
GCTAATGTTATTTTCTTCCCAATGCATTTTTTAGGTATTAATGGTATGCCTAGAAGAATT
CCTGATTATCCTGATGCTTTCGCAGGATGAAATTATGTCGCTTCTATTGGTTCATTCATT
GCACTATTATCATTATTCTTATTTATCTATATTTTATATGATCAATTAGTTAATGGATTA
AACAATAAAGTTAATAATAAATCAGTTATTTATAATAAAGCACCTGATTTTGTAGAATCT
AATCTTATCTTTAATTTAAATACAGTTAAATCTTCATCTATCGAATTCTTATTAACTTCT
CCACCAGCTGTACACTCATTTAATACACCAGCTGTACAATCTTAA

  1. How did the translation succeed? Nothing is wrong with the DNA sequence. Can you come up with some good reasons for the result?

  2. Keep the result of the translation in a window (we need it again in a while), and open a new window with Virtual Ribosome. Translate the DNA sequence once more using a different translation table (see options). Make an educated guess on what table to select.

  3. If you have chosen the right translation table, the DNA sequence can be translated without any problems. Compare the two results and answer the following questions:
    • What is the difference in the use of STOP codons?
    • What is the difference in the use of START codons?
    • Are some codons coding for completely new amino acids?

  4. More information on the definition of the different translation tables is found here: The Genetic Codes - NCBI. The tables are shown in a "compressed" format, but can be shown in a more comprehensible format by using the "Click here to change format" option. Note:
    • The use of START codons is described in details for all genetic codes.
    • The difference between the standard-code and other codes is summarized in each section.

Step 3: Reading frames

Remember to reset all options (in particular make sure that you now use the standard genetic code) before continuing the exercise.

  1. We have up to now assumed that the reading frame for the DNA-sequence was known and that it always started at the first nucleotide. In the following, we shall examen how it is often possible to identify the most likely reading frame using computational translation tools. We shall use the sequence below which is the complete mRNA sequence for a yeast-gene (profilin). Use your biological knowledge to answer the following questions:
    • Yeast has introns in some genes, could this be a major problem in this case?
    • Can an mRNA molecule contain more sequence than the gene in question?
>gi|4226|emb|Y00469.1| Yeast mRNA for profilin
GGCAAATTATGTCTTGGCAAGCATACACTGATAACTTAATAGGAACCGGTAAAGTCGACAAAGCTGTCAT
CTACTCGAGAGCAGGTGACGCTGTTTGGGCTACTTCTGGTGGCCTATCTTTGCAACCAAACGAAATTGGT
GAAATTGTTCAAGGCTTCGACAATCCAGCTGGTTTGCAAAGCAATGGTTTGCATATTCAAGGCCAAAAGT
TCATGTTGTTGAGAGCTGACGATAGAAGTATCTACGGTAGACATGATGCTGAGGGTGTTGTTTGTGTAAG
AACTAAGCAAACCGTTATTATTGCTCATTATCCACCAACCGTACAAGCCGGTGAGGCCACCAAGATTGTC
GAGCAATTGGCTGACTACTTGATTGGTGTTCAATACTAATTTATGCAGGTAAAGTTTTCTTGCCTTATAC
ACCACCTATTCTGGCATCTGCGGGATTTCGCTTCCTATTTTACAAATATTTTATTGATTGACGCTAATTA
TCACTGTAAAAGGCGCACTTTTTATATGTAGTCACATCCGGTATTTAACATATTTACGAAACAGTCTTAA
GAATATCGACATTTGATATACTTATGTTTAATTTATCTACATATTACAATCA
  1. Six reading frame exist: 1, 2, 3 (on the positive stand, i.e. the sequence as you read it), and -1, -2, -3 (on the negative strand, i.e the complementary DNA string). Since we are working with a mRNA sequence, we do not need to consider the reading frames on the complementary string.
    • Why is this?

  2. Translate the mRNA sequence in the three positive reading frames (1, 2, 3). The easiest way to do this, is to use a window for each translation to be able to compare the different results.
    • What reading frame is most likely the right one?
    • NB: remember that START and STOP codons are only shown for the selected reading frame.
    • Note also that the DNA-sequence is show alike in all three reading frames whereas the protein sequence is shifted. Why is this?

  3. It is possible to show multiple reading frames simultaneously. Use the Plus (1,2,3) as reading frame, and translate the sequence again.
    • Note, that the amino acid letter is centered above each codon (that is M is placed over the "T" in ATG.
    • The translation from reading frame 1 is shown just above the DNA sequence, followed by reading frame 2, and 3.
    • START and STOP codons for all three reading frames are shown at once

  4. For the sake of illustration, we shall try to translate the sequence on the negative strand. Select reading frame -1, and redo the translation
    • How does the DNA sequence look? In what direction shall it be read?
    • In what direction shall the protein-sequence be read? Try to compare to the protein sequence in FASTA format.

  5. Now, lets try to do it all in one go. Select All (6 reading frames) and translate the sequence again.
    • How many DNA string are displayed? Why is this?
    • Note the large number of possibilities a single DNA sequence contain with respect to translation to protein sequence.

Step 4: ORF finder

  1. We have now made a manual screening for possible reading frames. Such a procedure might work fine if you have only one DNA sequence, but this is in general not the case, and often you need to use computer-based ORF finders. An ORF (Open Reading Frame) is a DNA sequence that is not interrupted by a STOP codon. Often one will be looking for the longest ORF starting with a START codon and ending at a STOP codon.
    • The longest ORF is found by translating the sequence in all six reading frames, and then selecting the longest protein sequence.

  2. We shall now use a build-in ORF finder with the most stringent criteria. Under ORF finder, select Start codon: strict (this forces the ORF to start at ATG), select "All (6 reading frames)" and translate the sequence again.
    • Does the result fit to what you found earlier?
    • Would it make any difference to the result if we had only a partial sequence where the last part of the sequence with the STOP codon is missing?
    • What would happen if the first 50 nucleotide (with the START codon) were missing?

Protein-databases

In this part of the exercise, we shall extract information from the protein-database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), and Georgetown University.

UniProt, http://www.uniprot.org/,  consists of three part:

  • UniProt Knowledge-base (UniProtKB)
    protein sequences with annotation and references
  • UniProt Reference Clusters (UniRef)
    homology-reduced database, where similar sequences are merged into single entries
  • UniProt Archive (UniParc)
    an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot is the most useful, and this is database we shall be using today. Uniprot consists of two parts:
  • UniProtKB/Swiss-Prot
    a manually annotated protein-database.
  • UniProtKB/TrEMBL
    a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.

Here, we shall concentrate on the Swiss-Prot database http://www.uniprot.org/.

Simple text mining

First, we shall find some Swiss-Prot entries using simple text mining. You shall find entries for human-insulin. Note, that the syntaxt for Uniprot searches is different from the one you used when searching Genbank.

  1. Open the UniProt home-page http://www.uniprot.org/

  2. Type "human insulin" in the search field in the top of the page. Leave the search-menu on "Protein Knowledge-base (UniProtKB)", which is default. How many hits do you find?

  3. How many hits are from Swiss-Prot?(tip: Click on "Show only reviewed")

  4. Can you identify the correct hit?

  5. If you do not identify the correct hit immediately, it would often help to narrow down the search. This we can do by searching for proteins that actually come from human and are called something containing the word insulin, as opposed to just containing the words human and insulin somewhere in the description. This can be done very easily

    1. At 'Restrict term "human" to' click on; "organism". How many hits are now left? (still only in  Swiss-Prot)?

    2. Restrict term "insulin" to' click on; "protein name".  How many hits are now left (still only in  Swiss-Prot)?
  6. Note, that all selection made with the mouse are shown in text format in the Query box in the top of the page. It is possible to edited the search criteria i this box to make them broader and more narrow. Try for instance to exclude for proteins that are not insulin, but only insulin-like. This you do by adding the following text in the Query box: NOT name:insulin-like and click on the search bottom. Note, also that this syntaxt is different from the one used when searching Genbank. How many hits are now left?

  7. Try now to exclude proteins that are insulin-receptors or substrates for insulin-receptors. How many hits are now left?

The content of Swiss-Prot

We shall now see what information is contained in a Swiss-Prot entry, and what further information is available as links in each entry.

  1. Click on the accession-number for insulin (the blue code in the field "Accession"). This will take you to the insulin-entry in the Swiss-Prot database. Spend some time to get an overview on the page and what information it contains.

  2. Scroll down to references- how many are there? (Insulin is a highly investigated protein). Note, what each reference has contributed ("Cited for"). You can get to the PubMed literature database at NCBI by clicking at the link "PubMed" for a reference - try this. The abstract of a publication can be read here (or directly at UniProt using the "Abstract"-link), if the work is an actual published article and not a "direct submission".

  3. Read the "General annotation (Comments)". Here you find some of the general functional and structural annotation of the protein (the rest is placed in "Features". One of the most important types of comments is naturally "Function". Another type of comments is "Subcellular location" - where do you find insulin? Why is it found here?

  4. Scroll down to "Sequence annotation (Features)". Note the following:

    1. Insulin has both a signal-peptide and a pro-peptide. These are both cleaved of before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under "Sequence information".
    2. Secondary structure is specified as "HELIX" (alpha-helix), "STRAND" (part of a beta-pleated sheet) or "TURN" - Try to click on "Details...".
    3. Some variants (mutation) of insulin have been described. In some cases it is known what phenotype (variants of diabetes) is associated with each variant.
  5. To look at the three dimensional structure of a protein you must go to yet another database, RCSB PDB under "3D structure databases". We will be working with 3D structures on Wednesday, but lets just have a fast look here also. As you can see, the 3D structure of insulin has been determined several times. Select one such structure marked "X-ray" under "Method" and click on the Entry-linked. Besides a lot of information on how the molecule and the experimental procedure used to solve the structure, the page also contains a nice picture of the insulin molecule.

  6. Under "Family and domain databases" you find a long list of databases that using different techniques have collected proteins that are similar (protein families) In some cases, are the proteins similar only in smaller parts (domains) but not in other parts, and in some cases can the databases tell which parts of the actual protein that are known in other species.Some large proteins can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro,because it collects the information from most of the other databases. Try to click on the InterPro link. This will take you to the Interpro page with information about the insulin-family, and the long reference list.

Advanced search

UniProts interface allows you to search on most of the fields in the database, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
  1. Go to the UniProt's website http://www.uniprot.org/. Click on "Fields" on the left side of the search field

  2. Now we shall find how many proteins that are secreted from the cell just like insulin. Select "Subcellular location" i the menu "Field". Next type "secreted" in the field "Term" and click "Add & Search". How many protein do you find?

  3. Combining fields: How many secreted proteins are fond in humans? Click on "Fields" again, leave the menu to the left on "AND", select "Organism [OS]" under "Field", type "human" in the felt "Term" and click "Add & Search". How many protein do you find now? (Note again here how you can perform the search by editing the text in the Query box - however to do this you need to know the names for the fields)

  4. Numerical felt: What extremely short proteins are present in UniProt? Clear the previous search by clicking the "Clear"-bottom.Click on "Fields" again and select "Sequence length". Now two new field appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. How many proteins do you find

  5. Extremely short proteins in TrEMBL are most likely mistakes with no evidence for the sequences being protein coding. Limit your search to only Swiss-Prot. How many proteins are now left?

  6. A large fraction of the proteins identified are fragments. Try to exclude fragments from the search. Click on "Fields" again leave the menu on the left on "AND", set "Field" to "Fragment (yes/no)", select "no" and search. How many proteins are now left?

  7. And as the final question, can you select only the protein found in humans. How many are these (the answer is 6!)

  8. Finally you can save the results of your search. Click on the orange "Download..." bottom. You can now save the search results in the format you prefer.
That's it, .....