Exercise: Protein-databases

Exercise written by: Rasmus Wernersson, Henrik Nielsen and Morten Nielsen


In this part of the exercise, we shall extract information from the protein-database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), and Georgetown University.

UniProt, http://www.uniprot.org/,  consists of three part:

  • UniProt Knowledge-base (UniProtKB)
    protein sequences with annotation and references
  • UniProt Reference Clusters (UniRef)
    homology-reduced database, where similar sequences are merged into single entries
  • UniProt Archive (UniParc)
    an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot is the most useful, and this is database we shall be using today. Uniprot consists of two parts:
  • UniProtKB/Swiss-Prot
    a manually annotated protein-database.
  • UniProtKB/TrEMBL
    a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.

Here, we shall concentrate on the Swiss-Prot database http://www.uniprot.org/.

Simple text mining

First, we shall find some Swiss-Prot entries using simple text mining. You shall find entries for human-insulin. Note, that the syntaxt for Uniprot searches is different from the one you used when searching Genbank.

  1. Open the UniProt home-page http://www.uniprot.org/

  2. Type "human insulin" in the search field in the top of the page. Leave the search-menu on "Protein Knowledge-base (UniProtKB)", which is default. How many hits do you find?

  3. How many hits are from Swiss-Prot?(tip: Click on "Show only reviewed")

  4. Can you identify the correct hit?

  5. If you do not identify the correct hit immediately, it would often help to narrow down the search. This we can do by searching for proteins that actually come from human and are called something containing the word insulin, as opposed to just containing the words human and insulin somewhere in the description. This can be done very easily

    1. At 'Restrict term "human" to' click on; "organism". How many hits are now left? (still only in  Swiss-Prot)?

    2. Restrict term "insulin" to' click on; "protein name".  How many hits are now left (still only in  Swiss-Prot)?
  6. Note, that all selection made with the mouse are shown in text format in the Query box in the top of the page. It is possible to edited the search criteria i this box to make them broader and more narrow. Try for instance to exclude for proteins that are not insulin, but only insulin-like. This you do by adding the following text in the Query box: NOT name:insulin-like and click on the search bottom. Note, also that this syntaxt is different from the one used when searching Genbank. How many hits are now left?

  7. Try now to exclude proteins that are insulin-receptors or substrates for insulin-receptors. How many hits are now left?

The content of Swiss-Prot

We shall now see what information is contained in a Swiss-Prot entry, and what further information is available as links in each entry.

  1. Click on the accession-number for insulin (the blue code in the field "Accession"). This will take you to the insulin-entry in the Swiss-Prot database. Spend some time to get an overview on the page and what information it contains.

  2. Scroll down to references- how many are there? (Insulin is a highly investigated protein). Note, what each reference has contributed ("Cited for"). You can get to the PubMed literature database at NCBI by clicking at the link "PubMed" for a reference - try this. The abstract of a publication can be read here (or directly at UniProt using the "Abstract"-link), if the work is an actual published article and not a "direct submission".

  3. Read the "General annotation (Comments)". Here you find some of the general functional and structural annotation of the protein (the rest is placed in "Features". One of the most important types of comments is naturally "Function". Another type of comments is "Subcellular location" - where do you find insulin? Why is it found here?

  4. Scroll down to "Sequence annotation (Features)". Note the following:

    1. Insulin has both a signal-peptide and a pro-peptide. These are both cleaved of before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under "Sequence information".
    2. Secondary structure is specified as "HELIX" (alpha-helix), "STRAND" (part of a beta-pleated sheet) or "TURN" - Try to click on "Details...".
    3. Some variants (mutation) of insulin have been described. In some cases it is known what phenotype (variants of diabetes) is associated with each variant.
  5. To look at the three dimensional structure of a protein you must go to yet another database, RCSB PDB under "3D structure databases". We will be working with 3D structures later, but lets just have a fast look here also. As you can see, the 3D structure of insulin has been determined several times. Select one such structure marked "X-ray" under "Method" and click on the Entry-linked. Besides a lot of information on how the molecule and the experimental procedure used to solve the structure, the page also contains a nice picture of the insulin molecule.

  6. Under "Family and domain databases" you find a long list of databases that using different techniques have collected proteins that are similar (protein families) In some cases, are the proteins similar only in smaller parts (domains) but not in other parts, and in some cases can the databases tell which parts of the actual protein that are known in other species.Some large proteins can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro,because it collects the information from most of the other databases. Try to click on the InterPro link. This will take you to the Interpro page with information about the insulin-family, and the long reference list.

Advanced search

UniProts interface allows you to search on most of the fields in the database, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
  1. Go to the UniProt's website http://www.uniprot.org/. Click on "Fields" on the left side of the search field

  2. Now we shall find how many proteins that are secreted from the cell just like insulin. Select "Subcellular location" i the menu "Field". Next type "secreted" in the field "Term" and click "Add & Search". How many protein do you find?

  3. Combining fields: How many secreted proteins are fond in humans? Click on "Fields" again, leave the menu to the left on "AND", select "Organism [OS]" under "Field", type "human" in the felt "Term" and click "Add & Search". How many protein do you find now? (Note again here how you can perform the search by editing the text in the Query box - however to do this you need to know the names for the fields)

  4. Numerical felt: What extremely short proteins are present in UniProt? Clear the previous search by clicking the "Clear"-bottom.Click on "Fields" again and select "Sequence length". Now two new field appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. How many proteins do you find

  5. Extremely short proteins in TrEMBL are most likely mistakes with no evidence for the sequences being protein coding. Limit your search to only Swiss-Prot. How many proteins are now left?

  6. A large fraction of the proteins identified are fragments. Try to exclude fragments from the search. Click on "Fields" again leave the menu on the left on "AND", set "Field" to "Fragment (yes/no)", select "no" and search. How many proteins are now left?

  7. And as the final question, can you select only the protein found in humans. How many are these (the answer is 8!)

  8. Finally you can save the results of your search. Click on the orange "Download..." bottom. You can now (but dont do it) save the search results in the format you prefer.
That's it, .....