Exercise written by: Rasmus
Nielsen and Morten
In this part of the exercise, we shall extract information from the protein-database, Uniprot. This
database is administrated in collaboration between
Swiss Institute of Bioinformatics (SIB),
European Bioinformatics Institute (EBI), and
consists of three part:
Of these databases, Uniprot is the most useful, and this is database we shall be using today. Uniprot consists
of two parts:
- UniProt Knowledge-base (UniProtKB)
protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
homology-reduced database, where similar sequences are merged into single entries
- UniProt Archive (UniParc)
an archive containing all versions of Uniprot without annotations
a manually annotated protein-database.
a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences
not yet included in Swiss-Prot.
Here, we shall concentrate on the Swiss-Prot database
Simple text mining
First, we shall find some Swiss-Prot entries using simple text mining. You shall find entries for human-insulin.
Note, that the syntaxt for Uniprot searches is different from the one you used when searching Genbank.
Open the UniProt home-page http://www.uniprot.org/
Type "human insulin" in the search field in the top of the page. Leave the search-menu on "Protein
Knowledge-base (UniProtKB)", which is default. How many hits do you find?
How many hits are from Swiss-Prot?(tip: Click on "Show only reviewed")
Can you identify the correct hit?
If you do not identify the correct hit immediately, it would often help to narrow down the search. This we can do by
searching for proteins that actually come from human and are called something containing the word insulin, as opposed to
just containing the words human and insulin somewhere in the description. This can be done very easily
At 'Restrict term "human" to' click on; "organism".
How many hits are now left? (still only in Swiss-Prot)?
Restrict term "insulin" to' click
on; "protein name". How many hits are now left (still only in Swiss-Prot)?
Note, that all selection made with the mouse are shown in text format in the Query box in the top of the page.
It is possible to edited the search criteria i this box to make them broader and more narrow. Try for instance to exclude
for proteins that are not insulin, but only insulin-like. This you do by adding the following text in the Query box:
NOT name:insulin-like and click on the search bottom. Note, also that this
syntaxt is different from the one used when searching Genbank.
How many hits are now left?
Try now to exclude proteins that are insulin-receptors or substrates for insulin-receptors.
How many hits are now left?
The content of Swiss-Prot
We shall now see what information is contained in a Swiss-Prot entry, and what further information is available as links
in each entry.
Click on the accession-number for insulin (the blue code in the field "Accession"). This will take you to the
insulin-entry in the Swiss-Prot database. Spend some time to get an overview on the page and what information it
Scroll down to references- how many are there? (Insulin is a highly investigated protein). Note, what each
reference has contributed ("Cited for").
You can get to the PubMed literature database at NCBI by clicking at the link
"PubMed" for a reference - try this. The abstract of a publication can be read here (or directly at UniProt using the
"Abstract"-link), if the work is an actual published article and not a "direct submission".
"General annotation (Comments)". Here you find some of the general functional and structural annotation of the protein
(the rest is placed in "Features". One of the most important types of comments is naturally "Function".
Another type of comments is "Subcellular location" - where do you find insulin? Why is it found here?
Scroll down to "Sequence annotation (Features)". Note the following:
- Insulin has both a signal-peptide and a pro-peptide. These are both cleaved of before secretion. The mature insulin
(the A and B chains) is hence much smaller than what was shown under
- Secondary structure is specified as "HELIX"
(alpha-helix), "STRAND" (part of a beta-pleated sheet) or "TURN" -
Try to click on "Details...".
- Some variants (mutation) of insulin have been described. In some cases it is known what phenotype (variants of
diabetes) is associated with each variant.
To look at the three dimensional structure of a protein you must go to yet
another database, RCSB
PDB under "3D structure databases". We will be working with 3D structures later, but
lets just have a fast look here also. As you can see, the 3D structure of insulin has been
determined several times. Select one such structure marked "X-ray"
under "Method" and click on the Entry-linked. Besides a lot of information on how the molecule and the
experimental procedure used to solve the structure, the page also contains a nice picture of the insulin
Under "Family and domain databases" you find a long list of databases that using
different techniques have collected proteins that are similar (protein families)
In some cases, are the proteins similar only in smaller parts (domains) but not in other parts, and
in some cases can the databases tell which parts of the actual protein that are known in other species.Some large proteins
can contain several different parts (domains) each with their own evolutionary history.
The most important of these databases is InterPro,because it
collects the information from most of the other databases. Try to click on the InterPro link. This will take you to the
Interpro page with information about the insulin-family, and the long reference list.
UniProts interface allows you to search on most of the fields in the database, not only the fields like name and
organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
That's it, .....
Go to the UniProt's website http://www.uniprot.org/.
Click on "Fields" on the left side of the search field
Now we shall find how many proteins that are secreted from the cell just like insulin. Select
"Subcellular location" i the menu "Field". Next type "secreted" in the field "Term" and click "Add & Search".
How many protein do you find?
How many secreted proteins are fond in humans? Click on "Fields" again, leave the menu to the left
on "AND", select "Organism [OS]" under
"Field", type "human" in the felt "Term" and click "Add
& Search". How many protein do you find now? (Note again here how you can perform the search by editing the
text in the Query box - however to do this you need to know the names for the fields)
Numerical felt: What extremely short proteins are
present in UniProt?
Clear the previous search by clicking the "Clear"-bottom.Click on "Fields" again and select "Sequence length".
Now two new field appear where you can define the lower and upper limits for the search. Type 1 and 10 and search.
How many proteins do you find
Extremely short proteins in TrEMBL are most likely mistakes with no evidence for the
sequences being protein coding. Limit your search to only Swiss-Prot. How many proteins are now left?
A large fraction of the proteins identified are fragments. Try to exclude
fragments from the search. Click on "Fields" again leave the menu on the left on "AND",
set "Field" to "Fragment (yes/no)", select "no" and
search. How many proteins are now left?
And as the final question, can you select only the protein found in humans. How many are these (the answer is 8!)
- Finally you can save the results of your search. Click on the orange "Download..." bottom. You can now (but dont do it) save the
search results in the format you prefer.