|
Exercise 7
OBJECTIVES
In the exercises today we will look at protein localization. This is useful in
a variety of applications, both at the level of the individual protein and on
the organism level as the fraction of proteins which are retained, bound in the
membrane or gets exported. To predict this, we will use three different
programs.
- TMHMM - Predicts transmembrane helices
- SignalP - Predicts the presence of signal peptides and their cleavage sites.
- LipoP - Predicts peptidase I & II cleavage sites sites.
These programs are all accessible through a web interface at the CBS servers.
Yet we will use them from the commandline so as to make downstream data processing easier.
Also we will try something new today. This text exists in two versions.
One which is `cut n' paste compliant' and this one,
which includes only descriptions of what to do.
You can swich to the other version by appending `?help' to the url of this file.
|
-
LOG IN
Open a session on 'organism.cbs.dtu.dk', and from this login
to ibiology and create the Ex7 and source directories for you
to work in. Then copy the FASTA files with the predicted
protein sequences from Exercise 3 into the source directory.
-
Trans membrane helix predictions
|
The program TMHMM uses hidden markov models to look for transmembrane helices.
Its filename is tmhmm, and for a standard run it takes a fasta file of protein
sequences as input.
|
-
To get a feeling of the output, try to supply it with one protein sequence
as an example. Just open an editor, and copy the first protein sequence of the
BA000021 genome into a new file.
Save this file.
Then call tmhmm on the file.
-
Now we can proceed to automate it for the complete proteome of each genome.
This is done with the foreach loop, running the tmhmm program on the entire
proteins.fsa file for each genome, and redirecting the output into a
separate file for each genome.
-
The next step is to load the data into the database. In contrast to
the data created in the previous exercises, this feature, and the
other features today, relate themselves to a protein and not directly to the
genome. This leaves a couple of different options.
The most appropriate might be to create a new table for protein
features. A simpler, if not as elegant, solution is to keep the
current database structure and simply add the data to the comment field.
This is what we also did for the sigma factor and TCS predictions.
The former is the solution we will use. Thus the table localization
has been created
mysql> describe localization;
+--------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user | varchar(50) | YES | | NULL | |
| protid | int(11) | YES | | NULL | |
| type | varchar(20) | YES | | NULL | |
+--------+-------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)
The idea is that we will make one entry in this table for each
positive prediction, and link it to the features table by setting
protid to the id value of the relevant entry in the features
table. You can give writing the code a try, but unless you feel quite
comfortable with perl regular expressions and sql syntax, this is not
quite easy. Thus I have included the code in both versions.
mysql -e "delete from localization where user=USER() and type='tmhmm'" cmp_genomics
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
perl -ne 'if(/# CDS_(\d+)-(\d+)_DIR([+-]) Number of predicted TMHs:\s+(\d+)/ and $4>0) \
{print "insert into localization (user, protid, type) \
select USER() as user, ft.id as protid, \"tmhmm\" as type \
from features as ft \
where ft.user=USER() and \
ft.accession = \"'$accession'\" and \
ft.start = $1 and ft.stop = $2 and \
ft.dir = \"$3\";\n";}' $accession.tmhmm \
| mysql -D cmp_genomics
end
-
Now have a look at the number of predicted transmembrane helices.
mysql -e 'select count(*) from localization where type = "tmhmm";' cmp_genomics
At the end of todays exercise we will look at the results in a little
more detail.
|
-
SignalP - Signal peptide predictions
-
SignalP runs at the same time a neural network and a hidden markov
model to predict the signal peptides for protein export. It works for
both bacteria and eukaryotes, and for this reason it is a widely used
program. First try to run it on the test sequence you created in the
beginning of the previous step. It is also reccommended to run it with
-f summary and -f short flags to get
summary and one line output respectively. If you are in doubt what all
this output means, it is a good idea to take a look at
man signalp This manual page will explain the output,
but also the usage and possible parameters and options.
-
Now that you have seen what the output looks like it is time to make
the predictions for all the proteins in all the genomes. Remember that
you have to select the appropriate model/organism type for each
genome.
This will take a long time, so it will make sense to open a new session and start working on the next exercise until this has finished.
-
Now that the result files have been created, what is left is to make a
small script that will read each line and print an SQL statement that
will insert the data into the MySQL database. The result of this
script should then be piped into the mysql program to be executed.
The script will be very similar to the former, only the regular
expression needs to be updated to fit the new output, and the
requirement for what we think is a hit must be updated as well.
mysql -e "delete from localization where user=USER() and type='signalp'" cmp_genomics
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
perl -ne 'if(/^CDS_(\d+)-(\d+)_DIR([+-])\s+0.\d+\s+\d+\s+[YN]\s+0.\d+\s+\d+\s+[YN]\s+[10].\d+\s+\d+\s+Y\s+[10].\d+\s+Y\s+[10].\s+\s+Y\s+CDS_\d+-\d+_DIR[+-]\s+Q\s+[10].\d+\s+[10]\s+Y\s+[10].\d+\s+Y/) \
{print "insert into localization (user, protid, type) \
select USER() as user, ft.id as protid, \"signalp\" as type \
from features as ft \
where ft.user=USER() and \
ft.accession = \"'$accession'\" and \
ft.start = $1 and ft.stop = $2 and \
ft.dir = \"$3\";\n";}' $accession.signalp \
| mysql -D cmp_genomics
end
-
Now have a look at the number of predicted signal peptides.
mysql -e 'select count(*) from localization where type = "signalp";' cmp_genomics
Would it make sense to change the regular expression so it is more lax
with how many confirming predictions it needs? The Y's in the regular
expression can be changed to '[YN]' to set the prediction free.
|
-
LipoP - Lipoprotein signal peptide predictions
LipoP uses a hidden markov model to search for cleavage sites for
peptidase I & II cleavage sites in gram negative bacteria. The
program name is lipop, and the program will only run on organism. Like
the other programs it takes a fasta file with protein sequences as
input. The output of lipoP is in so called .gff format.
-
By now you should know how to make example output. You can repeat this
to get a feeling for how lipop is working.
-
Now run lipop on each of the proteomes. We will run it not only on the
gram negatives, but also on the gram positives to see what happens.
-
And again we will load the data into the database
mysql -e "delete from localization where user=USER() and type='lipop'" cmp_genomics
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
perl -ne 'if(/^# CDS_(\d+)-(\d+)_DIR([+-]).*cleavage=/) \
{print "insert into localization (user, protid, type) \
select USER() as user, ft.id as protid, \"lipop\" as type \
from features as ft \
where ft.user=USER() and \
ft.accession = \"'$accession'\" and \
ft.start = $1 and ft.stop = $2 and \
ft.dir = \"$3\";\n";}' $accession.lipop.gff \
| mysql -D cmp_genomics
end
-
Again take a look at the number of predicted lipoprotein cleavage
sites in the database.
|
-
Taking a look at the predictions
Now let us take a look at how many predictions we have of each type
for each organism.
mysql
mysql> use cmp_genomics;
mysql> select accession, count(*) from features as ft, localization as lo \
where ft.user = user() and \
ft.id = lo.protid and \
ft.user = lo.user and \
lo.type = "tmhmm" \
group by accession;
Repeat this for all the different types.
Does this results correspond with the biological knowledge you have of
the organisms?
|
|