Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise 7


OBJECTIVES

In the exercises today we will look at protein localization. This is useful in a variety of applications, both at the level of the individual protein and on the organism level as the fraction of proteins which are retained, bound in the membrane or gets exported. To predict this, we will use three different programs.
  1. TMHMM - Predicts transmembrane helices
  2. SignalP - Predicts the presence of signal peptides and their cleavage sites.
  3. LipoP - Predicts peptidase I & II cleavage sites sites.
These programs are all accessible through a web interface at the CBS servers. Yet we will use them from the commandline so as to make downstream data processing easier.

Also we will try something new today. This text exists in two versions. One which is `cut n' paste compliant' and this one, which includes only descriptions of what to do. You can swich to the other version by appending `?help' to the url of this file.

  1. LOG IN

    Open a session on 'organism.cbs.dtu.dk', and from this login to ibiology and create the Ex7 and source directories for you to work in. Then copy the FASTA files with the predicted protein sequences from Exercise 3 into the source directory.
  2. Trans membrane helix predictions

    The program TMHMM uses hidden markov models to look for transmembrane helices. Its filename is tmhmm, and for a standard run it takes a fasta file of protein sequences as input.


    1. To get a feeling of the output, try to supply it with one protein sequence as an example. Just open an editor, and copy the first protein sequence of the BA000021 genome into a new file.
      Save this file. Then call tmhmm on the file.

    2. Now we can proceed to automate it for the complete proteome of each genome. This is done with the foreach loop, running the tmhmm program on the entire proteins.fsa file for each genome, and redirecting the output into a separate file for each genome.

    3. The next step is to load the data into the database. In contrast to the data created in the previous exercises, this feature, and the other features today, relate themselves to a protein and not directly to the genome. This leaves a couple of different options. The most appropriate might be to create a new table for protein features. A simpler, if not as elegant, solution is to keep the current database structure and simply add the data to the comment field. This is what we also did for the sigma factor and TCS predictions.
      The former is the solution we will use. Thus the table localization has been created
      mysql> describe localization;
      +--------+-------------+------+-----+---------+----------------+
      | Field  | Type        | Null | Key | Default | Extra          |
      +--------+-------------+------+-----+---------+----------------+
      | id     | int(11)     | NO   | PRI | NULL    | auto_increment |
      | user   | varchar(50) | YES  |     | NULL    |                |
      | protid | int(11)     | YES  |     | NULL    |                |
      | type   | varchar(20) | YES  |     | NULL    |                |
      +--------+-------------+------+-----+---------+----------------+
      4 rows in set (0.00 sec)
      
      The idea is that we will make one entry in this table for each positive prediction, and link it to the features table by setting protid to the id value of the relevant entry in the features table. You can give writing the code a try, but unless you feel quite comfortable with perl regular expressions and sql syntax, this is not quite easy. Thus I have included the code in both versions.
      mysql -e "delete from localization where user=USER() and type='tmhmm'" cmp_genomics
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
        perl -ne 'if(/# CDS_(\d+)-(\d+)_DIR([+-]) Number of predicted TMHs:\s+(\d+)/ and $4>0) \
                  {print "insert into localization (user, protid, type) \
                                 select USER() as user, ft.id as protid, \"tmhmm\" as type \
                                        from features as ft \
                                        where ft.user=USER() and \
                                              ft.accession = \"'$accession'\" and \
                                              ft.start = $1 and ft.stop = $2 and \
                                              ft.dir = \"$3\";\n";}' $accession.tmhmm \
        | mysql -D cmp_genomics
      end
      

    4. Now have a look at the number of predicted transmembrane helices.
      mysql -e 'select count(*) from localization where type = "tmhmm";' cmp_genomics
      
      At the end of todays exercise we will look at the results in a little more detail.
  3. SignalP - Signal peptide predictions

    1. SignalP runs at the same time a neural network and a hidden markov model to predict the signal peptides for protein export. It works for both bacteria and eukaryotes, and for this reason it is a widely used program. First try to run it on the test sequence you created in the beginning of the previous step. It is also reccommended to run it with
      -f summary
      and
      -f short
      flags to get summary and one line output respectively. If you are in doubt what all this output means, it is a good idea to take a look at
      man signalp
      This manual page will explain the output, but also the usage and possible parameters and options.

    2. Now that you have seen what the output looks like it is time to make the predictions for all the proteins in all the genomes. Remember that you have to select the appropriate model/organism type for each genome. This will take a long time, so it will make sense to open a new session and start working on the next exercise until this has finished.

    3. Now that the result files have been created, what is left is to make a small script that will read each line and print an SQL statement that will insert the data into the MySQL database. The result of this script should then be piped into the mysql program to be executed.
      The script will be very similar to the former, only the regular expression needs to be updated to fit the new output, and the requirement for what we think is a hit must be updated as well.
      mysql -e "delete from localization where user=USER() and type='signalp'" cmp_genomics
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
        perl -ne 'if(/^CDS_(\d+)-(\d+)_DIR([+-])\s+0.\d+\s+\d+\s+[YN]\s+0.\d+\s+\d+\s+[YN]\s+[10].\d+\s+\d+\s+Y\s+[10].\d+\s+Y\s+[10].\s+\s+Y\s+CDS_\d+-\d+_DIR[+-]\s+Q\s+[10].\d+\s+[10]\s+Y\s+[10].\d+\s+Y/) \
                  {print "insert into localization (user, protid, type) \
                                 select USER() as user, ft.id as protid, \"signalp\" as type \
                                        from features as ft \
                                        where ft.user=USER() and \
                                              ft.accession = \"'$accession'\" and \
                                              ft.start = $1 and ft.stop = $2 and \
                                              ft.dir = \"$3\";\n";}' $accession.signalp \
        | mysql -D cmp_genomics
      end
      
    4. Now have a look at the number of predicted signal peptides.
      mysql -e 'select count(*) from localization where type = "signalp";' cmp_genomics
      
      Would it make sense to change the regular expression so it is more lax with how many confirming predictions it needs? The Y's in the regular expression can be changed to '[YN]' to set the prediction free.
  4. LipoP - Lipoprotein signal peptide predictions

      LipoP uses a hidden markov model to search for cleavage sites for peptidase I & II cleavage sites in gram negative bacteria. The program name is lipop, and the program will only run on organism. Like the other programs it takes a fasta file with protein sequences as input. The output of lipoP is in so called .gff format.
    1. By now you should know how to make example output. You can repeat this to get a feeling for how lipop is working.

    2. Now run lipop on each of the proteomes. We will run it not only on the gram negatives, but also on the gram positives to see what happens.

    3. And again we will load the data into the database
      mysql -e "delete from localization where user=USER() and type='lipop'" cmp_genomics
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
        perl -ne 'if(/^# CDS_(\d+)-(\d+)_DIR([+-]).*cleavage=/) \
                  {print "insert into localization (user, protid, type) \
                                 select USER() as user, ft.id as protid, \"lipop\" as type \
                                        from features as ft \
                                        where ft.user=USER() and \
                                              ft.accession = \"'$accession'\" and \
                                              ft.start = $1 and ft.stop = $2 and \
                                              ft.dir = \"$3\";\n";}' $accession.lipop.gff \
        | mysql -D cmp_genomics
      end
      
    4. Again take a look at the number of predicted lipoprotein cleavage sites in the database.
  5. Taking a look at the predictions

      Now let us take a look at how many predictions we have of each type for each organism.
      mysql
      mysql> use cmp_genomics;
      mysql> select accession, count(*) from features as ft, localization as lo \
                    where ft.user = user() and \
                          ft.id = lo.protid and \
                          ft.user = lo.user and \
                          lo.type = "tmhmm" \
                    group by accession;
      
      Repeat this for all the different types.

      Does this results correspond with the biological knowledge you have of the organisms?