Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise M16


OBJECTIVES

To get familiar with methods for assigning protein function by using Hidden Markov Models.


  1. LOG IN

    Open a session on 'genome.cbs.dtu.dk', and from this login to ibiology and create directories for you to work in:
    ssh -X ibiology
    mkdir M16
    cd M16
    mkdir source
    
  2. PREDICTING SIMGA FACTORS

    We will start out by using an HMM to query the proteins that you have predicted in Exercise M2 (and extracted in M11). We have prepared three HMMs from alignments of three Sigma factors, σ 54, σ 70, and ECF. The HMMs you are going to use have been calibrated in such a way, that alignment scores greater than zero are considered significant (or at least better than the NULL model)

    You will be using the database table brazilworkshop06.features to store your category asignments (by using the 'note' field.

    But we will start out by generating hmmsearch reports of the protein files you generated in Exercise Module M11
    1. foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
         foreach type (s54 s70 ecf)
           /usr/cbs/tools/lib/make/bin/hmmsearch.Linux /usr/cbs/tools/lib/make/pfh/HMMsigma/$type.hmm \
               ../M11/source/$accession.proteins.fsa > source/$accession.proteins.$type.sigmas.hmmsearch &
         end
      end
      wait
       
    2. How many HMM searches have you initiated? Try first and view the content of a hmmsearch report ( use spacebar to scroll and 'q' to quit ). In this report, you should find one protein with a score better than zero:
    3. less source/AE017042.proteins.ecf.sigmas.hmmsearch
       
    4. Ok. Now you have the hmmsearch reports, but you need to extract the feature names. The following command will filter the report and extract the gene names of the significant results and place a comment in the database table.
    5. mysql -B -e "update brazilworkshop06.features set note = '' where user = USER() and note not like 'tcs%' or note is null"
      foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
         foreach type (ecf s54 s70)
           cat source/$accession.proteins.$type.sigmas.hmmsearch | \
           perl -ne 'next unless /^CDS_(\d+)\-(\d+)_DIR([\-\+]+)\s+([0-9\-\.e]+)\s+([0-9\-\.e]+)\s+/;my ($start,$stop,$dir,$score,$evalue) = ($1,$2,$3,$4,$5);next unless $score > 0; print "update brazilworkshop06.features set note = \"Sigma Factor '$type'\" where start=$start and stop = $stop and user = user() and accession = \"'$accession'\";\n";' | mysql
         end
      end
      
       
    6. Start the mysql client and try and review all the predictions you made by querying the MySQL table:
    7. mysql
      mysql>select start,stop,dir,note from brazilworkshop06.features where user=USER() and note like 'Sigma Factor%' and accession = 'CP000036' order by start;
       
    8. And it probably does not come as a surprise that we can easily count the occurences in each genome by a single query:
    9. mysql>select features.note,genomes.accession,genomes.organism,count(*) from brazilworkshop06.features,brazilworkshop06.genomes where genomes.accession=features.accession and genomes.user = features.user and  features.user=USER() and note like 'Sigma Factor%' group by note,genomes.accession order by note,genomes.accession;
       
    10. The following query will access the MySQL table of the Genome Atlas Database. It will allow you to see the average number of σ54,σ70, and ECF Sigma Factors found in all the Bacterial species and the Bacterial phyla.
    11. Grouped by Species
      mysql>select concat(genus,' ',species),count(*) as Count,avg(SIGMA54_PROMOTERS_COUNT) as AvgS54Count, avg(SIGMA70_PROMOTERS_COUNT) as AvgS70Count,avg(SIGMAECF_PROMOTERS_COUNT) as AvgECFCount  from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by genus,species;
      
    12. Grouped by Phyla
      mysql>select TAXLONG_PHYLA,count(*) as Count,avg(SIGMA54_PROMOTERS_COUNT) as AvgS54Count,avg(SIGMA70_PROMOTERS_COUNT) as AvgS70Count,avg(SIGMAECF_PROMOTERS_COUNT) as AvgECFCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by TAXLONG_PHYLA;
      mysql>exit
        
    13. How does this compare with your results?
  3. PREDICTING TWO COMPONENT SYSTEMS

    The second part of todays exercise is about predicting two-component systems.

    We will use the same procedure like with the sigma factors.
    1. foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
         foreach type (RRreciever HisKA_1 HisKA_2 HisKA_3 HWE_HK)
           /usr/cbs/tools/lib/make/bin/hmmsearch.Linux --cut_tc /usr/cbs/tools/lib/make/pfh/TCS/$type.hmm \
             ../M11/source/$accession.proteins.fsa > source/$accession.$type.TCS.hmmsearch &
         end
      end
      wait
      				
    2. Try first and view the content of a hmmsearch report ( use spacebar to scroll and 'q' to quit ). In this report, you should find several proteins with a score better than zero:
    3. less source/AE017042.RRreciever.TCS.hmmsearch
       
    4. The following command will filter the report and extract the gene names of the significant results and place a comment in the database table:
    5. mysql -B -e "update brazilworkshop06.features set note = '' where user = USER() and note not like 'sigma%' or note is null"
      foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
         foreach type (RRreciever HisKA_1 HisKA_2 HisKA_3 HWE_HK)
           cat source/$accession.$type.TCS.hmmsearch | \
           perl -ne 'next unless /^CDS_(\d+)\-(\d+)_DIR([\-\+]+)\s+([0-9\-\.e]+)\s+([0-9\-\.e]+)\s+/;my ($start,$stop,$dir,$score,$evalue) = ($1,$2,$3,$4,$5);next unless $score > 0; print "update brazilworkshop06.features set note = \"TCS '$type'\" where start=$start and stop = $stop and user = user() and accession = \"'$accession'\";\n";' | mysql -B
         end
      end
       
    6. Exactly like we did before, listing the predicted two-component systems you have predicted of a given accession number, here CP000036)
    7. mysql
      mysql>select start,stop,dir,note from brazilworkshop06.features where user=USER() and note like 'TCS%' and accession = 'CP000036' order by start;
      
       
    8. Exactly like we did before, cross-referencing the organism names and the features you predicted
    9. mysql>select features.note,genomes.accession,organism,count(*) from brazilworkshop06.features,brazilworkshop06.genomes where  genomes.accession=features.accession and genomes.user = features.user and  features.user=USER() and note like 'TCS%'  group by note,genomes.accession order by genomes.accession,note;
       
    10. The following query will access the MySQL table of the Genome Atlas Database. It will allow you to see the average number of TCS' in all the Bacterial species and the Bacterial phyla.
    11. Grouped by Species
      mysql>select concat(genus,' ',species),count(*) as Count,avg(HISKA_1_COUNT) as HisKA1Count, avg(HISKA_2_COUNT) as HisKA2Count, avg(HISKA_3_COUNT) as HisKA3Count,avg(HWE_HK_COUNT) as HWEHKCount,avg(RRRECIEVER_COUNT) as RRrecieverCount  from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by genus,species;
       
    12. Grouped by Phyla
      mysql>select TAXLONG_PHYLA,count(*) as Count,avg(HISKA_1_COUNT) as HisKA1Count, avg(HISKA_2_COUNT) as HisKA2Count, avg(HISKA_3_COUNT) as HisKA3Count,avg(HWE_HK_COUNT) as HWEHKCount,avg(RRRECIEVER_COUNT) as RRrecieverCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by TAXLONG_PHYLA;
      mysql>exit
        
    13. How does this compare with your results?