|
Exercise M16
OBJECTIVES
To get familiar with methods for assigning protein function
by using Hidden Markov Models.
|
-
LOG IN
Open a session on 'genome.cbs.dtu.dk', and from this login to ibiology
and create directories for you to work in:
ssh -X ibiology
mkdir M16
cd M16
mkdir source
-
PREDICTING SIMGA FACTORS
We will start out by using an HMM to query the proteins
that you have predicted in Exercise M2 (and extracted in M11).
We have prepared three HMMs from alignments of
three Sigma factors, σ 54, σ 70, and ECF.
The HMMs you are going to use have been calibrated in such a way,
that alignment scores greater than zero are considered significant (or at least better than the NULL model)
You will be using the database table brazilworkshop06.features
to store your category asignments (by using the 'note' field.
But we will start out by generating hmmsearch reports of the
protein files you generated in Exercise Module M11
|
-
foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
foreach type (s54 s70 ecf)
/usr/cbs/tools/lib/make/bin/hmmsearch.Linux /usr/cbs/tools/lib/make/pfh/HMMsigma/$type.hmm \
../M11/source/$accession.proteins.fsa > source/$accession.proteins.$type.sigmas.hmmsearch &
end
end
wait
|
How many HMM searches have you initiated?
Try first and view the content of a hmmsearch report
( use spacebar to scroll and 'q' to quit ). In this
report, you should find one protein with a score better than
zero:
|
-
less source/AE017042.proteins.ecf.sigmas.hmmsearch
|
Ok. Now you have the hmmsearch reports, but you need to extract the
feature names. The following command will filter the report and extract the gene
names of the significant results and place a comment in the database
table.
|
-
mysql -B -e "update brazilworkshop06.features set note = '' where user = USER() and note not like 'tcs%' or note is null"
foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
foreach type (ecf s54 s70)
cat source/$accession.proteins.$type.sigmas.hmmsearch | \
perl -ne 'next unless /^CDS_(\d+)\-(\d+)_DIR([\-\+]+)\s+([0-9\-\.e]+)\s+([0-9\-\.e]+)\s+/;my ($start,$stop,$dir,$score,$evalue) = ($1,$2,$3,$4,$5);next unless $score > 0; print "update brazilworkshop06.features set note = \"Sigma Factor '$type'\" where start=$start and stop = $stop and user = user() and accession = \"'$accession'\";\n";' | mysql
end
end
Start the mysql client and try and review all the predictions you made by querying the MySQL table:
-
mysql
mysql>select start,stop,dir,note from brazilworkshop06.features where user=USER() and note like 'Sigma Factor%' and accession = 'CP000036' order by start;
And it probably does not come as a surprise that we can easily count the occurences in each genome by a single query:
-
mysql>select features.note,genomes.accession,genomes.organism,count(*) from brazilworkshop06.features,brazilworkshop06.genomes where genomes.accession=features.accession and genomes.user = features.user and features.user=USER() and note like 'Sigma Factor%' group by note,genomes.accession order by note,genomes.accession;
The following query will access the MySQL table of the Genome Atlas Database. It will allow
you to see the average number of σ54,σ70, and ECF Sigma Factors found in all the Bacterial species and the Bacterial phyla.
- Grouped by Species
mysql>select concat(genus,' ',species),count(*) as Count,avg(SIGMA54_PROMOTERS_COUNT) as AvgS54Count, avg(SIGMA70_PROMOTERS_COUNT) as AvgS70Count,avg(SIGMAECF_PROMOTERS_COUNT) as AvgECFCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by genus,species;
- Grouped by Phyla
mysql>select TAXLONG_PHYLA,count(*) as Count,avg(SIGMA54_PROMOTERS_COUNT) as AvgS54Count,avg(SIGMA70_PROMOTERS_COUNT) as AvgS70Count,avg(SIGMAECF_PROMOTERS_COUNT) as AvgECFCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by TAXLONG_PHYLA;
mysql>exit
How does this compare with your results?
-
PREDICTING TWO COMPONENT SYSTEMS
The second part of todays exercise is about predicting two-component systems.
We will use the same procedure like with the sigma factors.
|
-
foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
foreach type (RRreciever HisKA_1 HisKA_2 HisKA_3 HWE_HK)
/usr/cbs/tools/lib/make/bin/hmmsearch.Linux --cut_tc /usr/cbs/tools/lib/make/pfh/TCS/$type.hmm \
../M11/source/$accession.proteins.fsa > source/$accession.$type.TCS.hmmsearch &
end
end
wait
Try first and view the content of a hmmsearch report ( use spacebar to scroll and 'q' to quit ). In this report, you should find
several proteins with a score better than zero:
-
less source/AE017042.RRreciever.TCS.hmmsearch
The following command will filter the report and extract the gene names of the significant results and place a comment in the database table:
-
mysql -B -e "update brazilworkshop06.features set note = '' where user = USER() and note not like 'sigma%' or note is null"
foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
foreach type (RRreciever HisKA_1 HisKA_2 HisKA_3 HWE_HK)
cat source/$accession.$type.TCS.hmmsearch | \
perl -ne 'next unless /^CDS_(\d+)\-(\d+)_DIR([\-\+]+)\s+([0-9\-\.e]+)\s+([0-9\-\.e]+)\s+/;my ($start,$stop,$dir,$score,$evalue) = ($1,$2,$3,$4,$5);next unless $score > 0; print "update brazilworkshop06.features set note = \"TCS '$type'\" where start=$start and stop = $stop and user = user() and accession = \"'$accession'\";\n";' | mysql -B
end
end
Exactly like we did before, listing the predicted two-component systems you have predicted of a given accession number, here CP000036)
-
mysql
mysql>select start,stop,dir,note from brazilworkshop06.features where user=USER() and note like 'TCS%' and accession = 'CP000036' order by start;
Exactly like we did before, cross-referencing the organism names and the features you predicted
-
mysql>select features.note,genomes.accession,organism,count(*) from brazilworkshop06.features,brazilworkshop06.genomes where genomes.accession=features.accession and genomes.user = features.user and features.user=USER() and note like 'TCS%' group by note,genomes.accession order by genomes.accession,note;
The following query will access the MySQL table of the Genome Atlas Database. It will allow
you to see the average number of TCS' in all the Bacterial species and the Bacterial phyla.
- Grouped by Species
mysql>select concat(genus,' ',species),count(*) as Count,avg(HISKA_1_COUNT) as HisKA1Count, avg(HISKA_2_COUNT) as HisKA2Count, avg(HISKA_3_COUNT) as HisKA3Count,avg(HWE_HK_COUNT) as HWEHKCount,avg(RRRECIEVER_COUNT) as RRrecieverCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by genus,species;
- Grouped by Phyla
mysql>select TAXLONG_PHYLA,count(*) as Count,avg(HISKA_1_COUNT) as HisKA1Count, avg(HISKA_2_COUNT) as HisKA2Count, avg(HISKA_3_COUNT) as HisKA3Count,avg(HWE_HK_COUNT) as HWEHKCount,avg(RRRECIEVER_COUNT) as RRrecieverCount from genomeatlas.atlasdb where kingdom='bacteria' and ( segment = 'main' or segment >= 1) group by TAXLONG_PHYLA;
mysql>exit
How does this compare with your results?
|