|
Score sequence data with a PSSM
Description
Posisiton specific scoring matices (PSSM) are more statistially motivated sequence motif
models that provide higher sensitivity and specificity than regular expressions or ad hoc
approaches (see project 08).
Input and output
The program is given a fasta file with DNA or amino acid sequences, a PSSM and a score threshold.
You can find weight matices that describe the DNA binding preference for transcription
factors in the database
TRANSFAC.
You will have to register with TRANSFAC to gain access but it is free.
The units for these weights are often unknown and as such should be interpreted as
un-normalized frequencies.
The output should be the sites that score above the provided threshold.
Site scores are calculated as the sum of PSSM weights selected at each position.
Statistical analysis
We want to score every possible site (sub-sequence) with our log-likelihood PSSM.
How to get a log-likelihood PSSM?, see below.
Frequencies can be interptreted as letter (nucleotide or amino acid)
probabilities for a particular letter at a particular position in the matrix. The letter background
probabilities, i.e. probability of a letter not within the pattern, are aften called prior
probabilities (priors), i.e. what we would expect before we knew about the motif.
The priors will depend on the organism and how the background is defined.
In E.coli, for example, the genomic nucleotide priors are almost equiprobable
(p_A:0.25, p_C:0.25, p_G:0.25, p_T:0.25) but in yeast they are closer to
p_A:0.31, p_C:0.19, p_G:0.19, p_T:0.31 (note that p_A==p_T and p_C==p_G).
TRANSFAC example M00204 for GCN4 (yeast, Saccharomyces cerevisiae transcription factor).
PO A C G T
01 2.69 2.46 1.74 0.81 N
02 5.18 0.00 2.52 0.00 A
03 4.30 0.00 2.43 0.97 R
04 0.00 0.00 0.00 7.70 T
05 0.00 0.00 7.70 0.00 G
06 7.70 0.00 0.00 0.00 A
07 0.00 6.01 0.00 1.68 C
08 0.00 0.00 0.00 7.70 T
09 0.00 7.70 0.00 0.00 C
10 3.40 1.65 0.00 1.88 W
You may also wish to generate an alignment counts matrix from sequence examples like these,
GGCTTACAAAC
TTATTACTAAT
TGCTTATAAAT
CGCTGACTAAT
CGCTGACTAAT
TGCTTACTAGT
TGCTTACTAGT
TGATTAATAAT
TGATTAATAAT
TGCTGACTAAA
TTTTTACTAAT
TTTTTACTAAT
But as there are potentially zeros in these matrices (not very likely reflecting reality since
we typically have a small sample and there are only 4 letters) we need to
add pseudo-counts to avoid observed frequencies (o_A, o_C, o_G, o_T) being zero.
Probability matrix of the above with 1 added pseudo-count, q=1
where, for position i,
o_iA += p_A*q
o_iC += p_C*q
...
and
f_iA = o_iA/sum_j(o_ij)
f_iC = o_iC/sum_j(o_ij)
...
So matrix F looks like this,
PO A C G T
01 0.338 0.311 0.229 0.122
02 0.624 0.029 0.318 0.029
03 0.523 0.029 0.308 0.140
04 0.029 0.029 0.029 0.914
05 0.029 0.029 0.914 0.029
06 0.914 0.029 0.029 0.029
07 0.029 0.720 0.029 0.222
08 0.029 0.029 0.029 0.914
09 0.029 0.914 0.029 0.029
10 0.460 0.240 0.032 0.269
A probability matrix is then converted to a log-likelihood matrix
h_iA = log2(f_iA/p_A)
h_iC = log2(f_iC/p_C)
...
and matrix H looks like this,
PO A C G T
01 0.435 0.317 -0.128 -1.037
02 1.320 -3.121 0.349 -3.121
03 1.065 -3.121 0.301 -0.834
04 -3.121 -3.121 -3.121 1.870
05 -3.121 -3.121 1.870 -3.121
06 1.870 -3.121 -3.121 -3.121
07 -3.119 1.527 -3.119 -0.171
08 -3.121 -3.121 -3.121 1.870
09 -3.121 1.870 -3.121 -3.121
10 0.881 -0.061 -2.987 0.104
The fasta file can contain more than one sequence/entry.
|