Saco Patterns
Reads promoter sequences (DNA) from a 'file' in FASTA format and
counts the k-tuple frequencies in each data set.
Two kinds of statistics can be done on the patterns, and the
output depends on which is chosen.
When doing
Kolgomorov-Smirnoff statistics five columns of output are
produced: Pattern, number sequences in which pattern occurred, the
higher KS-ratio, the lower KS-ratio, and KS-direction.
The higher KS-ratio is defined as:
max |xi/n-i/N|
xxxi
-----------------
L*sqrt(n+N/(n*M))
N is the number of sequences in the set, n the number of sequences
containing the pattern, xi the number of sequences of rank up to i
that contain the pattern, and L is the length of the pattern. This is
the same value that is used for filtering when using Kolgomorov-Smirnoff statistics
The KS-direction indicates how the distribution is skewed. A '+'
means that the pattern in question skewed towards the end of the data
set, while a '-' means the pattern occurs preferentially in the
beginning of the data set. When using the KS-direction, one should
always check the lower KS-ratio. If it is close to the higher
KS-ratio, the variance of the pattern distribution is different from
the reference.
Alternatively a
hypergeometric statistic can be used. The hypergeometric statistic tests for differences between a positive and a negative set. Here we use a negative set consisting of all Arabidopsis thaliana promoters (800bp)
Example: if we have a pool of elements of two kinds and we then randomly takeout a number of elements from this pool, there is a certain chance of getting a certain number of each kind, depending on the composition of original pool. If we know the content of the original pool we can estimate the chance of a given draw from it.
In this case the kind of elements are promoters with or without a given pattern.
The output from the hypergeometric test is pattern, -log(p), number of promoters in positive set with pattern and number of promoters in negative set with pattern, respectively.
Gibbs Sampler
The Gibbs motif sampler stochastically examines candidate alignments
in an effort to find the best alignment as measured by the maximum
a posteriori (MAP) log-likelihood ratio. Note that, because it is
a stochastic method, there will be variations between the best
alignments found with different random seeds - especially for subtle
motifs. Consequently, it is often helpful to first run the sampler
several times to see whether it usually converges on the same
alignment each time. If it does not, as is typically the case for
very subtle motifs, it may be necessary to perform a large number
of independent searches in conjunction with a sufficient number
of "near-optimum" samples (specified by the "iterations pr. run" option).
For instance, the alignment of subtle porin repeats described in
Neuwald, Liu & Lawrence was obtained as the best found out of 1000
independent searches - 100 runs of the Gibbs program each performing
10 independent runs (the number of distinct runs is specified
using the "runs" option).
While picking the best result out of many independent runs
is no guarrantee of obtaining the optimum alignment, it will
substantially increase the chance of finding the optimum (or a nearly
optimum) alignment in such cases. In any case it should be noted
that most suboptimum alignments found by the sampler are often closely
related to the optimum alignment.
GETTING HELP
Server problems: