Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Description



Saco Patterns


Reads promoter sequences (DNA) from a 'file' in FASTA format and counts the k-tuple frequencies in each data set.

Two kinds of statistics can be done on the patterns, and the output depends on which is chosen.

When doing Kolgomorov-Smirnoff statistics five columns of output are produced: Pattern, number sequences in which pattern occurred, the higher KS-ratio, the lower KS-ratio, and KS-direction. The higher KS-ratio is defined as:

max |xi/n-i/N|
xxxi
-----------------
L*sqrt(n+N/(n*M))

N is the number of sequences in the set, n the number of sequences containing the pattern, xi the number of sequences of rank up to i that contain the pattern, and L is the length of the pattern. This is the same value that is used for filtering when using Kolgomorov-Smirnoff statistics

The KS-direction indicates how the distribution is skewed. A '+' means that the pattern in question skewed towards the end of the data set, while a '-' means the pattern occurs preferentially in the beginning of the data set. When using the KS-direction, one should always check the lower KS-ratio. If it is close to the higher KS-ratio, the variance of the pattern distribution is different from the reference.

Alternatively a hypergeometric statistic can be used. The hypergeometric statistic tests for differences between a positive and a negative set. Here we use a negative set consisting of all Arabidopsis thaliana promoters (800bp) Example: if we have a pool of elements of two kinds and we then randomly takeout a number of elements from this pool, there is a certain chance of getting a certain number of each kind, depending on the composition of original pool. If we know the content of the original pool we can estimate the chance of a given draw from it. In this case the kind of elements are promoters with or without a given pattern. The output from the hypergeometric test is pattern, -log(p), number of promoters in positive set with pattern and number of promoters in negative set with pattern, respectively.

Gibbs Sampler

The Gibbs motif sampler stochastically examines candidate alignments in an effort to find the best alignment as measured by the maximum a posteriori (MAP) log-likelihood ratio. Note that, because it is a stochastic method, there will be variations between the best alignments found with different random seeds - especially for subtle motifs. Consequently, it is often helpful to first run the sampler several times to see whether it usually converges on the same alignment each time. If it does not, as is typically the case for very subtle motifs, it may be necessary to perform a large number of independent searches in conjunction with a sufficient number of "near-optimum" samples (specified by the "iterations pr. run" option). For instance, the alignment of subtle porin repeats described in Neuwald, Liu & Lawrence was obtained as the best found out of 1000 independent searches - 100 runs of the Gibbs program each performing 10 independent runs (the number of distinct runs is specified using the "runs" option).
While picking the best result out of many independent runs is no guarrantee of obtaining the optimum alignment, it will substantially increase the chance of finding the optimum (or a nearly optimum) alignment in such cases. In any case it should be noted that most suboptimum alignments found by the sampler are often closely related to the optimum alignment.


GETTING HELP

Server problems: