Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation.
Jensen LJ, Knudsen S.
Bioinformatics. 2000 Apr;16(4):326-33.
Center for Biological Sequence Analysis.
MOTIVATION: The whole genomes submitted to GenBank contain valuable information about the function of genes as well as the upstream sequences and whole cell expression provides valuable information on gene regulation. To utilize these large amounts of data for a biological understanding of the regulation of gene expression, new automatic methods for pattern finding are needed. RESULTS: Two word-analysis algorithms for automatic discovery of regulatory sequence elements have been developed. We show that sequence patterns correlated to whole cell expression data can be found using Kolmogorov-Smirnov tests on the raw data, thereby eliminating the need for clustering co-regulated genes. Regulatory elements have also been identified by systematic calculations of the significance of correlations between words found in the functional annotation of genes and DNA words occurring in their promoter regions. Application of these algorithms to the Saccharomyces cerevisiae genome and publicly available DNA array data sets revealed a highly conserved 9-mer occurring in the upstream regions of genes coding for proteasomal subunits. Several other putative and known regulatory elements were also found. AVAILABILITY: Upon request.
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC.
Science. 1993 Oct 8;262(5131):208-14.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.
A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this "local multiple alignment" problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases.
Gibbs motif sampling: detection of bacterial outer membrane protein repeats.
Neuwald AF, Liu JS, Lawrence CE.
Protein Sci. 1995 Aug;4(8):1618-32.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
The detection and alignment of locally conserved regions (motifs) in multiple sequences can provide insight into protein structure, function, and evolution. A new Gibbs sampling algorithm is described that detects motif-encoding regions in sequences and optimally partitions them into distinct motif models; this is illustrated using a set of immunoglobulin fold proteins. When applied to sequences sharing a single motif, the sampler can be used to classify motif regions into related submodels, as is illustrated using helix-turn-helix DNA-binding proteins. Other statistically based procedures are described for searching a database for sequences matching motifs found by the sampler. When applied to a set of 32 very distantly related bacterial integral outer membrane proteins, the sampler revealed that they share a subtle, repetitive motif. Although BLAST (Altschul SF et al., 1990, J Mol Biol 215:403-410) fails to detect significant pairwise similarity between any of the sequences, the repeats present in these outer membrane proteins, taken as a whole, are highly significant (based on a generally applicable statistical test for motifs described here). Analysis of bacterial porins with known trimeric beta-barrel structure and related proteins reveals a similar repetitive motif corresponding to alternating membrane-spanning beta-strands. These beta-strands occur on the membrane interface (as opposed to the trimeric interface) of the beta-barrel. The broad conservation and structural location of these repeats suggests that they play important functional roles.
CORRESPONDENCE FOR SACO PATTERNS
Lars Juhl Jensen: