|
NetStart: translation initiation site data sets
Data sets
Release notes
These data sets were used as the basis for training the
NetStart
prediction server described in the paper Neural Network Prediction of Translation Initiation
Sites in Eukaryotes: Perspectives for EST and Genome analysis, AG Pedersen and H Nielsen,
Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology
(ISMB97), 226-233, 1997.
(PDF).
Please note that these data sets were constructed in 1997. It is now possible to compile much
larger datasets from current versions of genbank and other databases. Specifically, the data sets
contain 3312 (vertebrates) and 523 (Arabidopsis) sequences respectively, and have been "redundancy
reduced" to ensure unbiased statistics and validation (there will be no closely related sequences in
test and training sets if one performs cross-validation). Briefly, redundancy reduction was done by
performing all pairwise alignments of the sequences in the original set, comparing the distribution
of alignment scores to what you would expect from a random set of sequences (they follow an extreme
value distribution) and finally remove sequences so that the remaining ones are only remotely
related (meaning the distribution of alignment scores are like for the random set).
File Format
Files have been gzipped and need to be decompressed before
use. Data is in the so-called HOW-format used here at CBS. It's quite simple and should be easy to
convert. Briefly, the format consists of three parts (see example below):
- A line containing the sequence length, sequence ID and comments
- The sequence itself
- Annotation
The annotation consists of a sequence of annotation codes, one letter for
each nucleotide. In this data set we have used the following annotation
codes:
. : Un-annotated DNA (usually UTR in this set).
M : Untranslated exon.
i : Start codon (first nucleotide <=> the A in ATG).
E : Coding (translated) exon.
Example of entry:
299 BTLACTA.1 CAT X06366 Bos taurus
CAATGTTTCTTTGTTGGTTTTACTGGCCTCTCTTGTCATCCTCTTCCTGGATGTAAGGCTTGATGCCAGGGCCCCTAAGG 80
CTTTTTCCACAAATAAAAGGAGGTGAGCAGTGTGGTGACCCCATTTCAGAATCTTGGGGGGTCACCAAAATGATGTCCTT 160
TGTCTCTCTGCTCCTAGTAGGCATCCTATTCCATGCCACCCAGGCTGAACAGTTAACAAAATGTGAGGTGTTCCGGGAGC 240
TGAAAGACTTGAAGGGCTACGGAGGTGTCAGTTTGCCTGAATGGGTCTGTACCGCGTTT
................................................................................ 80
.....................................................................iEEEEEEEEEE 160
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Downstream of translation start we always included 150 bp. Upstream there is
up to a maximum of 150 bp.
|