Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

NetStart: translation initiation site data sets


Data sets


Release notes

These data sets were used as the basis for training the NetStart prediction server described in the paper Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome analysis, AG Pedersen and H Nielsen, Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB97), 226-233, 1997. (PDF).

Please note that these data sets were constructed in 1997. It is now possible to compile much larger datasets from current versions of genbank and other databases. Specifically, the data sets contain 3312 (vertebrates) and 523 (Arabidopsis) sequences respectively, and have been "redundancy reduced" to ensure unbiased statistics and validation (there will be no closely related sequences in test and training sets if one performs cross-validation). Briefly, redundancy reduction was done by performing all pairwise alignments of the sequences in the original set, comparing the distribution of alignment scores to what you would expect from a random set of sequences (they follow an extreme value distribution) and finally remove sequences so that the remaining ones are only remotely related (meaning the distribution of alignment scores are like for the random set).


File Format

Files have been gzipped and need to be decompressed before use. Data is in the so-called HOW-format used here at CBS. It's quite simple and should be easy to convert. Briefly, the format consists of three parts (see example below):

  1. A line containing the sequence length, sequence ID and comments
  2. The sequence itself
  3. Annotation

The annotation consists of a sequence of annotation codes, one letter for each nucleotide. In this data set we have used the following annotation codes:

   .  :  Un-annotated DNA  (usually UTR in this set).
   M  :  Untranslated exon.
   i  :  Start codon (first nucleotide <=> the A in ATG).
   E  :  Coding (translated) exon.

Example of entry:

   299 BTLACTA.1 CAT X06366 Bos taurus
CAATGTTTCTTTGTTGGTTTTACTGGCCTCTCTTGTCATCCTCTTCCTGGATGTAAGGCTTGATGCCAGGGCCCCTAAGG      80
CTTTTTCCACAAATAAAAGGAGGTGAGCAGTGTGGTGACCCCATTTCAGAATCTTGGGGGGTCACCAAAATGATGTCCTT     160
TGTCTCTCTGCTCCTAGTAGGCATCCTATTCCATGCCACCCAGGCTGAACAGTTAACAAAATGTGAGGTGTTCCGGGAGC     240
TGAAAGACTTGAAGGGCTACGGAGGTGTCAGTTTGCCTGAATGGGTCTGTACCGCGTTT
................................................................................      80
.....................................................................iEEEEEEEEEE     160
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE     240
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Downstream of translation start we always included 150 bp. Upstream there is up to a maximum of 150 bp.