Analysis and recognition of 5' UTR intron splice sites
in human pre-mRNA.
E. Eden and S. Brunak.
Nucleic Acids Research, 33:1131-1142, 2004.
Center for Biological Sequence Analysis, BioCentrum-DTU,
The Technical University of Denmark, DK-2800 Lyngby, Denmark
Prediction of splice sites in non-coding regions of genes is one of the most
challenging aspects of gene structure recognition. We perform a rigorous
analysis of such splice sites embedded in human 5' UTR regions, and investigate
correlations between this class of splice sites and other features found in the
adjacent exons and introns. By restricting the training of neural network
algorithms to 'pure' untranslated regions (not extending partially into protein
coding regions), we for the first time investigate the predictive power of the
splicing signal proper in contrast to conventional splice site prediction,
which typically rely on the change in sequence at the transition from protein
coding to non-coding. By doing so the algorithms were able to pick up subtler
splicing signals that were otherwise masked by 'coding' noise thus enhancing
significantly the prediction of 5' UTR splice sites. For example, the
non-coding splice site predicting networks pick up compositional and positional
bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where
cytosine and guanine are overrepresented. This compositional bias at the true
UTR donor sites is also visible in the synaptic weights of the neural networks
trained to identify UTR donor sites. Conventional splice site prediction
methods perform poorly in UTR regions, because the reading frame pattern is
absent. The NetUTR method presented here performs 2-3 fold better compared to
NetGene2 and GenScan in 5' UTR regions. We also tested the 5' UTR trained
method on protein coding regions, and discovered surprisingly that it works
quite well (although it cannot compete with NetGene2). This indicates that the
local splicing pattern in UTR and coding regions largely is the same.