We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.
A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our data, the length distributions for the three regions are significantly different from expectations. For instance, the assigned hydrophobic region is between 8 and 12 residues long in almost all eukaryotic signal peptides. This analysis also makes obvious the difference between eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. The model can be used to predict the location of the cleavage site, which it finds correctly in nearly 70% of signal peptides in a cross-validated test — almost the same accuracy as the best previous method. One of the problems for existing prediction methods is the poor discrimination between signal peptides and uncleaved signal anchors, but this is substantially improved by the hidden Markov model when expanding it with a very simple signal anchor model.
We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, and both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, together with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions have increased notably for all three organism groups, eukaryotes, Gram negative and Gram positive bacteria. The accuracy of cleavage site prediction has increased in the range from 6–17 % over the previous version, whereas the signal peptide discrimination improvement mainly is due to the elimination of false positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has also been benchmarked against other available methods.
This is a Correspondence, it has no abstract.
Locating proteins in the cell using TargetP, SignalP, and related tools
Olof Emanuelsson, Søren Brunak, Gunnar von Heijne, Henrik Nielsen
Nature Protocols, 2:953-971 (2007).
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.
Machine learning approaches to the prediction of signal peptides and other protein sorting signals.
Henrik Nielsen, Søren Brunak, and Gunnar von Heijne.
Protein Engineering, 12:3-9 (1999), Review.
Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for example automatic database annotation is feasible. In this review, we concentrate on the present status and future perspectives of SignalP, our neural network-based method for prediction of the most well-known sorting signal: the secretory signal peptide. We discuss the problems associated with the use of SignalP on genomic sequences, showing that signal peptide prediction will improve further if integrated with predictions of start codons and transmembrane helices. As a step towards this goal, a hidden Markov model version of SignalP has been developed, making it possible to discriminate between cleaved signal peptides and uncleaved signal anchors. Furthermore, we show how SignalP can be used to characterize putative signal peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review a few methods for predicting other protein sorting signals and discuss the future of protein sorting prediction in general.
A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Int. J. Neural Sys., 8:581-599 (1997).
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server: http://www.cbs.dtu.dk/services/SignalP/.
Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site.
Henrik Nielsen, Jacob Engelbrecht, Gunnar von Heijne and Søren Brunak.
Proteins, 24:165-77 (1996).
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution mode. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.
In the present age of genome sequencing, a vast number of predicted
genes are initially known only by their putative nucleotide
sequence. The newly established field of bioinformatics is concerned
with the computational prediction of structural and functional
properties of genes and the proteins they encode, based on their
nucleotide and amino acid sequences.
Since one of the crucial properties of a protein is its subcellular location, prediction of protein sorting is an important question in bioinformatics. A fundamental distinction in protein sorting is that between secretory and non-secretory proteins, determined by a cleavable N-terminal sorting signal, the secretory signal peptide.
The main part of this thesis, including four of the six papers, concerns prediction of secretory signal peptides in both eukaryotic and bacterial data using two machine learning techniques: artificial neural networks and hidden Markov models. A central result is the SignalP prediction method, which has been made available as a World Wide Web server and is very widely used.
Two additional prediction methods are also included, with one paper each. ChloroP predicts chloroplast transit peptides, another cleavable N-terminal sorting signal; while NetStart predicts start codons in eukaryotic genes. For prediction of all N-terminal signals, the assignment of correct start codon can be critical, which is why prediction of translation initiation from the nucleotide sequence is also important for protein sorting prediction.
This thesis comprises a detailed review of the molecular biology of protein secretion, a short introduction to the most important machine learning algorithms in bioinformatics, and a critical review of existing methods for protein sorting prediction. In addition, it contains general treatment of the principles of data set construction and performance evaluation for prediction methods in bioinformatics.