Article abstractsMain references:
Henrik Nielsen's PhD thesis
Original method (SignalP v. 1.1)Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Protein Engineering, 10:1-6 (1997).
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.
Update to SignalP v. 2.0Prediction of signal peptides and signal anchors by a hidden Markov model.
Henrik Nielsen and Anders Krogh.
Proc Int Conf Intell Syst Mol Biol. (ISMB 6), 6:122-130 (1998).
A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our data, the length distributions for the three regions are significantly different from expectations. For instance, the assigned hydrophobic region is between 8 and 12 residues long in almost all eukaryotic signal peptides. This analysis also makes obvious the difference between eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. The model can be used to predict the location of the cleavage site, which it finds correctly in nearly 70% of signal peptides in a cross-validated test — almost the same accuracy as the best previous method. One of the problems for existing prediction methods is the poor discrimination between signal peptides and uncleaved signal anchors, but this is substantially improved by the hidden Markov model when expanding it with a very simple signal anchor model.
Update to SignalP v. 3.0Improved prediction of signal peptides: SignalP 3.0.
Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak.
J. Mol. Biol., 340:783-795 (2004).
We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, and both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, together with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions have increased notably for all three organism groups, eukaryotes, Gram negative and Gram positive bacteria. The accuracy of cleavage site prediction has increased in the range from 6–17 % over the previous version, whereas the signal peptide discrimination improvement mainly is due to the elimination of false positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has also been benchmarked against other available methods.
Update to SignalP v. 4.0SignalP 4.0: discriminating signal peptides from transmembrane regions.
Thomas Nordahl Petersen, Søren Brunak, Gunnar von Heijne and Henrik Nielsen.
Nature Methods, 8:785-786 (2011).
This is a Correspondence, it has no abstract.
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.
Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for example automatic database annotation is feasible. In this review, we concentrate on the present status and future perspectives of SignalP, our neural network-based method for prediction of the most well-known sorting signal: the secretory signal peptide. We discuss the problems associated with the use of SignalP on genomic sequences, showing that signal peptide prediction will improve further if integrated with predictions of start codons and transmembrane helices. As a step towards this goal, a hidden Markov model version of SignalP has been developed, making it possible to discriminate between cleaved signal peptides and uncleaved signal anchors. Furthermore, we show how SignalP can be used to characterize putative signal peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review a few methods for predicting other protein sorting signals and discuss the future of protein sorting prediction in general.
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied to genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server: http://www.cbs.dtu.dk/services/SignalP/.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution mode. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.
In the present age of genome sequencing, a vast number of predicted
genes are initially known only by their putative nucleotide
sequence. The newly established field of bioinformatics is concerned
with the computational prediction of structural and functional
properties of genes and the proteins they encode, based on their
nucleotide and amino acid sequences.