|
SignalP V1.1World Wide Web Prediction ServerCenter for Biological Sequence Analysis |
SignalP is the most powerful prediction method for signal peptides published. In order to compare the strength of the neural network approach to the weight matrix method, we recalculated new weight matrices from the new data and tested the performance of these. The weight matrix method was comparable to the neural networks when calculating C-score, but was practically unable to solve the S-score problem and therefore did not provide the possibility of calculating the combined Y-score. The ability to distinguish signal anchors from signal peptides has not been evaluated for any of the earlier published methods for signal peptide recognition.
The best prediction of cleavage site location is provided by the position of the Y-score maximum. The best prediction of sequence type (signal peptide or non-secretory protein) is given by the mean S-score (the average of the S-score in the region between position 1 and the position immediately before the Y-score maximum): if mean S-score is larger than 0.5, the sequence is predicted to be a signal peptide (see the plot under ``Results: Identification of signal anchors''). When using these estimates, we obtain the predictive qualities given in the table below.
These prediction performances are minimal values. They are measured on the test sets (i.e. data which were not used to train the networks), and due to the redundancy reduction of the data, the sequence similarity between training and test sets is so low that the correct cleavage sites cannot be found by homology. Consequently, the prediction accuracy on sequences with some degree of homology to the sequences in the data sets will in general be higher.
| Data source | Cleavage site location | Signal peptide discrimination |
|---|---|---|
| (% correct) | (correlation) | |
| Human | 68.0 (67.9) | 0.96 (0.97) |
| Euk. | 70.2 | 0.97 |
| E. coli | 83.7 (85.7) | 0.89 (0.92) |
| Gram- | 79.3 | 0.88 |
| Gram+ | 67.9 | 0.96 |
Values given in parentheses indicate the performance for the human sequences when using networks trained on all eukaryotic data, and for the E. coli sequences when using Gram-negative networks, respectively. Note that there is no gain in performance when using the networks trained on single-species data sets - in other words, we have found no evidence of species-specific features of the signal peptides of humans and E. coli.
The quality of signal peptide discrimination is measured by correlation coefficient:
where
and
are
the numbers of true and false positives (sequences with mean S-score
over 0.5), while
and
are
the numbers of true and false negatives.
More data about predictive performance can be found in the long version of the SignalP article (Compressed Postscript, 140K).