World Wide Web Prediction Server
Center for Biological Sequence Analysis
The data were taken from SWISS-PROT version 29. The data sets were divided into prokaryotic and eukaryotic entries, and the prokaryotic data sets were further divided into Gram-positive eubacteria (Firmicutes) and Gram-negative eubacteria (Gracilicutes), excluding Mycoplasma and Archaebacteria. Viral, phage, and organellar proteins were not included. Additionally, two single-species data sets were selected, a human subset of the eukaryotic data, and an E. coli subset of the Gram-negative data.
From secretory proteins, the sequence of the signal peptide and the first 30 amino acids of the mature protein were included in the data set. From cytoplasmic and (for the eukaryotes) nuclear proteins, the first 70 amino acids of each sequence were used. Additionally, a set of eukaryotic signal anchor sequences, i.e. N-terminal parts of type II membrane proteins, were extracted.
Redundancy in the data sets was avoided by excluding pairs of sequences which were functionally homologous, i.e. where the cleavage site of one signal peptide could be located by simply aligning it to the other. The numbers of non-homologous sequences remaining in the data sets are shown below:
|Source||Number of sequences|
| signal |
| non-secr. |
| signal |
The data sets are available from the
SignalP ftp site.
Details of the data selection can be found in the ftp README file or in the long version of the SignalP article (Compressed Postscript, 140K)
The redundancy reduction is described in detail in