CBS-dk

SignalP V1.1

World Wide Web Prediction Server

Center for Biological Sequence Analysis

Next: Characteristics of signal peptides
Up: SignalP server
Previous: Prediction of signal peptides

Signal peptide data

The data were taken from SWISS-PROT version 29. The data sets were divided into prokaryotic and eukaryotic entries, and the prokaryotic data sets were further divided into Gram-positive eubacteria (Firmicutes) and Gram-negative eubacteria (Gracilicutes), excluding Mycoplasma and Archaebacteria. Viral, phage, and organellar proteins were not included. Additionally, two single-species data sets were selected, a human subset of the eukaryotic data, and an E. coli subset of the Gram-negative data.

From secretory proteins, the sequence of the signal peptide and the first 30 amino acids of the mature protein were included in the data set. From cytoplasmic and (for the eukaryotes) nuclear proteins, the first 70 amino acids of each sequence were used. Additionally, a set of eukaryotic signal anchor sequences, i.e. N-terminal parts of type II membrane proteins, were extracted.

Redundancy in the data sets was avoided by excluding pairs of sequences which were functionally homologous, i.e. where the cleavage site of one signal peptide could be located by simply aligning it to the other. The numbers of non-homologous sequences remaining in the data sets are shown below:

Source Number of sequences
signal
peptides
non-secr.
proteins
signal
anchors
Human 416 251 97
Eukaryotes 1011 820 28
E. coli 105 119 -
Gram- 266 186 -
Gram+ 141 64 -

The data sets are available from the SignalP ftp site.
Details of the data selection can be found in the ftp README file or in the long version of the SignalP article (Compressed Postscript, 140K)

The redundancy reduction is described in detail in


Last change: December 2, 1996,
Henrik Nielsen


GO BACK GOTO Home Page Email Webmaster