***************************** signalp data ***************************** This directory contains amino acid sequences of secretory signal peptides and N-terminal parts of sequences of non-secretory proteins. These data have been used in developing a neural network-based method for prediction of signal peptides and their cleavage sites. The prediction method can be accessed via the *signalp* mail server. For more information, send a message containing the word "help" only to signalp@cbs.dtu.dk . All data are taken from SWISS-PROT version 29. Any questions or comments should be sent to Kristoffer Rapacki, rapacki@cbs.dtu.dk. PAPERS TO REFERENCE: When using the data in this directory, please cite: H.Nielsen, J.Engelbrecht, S.Brunak, and G.von Heijne: "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites" To appear, Protein Engineering, vol. 10, issue 1, 1997 [ This manuscript can be retrieved in postscript format at URL: http://www.cbs.dtu.dk/publications/signalp.ps.Z ] or H.Nielsen, J.Engelbrecht, G.von Heijne, and S.Brunak: "Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site" PROTEINS, 24, 165-177, 1996. [ This manuscript can be retrieved in postscript format at URL: http://www.cbs.dtu.dk/publications/threshold.ps.Z ] and A.Bairoch and B.Boeckmann: "The SWISS-PROT protein sequence data bank: current status" Nucleic Acids Res. 22:3578-3580 (1994). DATA SELECTION: *Extraction of signal peptide sequences* The signal peptide data were selected from SWISS-PROT version 29 in the following way: From a total of 38303 entries, 5995 entries contained the keyword SIGNAL in the feature table. Entries suggesting non-experimental evidence for the cleavage site were discarded, i.e. where the signal peptide was incomplete, the cleavage site was unknown, question marks or comments such as "POTENTIAL", "PROBABLE", or "BY SIMILARITY" were present, or an alternative cleavage site was suggested. This selection procedure reduces the amount of cleavage sites which are not experimentally determined, but it does not eliminate them, since many SWISS-PROT entries simply lack information about the quality of the evidence, as we have previously found (See H.Nielsen, J.Engelbrecht, G.von Heijne and S.Brunak: "Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site", PROTEINS, to appear). Furthermore, all virus and phage genes were discarded. From the eukaryotic data set, proteins encoded by organellar (i.e. non-nuclear) genes were discarded (by excluding SWISS-PROT entries containing an "OG" line). From the prokaryotic data set, signal peptides cleaved by signal peptidase II (Lsp, a specific lipoprotein signal peptidase) were discarded, since the cleavage sites of these proteins differ considerably from those cleaved by the standard prokaryotic signal peptidase (Lep); this was done by excluding entries with a cross-reference to the PROSITE entry named "PROKAR_LIPOPROTEIN". From each entry, the sequence of the signal peptide and the first 30 amino acids of the mature protein were included in the data set. (One entry, "AVR9_CLAFU", which had less than 30 amino acids after the cleavage site was deleted). *Extraction of cytoplasmic and nuclear protein sequences* As a background to the signal peptides, we extracted data sets comprising the N-terminal parts of cytoplasmic and (for the eukaryotes) nuclear proteins. This was done by searching for comment lines in SWISS-PROT specifying the subcellular location as "CYTOPLASMIC" or "NUCLEAR" without comments like "POTENTIAL" or "PROBABLE". Entries comprising protein fragments were discarded (by searching for the word "FRAGMENT" in the description line or the keywords "NON_TER" or "NON_CONS" in the feature table), as were proteins shorter than 70 aa or lacking the initial Met. Virus and phage proteins were not included. The first 70 amino acids of each sequence were included in the data sets. In some cases (383 eukaryotic, 48 Gram-negative, and 14 Gram-positive) where the entry contained a feature table line with the key "INIT_MET", indicating that the initiator methionine had been cleaved off, we prepended the missing "M" to the sequence. *Extraction of signal anchor sequences* Certain membrane proteins, known as type II membrane proteins, are attached to the membrane by an N-terminal sequence which shares many characteristics with a signal peptide but is not cleaved. Consequently, they consist of a short N-terminal cytoplasmic domain, a single transmembrane domain, and a larger C-terminal extracellular or lumenal domain. In order to test whether the prediction method would erroneously classify these uncleaved signal peptides, also known as signal anchors, as signal peptides, a data set of signal anchors was extracted in the following way: In SWISS-PROT version 29, 157 entries contained the feature table keyword "TRANSMEM" with the qualifier "SIGNAL-ANCHOR (TYPE-II MEMBRANE PROTEIN)". From these, we selected 137 eukaryotic signal anchors with specified endpoints and without comments like "POTENTIAL" or "PROBABLE". Prokaryotic signal anchors were ignored, since only five of these were found (four of them potential). 18 entries were discarded because they contained more than one "TRANSMEM" line and therefore should be regarded as type IV (i.e. multispanning) membrane proteins, rather than type II. With one exception only, these proteins are members of the TM4 superfamily or bear similarity to it. Furthermore, we discarded 22 entries where the suggested signal anchor region (from the N-terminal of the protein to the C-terminal end of the specified transmembrane region) was 70 aa or longer, because these would hardly be mistaken for cleavable signal peptides. In many cases, the cytoplasmic domain preceding the signal anchor were marked "POTENTIAL" or "PROBABLE", even if the signal anchor itself was not. We did not discard these entries, however; since the signal anchor data were not going to be used as training data but only as test, we set the demands for the quality of experimental evidence lower than for the other data sets. *Division of data sets by systematic group* By using the information in the SWISS-PROT "OS" line, the resulting data sets were divided into prokaryotic and eukaryotic entries, and the prokaryotic data sets were further divided into Gram-positive eubacteria ("FIRMICUTES") and Gram-negative eubacteria ("GRACILICUTES"), excluding Mycoplasma and Archaebacteria. Additionally, two single-species data sets were selected, a human subset of the eukaryotic data, and an E.coli subset of the Gram-negative data. *Redundancy reduction* Redundancy in the data sets was avoided by excluding pairs of sequences which had more than a certain number of identities (exact matches) in an alignment made with a protein identity matrix of high relative entropy. The cutoff value which gave the best separation between functionally homologous and non-homologous signal peptide sequences was established for eukaryotes and prokaryotes separately: 17 identities for eukaryotes and 21 for prokaryotes. In this context, a sequence pair is defined to be functionally homologous if both cleavage sites are aligned at the same position (See H.Nielsen, J.Engelbrecht, G.von Heijne, and S.Brunak: "Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site", PROTEINS, to appear (1995)). We applied the same cutoff to cytoplasmic and nuclear protein sequences, even though the cutoff has been determined for signal peptide sequences specifically, since these merely serve as background to the signal peptide sequences. Redundancy reduction was not applied to the signal anchor data, since these were not used as training data. After computing all pairwise alignments within the five data sets, redundant sequences were removed using algorithm 2 of Hobohm et al. (See U.Hobohm, M.Scharf, R.Schneider, and C.Sander: "Selection of representative protein data sets", Protein Science 1:409-417 (1992)), which guarantees that no pairs of homologous sequences remain in the data set. This procedure removed 13-56% of the sequences (see below). *Removing errors* While investigating the pairwise similarities between signal peptide sequences, we found a number of sequence pairs with similarity above the threshold but without aligned cleavage sites (See H.Nielsen, J.Engelbrecht, G.von Heijne, and S.Brunak: "Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site", PROTEINS, to appear (1995)). By manually checking the references to these examples in the human signal peptide data set, a number of database errors were found. Five entries were found to lack experimental evidence for their cleavage sites: "ELNE_HUMAN", "FCG3_HUMAN", "FCGA_HUMAN", "FCGB_HUMAN", and "FCGC_HUMAN". These have been discarded. Three entries were found to have the cleavage site indicated at a wrong position: "HA22_HUMAN", "SOMV_HUMAN", and "SOMW_HUMAN". The cleavage sites of these have been changed accordingly. Note: The other data sets have not been through this type of error checking. Therefore, the human signal peptide data set is probably more error-free than the other signal peptide data sets. A few examples of disagreement between signal peptide and subcellular location information were found in the data: The entry "MURF_ECOLI" (E.coli UDP-MurNAc-pentapeptide synthetase) had both a signal peptide and a comment stating that it was located in the cytoplasm. The entry "NO27_SOYBN" (soybean nodulin-27) which was cytoplasmic according to the comment was very similar to two other nodulins ("NO20_SOYBN" and "NO22_SOYBN") which both had signal peptides. (A comment in "NO27_SOYBN" said that "Despite the similarity of their structures, the nodulins are located in different subcellular compartments.") These four entries were deleted from the data sets. According to our finished prediction method, "MURF_ECOLI" certainly does not look like a signal peptide, while the three nodulins look like typical signal peptides. SIZE OF THE DATA SETS: Signal Cytoplasmic Nuclear Signal peptides proteins proteins anchors SIG CYT NUC ANC tot. / red. tot. / red. tot. / red. tot. EUK 2275 / 1011 854 / 269 1007 / 551 97 HUMAN 614 / 416 138 / 97 188 / 154 28 GRAM- 383 / 266 293 / 186 ECOLI 119 / 105 128 / 119 GRAM+ 187 / 141 123 / 64 The number of sequences in the data sets before ("tot.") and after ("red.") redundancy reduction. The organism groups are: Eukaryotes ("EUK"), H.sapiens ("HUMAN"), Gram-negative bacteria ("GRAM-"), E.coli ("ECOLI"), and Gram-positive bacteria ("GRAM+"). The human data are subsets of the eukaryotic data, and the E.coli data are subsets of the Gram-negative data. FILE FORMAT: Below is shown an example of one entry (one sequence) in the data files: 52 CA11_HUMAN 22 PROCOLLAGEN ALPHA 1(I) CHAIN PRECURSOR. MFSFVDLRLLLLLAATALLTHGQEEGQVEGQDEDIPPITCVQNGLRYHDRDV SSSSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMM Each entry consist of three parts: - One line containing - the sequence length (in the first five positions) - the SWISS-PROT ID - the length of the signal peptide - all or part of the information in the SWISS-PROT DE (description) line - One or more lines (80 characters per line) showing the sequence in one letter code - One or more lines (80 characters per line) showing the classification of each position: S for positions in the signal peptide C for the position immediately downstream the cleavage site (a.k.a. the +1 position) M for other positions in the mature protein Only very few entries (5 in EUKSIG.tot, 2 in GRAM+SIG.tot, and 1 in GRAM-SIG.tot) consist of more than three lines, because their sequences are longer than 80 amino acids. FILE NAMES: The files are named according to the pattern: . equivalent to the regular expression: (EUK|HUMAN|GRAM-|ECOLI|GRAM\+)(SIG|CYT|NUC|ANC)(\.tot|\.red) -rw-r--r-- 1 hnielsen users 27009 Sep 4 19:06 ECOLICYT.tot -rw-r--r-- 1 hnielsen users 19187 Sep 4 19:06 ECOLISIG.red -rw-r--r-- 1 hnielsen users 21708 Sep 4 19:06 ECOLISIG.tot -rw-r--r-- 1 hnielsen users 22772 Oct 16 16:54 EUKANC.tot -rw-r--r-- 1 hnielsen users 56957 Sep 4 19:06 EUKCYT.red -rw-r--r-- 1 hnielsen users 178071 Sep 4 19:06 EUKCYT.tot -rw-r--r-- 1 hnielsen users 111586 Sep 4 19:06 EUKNUC.red -rw-r--r-- 1 hnielsen users 202515 Sep 4 19:06 EUKNUC.tot -rw-r--r-- 1 hnielsen users 180588 Sep 4 19:06 EUKSIG.red -rw-r--r-- 1 hnielsen users 405850 Sep 4 19:06 EUKSIG.tot -rw-r--r-- 1 hnielsen users 13624 Sep 4 19:06 GRAM+CYT.red -rw-r--r-- 1 hnielsen users 26024 Sep 4 19:06 GRAM+CYT.tot -rw-r--r-- 1 hnielsen users 28255 Sep 4 19:06 GRAM+SIG.red -rw-r--r-- 1 hnielsen users 37440 Sep 4 19:06 GRAM+SIG.tot -rw-r--r-- 1 hnielsen users 39393 Sep 4 19:06 GRAM-CYT.red -rw-r--r-- 1 hnielsen users 61567 Sep 4 19:06 GRAM-CYT.tot -rw-r--r-- 1 hnielsen users 48603 Sep 4 19:06 GRAM-SIG.red -rw-r--r-- 1 hnielsen users 70166 Sep 4 19:06 GRAM-SIG.tot -rw-r--r-- 1 hnielsen users 6204 Oct 16 16:54 HUMANANC.tot -rw-r--r-- 1 hnielsen users 20764 Sep 4 19:06 HUMANCYT.red -rw-r--r-- 1 hnielsen users 29488 Sep 4 19:06 HUMANCYT.tot -rw-r--r-- 1 hnielsen users 31868 Sep 4 19:06 HUMANNUC.red -rw-r--r-- 1 hnielsen users 38586 Sep 4 19:06 HUMANNUC.tot -rw-r--r-- 1 hnielsen users 75574 Sep 4 19:06 HUMANSIG.red -rw-r--r-- 1 hnielsen users 112843 Sep 4 19:06 HUMANSIG.tot ************************************************************************