Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Data sets used in the construction and evaluation of TargetP



Data sets redundancy reduced on presequence + first amino acid of mature protein if not otherwise stated.

Plant predictor data sets


Chloroplast (cTP) sequences
number: 141       fasta       AC numbers
Comments: SWISS-PROT release 36. Only plant proteins are included -- i.e. no algae.
Mitochondrial (mTP) sequences
number: 368       fasta       AC numbers
Comments: SWISS-PROT release 36. Plant and non-plant proteins (*).
Secretory Pathway/Signal Peptide (SP) sequences
number: 269       fasta       AC numbers
Comments: SWISS-PROT release 36.
Nuclear sequences
number:   48       fasta       AC numbers         (redundancy reduced on 112 N-term. aa's)
number:   54       fasta       AC numbers         (redundancy reduced on 68 N-term. aa's)
Comments: SWISS-PROT release 36. Redundancy reduced on different lengths to be used in training of different 1st layer networks (see article).
Cytosolic sequences
number:   87       fasta       AC numbers         (redundancy reduced on 112 N-term. aa's)
number: 108       fasta       AC numbers         (redundancy reduced on 68 N-term. aa's)
Comments: SWISS-PROT release 36. Redundancy reduced on different lengths to be used in training of different 1st layer networks (see article).

The set of 940 proteins used in Tables 1 and 2 in the JMB article consists of the cTP, mTP, SP, nuclear(54), and cytosolic(108) sets. The "other" set (162 entries) is a concatenation of the nuclear and cytosolic sets.



Non-plant predictor data sets


Mitochondrial (mTP) sequences
number:   371       fasta       AC numbers        (redundancy reduced on mTP+3 aa)
Comments: SWISS-PROT release 38. Plant and non-plant proteins (*).
Secretory Pathway/Signal Peptide (SP) sequences
number:   715       fasta       AC numbers
Comments: SWISS-PROT release 37.
Nuclear sequences
number: 1214       fasta       AC numbers         (redundancy reduced on 68 N-term. aa's)
Comments: SWISS-PROT release 37.
Cytosolic sequences
number:   438       fasta       AC numbers         (redundancy reduced on 68 N-term. aa's)
Comments: SWISS-PROT release 37.

The set of 2738 proteins used in Tables 1 and 2 in the JMB article consists of the mTP, SP, nuclear, and cytosolic sets. The "other" set (1652 entries) is a concatenation of the nuclear and cytosolic sets.



 (*) These data sets contain both plant and non-plant mTPs as justified by a study by Schneider et al., Proteins, 30, 49-60 (1998).

SWISS-PROT (Switzerland)


GETTING HELP

Scientific problems:        Technical problems: