We first constructed a database of accurate crystal structures to derive
a threshold for when amino acid sequence similarity implies structural
similarity:
SetI.
This set contains 942 protein chains with a resolution better than 1.8AA
The similarity threshold derived using setI was then build into the RedHom
homology reduction tool. This tool was in turn used to create the following
non redundant databases where the entries have a sequence identity I < 290/L,
where L is the length of the alignment.
setII
is a set of 525 proteins used to develop the SoWhat method for
prediction of distances in proteins.
setII was divided into a training set containing 420 sequences:
train
used to train the neural networks
and a test set containing 105 sequences:
test
used to test the performance of the neural networks.
To further test the neural networks a second test set containing 131 sequences
was extracted from release 79 of the pdb
set.131
The performance of the networks was also tested on a subset of 39 these 131
sequences which belonged to foldclasses present neither in the training
nor the test set:
set.39
The threading algorithm "shredder" were tested on a subset of 35 of the 105
sequences in test:
set.35. These sequences all contain a domain belonging to a superfamily
that occurs at least twice in the set.
To obtain a large set of sequences with low sequence similarity
setII
and
set.131
can be concatinated into a large set containing 656 sequences:
set.656