We first constructed a database of accurate crystal structures to derive a threshold for when amino acid sequence similarity implies structural similarity:

SetI. This set contains 942 protein chains with a resolution better than 1.8AA
The similarity threshold derived using setI was then build into the RedHom homology reduction tool. This tool was in turn used to create the following non redundant databases where the entries have a sequence identity I < 290/L, where L is the length of the alignment.
setII is a set of 525 proteins used to develop the SoWhat method for prediction of distances in proteins.
setII was divided into a training set containing 420 sequences:
train used to train the neural networks and a test set containing 105 sequences:
test used to test the performance of the neural networks.

To further test the neural networks a second test set containing 131 sequences was extracted from release 79 of the pdb set.131

The performance of the networks was also tested on a subset of 39 these 131 sequences which belonged to foldclasses present neither in the training nor the test set: set.39

The threading algorithm "shredder" were tested on a subset of 35 of the 105 sequences in test: set.35. These sequences all contain a domain belonging to a superfamily that occurs at least twice in the set.

To obtain a large set of sequences with low sequence similarity setII and set.131 can be concatinated into a large set containing 656 sequences: set.656


