Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Article abstract


REFERENCE

CpHModels-3.0. Remote homology modeling using structure guided profile sequence alignment and double-sided baseline corrected scoring scheme.
Morten Nielsen, Claus Lundegaard, Ole Lund, and Thomas Nordahl Petersen
Abstract at the CASP8 conference 193, 2008.

Center for Biological Sequence Analysis, Department of Systems biology The Technical University of Denmark, DK-2800 Lyngby, Denmark


ABSTRACT

Summary

Sequence profiles have a broad application in field of bioinformatics prediction algorithms dating back to the pioneering work by Rost and Sanders. The field of protein structure prediction has largely benefited from this work, and most high performing algorithms for protein homology modeling use sequence profiles as their main vehicle. Likewise has prediction of local protein structural features been demonstrated to improve when sequence profile are used to represent the protein sequences. Here, we develop a scoring scheme for remote homology modeling building on these findings. Two protein sequences are aligned using local sequence alignment with an amino acids scoring matrix constructed combining sequence profiles, and local protein structural features like secondary structure and relative surface accessibility. For the query sequence where the structure is unknown, predicted local features are used. For the template PDB structure averages of predicted and DSSP assigned local features are used. Secondary structure predictions are performed using the artificial neural network approach described by Petersen et al, and relative surface exposure predicted using a doubled structure neural network approach as described by Petersen et al.. Each element in the alignment function (profile, secondary structure, and relative surface exposure) where scored using a log-likelihood approach where the likelihood was estimated as (sum p_ia p_ja)/O , where the sum is over the different classes of the given feature (amino acids, secondary structure elements, and exposure class), pia is the probability of observing that given feature class a in protein i, and O is the odds value definition a background score for a given feature. The log-likelihood odds values, relative weights on the three parts of the alignment function as well as the two affine gap-penalty values were optimized using a set of structurally superimposable sequence pairs with low mutual sequence similarity. Relating a sequence alignment score to a likelihood of the two sequences been structurally similar is not straightforward. The protein length and protein amino acids composition among other things determine how a protein sequence will score against other protein sequences. We design a double-sided baseline corrected scoring scheme to allow for a direct interpretation of the alignment scoring values in terms of structural similarity likelihood. Each sequence is aligned against a set of 1500 sequence representatives with internal low sequence similarity and broad structural diversity. A baseline correction for the sequence is estimated from a least square fit of the alignment scores to the logarithm of the template query sequence. Next, a mean score and standard deviation is estimated from the baseline correction score distribution after removal of outliers. The baseline fit, mean score and standard deviation values for the two sequences are next used to determine the significance of a given alignment score. This significance score is calculated as Z=(2 ZQ ZT)/(ZQ+ZT), where ZQ and ZT are the baseline corrected Z-score values for the alignment score for the query (Q) and template (T) sequences, respectively. A curated version of the PDB where the SEQRES sequence was aligned to the PDB sequence with atom coordinates was used as template database. Sequence profiles were generated using PSI-Blast with default parameters for three iterations and an e-value cut-off of 0.001. Large scale benchmarking and cross validation demonstrates that the use of local structure predictions to guide the pairwise sequence alignment significantly improved the alignment quality beyond that obtained using sequence profiles only. Further, the use of double-sided baseline correction improved the specificity of the method for template recognition.




CORRESPONDENCE

CBS Webmaster,