|
Article abstract
REFERENCE
CpHModels-3.0. Remote homology modeling using structure guided profile sequence alignment and double-sided baseline corrected scoring scheme.
Morten Nielsen, Claus Lundegaard, Ole Lund, and Thomas Nordahl Petersen
Abstract at the CASP8 conference 193, 2008.
Center for Biological Sequence Analysis, Department of Systems biology
The Technical University of Denmark, DK-2800 Lyngby, Denmark
ABSTRACT
Summary
Sequence profiles have a broad application in field of bioinformatics
prediction algorithms dating back to the pioneering work by Rost and
Sanders. The field of protein structure prediction has largely
benefited from this work, and most high performing algorithms for
protein homology modeling use sequence profiles as their main vehicle.
Likewise has prediction of local protein structural features
been demonstrated to improve when sequence profile are used to
represent the protein sequences. Here, we develop a scoring
scheme for remote homology modeling building on these findings. Two
protein sequences are aligned using local sequence alignment with an
amino acids scoring matrix constructed combining sequence profiles, and
local protein structural features like secondary structure and relative
surface accessibility. For the query sequence where the structure is
unknown, predicted local features are used. For the template PDB
structure averages of predicted and DSSP assigned local features are
used. Secondary structure predictions are performed using the
artificial neural network approach described by Petersen et al, and
relative surface exposure predicted using a doubled structure neural
network approach as described by Petersen et al.. Each element in
the alignment function (profile, secondary structure, and relative
surface exposure) where scored using a log-likelihood approach where
the likelihood was estimated as (sum p_ia p_ja)/O , where the sum is over the different
classes of the given feature (amino acids, secondary structure
elements, and exposure class), pia is the probability of observing
that given feature class a in protein i, and O is the odds value
definition a background score for a given feature. The log-likelihood
odds values, relative weights on the three parts of the alignment
function as well as the two affine gap-penalty values were optimized
using a set of structurally superimposable sequence pairs with low
mutual sequence similarity. Relating a sequence alignment score to a
likelihood of the two sequences been structurally similar is not
straightforward. The protein length and protein amino acids
composition among other things determine how a protein sequence will
score against other protein sequences. We design a double-sided
baseline corrected scoring scheme to allow for a direct interpretation
of the alignment scoring values in terms of structural similarity
likelihood. Each sequence is aligned against a set of 1500 sequence
representatives with internal low sequence similarity and broad
structural diversity. A baseline correction for the sequence is
estimated from a least square fit of the alignment scores to the
logarithm of the template query sequence. Next, a mean score and
standard deviation is estimated from the baseline correction score
distribution after removal of outliers. The baseline fit, mean score
and standard deviation values for the two sequences are next used to
determine the significance of a given alignment score. This
significance score is calculated as Z=(2 ZQ ZT)/(ZQ+ZT), where ZQ and ZT are the baseline
corrected Z-score values for the alignment score for the query (Q) and
template (T) sequences, respectively. A curated version of the PDB
where the SEQRES sequence was aligned to the PDB sequence with atom
coordinates was used as template database. Sequence profiles were
generated using PSI-Blast with default parameters for three iterations
and an e-value cut-off of 0.001. Large scale benchmarking and cross
validation demonstrates that the use of local structure predictions to
guide the pairwise sequence alignment significantly improved the
alignment quality beyond that obtained using sequence profiles only.
Further, the use of double-sided baseline correction improved the
specificity of the method for template recognition.
CORRESPONDENCE
CBS Webmaster,
|