FeatureP is a prediction aggregator for differential annotation of
protein sequence variants.
It operates on one or more sets of homologous protein sequences.
A set can be
a wild type plus any number of mutant alleles,
or
a set of isoforms
derived from the same transcript by alternative splicing, or
a protein
family from one or more organisms.
The input proteins are processed by a number of analysis and prediction methods,
the results are then remapped onto a multiple alignment and compared.
All the differences found between the predicted features of the variants are
reported in detail.
NOTE 1: This service is under development; some planned elements are
not operational yet. Documentation is in preparation.
NOTE 2: FeatureP is dependent on a number of other programs
that have to be run on the input sequences prior to the comparative analysis.
Therefore, the processing of many sequences may be time-consuming, especially with
many features enabled. Particularly CPU-intensive features are hidden by default.
In the case of prolonged wait the user is advised to use the e-mail option: at
any time during the wait you may enter your e-mail address and simply leave the window.
Your job will continue; you will be notified by e-mail when it has terminated.
The e-mail message will contain the URL under which the results are stored;
they will remain on the server for 24 hours for you to collect them.
NOTE 3: The webserver instance of FeatureP will only handle a maximum of 100 protein sets. If you wish to run more than this please contact us for details of how to do so.
You can input one or more sequence sets, each set containing at least
two sequences. There are four possible formats for input sequences:
FASTA:
the sequence format used in most
bioinformatics servers and applications. It has two line types:
header lines (one per sequence) beginning with a greater-than sign
(">") followed by a sequence name and optional comments,
and sequence lines containing only one-letter amino acid codes
(A C D E F G H I K L M N P Q R S T V W Y).
When using FASTA, only one set at a time can be used as input.
dFASTA (delimited FASTA):
an extension of the basic FASTA format where an added line
type—a double slash ("//")—is used to separate
independent sets of sequences.
vFASTA (variant FASTA):
FASTA format where the delineation of sets is done via the sequence
names. Each sequence name must comprise a set identifier and a
variant identifier separated by an underscore ("_").
FASTA+ (FASTA plus variants):
an extension of the basic FASTA format with one added line type,
the short variant format line, describing one or more changes
in the sequence. This line follows the
nomenclature
for description of sequence variations suggested by
den Dunnen and Antonarakis 2001. Multiple substitutions, deletions,
or insertions in the same variant sequence are denoted by using a
comma between each variant designator. Each FASTA sequence
followed by one or more short variant format lines comprises a set.
A small example of the chosen input format is shown to the right of the sequence
input window.
The taxonomic group from which the sequences are
derived should be chosen from the drop-down menu. This ensures that, e.g.,
fungal-specific motifs are not applied to human sequences. However, it is also
possible to disable taxonomic filtering by choosing the option "Any".
In order to be able to compare the predicted features of the sequences
in each set, the sequences need to be aligned.
If your sequences within one or more sets have
different lengths (i.e. if there are insertions and/or deletions),
choose the Yes option.
If your sequences within each set have the same length (i.e. they differ
only by substitutions) or if you already have aligned them (using "-"
as the gap character), choose the No option.
Alignment method: If "Multiple alignment" is set to Yes,
you can choose between several alignment methods in the drop-down menu.
Alignment parameters: Some of the alignment methods have parameters
that can be specified by an advanced user. This point will only be shown if one of the alignment methods
with adjustable parameters is chosen.
FeatureP contains a large number of predictable features.
The user should specify which of these are relevant to the sequences under study.
A scrollable list of all features is shown per default, and
a number of filters are provided in order to assist the user in the selection process
by limiting the number of features shown in the list.
When the word Filters is clicked, a section containing a number of
tag buttons appears.
Each tag button limits the number of shown features to those matching the tag.
In the list of features, the tags for each feature are indicated after the
description.
Note: Features can have multiple tags; e.g.
the signal peptide predictor
SignalP is tagged with both "Peptide cleavage" and "Protein localisation",
since the signal peptide is a cleaved sorting signal.
In addition to the tag buttons, there is an input field (the
free-text filter) where any text can be entered, and the list
of features will be limited to those matching the text.
The matching applies both to the names and the descriptions of the features.
The tag buttons (and the free-text filter) operate via AND logic;
clicking two buttons applies two filters and limits the list further.
However, it is perfectly possible to enable groups of features according to
all logical combinations:
To select features having AANDB
(where A and B are tag labels):
click A, click B, then click "Enable shown".
To select features having AORB:
click A, click "Enable shown",
click A again to remove the filter,
then click B and click "Enable shown" again.
To select features having A but NOTB:
click A, click "Enable shown",
then click B and click "Disable shown".
To see the full range of features you have selected, click the button
"Clear all filters".
Special attention should be given to the tag buttons
"Secretory pathway" and "Cytosol/nucleus".
Since many predicted features, including most of the ELM classes, are specific
to one of these two broad categories of subcellular locations, one of these
filters should always be applied, unless your proteins are transmembrane
proteins with both "inside" and "outside" loops.
Most FeatureP features are positional, i.e.
the predictions pertain to specific sites or regions in the sequences.
However, a few features (tagged with "Non-positional feature")
pertain to the entire sequence, e.g. the WolFPsort predictor for
multi-category subcellular localisation.
A number of particularly CPU intensive features are hidden by default. While the
Filters section is shown, you can click on the
button labelled "Include CPU intensive features"
to have them included in the scrollable list, so they are ready to be
enabled. The button will turn orange, and clicking it a second time will hide the
CPU intensive features again.
While included in the list, the CPU intensive features are marked with an
orange sign at the end of their description.
In the scrollable list shown here, it is possible to
enable or disable individual features simply by clicking the checkbox
in the column marked "Enabled".
Clicking the name of a feature opens a new page with further information about the
feature.
When features have been selected, a "Quick feature selection string"
is shown near the bottom of the input page close to the "Submit"
button. This string can be saved for future re-use, if you want to run the exact
same selection of features on another dataset or with another alignment algorithm,
or it can be used as a starting point for modifying the selection of features.
The "Quick feature selection string" is repeated in the output for
easy access.
Submit the job
When you are done entering or uploading sequences and selecting the
features for analysis,
click on the "Submit" button. The status of your job
(either 'queued' or 'running') will be displayed and constantly updated
until it terminates and the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and simply
leave the window. Your job will continue; you will be notified by e-mail
when it has terminated. The e-mail message will contain the URL under
which the results are stored; they will remain on the server for 24 hours
for you to collect them.
The FeatureP output has two levels.
The page displayed to the user when the run has finished (the "level 1" output)
displays a list of the sets in the input. Clicking on a set number will take
the user to a page (the "level2" output) with details about the predicted
features for the specific set.
Level 1 output (for all sets)
For each set, the following information is shown:
Set #: The number of the set in the input file.
In case of FASTA input, there will be only one set.
Set ID: The name of the first sequence in the set
for identification purposes.
Total diff: The total number of alignment positions
with differences in predicted features within the set.
By default, the sets are ordered according to this number.
Features w. diff: The number of features that showed
at least one difference between the sequences in the set.
Feature w. largest diff: The name of the feature that
showed the maximal number of positions with differences.
Largest diff: The number of positions with differences
shown by the aforementioned feature.
Below the list, the "Quick feature selection string" is repeated from the
input page, and a citation list is shown. In addition to the citation for
FeatureP itself, the list shows the relevant citations for all selected
feature predictors.
Level 2 output (for a single set)
Clicking on a set number in the level 1 output will
take you to a page (the "level2" output) with details
about the predicted features for the specific set.
This page has a tabbed interface with the following tab labels:
Sequence Identifiers
A list of the sequence identifiers used in the level 2
output together with their FASTA names. Since FASTA names
may be very long, we have decided to use short identifiers
in the graphical outputs, and this list serves as a key
to these identifiers. The first sequence in the set,
with identifier "seq00000", is regarded as the
reference sequence to which the other sequences
(the variant sequences) should be compared.
Differential Table
A table showing, for each selected feature and each variant
sequence, how many positions were predicted differently from
the reference sequence. In addition to the total
number of difference positions, the numbers of gain and loss
positions are also given. A gain is defined as a change from
no annotation to something, while a loss is a change in the
opposite direction. The total number of differences may be
higher than the sum of gains and losses, if the feature
has more output categories and some positions have changed
from one category to another.
Differential plot
A plot showing the sequence alignment and, below it, a
line for each feature that showed at least one difference
between the sequences in the set. Each line shows where
in the alignment the differences were found. Positions with
differences in the sequences are marked with yellow background
in the alignment.
Positional
A plot showing the detailed prediction of each selected
positional feature for each sequence; below a colour-coded
alignment, it is shown which category was predicted in each
position of each sequence (see Figure 6). One particular use
of the positional plot is to compare the locations of predicted
ELM classes with predictions of intrinsically disordered regions
(DisEMBL and IUpred)—this could help filter out possible
false positive ELM hits, since linear motifs are known to
preferentially occur in disordered regions
Non-positional
A table showing the detailed prediction of each selected
non-positional feature for each sequence.
Reference
FeatureP: a prediction aggregator for differential annotation of protein sequence
variants
Christian Simon, Peter Wad Sackett, David Flores, Arcadio Rubio García,
Jose M. G. Izarzugaza, Valborg Gudmundsdottir, Ramneek Gupta, Kristoffer Rapacki,
Thomas Holberg Blicher, Peter Fischer Hallin, Thomas Skøt Jensen,
Agnieszka Sierakowska Juncker, Eleonora Kulberkyte, Thomas Sicheritz-Pontén,
Håkan Svensson, Kai Wang, Rasmus Wernersson, Søren Brunak, and Henrik Nielsen Manuscript in preparation
Abstract
Protein sequence variants, such as amino acid substitutions, insertions,
deletions, or splice isoforms, often have functional consequences for the proteins.
In many cases, these functional consequences can be identified by sequence-based
bioinformatics methods predicting features such as structural elements,
binding sites, or post-translational modifications. We present here FeatureP,
a web server which launches a selection of such predictors and mines their
outputs for differential predictions, i.e. features which are predicted to
be modified as a consequence of the differences between the input sequences.
Through a number of case studies, it is shown that FeatureP can be useful in
the prioritization of protein sequence variants and in the formulation of
hypotheses concerning the mode of action of mutations in diseases.
CITATION
For publication of results before the FeatureP paper is published, please cite: