Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

FeatureP 0.983 Server BETA version, return comments

FeatureP is a prediction aggregator for differential annotation of protein sequence variants. It operates on one or more sets of homologous protein sequences. A set can be

  • a wild type plus any number of mutant alleles, or
  • a set of isoforms derived from the same transcript by alternative splicing, or
  • a protein family from one or more organisms.
The input proteins are processed by a number of analysis and prediction methods, the results are then remapped onto a multiple alignment and compared. All the differences found between the predicted features of the variants are reported in detail.

NOTE 1:   This service is under development; some planned elements are not operational yet. Documentation is in preparation.

NOTE 2:   FeatureP is dependent on a number of other programs that have to be run on the input sequences prior to the comparative analysis. Therefore, the processing of many sequences may be time-consuming, especially with many features enabled. Particularly CPU-intensive features are hidden by default. In the case of prolonged wait the user is advised to use the e-mail option: at any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

NOTE 3:   The webserver instance of FeatureP will only handle a maximum of 100 protein sets. If you wish to run more than this please contact us for details of how to do so.

 Specify the input


Input sequences

Select the format of your input: (example is shown to the right)

Paste one set of homologous proteins sequences into the field below:

 or submit a file directly from your local disk:

Taxonomy

Multiple Seq. Alignment

 Yes, let FeatureP align each set of input sequences
 No, the input sequences are already aligned

 Select features for analysis


In this section you select which features you want to enable in your analysis. You can filter the features in FeatureP by selecting the tag filters (blue boxes below)—they work with AND logic. Also you can add a keyword phrase to match the tool name or description. Use the button "Clear all filters" to reset all filters and show all features.


Please note: The keyword filter is always applied (also when adjusting the filters buttons above).

Feature selection

To enable a feature simply mark the checkbox to the left of the tool name in the table.
You can also enable and disable the whole view in the table by clicking the buttons "Enable/Disable shown" below the table.
Please note: All features are alphabetically ordered with the enabled features (in the current filter view selected above) at the top of the table.

Enable CPU intensive features

Enabled

Name

Description

Enabled features: 0


You can also paste in a quick feature selection string from a previous analysis in the field below and press apply to enable all the features from that analysis (including any you might also have chosen manually before).


Finally, you can save the current quick feature selection string for all the features currently enabled by copying the string in the grey field below.




Restrictions:
At most 2,000 sequences and 200,000 amino acids per submission.

Confidentiality:
The sequences are kept confidential and will be deleted after processing.

Specify the input

Input sequences

You can input one or more sequence sets, each set containing at least two sequences. There are four possible formats for input sequences:

  • FASTA: the sequence format used in most bioinformatics servers and applications. It has two line types: header lines (one per sequence) beginning with a greater-than sign (">") followed by a sequence name and optional comments, and sequence lines containing only one-letter amino acid codes (A C D E F G H I K L M N P Q R S T V W Y). When using FASTA, only one set at a time can be used as input.
  • dFASTA (delimited FASTA): an extension of the basic FASTA format where an added line type—a double slash ("//")—is used to separate independent sets of sequences.
  • vFASTA (variant FASTA): FASTA format where the delineation of sets is done via the sequence names. Each sequence name must comprise a set identifier and a variant identifier separated by an underscore ("_").
  • FASTA+ (FASTA plus variants): an extension of the basic FASTA format with one added line type, the short variant format line, describing one or more changes in the sequence. This line follows the nomenclature for description of sequence variations suggested by den Dunnen and Antonarakis 2001. Multiple substitutions, deletions, or insertions in the same variant sequence are denoted by using a comma between each variant designator. Each FASTA sequence followed by one or more short variant format lines comprises a set.
A small example of the chosen input format is shown to the right of the sequence input window.

Taxonomy

The taxonomic group from which the sequences are derived should be chosen from the drop-down menu. This ensures that, e.g., fungal-specific motifs are not applied to human sequences. However, it is also possible to disable taxonomic filtering by choosing the option "Any".

Multiple alignment

In order to be able to compare the predicted features of the sequences in each set, the sequences need to be aligned.

  • If your sequences within one or more sets have different lengths (i.e. if there are insertions and/or deletions), choose the Yes option.
  • If your sequences within each set have the same length (i.e. they differ only by substitutions) or if you already have aligned them (using "-" as the gap character), choose the No option.

Alignment method: If "Multiple alignment" is set to Yes, you can choose between several alignment methods in the drop-down menu.

Alignment parameters: Some of the alignment methods have parameters that can be specified by an advanced user. This point will only be shown if one of the alignment methods with adjustable parameters is chosen.

Select features for analysis

FeatureP contains a large number of predictable features. The user should specify which of these are relevant to the sequences under study. A scrollable list of all features is shown per default, and a number of filters are provided in order to assist the user in the selection process by limiting the number of features shown in the list.

Filters

When the word Filters is clicked, a section containing a number of tag buttons appears. Each tag button limits the number of shown features to those matching the tag. In the list of features, the tags for each feature are indicated after the description.

Note: Features can have multiple tags; e.g. the signal peptide predictor SignalP is tagged with both "Peptide cleavage" and "Protein localisation", since the signal peptide is a cleaved sorting signal.

In addition to the tag buttons, there is an input field (the free-text filter) where any text can be entered, and the list of features will be limited to those matching the text. The matching applies both to the names and the descriptions of the features.

The tag buttons (and the free-text filter) operate via AND logic; clicking two buttons applies two filters and limits the list further. However, it is perfectly possible to enable groups of features according to all logical combinations:

  • To select features having A AND B (where A and B are tag labels): click A, click B, then click "Enable shown".
  • To select features having A OR B: click A, click "Enable shown", click A again to remove the filter, then click B and click "Enable shown" again.
  • To select features having A but NOT B: click A, click "Enable shown", then click B and click "Disable shown".
To see the full range of features you have selected, click the button "Clear all filters".

Special attention should be given to the tag buttons "Secretory pathway" and "Cytosol/nucleus". Since many predicted features, including most of the ELM classes, are specific to one of these two broad categories of subcellular locations, one of these filters should always be applied, unless your proteins are transmembrane proteins with both "inside" and "outside" loops.

Most FeatureP features are positional, i.e. the predictions pertain to specific sites or regions in the sequences. However, a few features (tagged with "Non-positional feature") pertain to the entire sequence, e.g. the WolFPsort predictor for multi-category subcellular localisation.

CPU intensive features

A number of particularly CPU intensive features are hidden by default. While the Filters section is shown, you can click on the button labelled "Include CPU intensive features" to have them included in the scrollable list, so they are ready to be enabled. The button will turn orange, and clicking it a second time will hide the CPU intensive features again.

While included in the list, the CPU intensive features are marked with an orange sign at the end of their description.

Feature selection

In the scrollable list shown here, it is possible to enable or disable individual features simply by clicking the checkbox in the column marked "Enabled". Clicking the name of a feature opens a new page with further information about the feature.

When features have been selected, a "Quick feature selection string" is shown near the bottom of the input page close to the "Submit" button. This string can be saved for future re-use, if you want to run the exact same selection of features on another dataset or with another alignment algorithm, or it can be used as a starting point for modifying the selection of features. The "Quick feature selection string" is repeated in the output for easy access.

Submit the job

When you are done entering or uploading sequences and selecting the features for analysis, click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and simply leave the window. Your job will continue; you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

The FeatureP output has two levels. The page displayed to the user when the run has finished (the "level 1" output) displays a list of the sets in the input. Clicking on a set number will take the user to a page (the "level2" output) with details about the predicted features for the specific set.

Level 1 output (for all sets)

For each set, the following information is shown:

  • Set #: The number of the set in the input file. In case of FASTA input, there will be only one set.
  • Set ID: The name of the first sequence in the set for identification purposes.
  • Total diff: The total number of alignment positions with differences in predicted features within the set. By default, the sets are ordered according to this number.
  • Features w. diff: The number of features that showed at least one difference between the sequences in the set.
  • Feature w. largest diff: The name of the feature that showed the maximal number of positions with differences.
  • Largest diff: The number of positions with differences shown by the aforementioned feature.
Below the list, the "Quick feature selection string" is repeated from the input page, and a citation list is shown. In addition to the citation for FeatureP itself, the list shows the relevant citations for all selected feature predictors.

Level 2 output (for a single set)

Clicking on a set number in the level 1 output will take you to a page (the "level2" output) with details about the predicted features for the specific set. This page has a tabbed interface with the following tab labels:

Sequence Identifiers

A list of the sequence identifiers used in the level 2 output together with their FASTA names. Since FASTA names may be very long, we have decided to use short identifiers in the graphical outputs, and this list serves as a key to these identifiers. The first sequence in the set, with identifier "seq00000", is regarded as the reference sequence to which the other sequences (the variant sequences) should be compared.

Differential Table

A table showing, for each selected feature and each variant sequence, how many positions were predicted differently from the reference sequence. In addition to the total number of difference positions, the numbers of gain and loss positions are also given. A gain is defined as a change from no annotation to something, while a loss is a change in the opposite direction. The total number of differences may be higher than the sum of gains and losses, if the feature has more output categories and some positions have changed from one category to another.

Differential plot

A plot showing the sequence alignment and, below it, a line for each feature that showed at least one difference between the sequences in the set. Each line shows where in the alignment the differences were found. Positions with differences in the sequences are marked with yellow background in the alignment.

Positional

A plot showing the detailed prediction of each selected positional feature for each sequence; below a colour-coded alignment, it is shown which category was predicted in each position of each sequence (see Figure 6). One particular use of the positional plot is to compare the locations of predicted ELM classes with predictions of intrinsically disordered regions (DisEMBL and IUpred)—this could help filter out possible false positive ELM hits, since linear motifs are known to preferentially occur in disordered regions

Non-positional

A table showing the detailed prediction of each selected non-positional feature for each sequence.

Reference

FeatureP: a prediction aggregator for differential annotation of protein sequence variants
Christian Simon, Peter Wad Sackett, David Flores, Arcadio Rubio García, Jose M. G. Izarzugaza, Valborg Gudmundsdottir, Ramneek Gupta, Kristoffer Rapacki, Thomas Holberg Blicher, Peter Fischer Hallin, Thomas Skøt Jensen, Agnieszka Sierakowska Juncker, Eleonora Kulberkyte, Thomas Sicheritz-Pontén, Håkan Svensson, Kai Wang, Rasmus Wernersson, Søren Brunak, and Henrik Nielsen
Manuscript in preparation

Abstract

Protein sequence variants, such as amino acid substitutions, insertions, deletions, or splice isoforms, often have functional consequences for the proteins. In many cases, these functional consequences can be identified by sequence-based bioinformatics methods predicting features such as structural elements, binding sites, or post-translational modifications. We present here FeatureP, a web server which launches a selection of such predictors and mines their outputs for differential predictions, i.e. features which are predicted to be modified as a consequence of the differences between the input sequences. Through a number of case studies, it is shown that FeatureP can be useful in the prioritization of protein sequence variants and in the formulation of hypotheses concerning the mode of action of mutations in diseases.


CITATION

For publication of results before the FeatureP paper is published, please cite:

Protein annotation in the era of personal genomics
Blicher T, Gupta R, Wesolowska A, Jensen LJ, Brunak S.
Curr Opin Struct Biol., 20:335-41, 2010.

The pipeline also relates to the work described in:

The implications of alternative splicing in the ENCODE protein complement.
Tress, ML et al.
Proc Natl Acad Sci U S A, 104:5495-500, 2007.


GETTING HELP

Scientific issues:        Technical problems: