NNAlign is a server for the discovery of sequence motifs in quantitative peptide data. By quantitative peptide data we mean a set of amino acid sequences, each associated with some numerical value. This value (the quantitative measure) defines negative, positive and intermediate examples across some numerical spectrum. It could be, for example, the binding strength of each peptide to a certain molecule, or the signals measured on a peptide array. It is important, for effective training of NNAlign, that not only positive instances but also negative (and possibly intermediate) examples are included in the training set. The neural networks can then attempt to correlate the amino acid sequences they are presented with their relative quantitative values, and learn what differentiates positives and negatives. The generated model that can the be used to produce quantitative predictions on new data.
In this page we introduce the formats of the data, the options available from the server, and give some guidelines for its use. Users are welcome to contact the authors for any questions.
1. Specify the training sequences
All the input sequences must be in one-letter amino acid
code. The allowed alphabet (not case sensitive) is as follows:
A C D E F G H I K L M N P Q R S T V W Y X
Paste a set of peptides, one sequence per line into the upper left window, or upload a
file from your local disk.
Look here to see an example of the format.
The program accepts a continuous spectrum of signal intensities associated with the sequences,
and it assumes that positive examples (e.g. binders) have higher values (as opposed to negative
examples that lie in the left part of the spectrum).
2. Select evaluation examples (Optional)
After the training phase, you will obtain a model that learned a common motif from the training data. If you wish to use the neural networks to discover
occurrences of the motif on new data, paste in an evaluation set or load it from your disk. Two formats are accepted:
-
A list of peptides, one sequence per line. If values are provided together with the
peptides (in a format similar to the training data), the method will calculate statistical
measures of correlation between observed and predicted values for the evaluation set. Example
-
A set of amino acid sequences in fasta format. The sequences are chopped into peptides
with length of the learned motif (plus any flanks) and then run through the neural
networks. Example
3. Customize your run by changing some of the advanced options (Optional)
Toggle the
Expert options checkbox to display all the available options. They are briefly explained here, in the order they appear on the form:
3.1 Motif length
The length of the alignment core can be specified as:
- Single value. e.g. 7
- Interval. e.g. 6-8
- Interval with step size. e.g. 6-10/2
The algorithm will attempt aligning the sequences for all specified motif lengths, and
choose the one that maximizes performance (in terms of correlation between observed and
predicted values).
Note that sequences shorter then the given motif length will be removed from the dataset.
3.2 Flanking amino acids for NN training
In some instances, the amino acid composition of the regions surrounding the motif core (peptide flanking region, PFR) can have an influence on the response. This was proven, for example, in
this paper, where the composition of a PFR of at least two amino acids around the core was shown to influence peptide binding strength. With this option you can specify the length of the regions flanking the alignment core. The length and amino acid composition of the PFRs are also encoded in the neural networks.
3.3 Encode flanking region length / Encode peptide length
Additional neurons are added to the neural networks input layer, to account for flanking region length and peptide length.
3.4 Neural network encoding
In Blosum encoding, similarity between amino acids is taken into account, allowing
substitution between residues with similar characteristics. Sparse encoding treats
all amino acids in the same. Choosing the "Combined" option, networks are trained
both with Blosum and Sparse encoding, and the prediction with the two strategies are combined.
3.5 Processing of input data
The optimal data distribution for NN training is between 0 and 1 with the bulk of the data in the middle region of the spectrum. With the default option the program rescales linearly the data between 0 and 1, but it is also possible to apply a logarithmic transformation if the data appears squashed towards lower values.
You can also inspect the data distribution before and after the transformation in the output panel,
following the link "View data distribution".
Example
Some experiments (e.g. peptide microarrays with linker sequences) might generate raw data with repeated flanking amino acids at either or both hands of all sequences. The program excludes these amino acids by default., but if you think they might contain biological signal you can choose to keep these repeated flanks.
3.6 Cross-validation method
The predictive performance of the method is estimated in cross-validation on the supplied
training set.
If
"Stop training on best test-set performance" is ticked, a completely unbiased
evaluation of the performance requires an additional independent test set. This is done by
selecting
"Exhaustive cross-validation". However, for large datasets an accurate (and much
faster) estimate of the predictive performance can be done on the same subsets used for
training.
Leaving
"Stop training on best test-set performance" unticked will let
the training continue until the maximum number of training cycles as specified
in the
"Number of training cycles" option.
3.7 Subsets for cross-validation
The data can be prepared for cross-validation in 3 manners:
- Random subsets: the raw data is simply split randomly into subsets of equal size
- Homology clustering: a Hobohm 1 algorithm is used to group homologous sequencences and limit overlap between subsets. The similarity threshold is specified with the relative option ("Threshold for homology clustering"), which expresses the % matches in an ungapped alignment.
- Common motif clustering: two sequences are considered similar if they share a stretch of N identical amino acids, where N is the (minimum) motif length specified by the user.
In both Homology and Common motif clustering, homologous sequences are grouped together but kept all for training. The option "
Remove homologous sequences from training set" allows only keeping one sequence per group of homologs, to minimize further a possible overlap between subsets.
3.8 Folds for cross-validation
Specify the number of subsets to be created for the estimation of performance on
cross-validation. It is also possible to skip cross-validation, ticking the 'NO' button. In this case all data are used for training, and execution will be faster, but it won't be possible to calculate performance measures.
3.9 Number of training cycles
This option specifies how many times each example in the training set is
presented to the neural network. If training is stopped on best test-set performance,
this value represents the maximum number of training cycles.
3.10 Number of hidden neurons
A higher number of hidden neurons in the NNs potentially allows detecting higher order correlations,
but increases the number of parameters of the model. Different hidden layer sizes can
be specified in a comma separated list (e.g.
3,7,12,20), in which case an
ensemble of networks with different architectures is constructed.
3.11 Number of initial seeds per iteration
It is possible to train the model from different initial random NN configurations. The
ensemble of several neural networks has been shown to perform better than a single NN.
However, note that the time required to train a model increases linearly with this
parameter.
3.12 Number of networks per fold in the final ensemble
When training with cross-validation, each neural network's performance is evaluated
in terms of Squared Error between predicted and observed values on the test set. The
top
N networks (for each cross-validation step) can be selected using this
parameter, and they will constitute the final model.
3.13 Sequence logo height (in bits)
Sets the scale of the y axis in the motif logo figure. For a 20 letter alphabet the information content can vary between zero for an uniformative position to log
220 for a completely conserved position.
It is possible to create logos for all the single networks in the final ensemble, and (recommended) to use offset correction to re-align networks to a common core. The lattest aims at maximizing the information content of a combined core on all networks, and generally produces a better representation of the sequence motif. This topic is discussed in a dedicated chapter of the
NNAlign paper (Section: Improving the LOGO sequence motif representation by an offset correction).
3.14 Sort results by prediction value
If ticked, the peptides in the prediction file are ordered according the NNAlign predicted value. If left unticked, they are presented in their original order.
3.15 Threshold on evaluation set predictions
Use this parameter to limit the size of evaluation set results. It should be given as
a number between 0 and 1 (set to 0, all results all displayed). This is mostly
relevant for large FASTA file submissions, to show only the best scoring sequences
detected by the method.
3.16 Optional prefix for results files
This prefix is appended to all files generated by the current run. If left empty, a
system-generated number will be assigned as prefix.
4. Submit the job
Click on the
"Submit" button. The status of your job (either 'queued'
or 'running') will be displayed and constantly updated until it terminates and
the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and leave
the window. Your job will continue and you will be notified by e-mail when it has
terminated. The e-mail message will contain the URL under which the results are
stored; they will remain on the server for 24 hours for you to collect them.
Loading a trained model
Once your job is completed, you will have the possibility of downloading the trained
method to your computer. You may then upload this model at any moment to the NNAlign
submission page and use it for new predictions on evaluation sets.
Simply upload the model from your disk to the right hand box in the submission form.
GETTING HELP
Scientific problems:
Technical problems: