Instructions & Guidelines
NNAlign is a server for the discovery of sequence motifs in quantitative peptide data, i.e. a set of amino acid sequences, each associated with some numerical value. This value (the quantitative measure) defines negative, positive and intermediate examples across a numerical spectrum. It could be, for example, the binding strength of each peptide to a certain molecule, or the signals measured on a peptide array.
It is important, for effective training of NNAlign, that not only positive instances but also negative (and possibly intermediate) examples are included in the training set. The neural networks can then attempt to correlate the amino acid sequences with their relative quantitative values, and learn what differentiates positives and negatives. The generated model that can the be used to produce quantitative predictions on new data.
This page introduces the data formats, the parameters available to customize the analysis and some guidelines for the use of version 2.1. Users are welcome to contact the authors for any questions.
1. Specify TRAINING sequences
Paste a set of peptides, one sequence per line into the upper left window, or upload a
file from your local disk. Training data should be in two columns. The first column is composed
of peptide sequences, the second column of a target value for each sequence.
to see an example of the format.
- If you provide receptor pseudosequences with Load receptors pseudo-sequences, add a third column to the input data
with the name of the receptor associated with each training point. Here is an example of the format.
- To specify your custom partitioning of the data, add the partition number of each datapoint
(in integer numbers from 0 to N) on the last column of the training data, and check "User-defined partitions" in the Method to create subsets option.
The program accepts a continuous spectrum of signal intensities associated with the sequences,
and by default it assumes that positive examples (e.g. binders) have higher values (as opposed to negative examples that lie in the left part of the spectrum).
that by default the program assumes that the data is expressed in the standard one-letter 20 amino acid alphabet:
A C D E F G H I K L M N P Q R S T V W Y
plus the X symbol for "unknown" amino acid, treated as a wildcard. If you wish to use a different alphabet (e.g. modified amino acids, DNA/RNA, etc) you must specify the list of symbols with the Alphabet option (see below). In all cases, X is a reserved wildcard symbol with neutral value.
2. Select EVALUATION examples (Optional)
NNAlign creates an ensemble of neural networks trained to recognize sequence motifs contained in the training data. If you wish to use the neural networks to discover occurrences of the motif on new data, paste in an evaluation set or load it from your disk. Two formats are accepted:
A list of peptides, one sequence per line. If values are provided together with the
peptides (in a format similar to the training data), the method will calculate statistical
measures of correlation between observed and predicted values for the evaluation set. Example
A set of amino acid sequences in fasta format. The sequences are digested into peptides
with length of the motif (plus any flanks) and then run through the neural
3. Set OPTIONS to customize your analysis
This prefix is pre-pended to all files generated by the current run. If left empty, a
system-generated number will be assigned as prefix.
The length of the alignment core can be specified as:
- Single value. e.g. 7
- Interval. e.g. 6-8
- Interval with step size. e.g. 6-10/2
The algorithm will align the sequences for all specified motif lengths, and select the solution that maximizes cross-validated performance in terms of correlation between observed and predicted values.
Note that sequences shorter then the maximum
motif length (+ insertions, if enabled) will be removed from the dataset.
DATA PROCESSING options
Order of the data:
By default, peptides with high values are positive instances and high prediction scores are used to derive the sequence logo. You can invert this behaviour
The optimal data distribution for NN training is between 0 and 1 with the bulk of the data in the middle region of the spectrum. With the default option the program rescales linearly the data between 0 and 1, but it is also possible to apply a logarithmic transformation if the data appears squashed towards lower values. If your data is already rescaled between 0 and 1, select the No rescale option
You can also inspect the data distribution before and after the transformation in the output panel, following the link "View data distribution". Example
Average target values of identical sequences:
If there are duplicated sequences, by default they are all used together with their target values. Toggle this option to only use each sequence once with the average of the multiple target values.
Folds for cross-validation:
Specify the number of subsets to be created for the estimation of performance on
cross-validation. It is also possible to skip cross-validation, ticking the 'NO' button. In this case all data are used for training, and execution will be faster, but it won't be possible to calculate performance measures.
The predictive performance of the method is estimated in cross-validation (CV) on the
training set. At each cross-validation step, one of 'n' subsets is left out as an evaluation set, where 'n' is the number of folds, rotating the evaluation set n times. Two CV methods are available:
- Simple cross-validation: uses n-1 sets to train the ANN and 1 set for evaluation.
- Nested cross-validation: uses n-2 sets to train the ANN, 1 set for early stopping (see below), and 1 set for evaluation, where n is the number of folds.
Stop training on best test-set performance:
If this option is selected, training of the networks will be stopped on the highest CV test-set performance (Early stopping). A completely unbiased evaluation of the performance requires an additional independent test set, by selecting Nested cross-validation. However, for large datasets an accurate and much faster estimate of the predictive performance can be done on the same subsets used for early stopping (Simple cross-validation together with Early stopping).
Leaving the Early stopping option unticked will continue the training until the maximum number of training cycles as specified in the "Number of training cycles" option.
Method to create subsets:
The data can be prepared for cross-validation in 4 manners:
- Random subsets: the raw data is simply split randomly into subsets of equal size
- Homology clustering: a Hobohm 1 algorithm is used to group homologous sequencences and limit overlap between subsets. Also specify the maximum identity between sequences in the same subset (e.g. 0.8 means that peptides in the same subsets are no more than 80% identical).
- Common motif clustering: two sequences are considered homologous if they share a stretch of at least N identical amino acids, where N is the common motif length specified by the user.
- User-defined partitions: you may specify your own partitions for cross-validation. Specify the groups as an additional column of the input data, assigning to each data point the partition number from 0 to N.
Remove homologous sequences from training set:
Homologous sequences are by default clustered in the same subset. Check the box to keep only one instance of homologous sequences.
You may use a custom alphabet (e.g. nucleic acids, or non-standard amino acids). All upper-case letters and the symbols + and @ are permitted. The symbol X is reserved as a wildcard. Note that if you modify the alphabet, all BLOSUM options will be disabled.
NEURAL NETWORK architecture
Number of training cycles:
This option specifies how many times each example in the training set is
presented to the neural networks. If training is stopped on the best test-set performance,
this value represents the maximum number of training cycles.
Number of seeds:
It is possible to train the model from different initial random network configurations. The
ensemble of several neural networks has been shown to perform better than a single network.
However, note that the time required to train a model increases linearly with this
Number of hidden neurons:
A higher number of hidden neurons in the ANNs potentially allows detecting higher order correlations, but increases the number of parameters of the model. Different hidden layer sizes can be specified in a comma separated list (e.g. 3,7,12,20), in which case an
ensemble of networks with different architectures is constructed.
Amino acid encoding:
Amino acids must be converted to numbers in order to be presented to the neural networks. Sparse encoding converts an amino acid into a binary vector, whereas Blosum encoding uses the BLOSUM62 substitution scores, accounting for physicochemical similarity between amino acids. Choosing the "Combined" option, networks are trained both with Blosum and Sparse encoding, combining the predictions of the two approaches. Note that Blosum encoding is only available if the training data uses the standard one-letter 20 amino acids alphabet (used as default).
Maximum length for Deletions:
Allow deletions in the alignment. For a description of how deletions (and insertions) are treated refer to this paper.
Maximum length for Insertions:
Allow insertions in the alignment.
Only allow insertions in sequences shorter than the motif length:
Apply insertions in any sequence, or only on sequences shorter than the motif length.
The burn-in is a number of initial iterations where no deletions or insertions are allowed. As gaps increase dramatically the number of possible solutions, it may be useful to use a burn-in period > 0 in order to limit the search space in the initial training phases.
Impose amino acid preference at P1 during burn-in:
In some cases, you may have some prior knowledge of the expected binding motif. For example, most HLA-DR molecules have a preference for hydrophobic residues (ILVMFYW) at P1. Such prior knowledge can help guiding the networks towards finding the correct binding core, and is applied only in the very first few iterations (specificed by the burn-in parameter). After the burn-in iterations, the restraint on the specified residues at P1 is removed, and the networks will consider all possible binding cores.
Preferred residues at P1:
Together with the previous option (impose amino acid preference at P1), this parameters allows specifying which subset of residues should be preferred at P1 of the binding core during the burn-in iterations.
Length of the PFR for composition encoding:
In some instances, the amino acid composition of the regions surrounding the motif core (peptide flanking region, PFR) can have an influence on the response. See for example in this paper, where the amino acid composition of a PFR of at least two amino acids around the core was shown to influence peptide-MHC binding strength. With this option you can specify the length of the regions flanking the alignment core, which will be encoded as input to NNAlign.
Encode PFR composition as sparse:
By default, the composition of the regions flanking the binding cores is encoded using the Blosum substitution matrix. Turning this option on, the raw frequency of each amino acid in the PFR (sparse alphabet) is used for encoding.
Encode PFR length:
Encodes the length of the flanks, i.e. the number of amino acids before/after the motif core. It essentially bears information about the position of the core within the peptide, if it is found at the extremes or in the middle. If this option is set to N > 0, the flank length is truncated to N amino acids, if N = 0 the encoding is unbounded (recommended). Setting this option to -1 disables the encoding.
Expected peptide length for encoding:
Assigns input neurons to encode the length of the input sequences if set to > 0. For an optimal encoding, give a rough estimate of the expected optimal peptide length.
Binned peptide length encoding:
Encode peptide length with individual input neurons for different lengths. For example, setting this parameter to "8,9,10,11" will create four separate input neurons, each activated only by peptides of the corresponding length (<=8, 9, 10, >=11 respectively). This encoding is different from the option "expected peptide length for encoding" which uses a continuous function that interpolates the possible peptide lengths.
Load receptor pseudo-sequences:
If you have different receptors associated with your training examples, specify the receptor names as the third column in the training file. Then, upload here a file with two columns: the receptor names in the first, and the aligned pseudo-sequences in the second. Note that pseudo-sequences must all have the same length and be in the same alphabet as the training sequences (including X).
SORTING and VISUALIZATION options
Number of networks (per fold) in the final ensemble:
When training with cross-validation, each neural network's performance is evaluated
in terms of Pearson's correlation between target and predicted values. The
top N networks (for each cross-validation fold) can be selected using this
parameter, and they will constitute the final model.
Sort results by prediction value:
Predictions can be sorted by the NNAlign predicted value. If left unticked, sequences are presented in their original order.
Exclude offset correction:
Offset correction is a procedure that realigns individual networks to enhance the combined sequence motif (see the section "Improving the LOGO sequence motif representation by an offset correction" in this paper. You can disable offset correction by ticking this box.
Show all logos in the final ensemble:
Displays the sequence motif identified by each neural network in the model.
EVALUATION DATA options
Length of peptides generated from FASTA entries:
Evaluation data submitted in FASTA format will be digested into fragments of the specified length. These peptides will then be submitted to the network ensemble to scan for the presence of the learned sequence motifs.
Sort evaluation results by prediction value:
Predictions on the independent evaluation set can be sorted by the NNAlign predicted value. If left unticked, sequences are presented in their original order.
Threshold on evaluation set predictions:
For large FASTA file submissions, the size of the results file may become very big. Use this parameter to limit the size of evaluation set results, and only show sequences with high predicted values. It should be given as a number between 0 and 1 (set to 0, all results will be displayed).
4. Submit the job
Click on the "Submit" button. The status of your job (either 'queued'
or 'running') will be displayed and constantly updated until it terminates and
the server output appears in the browser window.
At any time during the wait you may enter your e-mail address and leave
the window. Your job will continue and you will be notified by e-mail when it has
terminated. The e-mail message will contain the URL under which the results are
stored; they will remain on the server for 24 hours for you to collect them.
Loading a trained model
Once your job is completed, you will have the possibility of downloading the trained
method to your computer. You may then upload this model at any moment to the NNAlign
submission page and use it for new predictions on evaluation sets.
Simply select the option Upload a MODEL and paste your model file in the submission form.