Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Instructions & Guidelines



NNAlign is a server for the discovery of sequence motifs in quantitative peptide data. By quantitative peptide data we mean a set of amino acid sequences, each associated with some numerical value. This value (the quantitative measure) defines negative, positive and intermediate examples across some numerical spectrum. It could be, for example, the binding strength of each peptide to a certain molecule, or the signals measured on a peptide array. It is important, for effective training of NNAlign, that not only positive instances but also negative (and possibly intermediate) examples are included in the training set. The neural networks can then attempt to correlate the amino acid sequences they are presented with their relative quantitative values, and learn what differentiates positives and negatives. The generated model that can the be used to produce quantitative predictions on new data.

In this page we introduce the formats of the data, the options available from the server, and give some guidelines for its use. Users are welcome to contact the authors for any questions.


1. Specify the training sequences

All the input sequences must be in one-letter amino acid code. The allowed alphabet (not case sensitive) is as follows:

A C D E F G H I K L M N P Q R S T V W Y X

Paste a set of peptides, one sequence per line into the upper left window, or upload a file from your local disk. Look here to see an example of the format.

The program accepts a continuous spectrum of signal intensities associated with the sequences, and it assumes that positive examples (e.g. binders) have higher values (as opposed to negative examples that lie in the left part of the spectrum).

2. Select evaluation examples (Optional)

After the training phase, you will obtain a model that learned a common motif from the training data. If you wish to use the neural networks to discover occurrences of the motif on new data, paste in an evaluation set or load it from your disk. Two formats are accepted:

  • A list of peptides, one sequence per line. If values are provided together with the peptides (in a format similar to the training data), the method will calculate statistical measures of correlation between observed and predicted values for the evaluation set. Example
  • A set of amino acid sequences in fasta format. The sequences are chopped into peptides with length of the learned motif (plus any flanks) and then run through the neural networks. Example

3. Customize your run by changing some of the advanced options (Optional)

Toggle the Expert options checkbox to display all the available options. They are briefly explained here, in the order they appear on the form:

3.1 Motif length
The length of the alignment core can be specified as:
  • Single value. e.g. 7
  • Interval. e.g. 6-8
  • Interval with step size. e.g. 6-10/2
The algorithm will attempt aligning the sequences for all specified motif lengths, and choose the one that maximizes performance (in terms of correlation between observed and predicted values).
Note that sequences shorter then the given motif length will be removed from the dataset.

3.2 Flanking amino acids for NN training
In some instances, the amino acid composition of the regions surrounding the motif core (peptide flanking region, PFR) can have an influence on the response. This was proven, for example, in this paper, where the composition of a PFR of at least two amino acids around the core was shown to influence peptide binding strength. With this option you can specify the length of the regions flanking the alignment core. The length and amino acid composition of the PFRs are also encoded in the neural networks.

3.3 Encode flanking region length / Encode peptide length
Additional neurons are added to the neural networks input layer, to account for flanking region length and peptide length.

3.4 Neural network encoding
In Blosum encoding, similarity between amino acids is taken into account, allowing substitution between residues with similar characteristics. Sparse encoding treats all amino acids in the same. Choosing the "Combined" option, networks are trained both with Blosum and Sparse encoding, and the prediction with the two strategies are combined.

3.5 Processing of input data
The optimal data distribution for NN training is between 0 and 1 with the bulk of the data in the middle region of the spectrum. With the default option the program rescales linearly the data between 0 and 1, but it is also possible to apply a logarithmic transformation if the data appears squashed towards lower values.
You can also inspect the data distribution before and after the transformation in the output panel, following the link "View data distribution". Example
Some experiments (e.g. peptide microarrays with linker sequences) might generate raw data with repeated flanking amino acids at either or both hands of all sequences. The program excludes these amino acids by default., but if you think they might contain biological signal you can choose to keep these repeated flanks.

3.6 Cross-validation method
The predictive performance of the method is estimated in cross-validation on the supplied training set.
If "Stop training on best test-set performance" is ticked, a completely unbiased evaluation of the performance requires an additional independent test set. This is done by selecting "Exhaustive cross-validation". However, for large datasets an accurate (and much faster) estimate of the predictive performance can be done on the same subsets used for training.
Leaving "Stop training on best test-set performance" unticked will let the training continue until the maximum number of training cycles as specified in the "Number of training cycles" option.

3.7 Subsets for cross-validation
The data can be prepared for cross-validation in 3 manners:
  • Random subsets: the raw data is simply split randomly into subsets of equal size
  • Homology clustering: a Hobohm 1 algorithm is used to group homologous sequencences and limit overlap between subsets. The similarity threshold is specified with the relative option ("Threshold for homology clustering"), which expresses the % matches in an ungapped alignment.
  • Common motif clustering: two sequences are considered similar if they share a stretch of N identical amino acids, where N is the (minimum) motif length specified by the user.
In both Homology and Common motif clustering, homologous sequences are grouped together but kept all for training. The option "Remove homologous sequences from training set" allows only keeping one sequence per group of homologs, to minimize further a possible overlap between subsets.

3.8 Folds for cross-validation
Specify the number of subsets to be created for the estimation of performance on cross-validation. It is also possible to skip cross-validation, ticking the 'NO' button. In this case all data are used for training, and execution will be faster, but it won't be possible to calculate performance measures.

3.9 Number of training cycles
This option specifies how many times each example in the training set is presented to the neural network. If training is stopped on best test-set performance, this value represents the maximum number of training cycles.

3.10 Number of hidden neurons
A higher number of hidden neurons in the NNs potentially allows detecting higher order correlations, but increases the number of parameters of the model. Different hidden layer sizes can be specified in a comma separated list (e.g. 3,7,12,20), in which case an ensemble of networks with different architectures is constructed.

3.11 Number of initial seeds per iteration
It is possible to train the model from different initial random NN configurations. The ensemble of several neural networks has been shown to perform better than a single NN. However, note that the time required to train a model increases linearly with this parameter.

3.12 Number of networks per fold in the final ensemble
When training with cross-validation, each neural network's performance is evaluated in terms of Squared Error between predicted and observed values on the test set. The top N networks (for each cross-validation step) can be selected using this parameter, and they will constitute the final model.

3.13 Sequence logo height (in bits)
Sets the scale of the y axis in the motif logo figure. For a 20 letter alphabet the information content can vary between zero for an uniformative position to log220 for a completely conserved position.
It is possible to create logos for all the single networks in the final ensemble, and (recommended) to use offset correction to re-align networks to a common core. The lattest aims at maximizing the information content of a combined core on all networks, and generally produces a better representation of the sequence motif. This topic is discussed in a dedicated chapter of the NNAlign paper (Section: Improving the LOGO sequence motif representation by an offset correction).

3.14 Sort results by prediction value
If ticked, the peptides in the prediction file are ordered according the NNAlign predicted value. If left unticked, they are presented in their original order.

3.15 Threshold on evaluation set predictions
Use this parameter to limit the size of evaluation set results. It should be given as a number between 0 and 1 (set to 0, all results all displayed). This is mostly relevant for large FASTA file submissions, to show only the best scoring sequences detected by the method.

3.16 Optional prefix for results files
This prefix is appended to all files generated by the current run. If left empty, a system-generated number will be assigned as prefix.

4. Submit the job

Click on the "Submit" button. The status of your job (either 'queued' or 'running') will be displayed and constantly updated until it terminates and the server output appears in the browser window.

At any time during the wait you may enter your e-mail address and leave the window. Your job will continue and you will be notified by e-mail when it has terminated. The e-mail message will contain the URL under which the results are stored; they will remain on the server for 24 hours for you to collect them.

Loading a trained model

Once your job is completed, you will have the possibility of downloading the trained method to your computer. You may then upload this model at any moment to the NNAlign submission page and use it for new predictions on evaluation sets.
Simply upload the model from your disk to the right hand box in the submission form.



GETTING HELP

Scientific problems:        Technical problems: