Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Neural Network Prediction of Signal Peptides


Please keep two browser windows open throughout the exercise. The cookbook (i.e., this page) should be loaded into one while SignalP (a neural network for predicting signal peptides) should be opened in another separate window.

This exercise has three parts: first, you will use the SignalP server on four examples of secretory and non-secretory proteins. In the second part, you will use the SignalP server as a "virtual laboratory", and try to destroy a signal peptide by mutating a number of amino acids. Finally, you will attempt to modify a eukaryotic signal peptide so that it works in gram-positive bacteria.


Biological background



Both prokaryotic and eukaryotic cells have protein-secretion systems which rely on "signal peptides". These are short, N-terminal amino acid sequences that cause a protein to be exported out of the cell and that are usually cleaved off after the protein has been transported across the membrane. Signal peptides are not generally homologous, but instead have a number of common properties that are recognized by the secretion system. Below are sequence logos for the signal peptides of eukaryotes, gram-negative bacteria, and gram-positive bacteria. (In these logos the sequences have been aligned on the cleavage site located between positions -1 and +1). Signal peptides can be divided into three regions:
  • n-region: a positively charged region at the N-terminus of the signal peptide (the "left end" of the sequence). In eukaryotes this charge is supplied by the free amino group of the N-terminal amino acid. In prokaryotes where the N-terminal amino acid is formylated, the presence of an amino acid with a positively charged side chain is required.

  • h-region: a hydrophobic core in the middle of the signal peptide.

  • c-region: a more polar region at the C-terminus (the "right" end of the sequence) with small and neutral amino acids at position -1 and -3.
In addition to slightly different sequence preferences, eukaryotic signal peptides are somewhat shorter than gram-negative signal peptides, and markedly shorter than gram-positive signal peptides. (The logos below have all been cut off at 25 residues N-terminal to the cleavage site).





Explanation of SignalP output



For each position in your sequence, the SignalP server will return three scores between 0 and 1. The scores will be given as raw numbers, and (if you select that option) also in graphical form as a plot of the scores against the sequence. The exact meaning of the scores is explained below:

C-score (raw cleavage site score)
The output score from networks trained to recognize cleavage sites vs. other sequence positions.
The C-score is:

high at position +1 (immediately after the cleavage site)
low at all other positions.
S-score (signal peptide score)
The output score from networks trained to recognize signal peptide vs. non-signal-peptide positions.
The S-score is:

high at all positions before the cleavage site
low at 30 positions after the cleavage site and in the N-terminals of non-secretory proteins.
Y-score (combined cleavage site score)
The prediction of cleavage site location is optimized by observing where the C-score is high and the S-score changes from a high to a low value. The Y-score formalizes this by combining the height of the C-score with the slope of the S-score. Specifically, the Y-score is a geometric average between the C-score and a smoothed derivative of the S-score (i.e., the difference between the mean S-score over d positions before and d positions after the current position, where d varies with the chosen network ensemble).
This will all become clearer when you inspect some actual output from the neural network.


Exercise part 1: Prediction of Signal Peptides Using SignalP



In this part of the exercise you will use SignalP to analyze four different proteins and predict whether they are likely to be secreted or not. For each protein we have listed the entire amino acid sequence. However, for the purpose of predicting signal peptides you should only use the first 60 amino acids or so (corresponding to one line). Generally signal peptides are shorter than this, and in fact submitting longer sequences may sometimes lead to false positive predictions.

First, open the SignalP server if you haven't already done so. For each of the examples below please cut and paste the first approximately 60 amino acids into the sequence window on the SignalP page, making sure to select the proper organism type (eukaryote). You can return to the submission page by clicking the "back" button on your browser.

Example 1: Epidermal growth factor receptor precursor (Swissprot ID: EGFR_HUMAN)
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
ENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKL
FGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCN
LLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVM
GENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVV
ALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGS
GAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGI
CLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAA
RNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSY
GVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPK
FRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQ
QGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTED
SIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLN
TVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRV
APQSSEFIGA


Q1: Do you find any signal peptide based on the S score? If yes, how long?

Q2: Do you find any cleavage site? If yes, where is the location? Do you think it is credible?



Example 2: Human cystatin C precursor (Swissprot ID: CYTC_HUMAN )
MAGPLRAPLLLLAILAVALAVSPAAGSSPGKPPRLVGGPMDASVEEEGVRRALDFAVGEY
NKASNDMYHSRALQVVRARKQIVAGVNYFLDVELGRTTCTKTQPNLDNCPFHDQPHLKRK
AFCSFQIYAVPWQGTMTLSKSTCQDA

Results Q3: Check the plot "SignalP - NN prediciton". There are two high impulse lines on C score. Could you explain why SignalP only predicted one of them as a cleavage site?


Example 3: 19 kD protein of the human signal recognition particle (Swissprot ID: SR19_HUMAN)
MACAAARSPADQDRFICIYPAYLNNKKTIAEGRRIPISKAVENPTATEIQDVCSAVGLNV
FLEKNKMYSREWNRDVQYRGRVRVQLKQEDGSLCLVQFPSRKSVMLYAAEMIPKLKTRTQ
KTGGADQSLQQGEGSKKGKGKKKK

Q4: Do you find any cleavage site this time? Results?>


Example 4: Human octamer binding transcription factor 3B (Swissprot ID: OC3B_HUMAN)
MHFYRLFLGATRRFLNPEWKGEIDNWCVYVLTSLLPFKIQSQDIKALQKELEQFAKLLKQ
KRITLGYTQADVGLTLGVLFGKVFSQTTICRFEALQLSFKNMCKLRPLLQKWVEEADNNE
NLQEICKAETLVQARKRKRTSIENRVRGNLENLFLQCPKPTLQQISHIAQQLGLEKDVVR
VWFCNRRQKGKRSSSDYAQREDFEAAGSPFSGGPVSFPLAPGPHFGTPGYGSPHFTALYS
SVPFPEGEAFPPVSVTTLGSPMHSN

Q5: Check the result, you will find no cleavage site predicted. Could you explain why no cleavage site was predicted even though there was a high prediction value of C score? Results?>



Exercise part 2: Virtual Mutation Analysis



In this part of the exercise you will use SignalP as a "virtual laboratory" and try to predict the effect of mutating one or more residues in a functional signal peptide. This particular use is a bit artificial, but the take-home message is that experimental work and bioinformatics tools can often be used together with good results. Thus, bioinformatics tools may be useful for generating ideas for new experiments and also for selecting experiments that are likely to give interesting results.
  • First load the sequence below into SignalP and run the predictor. (These are the first 40 amino acids from the epidermal growth factor receptor precursor sequence that you also used previously):
    MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQ
    Have a look at the graphical and numerical output. As you can see, cleavage is between the A at position 24 and the L at position 25: SRA-LE

  • Now use the back button to return to the SignalP submission page. Your sequence should still be here, ready to receive any mutation you can think of.

  • Have an extra look at the logo for eukaryotic signal peptides (see above). Now try to destroy the signal peptide with as few mutations as possible. Suggestions: try to mess with the small, neutral amino acids at positions -1 and -3. Try to destroy the hydrophobic core by adding charged residues. An overview of amino acid properties is included below.




Q6: List your mutated, nonfunctional signal peptide sequence.


Q7: Can you explain why these mutations destroyed the signal peptide?


Exercise part 3: Virtual Yield Optimization



As mentioned previously, gram-positive signal peptides are markedly longer than eukaryotic signal peptides. This length difference is to a large degree caused by much longer h-regions (hydrophobic regions) in gram-positive signal peptides. In this part of the exercise you will attempt to modify a eukaryotic signal peptide so that it is functional in a gram-positive bacterium. In the biotech industry, it is often desirable to express eukaryotic proteins in prokaryotic cells since these are much easier to grow and handle.
  • Again load the epidermal growth factor receptor precursor sequence below into SignalP.
    MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQ

  • Now run the predictor first setting organism group to "eukaryotic" and subsequently setting the organism group to "gram-positive".

    As you can see, the signal peptide is functional in eukaryotes, but based on the C score, it does not have a particularly efficient cleavage site in gram positives.

  • Now try to modify the signal peptide so that it becomes more efficient in gram-positives also. Use the information given above concerning signal peptide differences and amino acid properties. Your aim should be to get the C score above the cutoff also.

    Q8: List your mutated, Gram positives signal peptide.

    Q9: Can you explain why these mutations improved the signal peptide in gram-positives?