Please keep two browser windows open throughout the exercise.
The cookbook (
i.e., this page) should be loaded into one while
SignalP (a neural network for predicting signal peptides)
should be opened in another separate window.
This exercise has three parts: first, you will use the SignalP server on four
examples of secretory and non-secretory proteins. In the second part, you will use the
SignalP server as a "virtual laboratory", and try to destroy a signal peptide by
mutating a number of amino acids. Finally, you will attempt to modify a eukaryotic
signal peptide so that it works in gram-positive bacteria.
Biological background
Both prokaryotic and eukaryotic cells have protein-secretion systems which rely on
"signal peptides". These are short, N-terminal amino acid sequences that cause a
protein to be exported out of the cell and that are usually cleaved off after the protein
has been transported across the membrane. Signal peptides are not generally
homologous, but instead have a number of common properties that are recognized by the
secretion system. Below are sequence logos for the signal peptides of eukaryotes,
gram-negative bacteria, and gram-positive bacteria. (In these logos the sequences
have been aligned on the cleavage site located between positions -1 and +1).
Signal peptides can be divided into three regions:
- n-region: a positively charged region at the N-terminus of the signal peptide
(the "left end" of the sequence). In eukaryotes this charge is supplied by the free amino
group of the N-terminal amino acid. In prokaryotes where the N-terminal amino acid is
formylated, the presence of an amino acid with a positively charged side chain is required.
- h-region: a hydrophobic core in the middle of the signal peptide.
- c-region: a more polar region at the C-terminus (the "right" end of the sequence)
with small and neutral amino acids at position -1 and -3.
In addition to slightly different sequence preferences, eukaryotic signal peptides are
somewhat shorter than gram-negative signal peptides, and markedly shorter than
gram-positive signal peptides. (The logos below have all been cut off at 25 residues
N-terminal to the cleavage site).
Explanation of SignalP output
For each position in your sequence, the SignalP server will return three scores
between 0 and 1. The scores will be given as raw numbers, and (if you select that option)
also in graphical form as a plot of the scores against the sequence.
The exact meaning of the scores is explained below:
C-score (raw cleavage site score)
The output score from networks trained to recognize cleavage sites vs. other sequence
positions.
The C-score is:
high at position +1 (immediately after the cleavage site)
low at all other positions.
S-score (signal peptide score)
The output score from networks trained to recognize signal peptide vs.
non-signal-peptide positions.
The S-score is:
high at all positions before the cleavage site
low at 30 positions after the cleavage site and in the N-terminals of
non-secretory proteins.
Y-score (combined cleavage site score)
The prediction of cleavage site location is optimized by observing where the
C-score is high and the S-score changes from a high to a low value. The Y-score
formalizes this by combining the height of the C-score with the slope of the
S-score.
Specifically, the Y-score is a geometric average between the C-score and a
smoothed derivative of the S-score (i.e., the difference between the mean S-score
over d positions before and d positions after the current position, where d varies
with the chosen network ensemble).
This will all become clearer when you inspect some actual output from the neural network.
Exercise part 1: Prediction of Signal Peptides Using SignalP
In this part of the exercise you will use SignalP to analyze four different proteins
and predict whether they are likely to be secreted or not. For each protein we have
listed the entire amino acid sequence. However, for the purpose of predicting signal peptides
you should
only use the first 60 amino acids or so (corresponding to one line).
Generally signal peptides are shorter than this, and in fact submitting longer
sequences may sometimes lead to false positive predictions.
First, open the
SignalP server if you
haven't already done so. For each of the examples below please cut and paste the
first approximately 60 amino acids into the sequence window on the SignalP page, making sure to
select the proper organism type (eukaryote). You can return to the submission
page by clicking the "back" button on your browser.
Example 1: Epidermal growth factor receptor precursor (Swissprot ID: EGFR_HUMAN)
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
ENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKL
FGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCN
LLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVM
GENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVV
ALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGS
GAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGI
CLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAA
RNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSY
GVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPK
FRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQ
QGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTED
SIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLN
TVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRV
APQSSEFIGA
Q1: Do you find any signal peptide based on the S score? If yes, how long?
Q2: Do you find any cleavage site? If yes, where is the location? Do you think it is credible?
Example 2: Human cystatin C precursor (Swissprot ID: CYTC_HUMAN )
MAGPLRAPLLLLAILAVALAVSPAAGSSPGKPPRLVGGPMDASVEEEGVRRALDFAVGEY
NKASNDMYHSRALQVVRARKQIVAGVNYFLDVELGRTTCTKTQPNLDNCPFHDQPHLKRK
AFCSFQIYAVPWQGTMTLSKSTCQDA
#Results
Q3: Check the plot "SignalP - NN prediciton". There are two high impulse lines on C score. Could you explain why SignalP only predicted one of them as a cleavage site?
Example 3: 19 kD protein of the human signal recognition particle (Swissprot ID: SR19_HUMAN)
MACAAARSPADQDRFICIYPAYLNNKKTIAEGRRIPISKAVENPTATEIQDVCSAVGLNV
FLEKNKMYSREWNRDVQYRGRVRVQLKQEDGSLCLVQFPSRKSVMLYAAEMIPKLKTRTQ
KTGGADQSLQQGEGSKKGKGKKKK
Q4: Do you find any cleavage site this time?
#Results?>
Example 4: Human octamer binding transcription factor 3B (Swissprot ID: OC3B_HUMAN)
MHFYRLFLGATRRFLNPEWKGEIDNWCVYVLTSLLPFKIQSQDIKALQKELEQFAKLLKQ
KRITLGYTQADVGLTLGVLFGKVFSQTTICRFEALQLSFKNMCKLRPLLQKWVEEADNNE
NLQEICKAETLVQARKRKRTSIENRVRGNLENLFLQCPKPTLQQISHIAQQLGLEKDVVR
VWFCNRRQKGKRSSSDYAQREDFEAAGSPFSGGPVSFPLAPGPHFGTPGYGSPHFTALYS
SVPFPEGEAFPPVSVTTLGSPMHSN
Q5: Check the result, you will find no cleavage site predicted. Could you explain why no cleavage site was predicted even though there was a high prediction value of C score?
#Results?>
Exercise part 2: Virtual Mutation Analysis
In this part of the exercise you will use SignalP as a "virtual laboratory" and try
to predict the effect of mutating one or more residues in a functional signal peptide.
This particular use is a bit artificial, but the take-home message is that experimental
work and bioinformatics tools can often be used together with good results. Thus,
bioinformatics tools may be useful for generating ideas for new experiments and also
for selecting experiments that are likely to give interesting results.
- First load the sequence below into SignalP and run the predictor. (These are
the first 40 amino acids from the epidermal growth factor receptor precursor sequence
that you also used previously):
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQ
Have a look at the graphical and numerical output. As you can see, cleavage is
between the A at position 24 and the L at position 25: SRA-LE
- Now use the back button to return to the SignalP submission page. Your
sequence should still be here, ready to receive any mutation you can think of.
- Have an extra look at the logo for eukaryotic signal peptides (see above).
Now try to destroy the signal peptide with as few mutations as possible. Suggestions:
try to mess with the small, neutral amino acids at positions -1 and -3. Try to destroy
the hydrophobic core by adding charged residues. An overview of amino acid properties is
included below.
Q6: List your mutated, nonfunctional signal peptide sequence.
Q7: Can you explain why these mutations destroyed the signal peptide?
Exercise part 3: Virtual Yield Optimization
As mentioned previously, gram-positive signal peptides are markedly longer than
eukaryotic signal peptides. This length difference is to a large degree caused
by much longer h-regions (hydrophobic regions) in gram-positive signal peptides. In
this part of the exercise you will attempt to modify a eukaryotic signal peptide so
that it is functional in a gram-positive bacterium. In the biotech industry, it is
often desirable to express eukaryotic proteins in prokaryotic cells since these are
much easier to grow and handle.
- Again load the epidermal growth factor receptor precursor sequence below
into SignalP.
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQ
- Now run the predictor first setting organism group to "eukaryotic" and
subsequently setting the organism group to "gram-positive".
As you can see, the signal peptide is functional in eukaryotes, but based on
the C score, it does not have
a particularly efficient cleavage site in gram positives.
- Now try to modify the signal peptide so that it becomes more efficient in
gram-positives also. Use the information given above concerning signal peptide
differences and amino acid properties. Your aim should be to get the C score above
the cutoff also.
Q8: List your mutated, Gram positives signal peptide.
Q9: Can you explain why these mutations improved the signal peptide in gram-positives?