Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise 1: Location of possible translations starts



Datadriven neural network prediction: Backgound Information

A Neural Network (NN) is an example of a machine learning algorithm, i.e. an algorithm capable of learning from its mistakes and successes. Another such algorithm which you may be familiar with from other exercises is the Hidden Markov Model (HMM). The topic of this exercise will be to examine the predictive power of neural networks. Many people tend to view neural networks as black boxes. You pour data in from one side, and results pour out from the other - what happens in between is magic. While this exercise is unlikely to lift that veil for you, it will hopefully illustrate that neural networks can be useful tools under the right set of circumstances.

One of the neural networks most powerful features is its ability to capture underlying correlations in the data. This ability allows neural networks to recognize specific patterns or combinations in the data, for example the combination of 'ATG' is of great significance in this exercise. A neural network could be trained to recognize when this pattern occurs in a sequence together with an increase in the number of 'A's - this would indicate a translation start in Arabidopsis thaliana; we are going to be looking for such in a moment. Neural networks are mathematical models which attempt to capture (albeit primitively), the way which our brains handles pattern recognition. If you are still confused, try to read this sentence out aloud: "Tear in eye, your dress will tear". Did you pronounce the word 'tear' differently the second time? If you did, you recognized that 'tear' was taken to mean a completely different thing the second time. (And if you didn't, this link is for you). In this case your brain analysed the underlying correlations in the sentence and found that 'tear' was first mentioned together with 'eye' signifying a drop of liquid. The second time, however, the word 'dress' occured prior to 'tear' suggesting instead a rip in a piece of cloth. Since our brains are less adept at reading DNA than English, we often turn to artificial neural networks in bioinformatics.

(For more on neural networks please turn to your textbooks).

The problem - A hypothetical case story in Bioinformatics

Imagine that you are a newly employed bioinformatician in a small biotech company. The company is currently very keen on researching a particular protein on which they hope to make some profit. This protein, which is tentatively called PROT_X, was isolated from the plant Arabidopsis thaliana. In front of you, you currently have a DNA sequence, PROM_X, which the laboratory guys believe contains the beginning (the promoter region) of the gene encoding PROT_X. Unfortunately, they are unsure exactly where the translation start is, i.e. where the DNA begins encoding for the protein amino acid composition of PROT_X.

Locating translation starts in higher eukaryotes like Arabidopsis thaliana can be very difficult. In this case, matters are complicated because the lab guys believe that the protein contains a signal peptide - a small amino acid sequence which is located in the beginning of the protein sequence and which is removed before the protein is completely build. For this reason, the translation start cannot be determined by reverse-engineering from the amino acid sequence, because the bit of sequence which would reveal the location of the translation start was lost along with the signal peptide. Identification of correct translation starts is extremly important, not only for biological reasons, but also patents might stand or fall based on the accuracy of the information provided in the application.

Because translation start always occurs at an 'ATG' codon, which results in the amino acid Methionine ('M'), we can limit the number of possibilities to include only the places where the three bases ATG occur sequentially. This, however, will still leave us with way too many possibilities. In this exercise we will use web-based neural-network prediction tools developed here at CBS to limit the number of possible translation starts, hopefully all the way down to only one posibility. For your convenience we have marked all ATGs in the sequence in bold.

What exactly is a signal peptide?

Biological cells have compartments, i.e. small enclosed pockets which serve different and specific functions. For example, the mitochondrias and chloroplasts of higher eukaryotes serve the cell as powerplants, producing energy as needed. Production of proteins, however, occurs centralized in the nucleolus and the endoplasmic reticulum. Signal peptides serve as adress markers, ensuring that proteins intended to function in a specific compartment actually reach that compartment. As part of this process, though, the signal peptides are usually discarded and they can thus be difficult to isolate and identify experimentally. As a sidenode, signal peptides also exist in prokaryotic bacteria where they typically govern things like exporting proteins out of the cell.



The DNA sequence PROM_X containing the promoter region and translation start of PROT_X

>PROM_X Gene promoter sequence, org:Arabidopsis thaliana
TTGAGGGGCCAGAGACCTCAGGAGGAGGAAGAAGAAGAAGGACGACATGGACGACACGGT
AATGGCTTAGAGGAGACCATCTGCAGCGCCAGGTGCACCGATAACCTCGATGACCCGTCT
CGTGCTGACGTGTACAAGCCACAGCTCGGTTACATCAGCACTCTCAACAGTTACGATCTC
CCCATTCTTCGCTTCATCCGTCTCTCAGCCCTCCGTGGATCTATCCGTCAAGTAAGTAAA
CATAAATATTATGTTACTATAACCTAGTAAAATATGCATGCCTCATGCATGTTAATATGT
CCATTTCTATATTTAAACATGACTCTGGAAACGTGTGTGGGTGTAGAACGCAATGGTGCT
TCCACAGTGGAACGCAAACGCGAACGCTATTCTTTACGAGAAAATCTATTTCCGATGAAG
ACATCAAAGAAGCAATTGAAGTGAAGTGTAAATTGTACGTGGTCGATTTTGTATACCTGG
TTCTTATCTCGATCAATTTATCCCCAAAAACCCTAAACACTTTCCCGAATAAATCCCTTT
ATAAAGAGCTTCACATAAATCAAGTGAGAAACCACAAAAGTAAGAAGATAAAAATGGCTC
GAGTCTCTTCTCTTCTTTCTTTCTGCTTAACACTTTTGATCCTTTTCCATGGCTACGCGG
CTCAACAGGGTCAGCAGGGTCAGCAGTTTCCGAACGAGTGCCAGCTCGACCAGCTCAATG
CGCTCGAGCCGTCACACGTACTGAAGAGCGAGGCTGGTCGCATCGAGGTGTGGGACCACC
ACGCTCCTCAGCTCCGTTGCTCAGGTGTCTCCTTTGCACGTTACATCATCGAGTCTAAGG
GTCTCTACTTGCCCTCTTTCTTTAACACCGCGAAGCTCTCTTTCGTGGCTAAGGGACGAG
GTCTTATGGGAAAAGTGATCCCTGGATGCGCCGAAACATTCCAAGACTCATCAGAGTTCC
AACCACGCTTCGAAGGTCAAGGTCAAAGCCAGAGGTTCCGTGACATGCACCAGAAAGTGG
AGCACATTAGGAGCGGTGATACCATTGCCACAACACCCGGTGTAGCACAGTG

(In case you are wondering, this is not a real A. thaliana sequence, rather it was constructed from pieces of A. thaliana DNA. The problem with using real sequences is that they present additional complications such as introns, which would make this exercise much harder to complete.)



Exercise Part I - Using NetStart

TIP: The student is advised to keep two browser windows open throughout the exercise. The cookbook (this page) should be loaded into one while the neural network predictor should appear in the other. The new browser should open automatically when you click on the link below. If it doesn't, right click on the link and select the appropriate item (should be something like 'open in new window').

NetStart is a neural network predictor developed specifically with translation start prediction in mind, so this would seem to be a good place to start. Open NetStart using the link below, switch the predictor to 'A. thaliana' mode using the radio button next to the 'submit' button, and cut and paste the DNA sequence given above into the appropriate field. Then click 'submit'. At the bottom of the output page from NetStart you should find a link to a page explaining how to intrepret the output.

Click here to open NetStart

NOTE: When you run NetStart, make absolutely sure that you remember to switch to 'A. thaliana' mode using the radio button next to the 'submit' button.

Exercise Part II - Filtering our findings with SignalP

While you'll probably agree that NetStart certainly did limit the number of possible translation starts, it still leaves us with several candidates to choose from. Remembering that the protein is supposed to have a signal peptide, we could use SignalP, a signal peptide predictor, to narrow the list of possible candidate translation starts. In order to do this, however, it is first necessary to translate the DNA sequence into amino acids using each of the possible translation starts in turn. We have prepared a small translator for you, it's somewhat primitive, but it should get the job done.

Click here to open the DNA to amino acid translator

Simply cut and paste the desired sequence into the translator field, and click the 'submit sequence' button. It is important that you cut and paste the sequence only from the start codon you wish to test and forward. That way you will get the amino acid sequence corresponding to the selected translation start. This means that you will have to count characters in the sequence of PROM_X above until you find the start codons identified by NetStart. To make this somewhat tolerable each line in the sequence should be limited to 60 characters, and all ATG's are marked in bold. You will need to do this once for each translational start you believe to be possible.

TIP: The translator will terminate translation if it encounters a stop codon, much like what would happen in biology. Since protein sequences are typically fairly long, you can safely discard any translation start which results in less than fifty (50) amino acids.

Once you have translated a sequence you can test it with SignalP. The correct translation start should result in a protein with a clear signal peptide within the first fifty amino acids or less. Cut and paste the amino acid sequence into the appropriate field on the SignalP web-server:

Click here to open SignalP

Now click submit (using the default parameters is fine for our purpose). At the top of the output page you'll find a link to an explanation of the SignalP output. Pay special attention to the graphs at the bottom of the explaination page; the figure with the clear signal peptide shows how your SignalP result should look if you have the right translation start.