Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise in Eukaryotic Gene Prediction



Overview
In this exercise you will:
  • Predict possible exon structures in two eukaryotic genomic sequences
  • Use gene finding programs Genscan, HMMgene and NetGene2
  • Evaluate the exon prediction scores
  • Evaluate splice site scores
  • Evaluate coding region potential
Evaluation
Please answer all questions marked by Qx where x is a number. Onsite students should fill in the forms provided in class and hand in to lecturer by the end of the execise. Online students should copy the form provided at the end of this webpage to an e-mail, fill in the answers and mail these to the course responsible (Anders Gorm Pedersen at gorm@cbs.dtu.dk)

Exercise flow
We will take a look at two genomic sequences (shown at the bottom of this page. Here is Sequence #1)
Do all analyses on sequence #1 before you start on sequence #2.

Running Genscan
  • Go to the Genscan server at: http://genes.mit.edu/GENSCAN.html
  • Copy-and-paste sequence #1 (including the FASTA header line starting with >gi....) into the DNA field
  • Press Run Genscan (use default options)
  • Read the explanation at the bottom of the screen and answer the following questions:
    • Q1. How many exons are predicted in Sequence #1 ?
    • Q2. What are the begin and end positions ?
    • Q3. For the possible exons, note the probability of each
    • Q4. On which strand (+ or -) is the gene located?
    • Q5. Write down the first 6 amino acids and the total length of the predicted protein sequence
    • Click the Show PDF to view a graphical representation (ignore this point if your machine doesn’t have AcroReader)

Running HMMgene
  • Go to the HMMgene server at: http://www.cbs.dtu.dk/services/HMMgene/
  • Copy-and-paste sequence #1 (including the header line starting with >gi....) into the field named Sequence(s) in FASTA format
  • Press Submit sequence (use default options)
    Important! - Wait for prediction to finish

  • The link named Explanation of output format will take you to a HELP/DOCUMENTATION page that will explain the output format
    (This is NOT the prediction on your sequence)
  • Go back to the prediction page and answer the following questions:
    • Q6. How many exons are predicted in Sequence #1 ?
    • Q7. What are the begin and end positions ?
    • Q8. For the possible exons, note the probability of each
    • Q9. On which strand (+ or -) is the gene located?
    • Q10. Compare the exon-intron boundaries with those obtained by Genscan. Do they agree for all exons?

Running NetGene2
  • NetGene predicts potential donor and acceptor splice sites as well as protein coding potential. It does not predict a complete exon-intron gene structure
  • Go to the NetGene2 server at: http://www.cbs.dtu.dk/services/NetGene2/
  • Cut-and-paste sequence #1 (starting with the header line >gi....) into the field named Sequence
  • Press Send file (use default selection of human) and wait for prediction to finish.
  • Scroll down to Donor splice sites, direct strand
    (Direct = + strand; do not look at the predictions for complement(-)strand in this exercise)
  • NetGene2 presents you with scores for many potential donor and acceptor splice sites.
    • Consult your results obtained using Genscan and HMMgene
    • Q11. Based on the predictions from Genscan/HMMgene, at which position do you expect to find a donor splice site?
    • Q12. If NetGene predicts a donor splice site at this position, what is then the confidence score?
    • Q13. If NetGene predicts a donor splice site at this position, write down the 3 nucleotides on either side of the splice site
    • Scroll down to Acceptor splice sites, direct strand
    • Consult your results obtained using Genscan and HMMgene
    • Q14. Based on the predictions from Genscan/HMMgene, at which position do you expect to find an acceptor splice site?
    • Q15. If NetGene predicts an acceptor splice site at this position, what is then the confidence score?
    • Q16. If NetGene predicts a acceptor splice site at this position, write down the 3 nucleotides on either side of the splice site
    • Scroll down to the Graphical Output from NetGene 2
    • Look at the upper panel "Coding" of the "Direct Strand Graphics"
    • Q17. At the position of the expected donor site (Q11) - do you see the graphical coding potential change from "Low to High" or from "High to Low" ?
    • Q18. At the position of the expected acceptor site (Q14) - do you see the graphical coding potential change from "Low to High" or from "High to Low" ?
    • Q19. What is Coding potential in the region between the donor site position (Q11) and the acceptor site position (Q14)? "Generally High" or "Generally Low"?

Repeat above sections for Sequence #2 and record your answers (not all questions can be answered for all prediction types)

Check results with GenBank entries for both sequences (see links at bottom of page)
(Look in ‘Features’ section under ‘CDS’ to find official annotations of exons)

After finishing analysis of Sequence #1 and #2, answer the following questions
  • Q20. Did Genscan and HMMgene predict the same exons in Seq#1?
  • Q21. Did Genscan and HMMgene predict the same exons in Seq#2?

  • B1-Bonus/optional. If not, why do you think there is a difference? Is there something unusual about the predictions that differ?
  • If there is a difference, take a look at the graphical coding region prediction of NetGene2 for the Sequence where you observed differences between the predictions
  • B2-Bonus/optional. Does the NetGene coding potential for the region in question indicate that this is a clear exon or a clear intron region?
  • B3-Bonus/optional. Do you think exon prediction is dependent on the exon length?
  • B4-Bonus/optional. Use Genscan to predict on Seq.#2 again. This time change the parameter "Suboptimal exon cutoff (optional):" before you "Run Genscan". What is the highest value at which you get a predicted suboptimal first exon corresponding to your findings with HMMgene?
  • B5-Bonus/optional. Before exon 1 in both genes, which type of splice site (if any) do you expect? Explain your answer.

That's it - You are done with the exercise!


  • OPTIONAL (only if time permits): Using the Ensembl server, try to find out the chromosomal location of the mouse orthologue of human Hox-A10 (seq #1).
    • Start here: http://www.ensembl.org/Homo_sapiens/index.html
    • Search with the accession number from human Hox-A10 (AF040714)
    • Select matching Ensembl gene
    • At the "Ensembl gene report", scroll down to "Orthologue prediction"
    • Select the mouse orthologue gene
    • Look at "genomic location" to find on which chromosome, mouse HoxA10 is located
    • Take a look at the other options or repeat the above procedure for Seq #2

    Evaluation form for online students - Exercise in Gene Prediction
    Online students should copy the form below to an e-mail, fill in the answers and mail these to the course responsible (Anders Gorm Pedersen at gorm@cbs.dtu.dk)

    Name:
    Student number:

    Sequence #1
  • Q1:
  • Q2:
  • Q3:
  • Q4:
  • Q5:
  • Q6:
  • Q7:
  • Q8:
  • Q9:
  • Q10:
  • Q11:
  • Q12:
  • Q13:
  • Q14:
  • Q15:
  • Q16:
  • Q17:
  • Q18:
  • Q19:
    Sequence #2
  • Q1:
  • Q2:
  • Q3:
  • Q4:
  • Q5:
  • Q6:
  • Q7:
  • Q8:
  • Q9:
  • Q10:
  • Q11:
  • Q12:
  • Q13:
  • Q14:
  • Q15:
  • Q16:
  • Q17:
  • Q18:
  • Q19:
    General
  • Q20:
  • Q21:
  • B1-Bonus/optional.
  • B2-Bonus/optional.
  • B3-Bonus/optional.
  • B4-Bonus/optional.
  • B5-Bonus/optional.


    Sequence #1
    >gi|2789671|gb|AF040714.1|AF040714 Homo sapiens 
    ATGCCAGGCCCCCCACCAGCCACGTTGGGGCAGCCCCCACAGCTCCCGGCCTTCGGGCCAAGGTGTCGGG
    GTGCGTCTCCTGGCCCATCAATACAGATTACATATTTATATCAATCGCGGGCTCTGAGGGCGCCCTCGGA
    GAGCGGCCCCGCGCCTACGAAACCAAACTGGGAGTGGTCGCGCGGAAACTCTGGCTCGGGATTGGCTGCG
    GGCGCCCGCCGCGGTGCGGGGGGATTGCTAATCGTATTCAGCATGTTTTGCACAAGAAATGTCAGCCAGA
    AAGGGCTATCTGCTCCCTTCGCCAAATTATCCCACAACAATGTCATGCTCGGAGAGCCCCGCCGCGAACT
    CTTTTTTGGTCGACTCGCTCATCAGCTCGGGCAGAGGCGAGGCAGGCGGCGGTGGTGGTGGCGCGGGGGG
    CGGCGGCGGTGGCGGTTACTACGCCCACGGCGGGGTCTACCTGCCGCCCGCCGCCGACCTGCCATACGGG
    CTGCAGAGCTGCGGGCTCTTCCCCACGCTGGGCGGCAAGCGCAATGAGGCAGCGTCGCCGGGCAGCGGTG
    GCGGTGGCGGGGGTCTAGGTCCCGGGGCGCACGGCTACGGGCCCTCGCCCATAGACCTGTGGCTAGACGC
    GCCCCGGTCTTGCCGGATGGAGCCGCCTGACGGGCCGCCGCCGCCGCCCCAGCAGCAGCCGCCGCCCCCG
    CCGCAACCACCCCAGCCAGCGCCGCAGGCCACCTCGTGCTCTTTCGCGCAGAACATCAAAGAAGAGAGCT
    CCTACTGCCTCTACGACTCGGCGGACAAATGCCCCAAAGTCTCGGCCACCGCCGCCGAACTGGCTCCCTT
    CCCGCGGGGCCCGCCGCCCGACGGCTGCGCCCTGGGCACCTCCAGCGGGGTGCCAGTGCCTGGCTACTTC
    CGCCTTTCTCAGGCCTACGGCACCGCCAAGGGCTATGGCAGCGGCGGCGGCGGCGCGCAGCAACTCGGGG
    CTGGCCCGTTCCCCGCGCAGCCCCCGGGGCGCGGTTTCGATCTCCCGCCCGCGCTAGCCTCCGGCTCGGC
    CGATGCGGCCCGGAAGGAGCGAGCCCTCGATTCGCCGCCGCCCCCCACGCTGGCTTGCGGCAGCGGCGGG
    GGCTCGCAGGGCGACGAGGAGGCGCACGCGTCGTCCTCGGCCGCGGAGGAGCTCTCCCCGGCCCCTTCCG
    AGAGCAGCAAAGCCTCGCCGGAGAAGGATTCCCTGGGTAAGCAGGGCTGCAGAGGGCTGCAGTCAGGCGG
    GCAGACAGGCAGACACAAGGAGGAGAAGGATCAGAAAACTAGGAGCCCGCGCAGCAGCCGGCCGGCCTTG
    GCCCAAGCTGCAGGCAGGCTGACCTTGTGAACTTGCTTTTTAATATTTGGGCGTGGGGGCGCAGTAAAAT
    TCATGTCCGGCTTAGCGCCCCACAGCAAGACGTCCTCGGCGCTGGCCTCAGCTCCCCCTGACTAGGGACG
    AGGACACCAGCGAGCAGGCCCCCTCCTGTGCGCTCTTTCCTGTGGCCGGGAGGACCCAGAGCCCTGGTCC
    CTGCCCAGCCTGCGCGGCGCGGCCCACGCGGGGGGAGGGGGAGGGAGGGAAAGTAGCTCGCCCGCAGATA
    GCGCGGATGTTTGTAAGGCATCCAAAATAAGCAGCCGCCAGCGCCAATAAATAAGCCCATTAACCGGCGA
    AGTTCGAGTGTACGATCCCCCATGCTTTTTTCAAAGTTGCTGAGGGGCGGGAATCTTCGTGGCGGGAAGA
    AGAAAAGGCAAATCCGGCCTGGAAGCGGGGGGCCCTGAGCTGAGAGCCAGAGAAGGGCCATTTCCCTTCC
    CCTGGACCTCGGAATCGCCCAGCTATGTATCCTGGCTCCTGGAGAAACTTGAGGGAGGGCCCTTGACCCC
    CGAATCGGTTTTTCCTGCCTTCCCCATTGGACCAATGATGCCCTTCTTTCTCCCCTTATCGAGTCTTGGG
    CAATCAGGGCCCTGGGGTGAGACAGCCAAGCTGCCTGGCCCATCTTCCAAGTAAGCACCCCGCGCTCCTA
    GCCTGGGGGCTACAGGAAATGCTTGTCTGCCATATGGCAAGAGGCAAAGAAAAGCGTTAAGTTCAAGATG
    TACAGCCTGCCCTCCCAGGCCTTTCCTTCTGCAAGCATCTACGGCTTAGCGCTAAAACAGGTGTTTGGAA
    AAGTGGGGGAAATGTAAATTGGAAGGGTCATGTAGATTGAAGGCCCACTCAATTTTTGTCATGACTTATG
    GAGGAACTGCTTGCTCTCAGCAAGCCAAAAACGGGGGCACGACTCTCTTCTCTGTGACTTGGGACATCTC
    TCTTATGGGAGAAACGGAGGCAATTCACCCCCGCGGGCAGCCCGTGTGGCCTCGACTTAATCATCCCCTC
    TTTATTCTCTTACATGCCAGGCAATTCCAAAGGTGAAAACGCAGCCAACTGGCTCACGGCAAAGAGTGGT
    CGGAAGAAGCGCTGCCCCTACACGAAGCACCAGACACTGGAGCTGGAGAAGGAGTTTCTGTTCAATATGT
    ACCTTACTCGAGAGCGGCGCCTAGAGATTAGCCGCAGCGTCCACCTCACGGACAGACAAGTGAAAATCTG
    GTTTCAGAACCGCAGGATGAAACTGAAGAAAATGAATCGAGAAAACCGGATCCGGGAGCTCACAGCCAAC
    TTTAATTTTTCCTGATGAATCTCCAGGCGAC
    


    Sequence #2
    >gi|2739430|gb|U70368.1|MMU70368 Mus musculus hematopoietic-specific IL-2 deubiquitinating enzyme (DUB-2) gene, complete cds
    GGAAGGAAAACCAGACCTAGGCTGCTTATACTGGTTCTGTGTGGTTAGCAAGGTAACAGAAACTCTTGTA
    TGGCATGTGTAGTCATCTATTTGACATGATTTTGTAACTTTATTCCAAGTAAAACCCAAGCTTAAGACAC
    CTAGGAAATTGGAGCTAAATTCAGGGAAATGCACTCCAATAATGTGACATTTCTGAGCTGCTTTGCAGAA
    ACCACACCCAAATTGGGAGAAGCTTGTCTGGGATTGGCTGTCCTTGGAAGACTGTAGGCGTGGTCACAAG
    ACTGGAGTATAAAAGACTGAGCATTTGTCCTCACTTGCAGAGATTCTCTGGAGGGAAAGACTTCCTTCTG
    CTCCCTTAGAAGACTCCAGCAAGTTATTTGAAGAGGTCTTTGGAGACATGGTGGTTTCTCTTTCCTTCCC
    AGAAGGTAAGTCTCACTGTAAGGTCTTTATGTCTTGTGTGTCCCCCAGCAGCCTTGTCATCTCCGGCTGC
    CCTAGACCTGCATAAGGACAGATTGAGTGTGCTGGGATAGACTTTTGTTGACAAAGGGGCTGCTCTGCCC
    TTCTAAGAGGTTGAGTCTCATCATAAGGCCTTTTGCAGCTTGCATGTGTAGTGCCAGGAAAGAGTAGTCA
    TCCCCCAAAACCAGACAGGAACTGACGAGATGCAATCACTGTGTGGACTTTTTACCAGCTAGCTAGGGCA
    CTACCATGAGCCACTGTCTAGCAGGGAGGCTTTGGGGATGGTGTGCCCCGAATATCTCTCAGGGTAAGAG
    TTTACAGTAAGCAGCAAGCAGAGGGGTGTGGGTGAGTGTGCAAGTATCTAATTGGCTAGTTTTTGTGGCC
    TGTAACATATTGGTGGGTGTTGGGAGTCATAAGCTAAATGTTTGCTTTCCTCTGCATTGGTGGTCATTAG
    GGAGGGGGCAGATTATGAACCTAGGTTGCAGATCTGTTGGAGTAATAACAAGACACTGGTCTTGTTGGGG
    GTATAACCTAGAGACTCGATTTATGTTCATGTTTGGTTTGGGATGGGTTTTATGTGAGTGTTTTCTTTTT
    TGGGGAGGGGGTCGGTTAACTTGGAAAGTAATGCTAGGTACTGTCCTGTTCATTTCCCTGAGGTGAAAGT
    TAGGTCAGGTTTTCTAGAATGGAGTCTGAAGGTAAAARATTTGGCCACTGGCATGCCCTAAAGTCTTTTT
    GTGTTCTTGTCCCCTAGCAGATCCAGCCCTATCATCTCCTGGTGCCCAACAGCTGCATCAGGATGAAGCT
    CAGGTAGTGGTGGAGCTAACTGCCAATGACAAGCCCAGTCTGAGTTGGGAATGTCCCCAAGGACCAGGAT
    GCGGGCTTCAGAACACAGGCAACAGCTGCTACCTGAATGCAGCCCTGCAGTGCTTGACACACACACCACC
    TCTAGCTGACTACATGCTGTCCCAGGAGTACAGTCAAACCTGTTGTTCCCCAGAAGGCTGTAAGATGTGT
    GCTATGGAAGCCCATGTAACCCAGAGTCTCCTGCACTCTCACTCGGGGGATGTCATGAAGCCCTCCCAGA
    TTTTGACCTCTGCCTTCCACAAGCACCAGCAGGAAGATGCCCATGAGTTTCTCATGTTCACCTTGGAAAC
    AATGCATGAATCCTGCCTTCAAGTGCACAGACAATCAGAACCCACCTCTGAGGACAGCTCACCCATTCAT
    GACATATTTGGAGGCTTGTGGAGGTCTCAGATCAAGTGTCTCCATTGCCAGGGTACCTCAGATACATATG
    ATCGCTTCCTGGATGTCCCCCTGGATATCAGCTCAGCTCAGAGTGTAAATCAAGCCTTGTGGGATACAGA
    GAAGTCAGAAGAGCTACGTGGAGAGAATGCCTACTACTGTGGTAGGTGTAGACAGAAGATGCCAGCTTCC
    AAGACCCTGCATATTCATAGTGCCCCAAAGGTACTCCTGCTAGTGTTAAAGCGCTTCTCGGCCTTCATGG
    GTAACAAGTTGGACAGAAAAGTAAGCTACCCAGAGTTCCTTGACCTGAAGCCATACCTGTCCCAGCCTAC
    TGGAGGACCTTTGCCTTATGCCCTCTATGCTGTCCTGGTCCATGAAGGTGCGACTTGTCACAGTGGACAT
    TACTTCTCTTATGTCAAAGCCAGACATGGGGCATGGTACAAGATGGATGATACTAAGGTCACCAGCTGCG
    ATGTGACTTCTGTCCTGAATGAGAATGCCTATGTGCTCTTCTATGTGCAGCAGACTGACCTCAAACAGGT
    CAGTATTGACATGCCAGAGGGCAGAGTACATGAGGTTCTCGACCCTGAATACCAGCTGAAGAAATCCCGG
    AGAAAAAAGCATAAGAAGAAAAGCCCTTGCACAGAAGATGCGGGAGAGCCCTGCAAAAACAGGGAGAAGA
    GAGCAACCAAAGAAACCTCCTTAGGGGAGGGGAAAGTGCYTCAGGAAAAGAACCACAAGAAAGCTGGGCA
    GAAACATGAGAATACCAAACTTGTGCCTCAGGAACAGAACCACCAGAAACTTGGGCAGAAACACAGGATC
    AATGAAATCTTGCCTCAGGAACAGAACCACCAGAAAGCTGGGCAGAGCCTCAGGAACACGGAAGGTGAAC
    TTGATCTGCCTGCTGATGCAATTGTGATTCACCTGCTCAGATCCACAGAAAACTGGGGCAGGGATGCTCC
    AGACAAGGAGAATCAACCCTGGCACAATGCTGACAGGCTCCTCACCTCTCAGGACCCTGTGAACACTGGG
    CAGCTCTGTAGACAGGAAGGAAGACGAAGATCAAAGAAGGGGAAGAACAAGAACAAGCAAGGGCAGAGGC
    TTCTGCTTGTTTGCTAGTGTTCACTCACCCACTCACACAGGCTCCTGTGGACACCCTGCCAACCCAAGGT
    GCCTGGAACAAGAGGTTTGGACCTCTGTCCCAGGCAGGGACAATGCCTCACCCTTCATGTGGGGTCCACC
    TATCCTCTGGGCCCTTGCCTGTTTTTACTGACTGACTCTCTGAGAATGGTCATTTGAATGTGGAAAAAAA
    ATGCCCAGGGTGTTGCTACAGGTTAAAGACAGGAAAGCTGGACAGTCAGGGGAGGTCTGCATAGCCTCTC
    CTGCAACTCATGGGATCTGAGTAGCGTAGAGACTAAATCACCACACTGGAGCTTTCTTTACTTTGCTTTC
    CTTTTTTTTTAATTTATTTTTTGTTATTAGATATTTTCTTTATTTACATTTCAAATGCTATCCCAAAAGT
    TCCCTATACCCTCCCCCCCCCCGCTAACCTACCCACCCA
     
    GenBank entry for Sequence #1

    GenBank entry for Sequence #2