EXERCISE: GenBank --------------------------- Answers by: Rasmus Wernersson (v18103) Question 1: ----------- a): Inspecting the FEATURE table of the entry reveals that two CDS regions are defined. As stated on the GenBank hand-out "CDS" is the most stable definition of a protein coding gene used in the GenBank format - sometimes "gene" will also be present, but CDS is more commonly used. b): Columba livia (domestic pigeon) c): The HEADER contain general information about the entry: Organism, publication references, keywords, accesion-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence - for example definition of CDS regions. Question 2: ----------- a + b) The _entire_ "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. Since the FEATURE table has be trown away, we no longer have the coordinates for the genes. As such they are "in there" somewhere, but we cannot find them without using external information. Question 3: ----------- The "join" statements defines how to extract the coding sequence from the entire length of DNA in the entry: join(1104..1192,1306..1510,1614..1742) Is basically a recipe stating to paste together the three intervals - and we'll get the _protein coding_ part of the gene: the coding exons glued together. The CDS will always start with a START codon (e.g. ATG) and end with a STOP codon (e.g. TAA). The gene thus contains three exons. (Pedantic note: from a CDS definition we don't get any information about UnTranslated Regions (UTR's) that are often found before and after the coding region in the mRNA). Question 4: ----------- First CDS in FASTA format: >(gi|1943996:1104-1192, 1306-1510, 1614-1742) Columba livia DNA for alpha-D globin, alpha-A globin ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA TAA The DNA sequence is ONLY the coding part of the original sequence - however, the original coordinates of the exons has been listed in the sequence header (the "name" of the sequence). Question 5: ----------- Searching "Nucleotide" for "insulin" returns 14068 results (Feb. 2009). Question 6: ----------- 2757 - all of them from Human. Question 7: ----------- Now only 11 results are returned. Several of the entries _contain_ the insulin gene. I've chosen the following entry: J00265 Human insulin gene, complete cds gi|186429|gb|J00265.1|HUMINS01[186429] Since it ONLY contains the insulin gene and is well annotated with lots of litterature references. Question 8: ----------- IMPORTANT NOTE: "preproinsulin" is the full-length "version" of insulin before it get's processed and cleaved into two chains. This means preproinsulin is a perfecly good entry for "insulin". 38 entries: Insulin[keyword] NOT (insulin-like OR part OR partial) 37 entries: Insulin[keyword] NOT (insulin-like OR part OR partial OR "growth factor") 26 entries: Insulin[keyword] NOT (insulin-like OR part OR partial OR "growth factor" OR synthetic) 19 entries: Insulin[keyword] NOT (insulin-like OR part OR partial OR "growth factor" OR synthetic OR flanking) Notice that we actually lose one "good" entry by including "flanking" in the kill-word list. The important part is to bring down the number of hits to an amount that's possible to manually inspect. Question 9: ----------- 1) "Find the Rat and Mouse Insulin gene" ------------------------------------- It's a good idea to seperate the two logical parts of the search string: One for narrowing down the species: "(rat[ORGANISM] OR mouse[ORGANISM])" And one for actually searching for insulin: "insulin[KEYWORD]" They can then be AND'ed together: (rat[ORGANISM] OR mouse[ORGANISM]) AND insulin[KEYWORD] This gives 11 hits. By manual inspection of the results, I then pick the following entries: V01242 - Rat gene for insulin (hormone) V01243 - Rat gene for insulin 2 X04724 - Mouse preproinsulin gene II X04725 - Mouse preproinsulin gene I 2) "Find the alcohol-dehydrogenase gene from as many organisms as possible." ------------------------------------------------------------------------- It will never be possilbe to do this query perfectly - a good attempt could be: "alcohol dehydrogenase"[keyword] NOT (hypothetical OR partial OR pseudogene OR transposable) This yields 390 hits - virtually all of them is from the correct gene, but some are still partial or otherwise broken. They'll need to be manually curated. 3) "Find the alpha-globin gene from Capra hircus" ---------------------------------------------- "Capra hircus"[ORGANISM] AND "alpha globin" This gives 4 hits - I pick the following as the ones I want: J00044 - Goat adult alpha-ii-globin gene, complete sequence J00043 - Goat adult alpha-i-globin gene, complete sequence 4) "Find the alpha-globin gene from all ruminants" ----------------------------------------------- From Tree of Life we find that ruminants (danish: "Drøvtyggere") is contained in the taxon: "Ruminantia". Since we can searach any level of taxonomy in the ORGANISM field we can use this: Ruminantia[ORGANISM] AND "alpha globin" This yields 20 hits (which will need a bit of clean-up). 5) "Find the NORMAL p53 gene from human" ------------------------------------- The main problem is that P53/TP53 is mentioned in an enormous amount of entries - especially in the litterature references. 2358 hits: "(TP53 OR P53) AND "Human"[ORGANISM]". By inspecting "S66666" we can see that P53 is a "Tumor suppressor" - we can use this to narrow down the search space a bit: 849 hits: (TP53 OR P53) AND "Human"[ORGANISM]" AND Tumor 353 hits: "Tumor suppressor" AND human[ORGANISM] AND (TP53 OR P53) IMPORTANT NOTICE: After a bit of inspection of the results, "Tumor suppressor" appears to be a bit too specific - sometimes P53 is simply listed as a "tumor protein" - we'll therefore have to make do with "Tumor" as the search term. Among the first page of result is the following entry which is clearly the mRNA version of the gene: NM_000546 - Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA We can then try to narrow down the search a bit more: 70 hits: Tumor AND "complete cds" AND human[ORGANISM] AND (TP53 OR P53) There are several good entries here - for example: AY904026 - Homo sapiens tumor protein p53 binding protein, 1 (TP53BP1) gene, complete cds If we want to get rid of the mRNA entries, we have to be a bit carful since "mRNA" is mentioned many places. We can restrict the search to the TITLE field: 17 hits: Tumor AND "complete cds" AND human[ORGANISM] AND (TP53 OR P53) NOT mRNA[TITLE] In this list we can easily find an entry containing the full-lenght p53 (or exaple AY904026 mentioned above - remeber to INSPECT the results as some are bogus and only contains some of the exons).