Exam in course 27013, May 28, 2003

Perl and Unix for Bioinformaticians


Preface:


Trivial syntax errors do not count against you. However, using functions or language structures (loops, conditional statements) in a nonsensical manner do. In appendix 6 is a short reminder (not complete coverage) of perl structure and functions.

If you want to refer to the data in the appendices as files, then use filenames "appendix1", "appendix2" and so forth.

You can use danish or english to answer the questions.


Assignment 1 (50%):


During your research in ion channels you stumble upon the SwissProt entry (CIQ3_HUMAN) in appendix 1. You notice in the feature table (FT) that several variants and mutations of this gene exists. You want to take a closer look at this

and decide that the first step is to extract the original amino acid sequence and all variations (full sequence with the appropiate amino acid changed) hereoff and put the result in a fasta file (see appendix 2 for fasta file format). Since you probably are going to do this on a lot of SwissProt entries, you decide to make a program in perl (surprise).


  1. Describe which keywords/patterns you will be looking for when parsing the file searching for the variants and the sequence.


  1. Describe a method to extract the sequences. You can use pseudo code, a diagram or whatever you find suitable in your description.


  1. Implement your method in perl (on paper).


  1. What kind of error checking could/should you include in your program ? Here you should name every check, which is relevant to the task, not every check it is possible to make.


  1. In what way could you generalize or extend the program ?

page 1/2

Assignment 2 (50%):


You have earlier in your career made a splendid program, that calculate various scores based on amino acid sequence features. The output from your program can be seen in appendix 3, and consist of an accession number followed by 6 numbers between 0 and 1 per line (tab separated). You want to find the accession

numbers with the highest and lowest average scores (average of the 6 numbers). However, you want to exclude any genes on your negative list from your calculations. These genes are listed as SwissProt IDs in appendix 4. Since GenBank accession numbers and SwissProt IDs are not identical, you need to translate between them in order to solve your problem. Fortunately you have a file, that does just that, see appendix 5, where the first item on the line is a SwissProt ID, second item is irrelevant, and third is the corresponding GenBank accession number.



a) Describe a method to find the data. You can use pseudo code, a diagram or whatever you find suitable in your description.


b) Implement your method in perl (on paper).


c) Have you made any assumptions about the data in your algoritm ? Which ? Why ? Are they reasonable assumptions (explain) ? Could/should you do away with them (by changing the code) ?


d) Usually, when you have this kind of problem, you want the highest 10 and lowest 10 average scores, not just the top and buttom average score. How would you solve this problem ? Will it change any assumptions i c) ?











page 2/2

Appendix 1 (page 1)



ID CIQ3_HUMAN STANDARD; PRT; 872 AA.

AC O43525;

DT 15-JUL-1999 (Rel. 38, Created)

DT 15-JUL-1999 (Rel. 38, Last sequence update)

DT 28-FEB-2003 (Rel. 41, Last annotation update)

DE Potassium voltage-gated channel subfamily KQT member 3 (Potassium

DE channel KQT-like 3).

GN KCNQ3.

OS Homo sapiens (Human).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

OX NCBI_TaxID=9606;

RN [1]

RP SEQUENCE FROM N.A., AND MUTAGENESIS OF GLY-310 AND GLY-318.

RC TISSUE=Brain;

RX MEDLINE=99087323; PubMed=9872318;

RA Schroeder B.C., Kubisch C., Stein V., Jentsch T.J.;

RT "Moderate loss of function of cyclic-AMP-modulated KCNQ2/KCNQ3 K+

RT channels causes epilepsy.";

RL Nature 396:687-690(1998).

RN [2]

RP VARIANT BFNC2 ARG-309.

RX MEDLINE=20309392; PubMed=10852552;

RA Hirose S., Zenri F., Akiyoshi H., Fukuma G., Iwata H., Inoue T.,

RA Yonetani M., Tsutsumi M., Muranaka H., Kurokawa T., Hanai T., Wada K.,

RA Kaneko S., Mitsudome A.;

RT "A novel mutation of KCNQ3 (c.925T-->C) in a Japanese family with

RT benign familial neonatal convulsions.";

RL Ann. Neurol. 47:822-826(2000).

CC -!- FUNCTION: PROBABLY IMPORTANT IN THE REGULATION OF NEURONAL

CC EXCITABILITY. ASSOCIATES WITH KCNQ2 OR KCNQ5 TO FORM A POTASSIUM

CC CHANNEL WITH ESSENTIALLY IDENTICAL PROPERTIES TO THE CHANNEL

CC UNDERLYING THE NATIVE M-CURRENT, A SLOWLY ACTIVATING AND

CC DEACTIVATING POTASSIUM CONDUCTANCE WHICH PLAYS A CRITICAL ROLE IN

CC DETERMINING THE SUBTHRESHOLD ELECTRICAL EXCITABILITY OF NEURONS AS

CC WELL AS THE RESPONSIVENESS TO SYNAPTIC INPUTS.

CC -!- SUBUNIT: HETEROMULTIMER WITH KCNQ2 OR KCNQ5. MAY ASSOCIATE WITH

CC KCNE2.

CC -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN.

CC -!- TISSUE SPECIFICITY: PREDOMINANTLY EXPRESSED IN BRAIN.

CC -!- DOMAIN: THE SEGMENT S4 IS PROBABLY THE VOLTAGE-SENSOR AND IS

CC CHARACTERIZED BY A SERIES OF POSITIVELY CHARGED AMINO ACIDS AT

CC EVERY THIRD POSITION (BY SIMILARITY).

CC -!- DISEASE: DEFECTS IN KCNQ3 ARE THE CAUSE OF BENIGN FAMILIAL

CC NEONATAL CONVULSIONS TYPE 2 (BFNC2); ALSO KNOWN AS EPILEPSY,

CC BENIGN NEONATAL TYPE 2 (EBN2); BFNC2 IS AN AUTOSOMAL DOMINANT FORM

CC OF EPILEPSY IN THE NEWBORN THAT CLEARS SPONTANEOUSLY AFTER A FEW

CC WEEKS AND IS FOLLOWED BY NORMAL PSYCHOMOTOR DEVELOPMENT.


Appendix 1 (page 2)


CC -!- MISCELLANEOUS: MUTAGENESIS EXPERIMENTS WERE CARRIED OUT IN XENOPUS

CC OOCYTES BY CO-EXPRESSION OF EITHER KCNQ3(MUT) AND KCNQ2 AT THE

CC RATIO OF 1:1, OR OF KCNQ3(MUT), KCNQ3(WT) AND KCNQ2 AT THE RATIO

CC OF 1:1:2, TO MIMIC THE SITUATION IN A HETEROZYGOUS PATIENT WITH

CC BFNC2 DISEASE.

CC -!- SIMILARITY: BELONGS TO THE POTASSIUM CHANNEL FAMILY. KQT

CC SUBFAMILY.

CC --------------------------------------------------------------------------

CC This SWISS-PROT entry is copyright. It is produced through a collaboration

CC between the Swiss Institute of Bioinformatics and the EMBL outstation -

CC the European Bioinformatics Institute. There are no restrictions on its

CC use by non-profit institutions as long as its content is in no way

CC modified and this statement is not removed. Usage by and for commercial

CC entities requires a license agreement (See http://www.isb-sib.ch/announce/

CC or send an email to license@isb-sib.ch).

CC --------------------------------------------------------------------------

DR HSSP; Q54397; 1BL8.

DR Genew; HGNC:6297; KCNQ3.

DR MIM; 602232; -.

DR MIM; 121201; -.

DR GO; GO:0008076; C:voltage-gated potassium channel complex; TAS.

DR GO; GO:0005249; F:voltage-gated potassium channel activity; TAS.

DR GO; GO:0006813; P:potassium ion transport; TAS.

DR GO; GO:0007268; P:synaptic transmission; TAS.

DR InterPro; IPR005821; Ion_trans.

DR InterPro; IPR001622; K+channel_pore.

DR InterPro; IPR003091; K_channel.

DR InterPro; IPR003937; KCNQ_channel.

DR InterPro; IPR005820; M+channel_nlg.

DR Pfam; PF00520; ion_trans; 1.

DR Pfam; PF03520; KCNQ_channel; 1.

DR PRINTS; PR00169; KCHANNEL.

KW Transport; Ion transport; Ionic channel; Voltage-gated channel;

KW Potassium channel; Potassium transport; Potassium; Transmembrane;

KW Multigene family; Disease mutation.

FT TRANSMEM 122 142 SEGMENT S1 (POTENTIAL).

FT TRANSMEM 153 173 SEGMENT S2 (POTENTIAL).

FT TRANSMEM 197 217 SEGMENT S3 (POTENTIAL).

FT TRANSMEM 226 247 SEGMENT S4 (POTENTIAL).

FT TRANSMEM 262 282 SEGMENT S5 (POTENTIAL).

FT DOMAIN 304 324 SEGMENT H5 (PORE-FORMING) (POTENTIAL).

FT TRANSMEM 331 351 SEGMENT S6 (POTENTIAL).

FT DOMAIN 13 24 POLY-GLY.

FT VARIANT 309 309 W -> R (IN BFNC2).

FT /FTId=VAR_010935.

FT VARIANT 310 310 G -> V (IN BFNC2).

FT /FTId=VAR_001546.

FT MUTAGEN 310 310 G->V: ABOUT 50% REDUCTION OF WT

FT HETEROMERIC CURRENT; RATIO OF 1:1; OR

FT 20%; RATIO OF 1:1:2.

Appendix 1 (page 3)



FT MUTAGEN 318 318 G->S: >50% REDUCTION OF WT HETEROMERIC

FT CURRENT; RATIO OF 1:1 AND 1:1:2.

SQ SEQUENCE 872 AA; 96742 MW; BB79C69EE8591A84 CRC64;

MGLKARRAAG AAGGGGDGGG GGGGAANPAG GDAAAAGDEE RKVGLAPGDV EQVTLALGAG

ADKDGTLLLE GGGRDEGQRR TPQGIGLLAK TPLSRPVKRN NAKYRRIQTL IYDALERPRG

WALLYHALVF LIVLGCLILA VLTTFKEYET VSGDWLLLLE TFAIFIFGAE FALRIWAAGC

CCRYKGWRGR LKFARKPLCM LDIFVLIASV PVVAVGNQGN VLATSLRSLR FLQILRMLRM

DRRGGTWKLL GSAICAHSKE LITAWYIGFL TLILSSFLVY LVEKDVPEVD AQGEEMKEEF

ETYADALWWG LITLATIGYG DKTPKTWEGR LIAATFSLIG VSFFALPAGI LGSGLALKVQ

EQHRQKHFEK RRKPAAELIQ AAWRYYATNP NRIDLVATWR FYESVVSFPF FRKEQLEAAS

SQKLGLLDRV RLSNPRGSNT KGKLFTPLNV DAIEESPSKE PKPVGLNNKE RFRTAFRMKA

YAFWQSSEDA GTGDPMAEDR GYGNDFPIED MIPTLKAAIR AVRILQFRLY KKKFKETLRP

YDVKDVIEQY SAGHLDMLSR IKYLQTRIDM IFTPGPPSTP KHKKSQKGSA FTFPSQQSPR

NEPYVARPST SEIEDQSMMG KFVKVERQVQ DMGKKLDFLV DMHMQHMERL QVQVTEYYPT

KGTSSPAEAE KKEDNRYSDL KTIICNYSET GPPEPPYSFH QVTIDKVSPY GFFAHDPVNL

PRGGPSSGKV QATPPSSATT YVERPTVLPI LTLLDSRVSC HSQADLQGPY SDRISPRQRR

SITRDSDTPL SLMSVNHEEL ERSPSGFSIS QDRDDYVFGP NGGSSWMREK RYLAEGETDT

DTDPFTPSGS MPLSSTGDGI SDSVWTPSNK PI

//




Appendix 2


Description of FASTA file format:

Every sequence starts with a header line, where the very first character is a > followed immediately by a unique sequence id (at the least, unique for the file). Optionally the id can be followed by whitespace and some relevant text, but all the text has to be on the header line only. On the lines following the header line is the sequence, which can be a nucleotide or amino acid sequence. Usually a sequence line contains 60 units (or less if it's the last line), but there are no limitations. Whitespace in the sequence is allowed but ignored.

See example below:



>SequenceID One line of text describing the sequence

MFLRRAAVAPQRAPILRPAFVPHVLQRADSALSSAAAGPRPMALRPPHQALVGPPLPGPP

GPPMMLPPMARAPGPPLGSMAALRPPLEEPAAPRELGLGLGLGLKEKEEAVVAAAAGLEE

ASAAVAVGAGGAPAGPAVIGPSLPLALAMPLPEPEPLPLPLEVVRGLLPPLRIPELLSLR

PRPRPPRPEPPPGLMALEVPEPLGEDKKKGKPEKLKRCIRTAAG >NewSequenceID One line of text describing the sequence

MAELKYISGFGNECSSEDPRCPGSLPEGQNNPQVCPYNLYAEQLSGSAFTCPRSTNKRSW

LYRILPSVSHKPFESIDEGHVTHNWDEVDPDPNQLRWKPFEIPKASQKKVDFVSGLHTLC

GAGDIKSNNGLAIHIFLCNTSMENRCFYNSDGDFLIVPQKGNLLIYTEFGKMLVQPNEIC

VIQRGMRFSIDVFEETRGYILEVYGVHFELPDLGPIGANGLANPRDFLIPI


Appendix 3


U01120.CDS.1 0.96254 0.48773 0.91830 0.98988 0.10537 0.62475

D25328.CDS.1 0.04034 0.42409 0.43538 0.52913 0.63754 0.79602

X15573.CDS.1 0.13059 0.65310 0.63434 0.69388 0.92635 0.03285

K03515.CDS.1 0.65147 0.03256 0.01210 0.92373 0.25138 0.04894

L44140.CDS.10 0.57916 0.67875 0.64902 0.11068 0.97844 0.40458

U24183.CDS.1 0.15529 0.94098 0.89230 0.07359 0.93086 0.99767

M97347.CDS.1 0.69834 0.97120 0.42177 0.13373 0.50034 0.05931

U05259.CDS.1 0.92974 0.63092 0.71241 0.56408 0.32481 0.63875

M62486.CDS.1 0.59694 0.97628 0.67132 0.60904 0.90001 0.92270

L11244.CDS.1 0.65798 0.47916 0.60145 0.30699 0.58984 0.57989

D38293.CDS.1 0.71157 0.74513 0.52088 0.60387 0.81872 0.45174

M86400.CDS.1 0.60154 0.51706 0.42294 0.02331 0.65079 0.92327

X56468.CDS.1 0.08261 0.58053 0.55420 0.79502 0.14462 0.87900

U54778.CDS.1 0.43378 0.74155 0.85528 0.10510 0.35059 0.75528

D78577.CDS.1 0.02779 0.00857 0.23445 0.62924 0.31556 0.82429

X57346.CDS.1 0.20913 0.02713 0.56942 0.73001 0.63100 0.38814

X77567.CDS.1 0.18175 0.23254 0.90520 0.60469 0.25584 0.55599

M74161.CDS.1 0.52796 0.33846 0.13653 0.08215 0.13348 0.28114

M32313.CDS.1 0.96116 0.56726 0.02270 0.81643 0.67235 0.37329

.

.

.

(long list continues here)


Appendix 4


OGG1_HUMAN

HGD_HUMAN

CRAR_HUMAN

SN25_HUMAN

INA2_HUMAN

TBB1_HUMAN

ADT2_HUMAN

FOL2_HUMAN

CBG_HUMAN

MYCM_HUMAN

PYR5_HUMAN

GLUC_HUMAN

SY04_HUMAN

PPA5_HUMAN

FGF2_HUMAN

COXR_HUMAN

GTM3_HUMAN

SPCB_HUMAN

MM08_HUMAN


Appendix 5


OGG1_HUMAN O15527 AB000410.CDS.1

HGD_HUMAN Q93099 AF000573.CDS.1

CN37_HUMAN P09543 D13146.CDS.1

GCST_HUMAN P48728 D14686.CDS.1

CRAR_HUMAN P48740 D17525.CDS.1

SN25_HUMAN P13795 D21267.CDS.1

APM1_HUMAN Q15848 D45371.CDS.1

CNCG_HUMAN Q13956 D45399.CDS.1

INA2_HUMAN P01563 J00207.CDS.1

TBB1_HUMAN P07437 J00314.CDS.1

IF2A_HUMAN P05198 J02645.CDS.1

ADT2_HUMAN P05141 J02683.CDS.1

FOL2_HUMAN P14207 J02876.CDS.1

2AAA_HUMAN P30153 J02902.CDS.1

C2F1_HUMAN P24903 J02906.CDS.1

CBG_HUMAN P08185 J02943.CDS.1

MYCM_HUMAN P12525 J03069.CDS.1

GBAZ_HUMAN P19086 J03260.CDS.1

LKHA_HUMAN P09960 J03459.CDS.1

PYR5_HUMAN P11172 J03626.CDS.1

GLUC_HUMAN P01275 J04040.CDS.1

CALM_HUMAN P02593 J04046.CDS.1

C1S_HUMAN P09871 J04080.CDS.1

SY04_HUMAN P13236 J04130.CDS.1

PPA5_HUMAN P13686 J04430.CDS.1

.

.

.

(long list continues here)