Exam in course 27013, May 28, 2003

Perl and Unix for Bioinformaticians

 

Preface:

 

Trivial syntax errors do not count against you. However, using functions or language structures (loops, conditional statements) in a nonsensical manner do. In appendix 6 is a short reminder (not complete coverage) of perl structure and functions.

If you want to refer to the data in the appendices as files, then use filenames  "appendix1", "appendix2" and so forth.

You can use danish or english to answer the questions.

 

Assignment 1 (50%):

 

During your research in ion channels you stumble upon the SwissProt entry (CIQ3_HUMAN) in appendix 1. You notice in the feature table (FT) that several variants and mutations of this gene exists. You want to take a closer look at this

and decide that the first step is to extract the original amino acid sequence and all variations (full sequence with the  appropiate amino acid changed) hereoff and put the result in a fasta file (see appendix 2 for fasta file format). Since you probably are going to do this on a lot of SwissProt entries, you decide to make a program in perl (surprise).

 

a)     Describe which keywords/patterns you will be looking for when parsing the file searching for the variants and the sequence.

 

b)    Describe a method to extract the sequences. You can use pseudo code, a diagram or whatever you find suitable in your description.

 

c)     Implement your method in perl (on paper).

 

d)    What kind of error checking could/should you include in your program ? Here you should name every check, which is relevant to the task, not every check it is possible  to make.

 

e)     In what way could you generalize or extend the program ?

page 1/2

Assignment 2 (50%):

 

You have earlier in your career made a splendid program, that calculate various scores based on amino acid sequence features. The output from your program can be seen in appendix 3, and consist of an accession number followed by 6 numbers between 0 and 1 per line (tab separated). You want to find the accession

numbers with the highest and lowest average scores (average of the 6 numbers). However, you want to exclude any genes on your negative list from your calculations. These genes are listed as SwissProt IDs in appendix 4. Since GenBank accession numbers and SwissProt IDs are not identical, you need to translate between them in order to solve your problem. Fortunately you have a file, that does just that, see appendix 5, where the first item on the line is a SwissProt ID, second item is irrelevant, and third is the corresponding GenBank accession number.

 

 

a) Describe a method to find the data. You can use pseudo code, a diagram or whatever you find suitable in your description.

 

b) Implement your method in perl (on paper).

 

c) Have you made any assumptions about the data in your algoritm ? Which ? Why ? Are they reasonable assumptions (explain) ? Could/should you do away with them (by changing the code) ?

 

d) Usually, when you have this kind of problem, you want the highest 10 and lowest 10 average scores, not just the top and buttom average score. How would you solve this problem ? Will it change any assumptions i c) ?

 

 

 

 

 

 

 

 

 

 

page 2/2

Appendix 1 (page 1)

 

 

ID   CIQ3_HUMAN     STANDARD;      PRT;   872 AA.

AC   O43525;

DT   15-JUL-1999 (Rel. 38, Created)

DT   15-JUL-1999 (Rel. 38, Last sequence update)

DT   28-FEB-2003 (Rel. 41, Last annotation update)

DE   Potassium voltage-gated channel subfamily KQT member 3 (Potassium

DE   channel KQT-like 3).

GN   KCNQ3.

OS   Homo sapiens (Human).

OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

OX   NCBI_TaxID=9606;

RN   [1]

RP   SEQUENCE FROM N.A., AND MUTAGENESIS OF GLY-310 AND GLY-318.

RC   TISSUE=Brain;

RX   MEDLINE=99087323; PubMed=9872318;

RA   Schroeder B.C., Kubisch C., Stein V., Jentsch T.J.;

RT   "Moderate loss of function of cyclic-AMP-modulated KCNQ2/KCNQ3 K+

RT   channels causes epilepsy.";

RL   Nature 396:687-690(1998).

RN   [2]

RP   VARIANT BFNC2 ARG-309.

RX   MEDLINE=20309392; PubMed=10852552;

RA   Hirose S., Zenri F., Akiyoshi H., Fukuma G., Iwata H., Inoue T.,

RA   Yonetani M., Tsutsumi M., Muranaka H., Kurokawa T., Hanai T., Wada K.,

RA   Kaneko S., Mitsudome A.;

RT   "A novel mutation of KCNQ3 (c.925T-->C) in a Japanese family with

RT   benign familial neonatal convulsions.";

RL   Ann. Neurol. 47:822-826(2000).

CC   -!- FUNCTION: PROBABLY IMPORTANT IN THE REGULATION OF NEURONAL

CC       EXCITABILITY. ASSOCIATES WITH KCNQ2 OR KCNQ5 TO FORM A POTASSIUM

CC       CHANNEL WITH ESSENTIALLY IDENTICAL PROPERTIES TO THE CHANNEL

CC       UNDERLYING THE NATIVE M-CURRENT, A SLOWLY ACTIVATING AND

CC       DEACTIVATING POTASSIUM CONDUCTANCE WHICH PLAYS A CRITICAL ROLE IN

CC       DETERMINING THE SUBTHRESHOLD ELECTRICAL EXCITABILITY OF NEURONS AS

CC       WELL AS THE RESPONSIVENESS TO SYNAPTIC INPUTS.

CC   -!- SUBUNIT: HETEROMULTIMER WITH KCNQ2 OR KCNQ5. MAY ASSOCIATE WITH

CC       KCNE2.

CC   -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN.

CC   -!- TISSUE SPECIFICITY: PREDOMINANTLY EXPRESSED IN BRAIN.

CC   -!- DOMAIN: THE SEGMENT S4 IS PROBABLY THE VOLTAGE-SENSOR AND IS

CC       CHARACTERIZED BY A SERIES OF POSITIVELY CHARGED AMINO ACIDS AT

CC       EVERY THIRD POSITION (BY SIMILARITY).

CC   -!- DISEASE: DEFECTS IN KCNQ3 ARE THE CAUSE OF BENIGN FAMILIAL

CC       NEONATAL CONVULSIONS TYPE 2 (BFNC2); ALSO KNOWN AS EPILEPSY,

CC       BENIGN NEONATAL TYPE 2 (EBN2); BFNC2 IS AN AUTOSOMAL DOMINANT FORM

CC       OF EPILEPSY IN THE NEWBORN THAT CLEARS SPONTANEOUSLY AFTER A FEW

CC       WEEKS AND IS FOLLOWED BY NORMAL PSYCHOMOTOR DEVELOPMENT.

 

Appendix 1 (page 2)

 

CC   -!- MISCELLANEOUS: MUTAGENESIS EXPERIMENTS WERE CARRIED OUT IN XENOPUS

CC       OOCYTES BY CO-EXPRESSION OF EITHER KCNQ3(MUT) AND KCNQ2 AT THE

CC       RATIO OF 1:1, OR OF KCNQ3(MUT), KCNQ3(WT) AND KCNQ2 AT THE RATIO

CC       OF 1:1:2, TO MIMIC THE SITUATION IN A HETEROZYGOUS PATIENT WITH

CC       BFNC2 DISEASE.

CC   -!- SIMILARITY: BELONGS TO THE POTASSIUM CHANNEL FAMILY. KQT

CC       SUBFAMILY.

CC   --------------------------------------------------------------------------

CC   This SWISS-PROT entry is copyright. It is produced through a collaboration

CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -

CC   the European Bioinformatics Institute.  There are no  restrictions on  its

CC   use  by  non-profit  institutions as long  as its content  is  in  no  way

CC   modified and this statement is not removed.  Usage  by  and for commercial

CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/

CC   or send an email to license@isb-sib.ch).

CC   --------------------------------------------------------------------------

DR   HSSP; Q54397; 1BL8.

DR   Genew; HGNC:6297; KCNQ3.

DR   MIM; 602232; -.

DR   MIM; 121201; -.

DR   GO; GO:0008076; C:voltage-gated potassium channel complex; TAS.

DR   GO; GO:0005249; F:voltage-gated potassium channel activity; TAS.

DR   GO; GO:0006813; P:potassium ion transport; TAS.

DR   GO; GO:0007268; P:synaptic transmission; TAS.

DR   InterPro; IPR005821; Ion_trans.

DR   InterPro; IPR001622; K+channel_pore.

DR   InterPro; IPR003091; K_channel.

DR   InterPro; IPR003937; KCNQ_channel.

DR   InterPro; IPR005820; M+channel_nlg.

DR   Pfam; PF00520; ion_trans; 1.

DR   Pfam; PF03520; KCNQ_channel; 1.

DR   PRINTS; PR00169; KCHANNEL.

KW   Transport; Ion transport; Ionic channel; Voltage-gated channel;

KW   Potassium channel; Potassium transport; Potassium; Transmembrane;

KW   Multigene family; Disease mutation.

FT   TRANSMEM    122    142       SEGMENT S1 (POTENTIAL).

FT   TRANSMEM    153    173       SEGMENT S2 (POTENTIAL).

FT   TRANSMEM    197    217       SEGMENT S3 (POTENTIAL).

FT   TRANSMEM    226    247       SEGMENT S4 (POTENTIAL).

FT   TRANSMEM    262    282       SEGMENT S5 (POTENTIAL).

FT   DOMAIN      304    324       SEGMENT H5 (PORE-FORMING) (POTENTIAL).

FT   TRANSMEM    331    351       SEGMENT S6 (POTENTIAL).

FT   DOMAIN       13     24       POLY-GLY.

FT   VARIANT     309    309       W -> R (IN BFNC2).

FT                                /FTId=VAR_010935.

FT   VARIANT     310    310       G -> V (IN BFNC2).

FT                                /FTId=VAR_001546.

FT   MUTAGEN     310    310       G->V: ABOUT 50% REDUCTION OF WT

FT                                HETEROMERIC CURRENT; RATIO OF 1:1; OR

FT                                20%; RATIO OF 1:1:2.

Appendix 1 (page 3)

 

 

FT   MUTAGEN     318    318       G->S: >50% REDUCTION OF WT HETEROMERIC

FT                                CURRENT; RATIO OF 1:1 AND 1:1:2.

SQ   SEQUENCE   872 AA;  96742 MW;  BB79C69EE8591A84 CRC64;

     MGLKARRAAG AAGGGGDGGG GGGGAANPAG GDAAAAGDEE RKVGLAPGDV EQVTLALGAG

     ADKDGTLLLE GGGRDEGQRR TPQGIGLLAK TPLSRPVKRN NAKYRRIQTL IYDALERPRG

     WALLYHALVF LIVLGCLILA VLTTFKEYET VSGDWLLLLE TFAIFIFGAE FALRIWAAGC

     CCRYKGWRGR LKFARKPLCM LDIFVLIASV PVVAVGNQGN VLATSLRSLR FLQILRMLRM

     DRRGGTWKLL GSAICAHSKE LITAWYIGFL TLILSSFLVY LVEKDVPEVD AQGEEMKEEF

     ETYADALWWG LITLATIGYG DKTPKTWEGR LIAATFSLIG VSFFALPAGI LGSGLALKVQ

     EQHRQKHFEK RRKPAAELIQ AAWRYYATNP NRIDLVATWR FYESVVSFPF FRKEQLEAAS

     SQKLGLLDRV RLSNPRGSNT KGKLFTPLNV DAIEESPSKE PKPVGLNNKE RFRTAFRMKA

     YAFWQSSEDA GTGDPMAEDR GYGNDFPIED MIPTLKAAIR AVRILQFRLY KKKFKETLRP

     YDVKDVIEQY SAGHLDMLSR IKYLQTRIDM IFTPGPPSTP KHKKSQKGSA FTFPSQQSPR

     NEPYVARPST SEIEDQSMMG KFVKVERQVQ DMGKKLDFLV DMHMQHMERL QVQVTEYYPT

     KGTSSPAEAE KKEDNRYSDL KTIICNYSET GPPEPPYSFH QVTIDKVSPY GFFAHDPVNL

     PRGGPSSGKV QATPPSSATT YVERPTVLPI LTLLDSRVSC HSQADLQGPY SDRISPRQRR

     SITRDSDTPL SLMSVNHEEL ERSPSGFSIS QDRDDYVFGP NGGSSWMREK RYLAEGETDT

     DTDPFTPSGS MPLSSTGDGI SDSVWTPSNK PI

//

 




Appendix 2

 

Description of FASTA file format:

Every sequence starts with a header line, where the very first character is a > followed immediately by a unique sequence id (at the least, unique for the file). Optionally the id can be followed by whitespace and some relevant text, but all the text has to be on the header line only. On the lines following the header line is the sequence, which can be a nucleotide or amino acid sequence. Usually a sequence line contains 60 units (or less if it's the last line), but there are no limitations. Whitespace in the sequence is allowed but ignored.

See example below:

 

 

>SequenceID One line of text describing the sequence

MFLRRAAVAPQRAPILRPAFVPHVLQRADSALSSAAAGPRPMALRPPHQALVGPPLPGPP

GPPMMLPPMARAPGPPLGSMAALRPPLEEPAAPRELGLGLGLGLKEKEEAVVAAAAGLEE

ASAAVAVGAGGAPAGPAVIGPSLPLALAMPLPEPEPLPLPLEVVRGLLPPLRIPELLSLR

PRPRPPRPEPPPGLMALEVPEPLGEDKKKGKPEKLKRCIRTAAG                                     >NewSequenceID One line of text describing the sequence

MAELKYISGFGNECSSEDPRCPGSLPEGQNNPQVCPYNLYAEQLSGSAFTCPRSTNKRSW

LYRILPSVSHKPFESIDEGHVTHNWDEVDPDPNQLRWKPFEIPKASQKKVDFVSGLHTLC

GAGDIKSNNGLAIHIFLCNTSMENRCFYNSDGDFLIVPQKGNLLIYTEFGKMLVQPNEIC

VIQRGMRFSIDVFEETRGYILEVYGVHFELPDLGPIGANGLANPRDFLIPI

                                                                                                                            

 


Appendix 3

 

U01120.CDS.1            0.96254            0.48773            0.91830            0.98988            0.10537            0.62475

D25328.CDS.1            0.04034            0.42409            0.43538            0.52913            0.63754            0.79602

X15573.CDS.1            0.13059            0.65310            0.63434            0.69388            0.92635            0.03285

K03515.CDS.1            0.65147            0.03256            0.01210            0.92373            0.25138            0.04894

L44140.CDS.10            0.57916            0.67875            0.64902            0.11068            0.97844            0.40458

U24183.CDS.1            0.15529            0.94098            0.89230            0.07359            0.93086            0.99767

M97347.CDS.1            0.69834            0.97120            0.42177            0.13373            0.50034            0.05931

U05259.CDS.1            0.92974            0.63092            0.71241            0.56408            0.32481            0.63875

M62486.CDS.1            0.59694            0.97628            0.67132            0.60904            0.90001            0.92270

L11244.CDS.1            0.65798            0.47916            0.60145            0.30699            0.58984            0.57989

D38293.CDS.1            0.71157            0.74513            0.52088            0.60387            0.81872            0.45174

M86400.CDS.1            0.60154            0.51706            0.42294            0.02331            0.65079            0.92327

X56468.CDS.1            0.08261            0.58053            0.55420            0.79502            0.14462            0.87900

U54778.CDS.1            0.43378            0.74155            0.85528            0.10510            0.35059            0.75528

D78577.CDS.1            0.02779            0.00857            0.23445            0.62924            0.31556            0.82429

X57346.CDS.1            0.20913            0.02713            0.56942            0.73001            0.63100            0.38814

X77567.CDS.1            0.18175            0.23254            0.90520            0.60469            0.25584            0.55599

M74161.CDS.1            0.52796            0.33846            0.13653            0.08215            0.13348            0.28114

M32313.CDS.1            0.96116            0.56726            0.02270            0.81643            0.67235            0.37329

.

.

.

(long list continues here)

 


Appendix 4

 

OGG1_HUMAN

HGD_HUMAN 

CRAR_HUMAN

SN25_HUMAN

INA2_HUMAN

TBB1_HUMAN

ADT2_HUMAN

FOL2_HUMAN

CBG_HUMAN 

MYCM_HUMAN

PYR5_HUMAN

GLUC_HUMAN

SY04_HUMAN

PPA5_HUMAN

FGF2_HUMAN

COXR_HUMAN

GTM3_HUMAN

SPCB_HUMAN

MM08_HUMAN                                                                                                              

 


Appendix 5

 

OGG1_HUMAN O15527     AB000410.CDS.1

HGD_HUMAN  Q93099     AF000573.CDS.1

CN37_HUMAN P09543     D13146.CDS.1

GCST_HUMAN P48728     D14686.CDS.1

CRAR_HUMAN P48740     D17525.CDS.1

SN25_HUMAN P13795     D21267.CDS.1

APM1_HUMAN Q15848     D45371.CDS.1

CNCG_HUMAN Q13956     D45399.CDS.1

INA2_HUMAN P01563     J00207.CDS.1

TBB1_HUMAN P07437     J00314.CDS.1

IF2A_HUMAN P05198     J02645.CDS.1

ADT2_HUMAN P05141     J02683.CDS.1

FOL2_HUMAN P14207     J02876.CDS.1

2AAA_HUMAN P30153     J02902.CDS.1

C2F1_HUMAN P24903     J02906.CDS.1

CBG_HUMAN  P08185     J02943.CDS.1

MYCM_HUMAN P12525     J03069.CDS.1

GBAZ_HUMAN P19086     J03260.CDS.1

LKHA_HUMAN P09960     J03459.CDS.1

PYR5_HUMAN P11172     J03626.CDS.1

GLUC_HUMAN P01275     J04040.CDS.1

CALM_HUMAN P02593     J04046.CDS.1

C1S_HUMAN  P09871     J04080.CDS.1

SY04_HUMAN P13236     J04130.CDS.1

PPA5_HUMAN P13686     J04430.CDS.1

.

.

.

(long list continues here)