Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Introduction to C programming

In todays exercise you shall work on an briefintroduction to the C programming language. It is expected that you all have some prior knowledge about programming, so the introduction is very short. Many programming languages exist (and many are more intuitive to use that C). However no language can compete with C in speed, and speed is an essential issue when programming algorithms dealing with large biological data sets.

The exercise has two part

  • Introduction to C
  • Making a C program to calculate a BLOSUM scoring matrix matching two fasta sequences

C programming

Connect to the Linux cluster

Use the xlaunch/putty or any other program that allows you to use the ssh to connect to the CBS login.cbs.dtu.dk server with the username and password from last exercise. Make sure that you have enabled tunneling, so that you can display graphics from the CBS server on you terminal window. On Linux and MAC you do this by launching the ssh with the -Y option.

Now we can get started.

Go to the src directory

cd src

Here you will find the following files (if you have made the right cp command in the last exercise).

test.c
c-program_template.c
fasta2scoremat.c
Makefile.Darwin_x86_64
Makefile.Linux_ia64
Makefile.Linux_i686
Makefile.Linux_x86_64

The three files ending with .c are c-program files, and the files called Makefile.* are used to compile c programs. Have a look at the file test.c by typing

cat test.c

As you can see this program does nothings else but printing "Hello World" to standard out (the screen). You compile this program by typing

make test

This command will make an executable from the test.c file called test. You execute the program by typing

test

Notem that if you did not change the path variable in the .cshrc file you might have to give the path to the executable files name, i.e "./test".

Now we can begin to work on a more complex program. Open the file c-program_template.c in your favorite editor.

This file contains a template your can use for future c programs. The first part of the file contains the include commands adding the different libraries you might need (the lines with #include). Next, follows a section declaring the command-line parameters for the program. These fall in different types (float, double, int, longint, string, switch, and char). Next, follows the subroutines and functions, and finally the main program. Each c-program must always have a main program definition. In the main program, we first declare the variable needed (this we also do in each subroutine and function), next we parse the command line parameters (with the call to pparse), and then we start to code.

Note, that many of the functions used in the program are routines made by other c-programmers (like myself), and are not part of the c language. All these routines are part of the utils library found in your utils directory. For instance the command-line parser (pparse) found in the file utils/sysutils.c.

The function of the c-program_template.c template is simpel. It will read and print the content of either a fasta file (a file in the FASTA format) or a peptide file (a file with peptides and binding affinity data). The type of the input is specified with a command-line parameter. Fasta files are read with the routine fsalist_read, and peptide files are read with the routine linelist_read. Both routines read the input data into a linked list data structure. The fsalist_read routine reads one or many fasta entries and the linelist_read routine reads one or many input lines. Each element in the linked list is accessed using the loop structure

for ( ln = linelist; ln; ln=ln->next ) 
for the linelist, and

for ( fsa = fsalist; fsa; fsa=fsa->next )
for the fsalist. You can of course write your own code to parse input data, but the two examples shown here will be very useful. Make sure you understand what each line the program is doing.

You can now compile the program by typing

make c-program_template

You print information on how to use the program by typing

c-program_template -h

and test the program on some data by typing

c-program_template ../data/test.pep | more
c-program_template -fsa ../data/test.fsa | more

And now it is your turn

Now you must try it your self. You shall complete the program fasta2scoremat.c, so that it reads two fasta files, and a BLOSUM substitution scoring matrix, and calculates the amino acids scoring matrix between the two sequences. The fasta2scoremat.c file already has most of the coded. You just need to complete the program and compile it. When the program is completed it should function as follows

fasta2scoremat file1.fsa file2.fsa

When you have compiled the program sucessfully, place a copy of the executable in the bin directory by typing

cp fasta2scoremat ../bin/

and type

rehash

to update the system table of executables

You can now access the program from any directory by simply typing

fasta2scoremat

The output from the program is a Blosum scoring matrix matching the two input sequences. This matrix can be visualized to identify if the sequence can be aligned.

In the data directory I have placed two fasta file 1PLC._.fsa, 1PLB._.fsa. Go to your home directory, make a new directory called for instance scoremat, go to this directory and run the program to construct the scoring matrix between the two sequences, and save the output in a file called score.mat

fasta2scoremat ../data/1PLC._.fsa ../data/1PLB._.fsa | grep -v "#" > score.mat

You can now visualize this scoring matrix using the heatmap procedure in R. Start R (version 2.9) by typing

R-2.9

and use the following commands

peptid.data <- as.matrix(read.table("score.mat", sep="\t", as.is=T, header=T, row.names=1))
heatmap(peptid.data,scale="none", Rowv=NA, Colv=NA)
dev.print(file="scoremat.ps")
q()

It might take a little while for the program to make the visualization. Can you see if these two proteins can be aligned using the Blosum substitution scoring matrix?

This is all for today!