|
Introduction to C programming
In todays exercise you shall work on an briefintroduction to the C
programming language. It is expected that you all have some prior knowledge about programming, so the introduction
is very short. Many programming languages exist (and many are more intuitive to use that C). However no language can
compete with C in speed, and speed is an essential issue when programming algorithms dealing with large biological
data sets.
The exercise has two part
- Introduction to C
- Making a C program to calculate a BLOSUM scoring matrix matching two fasta sequences
C programming
Connect to the Linux cluster
Use the xlaunch/putty or any other program that allows you to use the ssh to connect
to the CBS login.cbs.dtu.dk server with the username and password from last exercise.
Make sure that you have enabled tunneling, so that you can display graphics from
the CBS server on you terminal window. On Linux and MAC you do this by launching
the ssh with the -Y option.
Now we can get started.
Go to the src directory
cd src
Here you will find the following files (if you have made the right cp command in the last exercise).
test.c
c-program_template.c
fasta2scoremat.c
Makefile.Darwin_x86_64
Makefile.Linux_ia64
Makefile.Linux_i686
Makefile.Linux_x86_64
The three files ending with .c are c-program files, and the files called Makefile.* are used to compile c programs.
Have a look at the file test.c by typing
cat test.c
As you can see this program does nothings else but printing "Hello World" to standard out (the screen). You compile
this program by typing
make test
This command will make an executable from the test.c file called test. You execute the program by typing
test
Notem that if you did not change the path variable in the .cshrc file you might have to give the path to the executable files name, i.e "./test".
Now we can begin to work on a more complex program. Open the file c-program_template.c in your favorite editor.
This file contains a template your can use for future c programs. The first part of the file contains the include
commands adding the different libraries you might need (the lines with #include). Next, follows a section
declaring the command-line parameters for the program. These fall in different types (float, double,
int, longint, string, switch, and char). Next, follows the subroutines and functions, and finally the
main program. Each c-program must always have a main program definition. In the main program, we first
declare the variable needed (this we also do in each subroutine and function), next we parse the command line
parameters (with the call to pparse), and then we start to code.
Note, that many of the functions used in the program are routines made by other c-programmers (like myself), and
are not part of the c language. All these routines are part of the utils library found in your utils
directory. For instance the command-line parser (pparse) found in the file utils/sysutils.c.
The function of the c-program_template.c template is simpel. It will read and print the content
of either a fasta file (a file in the FASTA format) or a peptide file (a file with peptides and binding
affinity data). The type of the input is specified with a command-line parameter. Fasta files are read
with the routine fsalist_read, and peptide files are read with the routine linelist_read.
Both routines read the input data into a linked list data structure. The fsalist_read routine
reads one or many fasta entries and the linelist_read routine reads one or many input lines.
Each element in the linked list is accessed using the loop structure
for ( ln = linelist; ln; ln=ln->next )
for the linelist, and
for ( fsa = fsalist; fsa; fsa=fsa->next )
for the fsalist. You can of course write your own code to parse input data, but the two examples shown here
will be very useful. Make sure you understand what each line the program is doing.
You can now compile the program by typing
make c-program_template
You print information on how to use the program by typing
c-program_template -h
and test the program on some data by typing
c-program_template ../data/test.pep | more
c-program_template -fsa ../data/test.fsa | more
And now it is your turn
Now you must try it your self. You shall complete the program fasta2scoremat.c, so that it reads
two fasta files, and a BLOSUM substitution scoring matrix, and calculates the amino acids
scoring matrix between the two sequences. The fasta2scoremat.c file already has most of the
coded. You just need to complete the program and compile it. When the program is completed it
should function as follows
fasta2scoremat file1.fsa file2.fsa
When you have compiled the program sucessfully, place a copy of the executable in the bin directory by
typing
cp fasta2scoremat ../bin/
and type
rehash
to update the system table of executables
You can now access the program from any directory by simply typing
fasta2scoremat
The output from the program is a Blosum scoring matrix matching the two input sequences. This
matrix can be visualized to identify if the sequence can be aligned.
In the data directory I have placed two fasta file 1PLC._.fsa, 1PLB._.fsa.
Go to your home directory, make a new directory called for instance scoremat, go to this
directory and run the program to
construct the scoring matrix between the two sequences, and save the output in a file called score.mat
fasta2scoremat ../data/1PLC._.fsa ../data/1PLB._.fsa | grep -v "#" > score.mat
You can now visualize this scoring matrix using the heatmap procedure in R. Start R (version 2.9) by typing
R-2.9
and use the following commands
peptid.data <- as.matrix(read.table("score.mat", sep="\t", as.is=T, header=T, row.names=1))
heatmap(peptid.data,scale="none", Rowv=NA, Colv=NA)
dev.print(file="scoremat.ps")
q()
It might take a little while for the program to make the visualization.
Can you see if these two proteins can be aligned using the Blosum substitution scoring matrix?
This is all for today!
|