Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Introduction to C programming

In this exercise you shall work on an brief introduction to the C programming language. It is expected that you all have some prior knowledge about programming, so the introduction is very short. Many programming languages exist (and many are more intuitive to use that C). However no language can compete with C in speed, and speed is an essential issue when programming algorithms dealing with large biological data sets.

In this exercise, the basic format of a C program is introduced including variable declaration, use of subroutines/function, input/output, strings, and data structures.


C programming

Connect to the Linux cluster

Use Putty or any other program that allows you to use the ssh to connect to the IIB server with the username and password from last exercise. Make sure that you have enabled tunneling, so that you can display graphics from the CBS server on you terminal window. On Linux and MAC you do this by launching SSH with the -Y option.

Now we can get started.

Go to the src directory

cd src

Here you will find the following files (if you have made the right cp command in the last exercise).

test.c
c-program_template.c
fasta2scoremat.c
Makefile.Darwin_x86_64
Makefile.Linux_ia64
Makefile.Linux_i686
Makefile.Linux_x86_64

The three files ending with .c are c-program files, and the files called Makefile.* are used to compile c programs. Today we will work only on the test.c program.

The general form of a C program is

#include <stdio.h>

main()

{
	printf("Hello, world!\n");
	return 0;
}

The first line is practically boilerplate; it will appear in almost all programs we write. It asks that some definitions having to do with the ``Standard I/O Library'' be included in our program; these definitions are needed if we are to call the library function printf correctly.

The second line says that we are defining a function named main. Most of the time, we can name our functions anything we want, but the function name main is special: it is the function that will be ``called'' first when our program starts running. The empty pair of parentheses indicates that our main function accepts no arguments, that is, there isn't any information which needs to be passed in when the function is called.

The braces { and } surround a list of statements in C. Here, they surround the list of statements making up the function main.

The line

	printf("Hello, world!\n");
is the first statement in the program. It asks that the function printf be called; printf is a library function which prints formatted output. The parentheses surround printf's argument list: the information which is handed to it which it should act on. The semicolon at the end of the line terminates the statement.

You can find more information about the format of C programs here A First Example


Now you can understand the content of the test.c program. Have a look at the file by typing

cat test.c

As you can see this program does nothings else but printing "Hello World" to standard out (the screen).


Compiling c-programs

You compile this program by typing

make test

This command will make an executable from the test.c file called test. You execute the program by typing

test

Note, that if you did not change the path variable in the .cshrc file you might have to give the path to the executable files name, i.e "./test".


Loops

Open the file test.c with your favorit text editor, and add some code to do a for loop printing out the integer numbers from 0 to (and not including) 100. I.e write some code to do the following

for i=0 to 100
	print i
Again you can find information on how to make for loops in C in the notes by Steve Summit Example 2 by Steve Summit. Remember to compile the code to varify that it is functional after completing the implementation..

Functions/Subroutines

Next define a subroutine call add_one, that takes as input an integer values and returns this value + 1. Note you define subroutines in C as

int	add_one( int i )

{
	int j;

	return(j);
}
where the first line defines the type of output and input from the subroutine, and the return statement gives the value to be returned by the subroutine. In this case this value must be an integer. For information on Functions and subroutine by Steve Summit.

Use this subroutine to add one to each output in the loop above, i.e write some code to implement

for i=0 to 100
	print add_one(i)

Input/output

One of the most bothersome things to deal with when writting a program is input and output. Now, you shall make a program to read the content of the file 1A68_HUMAN.sprot and print the content to standard out. That is write code to do the following

open file called 1A68_HUMAN.sprot checking if the operation was successfull

while not end of file
	read line from file
	print line to stdout

close file

The commands for open and closing files are

fp = fopen( filename, "r/w/a" );
fclose( fp );
where fp is a file pointer and "r/w/a" indicates whether one will read (r), write (w) og append (a) to the given file. Note, fopen returns the value "NULL" if the command fails. For mpre information on inout/output see Input and Output by Steve Summit. We have not yet worked with strings so for now just use the code
char	line[1024];
to declare a variable line to hold a string with upto 1024 characters. A useful command to read on line from a file is
fgets(line, sizeof line, fp)
You can (as you can for all predefine C functions) use the man function to learn about the function and syntaxt for C function
man fgets
Doing this you will learn that fgets "reads at most one less than the number of characters specified by size from the given stream and stores them in the string line. Reading stops when a newline character is found, at end-of-file or error." Also this function returns a values equal to "NULL" if an error occurs or if end-of-file occurs before any characters are read.

The command for formated printing to stdout in C is "printf( format, arguments)". So if you want to print out one string called line you will write

printf( "%s\n", line );
where %s indicates that in argument to print is a string, and "\n" is a new line character. You can print out integers (%i or %d), floats (%f) and a whole lot more. Check with google.

Strings

Strings are in C just vectors for characters. This you saw before where the string "line" was declared as

char	line[1024]
as a vector of lenght 1024. C comes with a panel of predefined functions to work on strings. You saw one of these "fgets" earlier. Now you you shall modify the code from above and write a program sp2fsa that (as we did with gawk this morning) reads the GenBank file and prints out the content in FASTA format. That is make a program to do the following
open file called 1A68_HUMAN.sprot checking if the operation was successfull

while not end of file
        read line from file
	if line contains ID
		extract and store ID
	if line contains sequences 
		extract and store sequences
	if line == "//"
		print entry in fasta format to stdout

close file

Some functions might come handy when making this program

strncmp
sscanf
strlen
isspace

For instance can you read the first two fields from the line using the command

sscanf( line, "%s %s", dummy, id);
where dummy and id are two strings (vectors of characters). Use man or google to find out how the other functions work.

As a guide, the first lines of the code could look something like

#include <stdlib.h>
#include <stddef.h>
#include <ctype.h>
#include <string.h>
#include <stdio.h>

main()

{

        FILE    *fp;
        char    line[1024];
        char    filename[] = "1A68_HUMAN.sprot";
        char    dummy[256], id[256], seq[1024];
        int     i,j;

        fp = fopen( filename, "r" );

        if ( fp == NULL ) {
                printf( "Error. Cannot open file %s\n", filename );
                exit( 1 );
        }

        while ( fgets(line, sizeof line, fp) != NULL ) {

Structures

Finally we shall introduce the concept of structures. In C you can define your own new structures (variable types) as

typedef struct fsa {
        char    seq[1024];
        char    name[256];
        int     len;
} FSA;

That is, here you define a new variable type call FSA. Normally such a type defintion would be made in the beginning of a program, so that all subroutine/functions can make use of the variable type. Once the type is define, you can subsequently define new variables of this type as

FSA	fsa;
You can access the elements of a structure using the syntax
fsa.seq;
fsa.name;
fsa.len;

Note, that if you are accessing a structure element with a structure pointer, you must use use fsa->seq, etc. We will get back to this later.

Now you shall make a program fsa2fsa that reads a fasta file (say test.fsa), stores it in a FSA structure and prints the fasta entry to stdout. The program shall consist of two (at least) subroutines read_fasta, and print_fasta. This is not a trivial task, as the FASTA format does not have a character defining the end of one entry. You might use the follow code to check for when a new FASTA entries starts in the input file

if ( fp && ( ch = fgetc( fp ) ) && ungetc( ch, fp ) && ch == '>' ) {

}

This code reads one character from the file and puts it back again. In this way the file pointer fp has not been changed. So the code in the read_fasta subroutine might look something like

FSA	read_fasta( FILE *fp )

{
        FSA     fsa;
        int     read;
        int     i, j, ix, k;
        char    ch;
        char    line[1024];

        read = 0;
        j = 0;

        while ( ! feof( fp ) ) {

		if start of new FASTA entry
			if next entry
				return fsa
			else
				store id in fsa.name
		else
			read sequence and store/append it to fsa.seq
	}

	return fsa
}
a useful set of commands for storing the fasta name in the fsa.name variable being sure not to have buffer overflow, could be
fgets( line, sizeof( line), fp );
/* Remove newline from string line */
if ( line[strlen(line)-1] == '\n' )
	line[strlen(line)-1] = '\0';
/* copy string to fsa.name excluding the first character '>' */
strncpy( fsa.name, line+1, 256 );
fsa.name[255] = '\0';

This is all for now!