Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Introduction to C programming. Part 2

Now we shall continue the introduction to C, working with

  • How to talk to a C program using argv and argc
  • Dynamic memery allocation
  • Pointers
  • Linked lists

You are expected to have all the files downloaded from the first c-programming exercise, and configured you linux/unix/MAX accout to work as described in unix lecture notes before starting. If this is not the case, please do this before starting one the exercise. If you did all the previous exercises, you can continue below with "How to talk to a C program using argv and argc".

If you did not complete the earlier exercises, here is a summary of the command you need to execute

mkdir test
mkdir data
mkdir -p src
mkdir -p bin
cp /home/projects/mniel/ALGO/data/Cprog/* ./data/
cp /home/projects/mniel/ALGO/code/cprog/* ./src/
cp -R /home/projects/mniel/ALGO/code/utils ./src
cp /home/projects/mniel/ALGO/exercises/ex_unix/test.dat ./test/
cp /home/projects/mniel/ALGO/exercises/ex_unix/1A68_HUMAN.sprot ./test/
Note, that if you are working on your own labtop, the "cp" commands should be replaced with "scp", i.e
scp -r"*" ./data/
using the your cbs username and password for given in the lecture.

Now you have to check if your shell is "/bin/tcsh" by typing

echo $SHELL
If this is not the case, you can change the shell using the command
specifying the shell to be "/bin/tcsh". Next, open the file ".cshrc" in your favorit editor and change the line
setenv ALGOHOME /home/people/xxxxx
to have the path to your home directory. Finally type
source .cshrc
and you should be ready to go.

How to talk to a C program using argv and argc

Go to the src directory

cd src
As we have see earlier when doing the programs sp2fsa and fsa2fsa, we had to hardcode the names of the input files in the C code. This is clearly not optimal. C allows you to talk with the program in a very simple manner using the predefined variables argv[] and argc. If you in the C program define the main function as
main(int argc, char *argv[])


Then the variable argc will contain the number of arguments given to the program, and argv[] the value of these argument. We can illustrate this by making a small program called test_args.c with the content
#include <stdio.h>
#include <stdlib.h>

main(int argc, char *argv[])

	int	i;

	for ( i=0; i < argc; i++ ) 
		printf( "argc: %i argv[%i]: %s\n", i, i, argv[i]);

	exit( 0 );
Now compile the program, and try to execute it typing
As you can see the value of argv is 1 and the value of argv[1] is the program name test_args. You can try to execute the program with more arguments
test_args arg1 arg2 arg3

Dynamic memory allocation

We have also seen that it could be convinient to be able to allocate memory on the fly and not as a fixed size hard-coded in the c-program. We for instance defined the size of the length of the sequence in the FSA entry to be 1024
typedef struct fsa {
        char    seq[1024];
        char    name[256];
        int     len;
} FSA;
This might be ok in most cases, but what is the sequence has a length of 1025 characters?

C allows you to allocate memory dynamically during the execution. The general command is

v = ( int * ) malloc(( unsigned ) ( N ) * sizeof(int));
where "( int * )" is an optional cast to make sure that the malloc function returns a pointer to an int (we will get back to pointers in a while), N is the number of integers you want in your array, and sizeof(int) is the size of memory needed to store each integer.

This call to malloc will hence return a block of memory that can hold N integers. You can next access the individual elements, using v as an array, i.e v[0], v[1], etc.

If the malloc function fails to allocate the requested memory, it will return a pointer to NULL.

When the amount of memory is not needed anymore, you must return it to the operating system by calling the function free.

free( v );
Now go back to the program you made earlier fsa2fsa to read and print a FASTA file, and change the code (i.e make a second program fsa2fsa_2.c so that it reads the filename of the fasta-file ars argv[1], and allocates the memory for the sequence element of the FSA structure dynamically.

To do this you must among other things change the FSA structure to

typedef struct fsa {
        char    *seq;
        char    name[256];
        int     len;
} FSA;


Pointers are some of the most effective features of the C-programming language. However they are also the feature that causes the most trouble. A pointer is nothing but an address in memory. The pointers we have seen so far are
FILE	*fp;
char	*seq;
int	*v;
are thus pointer to locations in memory. fp points to a given place in the file, seq points to the begining of the string, and v points to the begining of an array of integers. You can find more information about points at Steve Summit. Pointers.

In C, arguments are passed to functions by value while other languages may pass variables by reference. This means that the receiving function gets copies of the values and has no direct way of altering the original variables. For a function to alter a variable passed from another function, the caller must pass its address (a pointer to it).

Here a examples on how an argument that is passed by value is NOT modified when returning from the function call

void    dummy( int i )


        i = i + 1;


main(int argc, char *argv[])

        int     i;

        i = 3;

        dummy( i );

        printf( "I: %i\n", i );

Try to make this program and confirm that the value of i printed is 3. If you change the program so that the variable i is transferred to the subroutine as a reference (i.e as a pointer), you will get the function you require.
void    dummy_wpointer( int *i )


        (*i) = (*i) + 1;


main(int argc, char *argv[])

        int     i;

        i = 3;

        dummy_wpointer( &i );

        printf( "I: %i\n", i );

Here the syntax "&i", get the address of the variable i, i.e. the pointer to the location in memory where the variable i is stored. Again confirm that this is the case by making a program with the above code.

Now you will be able to understand the code below

char	line[1024], name[64];
char	a,b;
int	n;

sscanf( line, "%c %c %s %i", &a, &b, name, &n );
In particular can you understand why the code has "&a", "&b", and "&n", but no "&" in front of the variable name?

Linked lists

Reading input from standard input or a file is always troublesome. A simple code reading lines from a file could look as follows (as we have seen earlier)

if ( ( fp = fopen( filename, r )) == NULL ) {
        printf( "Error. Cannot read from file %s. Exit\n",
                filename );
        exit( 1 );

while ( fgets(line, sizeof line, fp) != NULL ) {

/* Write some code work in the content of the variable line */


fclose( fp );

This is some code that you will need to use over and over again, so it is convenient to do this in a subroutine. I.e a function like

linelist = linelist_read( filename );
The linelist_read routine reads from the file called filename, and returns a linked-list of the lines in the file. Each element is the list are elements defined as
typedef struct linelist {
        struct linelist *next;
        LINE    line;
        int     nw;
        char    **wvec;
Here, the essential elements are line and next. The variable line contains the text from one line the file, and the next is a pointer to the next element in the list. This structure can be shown like
            ------------       ------------       ------------
            |  line1   |       |  line2   |       |  line3   |
linelist -> ------------   /-> ------------   /-> ------------   /-> NULL
            |      ----|--/    |      ----|--/    |      ----|--/
            ------------       ------------       ------------

You can access each variable of the linelist element as

Note that as the linklist is a pointer to a linklist structure, the syntax to accessing the elements of the structure is "->" and not "." as used earlier.

Once you have read the linked list using the linelist = linelist_read( filename ); command, you can access each element in the list, and hence each line in the input file, using the for loop command

for ( ln = linelist; ln; ln=ln->next ) {

/* Do some stuff on the line stored in ln->line */


Here the first part of the for command (ln = linelist) assigns the variable ln to point to where the variable linelist is pointing, and that is to the beginning of the list i.e the list element containing the first line of the file. The next part of the for tell the program to keep on looping while the variable ln is true. This is a compact way of writing ln != NULL, i.e until end of the linked list. Finally the last part of the for moves the ln pointer to the next element of the linked list.

The functions of read and manipulate the linelist structure are defined in the utils directory

The "lstutils.h" file contains the definitions of the functions, and the "lstutils.c" file the actual code implementing the functions. For instance can you see that the LINILIST structure is defined as
typedef struct linelist {
        struct linelist *next;
        LINE    line;
        int     nw;
        char    **wvec;
and the linelist_read function is defined as
LINELIST        *linelist_read( char *filename )

        LINELIST        *first, *last, *new;
        FILE            *fp;
        int             ff, fc;
        LINE            line, text;
        int             n;

        first = NULL;
        n = 0;

        if ( ( fp = stream_input( filename, &fc, &ff )) == NULL ) {
                printf( "Error. Cannot read NAMELIST from file %s. Exit\n",
                        filename );
                exit( 1 );

        while ( fgets(line, sizeof line, fp) != NULL ) {

                if ( strncmp( line, "#", 1 ) == 0 )

                if ( strlen( line ) < 1 )

                if ( sscanf( line, "%[^\n]", text ) != 1 )

                if ( ( new = linelist_alloc()) == NULL ) {
                        printf( "Error. Cannot allocate linelist. Exit\n" );
                        exit( 1 );

                strcpy( new->line, text );

                if ( first == NULL )
                        first = new;
                        last->next = new;

                last = new;

        stream_close( fp, fc, filename );

        if ( list_verbose )
                printf( "# Read %i elements on linelist %s\n", n, filename );

        return( first );

Most of the code should be familiar to you by now. Only the
fp = stream_input( filename, &fc, &ff ))
stream_close( fp, fc, filename );
have not been described. These function are defined in the utils library file


Similar linked list structures are available in the utils library for dealing with FASTA files.
fsalist = fsalist_read( filename );
The fsalist_read routine reads from the file called filename, and returns a linked-list of the FASTA entries in the file. Each FASTA entry is the list is defined as
typedef struct fsalist  {
        struct  fsalist *next;
        char    *seq;
        char    name[256];
        int     len;
        float   score;
        int     *i;

Here the essential parts are the variables seq containing the FASTA sequence, name containing the FASTA name, len containing the length of the FASTA sequence, and next a pointer to the next element in the list.

There are several routines available for dealing with FASTA format files. Here follows a few (they are are documented in the fsalist.c in the utils directory.)

void    fsalist_free( FSALIST *fsa );
FSALIST *fsalist_alloc();
FSALIST *fsalist_read_single( FILE *fp );
FSALIST *fsalist_find( char *name, FSALIST *list, char *pattern );
void    fsa_print( FSALIST *l );
void    fsalist_iassign_profile_order( FSALIST *fsalist );

In particular the last routine fsalist_iassign_profile_order will be useful. It defines the vector element i of the fsalist structure to contain the position of each amino acid in the FASTA sequence in the BLOSUM alphabet. The routine is implemented as

for ( fsa = fsalist; fsa; fsa = fsa->next) {

        n = fsa->len;

        fsa->i = ivector(0, n - 1);

        for (i = 0; i < n; i++)
                fsa->i[i] = strpos(PROFILE_ORDER, fsa->seq[i]);

What a function does

This brings us to the last part before solving todays programming exercise. As you have seen some function used in the c-programs are standard C language functions (like malloc, strlen, sscanf, fopen, etc), and others and functions implemented in the utils library. If you see a command, say strlen, you do not know the function of, try to type

man strlen

If this is a standard C routine, you will get a detailed description. If not, look into the utils directory. This you do by (say you what to find how the routine strpos works)

grep strpos utils/*.h

This will return

strutils.h:extern int strpos(char *s, char c);

to the screen. The routine

is hence defined in the strutils.c file. Open this file in your favorite editor and search for the definition of strpos

int strpos(char *s, char c)

        int     i, l = strlen(s);
        for (i = 0; i < l; i++)
                if (c == s[i])
                        return (i);
        return (-1);

You see that the routine strpos takes two arguments as input. The first character is a pointer to a char (that is en efficient way to pass a string), and the second argument is a character. The routine returns the position of the first occurrence of the character c in the string s, and if the character is not found in s the value -1. Note, that the first character in s is at position 0.

C-programming template

Now we can begin to work on a more complex program. Open the file c-program_template.c in your favorite editor.

This file contains a template your can use for future c programs. The first part of the file contains the include commands adding the different libraries you might need (the lines with #include). Next, follows a section declaring the command-line parameters for the program. These fall in different types (float, double, int, longint, string, switch, and char). Next, follows the subroutines and functions, and finally the main program. Each c-program must always have a main program definition. In the main program, we first declare the variable needed (this we also do in each subroutine and function), next we parse the command line parameters (with the call to pparse), and then we start to code.

Note, that many of the functions used in the program are routines made by other c-programmers (like myself), and are not part of the c language. All these routines are part of the utils library found in your utils directory. For instance the command-line parser (pparse) found in the file utils/sysutils.c.

The function of the c-program_template.c template is simpel. It will read and print the content of either a fasta file (a file in the FASTA format) or a peptide file (a file with peptides and binding affinity data). The type of the input is specified with a command-line parameter. Fasta files are read with the routine fsalist_read, and peptide files are read with the routine linelist_read. Both routines read the input data into a linked list data structure. The fsalist_read routine reads one or many fasta entries and the linelist_read routine reads one or many input lines. Each element in the linked list is accessed using the loop structure

for ( ln = linelist; ln; ln=ln->next )
for the linelist, and

for ( fsa = fsalist; fsa; fsa=fsa->next )
for the fsalist. You can of course write your own code to parse input data, but the two examples shown here will be very useful. Make sure you understand what each line the program is doing.

You can now compile the program by typing

make c-program_template

You print information on how to use the program by typing

c-program_template -h

and test the program on some data by typing

c-program_template ../data/test.pep | more
c-program_template -fsa ../data/test.fsa | more

This is all for now!