Advanced Perl 1: Advanced Data Structures

Nothing Index Next

Sparse N-dimensional matrix

If you have a situation where you need a large sparsely populated N-dimansional matrix, then a very good and easy way is to use a hash. An example could be a three dimensional matrix which could be populated in this way:
$matrix{"$x,$y,$z"} = $value;
Every time you wanted to use a value from the matrix, you can access it like this:
$value = exists $matrix{"$x,$y,$z"} ? $matrix{"$x,$y,$z"} : 0;
Notice that $x, $y, $z is not limited to numbers, they could be SwissProt IDs.

References

Symbolic references are not used often in perl. It is when you in one variable have the name of another variable, that is you use a level of indirection. Example:
$string ="Hello World !\n";
$print_string = 'string';  # name of the variable you want to print
print ${$print_string};
This can of course be used with arrays and hashes. You can also create variables in perl with symbolic references. Example:
$variablename = 'newvar';
${$variablename} = "See me\n";
print $newvar;
This is generally a bad idea, as you don't know the name of the variable (if you did then you would not resort to this), and that makes it hard to use the variable in a program. Also this kind of stuff makes "use strict;" impossible.

Hard references are used far more in perl than symbolic references, as they are more useful. A hard reference is much like a pointer in other languages. A "normal" variable holds some data, a hard reference is a variable that refers/points to the location of the data in memory. You create hard references with the backslash operator. Examples:

$variablereference = \$variable;
$arrayreference = \@array;
$hashreference = \%hash;
$codereference = \&subroutine;
$filereference = \*STDOUT;
You can create anonymous variables, that is variables that do not have names. They only exists because of the references to them.
$anon_array_ref = [1, 2, 3];
$anon_hash_ref = {'alpha' => 1, 'beta' => 2, 'gamma' => 3};
Notice the use of brackets [] and braces {}, where you normally would have parenthetis, when creating an ordinary variable. This method is used when creating anonymous variables and objects in perl. You use/dereference the data that your reference points to in this way.
print $$variablereference;
$$arrayreference[0] = 1;
print $$arrayreference[0];
@tab = @$arrayreference;
$$hashreference{'alpha'} = 1;
print $$hashreference{'alpha'};
%hash = %$hashreference;
&$codereference('parameter');
print $filereference $data;
$data = <$filereference>
If you want some extra clarity you can put braces {} around the reference like this:
print ${$hashreference}{'alpha'};
It is also possible (and recommended) to use the infix operator -> on arrays, hashes and subroutines. The examples below are doing the same.
$$hashref{$key} = 'blabla';
${$hashref}{$key} = 'blabla';
$hashref->{$key} = 'blabla';
Simple references are especially good to use if you want to call subroutines with more than one array/hash as parameter. Actually it is the only way to do so.

Arrays of arrays

You can create a N-dimensional matrix in perl with arrays of arrays of arrays .... Here the principle will be shown with a 2-dimensional matrix, which is an array of references to anonymous arrays. A 3-dimensional matrix is an array of references of anonymous arrays of references, that refer to anonymous arrays - get the picture?
# Simple assignment of an array of arrays
my @AoA = ([1,2,3], ['John', 'Joe', 'Ib'], ['Eat', 2]);
print $AoA[1][2]; # prints Ib

# Suppose you want to read a matrix in from a file 
while (defined (my $line = <IN>)) {
   my @tmp = split(' ', $line); 
   push(@AoA, [@tmp]); } # Add anonumous array (row) to @AoA

# Suppose you want to add a column to a matrix
for (my $i = 0; $i <= $#AoA; $i++) {
   push(@{$AoA[$i]}, "Some value"); }

# You could also just assign the values
for (my $x = 0; $x <= 10; $x++) {
   for (my $y = 0; $y <= 10; $y++) {
      $AoA[$x][$y] = &func($x, $y); } }

# Printing/accessing the AoA
for (my $x = 0; $x <= $#AoA; $x++) {
   for (my $y = 0; $y <= $#{$AoA[$x]}; $y++) {
      print "At X=$x, Y=$y is ", $AoA[$x][$y], "\n"; } }

# A common mistake
print @AoA; # Simply prints a list of array references

Hashes of arrays

# Simple assignment
%HoA = ('Numbers' => [1, 2, 3], 'Names' => ['John', 'Joe', 'Ib']);

# Adding an array to the hash
my @tmp = split(' ', $line);
$HoA{'NewKey'} = [@tmp];

# Appending a new element to one the the arrays
push(@{$HoA{'NewKey'}}, 'SomeValue');

# Two ways of accessing the structure
print $HoA{'Numbers'}[1]; # prints 2
print $HoA{'Names'}->[1]; # prints Joe

Arrays of hashes

# Simple assignment
@AoH = ({'key1' => 'value1', 'key2' => 'value2},
        {'newhashkey1' => 'value1', 'key2' => 'value2},
        {'anotherhashkey1' => 'value1', 'key2' => 'value2});

# Adding anonymous hash to the array
push(@AoH, {'key' => 'val', 'xkey' => 'xval'});
$AoH[2] = {'key' => 'val', 'xkey' => 'xval'};

# Adding single key/value pair in one of the hashes
$AoH[1]{'NewKey'} = 'NewValue';

# Accessing the structure
for (my $i = 0; $i <= $#AoH; $i++) {
   foreach my $key (keys %{$AoH[$i]}) {
      print $AoH[$i]{$key}, "\n"; } }

Hashes of hashes

This is the most flexible (and unordered) datastructure in perl.
# Simple assignment
%HoH = ('Masterkey1' => {'Key1' => 'Value1', 'Key2' => 'Value2' }, 
        'Masterkey2' => {'Key1' => 'Value1', 'Key2again' => 'Value2again' } );

# Adding an anonymous hash to the hash
$HoH{'NewMasterKey'} = {'NewKey1' => 'NewValue1', 'NewKey2' => 'NewValue2'}; 

# Or if you have a hash you want to add
$HoH{'NewMasterKey'} = { %MyOldHash };

# Adding a key/value pair in the "second" level. Autovivification trap.
$HoH{'MasterKey1'}{'NewKey'} = 'NewValue';

# Printing/using a single value
print $HoH{'MasterKey1'}{'Key1'};

# Accessing the structure
foreach my $masterkey (keys %HoH) {
   print "First level: $masterkey\n";
   foreach my $key (keys %{$HoH{$masterkey}}) {
      print "$key => $HoH{$masterkey}{$key}\n"; } }

Exercises

It is fairly difficult to come up with good exercises that are short and involves complicated data structures, so if they seem contrieved or outright silly, then that's the reason.

Arrays of arrays:
Make a program that calculates the product of two matrices and prints it on STDOUT (the screen). The matrices are in the files mat1.dat and mat2.dat. Numbers in the files are tab separated.
Advice: The program should have a subroutine that reads a matrix from a given file (to be used twice), a subroutine that calculates the product, and a sub that prints a matrix. This way ensures that your program is easy to changes to other forms of matrix calculations. Here are two links to the definition of matrix multiplication.
http://www.mai.liu.se/~halun/matrix/matrix.html
http://mathworld.wolfram.com/MatrixMultiplication.html

Arrays of hashes:
I haven't yet figured out a good exercise.

Hashes of arrays:
In the file test1.dat is results from an experiment in the form

AccessionNumber   Number Number Number ....
.
.
In the files test2.dat and test3.dat are results from similar experiments but with a slightly different gene set. You want to average the numbers from all experiments for each acccession number. The output this therefore.
AccessionNumber SingleAverageNumberOfAll3Experiments
.
.
Advice: Construct a sub that adds to a HaA from a filename. It will be used trice.

Hashes of hashes:
In the file protdom.dat is a list of proteins with their PFAM domains, like this

SwisProtID domain1 domain2 ...
.
.
From this list you want to construct the opposite list, ie. a list of which proteins that contains a certain domain, like this
Domain SwisProtID SwisProtID SwisProtID ....
.
.
Save the list in domprot.dat.
This goal can be achieved in various ways, but today you should do it by using hash of hashes. This will also create an extremely efficient lookup table, should you ever need it.

This page was last updated         by Peter Wad Sackett, pws@cbs.dtu.dk