Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

QT clustering

Description
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

Input and output
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers :-) The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01	8.76	3.29	1.05
ex02	12.3	2.33	3.53
ex03	-0.54	-3.56	1.45
.
.
The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:
Cluster-1
ex10	1.04	2.98	1.34
ex12	1.23	2.34	1.69
.
.
Cluster-2
ex04	-0.34	3.51	9.02
ex07	-8.56	5.12	12.5
.
.
Examples of program execution:
cluster.pl vectors.txt 500
The 500 is the maximum diameter for a cluster. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.pl vectors.txt 30%

Details
Here is a small list of data points and a big list you can use for your program. Wikipedia has a page on clustering algorithms, where QT clustering is no longer described.
The algorithm works like this.
1) For each point in the data set, calculate a candidate cluster. Choose the candidate cluster that contains most points as the "real" cluster and remove those points from the data set.
2) Repeat 1 until there are no point in the data set (or a set limit has been reached; like clusters with less than, say, 10 points are not true clusters, but noise).
3) Print the resulting clusters.

In the case of equal number of point in two candidate clusters, choose the one with smallest diameter. If diameter is also the same, just pick the first one you find.
A candidate cluster for a point is calculated using "complete linkage" like this: Consider the point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set. Now add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one. Continue adding points to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster. The diameter of a data set (candidate cluster) is the distance of the two points furthest from each other.
It is important to note that building a candidate cluster this way is NOT the same as adding the nearest point to the starting point or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algortihm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.
Here is the result of clustering the small list with a diameter of 30%, which you can use to check the correctness of your program since the algorithm is deterministic.