Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

K-means clustering

Description
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster.

Input and output
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers :-) The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01	8.76	3.29	1.05
ex02	12.3	2.33	3.53
ex03	-0.54	-3.56	1.45
.
.
The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster centroid and is proceeded by the the members of that cluster:
Cluster-1	2.13	3.24	1.54
ex10	1.04	2.98	1.34
ex12	1.23	2.34	1.69
.
.
Cluster-2	-4.13	1.25	6.34
ex04	-0.34	3.51	9.02
ex07	-8.56	5.12	12.5
.
.
Examples of program execution:
cluster.pl vectors.txt 5
The 5 is the number of clusters the data point should be partioned in.

Details
Here is a small list of data points and a big list you can use for your program. Wikipedia has a page on K-means clustering.