The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters.
Clustering is important in discovering patterns or modes in multi-dimensional data sets.
It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions
the data set such that each example (data point) is assigned to exactly one cluster.
Input and output
The input is a tab separated file containing one data point on each line. Each data point is a vector
consisting of a number of numbers :-) The program should handle any given vector size, but the vector
size is constant in any data file. Input file example:
ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
The output is all data points, partitioned in the clusters they belong to.
Output example where each cluster starts with the cluster centroid and is
proceeded by the the members of that cluster:
Cluster-1 2.13 3.24 1.54
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
Cluster-2 -4.13 1.25 6.34
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
Examples of program execution:
cluster.pl vectors.txt 5
The 5 is the number of clusters the data point should be partioned in.
Here is a small list of data points and
a big list
you can use for your program.
Wikipedia has a page on K-means clustering.