Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

k-nearest neighbor (k-NN) continuous variable estimation

Description

This scripts read a matrix-styled datafile, containing missing values, and infers these values by finding the k-nearest neighbors. An application of this can be seen in Microarray experiments, in which the observed signal is not always significantly different from the background signal. Imputing these values are a cheaper solution rather than redoing the whole experiment. This method has been shown to perform better than e.g. rowmeans, and faster than SVL approaches (Source #1). An important step in validating new methods in science is to asses the error rates and accuracy feeding already-known results to a program, and verifying the output is in accordance. It is expected that the final submission of this project contains:
  • Two scripts: One implementing the k-NN algorithm, the other removing datapoints and validating accuracy
  • Within the report should be graphs presenting the accuracy of your implementation, for a variety of missing field percentages, as well as different k's

Input and Output

The input should be a matrix-styled file, containing an identifier for each field M(i,j), in which a number of entries is listed as 'NA', as can be seen below.
Probe_ID	ref1	ref2	ref3	week1	week2	week3
100002	68.42	89.72	76.86	NA	NA	103.03
100037	20.79	23.86	25.65	20.6	NA	20.26
100039	3.95	5.94	5.51	6.49	NA	6.21
.
.
.
100058	64.29	80.97	64.92	42.49	34.54	39.08
In this example, the rows represent probes, and columns individual samples. The datafile provided contains a FULL dataset, with no NAs, from an actual experiment (E-GEOD-10590).
It is up to YOU to create the missing entries in this datafile, and this should be part of your project. The output of this script should be a matrix in which all NAs are filled with numbers, using the kNN algorithm. You must yourself find a method of evaluating your results, when comparing to the original datafile. It is a good idea to look into the literature for this (Both source #1 and #2 presents methods for this).

Examples of execution could be:

# Generation of test-set missing 5% of all values in the data.txt using the 10 nearest neighbors 
generate_test.pl -missing 5% -file data.txt -k 10
The above script will then call the script containing the kNN algorithm, and pass onto it the generated test set, as well as the k parameter. To get input from commandline, it is suggested to use the Getopt::Long module from CPAN. This is, however, no requirement.

Literature

  1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock et al. Missing value estimation methods for DNA. Bioinformatics 2001, 6:17, p. 520-525 Link
  2. Ki-Yeol Kim, Byoung-Jin Kim and Gwan-Su Yi. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 2004, 5:160, p. 160 Link