|
k-nearest neighbor (k-NN) continuous variable estimation
Description
This scripts read a matrix-styled datafile, containing missing values, and infers these values by finding the k-nearest neighbors. An application of this can be seen in Microarray experiments, in which the observed signal is not always significantly different from the background signal. Imputing these values are a cheaper solution rather than redoing the whole experiment. This method has been shown to perform better than e.g. rowmeans, and faster than SVL approaches (Source #1). An important step in validating new methods in science is to asses the error rates and accuracy feeding already-known results to a program, and verifying the output is in accordance.
It is expected that the final submission of this project contains:
- Two scripts: One implementing the k-NN algorithm, the other removing datapoints and validating accuracy
- Within the report should be graphs presenting the accuracy of your implementation, for a variety of missing field percentages, as well as different k's
Input and Output
The input should be a matrix-styled file, containing an identifier for each field M(i,j), in which a number of
entries is listed as 'NA', as can be seen below.
Probe_ID ref1 ref2 ref3 week1 week2 week3
100002 68.42 89.72 76.86 NA NA 103.03
100037 20.79 23.86 25.65 20.6 NA 20.26
100039 3.95 5.94 5.51 6.49 NA 6.21
.
.
.
100058 64.29 80.97 64.92 42.49 34.54 39.08
In this example, the rows represent probes, and columns individual samples. The datafile provided contains a FULL dataset,
with no NAs, from an actual experiment (E-GEOD-10590).
It is up to YOU to create the missing entries in this datafile, and this should be part of your project.
The output of this script should be a matrix in which all NAs are filled with numbers, using the kNN algorithm.
You must yourself find a method of evaluating your results, when comparing to the original datafile. It is a good
idea to look into the literature for this (Both source #1 and #2 presents methods for this).
Examples of execution could be:
# Generation of test-set missing 5% of all values in the data.txt using the 10 nearest neighbors
generate_test.pl -missing 5% -file data.txt -k 10
The above script will then call the script containing the kNN algorithm, and pass onto it the generated test set,
as well as the k parameter.
To get input from commandline, it is suggested to use the
Getopt::Long module from CPAN.
This is, however, no requirement.
Literature
- Olga Troyanskaya, Michael Cantor, Gavin Sherlock et al. Missing value
estimation methods for DNA. Bioinformatics 2001, 6:17, p. 520-525
Link
- Ki-Yeol Kim, Byoung-Jin Kim and Gwan-Su Yi. Reuse of imputed data in
microarray analysis increases imputation efficiency. BMC
Bioinformatics 2004, 5:160, p. 160
Link
|