Events News Research Groups CBS prediction servers CBS data sets Publications Bioinformatics education program
Staff Contact About CBS Internal CBS bioinformatics tools CBS courses Other bioinformatics links

A Comparison Study: Applying Segmentation to Array CGH Data for Downstream Analyses

Please see paper for details:
Hanni Willenbrock and Jane Fridlyand
A Comparison Study: Applying Segmentation to Array CGH Data for Downstream Analyses
[link]



Supplement Data and Discussions


Methods

GLAD Modifications


Optimization of Parameters for MergeLevels



Breakpoint Detection

Figure S 1 (warning: large file ~300 MB) [Postscript] [PDF]
HMM predicted log2 ratios before merging (upper), after merging with MergeLevels (middle) and after merging with GLADmerge (lower). Black: original log2 ratios, red: "true" log2 ratios, blue: predicted or merged log2ratio levels.

Figure S 2 (warning: large file ~300 MB) [Postscript] [PDF]
GLAD predicted log2 ratios before merging (upper), after merging with MergeLevels (middle) and after merging with GLADmerge (lower). Black: original log2 ratios, red: "true" log2 ratios, blue: predicted or merged log2 ratio levels.

Figure S 3 (warning: large file ~300 MB) [Postscript] [PDF]
DNAcopy predicted log2 ratios before merging (upper), after merging with MergeLevels (middle) and after merging with GLADmerge (lower). Black: original log2 ratios, red: “true” log2 ratios, blue: predicted or merged log2 ratio levels.

Signal/noise Dependent Performance of the Segmentation Methods

Figure S 4 [Postscript] [PDF]
Alternative use of HMM length distribution for simulation. Results from simulation identifying breakpoints using either HMM, DNAcopy or GLAD or after removal of excessive breakpoints by MergeLevels or GLADmerge following segmentation. It shows the median sensitivity and corresponding average number of false negatives (FN) (left plot); and false positive rate (FDR) (right plot) for breakpoint detection with error bars depicting the interquartile range. Breakpoints were classified as correctly identified at its exact location (w=0) or if within an offset of 1 - 2 clones (w=1-2) of a correct breakpoint.

Figure S 5 [Postscript] [PDF]
Alternative use of GLAD length distribution for simulation. Results from simulation identifying breakpoints using either HMM, DNAcopy or GLAD or after removal of excessive breakpoints by MergeLevels or GLADmerge following segmentation. It shows the median sensitivity and corresponding average number of false negatives (FN) (left plot); and false positive rate (FDR) (right plot) for breakpoint detection with error bars depicting the interquartile range. Breakpoints were classified as correctly identified at its exact location (w=0) or if within an offset of 1 - 2 clones (w=1-2) of a correct breakpoint.

Spatial Resolution of the Segmentation Methods

Table S 1 and S 2 [PDF]
Median of each performance measures for original log2 ratios, DNAcopy or GLAD predicted log2 ratios, or merged by MergeLevels or GLADmerge. Medians are based on 500 simulated samples.

Discussion: Alternative Simulation Model

Figure S 6 [Postscript] [PDF]
DNAcopy segmented, merged by MergeLevels and GLADmerge for data simulated according to model in GLAD paper. Black: original log2 ratios. Red: "true" log2 ratios. Blue: predicted or merged log2ratio levels.

Figure S 7 [Postscript] [PDF]
Sensitivity and specificity of breakpoint detection by HMM, DNAcopy and GLAD for predicted breakpoints (Predicted), breakpoints after merging by MergeLevels or GLADmerge. Breakpoints correctly identified at its exact location (w=0) or within 1-2 clone (w=1 or w=2).

Tables S 3, S 4 and S 5 [PDF]


Copy Number Association Study: Testing

Figure S 8 [Postscript] [PDF]
Results for copy number association power study. For each approach, the median sensitivity and specificity are shown with error bars depicting the interquartile range. Results are based on 500 datasets each consisting of 20 samples. From left: Original log2-ratios, segmented log2-ratios and segmented T-statistics. For the HMM, smoothed values were used; for DNAcopy, the segment means were used; and for GLAD, segment medians were used.

Discussion: gFWER(k)-controlling single-step common-cutoff augmentation procedure

Figure S 9 [Postscript] [PDF]
Examples for T-statistics smoothed by HMM for simulated heterogeneous data. Black: clones having no true copy number differences. Red: clones having a true copy number difference. Left: sample 8 were maxT (k=0) cutoff is too conservative and only a few clones are correctly classified. Using the 6th most extreme (k=5) T-statistics from the permutation distribution, most copy number differences were identified correctly. Right: sample 80. Using k=0, most copy number differences are already identified correctly. Using k=5, only a few more are identified, while obtaining an additional false positive (type I error).

Figure S 10 [Postscript] [PDF]
Sensitivity and specificity for heterogenous data using the gFWER(k)-controlling single-step common-cutoff augmentation procedure with k=5. Median with error bars depicting the interquartile range. Original log2 ratios, smoothed log2 ratios and smoothed T-statistics. Smoothing by HMM (smoothed values), DNAcopy, and GLAD (median for each region).


Real data: Oral Squamous Cell Carcinoma

Figure S 11 [PDF]
Frequency plots for the two p53 subtypes calculated with the two procedures. The height of the vertical bars indicates proportion of samples in a given subtype where the clone was gained (positive) or lost (negative). A: threshold based / clone-by-clone basis. B: segmented/merging-base procedure.

Figure S 12 [Postscript] [PDF]
Plot of T-statistics and cutoffs. Black: T-statistics based on original unsmoothed log2 ratios and the maxT cutoff at alpha=0.05. Red: T-statistics based on segmented log2 ratios (A) and DNAcopy segmented T-statistics (B) with maxT cutoff at alpha=0.05.



Simulation data and R scripts

Empirical length distribution for simulations
length.distr.RData

For breakpoint detection and merging
500 samples with 20 chromosomes of 100 clones for comparison of breakpoint detection and merging:
Data: 20chromosome.simulated.data.RData
Script: Generate.merge.data.R

For spatial resolution study
4 different datasets with varying length distributions:
<5: simulated.below5.data.RData
>=5 & <10: simulated.5to10.data.RData
>=10 & <20: simulated.10to20.data.RData
>20: simulated.above20.data.RData

For testing
500 datasets with 20 samples each with 500 clones on 1 chromosome:
Data: Heterogeneous.simulated.data.RData
Script: Generate.heterogeneous.data.R"

MergeLevels function for R
Description
Script


Oral Squamous Cell Carcinoma array CGH Data

Array CGH data
Phenotype data





For questions, please e-mail:
Hanni Willenbrock or Jane Fridlyand.





Last changed: Marts 19th, 2007 by Hanni Willenbrock