Detailed copy number (CN) variation data can be obtained from 450k or EPIC Illumina methylation assays. However, the effects of different preprocessing strategies (normalization, transformation and selection of gain/loss cutoff values) on variant calling have not been evaluated systematically.Results:
We provide an R package which allows to directly compare any preprocessed CN data. It provides its own CN alteration detection methodology: segments are identified through detection of changes in variance of CN data and are subsequently filtered for significance. Meaningful cutoffs for gain/loss definition can be identified automatically through analysis of the resulting ΔCN distributions of all analyzed samples. Three exemplary datasets (2x450k, 1xEPIC) were selected for comparative analyses of Raw, Illumina, SWAN, Quantile, Noob, Funnorm and Dasen normalizations. Importantly, all CN data distributions were skewed (-0.66 to -1.2) therefore requiring different gain/loss cutoffs. Depending on the normalization method, prominent baseline differences between samples could be observed. We present a workflow, which alleviates both issues: Z-transformation removes baseline differences between samples, and automatic cutoff selection circumvents the problems accompanying the skewed distributions. Additional filtering of candidates by significance yields comparable results for most enumerated normalization methods except for SWAN. In contrast, manual cutoff determination results in highly variable numbers of variant calls, highly dependent on the selected normalization method. Taken together, we present a workflow which allows to robustly identify copy number alterations in methylation array data fairly independent of the applied normalization.Availability and Implementation:
The cnAnalysis450k package is available on github (https://github.com/mknoll/cnAnalysis450k).Contact:
Supplementary data are available at Bioinformatics online.