Motivation: In a typical gene expression profiling study, our prime objective is to identify the genes that are differentially expressed between the samples from two different tissue types. Commonly, standard analysis of variance (ANOVA)/regression is implemented to identify the relative effects of these genes over the two types of samples from their respective arrays of expression levels. But, this technique becomes fundamentally flawed when there are unaccounted sources of variability in these arrays (latent variables attributable to different biological, environmental or other factors relevant in the context). These factors distort the true picture of differential gene expression between the two tissue types and introduce spurious signals of expression heterogeneity. As a result, many genes which are actually differentially expressed are not detected, whereas many others are falsely identified as positives. Moreover, these distortions can be different for different genes. Thus, it is also not possible to get rid of these variations by simple array normalizations. This both-way error can lead to a serious loss in sensitivity and specificity, thereby causing a severe inefficiency in the underlying multiple testing problem. In this work, we attempt to identify the hidden effects of the underlying latent factors in a gene expression profiling study by partial least squares (PLS) and apply ANCOVA technique with the PLS-identified signatures of these hidden effects as covariates, in order to identify the genes that are truly differentially expressed between the two concerned tissue types.
Results: We compare the performance of our method SVA-PLS with standard ANOVA and a relatively recent technique of surrogate variable analysis (SVA), on a wide variety of simulation settings (incorporating different effects of the hidden variable, under situations with varying signal intensities and gene groupings). In all settings, our method yields the highest sensitivity while maintaining relatively reasonable values for the specificity, false discovery rate and false non-discovery rate. Application of our method to gene expression profiling for acute megakaryoblastic leukemia shows that our method detects an additional six genes, that are missed by both the standard ANOVA method as well as SVA, but may be relevant to this disease, as can be seen from mining the existing literature.
Availability: The R code for our method, SVA-PLS, is freely available on the Supplementary website http://www.somnathdatta.org/Supp/SVPLS/
Supplementary information: Supplementary data are available at Bioinformatics online.