Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches

    loading  Checking for direct PDF access through Ovid

Abstract

With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X. On a moderately sized sample, we have (Y, X, W), and on a larger sample we have (Y, W). We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients β of Y on X toward different targets that use information derived from W in the larger dataset. We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of β. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents. The methods, including a fully Bayesian alternative, are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample.

Related Topics

    loading  Loading Related Articles