Discovery of clinically relevant disease sub-types is of prime importance in personalized medicine. Disease sub-type identification has in the past often been explored in an unsupervised machine learning paradigm which involves clustering of patients based on available-omics data, such as gene expression. A follow-up analysis involves determining the clinical relevance of the molecular sub-types such as that reflected by comparing their disease progressions. The above methodology, however, fails to guarantee the separability of the sub-types based on their subtype-specific survival curves.Results
We propose a new algorithm, Survival-based Bayesian Clustering (SBC) which simultaneously clusters heterogeneous-omics and clinical end point data (time to event) in order to discover clinically relevant disease subtypes. For this purpose we formulate a novel Hierarchical Bayesian Graphical Model which combines a Dirichlet Process Gaussian Mixture Model with an Accelerated Failure Time model. In this way we make sure that patients are grouped in the same cluster only when they show similar characteristics with respect to molecular features across data types (e.g. gene expression, mi-RNA) as well as survival times. We extensively test our model in simulation studies and apply it to cancer patient data from the Breast Cancer dataset and The Cancer Genome Atlas repository. Notably, our method is not only able to find clinically relevant sub-groups, but is also able to predict cluster membership and survival on test data in a better way than other competing methods.Availability and implementation
Our R-code can be accessed as https://github.com/ashar799/SBC.Contact
Supplementary data are available at Bioinformatics online.