Motivation: High-throughput sequencing platforms are increasingly used to screen patients with genetic disease for pathogenic mutations, but prediction of the effects of mutations remains challenging. Previously we developed SAAPdap (Single Amino Acid Polymorphism Data Analysis Pipeline) and SAAPpred (Single Amino Acid Polymorphism Predictor) that use a combination of rule-based structural measures to predict whether a missense genetic variant is pathogenic. Here we investigate whether the same methodology can be used to develop a differential phenotype predictor, which, once a mutation has been predicted as pathogenic, is able to distinguish between phenotypes—in this case the two major clinical phenotypes (hypertrophic cardiomyopathy, HCM and dilated cardiomyopathy, DCM) associated with mutations in the beta-myosin heavy chain (MYH7) gene product (Myosin-7).
Results: A random forest predictor trained on rule-based structural analyses together with structural clustering data gave a Matthews’ correlation coefficient (MCC) of 0.53 (accuracy, 75%). A post hoc removal of machine learning models that performed particularly badly, increased the performance (MCC = 0.61, Acc = 79%). This proof of concept suggests that methods used for pathogenicity prediction can be extended for use in differential phenotype prediction.
Availability and Implementation: Analyses were implemented in Perl and C and used the Java-based Weka machine learning environment. Please contact the authors for availability.
Contacts: email@example.com or firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.