1Biomedicine Discovery Institute and Department of Microbiology, Monash University, Clayton, VIC, Australia2Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China3Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA4National Institute of Technology, Matsue College, Matsue, Shimane, Japan5Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan6Gordon Life Science Institute, Boston, MA, USA7Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China8Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia9Department of Microbiology and Immunology and Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Parkville, VIC, Australia10Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology11Monash Centre for Data Science, Faculty of Information Technolog, Monash University, Clayton, VIC, Australia12ARC Centre of Excellence for Advanced Molecular Imaging, Monash University, Clayton, VIC, Australia
Checking for direct PDF access through Ovid
Motivation:Many Gram-negative bacteria use type VI secretion systems (T6SS) to export effector proteins into adjacent target cells. These secreted effectors (T6SEs) play vital roles in the competitive survival in bacterial populations, as well as pathogenesis of bacteria. Although various computational analyses have been previously applied to identify effectors secreted by certain bacterial species, there is no universal method available to accurately predict T6SS effector proteins from the growing tide of bacterial genome sequence data.Results:We extracted a wide range of features from T6SE protein sequences and comprehensively analyzed the prediction performance of these features through unsupervised and supervised learning. By integrating these features, we subsequently developed a two-layer SVM-based ensemble model with fine-grain optimized parameters, to identify potential T6SEs. We further validated the predictive model using an independent dataset, which showed that the proposed model achieved an impressive performance in terms of ACC (0.943), F-value (0.946), MCC (0.892) and AUC (0.976). To demonstrate applicability, we employed this method to correctly identify two very recently validated T6SE proteins, which represent challenging prediction targets because they significantly differed from previously known T6SEs in terms of their sequence similarity and cellular function. Furthermore, a genome-wide prediction across 12 bacterial species, involving in total 54 212 protein sequences, was carried out to distinguish 94 putative T6SE candidates. We envisage both this information and our publicly accessible web server will facilitate future discoveries of novel T6SEs.Availability and implementation:http://bastion6.erc.monash.edu/Supplementary information:Supplementary data are available at Bioinformatics online.