Protein domain classification is an important step in functional annotation for next-generation sequencing data. For RNA-Seq data of non-model organisms that lack quality or complete reference genomes, existing protein domain analysis pipelines are applied to short reads directly or to contigs that are generated using de novo sequence assembly tools. However, these strategies do not provide satisfactory performance in classifying short reads into their native domain families.Results:
We introduce SALT, a protein domain classification tool based on profile hidden Markov models and graph algorithms. SALT carefully incorporates the characteristics of reads that are sequenced from the domain regions and assembles them into contigs based on a supervised graph construction algorithm. We applied SALT to two RNA-Seq datasets of different read lengths and quantified its performance using the available protein domain annotations and the reference genomes. Compared with existing strategies, SALT showed better sensitivity and accuracy. In the third experiment, we applied SALT to a non-model organism. The experimental results demonstrated that it identified more transcribed protein domain families than other tested classifiers.Availability:
The source code and supplementary data are available at https://sourceforge.net/projects/salt1/Contact:
Supplementary data are available at Bioinformatics online.