Motivation: Protein residue–residue contact prediction can be useful in predicting protein 3D structures. Current algorithms for such a purpose leave room for improvement.
Results: We develop ProC_S3, a set of Random Forest algorithm-based models, for predicting residue–residue contact maps. The models are constructed based on a collection of 1490 non–redundant, high-resolution protein structures using >1280 sequence-based features. A new amino acid residue contact propensity matrix and a new set of seven amino acid groups based on contact preference are developed and used in ProC_S3. ProC_S3 delivers a 3-fold cross-validated accuracy of 26.9% with coverage of 4.7% for top L/5 predictions (L is the number of residues in a protein) of long-range contacts (sequence separation ≥24). Further benchmark tests deliver an accuracy of 29.7% and coverage of 5.6% for an independent set of 329 proteins. In the recently completed Ninth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP9), ProC_S3 is ranked as No. 1, No. 3, and No. 2 accuracies in the top L/5, L/10 and best 5 predictions of long-range contacts, respectively, among 18 automatic prediction servers.
Supplementary Information: Supplementary data are available at Bioinformatics online.