Motivation: Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated.
Results: In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, Fd, and the number of substitutions occurring in phylogenetic trees, Ns. We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time Fd which, unlike the number of substitutions Ns, has rarely been investigated. We show that Fd has evolutionary information that is distinct from that in Ns, which may be useful for detecting novel types of evolutionary constraints existing in the human genome.
Availability: The C++ source code of the ‘Fdur’ software is available at http://www.ncrna.org/software/fdur/
Supplementary information: Supplementary data are available at Bioinformatics online.