Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants
Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: that RAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigate the relationship between missing information among samples at the tips of a tree and information at edges within it. We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations to determine expected patterns of missing information. We also present new empirical results for the angiosperm clade Viburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths in the tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogenetic informativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverage accounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation-disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patterns of missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the number of parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysis leads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies.