Our new paper is out in Zoologica Scripta which looks at the causes and impacts of missing data using a RAD-seq dataset for an African frog (Afrixalus fornasini). You can read the full work here, but below is the abstract:
Restriction site‐associated DNA sequencing (RADseq) has emerged as a useful tool in systematics and population genomics. A common feature of RADseq data sets is that they contain missing data that arise from multiple sources including genealogical sampling bias, assembly methodology and sequencing error. Many RADseq studies have demonstrated that allowing sites (single nucleotide polymorphisms, SNPs) with missing data can increase support for phylogenetic hypotheses. Two non‐mutually exclusive explanations for this observation are that (a) larger data sets contain more phylogenetic information; and (b) excluding missing data disproportionally removes sites with the highest mutation rates, causing the exclusion of characters that are likely variable and informative. Using a RADseq data set derived from the East African banana frog, Afrixalus fornasini (up to 1.1 million SNPs), we found that missing data thresholds were positively correlated with the proportion of parsimony‐informative sites and mean branch support. Using three proxies for estimating site‐specific rate, we found that the most conservative missing data strategies excluded rapidly evolving sites, with four‐state sites present only when allowing ≥60% missing data per SNP. Topological similarity among estimated phylogenies was highest for the data sets with ≥60% missing data per SNP. Our results suggest that several desirable phylogenetic qualities were observed when allowing ≥60% missing data per SNP. However, at the highest missing data thresholds (80% and 90% missing data per SNP), we observed differences in performance between high‐ and mixed‐weight DNA extraction samples, which may indicate there are trade‐offs to consider when using degraded genomic template with RADseq protocols.