An empirical pipeline for choosing the optimal clustering threshold in RADseq studies

(1)

This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi:

1

2 MR. EVAN MCCARTNEY-MELSTAD (Orcid ID : 0000-0002-1803-1126) 3 DR. MÜGE GIDIŞ (Orcid ID : 0000-0002-5623-1698)

4 5

6 Article type : Resource Article 7

8

9 An empirical pipeline for choosing the optimal clustering threshold in RADseq studies 10 11 12 13 14 15 16 17 18 19 20

21 Evan McCartney-Melstad1*_{, Müge}Gidiş2_{, H. Bradley Shaffer}1

22

23 1: Department of Ecology and Evolutionary Biology, La Kretz Center for California 24 Conservation Science, and Institute of the Environment and Sustainability, University of 25 California, Los Angeles, California, USA

26

27 2: Kütahya School of Health, Dumlupınar University, Kütahya, Turkey 28

29 *Corresponding author: evanmelstad@ucla.edu

(2)

30

31 Keywords: RADseq, parameter choice, population genomics, Rana boylii 32

33 Running title: RADseq clustering threshold optimization 34 35 36 37 38 39 40 41 Abstract

42 Genomic data increasingly are used for high resolution population genetic studies including 43 those at the forefront of biological conservation. A key methodological challenge is determining 44 sequence similarity clustering thresholds for RADseq data when no reference genome is

45 available. These thresholds define the maximum permitted divergence among allelic variants and 46 the minimum divergence among putative paralogues and are central to downstream population 47 genomic analyses. Here we develop a novel set of metrics to determine sequence similarity 48 thresholds that maximize the correct separation of paralogous regions and minimize oversplitting 49 naturally occurring allelic variation within loci. These metrics empirically identify the threshold 50 value at which true alleles at opposite ends of several major axes of genetic variation begin to 51 incorrectly separate into distinct clusters, allowing researchers to choose thresholds just below 52 this value. We test our approach on a recently published dataset for the protected foothill yellow-53 legged frog (Rana boylii). The metrics recover a consistent pattern of roughly 96% similarity as a 54 threshold above which genetic divergence and data missingness become increasingly correlated. 55 We provide scripts for assessing different clustering thresholds and discuss how this approach can 56 be applied across a wide range of empirical datasets.

57

58 Introduction

(3)

59 Restriction site associated DNA sequencing, or RADseq, has become the technique of choice 60 for population genetic analyses of non-model organisms (Andrews, Good, Miller, Luikart, & 61 Hohenlohe, 2016; Davey & Blaxter, 2010). RADseq is inexpensive, technically simple and 62 reliable, and requires no genomic resources for implementation. However, this simplicity comes 63 with costs. The principle liability of the RADseq approach is the anonymous nature of putative 64 loci when no reference genome is available, and therefore the difficulty of firmly establishing 65 homologous loci (Harvey et al., 2015; Ilut, Nydam, & Hare, 2014). Although establishing 66 homology is a challenge for any technique or taxon, it may be particularly difficult in RADseq 67 studies of polyploid or large-genome species, because highly repetitive (i.e., large) genomes are 68 susceptible to the improper clustering of recently duplicated paralogs inappropriately into single 69 presumptive loci (Harvey et al., 2015; Ilut et al., 2014; McKinney, Waples, Seeb, & Seeb, 2017; 70 Waples, Seeb, & Seeb, 2017).

71 A key challenge in establishing homology in RADseq analyses is determining a consistent, 72 empirically-justifiable protocol to determine how divergent two reads representing alleles from 73 the same homologous locus can be, and not assigning those true homologues to paralogous loci. 74 This value is known as the clustering threshold in the RADseq analysis pipelines pyRAD/iPyrad 75 (Eaton, 2014). Its analogue in the popular RADseq pipeline Stacks (Catchen, Hohenlohe,

76 Bassham, Amores, & Cresko, 2013) is “distance allowed between stacks” (-M) divided by the 77 length of sequencing reads (for instance, an -M of 4 with 100bp reads is analogous to a clustering 78 threshold of 0.96 in pyRAD). If the clustering threshold is set too low, then paralogous genomic 79 regions may be inappropriately clustered together within and between samples. Additional, 80 pipeline-specific paralog filters may identify and discard these loci (leading to an unnecessary 81 loss of data), or they may be improperly included in the final datasets incorrectly as orthologues. 82 Setting the clustering threshold too high has the opposite effect, wherein true allelic variants of 83 orthologous loci may be split into putatively separate loci, effectively setting a cap on how 84 divergent a locus can be within a population or species. In both cases, downstream summary 85 statistics and inferences, ranging from overall heterozygosity to outlier-based fitness assays will 86 suffer. It is important to recognize that neither thresholds that are too low nor too high can be 87 considered “conservative” with respect to their biological impact. Rather, both extremes can lead 88 to the loss of the most biologically divergent, and in some cases informative, loci in a system.

(4)

89 Because the level of genetic differentiation among samples depends on many factors, 90 including population structure, effective population size, and mutation rate, no single default 91 value for clustering threshold should be expected to be accurate for all studies or species. Rather, 92 a reasonable approach that we explore in this paper is to process RADseq reads using multiple 93 clustering thresholds and choose the threshold that returns the optimal results for any empirical 94 system. While this approach is intuitively appealing, exactly how to implement such a strategy is 95 far from obvious. Ilut et al. (2014) suggested progressively lowering the clustering threshold until 96 the number of 2-haplotype clusters stops increasing sharply and the number of single haplotype 97 clusters plateaus, with the reasoning that most allelic variation within an individual will tend to 98 cluster together before paralogous reads do so. While reasonable, the Ilut et al. (2014) approach 99 focused on a single individual, and it is unclear how this strategy might scale up to clustering or 100 splitting divergent alleles among individuals in a population genomic framework. Harvey et al. 101 (2015) showed that the use of different clustering thresholds reduced comparability among 102 studies, and recommended evaluating the impact of threshold choice on the numbers of 1, 2, and 103 3+ haplotype loci, reasoning that an increase in 2-haplotype clusters may indicate increased 104 proper clustering of divergent alleles, while more 3+-haplotype clusters is strong evidence of 105 clustering paralogs in diploid species. Finally, Mastretta-Yanes et al. (2015) suggested optimizing 106 based on sequenced technical replicates, choosing parameter values that most consistently

107 recovered the same genotypes among replicates. When replicate samples are not available, 108 Mastretta-Yanes et al. (2015) recommended choosing a threshold that maximizes the number of 109 recovered SNPs and minimizes the genetic distance between individuals from the same locality, 110 an approach that attempts to increase the amount of variation available to analyze within a dataset 111 while reducing the presumed genealogical error derived from clustering paralogous loci.

112 Although these approaches are all significantly better than simply using the default values

113 supplied with RADseq pipelines, they do not directly address the question of how to minimize the 114 splitting of the most divergent orthologous alleles among genetically distant individuals while 115 avoiding the homology misinterpretation and data loss that may result from incorrectly clustering 116 paralogues.

117 To quantitatively address this problem in population genomics, we identified a set of seven 118 metrics for assessing optimal clustering thresholds in RADseq. We designed these metrics to

(5)

119 optimize the threshold clustering value at which natural allelic variation in the population starts to 120 be clustered, incorrectly, into separate genomic loci. The metrics, which are described in detail in 121 the Materials and Methods, include: a) the fraction of inferred paralogous clusters, b) the slope of 122 genetic divergence vs. geographic distance, c) the correlation of pairwise divergence with

123 pairwise missingness, and d) mean bootstrap values in a maximum likelihood tree building 124 framework. We also test three previously suggested metrics: e) the total number of SNPs 125 recovered and f) the fraction of variance explained by the main principal components of genetic 126 variation (both advocated by Mastretta-Yanes et al. (2015)), and g) heterozygosity (suggested by 127 Ilut et al. (2014). These seven metrics, each based on population genetic theory and landscape 128 genetic expectations that are applicable to any species, can then be examined to establish 129 clustering thresholds for data assembly that are presumably accurate estimates of true homolog 130 allelic variation for a set of individuals. We provide easily usable scripts to plot and evaluate 131 these metrics using VCF files output by several different RADseq processing pipelines.

132 We focus our empirical test case on the foothill yellow-legged frog (Rana boylii). Recent work 133 showed that the species is comprised of five deeply divergent clades, with fine-scale hierarchical 134 divergence present to a greater or lesser extent within each clade (McCartney-Melstad, Gidiş, & 135 Shaffer, 2018). Expectations on appropriate clustering thresholds in this species are mixed, 136 making it an interesting test case for our methodology. Early work on R. boylii showed that intron 137 5 of the tropomyosin gene contained only 2 variable sites across 517 bp (0.39% divergence) for a 138 similar geographic sampling (Lind, Spinks, Fellers, & Shaffer, 2011), which might suggest a 139 decreased risk of oversplitting loci at higher clustering thresholds. Alternatively, the larger 140 genome sizes in amphibians generally, and R. boylii in particular (estimated at about 8-10 GB 141 based on close relatives) (Olmo, 1973; Vinogradov, 1998) with their associated increase of 142 paralogous regions (Smith et al., 2009) suggests an increased risk of undersplitting at low 143 clustering thresholds. Using our newly-derived metrics, our results indicate that we can identify 144 biologically reasonable clustering thresholds directly from these data, and we discuss how these 145 metrics can be applied to other biological systems.

146 Materials and Methods

147 The current implementation of iPyrad uses a default clustering threshold of 0.85. Stacks, 148 another popular tool to process RADseq data, allows for two mismatches between sequences

(6)

149 within an individual and one additional mismatch between individuals by default, regardless of 150 read length. This translates to an effective 0.97 clustering threshold given 100 bp reads, a 0.94 151 threshold given 50 bp reads, and a 0.98 threshold for 150 bp reads. To establish biologically 152 reasonable thresholds for an empirical system, we developed data exploration strategies that 153 should return a small, reasonable range of threshold values for any given system. We motivate 154 and describe these strategies below. We also describe a recently published data set on our test 155 case, the candidate listed species Rana boylii.

156 1) Fraction of inferred paralogues

157 RADseq processing pipelines such as pyRAD/iPyrad and Stacks flag paralogous loci based on 158 several filters, including excessive heterozygosity in a locus within and among samples and the 159 presence of more than two haplotypes within individuals in a diploid species. In general, as the 160 clustering threshold increases, there should be a decrease in the number of flagged paralogs, 161 because the increased stringency in clustering eliminates true paralogs that would be flagged and 162 removed later by other filters. However, when the threshold is too high, both true paralogous loci 163 and loci that contain modest amounts of true allelic variation become split into separate clusters, 164 whereupon filters such as excessive heterozygosity within a sample have a reduced capacity to 165 identify paralogues. Plotting the percentage of flagged paralogs as a function of clustering 166 threshold provides a visualization of this relationship. While this is a primarily qualitative metric 167 that can gauge the importance of properly setting the clustering threshold (very low paralog 168 fractions indicate that there is decreased risk in setting a low clustering threshold), we expect that 169 the optimal threshold that best reflects clusters composed of only true homologues should be 170 somewhere in the initial, relatively shallow slope of the declining relationship between flagged 171 paralogs and clustering threshold. We used the output from step 5 of pyRAD to plot the fraction 172 of flagged paralogous loci ([f1loci-f2loci] / f1loci from s5.consens.txt).

173 2 & 3) Genetic variation

174 Choosing a clustering threshold that is much too low will result in distantly related paralogues 175 clustering together, often then being flagged and discarded by bioinformatic processing pipelines 176 from filters such as mandating no more than 2 haplotypes in a diploid organism (Ilut et al., 2014). 177 This can lead to the counterintuitive result of increasing heterozygosity with higher clustering 178 thresholds, until a threshold is reached that begins to split high levels of allelic variation. A

(7)

179 similar pattern is plausible for the total number of SNPs recovered in a dataset—clustering 180 divergent alleles via lower clustering thresholds will increase the number of inferred SNPs, but 181 when paralogues begin to be incorporated these highly polymorphic alleles will often be 182 discarded through standard paralogy filters. We therefore plotted boxplots for individual 183 heterozygosity (“poly” from s5.consens.txt in pyRAD) and the total number of SNPs recovered 184 across different clustering thresholds. We expect that for both heterozygosity and the total 185 number of SNPs the optimal clustering threshold should be at or near the value that returns the 186 maximum value for each.

187 4) Cumulative variance explained by major principal components

188 Increasing the amount of relevant, accurate data should generally decrease the noise in the 189 main patterns of genetic differentiation, leading to an increase in the proportion of variance 190 explained by the first several principal components. However, when true orthologues are

191 oversplit, there should be a reduction in the divergence of the most different groups, leading to a 192 decrease in the variance accounted for in those principal components. The signal-to-noise ratio in 193 the genetic datasets under different clustering thresholds was therefore investigated by calculating 194 the proportion of genetic variance explained by the first several principal components, as was also 195 done previously by Mastretta-Yanes et al. (2015). SNPRelate v1.12.2 (Zheng et al., 2012) was 196 used to perform principal component analysis for each clustering threshold VCF. To capture this 197 metric, we summed and plotted the cumulative variance explained by the first eight principal 198 components for each clustering threshold. We expect the fraction of variance explained by the 199 first eight principal components to increase with clustering threshold (as the amount of data 200 generally increases with clustering threshold) until a certain threshold when bias from 201 oversplitting more strongly influences the results, after which the proportion of variance

202 explained by those principal components should decrease. We choose eight principal components 203 because of the known presence of strong hierarchical population structure in this species

204 (McCartney-Melstad et al., 2018), which suggests that more than two principal components 205 should be strongly informative, but any small number of principal components should provide 206 roughly the same information.

Author Manuscript

(8)

207 5) Relationship between missingness and genetic divergence

208 Oversplitting of homologous loci is more likely between genetically distant individuals, 209 leading to an increase in missingness when a single locus with divergent alleles is incorrectly 210 divided into two loci, each of which is present in some individuals and absent in others. We 211 therefore investigated the relationship between individual pairwise data missingness and genetic 212 similarity to determine the degree to which similarity is correlated with missingness across 213 clustering thresholds. Allelic polymorphisms in restriction sites will also lead to missing data 214 among samples, the prevalence of which should scale with genetic divergence. This phenomenon 215 may impart a positive slope in the relationship between pairwise genetic divergence and data 216 missingness, but the accumulated polymorphisms in restriction sites are not dependent on 217 clustering threshold, and so the differences observed among clustering thresholds may be 218 attributed to bioinformatic parameter values. Genetic similarity (fractional identity by state) was 219 calculated from each clustering threshold VCF file using the snpgdsIBS function in SNPRelate 220 v.1.12.2 (Zheng et al., 2012). Pairwise data missingness among samples, calculated as the 221 fraction of SNPs with missing data for one of two sample genotypes divided by the total number 222 of SNPs genotyped for at least one of the samples, was calculated from the output VCF files 223 using the accompanying script vcf2pairwiseMissingness.pl

224 (https://www.github.com/atcg/clustOpt/). Then, snpgdsHCluster was used to create a distance-225 based dendrogram of the samples, and heatmaps showing the degree of pairwise data missingness 226 among samples were plotted using pheatmap (Kolde, 2018). Pearson’s correlation coefficients 227 were then calculated between genetic dissimilarity and pairwise missingness at different 228 clustering thresholds. We expect that the optimal clustering threshold should be at or near the 229 point when this correlation begins to steeply increase.

230 6) Slope of isolation by distance

231 Under a model of isolation by distance (Wright, 1943), a very general expectation is that 232 genetic distance should increase with geographic distance within species. However, if the most 233 divergent alleles of orthologues are unintentionally oversplit by setting a clustering threshold that 234 is too high, this would lead to a reduction in the observed genetic distance among geographically 235 distant comparisons. As a result, clustering thresholds that are too high will lead to a smaller 236 increase in genetic divergence per km than will clustering thresholds that do not split the most

(9)

237 divergent alleles. We measured this effect by calculating the slopes of the lines of best fit for 238 genetic divergence as a function of geographic distance for each clustering threshold using the lm 239 function in R. We expect that oversplitting will lead to a reduction in the positive slope of genetic 240 divergence as a function of geographic distance at the clustering threshold that begins to split the 241 most divergent homologue alleles.

242 7) Phylogenetic resolution

243 The consistency of recovered relationships among polymorphisms in a dataset serve as a 244 measure of relative noise. Inconsistency can be biological, as in the case of discordant gene trees 245 associated with some demographic histories, or technical, as in the case of improper shared 246 polymorphism inferred from the clustering of paralogues. To test the phylogenetic resolution of 247 the data resulting from each clustering threshold, and to quantify the general effects of different 248 parameter sets on downstream biological inferences, RAxML v8.2.12 (Stamatakis, 2014) was run 249 10 times with different parsimony and bootstrap seeds for each clustering threshold. Two separate 250 sets of analyses were run: one on the full alignments, which contain all passing loci that have data 251 for at least 4 samples, and one on alignments containing no more than 50% missing data. The full 252 alignments were used to minimize the bias introduced by filtering for missingess, which can mask 253 the contributions of certain classes of divergent loci (Huang & Knowles, 2016), while the 50% 254 missingness dataset was analyzed to mirror filtering for a relatively permissive amount of missing 255 data as often done in phylogenetic studies (Wielstra, McCartney-Melstad, Arntzen, Butlin, & 256 Shaffer, 2019; Wiens & Morrill, 2011). The GTRGAMMA model was used, along with 100 rapid 257 bootstrap replicates and 20 thorough maximum likelihood searches for each run. Then, mean 258 bootstrap values for each run were calculated across all branches using ape 5.2 (Paradis, Claude, 259 & Strimmer, 2004). Boxplots representing the median, 25% and 75% quartiles, and outliers of the 260 25 mean bootstrap values per clustering threshold were then plotted using R v3.4.4 (R Core 261 Team, 2017). We expect that the optimal clustering threshold should be the level that produces 262 the highest overall bootstrap levels for a given data set.

263 Empirical sequence data and clustering

264 Sequence reads from a double digest restriction-site associated DNA (Glenn et al., 2016; 265 Peterson, Weber, Kay, Fisher, & Hoekstra, 2012) experiment on R. boylii were downloaded from 266 NCBI (accession # PRJNA401430). These reads represent 93 samples from 50 localities from

(10)

267 across California and Oregon (Figure 1), sequenced across two 100 bp single-end Illumina HiSeq 268 2000 lanes. For full sample information and laboratory procedures, see McCartney-Melstad et al. 269 (2018).

270 The sequence clustering pipeline pyRAD v3.0.66 (Eaton, 2014) was run with the following 271 parameters: #8/Mindepth=10, #11/Datatype=ddrad, #12/MinCov=4, #13/MaxSH=55, and 272 #24/MaxH=2, with separate runs at each #10/Wclust value set to each hundredth between 0.88 273 and 0.99 (12 runs total). These correspond to a minimum sequencing depth of 10, a minimum of 274 four samples having data to keep a locus, a maximum of 55 heterozygous individuals for any 275 base, and a maximum of two heterozygous sites per locus within individuals. The parameters 276 Mindepth and MinCov filter for data quality and missingness, respectively, whereas MaxSH and 277 MaxH filter against combining paralogs. Step 7 of pyRAD was also rerun for each clustering 278 threshold with MinCov set to 47, enforcing a maximum of 50% missing data to build

279 supplementary alignments for the RAxML bootstrapping analysis. 280

281 Results

282 Sequence data and initial pyRAD statistics

283 We downloaded 394,983,404 100 bp sequence reads from 93 samples sequenced on two 284 HiSeq 2000 single-end lanes. Samples received between 2,353,591 and 6,263,131 reads each 285 (mean=4,247,133, stdev=840,333). After discarding reads with more than four low-quality bases 286 (phred < 20), our data set contained an average of 3,056,113 reads per individual

287 (min=1,698,025, max=4,560,049, SD=606,497). Total loci recovered by pyRAD ranged from 288 72,111 at a clustering threshold of 0.88 to 126,124 at a clustering threshold of 0.99 (Table 1). Full 289 RAxML alignments contained between 7.24 million bp and 12.61 million bp, and after removing 290 loci that contained more than 50% missing data, alignments contained between 1.85 million bp 291 and 2.51 million bp (Table 1).

292 Fraction of inferred paralogues and diversity measures (Metrics 1, 2, 3, 4)

293 The percentage of loci flagged as paralogous was reduced by roughly half at a clustering 294 threshold of 0.96 compared to 0.88, and began to fall sharply at 0.97 (Figure 2). Sample-level 295 heterozygosity increased from 0.88 to 0.96, where it leveled off until a sharp decrease at 0.99 296 (Figure 3A). Similarly, the total number of polymorphic sites recovered by pyRAD increased

(11)

297 steadily from 0.88 to 0.95, and then decreased slowly to 0.97 and sharply to 0.98 and 0.99 (Figure 298 3B). When viewed in a multivariate context, the cumulative variance among all biallelic SNPs 299 explained by the first eight principal components was greatest when using a clustering threshold 300 of 0.96 (Figure 3C).

301 Relationship between missingness and genetic divergence, and slope of isolation by distance

302 (Metrics 5, 6)

303 The relationship between pairwise data missingness and phylogenetic distance was obvious in 304 the missingness heat maps (Figures 4 and S2-S11). At a clustering threshold of 0.88 (Figure 4A), 305 the lowest missingness was apparent among the most closely related individuals, but the highest 306 missingness values were generally unstructured phylogenetically. This pattern remained visually 307 stable until reaching a clustering threshold of 0.98 (Figure S11), when high missingness values 308 began to cluster more clearly among comparisons between a single clade of the nine most 309 divergent samples with most of the remaining samples. This pattern became more pronounced at 310 a clustering threshold of 0.99 (Figure 4B), with phylogenetic divergence clearly driving the 311 patterns of highest data missingness. This suggests that these most divergent samples likely had a 312 higher pairwise missing data rate with other samples because their alleles were divergent enough 313 from the remaining samples that a clustering threshold of 0.99 (and, to a lesser degree, 0.98) 314 oversplit many of its alleles into separate loci, which necessarily contain missing data when 315 compared to more divergent samples.

316 Quantitatively, Pearson’s correlation coefficients between genetic divergence and missingness 317 were similar for clustering thresholds of 0.88 to 0.93. At higher thresholds, the correlation

318 became stronger, rising at 0.96 and then rapidly rising at thresholds of 0.97 to 0.99 (Figure 5A). 319 Similarly, the slope of genetic isolation by geographic distance began to drop at 0.95, and

320 precipitously so at 0.96 or 0.97 (Figure 5B), again suggesting that the most divergent homologous 321 alleles began to be split and homogenize the most distant samples at approximately 0.96 or 0.97. 322 Genetic divergence exhibited higher correlations with geographic distance at thresholds between 323 0.88 and 0.97, and lower ones at thresholds of 0.98 and 0.99 (Figure S1).

324 Phylogenetic resolution (Metric 7)

325 The results of the RAxML analyses on the full alignments demonstrated that not all clustering 326 thresholds produced the same degree of phylogenetic resolution (Figure 6). Thresholds of 0.88 to

(12)

327 0.94 produced similar and generally increasing resolution as measured by nodal support. Mean 328 bootstrap values were highest for a clustering threshold of 0.95 and then decreased at thresholds 329 of 0.96, 0.97, 0.98 and 0.99. We recovered a similar pattern when using alignments containing no 330 more than 50% missing data, with bootstrap values increasing roughly linearly from 0.88 to 0.97, 331 then decreasing sharply at 0.98 and 0.99 (Figure S12).

332 Discussion

333 The choice of clustering threshold in RADseq experiments can influence downstream analyses 334 of population assignment (Rodríguez‐Ezpeleta et al., 2016), molecular diversity (Figure 3) and 335 effective population size (Harvey et al., 2015; Shafer et al., 2017), and phylogenetic resolution 336 (Huang & Knowles, 2016; Leaché et al., 2015). For instance, the resolution of the genetic 337 datasets generated here clearly differed as evidenced by the different mean RAxML bootstrap 338 values among clustering thresholds with both full data matrices (Figure 6) and those filtered for a 339 maximum of 50% missingness (Figure S12). Properly split paralogues have been demonstrated to 340 show concordant patterns of population structure when treated as separate loci, contributing 341 valuable resolution to population genomic studies (Waples et al., 2017). The difference in power 342 gained from increasing the size of data matrices can, and does, lead to fundamentally different 343 biological interpretations in population genomics (McCartney‐Melstad, Vu, & Shaffer, 2018), 344 emphasizing that it is worth the extra work to extract as many true orthologs as possible from a 345 given data set.

346 We uncovered clear biological patterns that were evident in our proposed sequence clustering 347 threshold evaluation metrics, allowing us to identify an empirically-motivated, species-specific 348 clustering threshold approach that is broadly applicable to other species. Evaluating multiple 349 metrics is important because there may be minor differences among them. For instance, sequence 350 clustering thresholds exhibited mostly concordant patterns across the metrics for this R. boylii 351 data set, although some metrics supported 95% similarity (total SNPs, PCC between missing and 352 divergence, IBD slopes, and RAxML bootstraps; metrics 3, 5, 6, and 7), while others supported 353 96% (heterozygosity, cumulative PC variance; metrics 2 and 4). This indicates that the great 354 majority of allelic variation in this system is less than 4% divergent, and that 95% and 96% 355 represent reasonable clustering thresholds that serve to maximize the splitting of paralogous loci 356 while minimizing the splitting of allelic variants for R. boylii.

(13)

357 Absolute levels of paralogy

358 The fraction of loci flagged by pyRAD as paralogs through its ploidy and locus heterozygosity 359 filters (Figure 2) is informative, as this gives a sense of the fraction of loci that have been

360 improperly lumped, and therefore the distribution of within-individual paralog genetic distances. 361 However, these filters are imperfect and will not catch all true biological paralogs. They also may 362 improperly score some allelic variation as paralogous. Although one could verify our approach in 363 a species with a near-perfectly assembled reference genome by independent verification of 364 paralogous loci using a multiple mapping criterion, we were primarily interested in the

365 performance of our metrics in a system where we anticipated a relatively high fraction of paralogs 366 and a high level of allelic divergence. Given that, we selected a frog species with a large genome 367 and deep population structure (McCartney-Melstad et al., 2018). Approximately 20% of the 368 recovered loci were flagged as paralogs at lower clustering thresholds (roughly 0.88 to 0.92), 369 suggesting that RADseq locus paralogy is prevalent in R. boylii.

370 Paralog fractions in RADseq datasets vary widely, and these fractions may be difficult or 371 impossible to predict a priori, as they do not appear to scale with genome size. For instance, the 372 dusky parrotfish has a genome size on the order of 1.22 to 1.86 gigabases (Hinegardner, 1968; 373 Mirsky & Ris, 1951) and the mountain barberry has a genome size of approximately 0.5 to 1.83 374 gigabases (Mastretta-Yanes et al., 2015; Rounsaville & Ranney, 2010). Despite their similar 375 genome sizes, their fraction of paralogous RAD loci differ dramatically, with dusky parrotfish 376 exhibiting 1% paralogues and mountain barberry yielding 35% (Mastretta-Yanes et al., 2015; 377 McKinney et al., 2017). Similarly, the Chinook salmon has roughly the same fraction of 378 paralogous RAD loci (17%) as R. boylii (McKinney et al., 2017), despite R. boylii having a 379 genome roughly 2 to 4 times larger (7.33-9.13 vs. 2.40-3.23 gigabases) (Hardie & Hebert, 2003; 380 Hinegardner & Rosen, 1972; Olmo, 1973; Vinogradov, 1998).

381 Alternative approaches

382 Several other methods have been proposed to limit the effects of paralogy in RADseq analyses 383 and to choose the best parameters in bioinformatic pipelines. One suggested approach is to 384 sequence technical replicates to measure error rate across different parameter sets (Mastretta-385 Yanes et al., 2015). Although this is an excellent approach for evaluating the effects of e.g. 386 sequencing depth, laboratory batch effects and other sources of technical error, it does not take

(14)

387 into account the maximum observed level of divergence in the study system. Paris et al. (2017) 388 suggested choosing parameters that maximize the number of polymorphic loci that are present in 389 at least 80% of the samples present. In effect, this gets at the tradeoff between over- and under-390 splitting by trying to increase the number of loci (e.g. by increasing the clustering threshold) until 391 alleles begin to be split, which will lead to increased missing data (thus removing loci beneath the 392 80% genotyping threshold) and will also reduce the fraction of loci that are polymorphic (thus 393 failing the polymorphic locus test). While this is an interesting approach, it is unclear how it will 394 be affected by the presence of strong population structure, where absolute genotyping rates can be 395 low due to locus fallout and where much higher missing data rates may be informative (Huang & 396 Knowles, 2016).

397 The software pyRAD filters out paralogs using multiple approaches, including flagging loci 398 with more than two haplotypes per individual (biologically impossible for a single locus in a 399 diploid organism) and identifying loci with a higher than expected number of heterozygous sites 400 per individual (combined paralogous loci will appear heterozygous at sites with fixed differences 401 between those loci). Another approach to paralogue detection that is separate from clustering 402 threshold relies on the fraction of heterozygotes in a sample and the relative imbalance of allelic 403 read depths at a locus (McKinney et al., 2017). This approach is promising, but it is again unclear 404 how it may be affected by highly structured populations. Further empirical testing is needed.

405 Additional considerations

406 In addition to clustering threshold, other pipeline parameters are important for maximizing 407 orthology and minimizing paralogy. For instance, in pyRAD, MaxH is a paralogy filtering 408 parameter that represents the maximum number of heterozygous sites within a locus for an 409 individual. This parameter operates under the assumption that incorrectly clustered paralogues 410 will present as loci with unusually high levels of heterozygosity. Different values for parameters 411 such as these and their interactions with clustering threshold can also be tested using the metrics 412 and scripts presented here. Further, variance in sequencing depth among samples can be as or 413 more important than sequence divergence with respect to pairwise data missingness (Eaton, 414 Spriggs, Park, & Donoghue, 2017), particularly for studies sequenced at low depth. Fortunately, 415 low depth can usually be mitigated experimentally by sequencing samples more deeply.

(15)

416 Finally, we note that, depending on the specific research questions, choosing a clustering 417 threshold lower than the value that optimizes our seven metrics (which would cluster together 418 more of the rare allelic variants at the extreme tail of the divergence distribution at the cost of 419 improperly clustering more paralogous loci) may be appropriate. This may be particularly true for 420 studies that focus on identifying candidate loci under strong divergent selection, where the most 421 interesting homologous variants are likely to be the most genetically divergent. Further, in 422 systems where isolation by distance is not a good null hypothesis for relationships among 423 samples, metric 6 (the slope of genetic distance as a function of geographic distance) should be 424 discounted in favor of the other metrics.

425 Application to other systems

426 The reference manual for the software ipyrad recommends using thresholds of 95 % similarity 427 or lower to avoid oversplitting, but as levels of genetic variation and paralogy are inherently 428 specific to the system under study, we argue that this parameter does not lend itself well to default 429 values. Instead, we suggest that, where possible, researchers should investigate the emergent 430 relationships of divergence and missingness along known or inferred axes of genetic variation to 431 set an optimal threshold. Using the seven metrics described here, the accompanying scripts can be 432 used to visualize the relationship clustering thresholds and pairwise divergence and missingness. 433 Several previously suggested measures, including analyses of heterozygosity (metric 2) , 434 number of recovered SNPs (metric 3), and variance explained (metric 4) (Harvey et al., 2015; Ilut 435 et al., 2014; Mastretta-Yanes et al., 2015) are useful, but they do not directly measure the degree 436 to which the divergence among samples in the dataset affect the optimal choice of clustering 437 threshold. We suggest that researchers process their raw RADseq data using several different 438 clustering thresholds and evaluate multiple metrics described herein, including at least

439 heterozygosity or total number of SNPs (metrics 2 or 3), correlations between missingness and 440 divergence (metric 5), and cumulative variance explained by the first 3-10 principal components 441 (metric 4). The accompanying scripts at www.github.com/atcg/clustOpt can be used to calculate 442 these values directly from standard VCF files, which are produced by most RADseq pipelines, 443 including pyRAD and iPyrad (Eaton, 2014), Stacks (Catchen et al., 2013), and dDocent (Puritz, 444 Hollenbeck, & Gold, 2014). Specifically, the script vcfMissingness.pl can be used to generate 445 heatmaps of pairwise missingness that are clustered by genetic similarity similar to Figure 4 and

(16)

446 calculate pairwise missingness correlations as a function of genetic distance (metric 5). For 447 population genomic studies where isolation by distance is likely to play a role in the partitioning 448 of genetic variation (metric 6), ibdSlope.R can be used to calculate isolation by distance slopes 449 across different clustering thresholds for plotting figures similar to Figures 5B. And

450 vcfToPCAvarExplained.R can be used to calculate the cumulative variance explained by the most 451 important principal components starting from a collection of VCF files (metric 4).

452 453

454 Acknowledgements

455 This work used the XSEDE computing clusters Comet and Stampede2 clusters at the San 456 Diego Supercomputing Center (Towns et al., 2014), supported by NSF ACI-1548562. EMM and 457 HBS are supported by NSF-DEB 1257648 and grants from the US Fish and Wildlife Service, US 458 Bureau of Reclamation, and California Department of Transportation. MG was supported by a 459 grant from The Scientific and Technological Council of Turkey (TUBITAK).

460 Data Accessibility

461 Scripts that can be used to generate similar plots to those this manuscript from VCF files are 462 available at http://www.github.com/atcg/clustOpt/. Output and statistics files from the pyRAD 463 runs are available at Zenodo (doi: 10.5281/zenodo.2540263).

464 Author Contributions

465 EMM developed the metrics, analyzed the data, developed the accompanying scripts, and 466 wrote the initial draft manuscript, MG collected the Rana boylii data, and HBS assisted with 467 development of the metrics and edited the manuscript.

468 References

469 Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G., & Hohenlohe, P. A. (2016). Harnessing 470 the power of RADseq for ecological and evolutionary genomics. Nature Reviews

471 Genetics, 17(2), 81–92. doi: 10.1038/nrg.2015.28

472 Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A., & Cresko, W. A. (2013). Stacks: an 473 analysis tool set for population genomics. Molecular Ecology, 22(11), 3124–3140. doi:

474 10.1111/mec.12354

(17)

475 Davey, J. W., & Blaxter, M. L. (2010). RADSeq: next-generation population genetics. Briefings

476 in Functional Genomics, 9(5–6), 416–423. doi: 10.1093/bfgp/elq031

477 Eaton, D. A. R. (2014). PyRAD: assembly of de novo RADseq loci for phylogenetic analyses.

478 Bioinformatics, 30(13), 1844–1849. doi: 10.1093/bioinformatics/btu121

479 Eaton, D. A. R., Spriggs, E. L., Park, B., & Donoghue, M. J. (2017). Misconceptions on Missing 480 Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants.

481 Systematic Biology, 66(3), 399–412. doi: 10.1093/sysbio/syw092

482 Glenn, T. C., Nilsen, R., Kieran, T. J., Finger, J. W., Pierson, T. W., Bentley, K. E., … Faircloth, 483 B. C. (2016). Adapterama I: Universal stubs and primers for thousands of dual-indexed 484 Illumina libraries (iTru & iNext). BioRxiv, 049114. doi: 10.1101/049114

485 Hardie, D. C., & Hebert, P. D. . (2003). The nucleotypic effects of cellular DNA content in 486 cartilaginous and ray-finned fishes. Genome, 46(4), 683–706. doi: 10.1139/g03-040 487 Harvey, M. G., Judy, C. D., Seeholzer, G. F., Maley, J. M., Graves, G. R., & Brumfield, R. T. 488 (2015). Similarity thresholds used in DNA sequence assembly from short reads can 489 reduce the comparability of population histories across species. PeerJ, 3. doi: 490 10.7717/peerj.895

491 Hinegardner, R. (1968). Evolution of Cellular DNA Content in Teleost Fishes. The American

492 Naturalist, 102(928), 517–523.

493 Hinegardner, R., & Rosen, D. E. (1972). Cellular DNA Content and the Evolution of Teleostean 494 Fishes. The American Naturalist, 106(951), 621–644.

495 Huang, H., & Knowles, L. L. (2016). Unforeseen Consequences of Excluding Missing Data from 496 Next-Generation Sequences: Simulation Study of RAD Sequences. Systematic Biology,

497 65(3), 357–365. doi: 10.1093/sysbio/syu046

498 Ilut, D. C., Nydam, M. L., & Hare, M. P. (2014). Defining Loci in Restriction-Based Reduced 499 Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics 500 for Optimal Clustering. BioMed Research International, 2014, e675158. doi:

501 10.1155/2014/675158

502 Kolde, R. (2018). pheatmap: Pretty Heatmaps. https://CRAN.R-project.org/package=pheatmap. 503 (Version 1.0.10). Retrieved from https://CRAN.R-project.org/package=pheatmap

(18)

504 Leaché, A. D., Chavez, A. S., Jones, L. N., Grummer, J. A., Gottscho, A. D., & Linkem, C. W. 505 (2015). Phylogenomics of Phrynosomatid Lizards: Conflicting Signals from Sequence 506 Capture versus Restriction Site Associated DNA Sequencing. Genome Biology and

507 Evolution, 7(3), 706–719. doi: 10.1093/gbe/evv026

508 Lind, A. J., Spinks, P. Q., Fellers, G. M., & Shaffer, H. B. (2011). Rangewide phylogeography 509 and landscape genetics of the Western US endemic frog Rana boylii (Ranidae):

510 implications for the conservation of frogs and rivers. Conservation Genetics, 12(1), 269–

511 284.

512 Mastretta-Yanes, A., Arrigo, N., Alvarez, N., Jorgensen, T. H., Piñero, D., & Emerson, B. C. 513 (2015). Restriction site-associated DNA sequencing, genotyping error estimation and de 514 novo assembly optimization for population genetic inference. Molecular Ecology

515 Resources, 15(1), 28–41. doi: 10.1111/1755-0998.12291

516 McCartney-Melstad, E., Gidiş, M., & Shaffer, H. B. (2018). Population genomic data reveal 517 extreme geographic subdivision and novel conservation actions for the declining foothill 518 yellow-legged frog. Heredity, 121(2), 112–125. doi: 10.1038/s41437-018-0097-7

519 McCartney‐Melstad, E., Vu, J. K., & Shaffer, H. B. (2018). Genomic data recover previously 520 undetectable fragmentation effects in an endangered amphibian. Molecular Ecology,

521 27(22), 4430–4443. doi: 10.1111/mec.14892

522 McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017). Paralogs are revealed by 523 proportion of heterozygotes and deviations in read ratios in genotyping-by-sequencing 524 data from natural populations. Molecular Ecology Resources, 17(4), 656–669. doi: 525 10.1111/1755-0998.12613

526 Mirsky, A. E., & Ris, H. (1951). THE DESOXYRIBONUCLEIC ACID CONTENT OF 527 ANIMAL CELLS AND ITS EVOLUTIONARY SIGNIFICANCE. The Journal of

528 General Physiology, 34(4), 451–462.

529 Olmo, E. (1973). Quantitative variations in the nuclear DNA and phylogenesis of the amphibia.

530 Caryologia, 26(1), 43–68. doi: 10.1080/00087114.1973.10796525

531 Paradis, E., Claude, J., & Strimmer, K. (2004). APE: Analyses of Phylogenetics and Evolution in 532 R language. Bioinformatics, 20(2), 289–290. doi: 10.1093/bioinformatics/btg412

(19)

533 Paris, J. R., Stevens, J. R., & Catchen, J. M. (2017). Lost in parameter space: a road map for 534 stacks. Methods in Ecology and Evolution, 8(10), 1360–1373. doi:

10.1111/2041-535 210X.12775

536 Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. (2012). Double Digest 537 RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model 538 and Non-Model Species. PLoS ONE, 7(5), e37135. doi: 10.1371/journal.pone.0037135 539 Puritz, J. B., Hollenbeck, C. M., & Gold, J. R. (2014). dDocent: a RADseq, variant-calling 540 pipeline designed for population genomics of non-model organisms. PeerJ, 2, e431. doi: 541 10.7717/peerj.431

542 R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for 543 Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL

https://www.R-544 project.org/. Retrieved September 30, 2016, from http://www.R-project.org

545 Rodríguez‐Ezpeleta, N., Bradbury, I. R., Mendibil, I., Álvarez, P., Cotano, U., & Irigoien, X. 546 (2016). Population structure of Atlantic mackerel inferred from RAD-seq-derived SNP 547 markers: effects of sequence clustering parameters and hierarchical SNP selection.

548 Molecular Ecology Resources, 16(4), 991–1001. doi: 10.1111/1755-0998.12518

549 Rounsaville, T. J., & Ranney, T. G. (2010). Ploidy Levels and Genome Sizes of Berberis L. and 550 Mahonia Nutt. Species, Hybrids, and Cultivars. HortScience, 45(7), 1029–1033.

551 Shafer, A. B. A., Peart, C. R., Tusso, S., Maayan, I., Brelsford, A., Wheat, C. W., & Wolf, J. B. 552 W. (2017). Bioinformatic processing of RAD-seq data dramatically impacts downstream 553 population genetic inference. Methods in Ecology and Evolution, 8(8), 907–917. doi: 554 10.1111/2041-210X.12700

555 Smith, J. J., Putta, S., Zhu, W., Pao, G. M., Verma, I. M., Hunter, T., … Voss, S. R. (2009). 556 Genic regions of a large salamander genome contain long introns and novel genes. BMC

557 Genomics, 10(1), 19. doi: 10.1186/1471-2164-10-19

558 Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of 559 large phylogenies. Bioinformatics, 30(9), 1312–1313. doi: 10.1093/bioinformatics/btu033 560 Towns, J., Cockerill, T., Dahan, M., Foster, I., Gaither, K., Grimshaw, A., … Wilkins-Diehr, N. 561 (2014). XSEDE: Accelerating Scientific Discovery. Computing in Science & Engineering,

562 16(5), 62–74. doi: 10.1109/MCSE.2014.80

(20)

563 Vinogradov, A. E. (1998). Genome size and GC-percent in vertebrates as determined by flow 564 cytometry: The triangular relationship. Cytometry, 31(2), 100–109.

565 Waples, R. K., Seeb, J. E., & Seeb, L. W. (2017). Congruent population structure across 566 paralogous and nonparalogous loci in Salish Sea chum salmon (Oncorhynchus keta).

567 Molecular Ecology, 26(16), 4131–4144. doi: 10.1111/mec.14163

568 Wielstra, B., McCartney-Melstad, E., Arntzen, J. W., Butlin, R. K., & Shaffer, H. B. (2019). 569 Phylogenomics of the adaptive radiation of Triturus newts supports gradual ecological 570 niche expansion towards an incrementally aquatic lifestyle. Molecular Phylogenetics and

571 Evolution. doi: 10.1016/j.ympev.2018.12.032

572 Wiens, J. J., & Morrill, M. C. (2011). Missing Data in Phylogenetic Analysis: Reconciling 573 Results from Simulations and Empirical Data. Systematic Biology, 60(5), 719–731. doi: 574 10.1093/sysbio/syr025

575 Wright, S. (1943). Isolation by Distance. Genetics, 28(2), 114–138.

576 Zheng, X., Levine, D., Shen, J., Gogarten, S. M., Laurie, C., & Weir, B. S. (2012). A high-577 performance computing toolset for relatedness and principal component analysis of SNP 578 data. Bioinformatics (Oxford, England), 28(24), 3326–3328. doi:

579 10.1093/bioinformatics/bts606 580

581 Tables

582 Table 1. Loci = pyRAD loci with at least four samples containing data and paralogs removed; 583 RAxML alignment length = length of phylip alignment containing no more than 50% missing 584 data per site; variable sites = total SNPs; pars. sites = total parsimony-informative SNPs;

585 Unlinked SNPs = number of RAD loci that contain at least one SNP; Mean (SD) missingness rate 586 = average (standard deviation) fraction of genotype missingness across the 93 samples.

Threshold Loci RAxML alignment length—no missingness filter RAxML alignment length— max 50% missing Variable sites Pars. sites Unlinked SNPs Mean (SD) missingness rate 88% 72,111 7,240,654 1,854,097 185,727 113,977 53,023 0.589 (0.347)

Author Manuscript

(21)

89% 74,290 7,456,704 1,895,202 187,626 114,858 54,622 0.588 (0.348) 90% 77,057 7,731,870 1,945,213 191,023 116,092 56,640 0.590 (0.349) 91% 80,496 8,072,868 2,000,714 195,232 118,353 59,128 0.591 (0.350) 92% 84,362 8,457,019 2,064,330 199,556 121,403 61,845 0.592 (0.351) 93% 89,035 8,921,864 2,141,708 204,232 124,133 65,006 0.592 (0.353) 94% 94,159 9,431,576 2,232,926 207,335 126,454 68,221 0.590 (0.355) 95% 99,785 9,991,956 2,331,546 209,460 128,313 71,670 0.589 (3.356) 96% 105,798 10,590,415 2,433,922 207,339 128,472 74,759 0.584 (0.359) 97% 111,964 11,202,927 2,514,312 197,729 124,844 77,092 0.578 (0.360) 98% 118,105 11,811,076 2,454,473 175,579 111,020 76,965 0.589 (0.354) 99% 126,124 12,608,497 2,409,155 108,998 69,182 66,119 0.568 (0.347) 587 588 Figures

589 Figure 1: A) Map of Rana boylii sample localities, shown in grey diamonds. County outlines are 590 shown within California and Oregon.

591 124 W 122 W 120 W 118 W 116 W 114 W 3 4 N 3 6 N 3 8 N 4 0 N 4 2 N 4 4 N 4 6 N 0 100 200 km N

Author Manuscript

(22)

592

593 Figure 2: Fraction of loci inferred as paralogs and discarded by pyRAD. Box boundaries

594 represent the 25% to 75% interval, while the notched region represents a 95% confidence interval 595 around the median of each group of 93 sample observations. Whiskers represent the most extreme 596 values within 1.5 times the interquartile range from the upper and lower quartiles, and dots

597 represent outlier values more extreme than 1.5 times the interquartile range.

598 88 90 92 94 96 98 0 5 1 0 1 5 2 0 2 5

Clustering threshold (% similarity)

% fl a g g e d p a ra lo g s 599 600

601 Figure 3: A) Heterozygosity, B) total SNPs recovered across different clustering thresholds, and 602 C) cumulative variance of all biallelic SNPs recovered by pyRAD explained by the first eight 603 principal components across different clustering thresholds. Box boundaries represent the 25% to 604 75% interval, while the notched region represents a 95% confidence interval around the median 605 of each group of 93 sample observations. Whiskers represent the most extreme values within 1.5 606 times the interquartile range from the upper and lower quartiles, and dots represent outlier values 607 more extreme than 1.5 times the interquartile range.

(23)

608 88 90 92 94 96 98 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0

Clustering threshold (% similar ity)

% h e te ro zyg o u s si te s

A

● ● ● ● ● ● ● ● ● _● ● ● 88 90 92 94 96 98 1 2 0 0 0 0 1 6 0 0 0 0 2 0 0 0 0 0

T o ta l S N P s

B

● ● ● ● ● ● ● ● ● ● ● ● 88 90 92 94 96 98 0 .3 7 0 0 .3 8 0 0 .3 9 0

C u m u la ti v e v a ri a n c e in fi rs t 8 P C s

C

609 Figure 4: Heat maps showing pairwise data missingness at clustering thresholds of 88% (A) and 610 99% (B). Rows/columns represent individual samples. Dendrograms represent samples clustered 611 by genetic similarity (SNP identity by state). To evenly distribute the membership among colors 612 in the heat maps, the color ramp is scaled to the quantiles of pairwise missingness across all 12 613 clustering thresholds. Pairwise data missingness heat maps for clustering thresholds between 89% 614 and 98% are shown in Figures S2-S11. Colors indicate missingness fractions as shown in the 615 center legend, ranging from 0.161 (16.1% missing data) to 0.606 (60.6% missing data).

616

A

B

617

(24)

618 Figure 5: A) Pearson’s correlation coefficient between pairwise genetic dissimilarity and data 619 missingness at different clustering thresholds and B) slopes of genetic isolation by geographic 620 distance across all 93 samples.

621 622 ● ● ● ● ● ● ● ● ● ● ● ● 88 90 92 94 96 98 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5

P C C b e tw e e n g e n e ti c d ist a n ce a n d m issi n g n e ss

A

● ● ● ● ● ● ● ● ● ● ● ● 88 90 92 94 96 98 1 .1 5 1 .2 5 1 .3 5 1 .4 5

% in cr e a se d S N P d iv e rg e n ce p e r 1 0 0 km

B

623 Figure 6: RAxML bootstrap values for analyses on the full data matrix at different clustering 624 thresholds. Each boxplot represents 10 mean bootstrap values from 10 different RAxML random 625 seed runs. Box boundaries represent the 25% to 75% interval of RAxML run means. Whiskers 626 represent the most extreme values within 1.5 times the interquartile range from the upper and 627 lower quartiles, and dots represent outlier values more extreme than 1.5 times the interquartile 628 range.

(25)

629 88 90 92 94 96 98 7 5 8 0 8 5

RAxML mean bootstrap v alues

with increasing clustering threshold

Clustering threshold (% similarity)

M e a n b o o ts tr a p v a lu e s (1 0 R A x M L ru n s ) 630

Author Manuscript

(26)

Author Manuscript

(27)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 % flagged par alogs men_13029_f2.pdf

Author Manuscript

(28)

0.10 0.15 0.20 0.25 0.30 % heterozygous sites

A

● ●● ●● ● ●● ● ● ● 160000 200000 T otal SNPs

B

● ● ● ● ● ● ● ●● ● ● 0.380 0.390 Cum ulativ e v ar iance in first 8 PCs

C

(29)

(30)

● ● ● ● ● ● ● ● ● ● ● ● 0.65 0.70 0.75 0.80 0.85 PCC betw een genetic

distance and missingness

A

● ● ● ● ● ● ● ● ● ● ● ● 1.15 1.25 1.35 1.45 % increased SNP div ergence per 100 km

B

men_13029_f5.pdf

Author Manuscript

(31)

● ● ● ● ● ● ● ● 75 80 85

RAxML mean bootstrap values

with increasing clustering threshold

Mean bootstr

ap v

alues (10 RAxML r

uns)

men_13029_f6.pdf