Thursday, January 9, 2014

Trawling through genomic datasets to find alternative phylogenetic histories

It's the first week at Pierroton, and I've been working on a revision of our RAD phylogenetics paper for PLoS ONE. Here's the question: what does a history of introgression do to a whole lot of anonymous markers? Can you use the pattern of support levels, marker-by-marker, to identify alternative topologies that are suboptimal but important? I'm not sure how straightforward this is. My idea for dealing with this in 2010 came from a conversation I had in graduate school with David Baum, in which he pointed out that identifying suites of suites of markers that support different topologies might be an effective way of identifying the different histories imprinted in a cross-species AFLP dataset. In fact, this idea was used by Trueman (1998) in his work on reverse successive weighting, which was implemented as a method of identifying alternative topologies encoded by a dataset. The idea is that if you peel back the loci that have a CI of 1.0 on the globally MP tree, you may find a second or third best topology buried beneath.

In the current paper, we're doing something like this, but we're using likelihood instead of parsimony, and creating topologies to compare with each other using NNI. When I first tried this in 2010 for a talk, I plotted tree likelihood on the x-axis and the number of loci for which that tree is the best of the trees (in lnL) on the y-axis. We were using a very sparse matrix that had not been error-checked, and we got a result that was easy to interpret: the optimal tree was in the upper right-hand corner of the plot (highest likelihood and lots of loci supporting it), the others scattered off toward the origin.

Figure 1
With more work on the dataset, though, and a lot more confidence in our clustering, we have a very different finding. Take, for example, the case in which we plot the number of loci that favor a tree at the 0.975 level (placing it in the top 2.5% of trees by likelihood) against the tree's likelihood (Fig. 1). The globally optimal tree (the yellow dot) is way down at the lower right-hand corner of the plot. Perhaps because there are so many nearly identical trees, every tree competes with many others for the favor of every locus, and it is difficult for any tree to stand out as exceptionally well supported in terms of the number of loci that favor it at a stringent cutoff.

Figure 2
But with a more relaxed cutoff, the story changes. In the extreme case in which we chop the distribution right in half and ask what trees are in the top 50% of the likelihood distribution for each locus, we find the globally optimal tree near (though not at) the upper right-hand corner of the plot (Fig. 2).

So what would this plot look like if there were two very distinct histories, strongly supported, due to hybridization? It would be nice if we found two nice high points corresponding to the two histories (e.g., one in which Quercus macrocarpa is sister to Q. bicolor, one in which it is sister to Q. alba). We certainly don't see anything like that here, no matter what likelihood threshold we use. Rather, what we're seeing is that at a given likelihood stratum, there is a wide range in the number of loci that will support a given tree, no matter what the cutoff. Because of the competition among trees for the best loci, the 50-50 visualization may give the most information about support among trees, and in fact this analysis makes even more sense when you compare the number of loci favoring trees with the number of loci disfavoring trees (Fig. 3). The globally optimal tree is right near the top of the pile when we rank trees by the difference between the number of loci favoring and the number disfavoring each tree. It turns out, in fact, that the five best trees are mostly topologically indistinguishable from one another, differing primarily in slight differences in branch length. There is, in other words, no obvious signal in our dataset of phylogenetic discordance, due perhaps to our small and phylogenetically very sparse sample (20 sampled of ca. 225 in the clade).

Figure 3
I have one misgiving about this analysis: I have not tried to tailor trees to the loci voting on them. For example, a locus that only tells us about relationships between four species should really only vote on trees after they have been pruned to those four taxa, and I suspect that the fact that these trees differ from locus to locus should be reflected in how the rankings are established.


No comments:

Post a Comment