Thursday, January 23, 2014

The RAD partitioning works and is much cleaner, and tells us something meaningful

I was beginning to think this was a waste of time, but in fact the analysis now is so much cleaner and nicer. The workflow is explained in the last few posts. The way I'm deciding who votes for vs. against a locus now is by a likelihood threshold. In the plots below, I've filtered out all loci that have < 5 log-likelihood point spread, and I only consider trees voted for if they are in the top 2 lnL point range, and voted against if they are in the bottom 2 lnL point range. Narrowing the spread doesn't change the story, but makes it noisier.

The result is striking: there is no tree favored by more loci than the best tree. There should be an outlier test associated with this, but just eyeballing it, there seems no obvious second story lurking among the suboptimal trees that is supported by a lot of loci. What is needed now is some simulation of what the data look like from a clean, single-story dataset. It's hard to imagine it looking much cleaner than what we get here.

Loci that favor the optimal topology and
each of 200 permuted topologies

Loci that disfavor the optimal topology and
each of 200 permuted topologies

Tuesday, January 21, 2014

Updating the partitioned RAD analysis

I ended up rewriting the RAD scripts so they do split out a new dataset and set of trees for each locus... the workflow goes like this:
  1. Generate a set of topologies using NNI. Use: genTrees [fixed today to return unique trees, which it wasn't necessarily doing before]
  2. Export locus and unique set of trees for each locus, and a files with the tree indices for each locus. Use: gen.RAD.loci.datasets 
  3. Make the resulting sh files executable: use chmod u+rwx raxml.batch.* then execute using sh. note that the export is currently written for Linux and defaults to my directory structure
  4. read in the info files to get likelihood for each tree, the index files to get original tree index, and then apply those likelihoods to the trees: use match.lnL.to.trees. I see now that the way the index is written is a bit awkward. What I want from the index is to know which new tree points to which old tree... but what the index actually gives me is, for each old tree, which other old tree that tree is identical to. I guess I can get what I want by just taking unique(treeIndex)... okay, that works.
Okay! that all runs fine... I'm back to the problem of deciding whom to favor and disfavor. I've scrapped all the old SWUL code, which was convoluted and hard to follow and inappropriately named (it's no longer successive weighting, but locus-partitioned data exploration). 

Figuring this part out now...

Plot at the end: how many loci favor each tree, how many disfavor each tree, how many don't vote, and favoring - disfavoring

Monday, January 20, 2014

Frost, thrushes, hazelnuts

This morning was the first frost we've had since our arrival about a month ago. Two days ago, I started noticing thrushes singing: mistle thrush? song thrush? I haven't seen one. There's a big, beautiful hazelnut bush near the entrance here whose catkins have descended. Bright yellow Ulex are in bloom in the sandy fields behind the building here. Days are getting longer and everything is greening up.

Friday, January 17, 2014

Making unique trees -- updated in ape 3.1

Emmanuel Paradis is including an updated version of unique.multiPhylo in ape 3.1, which will go onto CRAN soon. The old trees index in that function is accessed using attr(x, 'old.index') and is ordered the same as in the version I posted earlier this week.

Monday, January 13, 2014

Making unique treesets

unique.multiPhylo doesn't toss out duplicate trees with my dataset. The problem isn't all.identical.phylo: my function compare.all.trees is slow, but it does seem to detect pairwise tree identity properly (there are 16 in the treeset I have) using all.equal.phylo. The issue is with indexing: replacing 'for s in x[keep]' to index the trees with 'for s in 1:length(keep)' and then comparing with x[[s]] solves it.

I also added to the output an attribute that indexes the original trees to the new (unique) trees. The updated function is at 

http://mor-systematics.googlecode.com/svn/trunk/rads/unique.multiPhylo.R

This finishes step 2 of the analysis workflow from last week.

Friday, January 10, 2014

Refining the RAD incongruence method

Building on last night's post, the results of these analyses may be cleaner when you filter for just the loci that are especially conclusive. Compare the two figures below:

Output from plot.swulLikelihood, including all 9,179 loci
Output from plot.swulLikelihood, filtering to the 587 loci with lnL range ≥ 1.5

In the lower panels, only loci were included that exhibit a log-likelihood range of 1.5 or greater. The result is a tighter distribution of points in all three panels, suggesting that we are getting rid of some noise. Better!

Now, what if we just let loci talk about trees they care about? Can we do this with the likelihood calculations the way they were done? The way it's done now, each locus is forced to vote on every tree, unpruned. I think in the best implementation, I would take the 201 trees and, for each unique taxon set (not each locus, because many loci will have identical sets of taxa), spit out a treeset pruned to just that taxon set for each taxon set, then calculate the site likelihoods based on those treesets. Then every locus would be evaluating every tree, but many trees would be topologically identical because of the pruning... and which trees were topologically identical would differ among loci. Does this make a problem for interpretation? We are still investigating the balance of loci—i.e., for each tree, we are asking how many loci rank the tree in the top x% and lower x% of trees, using lnL—but we are blurring (at best) the search for cliques of loci that favor alternative topologies, because the loci are actually voting on different topologies. Maybe this method really makes the most sense to apply node-by-node, and ask then how the loci that have anything to say about a given node are voting.

But to the analysis at hand, which is really focused on identifying alternative topologies supported by different suites of loci. Perhaps what is best to do is to:
  1. Prune the trees for each locus. Output: a treeset for each locus that has identical tips in all trees and in the locus
  2. Identify the unique groups of topologies post-pruning. Output: a vector for each locus indexing all trees by the unique topology to which they belong; and a vector of unique topologies (number is fine, as the topologies can be looked up from the treeset and its corresponding vector locus)
  3. Average the likelihood across loci for all unique topologies. Output: a vector of locus-likelihoods that corresponds with the length of the unique topologies vector.
  4. Give a "favored" or "disfavored" assignment to the unique topologies for each locus. Here is where a problem may come in for thinking about these trees as a whole, rather than working node-by-node: what about when a locus is only meaningful for a handful of topologies? Ties among topologies will not have an effect on the ranking. So if there are 3 topologies, we have a 33rd percentile and a 67th percentile. If we are doing a 50-50 analysis, we say the top topology is supported at the 50th percentile, the bottom topology is rejected at the 50th percentile, and the middle is both supported and rejected. Every topology will be counted at least once by every locus... but hold it! This will increase the weight of loci that have the poorest taxon sampling, because they will impact the most trees. Perhaps there is a good solution; do the 33rd / 67th percentiles to avoid just this problem, and toss out loci that only have two topologies. Output: a vector of 1, -1, or 0 for unique topologies for each locus indicating whether they are favored, disfavored, or neutral.
  5. Assign the rankings to all topologies. Output: a vector of 1, -1, or 0 for every topology, using the index created in step 2.
Okay. I think this is a lot cleaner. And in fact, this could be done just fine on a set of topologies uniquely created for each locus, and that would be cleaner still, but it would be more time spent chopping up datasets and analyzing trees on each dataset, probably for no or negligible difference in the result. For now, I'll work with the likelihoods we got by analyzing the full trees.

Time to rewrite.

Thursday, January 9, 2014

Trawling through genomic datasets to find alternative phylogenetic histories

It's the first week at Pierroton, and I've been working on a revision of our RAD phylogenetics paper for PLoS ONE. Here's the question: what does a history of introgression do to a whole lot of anonymous markers? Can you use the pattern of support levels, marker-by-marker, to identify alternative topologies that are suboptimal but important? I'm not sure how straightforward this is. My idea for dealing with this in 2010 came from a conversation I had in graduate school with David Baum, in which he pointed out that identifying suites of suites of markers that support different topologies might be an effective way of identifying the different histories imprinted in a cross-species AFLP dataset. In fact, this idea was used by Trueman (1998) in his work on reverse successive weighting, which was implemented as a method of identifying alternative topologies encoded by a dataset. The idea is that if you peel back the loci that have a CI of 1.0 on the globally MP tree, you may find a second or third best topology buried beneath.

In the current paper, we're doing something like this, but we're using likelihood instead of parsimony, and creating topologies to compare with each other using NNI. When I first tried this in 2010 for a talk, I plotted tree likelihood on the x-axis and the number of loci for which that tree is the best of the trees (in lnL) on the y-axis. We were using a very sparse matrix that had not been error-checked, and we got a result that was easy to interpret: the optimal tree was in the upper right-hand corner of the plot (highest likelihood and lots of loci supporting it), the others scattered off toward the origin.

Figure 1
With more work on the dataset, though, and a lot more confidence in our clustering, we have a very different finding. Take, for example, the case in which we plot the number of loci that favor a tree at the 0.975 level (placing it in the top 2.5% of trees by likelihood) against the tree's likelihood (Fig. 1). The globally optimal tree (the yellow dot) is way down at the lower right-hand corner of the plot. Perhaps because there are so many nearly identical trees, every tree competes with many others for the favor of every locus, and it is difficult for any tree to stand out as exceptionally well supported in terms of the number of loci that favor it at a stringent cutoff.

Figure 2
But with a more relaxed cutoff, the story changes. In the extreme case in which we chop the distribution right in half and ask what trees are in the top 50% of the likelihood distribution for each locus, we find the globally optimal tree near (though not at) the upper right-hand corner of the plot (Fig. 2).

So what would this plot look like if there were two very distinct histories, strongly supported, due to hybridization? It would be nice if we found two nice high points corresponding to the two histories (e.g., one in which Quercus macrocarpa is sister to Q. bicolor, one in which it is sister to Q. alba). We certainly don't see anything like that here, no matter what likelihood threshold we use. Rather, what we're seeing is that at a given likelihood stratum, there is a wide range in the number of loci that will support a given tree, no matter what the cutoff. Because of the competition among trees for the best loci, the 50-50 visualization may give the most information about support among trees, and in fact this analysis makes even more sense when you compare the number of loci favoring trees with the number of loci disfavoring trees (Fig. 3). The globally optimal tree is right near the top of the pile when we rank trees by the difference between the number of loci favoring and the number disfavoring each tree. It turns out, in fact, that the five best trees are mostly topologically indistinguishable from one another, differing primarily in slight differences in branch length. There is, in other words, no obvious signal in our dataset of phylogenetic discordance, due perhaps to our small and phylogenetically very sparse sample (20 sampled of ca. 225 in the clade).

Figure 3
I have one misgiving about this analysis: I have not tried to tailor trees to the loci voting on them. For example, a locus that only tells us about relationships between four species should really only vote on trees after they have been pruned to those four taxa, and I suspect that the fact that these trees differ from locus to locus should be reflected in how the rankings are established.