Friday, January 10, 2014

Refining the RAD incongruence method

Building on last night's post, the results of these analyses may be cleaner when you filter for just the loci that are especially conclusive. Compare the two figures below:

Output from plot.swulLikelihood, including all 9,179 loci
Output from plot.swulLikelihood, filtering to the 587 loci with lnL range ≥ 1.5

In the lower panels, only loci were included that exhibit a log-likelihood range of 1.5 or greater. The result is a tighter distribution of points in all three panels, suggesting that we are getting rid of some noise. Better!

Now, what if we just let loci talk about trees they care about? Can we do this with the likelihood calculations the way they were done? The way it's done now, each locus is forced to vote on every tree, unpruned. I think in the best implementation, I would take the 201 trees and, for each unique taxon set (not each locus, because many loci will have identical sets of taxa), spit out a treeset pruned to just that taxon set for each taxon set, then calculate the site likelihoods based on those treesets. Then every locus would be evaluating every tree, but many trees would be topologically identical because of the pruning... and which trees were topologically identical would differ among loci. Does this make a problem for interpretation? We are still investigating the balance of loci—i.e., for each tree, we are asking how many loci rank the tree in the top x% and lower x% of trees, using lnL—but we are blurring (at best) the search for cliques of loci that favor alternative topologies, because the loci are actually voting on different topologies. Maybe this method really makes the most sense to apply node-by-node, and ask then how the loci that have anything to say about a given node are voting.

But to the analysis at hand, which is really focused on identifying alternative topologies supported by different suites of loci. Perhaps what is best to do is to:
  1. Prune the trees for each locus. Output: a treeset for each locus that has identical tips in all trees and in the locus
  2. Identify the unique groups of topologies post-pruning. Output: a vector for each locus indexing all trees by the unique topology to which they belong; and a vector of unique topologies (number is fine, as the topologies can be looked up from the treeset and its corresponding vector locus)
  3. Average the likelihood across loci for all unique topologies. Output: a vector of locus-likelihoods that corresponds with the length of the unique topologies vector.
  4. Give a "favored" or "disfavored" assignment to the unique topologies for each locus. Here is where a problem may come in for thinking about these trees as a whole, rather than working node-by-node: what about when a locus is only meaningful for a handful of topologies? Ties among topologies will not have an effect on the ranking. So if there are 3 topologies, we have a 33rd percentile and a 67th percentile. If we are doing a 50-50 analysis, we say the top topology is supported at the 50th percentile, the bottom topology is rejected at the 50th percentile, and the middle is both supported and rejected. Every topology will be counted at least once by every locus... but hold it! This will increase the weight of loci that have the poorest taxon sampling, because they will impact the most trees. Perhaps there is a good solution; do the 33rd / 67th percentiles to avoid just this problem, and toss out loci that only have two topologies. Output: a vector of 1, -1, or 0 for unique topologies for each locus indicating whether they are favored, disfavored, or neutral.
  5. Assign the rankings to all topologies. Output: a vector of 1, -1, or 0 for every topology, using the index created in step 2.
Okay. I think this is a lot cleaner. And in fact, this could be done just fine on a set of topologies uniquely created for each locus, and that would be cleaner still, but it would be more time spent chopping up datasets and analyzing trees on each dataset, probably for no or negligible difference in the result. For now, I'll work with the likelihoods we got by analyzing the full trees.

Time to rewrite.

No comments:

Post a Comment