Adapting cluster graphs for inference of continuous trait evolution on phylogenetic networks
Episode

Adapting cluster graphs for inference of continuous trait evolution on phylogenetic networks

Dec 19, 20259:47
q-bio.PEstat.CO
No ratings yet

Abstract

Dynamic programming approaches have long been applied to fit models of univariate and multivariate trait evolution on phylogenetic trees for discrete and continuous traits, and more recently adapted to phylogenetic networks with reticulation. We previously showed that various trait evolution models on a network can be readily cast as probabilistic graphical models, so that likelihood-based estimation can proceed efficiently via belief propagation on an associated clique tree. Even so, exact likelihood inference can grow computationally prohibitive for large complex networks. Loopy belief propagation can similarly be applied to these settings, using non-tree cluster graphs to optimize a factored energy approximation to the log-likelihood, and may provide a more practical trade-off between estimation accuracy and runtime. However, the influence of cluster graph structure on this trade-off is not precisely understood. We conduct a simulation study using the Julia package PhyloGaussianBeliefProp to investigate how varying maximum cluster size affects this trade-off for Gaussian trait evolution models on networks. We discuss recommended choices for maximum cluster size, and prove the equivalence of likelihood-based and factored-energy-based parameter estimates for the homogeneous Brownian motion model.

Summary

This paper addresses the computational challenges of inferring continuous trait evolution on phylogenetic networks, particularly when using Gaussian trait models. While exact likelihood inference via belief propagation on clique trees is possible, it becomes computationally prohibitive for large and complex networks. The authors explore the use of loopy belief propagation (LBP) on non-tree cluster graphs as a more scalable approximate inference method. The main research question revolves around understanding how the structure of the cluster graph, specifically the maximum cluster size (k), affects the trade-off between estimation accuracy and runtime. The authors conduct a simulation study using their Julia package, PhyloGaussianBeliefProp, to investigate the impact of varying 'k' on Gaussian trait evolution models on phylogenetic networks. They focus on maximum likelihood estimation for the Brownian motion model on networks of varying topological complexity. They generate cluster graphs using the join-graph structuring algorithm and assess the accuracy of parameter estimation by maximizing the factored energy, an approximation of the log-likelihood. They also prove the equivalence of likelihood-based and factored-energy-based parameter estimates for the homogeneous Brownian motion model under certain conditions. The findings provide recommendations for choosing the maximum cluster size to balance computational cost and estimation accuracy, demonstrating that accurate approximations can be achieved with cluster sizes smaller than those required for exact inference.

Key Insights

  • Factored energy can closely approximate the log-likelihood using cluster graphs with a maximum cluster size (k) much smaller than what is needed for exact inference using a clique tree (k*), especially for networks with high complexity.
  • For networks with complexity as high as k* = 54 and small trait dimensions (p ≤ 4), good accuracy may be attained for k ≈ 10, suggesting a potential for significant computational savings.
  • There exists a phase transition in approximation error, where the error decreases rapidly up to a certain 'k' value, after which the rate of decrease slows down. In the M ̈uller network, this transition occurred around k = 11, corresponding to a threshold for achieving a reasonable approximation with at most 10% error.
  • The runtime behavior depends on the trade-off between the cost of computing messages and the number of messages required for calibration. For simpler networks, runtime generally improves with increasing 'k', while for more complex networks, the relationship is more nuanced.
  • Theorem 1 proves that the factored energy differs from the log-likelihood by an additive constant across all parameters under the Brownian motion model on bicombining networks, assuming perfect calibration. This provides a theoretical justification for using factored energy maximization as a proxy for maximum likelihood estimation.
  • Numerical errors during belief propagation can lead to issues with positive-definite belief precisions, particularly when candidate parameter values are far from the optimal value. The authors' implementation addresses this by returning a large value for the factored energy when encountering infinite messages or ill-defined states.
  • An analysis of theoretical bounds suggests that LBP could be useful for large phylogenetic studies with thousands of leaves involving reticulate evolution and phylodynamic studies with large, highly-reticulated networks, especially when using multi-regime Gaussian trait models.

Practical Implications

  • This research provides a more scalable approach to inferring continuous trait evolution on complex phylogenetic networks, which is crucial for analyzing large datasets in phylogenomics and phylodynamics.
  • Researchers can use the findings to select appropriate maximum cluster sizes for loopy belief propagation, balancing computational cost and estimation accuracy. The blunt rule provided (k = k* if pk* ≲ 60 and p ≤ 2, and k = min(cp,k ∗ ) otherwise, where c≈ 3) offers a practical starting point for choosing 'k'.
  • The PhyloGaussianBeliefProp package, developed in Julia, provides a tool for implementing loopy belief propagation for Gaussian trait models on phylogenetic networks.
  • Future research can focus on improving the robustness of belief propagation to numerical errors, developing adaptive message schedules, and exploring alternative cluster graph construction methods to further enhance the accuracy and scalability of the approach.

Links & Resources

Authors