A Communication-Efficient Distributed Algorithm for Learning with Heterogeneous and Structurally Incomplete Multi-Site Data
Abstract
In multicenter biomedical research, integrating data from multiple decentralized sites provides more robust and generalizable findings due to its larger sample size and the ability to account for the between-site heterogeneity. However, sharing individual-level data across sites is often difficult due to patient privacy concerns and regulatory restrictions. To overcome this challenge, many distributed algorithms, that fit a global model by only communicating aggregated information across sites, have been proposed. A major challenge in applying existing distributed algorithms to real-world data is that their validity often relies on the assumption that data across sites are independently and identically distributed, which is frequently violated in practice. In biomedical applications, data distributions across clinical sites can be heterogeneous. Additionally, the set of covariates available at each site may vary due to different data collection protocols. We propose a distributed inference framework for data integration in the presence of both distribution heterogeneity and data structural heterogeneity. By modeling heterogeneous and structurally missing data using density-tilted generalized method of moments, we developed a general aggregated data-based distributed algorithm that is communication-efficient and heterogeneity-aware. We establish the asymptotic properties of our estimator and demonstrate the validity of our method via simulation studies.
Summary
This paper addresses the challenge of integrating data from multiple decentralized sites in biomedical research when individual-level data sharing is restricted due to privacy concerns and regulatory limitations. The authors focus on scenarios where data across sites are both heterogeneous (i.e., not independently and identically distributed) and structurally incomplete (i.e., different covariates are available at different sites). They propose a novel distributed inference framework that utilizes a density-tilted generalized method of moments (GMM) to address these challenges. The key idea is to use reference samples at each site to estimate and communicate the covariate density, enabling the adjustment for between-site heterogeneity. The proposed algorithm is communication-efficient, requiring only one round of aggregated data exchange. The authors establish the asymptotic properties of their estimator and demonstrate its validity through simulation studies, comparing its performance against existing methods that ignore data heterogeneity. The proposed framework is significant because it provides a practical solution for integrating multi-site data in biomedical research, where heterogeneity and structural missingness are common. By accounting for these issues, the method improves the accuracy and reliability of inferences drawn from the integrated data. The communication-efficient nature of the algorithm makes it particularly appealing for large-scale multi-center studies where minimizing data transfer is crucial. The use of density ratio tilting and the GMM framework provides a flexible and theoretically sound approach to address the challenges of data integration in the presence of heterogeneity and structural incompleteness. The simulation results show that the proposed method outperforms existing approaches, particularly when data heterogeneity is present.
Key Insights
- •The paper introduces a novel communication-efficient distributed algorithm, "dist-GMM-C," based on density-tilted generalized method of moments (GMM) to handle both distribution heterogeneity and structural missingness in multi-site data.
- •The method uses reference samples at each site to estimate covariate densities, which are then communicated to a central site to adjust for heterogeneity using density ratio tilting. This avoids the need to share individual-level patient data.
- •A copula-based density estimation approach is employed to reduce communication costs associated with transmitting high-dimensional density information. This approach only needs to transmit marginal distribution information and correlation parameters.
- •Simulation studies demonstrate that the proposed method ("dist-GMM-C") outperforms existing methods like GENMETA (which assumes homogeneity) and local analysis, especially in the presence of distribution heterogeneity. GENMETA's biases increase significantly when data are heterogeneous.
- •Using a synthetic data approach ("dist-GMM-S") to estimate the density ratios leads to noticeable biases and higher variability compared to the copula-based approach ("dist-GMM-C") due to the bias in multivariate density estimation and extra sampling variation from generating synthetic data.
- •The simulation studies suggest that a reference sample size of around 300 is sufficient for satisfactory performance of density estimation, even when the study sample size within each site is 1000. Also, a grid density of 100 is sufficient for performance without greatly increasing computational cost.
- •The theoretical results establish the asymptotic properties of the proposed estimator, providing a rigorous foundation for its use in practice.
Practical Implications
- •The proposed algorithm can be applied in various real-world biomedical research settings where data are distributed across multiple sites, heterogeneous, and structurally incomplete, such as studies using electronic health records (EHRs) from different hospitals or research centers.
- •Researchers and practitioners in biomedical informatics, biostatistics, and data science can benefit from this method to integrate data from multiple sources while respecting patient privacy and regulatory constraints.
- •The algorithm can be implemented using standard statistical software packages and can be readily adapted to different types of data and models.
- •Future research directions include extending the method to handle more complex data structures, such as longitudinal data or time series data, and exploring alternative density estimation techniques to further improve communication efficiency and accuracy.
- •The growing availability of large-scale health datasets, such as the All of Us initiative, enhances the practicality of applying this methodology to leverage real-world data for research and clinical decision-making.