Episode

TabGraphSyn: Graph-Guided Latent Diffusion for High-Fidelity and Privacy-Conscious Clinical Data Generation

Dec 29, 202512:40
Health Informatics
No ratings yet

Abstract

The critical need for accessible patient data in clinical research is often hindered by privacy regulations and data scarcity. While synthetic data generation offers a promising solution, existing generative models face key limitations. GANs can suffer from training instability, while diffusion models typically process records independently and often neglect the local neighborhood structure of the data manifold. To address this gap, we introduce TabGraphSyn, a two-stage generative framework for synthesizing patient data. Our approach constructs a patient similarity graph (k-NN) to encode local neighborhood geometry and density in feature space. The resulting relational embeddings guide a latent diffusion model, ensuring the generative process preserves both single-record feature distributions and the intricate joint feature structure and local density structure of the original dataset. Evaluations on TCGA, AIDS and WBCD clinical datasets show TabGraphSyn outperforms tested baselines, achieving up to a 4.29% reduction in marginal distribution error and a 2.92% decrease in pairwise correlation error while maintaining 100% data validity. For downstream utility, classifiers trained on synthetic data matched real-data performance in a classification task, achieving an AUC of 99.96%. In a survival analysis, the synthetic data identified significant covariates with a high F1-score of 0.857. An ablation study confirms that leveraging similarity-based neighborhood embeddings via the GNN module is crucial for the observed improvements in fidelity and utility. Privacy audits (best DCR) confirmed robust deidentification. Embedding caching yielded up to 11% improvement in ten-fold augmentation, enabling large-cohort synthesis. By integrating similarity-based neighborhood structure into the generative process, TabGraphSyn offers a robust method for generating high-fidelity synthetic clinical data.

Links & Resources

Authors

Cite This Paper

Year:2025
Category:health_informatics
APA

Z., W., H., C., Y., C. J. (2025). TabGraphSyn: Graph-Guided Latent Diffusion for High-Fidelity and Privacy-Conscious Clinical Data Generation. arXiv preprint arXiv:10.64898/2025.12.28.25342851.

MLA

Wu, Z., Chen, H., and Chen, J. Y.. "TabGraphSyn: Graph-Guided Latent Diffusion for High-Fidelity and Privacy-Conscious Clinical Data Generation." arXiv preprint arXiv:10.64898/2025.12.28.25342851 (2025).