Efficient and scalable clustering of survival curves
Episode

Efficient and scalable clustering of survival curves

Dec 18, 202512:06
Methodologystat.COMachine Learning
No ratings yet

Abstract

Survival analysis encompasses a broad range of methods for analyzing time-to-event data, with one key objective being the comparison of survival curves across groups. Traditional approaches for identifying clusters of survival curves often rely on computationally intensive bootstrap techniques to approximate the null hypothesis distribution. While effective, these methods impose significant computational burdens. In this work, we propose a novel approach that leverages the k-means and log-rank test to efficiently identify and cluster survival curves. Our method eliminates the need for computationally expensive resampling, significantly reducing processing time while maintaining statistical reliability. By systematically evaluating survival curves and determining optimal clusters, the proposed method ensures a practical and scalable alternative for large-scale survival data analysis. Through simulation studies, we demonstrate that our approach achieves results comparable to existing bootstrap-based clustering methods while dramatically improving computational efficiency. These findings suggest that the log-rank-based clustering procedure offers a viable and time-efficient solution for researchers working with multiple survival curves in medical and epidemiological studies.

Summary

The paper addresses the computational burden associated with clustering survival curves, a common task in medical and epidemiological studies. Traditional methods often rely on computationally intensive bootstrap resampling techniques to approximate the null hypothesis distribution, which becomes impractical for large datasets. The authors propose a novel, efficient, and scalable clustering method called fastSCC (Fast and Scalable Clustering of Survival Curves) that combines k-means clustering with the log-rank test. This approach eliminates the need for computationally expensive resampling by iteratively merging survival curves based on the results of log-rank tests, ensuring an optimal partitioning of survival groups. The fastSCC method involves estimating survival curves using the Kaplan-Meier estimator, clustering them using k-means, and then performing log-rank tests within each cluster to determine if survival functions are significantly different. P-values are adjusted for multiple testing. The method is evaluated through simulation studies and real-world datasets, demonstrating comparable accuracy to existing bootstrap-based methods, while achieving a significant reduction in computational time. The authors show that fastSCC can reduce execution time by approximately 98% compared to bootstrap resampling while maintaining a Type I error rate close to the nominal level and a success rate of around 95%.

Key Insights

  • Novel Approach: The paper introduces a novel approach that combines k-means clustering and the log-rank test to efficiently cluster survival curves, eliminating the need for computationally intensive bootstrap resampling.
  • Significant Speedup: Simulation studies demonstrate a significant improvement in computational efficiency, with execution times reduced by approximately 98% compared to traditional bootstrap-based methods. Specifically, Table 8 shows improvements ranging from 80 to 100 times faster.
  • Comparable Accuracy: The proposed method achieves results comparable to existing bootstrap-based clustering methods in terms of Type I error rate and statistical power, maintaining a success rate of approximately 95%.
  • Type I Error Control: The fastSCC method effectively controls the Type I error rate, remaining close to the nominal level across different sample sizes and censoring rates.
  • Impact of Multiple Testing Correction: The paper highlights the importance of multiple testing correction when multiple clusters contain more than one survival curve, demonstrating that omitting correction can lead to failure in properly controlling the significance level (Experiment Ib).
  • Correction Method Choice: Experiment III shows that Bonferroni, Holm, BH, and Hommel corrections yield very similar empirical outcomes in terms of both Type I error control and statistical power, across a wide range of sample sizes and censoring conditions, suggesting that any of them can be used without significantly affecting the final clustering outcome, within the framework of the proposed method.
  • Real-world Validation: The method is validated using two real-world datasets ("rotterdam" and "flchain"), demonstrating its applicability in complex and challenging scenarios. Table 10 shows that the fastSCC method is significantly faster than the bootstrap method, especially for large datasets.

Practical Implications

  • Real-world Applications: The fastSCC method offers a practical and time-efficient solution for researchers working with multiple survival curves in medical, biological, and epidemiological studies.
  • Beneficiaries: Researchers and data scientists analyzing large-scale survival data will benefit from the reduced computational cost and improved scalability of the fastSCC method.
  • Practitioner Usage: Practitioners can use the fastSCC method to efficiently identify clusters of survival curves, enabling them to gain insights into different subpopulations and their respective risk patterns. The R code is available on GitHub, and the `clustcurv` package is available on CRAN.
  • Future Research: Future research directions include exploring alternative weighting schemes for the log-rank test, extending the theoretical properties of the method, and evaluating its applicability in broader contexts of survival analysis. Investigating its behavior under different censoring mechanisms and alternative distance metrics is also a promising avenue.

Links & Resources

Authors