Improving optimal subsampling through stratification
Episode

Improving optimal subsampling through stratification

Dec 23, 20259:13
Methodology
No ratings yet

Abstract

Recent works have proposed optimal subsampling algorithms to improve computational efficiency in large datasets and to design validation studies in the presence of measurement error. Existing approaches generally fall into two categories: (i) designs that optimize individualized sampling rules, where unit-specific probabilities are assigned and applied independently, and (ii) designs based on stratified sampling with simple random sampling within strata. Focusing on the logistic regression setting, we derive the asymptotic variances of estimators under both approaches and compare them numerically through extensive simulations and an application to data from the Vanderbilt Comprehensive Care Clinic cohort. Our results reinforce that stratified sampling is not merely an approximation to individualized sampling, showing instead that optimal stratified designs are often more efficient than optimal individualized designs through their elimination of between-stratum contributions to variance. These findings suggest that optimizing over the class of individualized sampling rules overlooks highly efficient sampling designs and highlight the often underappreciated advantages of stratified sampling.

Summary

This paper investigates optimal subsampling strategies for logistic regression in large datasets, a common problem in fields like medicine where computational limitations or measurement errors necessitate analyzing subsets of the data. The authors compare two primary approaches: individualized sampling, where each unit has a unique probability of being selected, and stratified sampling, where the population is divided into subgroups (strata), and samples are drawn independently from each. The research focuses on minimizing the asymptotic variance of the estimator for the regression coefficients. The authors analytically derive the asymptotic variances of estimators under both individualized and stratified sampling in the logistic regression setting. They then compare these approaches through extensive simulations across various data-generating scenarios and apply them to a real-world dataset from the Vanderbilt Comprehensive Care Clinic cohort. The key finding is that optimal stratified designs often outperform optimal individualized designs. This is because stratified sampling eliminates between-stratum variance, leading to more efficient estimators, especially when the stratification is informative (i.e., strata are homogeneous). This contradicts the implicit assumption in much of the recent literature that individualized sampling is inherently superior. The paper's contribution lies in demonstrating the often-underappreciated advantages of stratified sampling in optimal subsampling. It challenges the focus on individualized sampling rules and provides a theoretical and empirical basis for prioritizing stratified approaches, particularly when dealing with large, complex datasets and error-prone measurements. This matters to the field because it offers a potentially more efficient and practical alternative for researchers and practitioners working with big data and limited resources.

Key Insights

  • Optimal stratified sampling can achieve lower variance than optimal individualized sampling in logistic regression, particularly when stratification is informative and eliminates between-stratum variance.
  • The advantage of stratified sampling is more pronounced when the number of covariates is small enough to allow stratification based on the quantiles of their influence functions.
  • In a simulation with discrete covariates where X is fixed, optimal stratified sampling with known outcomes yielded zero variance, recovering the exact MLE from the full data, demonstrating its efficiency.
  • Pilot studies are crucial for implementing stratified sampling when the true outcome is unknown, with a pilot sample size around half the total sample size recommended for good performance. Small pilot studies in the high error setting can lead to the highest MSE.
  • The analytical results are based on influence functions and can be applied to any asymptotically linear estimator, extending the applicability beyond logistic regression to other models like Cox regression.
  • The authors derive the asymptotic variances under both approaches, highlighting the trade-off between individualized selection and the elimination of between-stratum variation in stratified sampling.
  • The paper challenges the implicit assumption in existing literature that individualized sampling is always superior, revealing contexts where stratified sampling is more efficient.

Practical Implications

  • Researchers and practitioners working with large datasets and logistic regression can improve the efficiency of their analyses by considering stratified subsampling designs.
  • The findings are particularly relevant for fields like medical research, where electronic health records often contain measurement errors and computational limitations necessitate subsampling.
  • Engineers and data scientists can use the derived formulas and algorithms to implement optimal stratified sampling in their applications. The R package "optimall" can be used for this purpose.
  • Future research should explore combining individualized sampling within strata to further improve stratified sampling designs.
  • The research opens avenues for exploring the application of stratified sampling in other statistical models and settings, such as survival analysis and semi-parametric estimation.

Links & Resources

Authors