Abstract
Adjusting for (baseline) covariates with working regression models becomes standard practice in the analysis of randomized clinical trials (RCT). When the dimension $p$ of the covariates is large relative to the sample size $n$, specifically $p = o (n)$, adjusting for covariates even in a linear working model by ordinary least squares can yield overly large bias, defeating the purpose of improving efficiency. This issue arises when no structural assumptions are imposed on the outcome model, a scenario that we refer to as the assumption-lean setting. Several new estimators have been proposed to address this issue. However, they focus mainly on simple randomization under the finite-population model, not covering covariate adaptive randomization (CAR) schemes under the superpopulation model. Due to improved covariate balance between treatment groups, CAR is more widely adopted in RCT; and the superpopulation model fits better when subjects are enrolled sequentially or when generalizing to a larger population is of interest. Thus, there is an urgent need to develop procedures in these settings, as the current regulatory guidance provides little concrete direction. In this paper, we fill this gap by demonstrating that an adjusted estimator based on second-order $U$-statistics can almost unbiasedly estimate the average treatment effect and enjoy a guaranteed efficiency gain if $p = o (n)$. In our analysis, we generalize the coupling technique commonly used in the CAR literature to $U$-statistics and also obtain several useful results for analyzing inverse sample Gram matrices by a delicate leave-$m$-out analysis, which may be of independent interest. Both synthetic and semi-synthetic experiments are conducted to demonstrate the superior finite-sample performance of our new estimator compared to popular benchmarks.
Summary
This paper addresses the problem of covariate adjustment in randomized clinical trials (RCTs) when the number of covariates, *p*, is large relative to the sample size, *n*, specifically *p = o(n)*, in an "assumption-lean" setting (i.e., without strong assumptions on the outcome model). Traditional methods like ordinary least squares (OLS) can introduce significant bias in such scenarios, negating the efficiency gains from covariate adjustment. While previous work has focused on simple randomization, this paper tackles the more practical and challenging covariate adaptive randomization (CAR) schemes under the superpopulation model. The authors propose a novel adjusted estimator for the average treatment effect (ATE) based on second-order U-statistics. They generalize the coupling technique from the CAR literature to U-statistics and develop new results for analyzing inverse sample Gram matrices using a leave-m-out analysis. They prove that their estimator is root-n consistent and asymptotically normal (√n-CAN) and enjoys a guaranteed efficiency gain compared to the unadjusted estimator when *p = o(n)*, under mild tail assumptions. They also provide a consistent variance estimator for their adjusted estimator, enabling valid statistical inference. The paper validates their theoretical findings through extensive synthetic and semi-synthetic experiments, demonstrating superior finite-sample performance compared to existing benchmark methods. This research is significant because it provides a theoretically sound and practically useful method for covariate adjustment in RCTs with high-dimensional covariates under CAR, filling a critical gap in the literature. The FDA guidelines acknowledge the limitations of existing methods when the number of covariates is large, highlighting the need for such advancements. The authors also provide practical guidance and an R package to facilitate the adoption of their method in real-world applications.
Key Insights
- •Novel Estimator: The paper introduces a new ATE estimator based on second-order U-statistics, specifically designed for assumption-lean settings with CAR and *p = o(n)*.
- •Theoretical Guarantees: The proposed estimator is proven to be √n-CAN and has a guaranteed efficiency gain compared to the unadjusted estimator under mild conditions.
- •Technical Contributions: The paper extends the coupling technique for CAR to U-statistics and develops new results for analyzing inverse sample Gram matrices using leave-m-out analysis, which are of independent interest.
- •Variance Estimation: A consistent variance estimator is provided, enabling valid statistical inference using the proposed ATE estimator.
- •Empirical Validation: Simulation studies and semi-synthetic data analysis demonstrate that the proposed estimator outperforms existing methods in various data-generating processes.
- •OLS Bias: The paper analytically and empirically demonstrates the bias of OLS-based estimators when *p* is moderately large, even when *p = o(n)*, highlighting the need for the proposed U-statistic approach. Figure 1(a) shows that the analytical bias of the OLS estimator tracks the empirical bias closely.
- •Efficiency Gain: The paper shows that when p = o(n), the proposed estimator achieves the same asymptotic variance as the OLS estimator (when the OLS model is correctly specified with p fixed) and is thus more efficient than or as efficient as the unadjusted estimator.
Practical Implications
- •Improved RCT Analysis: The proposed estimator allows researchers to more effectively adjust for covariates in RCTs, particularly when dealing with high-dimensional data and CAR schemes.
- •Application to Clinical Trials: The research directly benefits practitioners involved in clinical trials by providing a robust and efficient method for estimating treatment effects.
- •Regulatory Compliance: The paper addresses a gap identified in FDA guidelines regarding covariate adjustment with a large number of covariates, offering a method that sponsors can potentially use to comply with regulatory requirements.
- •Software Implementation: The authors provide an R package that incorporates the new adjusted estimator under CAR, making it readily available for practical use.
- •Future Research: The paper opens up avenues for future research, such as exploring data-adaptive variable selection methods for further efficiency gains and investigating the performance of other estimators of the inverse Gram matrix.