Semiparametric Efficiency in Policy Learning with General Treatments
Episode

Semiparametric Efficiency in Policy Learning with General Treatments

Dec 22, 20258:36
Econometrics
No ratings yet

Abstract

Recent literature on policy learning has primarily focused on regret bounds of the learned policy. We provide a new perspective by developing a unified semiparametric efficiency framework for policy learning, allowing for general treatments that are discrete, continuous, or mixed. We provide a characterization of the failure of pathwise differentiability for parameters arising from deterministic policies. We then establish efficiency bounds for pathwise differentiable parameters in randomized policies, both when the propensity score is known and when it must be estimated. Building on the convolution theorem, we introduce a notion of efficiency for the asymptotic distribution of welfare regret, showing that inefficient policy estimators not only inflate the variance of the asymptotic regret but also shift its mean upward. We derive the asymptotic theory of several common policy estimators, with a key contribution being a policy-learning analogue of the Hirano-Imbens-Ridder (HIR) phenomenon: the inverse propensity weighting estimator with an estimated propensity is efficient, whereas the same estimator using the true propensity is not. We illustrate the theoretical results with an empirically calibrated simulation study based on data from a job training program and an empirical application to a commitment savings program.

Summary

This paper addresses the problem of efficient policy learning with general treatments (discrete, continuous, or mixed). The authors develop a unified semiparametric efficiency framework to analyze policy estimators, shifting the focus from regret *bounds* to the *asymptotic distribution* of welfare regret. They demonstrate that deterministic policies are not pathwise differentiable, motivating the use of randomized policies. A key finding is a policy-learning analogue of the Hirano-Imbens-Ridder (HIR) phenomenon: the inverse propensity weighting (IPW) estimator with an *estimated* propensity score is efficient, while the IPW estimator with the *true* propensity score is *not*. They also show that inefficient policy estimators not only increase the variance of asymptotic regret but also shift its mean upward. The authors derive the asymptotic theory for several common policy estimators, including IPW and doubly robust (DR) estimators. The theoretical results are illustrated with an empirically calibrated simulation based on job training data and an empirical application to a commitment savings program. The paper's contribution lies in providing a rigorous semiparametric framework for policy learning, offering a new perspective beyond traditional regret bounds. The HIR phenomenon in policy learning is a significant and counterintuitive finding. This matters to the field because it challenges the common practice of directly using the true propensity score when it's known, especially in experimental settings. The paper provides actionable guidance on using estimated propensities or doubly robust estimators for better efficiency and lower expected regret. The authors also extend the discussion beyond randomized policies by analyzing regret bounds through the supremum of the welfare process, further demonstrating the HIR phenomenon.

Key Insights

  • Deterministic policies are generally *not* pathwise differentiable, making them unsuitable for root-n consistent estimation. This motivates the use of randomized policies.
  • Inefficient policy estimators not only inflate the *variance* of asymptotic regret but also shift its *mean* upward, a consequence not captured by traditional regret bounding approaches.
  • The IPW estimator with an *estimated* propensity score achieves efficient regret, while the IPW estimator with the *true* propensity score does not (HIR phenomenon). This is a counterintuitive finding with practical implications.
  • The doubly robust (DR) estimator also achieves efficient regret, providing an alternative to IPW with estimated propensity.
  • The HIR phenomenon extends beyond randomized policies, also appearing in the regret bounds analysis for deterministic policies via supremum of the welfare process.
  • Simulations using the JTPA dataset confirm that IPW with the true propensity score leads to larger mean and standard deviation of regret compared to efficient estimators.
  • The empirical application to a commitment savings program shows that policies learned using IPW with the true propensity score can exhibit substantially larger standard errors than those obtained from efficient estimators.

Practical Implications

  • In experimental settings where the propensity score is known, researchers should *not* directly use the true propensity score in IPW estimators. They should instead use IPW with an *estimated* propensity score or a doubly robust estimator to achieve efficient regret.
  • Practitioners can use the derived asymptotic distributions of policy estimators to conduct inference on policy parameters and quantify the uncertainty of learned policies.
  • The findings provide guidance for selecting efficient policy learning methods in various real-world applications, such as targeted social programs, personalized medicine, and resource allocation.
  • Future research could explore the finite-sample properties of different policy estimators and develop methods for robust propensity score estimation in policy learning settings.
  • Further investigation is needed to understand the implications of the HIR phenomenon in more complex policy learning settings, such as those with high-dimensional covariates or non-smooth policy classes.

Links & Resources

Authors