A Comedy of Estimators: On KL Regularization in RL Training of LLMs
Episode

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Dec 26, 20259:36
Machine LearningArtificial Intelligence
No ratings yet

Abstract

The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.

Summary

This paper addresses a critical but often overlooked aspect of reinforcement learning (RL) fine-tuning of large language models (LLMs): the impact of different Kullback-Leibler (KL) divergence estimators and their integration methods on training stability and performance. The authors highlight that while KL regularization is widely used to prevent language drift and catastrophic forgetting, various estimators are employed to approximate the intractable KL divergence, and these estimators are incorporated into the RL objective in different ways. Recent studies suggest that common practices for KL regularization might not provide accurate gradients, leading to discrepancies between the intended objective and its implementation. The authors analyze the gradients of several estimator configurations, revealing how design choices shape gradient bias. They conduct empirical evaluations by RL fine-tuning Qwen2.5-7B, Llama-3.1-8B-Instruct and Qwen3-4B-Instruct-2507 models using different configurations, assessing their performance on in- and out-of-distribution tasks. Key findings include that biased gradient estimators can lead to training instabilities, while unbiased estimators result in better performance on both in-domain and out-of-domain tasks. They also investigate the role of KL regularization in stabilizing off-policy RL training in asynchronous setups. This research matters to the field because it provides a systematic understanding of KL regularization in RL training of LLMs, offering practical guidance for researchers and practitioners to avoid common pitfalls and achieve better results.

Key Insights

  • Using the K1 estimator (Monte Carlo estimate of the log-ratio of likelihoods) in the loss function, as popularized by GRPO, results in a gradient that is zero in expectation, leading to training instabilities and potentially collapsing the training process. For example, using K1 in loss with Qwen2.5-7B led to training instabilities with β= 0.3 and 1.
  • Adding the K3 estimator to the reward function also results in a biased gradient estimate, leading to unpredictable behavior and, in some cases, complete or partial collapse of training.
  • Configurations leading to unbiased gradient estimates, such as K1 added to the reward, consistently outperform biased configurations, like K3 added to the loss, on both in-domain and out-of-domain evaluation tasks. For example, K1 in reward shows an average relative improvement of 19.06% across MMLU college-physics, college-chemistry and college-biology, as compared to an average relative improvement of only 6.21% on in-domain tasks across MATH500 and MATH 2 , for 훽= 0.05 when using Qwen-2.5-7B.
  • In highly asynchronous RL settings, both K1 in reward and K3 in loss help to stabilize training compared to not using KL regularization at all.
  • The paper demonstrates that adding the KL penalty to both the reward and the loss in on-policy settings results in unbiased gradient estimates, regardless of the specific estimator used (K1 or K3), and this approach consistently outperforms the biased K3-in-loss configuration.

Practical Implications

  • The research provides clear guidelines for selecting and implementing KL estimators in RL fine-tuning of LLMs to ensure stable training and improved performance. Practitioners should favor using unbiased gradient estimators, such as K1 in reward.
  • Researchers and engineers can use the findings to debug and improve existing RL fine-tuning pipelines that might be using biased KL estimators, potentially leading to suboptimal results.
  • This work opens up future research directions in developing better and more stable KL estimators for RL training of LLMs, particularly in off-policy settings where obtaining unbiased gradient estimates remains a challenge. Further work can focus on implementing and analyzing an unbiased sequence level reverse KL gradient estimate in off-policy settings.
  • The findings can be applied to various real-world applications of LLMs, such as improving reasoning abilities in math, coding, and other complex tasks, where RL fine-tuning is crucial. This can improve the reliability and accuracy of LLMs in these applications.

Links & Resources

Authors