From REINFORCE to Dr. GRPO: A Unified Perspective on LLM Post-Training
Qingfeng Lan
Abstract
Recently, many reinforcement learning (RL) algorithms have been applied to improve the post-training of large language models (LLMs). In this article, we aim to provide a unified perspective on the objectives of these RL algorithms, exploring how they relate to each other through the Policy Gradient Theorem — the fundamental theorem of policy gradient methods.
Successful Page Load