Blog Track Poster

From REINFORCE to Dr. GRPO: A Unified Perspective on LLM Post-Training

Qingfeng Lan

[ OpenReview]

Abstract

Recently, many reinforcement learning (RL) algorithms have been applied to improve the post-training of large language models (LLMs). In this article, we aim to provide a unified perspective on the objectives of these RL algorithms, exploring how they relate to each other through the Policy Gradient Theorem — the fundamental theorem of policy gradient methods.

Video

Chat is not available.