Skip to yearly menu bar Skip to main content


Poster

How to Evaluate Reward Models for RLHF

Evan Frick · Tianle Li · Connor Chen · Wei-Lin Chiang · Anastasios Angelopoulos · Jiantao Jiao · Banghua Zhu · Joseph E Gonzalez · Ion Stoica

Hall 3 + Hall 2B #610
[ ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback).The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance.However, this process is prohibitively expensive.To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains.To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowd-sourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we opensource for public use and further development at https://github.com/lmarena/PPE.

Live content is unavailable. Log in and register to view live content