Blog Track Poster

Is the evidence in 'Language Models Learn to Mislead Humans via RLHF' valid?

Aaryan Chandna ⋅ Lukas Fluri ⋅ Micah Carroll

[ OpenReview]

Abstract

Language Models Learn to Mislead Humans Via RLHF (published at ICLR 2025) argues that RLHF can unintentionally train models to mislead humans – a phenomenon termed Unintentional-SOPHISTRY. However, our review of the paper's code and experiments suggests that a significant portion of their empirical findings may be due largely to major bugs that make the RLHF setup both unrealistic and highly prone to reward hacking. In addition to high-level claims, we correct these issues for one of their experiments, and fail to find evidence that supports the original paper's claims.

Video

Chat is not available.