Blog Track Poster Sat, Apr 25, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 3 P3-#1806

Misalignments and RL Failure Modes in the Early Stage of Superintelligence

Shu Yang ⋅ Hanqi Yan ⋅ Di Wang

[ OpenReview]

Abstract

With the rapid ability grokking of frontier Large Models (LMs), there is growing attention and research focus on aligning them with human values and intent via large scale reinforcement learning and other techniques. However, as LMs are getting stronger and more agentic, their misalignment and deceptive behaviors are also emerging and becoming increasingly difficult for humans to pre-detect and keep track of. This blog post discusses current misalignment patterns, deceptive behaviors, RL failure modes, and emergent traits in modern large models to further AI safety discussions and advance the development of mitigation strategies for LM misbehaviors.

Video

Chat is not available.