Poster
in
Workshop: Agents in the Wild: Safety, Security, and Beyond

Prover-Verifier Games for AI Control

Joan Velja ⋅ Charlie Griffin ⋅ Alessandro Abate

Project Page [ OpenReview]

Abstract

AI Control relies on monitors that can detect strategically deceptive actions by deployed agents, but examples of such actions are rare, ever changing, and hard to label. Prover--Verifier Games (PVG; Anil et al., 2021) address this problem via adversarial self-play: a helpful prover produces a correct trace, a sneaky prover produces an apparently correct but error-containing trace, and a verifier is trained to rank the helpful trace higher. In the standard simultaneous game, the verifier can separate the provers by their use of stylistic cues, yielding globally non-robust solutions that do not transfer when evaluating out-of-distribution. We propose a sequential PVG where the sneaky prover observes the helpful trace before acting, forcing near-miss attacks (traces that match the helpful trace on copyable style features while differing only in the backdoor-relevant content) and ruling out style-only solutions. Sequential play eliminates purely style-based verifier solutions: because the attacker observes the helpful trace before acting, it can cheaply imitate any copyable feature, forcing the verifier to rely on semantic content. On code-generation backdoor detection, a 0.5B sequentially trained verifier matches or outperforms a 7B untrained baseline and generalizes better to stronger out-of-distribution attackers; stylistic probes and ablations indicate that the gain comes from sequentiality rather than scale.

Chat is not available.