Prover-Verifier Games for AI Control
Abstract
AI Control relies on monitors that can detect strategically deceptive actions by deployed agents, but examples of such actions are rare, ever changing, and hard to label. Prover--Verifier Games (PVG; Anil et al., 2021) address this problem via adversarial self-play: a helpful prover produces a correct trace, a sneaky prover produces an apparently correct but error-containing trace, and a verifier is trained to rank the helpful trace higher. In the standard simultaneous game, the verifier can separate the provers by their use of stylistic cues, yielding globally non-robust solutions that do not transfer when evaluating out-of-distribution. We propose a sequential PVG where the sneaky prover observes the helpful trace before acting, forcing near-miss attacks (traces that match the helpful trace on copyable style features while differing only in the backdoor-relevant content) and ruling out style-only solutions. Sequential play eliminates purely style-based verifier solutions: because the attacker observes the helpful trace before acting, it can cheaply imitate any copyable feature, forcing the verifier to rely on semantic content. On code-generation backdoor detection, a 0.5B sequentially trained verifier matches or outperforms a 7B untrained baseline and generalizes better to stronger out-of-distribution attackers; stylistic probes and ablations indicate that the gain comes from sequentiality rather than scale.