Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

Paranoid Monitors: How Long Context Breaks LLM Agent Supervision

Alicia Yang ⋅ Aashiq Muhamed ⋅ Mona Diab ⋅ Virginia Smith

Project Page [ OpenReview]

Abstract

Monitoring LLM agents in long-context environments remains an unsolved challenge. We demonstrate that current monitoring approaches rely on suspicion heuristics that degrade as context length increases. Through evaluation of synthetically augmented trajectories from the ShadeArena bank transfer task, we show that standard monitors suffer from paranoia, assigning higher suspicion scores to benign trajectories than malicious ones in contexts exceeding 30k tokens. We quantify this failure using the suspicion gap, defined as the difference in mean suspicion scores assigned to malicious versus benign trajectories. To address this, we propose ledger-based monitoring, a two-stage framework grounded in factored cognition that decouples fact extraction from judgment. Our method restores calibration across all tested models, improving the suspicion gap from approximately 0 to greater than 1.5 in 50k+ token contexts.

Chat is not available.