Agent-as-a-Coach: Towards Fully Agentic, Stateful, and Tool-Augmented Process Rewards
Abstract
Training multiagent LLM systems with reinforcement learning requires solving credit assignment: determining which agent’s actions led to success or failure in long, multi-turn trajectories. Process rewards from external LLM judges address this by scoring each agent’s actions rather than only the final outcome. However, existing judges are stateless: each evaluation is an independent LLM call with no memory of prior scores, no awareness of scoring drift, and no mechanism for self-calibration—even as thousands of evaluations accumulate over a training run. We propose upgrading the stateless LLM-as-a-Judge to agent-as-a-coach: a tool-augmented LLM agent with persistent per-coach memory, rolling evaluation statistics, and a multi-turn reasoning loop. The coach queries its own scoring history, detects task-type biases, writes qualitative notes for self-calibration, and passes messages via external memory. In preliminary experiments on DSBench with a three-agent data science pipeline (Qwen3-4B), our agent-as-a-coach approach produces adaptive rebalancing between classification and regression tasks: anti-correlated performance oscillations that a stateless judge lacking cross-evaluation awareness cannot generate.