PeerCoT: Structured Multi-Agent Chain-of-Thought Collaboration for Error Localization in LLM Reasoning
Abstract
Large Language Model (LLM) agents exhibit emergent reasoning abilities through debate, critique, and self-reflection. However, most multi-agent systems only exchange final outputs between agents, which limits transparency and hinders the ability to diagnose and improve reasoning processes. PeerCoT introduces a structured and symmetric Chain-of-Thought (CoT) exchange protocol. In this system, peer agents transparently share reasoning traces, provide labeled critiques, and perform minimal-edit revisions before aggregation. This structure enables explicit measurement of process-level error-type identification, called Error Localization Success (ELS), within an agent’s reasoning. We introduce and release AQUA-RAT-Corrupted and GSM8K-Corrupted, structured benchmarks synthetically designed to evaluate error localization and correction in multi-agent reasoning. PeerCoT achieves 64.1% accuracy in AQUA-RAT-Corrupted and 53.15% in GSM8K-Corrupted. PeerCoT maintains competitive accuracy and transparency compared to the baseline models while providing explicit error taxonomy, critique, and ELS. Beyond outcome-level performance, the structured critique protocol corrects 30.43% of initially incorrect solutions in merged outputs. By aligning cooperative critique with fine-grained reasoning supervision, PeerCoT introduces explicit error identification in collaborative reasoning.