Beyond Self-Checking: Fragment-Level Verification Across Diverse LLMs
Abstract
Large language models struggle to verify their own reasoning—a limitation documented across planning, logic, and mathematical tasks. We exploit this asymmetry in Sup, a multi-agent framework where independent models verify each other's outputs using logprob-derived confidence scores, then a constructive synthesis mechanism extracts valid reasoning fragments from individually incorrect responses and assembles them into correct solutions. Across 9 frontier models evaluated on Humanity's Last Exam (HLE, n = 1,369), Sup achieves 52.15% accuracy—7.41 points above the best individual model (p < 0.0001, McNemar's test) and 2.92 points above confidence-weighted selection. Cross-model verification enables two capabilities absent in selection-based ensembles: (1) correlation-aware routing that pairs generators with low-correlation verifiers (ρ = 0.54 cross-family vs. ρ = 0.77 within-family), exploiting complementary failure modes; and (2) constructive synthesis that produces 3 correct answers on questions where every model fails—recovering valid fragments from globally incorrect responses via cross-model corroboration. Our results suggest that distributing verification across diverse models yields qualitatively different error correction than scaling any single model's self-checking, and we identify concrete open problems at the intersection of soft (LLM-based) and hard (formal) verification.