Owl: Separating Generation from Evaluation to Detect Plausible Failures in Lifecycle Inventory Mapping
Abstract
Deploying AI at scale requires detecting plausible-but-incorrect outputs before they compound into aggregate harm, especially in domains where wrong answers look defensible. We present a multi-agent architecture that separates generation from quality assessment, applied to measuring corporate environmental impacts through lifecycle inventory mapping. Our system, Owl, consists of a domain-specialized mapper that generates candidates using tool-augmented retrieval, and an independent judge that evaluates quality without access to the mapper's reasoning. Effective climate action requires companies to accurately measure supply chain emissions, which often constitute over 40% of corporate carbon footprints. Yet this demands mapping tens to hundreds of thousands of purchased items to emission factor databases, which is manually intractable. The problem is entity matching under uncertainty: inputs are variably noisy, database coverage is incomplete, and incorrect mappings produce realistic emissions values without obvious error signals. Across 1,039 items, the domain-specialized mapper achieves 90.7% defensible accuracy versus 68.3% for non-agentic baselines and 81.7% without domain-specific prompting. On less noisy inputs, it achieves 98.9% on a well-specified public benchmark. The judge enables efficient human oversight: at a 20% review budget, it captures 67% of errors versus 37-40% for heuristic baselines. Our results demonstrate that multi-agent architectures that separate generation from evaluation can improve reliability in domains where errors are plausible but consequential.