Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

The AI Barrister Flight Simulator: A Neuro-Symbolic Benchmark for Structured Legal Reasoning

David Lewis ⋅ Enrique Zueco ⋅ Haley Yi

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) deployed in legal settings produce fluent but structurally unreliable reasoning: they hallucinate authorities, violate jurisdictional boundaries, and ignore temporal precedent chains. We introduce the AI Barrister Flight Simulator, a neuro-symbolic benchmark that evaluates how an LLM reasons over legal structure rather than merely whether it reaches the correct answer. The benchmark couples a Legal Knowledge Graph (LKG) encoding statutes, case law, doctrinal tests, and citation networks with a symbolic controller that orchestrates retrieval, generation, and post-hoc consistency checking. Five task families (multi-hop citation, jurisdiction-constrained, temporal validity, doctrine-structure, and multi-query consistency) and four structure-aware metrics—Constraint Violation Rate (CVR), Hallucination Rate (HAR), Path Alignment (PA), and Node Coverage (NC)—expose failure modes invisible to accuracy alone. On a 50-scenario suite evaluated across three seeds, our KG-RAG pipeline achieves 98.0% accuracy with HAR = 0.005 and PA = 0.830, versus 77.3% accuracy and HAR = 0.138 for a baseline LLM. The full KG-RAG+Controller further reduces HAR to 0.003 and CVR to 0.289. Correlation analysis reveals that PA and NC are significant predictors of correctness (r=0.259 and r=0.302 respectively); a logistic model combining CVR, PA, and NC predicts answer correctness with 98.0% accuracy. Code, LKG, scenario library, and evaluation scripts will be released upon acceptance.

Chat is not available.