When Long Contexts Break Logic: Separating Evidence Use and Decision Bias in Instruction-Tuned LLMs
Pravish Sainath
Abstract
Large language models (LLMs) increasingly operate over long contexts, yet their logical reasoning remains brittle when many irrelevant tokens intervene between premises and query. A recurring challenge is \emph{diagnosis}: when an LLM answers incorrectly in a long context, is the failure due to (i) not using the relevant premises, (ii) failing to compose them into a valid inference, or (iii) a biased decision rule at the final Yes/No readout? We present a compact suite of probes that disentangle these failure modes using \emph{matched-prior subtraction}---a distractor-conditioned control prompt that preserves formatting and length while removing the content of the evidence. Across three open instruction-tuned models (Qwen2.5, Llama-3.2, Gemma-2) we find that evidence influence on the final decision is near-zero in early layers and rises sharply only in late layers on a ``needle-in-a-haystack'' variant of LogicBench. For synthetic multi-premise rules (modus tollens, disjunctive syllogism, etc.), we show that many ``oracle'' failures under naive scoring are actually decision-level miscalibration: simple calibrated decision rules raise oracle accuracy to $0.83$--$0.93$ on several rules. Finally, a \emph{local calibratability} analysis reveals that the required decision correction depends systematically on evidence placement (front/middle/end/interleaved), indicating multiple long-context bias regimes rather than a single global calibration.
Chat is not available.
Successful Page Load