Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

When Long Contexts Break Logic: Separating Evidence Use and Decision Bias in Instruction-Tuned LLMs

Pravish Sainath

Project Page [ OpenReview]

Abstract

Large language models (LLMs) increasingly operate over long contexts, yet their logical reasoning remains brittle when many irrelevant tokens intervene between premises and query. A recurring challenge is \emph{diagnosis}: when an LLM answers incorrectly in a long context, is the failure due to (i) not using the relevant premises, (ii) failing to compose them into a valid inference, or (iii) a biased decision rule at the final Yes/No readout? We present a compact suite of probes that disentangle these failure modes using \emph{matched-prior subtraction}---a distractor-conditioned control prompt that preserves formatting and length while removing the content of the evidence. Across three open instruction-tuned models (Qwen2.5, Llama-3.2, Gemma-2) we find that evidence influence on the final decision is near-zero in early layers and rises sharply only in late layers on a ``needle-in-a-haystack'' variant of LogicBench. For synthetic multi-premise rules (modus tollens, disjunctive syllogism, etc.), we show that many ``oracle'' failures under naive scoring are actually decision-level miscalibration: simple calibrated decision rules raise oracle accuracy to $0.83$--$0.93$ on several rules. Finally, a \emph{local calibratability} analysis reveals that the required decision correction depends systematically on evidence placement (front/middle/end/interleaved), indicating multiple long-context bias regimes rather than a single global calibration.

Chat is not available.