Benchmarking Logical Reasoning Inconsistencies in Local Large Language Models: Evidence from Multi-Domain Evaluation
Abstract
We present systematic evidence of logical reasoning limitations in local large language models through MREB (Multimodal Reasoning and Ethics Benchmark), focusing on deduction, induction, and consistency across related questions. Our evaluation of four prominent local models reveals significant logical reasoning deficits, with performance ranging from 48-60% on logical tasks while achieving 84-92% on ethics questions that require similar reasoning patterns. We identify three critical failure modes: (1) inconsistent logical deduction across semantically equivalent problems, (2) failure to maintain logical consistency when reasoning about related scenarios, and (3) systematic bias toward pattern matching over genuine logical inference. Our findings demonstrate that current local LLMs exhibit fundamental logical reasoning limitations that are masked by strong performance in other cognitive domains, highlighting the need for targeted logical reasoning improvements and more rigorous consistency evaluation frameworks.