Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems
Abstract
Large Language Models (LLMs) demonstrate strong temporal reasoning and historical fact retrieval, yet existing benchmarks rely almost exclusively on the Gregorian calendar, implicitly treating Western temporal standards as universal. This Gregorian-centric framing obscures a critical limitation: current foundation models fail to reason reliably within culturally diverse, non-Gregorian calendar systems used by billions worldwide. We introduce a diagnostic benchmark for temporal reasoning across five major cultural calendars: Vikram Samvat, Persian (Jalali), Hijri, Chinese Lunar, and Hebrew. The benchmark evaluates two core capabilities: Event Date Retrieval, measuring factual grounding in indigenous timelines, and Date Arithmetic, probing structural reasoning over non-linear temporal constructs such as intercalary months and lunar cycles. Evaluating several open-weight models, including Gemma-3, DeepSeek-V3, and Qwen-32B, reveals pronounced performance disparities. While reasoning-optimized models such as DeepSeek-R1 show localized competence in solar calendars (e.g., Persian), performance collapses for lunisolar and purely lunar systems. Models consistently exhibit a Gregorian anchoring effect, defaulting to linear offsets or Western mathematical heuristics even when prompted within alternative calendar frameworks. These findings expose a deep-seated Gregorian bias in foundation models, suggesting that temporal reasoning is often memorized rather than structurally learned. Our work identifies a key bottleneck in cultural alignment and provides a rigorous framework for developing more inclusive and robust temporal reasoning systems.