Hyperbolic Geometry of Reasoning: Probing LLM Hidden States
Abstract
Large language models with chain-of-thought reasoning exhibit hierarchical dependencies, yet the geometric structure of these representations remains underexplored. We probe DeepSeek-R1 (reasoning-specialized) and Qwen2.5 (standard instruction-tuned) on PrOntoQA logical reasoning tasks, comparing Euclidean and hyperbolic probe geometries. Hyperbolic probes maintain robust performance across all layers, while Euclidean probes exhibit late-layer degradation specific to reasoning models--stable at early layers but degrading substantially at the final layer. Standard instruction-tuned models show no such degradation. We further show that probing "thinking tokens"--reasoning-critical tokens identified via linguistic markers--concentrates hierarchical information far more effectively than uniform pooling at the compressed final layer. Layer-wise activation statistics provide statistical evidence linking representational compression to the geometry-dependent performance gap. These findings suggest that hyperbolic geometry provides important robustness advantages for probing reasoning representations, conditional on model architecture.