MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Abstract
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a large-scale, high-quality, multilingual, and multimodal dataset of Olympiad-level problems. MathNet spans 40 countries, 10 languages, and two decades of competitions, comprising 17,512 expert-authored problems with solutions across diverse domains. MathNet supports three tasks: (i) mathematical comprehension, (ii) mathematical retrieval, an underexplored but essential capability and (iii) Math RAG, which evaluates how retrieval-augmented generation improves problem solving. For retrieval, we construct 39K pairs of mathematically equivalent problems to enable equivalence-based evaluation, in addition to 70 expert-curated pairs from real competitions. Experimental results show that even state-of-the-art reasoning models (76.8% for GPT-5 and 46.8% for Claude 4.5 Opus) are challenged, while embedding models struggle to retrieve equivalent problems. Finally, we show that LLM performance in RAG-based math problem solving is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset and the first retrieval benchmark for problem equivalence. We publicly release both the dataset and benchmark at http://mathnet.netlify.app/.