MathEval: A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities
en-GBde-DEes-ESfr-FR

MathEval: A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities

04.12.2025 Frontiers Journals

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key limitations in existing assessments—including inconsistency, narrow scope, and inadequate adaptation to diverse models and datasets—MathEval consolidates 22 datasets spanning arithmetic, math word problems (MWPs), and competitive mathematics across English and Chinese languages, with difficulty levels ranging from elementary to advanced.

The study’s methodological innovation lies in its three core components: (1) Diverse Math Scenarios, featuring novel datasets like Arith3K and dynamically updated Gaokao exam problems to prevent test contamination; (2) Adaptive Prompt Engineering, tailoring zero-shot and few-shot prompts to optimize model performance across varied problem types; and (3) Robust Evaluation Protocols, employing GPT-4 as an automated judge for answer extraction and comparison, supplemented by a fine-tuned DeepSeek-7B model for researchers without GPT-4 access.
Key findings reveal that closed-source models (e.g., Claude-3.5-Sonnet) outperform open-source counterparts, achieving 77% average accuracy, while domain-specific fine-tuning (e.g., DeepSeek-Math-7B) significantly enhances arithmetic and MWP-solving abilities. The benchmark also uncovers critical insights: models exhibit stronger performance in English versus Chinese MWPs (84.7% vs. 67.2% accuracy) and struggle with high-school-level problems due to their complexity. Notably, the inclusion of fresh Gaokao problems exposed potential data contamination in models like Qwen-series, whose performance dropped on unseen exam questions.
The study acknowledges limitations, such as sparse middle-school-level MWPs and the absence of multimodal (visual) reasoning tasks, proposing future expansions to include geometry problems and finer-grained difficulty classifications. By establishing a standardized, scalable evaluation framework, MathEval advances the rigorous assessment of LLMs’ mathematical reasoning, offering actionable insights for model improvement and educational applications. Its integration of dynamic datasets and adaptive evaluation methods sets a new precedent for benchmarking in AI-driven mathematical problem-solving.
DOI:10.1007/s44366-025-0053-z
Angehängte Dokumente
  • Figure 1. Three core components of MathEval addressing key challenges. LLM: large language model.
  • Figure 2. MathEval evaluation results. (a) Overall average for different model categories; (b) model performance by parameter size; (c) improvements on math domain models; (d) comparison of solving arithmetic and math word problems capabilities of models. (a), (b), and (c) Show the discovery of closed-source models, open-source models, and math domain models. (d) Compares the model-level capabilities across problem type dimensions.
04.12.2025 Frontiers Journals
Regions: Asia, China
Keywords: Humanities, Education

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Referenzen

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Wir arbeiten eng zusammen mit...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by DNN Corp Terms Of Use Privacy Statement