Tipo de contenido material para medios audiovisuales:
Comienzo del material para medios audiovisuales:
Duración del material para medios audiovisuales:
This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key limitations in existing assessments—including inconsistency, narrow scope, and inadequate adaptation to diverse models and datasets—MathEval consolidates 22 datasets spanning arithmetic, math word problems (MWPs), and competitive mathematics across English and Chinese languages, with difficulty levels ranging from elementary to advanced.
The study’s methodological innovation lies in its three core components: (1) Diverse Math Scenarios, featuring novel datasets like Arith3K and dynamically updated Gaokao exam problems to prevent test contamination; (2) Adaptive Prompt Engineering, tailoring zero-shot and few-shot prompts to optimize model performance across varied problem types; and (3) Robust Evaluation Protocols, employing GPT-4 as an automated judge for answer extraction and comparison, supplemented by a fine-tuned DeepSeek-7B model for researchers without GPT-4 access.
Key findings reveal that closed-source models (e.g., Claude-3.5-Sonnet) outperform open-source counterparts, achieving 77% average accuracy, while domain-specific fine-tuning (e.g., DeepSeek-Math-7B) significantly enhances arithmetic and MWP-solving abilities. The benchmark also uncovers critical insights: models exhibit stronger performance in English versus Chinese MWPs (84.7% vs. 67.2% accuracy) and struggle with high-school-level problems due to their complexity. Notably, the inclusion of fresh Gaokao problems exposed potential data contamination in models like Qwen-series, whose performance dropped on unseen exam questions.
The study acknowledges limitations, such as sparse middle-school-level MWPs and the absence of multimodal (visual) reasoning tasks, proposing future expansions to include geometry problems and finer-grained difficulty classifications. By establishing a standardized, scalable evaluation framework, MathEval advances the rigorous assessment of LLMs’ mathematical reasoning, offering actionable insights for model improvement and educational applications. Its integration of dynamic datasets and adaptive evaluation methods sets a new precedent for benchmarking in AI-driven mathematical problem-solving.
DOI:10.1007/s44366-025-0053-z