Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems
en-GBde-DEes-ESfr-FR

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

05/02/2026 Frontiers Journals

Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. However, CoT still falls short in dealing with complex math word problems, as it usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. Prior studies involve addressing the calculation errors and step-missing errors, but neglect the semantic misunderstanding errors, which is the major factor limiting the reasoning performance of LLMs.
To solve the problems, a research team led by Qihuang ZHONG published their new research on 15 January 2026 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team proposed a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors. The core of DUP method is to encourage the LLMs to deeply understand the problems and extract the key problem-solving information used for better reasoning.
The principle of DUP method is akin to the human learning process, i.e., for human students who receive a complex math word problem, they will read and comprehend the text of the problem, identify the core question that needs to be answered, and finally solve it with relevant problem-solving information. Specifically, DUP consists of three stages: 1) Revealing the core question of the input problem; 2) Extracting the problem-solving information relevant to solving the core question; 3) Generating and extracting the final answer by combining the core question with problem-solving information. By doing so, LLMs can filter out irrelevant information and achieve better math reasoning performance.
They conduct a series of experiments on 11 reasoning datasets across math, commonsense, and symbolic reasoning benchmarks. The experimental results of GPT-3.5-Turbo and GPT-4 show that: 1) DUP consistently outperforms the other counterparts across all datasets by a large margin; 2) Zero-shot DUP can even outperform the few-shot methods on most reasoning datasets; 3) More encouragingly, DUP achieves new SOTA results on the popular GSM8K (97.1%) and SVAMP (94.2%).
Future work can focus on exploring more efficient methods for boosting LLMs’ reasoning abilities and expanding the DUP method to more fields.
DOI:10.1007/s11704-025-41102-z
Fichiers joints
  • Error analysis of GSM8K problems with incorrect answers
  • Illustration of DUP strategy
05/02/2026 Frontiers Journals
Regions: Asia, China
Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Témoignages

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Nous travaillons en étroite collaboration avec...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement