Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

05/02/2026 Frontiers Journals

Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. However, CoT still falls short in dealing with complex math word problems, as it usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. Prior studies involve addressing the calculation errors and step-missing errors, but neglect the semantic misunderstanding errors, which is the major factor limiting the reasoning performance of LLMs.
To solve the problems, a research team led by Qihuang ZHONG published their new research on 15 January 2026 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team proposed a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors. The core of DUP method is to encourage the LLMs to deeply understand the problems and extract the key problem-solving information used for better reasoning.
The principle of DUP method is akin to the human learning process, i.e., for human students who receive a complex math word problem, they will read and comprehend the text of the problem, identify the core question that needs to be answered, and finally solve it with relevant problem-solving information. Specifically, DUP consists of three stages: 1) Revealing the core question of the input problem; 2) Extracting the problem-solving information relevant to solving the core question; 3) Generating and extracting the final answer by combining the core question with problem-solving information. By doing so, LLMs can filter out irrelevant information and achieve better math reasoning performance.
They conduct a series of experiments on 11 reasoning datasets across math, commonsense, and symbolic reasoning benchmarks. The experimental results of GPT-3.5-Turbo and GPT-4 show that: 1) DUP consistently outperforms the other counterparts across all datasets by a large margin; 2) Zero-shot DUP can even outperform the few-shot methods on most reasoning datasets; 3) More encouragingly, DUP achieves new SOTA results on the popular GSM8K (97.1%) and SVAMP (94.2%).
Future work can focus on exploring more efficient methods for boosting LLMs’ reasoning abilities and expanding the DUP method to more fields.
DOI:10.1007/s11704-025-41102-z

https://doi.org/10.1007/s11704-025-41102-z

Attached files

Error analysis of GSM8K problems with incorrect answers
Illustration of DUP strategy

05/02/2026 Frontiers Journals

Regions: Asia, China

Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.