PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

Existing datasets for medical QA cannot comprehensively assess the proficiency of LLMs in pediatrics. To fill this problem, a research team led by Hui LI and Yanhao WANG published their new research on the benchmark of LLMs for pediatric QA in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team introduced PediaBench, the first Chinese pediatric dataset encompassing 5 question types and 12 disease groups, and devised an integrated scoring scheme to thoroughly assess each LLM's proficiency across all types of questions in a unified manner. Finally validated the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs.
In the research, they first introduced the construction process of the PediaBench dataset. The questions of PediaBench are collected from various public sources, including the Chinese national medical licensing examination, final exams of universities in medicine, pediatric disease diagnosis and treatment standards, and clinical guidelines. The questions are classified into five types: true-or-false (ToF), multiple choice (MC), pairing(PA), essay-type short answer (ES), and case analysis (CA). They use GLM to classify the questions into disease groups according to the International Classification of Diseases (ICD-11) standard issued by the WHO. Then they devise an integrated scoring criterion to evaluate the performance of each LLM. For ToF and MC questions, using accuracy as the basic measure. And assigning a weight to each question based on its difficulty level. For PA questions, using an equal weight of 3 and give a score of 1 for a partially correct result. And for ES and CA questions, using GPT-4o to score each LLM's answers. Finally, they assigned a fixed proportion to each type of question and calculated the integrated score.
The experimental results show that only a few LLMs achieve a passing score of at least 60. the high requirement for factuality in medical applications, there is still a significant gap when deploying LLMs as assistants for pediatricians.
DOI:10.1007/s11704-025-41345-w

https://dx.doi.org/10.1007/s11704-025-41345-w

Attached files

The dataset construction and evaluation process.
The dataset statistics in terms of question types, difficulty levels for ToF and MC questions, and disease groups
Results of different LLMs for the scores for five question types and the total scores

02/04/2026 Frontiers Journals

Regions: Asia, China

Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Latest Publications

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.

Koula Bouloukos, Senior manager, Editorial & Production Underknown

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.

Peter Dunn, Director of Press and Media Relations at the University of Warwick

AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.

PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

This item is under embargo and is only visible to journalists

Latest Publications

Testimonials

Koula Bouloukos, Senior manager, Editorial & Production Underknown

Peter Dunn, Director of Press and Media Relations at the University of Warwick

Ben Deighton, SciDevNet