PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

Existing datasets for medical QA cannot comprehensively assess the proficiency of LLMs in pediatrics. To fill this problem, a research team led by Hui LI and Yanhao WANG published their new research on the benchmark of LLMs for pediatric QA in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team introduced PediaBench, the first Chinese pediatric dataset encompassing 5 question types and 12 disease groups, and devised an integrated scoring scheme to thoroughly assess each LLM's proficiency across all types of questions in a unified manner. Finally validated the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs.
In the research, they first introduced the construction process of the PediaBench dataset. The questions of PediaBench are collected from various public sources, including the Chinese national medical licensing examination, final exams of universities in medicine, pediatric disease diagnosis and treatment standards, and clinical guidelines. The questions are classified into five types: true-or-false (ToF), multiple choice (MC), pairing(PA), essay-type short answer (ES), and case analysis (CA). They use GLM to classify the questions into disease groups according to the International Classification of Diseases (ICD-11) standard issued by the WHO. Then they devise an integrated scoring criterion to evaluate the performance of each LLM. For ToF and MC questions, using accuracy as the basic measure. And assigning a weight to each question based on its difficulty level. For PA questions, using an equal weight of 3 and give a score of 1 for a partially correct result. And for ES and CA questions, using GPT-4o to score each LLM's answers. Finally, they assigned a fixed proportion to each type of question and calculated the integrated score.
The experimental results show that only a few LLMs achieve a passing score of at least 60. the high requirement for factuality in medical applications, there is still a significant gap when deploying LLMs as assistants for pediatricians.
DOI:10.1007/s11704-025-41345-w

https://dx.doi.org/10.1007/s11704-025-41345-w

Archivos adjuntos

The dataset construction and evaluation process.
The dataset statistics in terms of question types, difficulty levels for ToF and MC questions, and disease groups
Results of different LLMs for the scores for five question types and the total scores

02/04/2026 Frontiers Journals

Regions: Asia, China

Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Publicaciones más recientes

Testimonios