PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models
en-GBde-DEes-ESfr-FR

PediaBench: a comprehensive Chinese pediatric dataset for benchmarking large language models

02/04/2026 Frontiers Journals

Existing datasets for medical QA cannot comprehensively assess the proficiency of LLMs in pediatrics. To fill this problem, a research team led by Hui LI and Yanhao WANG published their new research on the benchmark of LLMs for pediatric QA in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team introduced PediaBench, the first Chinese pediatric dataset encompassing 5 question types and 12 disease groups, and devised an integrated scoring scheme to thoroughly assess each LLM's proficiency across all types of questions in a unified manner. Finally validated the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs.
In the research, they first introduced the construction process of the PediaBench dataset. The questions of PediaBench are collected from various public sources, including the Chinese national medical licensing examination, final exams of universities in medicine, pediatric disease diagnosis and treatment standards, and clinical guidelines. The questions are classified into five types: true-or-false (ToF), multiple choice (MC), pairing(PA), essay-type short answer (ES), and case analysis (CA). They use GLM to classify the questions into disease groups according to the International Classification of Diseases (ICD-11) standard issued by the WHO. Then they devise an integrated scoring criterion to evaluate the performance of each LLM. For ToF and MC questions, using accuracy as the basic measure. And assigning a weight to each question based on its difficulty level. For PA questions, using an equal weight of 3 and give a score of 1 for a partially correct result. And for ES and CA questions, using GPT-4o to score each LLM's answers. Finally, they assigned a fixed proportion to each type of question and calculated the integrated score.
The experimental results show that only a few LLMs achieve a passing score of at least 60. the high requirement for factuality in medical applications, there is still a significant gap when deploying LLMs as assistants for pediatricians.
DOI:10.1007/s11704-025-41345-w
Archivos adjuntos
  • The dataset construction and evaluation process.
  • The dataset statistics in terms of question types, difficulty levels for ToF and MC questions, and disease groups
  • Results of different LLMs for the scores for five question types and the total scores
02/04/2026 Frontiers Journals
Regions: Asia, China
Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonios

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Trabajamos en estrecha colaboración con...


  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement