An Open-Source Large Language Model for Chinese Education Research
en-GBde-DEes-ESfr-FR

An Open-Source Large Language Model for Chinese Education Research

05/12/2025 Frontiers Journals

Open-source large language models (LLMs) research has made significant progress, but most studies predominantly focus on general-purpose English data, which poses challenges for LLM research in Chinese education. To address this, this research first reviewed and synthesized the core technologies of representative open-source LLMs, and designed an advanced 1.5B-parameter LLM tailored for the Chinese education field. Chinese education large language model (CELLM) is trained from scratch, involving two stages, namely, pre-training and instruction fine-tuning. In the pre-training phase, an open-source dataset is utilized for the Chinese education domain. During the instruction fine-tuning stage, the Chinese instruction dataset is developed and open-sourced, comprising over 258,000 data entries. Finally, the results and analysis of CELLM across multiple evaluation datasets are presented, which provides a reference baseline performance for future research. All of the models, data, and codes are open-source to foster community research on LLMs in the Chinese education domain.

This study introduces CELLM (Chinese Education Large Language Model), a specialized 1.5B-parameter open-source LLM designed specifically for Chinese educational applications. The research addresses two critical gaps in current LLM development: (1) the lack of transparency in training processes among existing open-source models, and (2) the scarcity of high-quality Chinese educational datasets compared to English counterparts.

The core innovation lies in developing a fully transparent training pipeline with two key components. First, the authors curated Chinese-fineweb-edu-v2, a domain-specific pretraining corpus combining multiple Chinese educational resources (25.4% industry corpus, 18.6% safety corpus, etc.). Second, they created a novel multi-turn dialogue translation framework that successfully converted 258,000 English instructional entries into Chinese with 97.7% accuracy, significantly expanding available Chinese educational data.

Technical implementation adopts a causal-decoder architecture with grouped-query attention (GQA) and rotary positional encoding (RoPE), optimized for educational contexts. The model demonstrates particular strength in humanities (26.77% accuracy on C-Eval-humanities) and social sciences (26.35% on C-Eval-social-science), though shows limitations in STEM domains (21.48% on C-Eval-stem) and programming tasks (0.6 score on mbpp benchmark).

Notably, the paper provides complete architectural transparency-detailing everything from vocabulary size (151,936 tokens) to training parameters (33.6B pretraining tokens, 16B fine-tuning tokens). This open approach, combined with the release of all models, data, and code, establishes CELLM as a foundational resource for Chinese educational LLM research, while setting performance baselines across 11 evaluation datasets including C-Eval, CMMLU and MMLU.

The work represents a significant step toward democratizing educational LLM development in non-English contexts, though acknowledges current limitations in model scale (1.5B parameters) compared to commercial counterparts. Future directions include expanding pretraining data and exploring alignment techniques to enhance STEM performance.
DOI:10.1007/s44366-025-0060-0
Archivos adjuntos
  • Figure 1. Prompt used in translation framework.
05/12/2025 Frontiers Journals
Regions: Asia, China
Keywords: Humanities, Education

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonios

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Trabajamos en estrecha colaboración con...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by DNN Corp Terms Of Use Privacy Statement