Open-source large language models (LLMs) research has made significant progress, but most studies predominantly focus on general-purpose English data, which poses challenges for LLM research in Chinese education. To address this, this research first reviewed and synthesized the core technologies of representative open-source LLMs, and designed an advanced 1.5B-parameter LLM tailored for the Chinese education field. Chinese education large language model (CELLM) is trained from scratch, involving two stages, namely, pre-training and instruction fine-tuning. In the pre-training phase, an open-source dataset is utilized for the Chinese education domain. During the instruction fine-tuning stage, the Chinese instruction dataset is developed and open-sourced, comprising over 258,000 data entries. Finally, the results and analysis of CELLM across multiple evaluation datasets are presented, which provides a reference baseline performance for future research. All of the models, data, and codes are open-source to foster community research on LLMs in the Chinese education domain.
This study introduces CELLM (Chinese Education Large Language Model), a specialized 1.5B-parameter open-source LLM designed specifically for Chinese educational applications. The research addresses two critical gaps in current LLM development: (1) the lack of transparency in training processes among existing open-source models, and (2) the scarcity of high-quality Chinese educational datasets compared to English counterparts.
The core innovation lies in developing a fully transparent training pipeline with two key components. First, the authors curated Chinese-fineweb-edu-v2, a domain-specific pretraining corpus combining multiple Chinese educational resources (25.4% industry corpus, 18.6% safety corpus, etc.). Second, they created a novel multi-turn dialogue translation framework that successfully converted 258,000 English instructional entries into Chinese with 97.7% accuracy, significantly expanding available Chinese educational data.
Technical implementation adopts a causal-decoder architecture with grouped-query attention (GQA) and rotary positional encoding (RoPE), optimized for educational contexts. The model demonstrates particular strength in humanities (26.77% accuracy on C-Eval-humanities) and social sciences (26.35% on C-Eval-social-science), though shows limitations in STEM domains (21.48% on C-Eval-stem) and programming tasks (0.6 score on mbpp benchmark).
Notably, the paper provides complete architectural transparency-detailing everything from vocabulary size (151,936 tokens) to training parameters (33.6B pretraining tokens, 16B fine-tuning tokens). This open approach, combined with the release of all models, data, and code, establishes CELLM as a foundational resource for Chinese educational LLM research, while setting performance baselines across 11 evaluation datasets including C-Eval, CMMLU and MMLU.
The work represents a significant step toward democratizing educational LLM development in non-English contexts, though acknowledges current limitations in model scale (1.5B parameters) compared to commercial counterparts. Future directions include expanding pretraining data and exploring alignment techniques to enhance STEM performance.
DOI:10.1007/s44366-025-0060-0