Evaluating Open-Ended High-Stakes Exams with LLMs: Alignment Between ChatGPT-4o and Human Grading in High- and Low-Resource Languages
en-GBde-DEes-ESfr-FR

Evaluating Open-Ended High-Stakes Exams with LLMs: Alignment Between ChatGPT-4o and Human Grading in High- and Low-Resource Languages

03/06/2026 HEP Journals

Large language models (LLMs) are increasingly used for grading written responses, yet large-scale benchmarks against human expert evaluation remain scarce, especially across languages with different resource levels. This study evaluates ChatGPT-4o using a reranked retrieval-augmented generation (RAG) framework to grade Finland’s national high-stakes matriculation examination from 1,016 students’ open-ended responses. We examine GPT-4o’s alignment with official grades, recognition of grading-relevant keywords, and the effect of translating responses from a low-resource language (Finnish) into a high-resource language (English). Using descriptive statistics and correlation analyses, results show that GPT-4o’s grades on a 0–15 scale closely matched human evaluations: 75% of scores were within ±2 points of official grades, with only 3% severe outliers. Translating responses into English improved alignment to 85%. While the model generally identified relevant keywords effectively, occasional misinterpretations of contextual usage reduced grading reliability in a few cases. Overall, the findings demonstrate both the promise and current limitations of LLM-based assessment. There is a substantial potential to use LLMs as a supplementary grading tools, particularly in high-resource languages, but they do not yet match the consistency or interpretative depth of human expert evaluators. The study underlines the need for human oversight, rigorous validation, and careful consideration of language effects when deploying LLMs in high-stakes educational assessment.
DOI:10.1007/s44366-026-0091-1
Fichiers joints
  • Figure 1 Evaluation process of students’ open-ended responses with GPT-4o. RAG: retrieval-augmented generation.
03/06/2026 HEP Journals
Regions: Asia, China, Europe, Finland
Keywords: Humanities, Education, Applied science, Artificial Intelligence

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Témoignages

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Nous travaillons en étroite collaboration avec...


  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement