First-ever dataset to improve English-to-Malayalam machine translation fills critical gap for low-resource languages 
en-GBde-DEes-ESfr-FR

First-ever dataset to improve English-to-Malayalam machine translation fills critical gap for low-resource languages 


The world’s first dataset aimed at improving the quality of English-to-Malayalam machine translation – a long-overlooked language spoken by more than 38 million people in India – has been developed by researchers at the University of Surrey.

Malayalam is considered a low-resource language in the world of machine translation, and until now, there has been almost no data available to evaluate the accuracy of machine-translated text from English, limiting progress for communities that rely on digital translation tools.

Funded by the European Association for Machine Translation (EAMT), the Surrey-led research published in ACL Anthology focused on two key areas – Quality Estimation, which predicts how good a translation is without the need for a reference text, and Automatic Post-Editing (APE), which automatically corrects errors.

The team curated 8,000 English-to-Malayalam translation segments across finance, legal and news – domains where accuracy is essential. Each segment was reviewed by professional annotators at TechLiebe, an industry partner, who provided three human quality scores and a corrected “post-edited” version of the machine-translated text.

Dr Diptesh Kanojia, Senior Lecturer at the Surrey Institute for People-Centred AI, and project co-lead, said:

“Low-resource languages like Malayalam are often left behind simply because we don’t have the datasets needed to improve machine translation. Our work provides a strong foundation for both assessing and correcting translations – supporting Malayalam speakers while also opening the door to similar resources for many other underserved languages.”

An additional layer of annotation known as ‘Weak Error Remarks’ was also introduced, allowing human annotators to quickly note and describe the types of errors they spotted, such as mistranslations, missing words or added phrases. Early findings show that when these added notes are combined with large language models, systems can interpret the translation better on where the translation went wrong – a method that is already outperforming current approaches.

Postgraduate Researcher and project lead at Surrey, Archchana Sindhujan, who introduced this novel idea, said:

"Malayalam is one of India’s classical languages, spoken by millions, yet it remains severely under-resourced for reference-free machine translation evaluation. By introducing Weak Error Remarks, we offer a lightweight and interpretable form of human-annotated supervision that captures translation errors beyond scalar scores. This added context enables learning signals that help large language models reason more effectively about translation quality, demonstrating how simple, human-centric annotations can significantly strengthen MT evaluation in low-resource settings.”

The research team have completed the majority of annotations, with a public release of the dataset planned for April 2026. The methodology could serve as a blueprint for other low-resource languages, including many across India, Africa and Creole-speaking regions, where high-quality translation data is urgently needed.

[ENDS]

Archchana Sindhujan, Diptesh Kanojia, and Constantin Orăsan. 2025. Prompt-based Explainable Quality Estimation for English-Malayalam. In Proceedings of Machine Translation Summit XX: Volume 2, pages 105–106, Geneva, Switzerland. European Association for Machine Translation.
Regions: Europe, United Kingdom, Asia, India
Keywords: Applied science, Artificial Intelligence, Humanities, Linguistics

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Referenzen

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Wir arbeiten eng zusammen mit...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement