The world’s first dataset aimed at improving the quality of English-to-Malayalam machine translation – a long-overlooked language spoken by more than 38 million people in India – has been developed by researchers at the University of Surrey.
Malayalam is considered a low-resource language in the world of machine translation, and until now, there has been almost no data available to evaluate the accuracy of machine-translated text from English, limiting progress for communities that rely on digital translation tools.
Funded by the European Association for Machine Translation (EAMT), the Surrey-led research published in ACL Anthology focused on two key areas – Quality Estimation, which predicts how good a translation is without the need for a reference text, and Automatic Post-Editing (APE), which automatically corrects errors.
The team curated 8,000 English-to-Malayalam translation segments across finance, legal and news – domains where accuracy is essential. Each segment was reviewed by professional annotators at TechLiebe, an industry partner, who provided three human quality scores and a corrected “post-edited” version of the machine-translated text.
Dr Diptesh Kanojia, Senior Lecturer at the Surrey Institute for People-Centred AI, and project co-lead, said:
“Low-resource languages like Malayalam are often left behind simply because we don’t have the datasets needed to improve machine translation. Our work provides a strong foundation for both assessing and correcting translations – supporting Malayalam speakers while also opening the door to similar resources for many other underserved languages.”
An additional layer of annotation known as ‘Weak Error Remarks’ was also introduced, allowing human annotators to quickly note and describe the types of errors they spotted, such as mistranslations, missing words or added phrases. Early findings show that when these added notes are combined with large language models, systems can interpret the translation better on where the translation went wrong – a method that is already outperforming current approaches.
Postgraduate Researcher and project lead at Surrey, Archchana Sindhujan, who introduced this novel idea, said:
"Malayalam is one of India’s classical languages, spoken by millions, yet it remains severely under-resourced for reference-free machine translation evaluation. By introducing Weak Error Remarks, we offer a lightweight and interpretable form of human-annotated supervision that captures translation errors beyond scalar scores. This added context enables learning signals that help large language models reason more effectively about translation quality, demonstrating how simple, human-centric annotations can significantly strengthen MT evaluation in low-resource settings.”
The research team have completed the majority of annotations, with a public release of the dataset planned for April 2026. The methodology could serve as a blueprint for other low-resource languages, including many across India, Africa and Creole-speaking regions, where high-quality translation data is urgently needed.
[ENDS]