A Novel, Multimodal Approach to Automated Speaking Skill Assessment
en-GBde-DEes-ESfr-FR

A Novel, Multimodal Approach to Automated Speaking Skill Assessment


Researchers integrate acoustic, turn-taking, linguistic, and visual components to enhance automated spoken English evaluation

Ishikawa, Japan
-- Spoken English proficiency—the ability to communicate effectively in spoken English—is a key determinant of both academic and professional success. Traditionally, the degree of mastery over English grammar, vocabulary, pronunciation, and communication skills has been assessed through tedious and expensive human-administered tests. However, with the advent of artificial intelligence (AI) and machine learning in recent years, automated spoken English assessment tests have gained immense popularity among researchers worldwide.

While monologue-based speaking assessments are prevalent, they lack real-world relevance, particularly in environments where a dialogue or group interaction is crucial. Moreover, research on automated assessment of spoken English skills in interactive settings remains limited and often focuses only on single modalities, such as text or audio. In this light, a team of researchers led by Professor Shogo Okada and comprising Assistant Professor Candy-Olivia Mawalim from Japan Advanced Institute of Science and Technology (JAIST), have developed a multioutput learning framework that can simultaneously assess multiple aspects of spoken English proficiency. Their findings were published online in the Computers and Education: Artificial Intelligence journal on March 20, 2025.

The researchers utilized a novel spoken English evaluation (SEE) dataset comprising synchronized audio, video, and text transcripts from open-ended, high-stakes interviews with adolescents (9-16 years old) applying to high schools and universities. This dataset was collected by the real service from Vericant and is particularly notable for incorporating expert-assigned scores supervised by researcher from Education Testing Service (ETS) across a range of speaking skill dimensions, enabling a rich, multimodal analysis of English proficiency.

Dr. Mawalim shares, “Our framework allows for the modeling and integration of different aspects of speaking proficiency, thereby improving our understanding of the various underlying factors. Also, by incorporating open-ended interview settings in our assessment framework, we can gauge an individual’s ability to engage in spontaneous and creative communication and their overall sociolinguistic competence.”

The multioutput learning framework developed by the team integrates acoustic features such as prosody, visual cues like facial action units, and linguistic patterns such as turn-taking. Compared to unimodal approaches, this multimodal strategy significantly enhanced prediction accuracy, achieving an overall SEE score prediction accuracy of approximately 83% using the Light Gradient Boosting Machine (LightGBM) algorithm.

“The findings of our study have broad implications, offering diverse applications for stakeholders across various fields,” states Prof. Okada. “Besides providing direct actionable insights for students to improve their spoken English proficiency, our approach can help teachers to tailor their instructions to address individual student needs. Moreover, our multi-output learning framework can aid the development of more transparent and interpretable models for assessment of spoken language skills.”

The scientists also studied the importance of the utterance sequence in spoken English proficiency. Bidirectional encoder representations from transformers (BERT), a pre-trained deep learning model, revealed that the initial utterance had a lot of significance in predicting spoken proficiency. Furthermore, the influence of external factors, such as interviewer behaviour and the interview setting on spoken English proficiency, was also assessed. Their analyses showed that specific features, such as interviewer speech, gender, and in-person or remote interview setting, significantly impacted the coherence of the interviewees’ responses.

“With the rapid growth of AI-driven technologies and their expanding integration into our daily lives, multimodal assessments could become standard in educational settings in the near future. This can enable students to receive highly personalized feedback on their communication skills, not just language proficiency. This could lead to tailored curricula and teaching methods, helping students to hone and develop crucial soft skills like public speaking, presentation, and interpersonal communication more effectively,” says Dr. Mawalim, the lead author of the present study.

Taken together, the research offers a more nuanced and interpretable approach to automated spoken English assessment and lays the groundwork for developing intelligent, student-centered tools in educational and professional contexts.


###

Reference
Title of original paper: Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents
Authors: Candy Olivia Mawalim*, Chee Wee Leong, Guy Sivan, Hung-Hsuan Huang, and Shogo Okada
Journal: Computers and Education: Artificial Intelligence
DOI: 10.1016/j.caeai.2025.100386




About Japan Advanced Institute of Science and Technology, Japan
Founded in 1990 in Ishikawa prefecture, the Japan Advanced Institute of Science and Technology (JAIST) was the first independent national graduate university that has its own campus in Japan. Now, after 30 years of steady progress, JAIST has become one of Japan’s top-ranking universities. JAIST strives to foster capable leaders with a state-of-the-art education system where diversity is key; about 40% of its alumni are international students. The university has a unique style of graduate education based on a carefully designed coursework-oriented curriculum to ensure that its students have a solid foundation on which to carry out cutting-edge research. JAIST also works closely with both local and overseas communities by promoting industry–academia collaborative research.


About Professor Shogo Okada from Japan Advanced Institute of Science and Technology, Japan
Dr. Shogo Okada serves as a Professor at Japan Advanced Institute of Science and Technology (JAIST), Japan. He received his PhD from Tokyo Institute of Technology in 2008. Dr. Okada has been an active researcher in the fields of computational intelligence and systems science and has 130 publications to his credit. His research interests include multimodal interaction, machine learning, and social signal modeling. He currently heads the Social Signal and Interaction Group at JAIST, Japan.

About Dr. Candy Olivia Mawalim from Japan Advanced Institute of Science and Technology, Japan
Mawalim serves as an assistant professor at JAIST, Japan. She was selected as a research fellow for young scientists DC1 (JSPS). Her research has been published in several top conferences and Q1 journals, including the ACM Trans. on Multimedia Computing Communications and Applications, Applied Acoustic, and Computer Speech & Language. She serves as a member of the appointed team for ISCA SIG-SPSC (Security & Privacy in Speech Communication), where her responsibilities encompass the educational aspects of the group’s activities, i.e., organizing once per month SPSC webinar and contribute as a technical committee in SPSC Symposium (2023 and 2024).


Funding information
This work was partially supported by JSPS KAKENHI (22H00536, 23H03506).
Title: Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents
Authors: Candy Olivia Mawalim*, Chee Wee Leong, Guy Sivan, Hung-Hsuan Huang, and Shogo Okada
Journal: Computers and Education: Artificial Intelligence
DOI: 10.1016/j.caeai.2025.100386
Funding information:
This work was partially supported by JSPS KAKENHI (22H00536, 23H03506).
Attached files
  • Image title: A proposed framework for simultaneously estimating multifaceted English communication skills.  Image caption: Previously developed systems for the automated assessment of speaking proficiency focus on limited assessment criteria. However, the use of a novel multimodal spoken English evaluation dataset, comprising synchronized audio, video, and text transcripts, permits a more comprehensive and interpretable assessment.  Image credit: Candy Olivia Mawalim of JAIST.
Regions: Asia, Japan
Keywords: Humanities, Linguistics, Applied science, Artificial Intelligence, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement