Researchers Create Multimodal Sentiment Analysis Method that Improves Detection of Human Emotions While Reducing Computational Cost

A novel approach in multimodal Sentiment Analysis called R3DG offers improved digital detection of human emotions with reduced computational cost

Multimodal sentiment analysis is an information processing technique that attempts to predict human emotional states from multiple modalities like text, audio, and video. Due to challenges in aligning multiple modalities, existing methods are limited to analysis at course or fine granularity, which risks missing nuances in human emotional expression. Researchers have now developed an innovative approach to MSA that reduces computational time required to sentiment prediction while offering improved performance.

Multimodal sentiment analysis (MSA) is an emerging technology that seeks to digitally automate extraction and prediction of human sentiments from text, audio, and video. With advances in deep learning and human-computer interaction, research in MSA is receiving significant attention. However, when training MSA models or making predictions, aligning different modalities such a text, audio and video for analysis can pose significant challenges.

There are several ways of aligning various modalities in MSA. Most MSA methods align modalities either at the ‘coarse-grained level’ by grouping representations over different time steps or at the ‘fine-grained level’ by grouping modalities at each time step (step-by-step alignment). However, these approaches can fail to capture the individual variations in emotional expression or differences in contexts in which sentiments are expressed. To overcome this crucial limitation, researchers have now developed a framework for analyzing inputs of different modalities. Their study, which was recently published in Research, shows that the framework ‘Retrieve, Rank, and Reconstruction with Different Granularities (R3DG)’ outperforms existing analysis methods and reduces the computational time required for analysis.

“Coarse-grained methods may miss subtle emotional cues like 'head nod’, ‘frown’, or ‘high pitch’, especially in long videos. On the other hand, fine-grained alignment can lead to fragmented representations, where emotional events are divided into multiple time steps, creating data redundancy. Furthermore, these methods are computationally expensive due to the need for extensive attention-based alignment”, explains Professor Fuji Ren of the University of Electronic Science and Technology of China, the lead researcher of the study.

Existing MSA approaches either average features over all time steps or align various features at each step, achieving one granularity of alignment at the maximum. In contrast, R3DG analyses representations at varying granularities, thus preserving potentially critical information and capturing emotional nuances across modalities. By aligning audio and video modalities to the text modality using representations at varying granularities, R3DG reduces computational complexity while enhancing the model’s ability to capture nuanced emotional fluctuations. Its segmentation and selection of the most relevant audio and video features—combined with reconstruction to preserve critical information—contribute to more accurate and efficient sentiment prediction.

The researchers critically assessed the comparative performance of R3DG using five benchmark MSA datasets. R3DG demonstrated superior performance compared to existing methods across these datasets, with a substantial reduction in expended computational time. The findings suggest that the R3DG approach may be among the most efficient MSA methods.

“Experimental results demonstrate that R3DG achieves state-of-the-art performance in multiple multimodal tasks, including sentiment analysis, emotion recognition, and humor detection, outperforming existing methods. Ablation studies further confirm R3DG’s superiority, highlighting its robust performance despite the reduced computational cost”, Dr. Jiawen Deng, the co-corresponding author, highlights the main findings of his study.

R3DG achieves modality alignment in just two steps—first between video and audio modalities, and then between their fused representation and text. This streamlined approach significantly reduces computational cost compared to most existing models. With its enhanced efficiency, R3DG demonstrates strong potential to drive the next generation of MSA.

“Looking ahead, future work will focus on automating the selection of modality importance and granularity, further enhancing R3DG’s adaptability to diverse real-world applications”, states Professor Ren, anticipating exciting future improvements to the MSA approach.

The complete study is accessible via DOI: 10.34133/research.0729

https://spj.science.org/doi/10.34133/research.0729

Title: R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis
Authors: Yan Zhuang, Yanru Zhang, Jiawen Deng, and Fuji Ren
Journal: Research, 2 Jul 2025, Vol 8, Article ID: 0729
DOI: 10.34133/research.0729

Fichiers joints

Researchers have developed a novel MSA method that improves sentiment detection while reducing computational cost

20/08/2025 Science and Technology Review Publishing House

Regions: Asia, China

Keywords: Applied science, Artificial Intelligence, Computing, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Dernières publications

Témoignages