Researchers from Nazarbayev University and Nanyang Technological University present a new adaptive transformer for multimodal EEG, audio, and vision fusion
The study was authored by Sabina Bralina, Adnan Yazici, Cuntai Guan, and Min-Ho Lee; public author profiles link the team to Nazarbayev University and Nanyang Technological University.
A new study introduces the Adaptive Multimodal Bottleneck Transformer (AMBT), a framework designed to combine EEG, audio, and vision signals for more accurate emotion recognition. The model enables efficient cross-modal interaction through lightweight adaptive layers and bottleneck tokens, preserving strong unimodal representations while reducing training cost. Across benchmark datasets, AMBT achieved strong results, including 85.1% accuracy on five-class emotion classification in the EAV dataset.
Highlights
- Introduces a multimodal fusion of EEG, audio, and vision signals.
- Proposes Adaptive Multimodal Bottleneck Transformer for efficient fusion.
- Reduces training cost by updating only lightweight adaptive layers.
- Achieves 85.1% accuracy on five-class emotion classification (vs. 68% unimodal).
- Shows that integrating EEG with audio-vision fusion yields richer affective cues.
This work advances multimodal affective computing by showing that neural, vocal, and visual signals can be fused more effectively without the heavy computational cost of full-model retraining. The approach has promising implications for healthcare, mental health monitoring, human-computer interaction, social robotics, and adaptive learning systems, where understanding emotion from multiple complementary signals is essential.