AI model predicts human attention in 360-degree videos using both sound and vision

360-degree videos and virtual reality (VR) experiences are transforming viewers from passive observers into active participants immersed within a scene. Yet this shift raises an important question: where do people direct their attention in such environments, and what shapes that attention?

A new study led by Assoc. Prof. Dr. Aykut Erdem from Koç University’s Department of Computer Engineering, published in IEEE Transactions on Pattern Analysis and Machine Intelligence, offers an innovative answer. The study was carried out in collaboration with researchers from the Vision Laboratory at Boğaziçi University’s Department of Psychology, Hacettepe University, and the National Institute of Advanced Industrial Science and Technology (AIST) in Japan. The most distinctive aspect of the research is its ability to predict viewer attention by jointly analyzing both visual and auditory information, rather than relying on visual cues alone.

In conventional videos, the viewer’s gaze is largely guided by the camera’s framing. In contrast, 360-degree videos present the entire scene, allowing viewers to look in any direction at any moment. This makes it significantly more challenging to determine where attention is directed.

At this point, sound becomes a key factor. As in everyday life, when we hear a sound, we instinctively turn our attention toward its source. However, many previous studies have addressed this phenomenon only to a limited extent, focusing primarily on visual data.

To address this gap, the research team developed a comprehensive dataset to examine how visual and auditory cues interact. The dataset includes 81 videos featuring diverse scenes, presented under different audio conditions: silent, mono, and spatial sound – a technology that creates the sensation of sound arriving from a specific direction, just as in real life. By tracking the eye movements of more than 100 participants, the researchers were able to analyze in detail how attention shifts under varying auditory conditions.

The study also introduces two AI models tailored to the unique structure of 360-degree video data. The first model relies solely on visual information, while the second integrates audio into the analysis, enabling a more comprehensive understanding of attention. As a result, the model can capture not only visually salient elements but also areas that attract attention due to sound.

The findings are striking: incorporating audio significantly improves the model’s ability to predict viewer attention. In particular, when spatial sound is included, the model can accurately identify not only visually prominent regions but also areas that may appear less salient visually yet draw attention because of sound.

Overall, the research demonstrates that it is possible to model human attention more accurately by considering how people distribute their focus across both visual and auditory stimuli. Beyond its scientific contribution, this approach has strong potential to enhance a wide range of applications—from video compression and content creation to quality assessment and user experience design in immersive environments.

https://ieeexplore.ieee.org/document/11144923

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360∘∘ Videos
Mert Cokelek; Halit Ozsoy; Nevrez Imamoglu; Cagri Ozcinar; Inci Ayhan; Erkut Erdem
Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 48, Issue: 1, January 2026)
Page(s): 329 - 345
Date of Publication: 29 August 2025
ISSN Information:
PubMed ID: 40880339
DOI: 10.1109/TPAMI.2025.3604091

Fichiers joints

Aykut Erdem, Associate Professor of Computer Engineering at Koç University
Audio-visual saliency in 360∘ videos. Illustration of how spatial audio cues influence visual attention in omnidirectional videos. In this example, spatial audio highlights salient regions by directing viewer attention towards audio-emitting objects such as a passing car and birds singing in the trees, emphasizing the necessity of integrating audio modalities into saliency prediction models.

22/04/2026 Koc University

Regions: Europe, Turkey, Asia, Japan

Keywords: Applied science, Artificial Intelligence, Computing, Engineering, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Dernières publications

Témoignages