AI model predicts human attention in 360-degree videos using both sound and vision

360-degree videos and virtual reality (VR) experiences are transforming viewers from passive observers into active participants immersed within a scene. Yet this shift raises an important question: where do people direct their attention in such environments, and what shapes that attention?

A new study led by Assoc. Prof. Dr. Aykut Erdem from Koç University’s Department of Computer Engineering, published in IEEE Transactions on Pattern Analysis and Machine Intelligence, offers an innovative answer. The study was carried out in collaboration with researchers from the Vision Laboratory at Boğaziçi University’s Department of Psychology, Hacettepe University, and the National Institute of Advanced Industrial Science and Technology (AIST) in Japan. The most distinctive aspect of the research is its ability to predict viewer attention by jointly analyzing both visual and auditory information, rather than relying on visual cues alone.

In conventional videos, the viewer’s gaze is largely guided by the camera’s framing. In contrast, 360-degree videos present the entire scene, allowing viewers to look in any direction at any moment. This makes it significantly more challenging to determine where attention is directed.

At this point, sound becomes a key factor. As in everyday life, when we hear a sound, we instinctively turn our attention toward its source. However, many previous studies have addressed this phenomenon only to a limited extent, focusing primarily on visual data.

To address this gap, the research team developed a comprehensive dataset to examine how visual and auditory cues interact. The dataset includes 81 videos featuring diverse scenes, presented under different audio conditions: silent, mono, and spatial sound – a technology that creates the sensation of sound arriving from a specific direction, just as in real life. By tracking the eye movements of more than 100 participants, the researchers were able to analyze in detail how attention shifts under varying auditory conditions.

The study also introduces two AI models tailored to the unique structure of 360-degree video data. The first model relies solely on visual information, while the second integrates audio into the analysis, enabling a more comprehensive understanding of attention. As a result, the model can capture not only visually salient elements but also areas that attract attention due to sound.

The findings are striking: incorporating audio significantly improves the model’s ability to predict viewer attention. In particular, when spatial sound is included, the model can accurately identify not only visually prominent regions but also areas that may appear less salient visually yet draw attention because of sound.

Overall, the research demonstrates that it is possible to model human attention more accurately by considering how people distribute their focus across both visual and auditory stimuli. Beyond its scientific contribution, this approach has strong potential to enhance a wide range of applications—from video compression and content creation to quality assessment and user experience design in immersive environments.

https://ieeexplore.ieee.org/document/11144923

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360∘∘ Videos
Mert Cokelek; Halit Ozsoy; Nevrez Imamoglu; Cagri Ozcinar; Inci Ayhan; Erkut Erdem
Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 48, Issue: 1, January 2026)
Page(s): 329 - 345
Date of Publication: 29 August 2025
ISSN Information:
PubMed ID: 40880339
DOI: 10.1109/TPAMI.2025.3604091

Attached files

Aykut Erdem, Associate Professor of Computer Engineering at Koç University
Audio-visual saliency in 360∘ videos. Illustration of how spatial audio cues influence visual attention in omnidirectional videos. In this example, spatial audio highlights salient regions by directing viewer attention towards audio-emitting objects such as a passing car and birds singing in the trees, emphasizing the necessity of integrating audio modalities into saliency prediction models.

22/04/2026 Koc University

Regions: Europe, Turkey, Asia, Japan

Keywords: Applied science, Artificial Intelligence, Computing, Engineering, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Latest Publications

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.

Koula Bouloukos, Senior manager, Editorial & Production Underknown

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.

Peter Dunn, Director of Press and Media Relations at the University of Warwick

AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.

AI model predicts human attention in 360-degree videos using both sound and vision

This item is under embargo and is only visible to journalists

Latest Publications

Testimonials

Koula Bouloukos, Senior manager, Editorial & Production Underknown

Peter Dunn, Director of Press and Media Relations at the University of Warwick

Ben Deighton, SciDevNet