AI model predicts human attention in 360-degree videos using both sound and vision
en-GBde-DEes-ESfr-FR

AI model predicts human attention in 360-degree videos using both sound and vision

22/04/2026 Koc University

360-degree videos and virtual reality (VR) experiences are transforming viewers from passive observers into active participants immersed within a scene. Yet this shift raises an important question: where do people direct their attention in such environments, and what shapes that attention?

A new study led by Assoc. Prof. Dr. Aykut Erdem from Koç University’s Department of Computer Engineering, published in IEEE Transactions on Pattern Analysis and Machine Intelligence, offers an innovative answer. The study was carried out in collaboration with researchers from the Vision Laboratory at Boğaziçi University’s Department of Psychology, Hacettepe University, and the National Institute of Advanced Industrial Science and Technology (AIST) in Japan. The most distinctive aspect of the research is its ability to predict viewer attention by jointly analyzing both visual and auditory information, rather than relying on visual cues alone.

In conventional videos, the viewer’s gaze is largely guided by the camera’s framing. In contrast, 360-degree videos present the entire scene, allowing viewers to look in any direction at any moment. This makes it significantly more challenging to determine where attention is directed.

At this point, sound becomes a key factor. As in everyday life, when we hear a sound, we instinctively turn our attention toward its source. However, many previous studies have addressed this phenomenon only to a limited extent, focusing primarily on visual data.

To address this gap, the research team developed a comprehensive dataset to examine how visual and auditory cues interact. The dataset includes 81 videos featuring diverse scenes, presented under different audio conditions: silent, mono, and spatial sound – a technology that creates the sensation of sound arriving from a specific direction, just as in real life. By tracking the eye movements of more than 100 participants, the researchers were able to analyze in detail how attention shifts under varying auditory conditions.

The study also introduces two AI models tailored to the unique structure of 360-degree video data. The first model relies solely on visual information, while the second integrates audio into the analysis, enabling a more comprehensive understanding of attention. As a result, the model can capture not only visually salient elements but also areas that attract attention due to sound.

The findings are striking: incorporating audio significantly improves the model’s ability to predict viewer attention. In particular, when spatial sound is included, the model can accurately identify not only visually prominent regions but also areas that may appear less salient visually yet draw attention because of sound.

Overall, the research demonstrates that it is possible to model human attention more accurately by considering how people distribute their focus across both visual and auditory stimuli. Beyond its scientific contribution, this approach has strong potential to enhance a wide range of applications—from video compression and content creation to quality assessment and user experience design in immersive environments.

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360∘∘ Videos
Mert Cokelek; Halit Ozsoy; Nevrez Imamoglu; Cagri Ozcinar; Inci Ayhan; Erkut Erdem
Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 48, Issue: 1, January 2026)
Page(s): 329 - 345
Date of Publication: 29 August 2025
ISSN Information:
PubMed ID: 40880339
DOI: 10.1109/TPAMI.2025.3604091
Fichiers joints
  • Aykut Erdem, Associate Professor of Computer Engineering at Koç University
  • Audio-visual saliency in 360∘ videos. Illustration of how spatial audio cues influence visual attention in omnidirectional videos. In this example, spatial audio highlights salient regions by directing viewer attention towards audio-emitting objects such as a passing car and birds singing in the trees, emphasizing the necessity of integrating audio modalities into saliency prediction models.
22/04/2026 Koc University
Regions: Europe, Turkey, Asia, Japan
Keywords: Applied science, Artificial Intelligence, Computing, Engineering, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Témoignages

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Nous travaillons en étroite collaboration avec...


  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement