360-degree videos and virtual reality (VR) experiences are transforming viewers from passive observers into active participants immersed within a scene. Yet this shift raises an important question: where do people direct their attention in such environments, and what shapes that attention?
A new study led by Assoc. Prof. Dr. Aykut Erdem from Koç University’s Department of Computer Engineering, published in IEEE Transactions on Pattern Analysis and Machine Intelligence, offers an innovative answer. The study was carried out in collaboration with researchers from the Vision Laboratory at Boğaziçi University’s Department of Psychology, Hacettepe University, and the National Institute of Advanced Industrial Science and Technology (AIST) in Japan. The most distinctive aspect of the research is its ability to predict viewer attention by jointly analyzing both visual and auditory information, rather than relying on visual cues alone.
In conventional videos, the viewer’s gaze is largely guided by the camera’s framing. In contrast, 360-degree videos present the entire scene, allowing viewers to look in any direction at any moment. This makes it significantly more challenging to determine where attention is directed.
At this point, sound becomes a key factor. As in everyday life, when we hear a sound, we instinctively turn our attention toward its source. However, many previous studies have addressed this phenomenon only to a limited extent, focusing primarily on visual data.
To address this gap, the research team developed a comprehensive dataset to examine how visual and auditory cues interact. The dataset includes 81 videos featuring diverse scenes, presented under different audio conditions: silent, mono, and spatial sound – a technology that creates the sensation of sound arriving from a specific direction, just as in real life. By tracking the eye movements of more than 100 participants, the researchers were able to analyze in detail how attention shifts under varying auditory conditions.
The study also introduces two AI models tailored to the unique structure of 360-degree video data. The first model relies solely on visual information, while the second integrates audio into the analysis, enabling a more comprehensive understanding of attention. As a result, the model can capture not only visually salient elements but also areas that attract attention due to sound.
The findings are striking: incorporating audio significantly improves the model’s ability to predict viewer attention. In particular, when spatial sound is included, the model can accurately identify not only visually prominent regions but also areas that may appear less salient visually yet draw attention because of sound.
Overall, the research demonstrates that it is possible to model human attention more accurately by considering how people distribute their focus across both visual and auditory stimuli. Beyond its scientific contribution, this approach has strong potential to enhance a wide range of applications—from video compression and content creation to quality assessment and user experience design in immersive environments.