Self-trained vision transformers mimic human gaze with surprising precision

A research team led by The University of Osaka has demonstrated that vision transformers using self-attention mechanisms can spontaneously develop visual attention patterns similar to humans without specific training

Osaka, Japan – Can machines ever see the world as we see it? Researchers have uncovered compelling evidence that vision transformers (ViTs), a type of deep-learning model that specializes in image analysis, can spontaneously develop human-like visual attention patterns when trained without labeled instructions.

Visual attention is the mechanism by which organisms, or artificial intelligence (AI), filter out ‘visual noise’ to focus on the most relevant parts of an image or view. While natural for humans, spontaneous learning has proven difficult for AI. However, researchers have revealed, in their recent publication in Neural Networks, that with the right training experience, AI can spontaneously acquire human-like visual attention without being explicitly taught to do so.

The research team, from The University of Osaka, compared human eye-tracking data to attention patterns generated by ViTs trained using DINO (‘self-distillation with no labels’), a method of self-supervised learning that allows models to organize visual information without annotated datasets. Remarkably, the DINO-trained ViTs exhibited gaze behavior that closely mirrored that of typically developing adults when viewing dynamic video clips. In contrast, ViTs trained with conventional supervised learning showed unnatural visual attention.

“Our models didn’t just attend to visual scenes randomly, they spontaneously developed specialized functions,” says Takuto Yamamoto, lead author of the study. “One subset of the model consistently focused on faces, another captured the outlines of entire figures, and a third attended primarily to background features. This closely reflects how human visual systems segment and interpret scenes.”

Through detailed analyses, the team demonstrated that these attention clusters emerged naturally in the DINO-trained ViTs. These attention patterns were not only qualitatively similar to the human gaze, but also quantitatively aligned with established eye-tracking data, particularly in scenes involving human figures. The findings suggest a possible extension of the traditional, two-part figure–ground model of perception in psychology into a three-part model.

“What makes this result remarkable is that these models were never told what a face is,” explains senior author, Shigeru Kitazawa, “Yet they learned to prioritize faces, probably because doing so maximized the information gained from their environment. It is a compelling demonstration that self-supervised learning may capture something fundamental about how intelligent systems, including humans, learn from the world.”
The study underscores the potential of self-supervised learning not only for advancing AI applications, but also for modeling aspects of biological vision. By aligning artificial systems more closely with human perception, self-supervised ViTs offer a new lens for interpreting both machine learning and human cognition. The findings of this study could be used for a variety of applications, such as the development of human-friendly robots or to enhance support during early childhood development.

###

The article “Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study” has been published in Neural Networks at DOI: https://doi.org/10.1016/j.neunet.2025.107595

https://resou.osaka-u.ac.jp/en

Title: Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study
Journal: Neural Networks
Authors: Takuto Yamamoto, Hirosato Akahoshi and Shigeru Kitazawa
DOI: 10.1016/j.neunet.2025.107595
Funded by: Japan Society for the Promotion of Science

Attached files

Fig. 1 Comparison of gaze coordinates between human participants and attention heads of vision transformers (ViTs) Video clips from N2010 (Nakano et al., 2010) and CW2019 (Costela and Woods, 2019) were presented to ViTs. The gaze positions of each self-attention head in the class token ([CLS]) — identified as peak positions within the self-attention map directed at patch token — were compared with human gaze positions from the respective datasets. There were six ViT models with varying numbers of layers (L = 4, 8, or 12), trained either by supervised learning (SL) or self-supervised learning using the DINO method., CC BY, Reproduced from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.
Fig. 2 Attention of DINO-trained ViTs closely resembles that of humansTop: In a scene depicting a conversation between two children, the human gaze is predominantly directed toward the face of the child on the right (left). A ViT trained with the DINO method focuses on the face of the child on the right (center). In contrast, the gaze of a ViT trained with supervised learning (SL) is scattered (right). Bottom: The distance between each attention head and human gaze was quantified layer by layer. In the DINO-trained ViT, heads that exhibited attention patterns similar to human gaze emerged in layers 9 and 10., CC BY, Adapted from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.
Fig. 3 Attention heads in DINO-trained ViTs were grouped into three categoriesIn the DINO-trained ViT12 model, 144 attention heads from layers exhibiting human-like attention were classified using multidimensional scaling based on attention-to-gaze distances across many images. The heads were classified into three distinct groups: G1 focused on the center of figures (e.g., faces), G2 focused on figure outlines (e.g., whole bodies), and G3 focused on the ground (background)., CC BY, Adapted from Yamamoto K, Akahoshi H, Kitazawa S. (2025). Emergence of human-like attention in self-supervised Vision Transformers: an eye-tracking study. Neural Networks.

26/05/2025 The University of Osaka

Regions: Asia, Japan

Keywords: Applied science, Artificial Intelligence, Computing, Technology, Society, Social Sciences

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Latest Publications

Testimonials