A research team led by The University of Osaka has demonstrated that vision transformers using self-attention mechanisms can spontaneously develop visual attention patterns similar to humans without specific training
Osaka, Japan – Can machines ever see the world as we see it? Researchers have uncovered compelling evidence that vision transformers (ViTs), a type of deep-learning model that specializes in image analysis, can spontaneously develop human-like visual attention patterns when trained without labeled instructions.
Visual attention is the mechanism by which organisms, or artificial intelligence (AI), filter out ‘visual noise’ to focus on the most relevant parts of an image or view. While natural for humans, spontaneous learning has proven difficult for AI. However, researchers have revealed, in their recent publication in Neural Networks, that with the right training experience, AI can spontaneously acquire human-like visual attention without being explicitly taught to do so.
The research team, from The University of Osaka, compared human eye-tracking data to attention patterns generated by ViTs trained using DINO (‘self-distillation with no labels’), a method of self-supervised learning that allows models to organize visual information without annotated datasets. Remarkably, the DINO-trained ViTs exhibited gaze behavior that closely mirrored that of typically developing adults when viewing dynamic video clips. In contrast, ViTs trained with conventional supervised learning showed unnatural visual attention.
“Our models didn’t just attend to visual scenes randomly, they spontaneously developed specialized functions,” says Takuto Yamamoto, lead author of the study. “One subset of the model consistently focused on faces, another captured the outlines of entire figures, and a third attended primarily to background features. This closely reflects how human visual systems segment and interpret scenes.”
Through detailed analyses, the team demonstrated that these attention clusters emerged naturally in the DINO-trained ViTs. These attention patterns were not only qualitatively similar to the human gaze, but also quantitatively aligned with established eye-tracking data, particularly in scenes involving human figures. The findings suggest a possible extension of the traditional, two-part figure–ground model of perception in psychology into a three-part model.
“What makes this result remarkable is that these models were never told what a face is,” explains senior author, Shigeru Kitazawa, “Yet they learned to prioritize faces, probably because doing so maximized the information gained from their environment. It is a compelling demonstration that self-supervised learning may capture something fundamental about how intelligent systems, including humans, learn from the world.”
The study underscores the potential of self-supervised learning not only for advancing AI applications, but also for modeling aspects of biological vision. By aligning artificial systems more closely with human perception, self-supervised ViTs offer a new lens for interpreting both machine learning and human cognition. The findings of this study could be used for a variety of applications, such as the development of human-friendly robots or to enhance support during early childhood development.
###
The article “Emergence of Human-Like Attention and Distinct Head Clusters in Self-Supervised Vision Transformers: A Comparative Eye-Tracking Study” has been published in
Neural Networks at DOI:
https://doi.org/10.1016/j.neunet.2025.107595