Unlike traditional computer vision, egocentric vision records scenes from a first-person perspective, allowing machines to perceive actions, interactions, and surroundings in ways that more closely resemble human experience. This makes it highly relevant to applications such as augmented reality, virtual reality, robotics, intelligent surveillance, and human-computer interaction. However, first-person video is far more difficult to interpret than standard third-person imagery. It often contains rapid viewpoint shifts, severe motion blur, object occlusion, and complex interactions unfolding over time. The survey also highlights a critical data gap: compared with large exocentric datasets, egocentric datasets remain limited in both scale and annotation quality. Because of these challenges, deeper research into egocentric vision is needed.
Researchers from the Department of Information and Communication Engineering at the University of Electronic Science and Technology of China reported (DOI: 10.1007/s11633-025-1599-4) this review in Machine Intelligence Research (Vol. 23, No. 1, February 2026). The paper systematically examines the architecture of egocentric vision research, classifies its major tasks, summarizes representative methods and datasets, and highlights the central challenges and future trends shaping first-person AI.
A major contribution of the survey is its scene-centered task taxonomy. Instead of grouping studies only by method, the authors decompose egocentric scenes into three core elements—subject, interacting objects, and environment—and then extend this into four research categories: subject understanding, object understanding, environment understanding, and hybrid understanding. Under this structure, the paper reviews 11 sub-tasks, including gaze understanding, pose estimation, action understanding, social perception, human identity and trajectory recognition, object recognition, environment modeling, scene localization, content summarization, multi-view joint understanding, and video question answering. The survey argues that this is the first hierarchical analysis of egocentric scenarios, giving the field a clearer conceptual map. It also pinpoints three dominant barriers: limited specialized datasets and benchmarks, the highly dynamic nature of first-person video, and the challenge of representing information across multiple layers and granularities. To support future work, the authors further compile 21 egocentric datasets and discuss five major trends that may help the field move toward more robust, multimodal, and embodied intelligence systems.
Rather than presenting egocentric vision as a collection of isolated benchmarks, the authors position it as a foundational capability for machine intelligence. They emphasize that understanding first-person data requires models that can connect attention, motion, objects, context, memory, and reasoning over time. Their conclusion is clear: progress will depend not only on better architectures, but also on stronger datasets, clearer task definitions, and deeper integration across modalities and scene elements.
The implications of this roadmap extend well beyond academic computer vision. More capable egocentric systems could support wearable assistants that understand what users are doing, AR and VR platforms that respond naturally to gaze and action, robots that learn from human demonstrations, and embodied agents that reason within real environments. The survey suggests that as sensing hardware improves and large multimodal models mature, first-person AI may become a key bridge between perception and action. By organizing the field’s knowledge base and clarifying its next steps, this work helps prepare egocentric vision for broader real-world impact.
###
References
DOI
10.1007/s11633-025-1599-4
Original Source URL
https://doi.org/10.1007/s11633-025-1599-4
Funding information
This work was supported by the National Natural Science Foundation of China (Nos. U23A20286 and 62301121) and Postdoctoral Fellowship Program (Grade B) of China Postdoctoral Science Foundation (No. GZB20240120).
About Machine Intelligence Research
Machine Intelligence Research (original title: International Journal of Automation and Computing) is published by Springer and sponsored by the Institute of Automation, Chinese Academy of Sciences. The journal publishes high-quality papers on original theoretical and experimental research, targets special issues on emerging topics, and strives to bridge the gap between theoretical research and practical applications.