A new era of intelligent factories: How VLMs enable smarter, safer human–robot partnerships
en-GBde-DEes-ESfr-FR

A new era of intelligent factories: How VLMs enable smarter, safer human–robot partnerships

26/12/2025 TranSpread

Human–robot collaboration has long been promised as a cornerstone of next-generation manufacturing, yet conventional robots often fall short—constrained by brittle programming, limited perception, and minimal understanding of human intent. Industrial lines are dynamic, and robots that cannot adapt struggle to perform reliably. Meanwhile, advances in artificial intelligence, especially large language models and multimodal learning, have begun to show how machines could communicate and reason in more human-like ways. But the integration of these capabilities into factory environments remains fragmented. Because of these challenges, deeper investigation into vision-language-model-based human–robot collaboration is urgently needed.

A team from The Hong Kong Polytechnic University and KTH Royal Institute of Technology has published (DOI: 10.1007/s42524-025-4136-9) a new survey in Frontiers of Engineering Management (March 2025), delivering the first comprehensive mapping of how vision-language models (VLMs) are reshaping human–robot collaboration in smart manufacturing. Drawing on 109 studies from 2020–2024, the authors examine how VLMs —AI systems that jointly process images and language—enable robots to plan tasks, navigate complex environments, perform manipulation, and learn new skills directly from multimodal demonstrations.

The survey traces how VLMs add a powerful cognitive layer to robots, beginning with core architectures based on transformers and dual-encoder designs. It outlines how VLMs learn to align images and text through contrastive objectives, generative modeling, and cross-modal matching, producing shared semantic spaces that robots can use to understand both environments and instructions. In task planning, VLMs help robots interpret human commands, analyze real-time scenes, break down multi-step instructions, and generate executable action sequences. Systems built on CLIP, GPT-4V, BERT, and ResNet achieve success rates above 90% in collaborative assembly and tabletop manipulation tasks. In navigation, VLMs allow robots to translate natural-language goals into movement, mapping visual cues to spatial decisions. These models can follow detailed step-by-step instructions or reason from higher-level intent, enabling robust autonomy in domestic, industrial, and embodied environments. In manipulation, VLMs help robots recognize objects, evaluate affordances, and adjust to human motion—key capabilities for safety-critical collaboration on factory floors. The review also highlights emerging work in multimodal skill transfer, where robots learn directly from visual-language demonstrations rather than labor-intensive coding.

The authors emphasize that VLMs mark a turning point for industrial robotics because they enable a shift from scripted automation to contextual understanding. “Robots equipped with VLMs can comprehend both what they see and what they are told,” they explain, highlighting that this dual-modality reasoning makes interaction more intuitive and safer for human workers. At the same time, they caution that achieving large-scale deployment will require addressing challenges in model efficiency, robustness, and data collection, as well as developing industrial-grade multimodal benchmarks for reliable evaluation.

The authors envision VLM-enabled robots becoming central to future smart factories—capable of adjusting to changing tasks, assisting workers in assembly, retrieving tools, managing logistics, conducting equipment inspections, and coordinating multi-robot systems. As VLMs mature, robots could learn new procedures from video-and-language demonstrations, reason through long-horizon plans, and collaborate fluidly with humans without extensive reprogramming. The authors conclude that breakthroughs in efficient VLM architectures, high-quality multimodal datasets, and dependable real-time processing will be key to unlocking their full industrial impact, potentially ushering in a new era of safe, adaptive, and human-centric manufacturing.

###

References

DOI

10.1007/s42524-025-4136-9

Original Source URL

https://doi.org/10.1007/s42524-025-4136-9

Funding Information

This work was mainly supported by the funding support from the Research Institute for Advanced Manufacturing (RIAM) of The Hong Kong Polytechnic University (1-CDJT); the Intra-Faculty Interdisciplinary Project 2023/24 (1-WZ4N), by the Research Committee of The Hong Kong Polytechnic University; the State Key Laboratory of Intelligent Manufacturing Equipment and Technology, Huazhong University of Science and Technology (IMETKF2024010); Guangdong–Hong Kong Technology Cooperation Funding Scheme (GHX/075/22GD); Innovation and Technology Commission (ITC); the COMAC International Collaborative Research Project (COMAC-SFGS-2023-3148); and the General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Nos. PolyU15210222 and PolyU15206723); Open access funding provided by the Hong Kong Polytechnic University.

About Frontiers of Engineering Management

Frontiers of Engineering Management(FEM) is an international academic journal supervised by the Chinese Academy of Engineering, focusing on cutting-edge management issues across all fields of engineering. The journal publishes research articles, reviews, and perspectives that advance theoretical and practical understanding in areas such as manufacturing, construction, energy, transportation, environmental systems, and logistics. FEM emphasizes methodologies in systems engineering, information management, technology and innovation management, as well as the management of large-scale engineering projects. Serving both scholars and industry leaders, the journal aims to promote knowledge exchange and support innovation in global engineering management.

Paper title: Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey
Attached files
  • Overview of vision–language model-driven human–robot collaboration in smart manufacturing.
26/12/2025 TranSpread
Regions: North America, United States, Asia, Hong Kong, Europe, Sweden
Keywords: Applied science, Technology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Testimonials

For well over a decade, in my capacity as a researcher, broadcaster, and producer, I have relied heavily on Alphagalileo.
All of my work trips have been planned around stories that I've found on this site.
The under embargo section allows us to plan ahead and the news releases enable us to find key experts.
Going through the tailored daily updates is the best way to start the day. It's such a critical service for me and many of my colleagues.
Koula Bouloukos, Senior manager, Editorial & Production Underknown
We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet

We Work Closely With...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2025 by AlphaGalileo Terms Of Use Privacy Statement