Vision-Language Models Revolutionize Mobile Robot Navigation in Smart Manufacturing

In the evolving landscape of smart manufacturing, the integration of advanced technologies to enhance efficiency and adaptability is crucial. A recent study published in Engineering presents a novel approach to mobile robot navigation in unstructured environments, leveraging vision-language models (VLMs) and large language models (LLMs) to achieve human-guided navigation. This research, led by Tian Wang, Junming Fan, Pai Zheng, Ruqiang Yan, and Lihui Wang, aims to address the limitations of current autonomous mobile robots in dynamic and unpredictable manufacturing settings.

The study introduces a VLM-based human-guided navigation system designed specifically for human-centric smart manufacturing (HSM). This system integrates three core components: robust three-dimensional (3D) scene reconstruction, VLM-based semantic segmentation and integration, and LLM-driven spatial goal navigation. The system's ability to process natural language instructions and integrate them with visual data enables mobile robots to navigate complex environments more effectively.

The researchers utilized advanced point cloud techniques to reconstruct 3D scenes from RGB-depth (RGB-D) video frames, mitigating the impact of sensor noise. This robust reconstruction process involves multiple stages, including colored point cloud registration, fast global registration, and Truncated Signed Distance Function (TSDF) volume integration. The resulting 3D scene mesh provides a detailed and accurate representation of the environment, essential for effective robot navigation.

In addition to scene reconstruction, the system employs a VLM for semantic segmentation. The LSeg model, which uses the CLIP text encoder, allows for zero-shot semantic segmentation of RGB images. This means the system can understand and segment objects based on textual descriptions it has not encountered during training. The semantic information is then integrated into the 3D scene mesh using the TSDF algorithm, creating a comprehensive semantic map that guides the robot's navigation.

The spatial goal navigation component leverages the GPT-3.5 model to translate natural language instructions into executable Python code. This enables the robot to understand and follow human commands, such as moving to specific locations or inspecting objects, by generating and executing the necessary control code.
The efficacy of the proposed system was validated through extensive experiments. The semantic segmentation experiments demonstrated that the LSeg model achieved a pixel accuracy of 96.16% on the validation set and 87.55% on the test set, which is comparable to fully supervised models such as U-Net and DeepLabV3+. The zero-shot capabilities of LSeg were further highlighted by its ability to accurately segment objects using new labels not seen during training. The 3D reconstruction experiments showed an average Chamfer distance of 6.9 cm between the reconstructed point clouds and the ground truth, indicating high accuracy in scene reconstruction.

The spatial goal navigation experiments conducted in the AI Habitat simulator revealed an average success rate of 92.5% across different numbers of navigation subgoals. The results indicate that the system can effectively interpret and execute complex natural language instructions, guiding the robot to complete tasks in unstructured environments.

This research represents a significant step forward in the development of human-robot interaction in smart manufacturing. By combining VLMs and LLMs, the proposed system enhances the adaptability and resilience of mobile robots, enabling them to navigate complex, unstructured environments guided by human instructions. Future work may focus on extending the system's capabilities to include dynamic scene updates and more granular robotic manipulation tasks, further advancing the integration of artificial intelligence in manufacturing processes.

The paper “Vision-Language Model-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing,” is authored by Tian Wang, Junming Fan, Pai Zheng, Ruqiang Yan, Lihui Wang. Full text of the open access paper: https://doi.org/10.1016/j.eng.2025.04.028. For more information about Engineering, visit the website at https://www.sciencedirect.com/journal/engineering.

Vision-Language Model-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing
Author: Tian Wang,Junming Fan,Pai Zheng,Ruqiang Yan,Lihui Wang
Publication: Engineering
Publisher: Elsevier
Date: Available online 15 July 2025

25/03/2026 Frontiers Journals

Regions: Asia, China

Keywords: Applied science, Artificial Intelligence, Engineering

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Publicaciones más recientes

Testimonios