AI model brings affordable 3D vision to automated fruit harvesting

The innovation combines a new wax gourd dataset with a specialized neural network, TPDNet, which captures depth and spatial information from a standard RGB image.

Meeting the global demand for fruits and vegetables is increasingly challenged by rising labor costs, with harvesting alone accounting for up to half of total production expenses. Automated harvesting technologies are central to this shift, with object detection playing a critical role in enabling robots to identify and pick crops accurately. Conventional 2D object detection has made progress in tasks such as apple and passion fruit recognition, yet it is limited to flat image data. By contrast, 3D object detection supplies vital information on size, depth, and spatial coordinates, crucial for automation in complex orchard or field conditions. While point cloud-based methods deliver strong performance, they require expensive sensors unsuitable for most farms. Monocular 3D detection, using only a single camera, offers a low-cost alternative—but has been hindered by the lack of agricultural datasets and tailored algorithms.

A study (DOI: 10.1016/j.plaphe.2025.100048) published in Plant Phenomics on online 30 May by Qi Wang’s team, Guizhou University, opens the door to more affordable and precise automated harvesting across diverse crop systems.

In this study, the researchers evaluated their proposed monocular 3D object detection model, TPDNet, through a systematic experimental design that incorporated multiple evaluation metrics, implementation strategies, and robustness analyses. Detection performance was assessed across three categories—2D object detection, Bird’s Eye View (BEV) detection, and full 3D object detection—using Average Precision at 40 recall points (AP40) as the primary metric. Intersection over Union (IoU) thresholds were set at 0.75 for 2D detection and 0.5 for both BEV and 3D tasks to ensure fair comparison. The model was trained on an NVIDIA A40 GPU for 300 epochs using the Adam optimizer with a batch size of three and an initial learning rate of 0.0001, which was gradually adjusted through a cosine annealing schedule. To strengthen detection precision, the model applied 48 anchors per pixel, covering multiple aspect ratios and height scales, and employed Non-Maximum Suppression during inference to reduce redundant bounding boxes. Results demonstrated that TPDNet consistently outperformed leading monocular 3D detection frameworks, such as MonoDETR, MonoDistill, and MonoDTR, by up to 16.9% in AP3D and over 12% in APBEV. Visual comparisons showed that its predicted bounding boxes aligned more closely with ground truth, capturing both object centers and sizes more accurately, while also detecting occluded and unlabeled objects. Attention map visualizations confirmed that TPDNet concentrated on key crop regions rather than background noise, validating the effectiveness of its depth enhancement and phenotype aggregation modules. Ablation studies highlighted the synergistic importance of all three core modules—depth enhance, phenotype aggregation, and phenotype intensify—since performance declined markedly when any were removed. Additional experiments showed that optimal training stability and accuracy were achieved with a loss function weighting ratio of 1:3.5, emphasizing the role of depth estimation. Finally, cross-validation confirmed the model’s robustness across different data partitions, and hardware analyses demonstrated that the network, while resource-intensive in training, can be optimized for deployment in real-world, resource-limited agricultural environments.

This research carries direct benefits for agricultural automation. By requiring only low-cost cameras, TPDNet reduces barriers to adoption and offers scalability for smallholder farms. Automated harvesters powered by this technology could lower labor costs, improve harvesting efficiency, and minimize crop loss. Beyond wax gourds, the system shows potential to adapt to other crops, including melons, apples, and kiwifruit, supporting a new generation of intelligent farm machinery.

###

References

DOI

10.1016/j.plaphe.2025.100048

Original URL

https://doi.org/10.1016/j.plaphe.2025.100048

Funding information

This research was supported by the National Key R&D Program of China (2024YFE0214300), Guizhou Provincial Science and Technology Projects ([2024]002, CXTD[2023]027), Guizhou Province Youth Science and Technology Talent Project ([2024]317), Guiyang Guian Science and Technology Talent Training Project ([2024] 2-15), Academic Innovation Exploration and Emerging Scholars Program of Guizhou University of Finance and Economics: 2024XSXMB08.

About Plant Phenomics

Plant Phenomics is dedicated to publishing novel research that will advance all aspects of plant phenotyping from the cell to the plant population levels using innovative combinations of sensor systems and data analytics. Plant Phenomics aims also to connect phenomics to other science domains, such as genomics, genetics, physiology, molecular biology, bioinformatics, statistics, mathematics, and computer sciences. Plant Phenomics should thus contribute to advance plant sciences and agriculture/forestry/horticulture by addressing key scientific challenges in the area of plant phenomics.

https://doi.org/10.1016/j.plaphe.2025.100048

Title of original paper: TPDNet: Triple phenotype deepen networks for monocular 3D object detection of melons and fruits in fields
Authors: Yazhou Wang a 1, Tianhan Zhang b 1, Xingcai Wu b, Qinglei Li b, Yuquan Li b, Qi Wang b
Journal: Plant Phenomics
Original Source URL: https://doi.org/10.1016/j.plaphe.2025.100048
DOI: 10.1016/j.plaphe.2025.100048
Latest article publication date: 30 May 2025
Subject of research: Not applicable
COI statement: The authors declare that they have no competing interests.

Attached files

Figure 3. The overall architecture of the proposed TPDNet. Our method consists of the Backbone, Depth Estimation, Depth Enhance, Phenotype Aggregation, and Phenotype Intensify modules. The Depth Estimation module uses depth-assistive tasks to estimate the depth features corresponding to image features. The Depth Enhance module employs multi-dimensional feature enhancement to improve the depth representation of features. The Phenotype Aggregation module captures the latent correspondence between image features and depth features through two fusion strategies. Finally, the phenotype intensify module utilizes a linear self-attention mechanism to strengthen the degree of feature fusion.

20/09/2025 TranSpread

Regions: North America, United States, Asia, China

Keywords: Applied science, Engineering

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Latest Publications

Testimonials