Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Optimization

Huy Hoang Nguyen1       Minh Nhat Vu1,2*       Florian Beck2       Gerald Ebmer2       Anh Nguyen3       Wolfgang Kemmetmüller2       Andreas Kugi1,2      
1Austrian Institute of Technology   2ACIN - TU Wien   3University of Liverpool  
*Corresponding Author
Video Code arXiv

Abstract

Combining a vision module inside a closed-loop control system for the seamless movement of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a modular zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. By leveraging a vision language model via natural language commands, an object is segmented within 0.5s. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real time. This provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to 30Hz update rates for the online 6D pose localization module and 10Hz update rates for the receding-horizon trajectory optimization. This highlights the modular framework's potential applications in robotics and human-robot interaction.

Video

Method

method

Our method involves the three main modules: language-driven object detection, real-time object pose localization, and online trajectory optimization (MP-TrajOpt). Consider a scenario in a workshop where a human instructs a robot to grasp a tool with a known CAD model and place it in a predefined location. First, the language-driven object detection module processes the image to determine the 2D location of the tool and generate a binary mask. This mask, combined with the CAD model, is used by the real-time object pose localization module to estimate the initial pose of the tool. The pose is then refined using a linear Kalman filter for smoothness. Finally, the online trajectory optimization module plans and executes the grasp and placement actions, ensuring smooth and collision-free movements.

Experiment

Prompt:"Grasp the metal part"

Prompt:"Grasp the orange block"

Prompt:"Grasp the plier"

Prompt:"Grasp the scissor"

Prompt:"Grasp the line stripper"

Prompt:"Grasp the timer"

Prompt:"Grasp the eraser"

Prompt:"Grasp the black tool"

Prompt:"Grasp the drill"

Prompt:"Grasp the silver block"

Results

results

Results from our method show that the 3D position trajectory of the object closely matches the ground truth measurements(Opitrack). It is important to note that these measurements are transformed to the camera origin, which is approximately 1 m from the object. The maximum positional error observed is approximately 0.02 m, representing 2% relative to the distance to the camera. The orientation error is constrained, with a maximum error of 0.15 rad in the roll angle.



BibTeX

@article{NGUYEN2025103335,
                  title = {Language-driven closed-loop grasping with model-predictive trajectory optimization},
                  journal = {Mechatronics},
                  volume = {109},
                  pages = {103335},
                  year = {2025},
                  issn = {0957-4158},
                  doi = {https://doi.org/10.1016/j.mechatronics.2025.103335},
                  url = {https://www.sciencedirect.com/science/article/pii/S0957415825000443},
                  author = {H.H. Nguyen and M.N. Vu and F. Beck and G. Ebmer and A. Nguyen and W. Kemmetmueller and A. Kugi},
                  keywords = {Language-driven object detection, Pose estimation, Grasping, Trajectory optimization},
                  }

Acknowledgements

We borrow the page template from  Nerfies project page. Special thanks to them!
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.