A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects

Authors: Arjun Gupta, Rishik Sathua, Saurabh Gupta

Abstract: Many everyday mobile manipulation tasks require precise interaction with
small objects, such as grasping a knob to open a cabinet or pressing a light
switch. In this paper, we develop Servoing with Vision Models (SVM), a
closed-loop training-free framework that enables a mobile manipulator to tackle
such precise tasks involving the manipulation of small objects. SVM employs an
RGB-D wrist camera and uses visual servoing for control. Our novelty lies in
the use of state-of-the-art vision models to reliably compute 3D targets from
the wrist image for diverse tasks and under occlusion due to the end-effector.
To mitigate occlusion artifacts, we employ vision models to out-paint the
end-effector thereby significantly enhancing target localization. We
demonstrate that aided by out-painting methods, open-vocabulary object
detectors can serve as a drop-in module to identify semantic targets (e.g.
knobs) and point tracking methods can reliably track interaction sites
indicated by user clicks. This training-free method obtains an 85% zero-shot
success rate on manipulating unseen objects in novel environments in the real
world, outperforming an open-loop control method and an imitation learning
baseline trained on 1000+ demonstrations by an absolute success rate of 50%.

Source: http://arxiv.org/abs/2502.13964v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these