Authors: Shivansh Patel, Xinchen Yin, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li
Abstract: Task specification for robotic manipulation in open-world environments is
challenging, requiring flexible and adaptive objectives that align with human
intentions and can evolve through iterative feedback. We introduce Iterative
Keypoint Reward (IKER), a visually grounded, Python-based reward function that
serves as a dynamic task specification. Our framework leverages VLMs to
generate and refine these reward functions for multi-step manipulation tasks.
Given RGB-D observations and free-form language instructions, we sample
keypoints in the scene and generate a reward function conditioned on these
keypoints. IKER operates on the spatial relationships between keypoints,
leveraging commonsense priors about the desired behaviors, and enabling precise
SE(3) control. We reconstruct real-world scenes in simulation and use the
generated rewards to train reinforcement learning (RL) policies, which are then
deployed into the real world-forming a real-to-sim-to-real loop. Our approach
demonstrates notable capabilities across diverse scenarios, including both
prehensile and non-prehensile tasks, showcasing multi-step task execution,
spontaneous error recovery, and on-the-fly strategy adjustments. The results
highlight IKER’s effectiveness in enabling robots to perform multi-step tasks
in dynamic environments through iterative reward shaping.
Source: http://arxiv.org/abs/2502.08643v1