Grounding Video Models to Actions through Goal Conditioned Exploration

Authors: Yunhao Luo, Yilun Du

Abstract: Large video models, pretrained on massive amounts of Internet video, provide
a rich source of physical knowledge about the dynamics and motions of objects
and tasks. However, video models are not grounded in the embodiment of an
agent, and do not describe how to actuate the world to reach the visual states
depicted in a video. To tackle this problem, current methods use a separate
vision-based inverse dynamic model trained on embodiment-specific data to map
image states to actions. Gathering data to train such a model is often
expensive and challenging, and this model is limited to visual settings similar
to the ones in which data are available. In this paper, we investigate how to
directly ground video models to continuous actions through self-exploration in
the embodied environment — using generated video states as visual goals for
exploration. We propose a framework that uses trajectory level action
generation in combination with video guidance to enable an agent to solve
complex tasks without any external supervision, e.g., rewards, action labels,
or segmentation masks. We validate the proposed approach on 8 tasks in Libero,
6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual
Navigation. We show how our approach is on par with or even surpasses multiple
behavior cloning baselines trained on expert demonstrations while without
requiring any action annotations.