BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Authors: Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir

Abstract: Robotics has long sought to develop visual-servoing robots capable of
completing previously unseen long-horizon tasks. Hierarchical approaches offer
a pathway for achieving this goal by executing skill combinations arranged by a
task planner, with each visuomotor skill pre-trained using a specific imitation
learning (IL) algorithm. However, even in simple long-horizon tasks like skill
chaining, hierarchical approaches often struggle due to a problem we identify
as Observation Space Shift (OSS), where the sequential execution of preceding
skills causes shifts in the observation space, disrupting the performance of
subsequent individually trained skill policies. To validate OSS and evaluate
its impact on long-horizon tasks, we introduce BOSS (a Benchmark for
Observation Space Shift). BOSS comprises three distinct challenges: “Single
Predicate Shift”, “Accumulated Predicate Shift”, and “Skill Chaining”, each
designed to assess a different aspect of OSS’s negative effect. We evaluated
several recent popular IL algorithms on BOSS, including three Behavioral
Cloning methods and the Visual Language Action model OpenVLA. Even on the
simplest challenge, we observed average performance drops of 67%, 35%, 34%, and
54%, respectively, when comparing skill performance with and without OSS.
Additionally, we investigate a potential solution to OSS that scales up the
training data for each skill with a larger and more visually diverse set of
demonstrations, with our results showing it is not sufficient to resolve OSS.
The project page is: https://boss-benchmark.github.io/

Source: http://arxiv.org/abs/2502.15679v1

Archives

Categories

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

About the Author

user

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You may also like these

VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

FLEKE: Federated Locate-then-Edit Knowledge Editing

One-step Diffusion Models with $f$-Divergence Distribution Matching