Authors: Anxing Xiao, Nuwan Janaka, Tianrun Hu, Anshul Gupta, Kaixin Li, Cunjun Yu, David Hsu
Abstract: In this paper, we introduce Robi Butler, a novel household robotic system
that enables multimodal interactions with remote users. Building on the
advanced communication interfaces, Robi Butler allows users to monitor the
robot’s status, send text or voice instructions, and select target objects by
hand pointing. At the core of our system is a high-level behavior module,
powered by Large Language Models (LLMs), that interprets multimodal
instructions to generate action plans. These plans are composed of a set of
open vocabulary primitives supported by Vision Language Models (VLMs) that
handle both text and pointing queries. The integration of the above components
allows Robi Butler to ground remote multimodal instructions in the real-world
home environment in a zero-shot manner. We demonstrate the effectiveness and
efficiency of this system using a variety of daily household tasks that involve
remote users giving multimodal instructions. Additionally, we conducted a user
study to analyze how multimodal interactions affect efficiency and user
experience during remote human-robot interaction and discuss the potential
improvements.
Source: http://arxiv.org/abs/2409.20548v1