Authors: Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang
Abstract: Large Language Models (LLMs) have demonstrated potential in
Vision-and-Language Navigation (VLN) tasks, yet current applications face
challenges. While LLMs excel in general conversation scenarios, they struggle
with specialized navigation tasks, yielding suboptimal performance compared to
specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied
Agent), a novel Multimodal LLM-based agent and architecture designed for urban
VLN tasks that efficiently handles multiple observations. Our approach
implements a three-phase tuning technique for effective adaptation to
navigation tasks, including single perception tuning for street view
description, multiple perception tuning for trajectory summarization, and
end-to-end training on VLN datasets. The augmented datasets are synthesized
automatically. Experimental results demonstrate FLAME’s superiority over
existing methods, surpassing state-of-the-art methods by a 7.3% increase in
task completion rate on Touchdown dataset. This work showcases the potential of
Multimodal LLMs (MLLMs) in complex navigation tasks, representing an
advancement towards practical applications of MLLMs in embodied AI. Project
page: https://flame-sjtu.github.io
Source: http://arxiv.org/abs/2408.11051v1