Authors: Yanan Chen, Ali Pesaranghader, Tanmana Sadhu, Dong Hoon Yi
Abstract: Large language models (LLMs) have brought autonomous agents closer to
artificial general intelligence (AGI) due to their promising generalization and
emergent capabilities. There is, however, a lack of studies on how LLM-based
agents behave, why they could potentially fail, and how to improve them,
particularly in demanding real-world planning tasks. In this paper, as an
effort to fill the gap, we present our study using a realistic benchmark,
TravelPlanner, where an agent must meet multiple constraints to generate
accurate plans. We leverage this benchmark to address four key research
questions: (1) are LLM agents robust enough to lengthy and noisy contexts when
it comes to reasoning and planning? (2) can few-shot prompting adversely impact
the performance of LLM agents in scenarios with long context? (3) can we rely
on refinement to improve plans, and (4) can fine-tuning LLMs with both positive
and negative feedback lead to further improvement? Our comprehensive
experiments indicate that, firstly, LLMs often fail to attend to crucial parts
of a long context, despite their ability to handle extensive reference
information and few-shot examples; secondly, they still struggle with analyzing
the long plans and cannot provide accurate feedback for refinement; thirdly, we
propose Feedback-Aware Fine-Tuning (FAFT), which leverages both positive and
negative feedback, resulting in substantial gains over Supervised Fine-Tuning
(SFT). Our findings offer in-depth insights to the community on various aspects
related to real-world planning applications.
Source: http://arxiv.org/abs/2408.06318v1