Authors: Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Abstract: Multimodal large language models (MLLMs) have enabled LLM-based agents to
directly interact with application user interfaces (UIs), enhancing agents’
performance in complex tasks. However, these agents often suffer from high
latency and low reliability due to the extensive sequential UI interactions. To
address this issue, we propose AXIS, a novel LLM-based agents framework
prioritize actions through application programming interfaces (APIs) over UI
actions. This framework also facilitates the creation and expansion of APIs
through automated exploration of applications. Our experiments on Office Word
demonstrate that AXIS reduces task completion time by 65%-70% and cognitive
workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans.
Our work contributes to a new human-agent-computer interaction (HACI) framework
and a fresh UI design principle for application providers in the era of LLMs.
It also explores the possibility of turning every applications into agents,
paving the way towards an agent-centric operating system (Agent OS).
Source: http://arxiv.org/abs/2409.17140v1