Authors: Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuyin Chen, Mohamed Elhoseiny, Xiangliang Zhang
Abstract: Large Vision-Language Models (LVLMs) have become essential for advancing the
integration of visual and linguistic information, facilitating a wide range of
complex applications and tasks. However, the evaluation of LVLMs presents
significant challenges as the evaluation benchmark always demands lots of human
cost for its construction, and remains static, lacking flexibility once
constructed. Even though automatic evaluation has been explored in textual
modality, the visual modality remains under-explored. As a result, in this
work, we address a question: “Can LVLMs serve as a path to automatic
benchmarking?”. We introduce AutoBench-V, an automated framework for serving
evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of
model capability. Upon receiving an evaluation capability, AutoBench-V
leverages text-to-image models to generate relevant image samples and then
utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing
the evaluation process efficiently and flexibly. Through an extensive
evaluation of seven popular LVLMs across five demanded user inputs (i.e.,
evaluation capabilities), the framework shows effectiveness and reliability. We
observe the following: (1) Our constructed benchmark accurately reflects
varying task difficulties; (2) As task difficulty rises, the performance gap
between models widens; (3) While models exhibit strong performance in abstract
level understanding, they underperform in details reasoning tasks; and (4)
Constructing a dataset with varying levels of difficulties is critical for a
comprehensive and exhaustive evaluation. Overall, AutoBench-V not only
successfully utilizes LVLMs for automated benchmarking but also reveals that
LVLMs as judges have significant potential in various domains.
Source: http://arxiv.org/abs/2410.21259v1