Authors: Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie
Abstract: In this paper, we introduce WorldSense, the first benchmark to assess the
multi-modal video understanding, that simultaneously encompasses visual, audio,
and text inputs. In contrast to existing benchmarks, our WorldSense has several
features: (i) collaboration of omni-modality, we design the evaluation tasks to
feature a strong coupling of audio and video, requiring models to effectively
utilize the synergistic perception of omni-modality; (ii) diversity of videos
and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual
synchronised videos, systematically categorized into 8 primary domains and 67
fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice
QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii)
high-quality annotations, all the QA pairs are manually labeled by 80 expert
annotators with multiple rounds of correction to ensure quality. Based on our
WorldSense, we extensively evaluate various state-of-the-art models. The
experimental results indicate that existing models face significant challenges
in understanding real-world scenarios (48.0% best accuracy). We hope our
WorldSense can provide a platform for evaluating the ability in constructing
and understanding coherent contexts from omni-modality.
Source: http://arxiv.org/abs/2502.04326v1