Authors: Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, Li Fei-Fei
Abstract: We present HourVideo, a benchmark dataset for hour-long video-language
understanding. Our dataset consists of a novel task suite comprising
summarization, perception (recall, tracking), visual reasoning (spatial,
temporal, predictive, causal, counterfactual), and navigation (room-to-room,
object retrieval) tasks. HourVideo includes 500 manually curated egocentric
videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and
features 12,976 high-quality, five-way multiple-choice questions. Benchmarking
results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve
marginal improvements over random chance. In stark contrast, human experts
significantly outperform the state-of-the-art long-context multimodal model,
Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal
capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are
available at https://hourvideo.stanford.edu
Source: http://arxiv.org/abs/2411.04998v1