Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai
Abstract: While existing research often treats long-form videos as extended short
videos, we propose a novel approach that more accurately reflects human
cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for
Long-Form Video Understanding, a model that simulates episodic memory
accumulation to capture action sequences and reinforces them with semantic
knowledge dispersed throughout the video. Our work makes two key contributions:
First, we develop an Episodic COmpressor (ECO) that efficiently aggregates
crucial representations from micro to semi-macro levels. Second, we propose a
Semantics reTRiever (SeTR) that enhances these aggregated representations with
semantic information by focusing on the broader context, dramatically reducing
feature dimensionality while preserving relevant macro-level information.
Extensive experiments demonstrate that BREASE achieves state-of-the-art
performance across multiple long video understanding benchmarks in both
zero-shot and fully-supervised settings. The project page and code are at:
https://joslefaure.github.io/assets/html/hermes.html.
Source: http://arxiv.org/abs/2408.17443v1