TimeRefine: Temporal Grounding with Time Refining Video LLM

Authors: Xizi Wang, Feng Cheng, Ziyang Wang, Huiyu Wang, Md Mohaiminul Islam, Lorenzo Torresani, Mohit Bansal, Gedas Bertasius, David Crandall

Abstract: Video temporal grounding aims to localize relevant temporal boundaries in a
video given a textual prompt. Recent work has focused on enabling Video LLMs to
perform video temporal grounding via next-token prediction of temporal
timestamps. However, accurately localizing timestamps in videos remains
challenging for Video LLMs when relying solely on temporal token prediction.
Our proposed TimeRefine addresses this challenge in two ways. First, instead of
directly predicting the start and end timestamps, we reformulate the temporal
grounding task as a temporal refining task: the model first makes rough
predictions and then refines them by predicting offsets to the target segment.
This refining process is repeated multiple times, through which the model
progressively self-improves its temporal localization accuracy. Second, to
enhance the model’s temporal perception capabilities, we incorporate an
auxiliary prediction head that penalizes the model more if a predicted segment
deviates further from the ground truth, thus encouraging the model to make
closer and more accurate predictions. Our plug-and-play method can be
integrated into most LLM-based temporal grounding approaches. The experimental
results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on
the ActivityNet and Charades-STA datasets, respectively. Code and pretrained
models will be released.

Source: http://arxiv.org/abs/2412.09601v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these