Open-vocabulary Temporal Action Localization using VLMs

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: Video action localization aims to find timings of a specific action from a
long video. Although existing learning-based approaches have been successful,
those require annotating videos that come with a considerable labor cost. This
paper proposes a learning-free, open-vocabulary approach based on emerging
vision-language models (VLM). The challenge stems from the fact that VLMs are
neither designed to process long videos nor tailored for finding actions. We
overcome these problems by extending an iterative visual prompting technique.
Specifically, we sample video frames into a concatenated image with frame index
labels, making a VLM guess a frame that is considered to be closest to the
start/end of the action. Iterating this process by narrowing a sampling time
window results in finding a specific frame of start and end of an action. We
demonstrate that this sampling technique yields reasonable results,
illustrating a practical extension of VLMs for understanding videos.

Source: http://arxiv.org/abs/2408.17422v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these