Authors: Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai
Abstract: Spatio-temporal action detection encompasses the tasks of localizing and
classifying individual actions within a video. Recent works aim to enhance this
process by incorporating interaction modeling, which captures the relationship
between people and their surrounding context. However, these approaches have
primarily focused on fully-supervised learning, and the current limitation lies
in the lack of generalization capability to recognize unseen action categories.
In this paper, we aim to adapt the pretrained image-language models to detect
unseen actions. To this end, we propose a method which can effectively leverage
the rich knowledge of visual-language models to perform Person-Context
Interaction. Meanwhile, our Context Prompting module will utilize contextual
information to prompt labels, thereby enhancing the generation of more
representative text features. Moreover, to address the challenge of recognizing
distinct actions by multiple people at the same timestamp, we design the
Interest Token Spotting mechanism which employs pretrained visual knowledge to
find each person’s interest context tokens, and then these tokens will be used
for prompting to generate text features tailored to each individual. To
evaluate the ability to detect unseen actions, we propose a comprehensive
benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our
method achieves superior results compared to previous approaches and can be
further extended to multi-action videos, bringing it closer to real-world
applications. The code and data can be found in
https://webber2933.github.io/ST-CLIP-project-page.
Source: http://arxiv.org/abs/2408.15996v1