Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Authors: Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez, Jose Garcia-Rodriguez, Antonis Argyros

Abstract: The sequential execution of actions and their hierarchical structure
consisting of different levels of abstraction, provide features that remain
unexplored in the task of action recognition. In this study, we present a novel
approach to improve action recognition by exploiting the hierarchical
organization of actions and by incorporating contextualized textual
information, including location and prior actions to reflect the sequential
context. To achieve this goal, we introduce a novel transformer architecture
tailored for action recognition that utilizes both visual and textual features.
Visual features are obtained from RGB and optical flow data, while text
embeddings represent contextual information. Furthermore, we define a joint
loss function to simultaneously train the model for both coarse and
fine-grained action recognition, thereby exploiting the hierarchical nature of
actions. To demonstrate the effectiveness of our method, we extend the Toyota
Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing
the Hierarchical TSU dataset. We also conduct an ablation study to assess the
impact of different methods for integrating contextual and hierarchical data on
action recognition performance. Results show that the proposed approach
outperforms pre-trained SOTA methods when trained with the same
hyperparameters. Moreover, they also show a 17.12% improvement in top-1
accuracy over the equivalent fine-grained RGB version when using ground-truth
contextual information, and a 5.33% improvement when contextual information is
obtained from actual predictions.