Multi-Modal Video Feature Extraction for Popularity Prediction

Authors: Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun

Abstract: This work aims to predict the popularity of short videos using the videos
themselves and their related features. Popularity is measured by four key
engagement metrics: view count, like count, comment count, and share count.
This study employs video classification models with different architectures and
training methods as backbone networks to extract video modality features.
Meanwhile, the cleaned video captions are incorporated into a carefully
designed prompt framework, along with the video, as input for video-to-text
generation models, which generate detailed text-based video content
understanding. These texts are then encoded into vectors using a pre-trained
BERT model. Based on the six sets of vectors mentioned above, a neural network
is trained for each of the four prediction metrics. Moreover, the study
conducts data mining and feature engineering based on the video and tabular
data, constructing practical features such as the total frequency of hashtag
appearances, the total frequency of mention appearances, video duration, frame
count, frame rate, and total time online. Multiple machine learning models are
trained, and the most stable model, XGBoost, is selected. Finally, the
predictions from the neural network and XGBoost models are averaged to obtain
the final result.

Source: http://arxiv.org/abs/2501.01422v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these