VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

Authors: Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord

Abstract: We explore the potential of large-scale generative video models for
autonomous driving, introducing an open-source auto-regressive video model
(VaViM) and its companion video-action model (VaVAM) to investigate how video
pre-training transfers to real-world driving. VaViM is a simple auto-regressive
video model that predicts frames using spatio-temporal token sequences. We show
that it captures the semantics and dynamics of driving scenes. VaVAM, the
video-action model, leverages the learned representations of VaViM to generate
driving trajectories through imitation learning. Together, the models form a
complete perception-to-action pipeline. We evaluate our models in open- and
closed-loop driving scenarios, revealing that video-based pre-training holds
promise for autonomous driving. Key insights include the semantic richness of
the learned representations, the benefits of scaling for video synthesis, and
the complex relationship between model size, data, and safety metrics in
closed-loop evaluations. We release code and model weights at
https://github.com/valeoai/VideoActionModel

Source: http://arxiv.org/abs/2502.15672v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these