Authors: Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Jiwen Lu
Abstract: Realtime 4D reconstruction for dynamic scenes remains a crucial challenge for
autonomous driving perception. Most existing methods rely on depth estimation
through self-supervision or multi-modality sensor fusion. In this paper, we
propose Driv3R, a DUSt3R-based framework that directly regresses per-frame
point maps from multi-view image sequences. To achieve streaming dense
reconstruction, we maintain a memory pool to reason both spatial relationships
across sensors and dynamic temporal contexts to enhance multi-view 3D
consistency and temporal integration. Furthermore, we employ a 4D flow
predictor to identify moving objects within the scene to direct our network
focus more on reconstructing these dynamic regions. Finally, we align all
per-frame pointmaps consistently to the world coordinate system in an
optimization-free manner. We conduct extensive experiments on the large-scale
nuScenes dataset to evaluate the effectiveness of our method. Driv3R
outperforms previous frameworks in 4D dynamic scene reconstruction, achieving
15x faster inference speed compared to methods requiring global alignment.
Code: https://github.com/Barrybarry-Smith/Driv3R.
Source: http://arxiv.org/abs/2412.06777v1