Authors: Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Jiwen Lu
Abstract: Realtime 4D reconstruction for dynamic scenes remains a crucial challenge for
autonomous driving perception. Most existing methods rely on depth estimation
through self-supervision or multi-modality sensor fusion. In this paper, we
propose Driv3R, a DUSt3R-based framework that directly regresses per-frame
point maps from multi-view image sequences. To achieve streaming dense
reconstruction, we maintain a memory pool to reason both spatial relationships
across sensors and dynamic temporal contexts to enhance multi-view 3D
consistency and temporal integration. Furthermore, we employ a 4D flow
predictor to identify moving objects within the scene to direct our network
focus more on reconstructing these dynamic regions. Finally, we align all
per-frame pointmaps consistently to the world coordinate system in an
optimization-free manner. We conduct extensive experiments on the large-scale
nuScenes dataset to evaluate the effectiveness of our method. Driv3R
outperforms previous frameworks in 4D dynamic scene reconstruction, achieving
15x faster inference speed compared to methods requiring global alignment.