Authors: Chun-Hao Paul Huang, Jae Shin Yoon, Hyeonho Jeong, Niloy Mitra, Duygu Ceylan
Abstract: Inspired by the emergent 3D capabilities in image generators, we explore
whether video generators similarly exhibit 3D awareness. Using
structure-from-motion (SfM) as a benchmark for 3D tasks, we investigate if
intermediate features from OpenSora, a video generation model, can support
camera pose estimation. We first examine native 3D awareness in video
generation features by routing raw intermediate outputs to SfM-prediction
modules like DUSt3R. Then, we explore the impact of fine-tuning on camera pose
estimation to enhance 3D awareness. Results indicate that while video generator
features have limited inherent 3D awareness, task-specific supervision
significantly boosts their accuracy for camera pose estimation, resulting in
competitive performance. The proposed unified model, named JOG3R, produces
camera pose estimates with competitive quality without degrading video
generation quality.
Source: http://arxiv.org/abs/2501.01409v1