Authors: Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati
Abstract: We propose DiTFlow, a method for transferring the motion of a reference video
to a newly synthesized one, designed specifically for Diffusion Transformers
(DiT). We first process the reference video with a pre-trained DiT to analyze
cross-frame attention maps and extract a patch-wise motion signal called the
Attention Motion Flow (AMF). We guide the latent denoising process in an
optimization-based, training-free, manner by optimizing latents with our AMF
loss to generate videos reproducing the motion of the reference one. We also
apply our optimization strategy to transformer positional embeddings, granting
us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow
against recently published methods, outperforming all across multiple metrics
and human evaluation.
Source: http://arxiv.org/abs/2412.07776v1