Authors: Yilun Xu, Weili Nie, Arash Vahdat
Abstract: Sampling from diffusion models involves a slow iterative process that hinders
their practical deployment, especially for interactive applications. To
accelerate generation speed, recent approaches distill a multi-step diffusion
model into a single-step student generator via variational score distillation,
which matches the distribution of samples generated by the student to the
teacher’s distribution. However, these approaches use the reverse
Kullback-Leibler (KL) divergence for distribution matching which is known to be
mode seeking. In this paper, we generalize the distribution matching approach
using a novel $f$-divergence minimization framework, termed $f$-distill, that
covers different divergences with different trade-offs in terms of mode
coverage and training variance. We derive the gradient of the $f$-divergence
between the teacher and student distributions and show that it is expressed as
the product of their score differences and a weighting function determined by
their density ratio. This weighting function naturally emphasizes samples with
higher density in the teacher distribution, when using a less mode-seeking
divergence. We observe that the popular variational score distillation approach
using the reverse-KL divergence is a special case within our framework.
Empirically, we demonstrate that alternative $f$-divergences, such as
forward-KL and Jensen-Shannon divergences, outperform the current best
variational score distillation methods across image generation tasks. In
particular, when using Jensen-Shannon divergence, $f$-distill achieves current
state-of-the-art one-step generation performance on ImageNet64 and zero-shot
text-to-image generation on MS-COCO. Project page:
https://research.nvidia.com/labs/genair/f-distill
Source: http://arxiv.org/abs/2502.15681v1