Authors: Ekaterina Iakovleva, Fabio Pizzati, Philip Torr, Stéphane Lathuilière
Abstract: Text-based editing diffusion models exhibit limited performance when the
user’s input instruction is ambiguous. To solve this problem, we propose
$\textit{Specify ANd Edit}$ (SANE), a zero-shot inference pipeline for
diffusion-based editing systems. We use a large language model (LLM) to
decompose the input instruction into specific instructions, i.e. well-defined
interventions to apply to the input image to satisfy the user’s request. We
benefit from the LLM-derived instructions along the original one, thanks to a
novel denoising guidance strategy specifically designed for the task. Our
experiments with three baselines and on two datasets demonstrate the benefits
of SANE in all setups. Moreover, our pipeline improves the interpretability of
editing models, and boosts the output diversity. We also demonstrate that our
approach can be applied to any edit, whether ambiguous or not. Our code is
public at https://github.com/fabvio/SANE.
Source: http://arxiv.org/abs/2407.20232v1