Authors: Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
Abstract: We introduce a method for composing object-level visual prompts within a
text-to-image diffusion model. Our approach addresses the task of generating
semantically coherent compositions across diverse scenes and styles, similar to
the versatility and expressiveness offered by text prompts. A key challenge in
this task is to preserve the identity of the objects depicted in the input
visual prompts, while also generating diverse compositions across different
images. To address this challenge, we introduce a new KV-mixed cross-attention
mechanism, in which keys and values are learned from distinct visual
representations. The keys are derived from an encoder with a small bottleneck
for layout control, whereas the values come from a larger bottleneck encoder
that captures fine-grained appearance details. By mixing keys and values from
these complementary sources, our model preserves the identity of the visual
prompts while supporting flexible variations in object arrangement, pose, and
composition. During inference, we further propose object-level compositional
guidance to improve the method’s identity preservation and layout correctness.
Results show that our technique produces diverse scene compositions that
preserve the unique characteristics of each visual prompt, expanding the
creative potential of text-to-image generation.
Source: http://arxiv.org/abs/2501.01424v1