Authors: Qiyao Liang, Ziming Liu, Mitchell Ostrow, Ila Fiete
Abstract: Diffusion models are capable of generating photo-realistic images that
combine elements which likely do not appear together in the training set,
demonstrating the ability to compositionally generalize. Nonetheless, the
precise mechanism of compositionality and how it is acquired through training
remains elusive. Inspired by cognitive neuroscientific approaches, we consider
a highly reduced setting to examine whether and when diffusion models learn
semantically meaningful and factorized representations of composable features.
We performed extensive controlled experiments on conditional Denoising
Diffusion Probabilistic Models (DDPMs) trained to generate various forms of 2D
Gaussian data. We found that the models learn factorized but not fully
continuous manifold representations for encoding continuous features of
variation underlying the data. With such representations, models demonstrate
superior feature compositionality but limited ability to interpolate over
unseen values of a given feature. Our experimental results further demonstrate
that diffusion models can attain compositionality with few compositional
examples, suggesting a more efficient way to train DDPMs. Finally, we connect
manifold formation in diffusion models to percolation theory in physics,
offering insight into the sudden onset of factorized representation learning.
Our thorough toy experiments thus contribute a deeper understanding of how
diffusion models capture compositional structure in data.
Source: http://arxiv.org/abs/2408.13256v1