Authors: Zhihang Yuan, Yuzhang Shang, Hanling Zhang, Tongcheng Fang, Rui Xie, Bingxin Xu, Yan Yan, Shengen Yan, Guohao Dai, Yu Wang
Abstract: Recent advances in autoregressive (AR) models with continuous tokens for
image generation show promising results by eliminating the need for discrete
tokenization. However, these models face efficiency challenges due to their
sequential token generation nature and reliance on computationally intensive
diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive
Image Generation via Multistage Modeling), an approach that addresses these
limitations through two intertwined innovations: (1) a stage-wise continuous
token generation strategy that reduces computational complexity and provides
progressively refined token maps as hierarchical conditions, and (2) a
multistage flow-based distribution modeling method that transforms only
partial-denoised distributions at each stage comparing to complete denoising in
normal diffusion models. Holistically, ECAR operates by generating tokens at
increasing resolutions while simultaneously denoising the image at each stage.
This design not only reduces token-to-image transformation cost by a factor of
the stage number but also enables parallel processing at the token level. Our
approach not only enhances computational efficiency but also aligns naturally
with image generation principles by operating in continuous token space and
following a hierarchical generation process from coarse to fine details.
Experimental results demonstrate that ECAR achieves comparable image quality to
DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and
5$\times$ speedup to generate a 256$\times$256 image.
Source: http://arxiv.org/abs/2412.14170v1