Revisiting the Integration of Convolution and Attention for Vision Backbone

Authors: Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

Abstract: Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically
considered alternatives to each other for building vision backbones. Although
some works try to integrate both, they apply the two operators simultaneously
at the finest pixel granularity. With Convs responsible for per-pixel feature
extraction already, the question is whether we still need to include the heavy
MHSAs at such a fine-grained level. In fact, this is the root cause of the
scalability issue w.r.t. the input resolution for vision transformers. To
address this important problem, we propose in this work to use MSHAs and Convs
in parallel \textbf{at different granularity levels} instead. Specifically, in
each layer, we use two different ways to represent an image: a fine-grained
regular grid and a coarse-grained set of semantic slots. We apply different
operations to these two representations: Convs to the grid for local features,
and MHSAs to the slots for global features. A pair of fully differentiable soft
clustering and dispatching modules is introduced to bridge the grid and set
representations, thus enabling local-global fusion. Through extensive
experiments on various vision tasks, we empirically verify the potential of the
proposed integration scheme, named \textit{GLMix}: by offloading the burden of
fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a
few (e.g., 64) semantic slots to match the performance of recent
state-of-the-art backbones, while being more efficient. Our visualization
results also demonstrate that the soft clustering module produces a meaningful
semantic grouping effect with only IN1k classification supervision, which may
induce better interpretability and inspire new weakly-supervised semantic
segmentation approaches. Code will be available at
\url{https://github.com/rayleizhu/GLMix}.

Source: http://arxiv.org/abs/2411.14429v1

Archives

Categories

Revisiting the Integration of Convolution and Attention for Vision Backbone

About the Author

user

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You may also like these

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Revealing and Mitigating Over-Attention in Knowledge Editing

Interpretable Text Embeddings and Text Similarity Explanation: A Primer

Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework