Authors: Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang
Abstract: Vision transformers (ViTs) have demonstrated their superior accuracy for
computer vision tasks compared to convolutional neural networks (CNNs).
However, ViT models are often computation-intensive for efficient deployment on
resource-limited edge devices. This work proposes Quasar-ViT, a
hardware-oriented quantization-aware architecture search framework for ViTs, to
design efficient ViT models for hardware implementation while preserving the
accuracy. First, Quasar-ViT trains a supernet using our row-wise flexible
mixed-precision quantization scheme, mixed-precision weight entanglement, and
supernet layer scaling techniques. Then, it applies an efficient
hardware-oriented search algorithm, integrated with hardware latency and
resource modeling, to determine a series of optimal subnets from supernet under
different inference latency targets. Finally, we propose a series of
model-adaptive designs on the FPGA platform to support the architecture search
and mitigate the gap between the theoretical computation reduction and the
practical inference speedup. Our searched models achieve 101.5, 159.6, and
251.6 frames-per-second (FPS) inference speed on the AMD/Xilinx ZCU102 FPGA
with 80.4%, 78.6%, and 74.9% top-1 accuracy, respectively, for the ImageNet
dataset, consistently outperforming prior works.
Source: http://arxiv.org/abs/2407.18175v1