Authors: Md Kaykobad Reza, Niki Nezakati, Ameya Patil, Mashhour Solh, M. Salman Asif
Abstract: Multimodal learning often relies on designing new models and complex training
strategies to achieve optimal performance. We present Unified Unimodal
Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using
low-rank adaptation (LoRA) for various multimodal tasks. Our method
significantly reduces the number of learnable parameters and eliminates the
need for complex training strategies, such as alternating training, gradient
modifications, or unimodal fine-tuning. To address missing modalities during
both training and testing, we introduce Mask Tokens (MT), which generate
missing modality features from available modalities using a single token per
modality. This simplifies the process, removing the need for specialized
feature estimation or prompt-tuning methods. Our evaluation demonstrates that
U2A matches or outperforms state-of-the-art methods in both complete and
missing modality settings, showcasing strong performance and robustness across
various modalities, tasks, and datasets. We also analyze and report the
effectiveness of Mask Tokens in different missing modality scenarios. Overall,
our method provides a robust, flexible, and efficient solution for multimodal
learning, with minimal computational overhead.
Source: http://arxiv.org/abs/2501.17823v1