Authors: Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova
Abstract: The recent advancements in auto-regressive multimodal large language models
(MLLMs) have demonstrated promising progress for vision-language tasks. While
there exists a variety of studies investigating the processing of linguistic
information within large language models, little is currently known about the
inner working mechanism of MLLMs and how linguistic and visual information
interact within these models. In this study, we aim to fill this gap by
examining the information flow between different modalities — language and
vision — in MLLMs, focusing on visual question answering. Specifically, given
an image-question pair as input, we investigate where in the model and how the
visual and linguistic information are combined to generate the final
prediction. Conducting experiments with a series of models from the LLaVA
series, we find that there are two distinct stages in the process of
integration of the two modalities. In the lower layers, the model first
transfers the more general visual features of the whole image into the
representations of (linguistic) question tokens. In the middle layers, it once
again transfers visual information about specific objects relevant to the
question to the respective token positions of the question. Finally, in the
higher layers, the resulting multimodal representation is propagated to the
last position of the input sequence for the final prediction. Overall, our
findings provide a new and comprehensive perspective on the spatial and
functional aspects of image and language processing in the MLLMs, thereby
facilitating future research into multimodal information localization and
editing.
Source: http://arxiv.org/abs/2411.18620v1