Authors: Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou
Abstract: Despite significant advancements in Large Vision-Language Models (LVLMs),
existing pixel-grounding models operate on single-image settings, limiting
their ability to perform detailed, fine-grained comparisons across multiple
images. Conversely, current multi-image understanding models lack pixel-level
grounding. Our work addresses this gap by introducing the task of multi-image
pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates
pixel-level grounding with robust multi-image reasoning capabilities to produce
contextually rich, pixel-grounded explanations. Central to PRIMA is an
efficient vision module that queries fine-grained visual representations across
multiple images, reducing TFLOPs by $25.3\%$. To support training and
evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark
consisting of $\sim$224K question-answer pairs that require fine-grained visual
understanding across multiple images. Experimental results demonstrate PRIMA
outperforms state-of-the-art baselines.
Source: http://arxiv.org/abs/2412.15209v1