Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs)
developed by consolidating our insights into data, models, and visual
representations in the LLaVA-NeXT blog series. Our experimental results
demonstrate that LLaVA-OneVision is the first single model that can
simultaneously push the performance boundaries of open LMMs in three important
computer vision scenarios: single-image, multi-image, and video scenarios.
Importantly, the design of LLaVA-OneVision allows strong transfer learning
across different modalities/scenarios, yielding new emerging capabilities. In
particular, strong video understanding and cross-scenario capabilities are
demonstrated through task transfer from images to videos.
Source: http://arxiv.org/abs/2408.03326v1