Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr
Abstract: We introduce Olympus, a new approach that transforms Multimodal Large
Language Models (MLLMs) into a unified framework capable of handling a wide
array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates
over 20 specialized tasks across images, videos, and 3D objects to dedicated
modules. This instruction-based routing enables complex workflows through
chained actions without the need for training heavy generative models. Olympus
easily integrates with existing MLLMs, expanding their capabilities with
comparable performance. Experimental results demonstrate that Olympus achieves
an average routing accuracy of 94.75% across 20 tasks and precision of 91.82%
in chained action scenarios, showcasing its effectiveness as a universal task
router that can solve a diverse range of computer vision tasks. Project page:
https://github.com/yuanze-lin/Olympus_page
Source: http://arxiv.org/abs/2412.09612v1