PAPER CODE Checkpoints Demo

Introduction

We present LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model’s performance in multimodal tasks. LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as multimodal understanding, visual question answering, and video understanding, highlighting its broad application potential.

Why LLaVA-Octopus?

Due to the varying video understanding scenarios that different MLLMs are designed to address, the projectors tailored for them exhibit distinct forms and characteristics.

We have observed that each kind of projetcor demonstrates unique advantages within its specialized domain. As shown in the upper figure, we present three representative video understanding tasks, offering an intuitive illustration of the characteristics of three typical approaches that employ different specifically designed visual projectors. LLaVA-OneVisionuses image-based projector while VideoLLaMa2 and LLaMA-VID use spatial-temporal projector and token-compress projector, respectively. The results indicate that different visual projectors perform well in their appropriate domains while exhibiting poorer performance in other scenarios. Therefore, we present LLaVA-Octopus, a instruction-driven projector fusion paradigm. **LLaVA-Octopus **introduces a projector fusion gate that integrates the strengths of different visual projectors based on user instructions. In summary, LLaVA-Octopus is able to adaptively adjust the feature weights of various visual projectors according to user instructions, thereby capitalizing on the complementary advantages of each projector.

What’s the difference between LLaVA-Octopus and other paradigm?

In the classical paradigm, user instructions are fed into the LLM solely as text tokens. While the instruction-involved paradigm facilitates interaction between instructions and visual features, it is constrained by a single projector. Our proposed instruction-driven projector fusion paradigm designs a projector fusion gate, which dynamically adjusts the weights of different types of visual projectors based on user instructions to produce the fused visual tokens.

Performance

Model NameMSVDActivityNetVideoChatGPTMVBenchEgoSchemaMLVUVideoMME
GPT4-V-59.54.0643.555.6-60.7
LLaMA-Adapter54.934.22.731.7---
Video-LLaMA65.348.32.5734.1---
VideoLLaMA270.950.23.1354.651.748.546.6
LLaMA-VID69.747.42.90-38.533.2-
VideoChat56.326.52.3135.5---
VideoChat270.049.13.0251.154.447.954.6
LLaVA-Octopus74.353.43.1966.959.257.554.7

Citation

@article{zhao2025llavaoctopus,

  title={LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding}, 
  author={Jiaxing Zhao and Boyuan Sun and Xiang Chen and Xihan Wei and Qibin Hou},
  journal={arXiv preprint arXiv:2501.05067},
	year={2025}
  }