Small Vision-Language Models are Smart Compressors for Long Video Understanding

1Meta AI     2KAUST
Project Lead

(Click video to Play/Pause)

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long video understanding is severely bottlenecked by context window limits. Dense visual streams quickly saturate input token budgets and exacerbate the lost-in-the-middle phenomenon. Existing query-agnostic heuristics—such as sparse sampling or uniform pooling—blindly sacrifice fidelity. They frequently discard transient decisive moments, blur fine-grained evidence, and waste representational bandwidth on irrelevant backgrounds.

To resolve this, we introduce Tempo, an efficient, query-aware framework that natively learns to compress long videos for downstream understanding. As its name suggests, Tempo acts as an intelligent temporal compressor that dynamically distributes the rhythm of the video. It leverages a Small Vision-Language Model (SVLM) to perform an early cross-modal distillation process, generating compact, intent-aligned video representations in a single forward pass.

To enforce strict inference budgets without breaking causality, we propose Adaptive Token Allocation (ATA). Exploiting the SVLM's inherent zero-shot relevance prior and empirical semantic front-loading, ATA acts as a training-free, O(1) dynamic router. It allocates dense, high-fidelity bandwidth to query-critical semantic beats while rapidly fast-forwarding redundancies into minimal temporal anchors to maintain the global storyline.

Extensive experiments demonstrate that our compact 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5–16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual token budget, outperforming proprietary baselines such as GPT-4o and Gemini 1.5 Pro. Crucially, empirical profiling reveals that Tempo frequently compresses hour-long videos to token counts substantially below theoretical limits, proving that true long-form multimodal understanding is best achieved not by greedily padding expansive context windows, but through intent-driven, dynamic allocation based on semantic necessity.

Methodology: The Tempo Framework

Tempo Architecture Diagram

Tempo resolves the structural mismatch between continuous visual streams and bounded LLM context windows by casting long video understanding as an end-to-end, query-aware cross-modal distillation process. Our framework consists of three core phases:

  • The Local Compressor (Left): We uniformly partition the video into temporal segments. For each segment, a Small Vision-Language Model (SVLM) processes the visual frames alongside the user query. Under causal attention, a fixed number of learnable memory tokens inherently distill the preceding visual semantics, discarding query-irrelevant backgrounds early on.
  • Inference-Only Bypass via ATA (Middle): To enforce strict global token budgets (e.g., 8K) during inference, we introduce Adaptive Token Allocation (ATA). In a single forward pass, ATA intercepts a zero-shot relevance score directly from the SVLM. It dynamically dictates an O(1) head truncation—allocating dense bandwidth to query-critical segments while aggressively compressing redundancies down to as few as 4 tokens for minimal temporal anchors.
  • The Global Decoder (Right): The compressed, highly filtered memory tokens are assembled into a sparse sequence using explicit temporal tags (e.g., <t=2.0s>). A large global LLM synthesizes this condensed multimodal context to generate the final, accurate response without suffering from attention dilution.

Video Demonstrations

Qualitative Analysis: Visualizing Intent-Driven Sparsity

To demonstrate how Tempo dynamically allocates token bandwidth, we visualize the zero-shot relevance scores alongside the generated memory tokens. Tempo natively assigns dense anchors to query-critical moments while aggressively compressing redundant backgrounds.

Quantitative Results

Comparison with state-of-the-art MLLMs on long video benchmarks, highlighting Tempo's superior accuracy and extreme token efficiency. Notably, while we set a theoretical dynamic range of 0.5–16 tokens, empirical profiling reveals that Tempo inherently operates substantially below the maximum budget limits in practice (shown in gray rows).

Model Size Tokens / Frame LongVideoBench
(473s)
MLVU
(651s)
Video-MME (Overall)
(1010s)
Video-MME (Long)
(2386s)
LVBench
(4101s)
Proprietary Models
GPT-4o--66.764.671.965.330.8
Gemini 1.5 Pro--64.0-75.067.433.1
General Open-Source MLLMs
VideoChat2-HD7B72-47.945.339.8-
LLaVA-OneVision7B19656.464.758.2--
LLaVA-Video7B67658.270.863.3--
VideoLLaMA3*7B≤ 9159.873.066.254.945.3
InternVL3.58B25662.170.266.0--
Molmo28B8367.5-69.9-52.8
Qwen2.5-VL7B192456.070.265.1-45.3
Qwen3-VL*2B≤ 640-68.361.9-47.4
Qwen3-VL*8B≤ 640-78.171.4-58.0
Specialized Long Video MLLMs
LLaMA-VID7B2-33.225.9-23.9 (13B)
LongVA7B144-56.352.646.2-
Kangaroo8B25654.861.056.046.739.4
LongLLaVAA13B14453.5-53.846.4-
LongVILA7B19657.1-60.147.0-
LongVU7B64-65.460.659.5-
Storm7B6460.572.963.453.4-
BIMBA7B3659.571.464.7--
VideoChat-Flash7B1664.774.765.355.448.2
Tempo* (4K Budget) 6B 0.5–16 64.5 75.6 67.8 57.8 52.7
↳ actual avg. toks/frame 2.82.83.63.42.9
Tempo* (8K Budget) 6B 0.5–16 65.1 75.2 67.7 57.0 52.3
↳ actual avg. toks/frame 3.13.34.34.13.5

BibTeX

If you find our work useful for your research and applications, please consider citing our paper:

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}

Acknowledgements

Junjie Fei, Mingchen Zhuge, Shuming Liu, and Mohamed Elhoseiny were supported by funding from the King Abdullah University of Science and Technology (KAUST) Center of Excellence for Generative AI.

We extend our sincere gratitude to the authors of LVBench, Video-MME, MLVU, and LongVideoBench for providing the invaluable long-video evaluation benchmarks that made this research possible.

Our codebase is built upon the excellent foundations of LongVU, VideoChat-Flash, LLaVA, and Qwen3-VL. Furthermore, our models are initialized using the powerful pre-trained weights from Qwen3-VL and Qwen3-LM. We also thank the Nerfies team for open-sourcing their beautiful project page template.

License & Disclaimer

Framework & Weights: The Tempo framework's source code is open-sourced under the Apache-2.0 License to foster the research and development of community. However, the pre-trained model weights and checkpoints are distributed strictly for academic and non-commercial research purposes. Any commercial use is explicitly prohibited without prior written consent.

Video & Data Assets: All visual media, including video clips and images showcased on this project page and within our evaluation pipelines, are utilized under the doctrine of Fair Use exclusively for non-profit academic research and scientific illustration. We claim no ownership over the original, copyrighted media assets.

Take-down Notice: We deeply respect the intellectual property rights of creators. If you are a copyright holder and believe that any content hosted here infringes upon your rights, please contact us at junjiefei@outlook.com or open an issue on our GitHub repository. We will promptly investigate and remove the identified content upon verification.