Welcome to Junjie Fei’s Homepage!

I am currently a second-year PhD student at King Abdullah University of Science and Technology (KAUST), advised by Prof. Mohamed Elhoseiny. My research interests lie in vision-language learning and long visual context modeling. Recently, I completed a Research Scientist Internship at Meta AI, hosted by Dr. Chenchen Zhu, where my work focused on long video understanding.

Prior to KAUST, I received my BS and MS degrees from Chongqing University and Xiamen University, respectively. I also gained research experience as a research assistant and visiting scholar at SUSTech VIP Lab and KAUST Vision CAIR. For more details, please refer to my CV.

💡 Open to collaboration: I am actively seeking research internship opportunities! I am also widely open to talks, discussions, and collaborations in the multimodal learning community. Feel free to reach out to me at junjiefei@outlook.com or junjie.fei@kaust.edu.sa.

News

  • [2026/04] Project Tempo from my Meta AI internship is publicly released!
  • [2025/09] One paper has been accepted by NeurIPS 2025!
  • [2025/09] Joined Meta AI as a Research Scientist Intern!
  • [2025/06] Two papers have been accepted by ICCV 2025!
  • [2025/02] One paper has been accepted by CVPR 2025!
  • [2024/08] Joined KAUST as a PhD student!
  • [2023/07] One paper has been accepted by ICCV 2023!
  • [2023/04] Project Caption Anything is publicly released!

Experience

Meta AI
Research Scientist Intern | Sep. 2025 - Feb. 2026

Vision CAIR Research Group, KAUST
Visiting Scholar | Jan. 2024 - May 2024

VIP (Visual Intelligence & Perception) Lab, SUSTech
Visiting Scholar / Research Assistant | Oct. 2022 - Jan. 2024

Research

(* equal contribution)

Small Vision-Language Models are Smart Compressors for Long Video Understanding
Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
arXiv, 2026
project / code / paper / demo

Tempo is a highly efficient, query-aware framework that leverages a Small Vision-Language Model (SVLM) as an intelligent temporal compressor. It adaptively distills long-form video content into semantic-rich tokens in a single forward pass, achieving SOTA performance on LVBench while significantly reducing the computational overhead of processing hour-long videos.

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
Mahmoud Ahmed*, Junjie Fei*, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny
ICCV, 2025
project / paper

Kestrel is a part-aware point grounding 3D MLLM, capable of comprehending and generating language and locating the position of the object and its materials at the part level.

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation
Zhongyu Yang*, Jun Chen*, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny
ICCV, 2025
project / code / paper

WikiAutoGen is a novel system for automated multimodal Wikipedia-style article generation, retrieving and integrating relevant images alongside text to enhance both the depth and visual appeal of the generated content.

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
Jun Chen*, Dannong Xu*, Junjie Fei*, Chun-Mei Feng, Mohamed Elhoseiny
CVPR, 2025
code / paper / benchmark

The Document Haystack Benchmarks aim to evaluate the performance of VLMs on large-scale visual document retrieval and understanding.

Transferable Decoding with Visual Entities for Zero‑Shot Image Captioning
Junjie Fei*, Teng Wang*, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng
ICCV, 2023
code / paper

Improving the transferability of zero-shot captioning for out-of-domain images by addressing the modality bias and object hallucination that arise when adapting pre-trained vision-language models and large language models.

Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Teng Wang*, Jinrui Zhang*, Junjie Fei*, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao
arXiv, 2023
code / paper / demo

Caption Anything is an interactive image‑to‑text generative tool that can generate diverse descriptions for any user-specified object within an image, providing a variety of language styles and visual controls to cater to diverse user preferences.

Academic Services

Conference Reviewer

CVPR, ECCV, NeurIPS, ICML, ICLR

Journal Reviewer

IEEE TMM, Neurocomputing, CVIU