Junjie Fei’s Homepage

I am currently a PhD student at the King Abdullah University of Science and Technology (KAUST), under the supervision of Prof. Mohamed Elhoseiny. Before that, I obtained my BS and MS degrees from Chongqing University and Xiamen University, respectively. I also gained valuable research experience as a visiting student / research assistant at SUSTech VIP Lab and KAUST Vision CAIR. Please refer to my CV for more details.

My recent research interests are focused on vision-language multimodal learning. Feel free to drop me an email at junjiefei@outlook.com / junjie.fei@kaust.edu.sa if you are interested in collaborating.

News

  • [2025/06] 2 papers have been accepted by ICCV 2025!
  • [2025/02] 1 paper has been accepted by CVPR 2025!
  • [2024/08] Join KAUST as a PhD student!
  • [2023/07] 1 paper has been accepted by ICCV 2023!
  • [2023/04] Project Caption Anything is publicly released!

Research

(* equal contribution)

Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
Junjie Fei*, Mahmoud Ahmed*, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny
ICCV, 2025
project / paper

Kestrel is a part-aware point grounding 3D MLLM, capable of comprehending and generating language and locating the position of the object and its materials at the part level.

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation
Zhongyu Yang*, Jun Chen*, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny
ICCV, 2025
project / code / paper

WikiAutoGen is a novel system for automated multimodal Wikipedia-style article generation, retrieving and integrating relevant images alongside text to enhance both the depth and visual appeal of the generated content.

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
Jun Chen*, Dannong Xu*, Junjie Fei*, Chun-Mei Feng, Mohamed Elhoseiny
CVPR, 2025
code / paper / benchmark

The Document Haystack Benchmarks aim to evaluate the performance of VLMs on large-scale visual document retrieval and understanding.

Transferable Decoding with Visual Entities for Zero‑Shot Image Captioning
Junjie Fei*, Teng Wang*, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng
ICCV, 2023
code / paper

Improving the transferability of zero-shot captioning for out-of-domain images by addressing the modality bias and object hallucination that arise when adapting pre-trained vision-language models and large language models.

Caption Anything: Interactive Image Description with Diverse Multimodal Controls
Teng Wang*, Jinrui Zhang*, Junjie Fei*, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao
arXiv, 2023
code / paper / demo

Caption Anything is an interactive image‑to‑text generative tool that can generate diverse descriptions for any user-specified object within an image, providing a variety of language styles and visual controls to cater to diverse user preferences.

Academic Services

Conference reviewer for NeurIPS, ICLR, ICML

Journal reviewer for IEEE TMM, Neurocomputing