While 3D multimodal large language models (MLLMs) have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel: a part-aware point grounding MLLM, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess the part-aware understanding ability of 3D MLLMs. To address this, we propose two novel tasks: Part-Aware Point Grounding and Part-Aware Point Grounded Captioning. In Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions. In Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions, where each part-level description in the answer corresponds to a segmentation mask. To support learning and evaluating for the proposed tasks, we introduce two versions of 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs' ability of part-aware segmentation grounding based on user instructions. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs' part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can accurately generate user-specific segmentation masks, a capability not present in any existing 3D MLLMs. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects.
Kestrel includes both 3D vision-language modules and 3D segmentation grounding modules. Vision-language module projects the input cloud and text instruction into language hidden states. Decoding these language hidden states, we can get a detailed caption with part-level description. Each grounded part-level description (e.g., backrest, legs, ...) in the answer can extract a [SEG] token, the projection layer maps the hidden states of [SEG] tokens to the queries of segmentation grounding decoder . Meanwhile, the segmentation grounding decoder also takes the point features, extracted by the segmentation grounding encoder , as input and predicts the corresponding masks indicating by [SEG] tokens.
Here, we provide some examples of 3DCoMPaT-GRIN Grounded Caption. The label of this dataset is grounded caption, a multimodal caption comprising of detailed description and segmentation masks (part and material masks). Positional tokens <p> and </p> refer to the part-level information that needs to be grounded in the caption.
@article{fei2024kestrel,
title={Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding},
author={Fei, Junjie and Ahmed, Mahmoud and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2405.18937},
year={2024}
}