Kestrel: A Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

King Abdullah University of Science and Technology
Preprint

*Indicates Equal Contribution
MY ALT TEXT

Grounded 3D Descriptions with Kestrel. Kestrel is a part-aware point grounding multimodal large language model (MLLM) capable of comprehending natural language and grounding the position of an object's parts and materials. (a) Kestrel responds to user instruction accurately even at the part level, an ability that noe of the previous 3D MLLMs possess. (b) Kestrel can generate detailed descriptions and grounding object parts mentioned in the response. (c) Kestrel enables dialogue and reasoning over part-level information.

Abstract

While 3D multimodal large language models (MLLMs) have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel: a part-aware point grounding MLLM, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess the part-aware understanding ability of 3D MLLMs. To address this, we propose two novel tasks: Part-Aware Point Grounding and Part-Aware Point Grounded Captioning. In Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions. In Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions, where each part-level description in the answer corresponds to a segmentation mask. To support learning and evaluating for the proposed tasks, we introduce two versions of 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs' ability of part-aware segmentation grounding based on user instructions. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs' part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can accurately generate user-specific segmentation masks, a capability not present in any existing 3D MLLMs. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects.

Model

Kestrel includes both 3D vision-language modules and 3D segmentation grounding modules. Vision-language module fVL projects the input cloud and text instruction into language hidden states. Decoding these language hidden states, we can get a detailed caption with part-level description. Each grounded part-level description (e.g., backrest, legs, ...) in the answer can extract a [SEG] token, the projection layer fP maps the hidden states of [SEG] tokens to the queries of segmentation grounding decoder fD. Meanwhile, the segmentation grounding decoder also takes the point features, extracted by the segmentation grounding encoder fE, as input and predicts the corresponding masks indicating by [SEG] tokens.

MY ALT TEXT

3DCoMPaT-GRIN

Here, we provide some examples of 3DCoMPaT-GRIN Grounded Caption. The label of this dataset is grounded caption, a multimodal caption comprising of detailed description and segmentation masks (part and material masks). Positional tokens <p> and </p> refer to the part-level information that needs to be grounded in the caption.

MY ALT TEXT

Demos

BibTeX


        @article{fei2024kestrel,
          title={Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding},
          author={Fei, Junjie and Ahmed, Mahmoud and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed},
          journal={arXiv preprint arXiv:2405.18937},
          year={2024}
        }