InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Robotics Institute, Carnegie Mellon University
ACL 2025 Main

fail — **Figure 1. The task-oriented part segmentation task.** Presented with an image observation (left) and a corresponding task to add some water, the system is required to reason about specific parts to fulfill the task.

Highlights

To the best of our knowledge, we introduce the first dataset that bridges task-oriented interactions with part segmentation for common household tasks.
We rigorously evaluate various vision-language models on our dataset, revealing their limitations in fine-grained recognition with language reasoning.
We fine-tune a simple baseline based on a state-of-the-art model, achieving performance gains of over twofold, highlighting the quality and training potential of our dataset.

Abstract

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields.

InstructPart Dataset and Tasks

InstructPart dataset: InstructPart dataset comprises of 2,400 images across 48 object classes and 44 part classes, with hand-labeled segmentation masks, as well as 9,600 hand-labeled task instructions, 2,400 part queries, and 2,400 affordances. We split the dataset into 1,800 images for training and 600 images for evaluation.

Task Reasoning Part Segmentation (TRPS): To explore the reasoning and part grounding ability of current models, we propose the TRPS task, which is shown in the first row of Figure 2. The models should only take an instruction-image pair as the input and find the part segmentation masks being referred to.

Oracle Referring Part Segmentation (ORPS): In the ORPS setting, we consider using the part name to directly refer to the part, which ensures the model has a correct text input. The task is shown in the second row of Figure 2.

Figure 2. InstructPart dataset examples. Instruction queries are denoted in red text, while object and part names are indicated in blue. Each example includes an observation image (left), with the corresponding ground truth part segments (right), highlighted with a green mask.

Experimental Results

Benchmarking Quantitative Results

Table 1. Results on oracle referring part segmentation task (left) and task reasoning part segmentation task (right). We divide the methods into three categories, namely, open-vocabulary segmentation (OVS), referring expression segmentation (RES), and reasoning segmentation (RS).

Qualitative Results

Fine-tuned PISA performs better. (Training potential of our proposed method)











Ground Truth	X-Decoder	SEEM	TRIS	G-SAM	MiniGPT-v2	LISA-Pretrain	LISA-Finetune	PISA-Finetune

Both fine-tuned LISA and PISA perform better. (Training usage of our dataset)















Ground Truth	X-Decoder	SEEM	TRIS	G-SAM	MiniGPT-v2	LISA-Pretrain	LISA-Finetune	PISA-Finetune

Pre-trained LISA already performs well. (Benchmarking usage of our dataset)















Ground Truth	X-Decoder	SEEM	TRIS	G-SAM	MiniGPT-v2	LISA-Pretrain	LISA-Finetune	PISA-Finetune

More Annotation Examples

The images and Json files are in corresponding order. Each row shows one sample along with models' prediction.






Ground Truth	X-Decoder	SEEM	TRIS	G-SAM	MiniGPT-v2	LISA-Pretrain	LISA-Finetune	PISA-Finetune
{ "image_path": "538210619_c4def94c9b_o.jpg", "part_list": [ { "object": "scissors", "part": "handle", "affordance": "hold", "action": "hold", "instruction": [ "If I want to use the scissors, which part in the picture should I put my fingers in?", "Describe the part of the scissors in the picture where fingers should be placed.", "Where is the handle of the scissors in this image?", "Where is the handle of the scissors that can be held in this image?", "handle of the scissors", "handle of the scissors that can be held" ] } ] } { "image_path": "knife_002845.jpg", "part_list": [ { "object": "knife", "part": "handle", "affordance": "hold", "action": "pick up", "instruction": [ "If I want to pick up the knife, which part in the picture can be used?", "Which part of the knife is safe to hold when picking it up?", "Where is the handle of the knife in this image?", "Where is the handle of the knife that can be held in this image?", "handle of the knife", "handle of the knife that can be held" ] } ] } { "image_path": "2329134125_8a71be7470_o.jpg", "part_list": [ { "object": "kettle", "part": "handle", "affordance": "hold", "action": "hold", "instruction": [ "Which part in the picture can be utilized to hold the kettle?", "In the image, identify the part of the kettle that's meant to be held.", "Where is the handle of the kettle in this image?", "Where is the handle of the kettle that can be held in this image?", "handle of the kettle", "handle of the kettle that can be held" ] } ] } { "image_path": "bottle_002805.jpg", "part_list": [ { "object": "bottle", "part": "body", "affordance": "hold", "action": "hold", "instruction": [ "If I want to hold the bottles, which parts in the picture can be utilized?", "To hold the bottles, which parts are designed for grip?", "Where is the body of the bottle in this image?", "Where is the body of the bottle that can be held in this image?", "body of the bottle", "body of the bottle that can be held" ] } ] } { "image_path": "knife_000953.jpg", "part_list": [ { "object": "knife", "part": "blade", "affordance": "cut", "action": "cut", "instruction": [ "If I want to use the knife to cut the carrots, which part in the picture should be used?", "Identify the part of the knife ideal for slicing the carrots.", "Where is the blade of the knife in this image?", "Where is the blade of the knife that can cut in this image?", "blade of the knife", "blade of the knife that can cut" ] } ] }

BibTeX

@inproceedings{ wan2024instructpart, title={InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning}, author={Wan, Zifu and Xie, Yaqi and Zhang, Ce and Lin, Zhiqiu and Wang, Zihan and Stepputtis, Simon and Ramanan, Deva and Sycara, Katia}, booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics}, year={2025}, url={https://openreview.net/forum?id=IMEr4XgJSZ} }