SceneComplete

Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Aditya Agarwal¹, Gaurav Singh², Bipasha Sen¹, Tomás Lozano-Pérez¹, Leslie Pack Kaelbling¹

¹MIT CSAIL, ²IIIT Hyderabad

Paper

Code

Video

TL;DR: SceneComplete is an open-world 3D scene completion system, that constructs a complete, segmented, 3D model of a scene from a single RGB-D image.

Experiments

Abstract

Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid mistakenly colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. It provides a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, and pose-estimation) to obtain high-accuracy results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand.

Architecture

The figure illustrates the overall design of SceneComplete. It takes a single RGB-D image as input and produces a set of object meshes that are registered with the input 3D scan. The objective is to provide an accurate 3D reconstruction of the scene, in terms of segmentation into rigid components, and the shape of each component, expressed as a mesh.

Results

Click the thumbnails below to interact with the reconstructed 3D scenes.

Evaluating SceneComplete

We evaluate SceneComplete on the challenging GraspNet-1Billion dataset and observe that the parallel jaw grasps sampled using antipodal grasp sampling enable more robust predictions with fewer collisions on the ground truth scene. We also demnstrate dexterous grasps using both Shadow Hands and Allegro Hands, highlighting improved manipulation of complex objects with complete 3D reconstructions.

SceneComplete enables dexterous grasping

A significant test of the utility of our approach is whether the individual object reconstructions support the computation of good dexterous grasps for a multi-fingered hand. We show that SceneComplete enables dexterous grasping by generating stable grasp proposals for a multi-fingered hand. We instantiate Isaac Gym with the selected hand and the ground truth object mesh and lift the hand within the simulation to detect whether the object is dropped using PhysX as the physics engine.

Citation

@misc{agarwal2024scenecompleteopenworld3dscene,
 title={SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation}, 
 author={Aditya Agarwal and Gaurav Singh and Bipasha Sen and Tomás Lozano-Pérez and Leslie Pack Kaelbling},
 year={2024},
 eprint={2410.23643},
 archivePrefix={arXiv},
 primaryClass={cs.RO},
 url={https://arxiv.org/abs/2410.23643}, 
}