Theses Doctoral

Vision-based Manipulation In-the-Wild

Chi, Cheng

Deploying robots in real-world environments involves immense engineering complexity, potentially surpassing the resources required for autonomous vehicles due to the increased dimensionality and task variety. To maximize the chances of successful real-world deployment, finding a simple solution that minimizes engineering complexity at every level, from hardware to algorithm to operations, is crucial.

In this dissertation, we consider a vision-based manipulation system that can be deployed in-the-wild when trained to imitate sufficient quantity and diversity of human demonstration data on the desired task. At deployment time, the robot is driven by a single diffusion-based visuomotor policy, with raw RGB images as input and robot end-effector pose as output. Compared to existing policy representations, Diffusion Policy handles multimodal action distributions gracefully, being scalable to high-dimensional action spaces and exhibiting impressive training stability. These properties allow a single software system to be used for multiple tasks, with data collected by multiple demonstrators, deployed to multiple robot embodiments, and without significant hyper-parameter tuning.

We developed a Universal Manipulation Interface (UMI), a portable, low-cost, and information-rich data collection system to enable direct manipulation skill learning from in-the-wild human demonstrations. UMI provides an intuitive interface for non-expert users by using hand-held grippers with mounted GoPro cameras. Compared to existing robotic data collection systems, UMI enables robotic data collection without needing a robot, drastically reducing the engineering and operational complexity. Trained with UMI data, the resulting diffusion policies can be deployed across multiple robot platforms in unseen environments for novel objects and to complete dynamic, bimanual, precise, and long-horizon tasks.

The Diffusion Policy and UMI combination provides a simple full-stack solution to many manipulation problems. The turn-around time of building a single-task manipulation system (such as object tossing and cloth folding) can be reduced from a few months to a few days.

Files

  • thumnail for Chi_columbia_0054D_18645.pdf Chi_columbia_0054D_18645.pdf application/pdf 6.98 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Song, Shuran
Vondrick, Carl M.
Degree
Ph.D., Columbia University
Published Here
August 7, 2024