2025 Theses Doctoral
Learning Robot Manipulation Through Hands of Humans
Intelligent robots should possess the capability to utilize diverse skills to complete a wide range of manipulation tasks. Moreover, they should be able to acquire new skills in a scalable manner. While imitation learning has shown promise for robot skill acquisition, it heavily relies on expensive robot data collection through human teleoperation, making it challenging to scale.
This dissertation aims to develop approaches for robot manipulation skill learning from human video, which is easy to obtain and widely available. The goal is to reduce robot learning's dependence on teleoperation data and develop methods to learn from cross-embodiment data, enabling more scalable skill acquisition. However, the embodiment gap between humans and robots prevents robots from learning manipulation directly from human video. Specifically, both the visual gap (e.g., appearance) and morphology gap (e.g., kinematics) pose significant challenges to transferring human knowledge to robots.
To address this, we propose using different interfaces to minimize the embodiment gap. Here, an interface is defined as a function that maps human knowledge to robot domain knowledge.
This dissertation presents a series of interfaces and associated systems that allow robots to learn directly from human video. First, we present using skills as an interface to implicitly close the visual and morphology gaps by enabling robots to identify skills demonstrated in human video and recompose these skills to complete unseen tasks. We present XSkill and ASPiRe for tasks requiring skill composition. Second, we introduce object flow as an interface to explicitly overcome the visual gap. Our approach, Im2Flow2Act, distills task knowledge by learning an object flow generator from human video while acquiring flow-conditioned manipulation policies from simulation. This system can complete a wide range of manipulation tasks in the real world without requiring real-world robot data.
Finally, we introduce the human hand as an interface to enable humans to teach robots dexterous hand manipulation skills. Our system, DexUMI, is equipped with an exoskeleton as a hardware adaptation layer to minimize the morphology gap and a data processing pipeline as a software adaptation layer to minimize the visual gap.
This dissertation demonstrates that leveraging human video enables robots to learn manipulation skills in a more scalable manner. Furthermore, the proposed interface-based approaches provide a systematic framework for addressing embodiment gaps and offer practical solutions for real-world robot deployment.
Subjects
Files
-
Xu_columbia_0054D_19478.pdf
application/pdf
3.3 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Song, Shuran
- Degree
- Ph.D., Columbia University
- Published Here
- October 8, 2025