Theses Doctoral

Learning Robot Manipulation Through Hands of Humans

Xu, Mengda

Intelligent robots should possess the capability to utilize diverse skills to complete a wide range of manipulation tasks. Moreover, they should be able to acquire new skills in a scalable manner. While imitation learning has shown promise for robot skill acquisition, it heavily relies on expensive robot data collection through human teleoperation, making it challenging to scale.

This dissertation aims to develop approaches for robot manipulation skill learning from human video, which is easy to obtain and widely available. The goal is to reduce robot learning's dependence on teleoperation data and develop methods to learn from cross-embodiment data, enabling more scalable skill acquisition. However, the embodiment gap between humans and robots prevents robots from learning manipulation directly from human video. Specifically, both the visual gap (e.g., appearance) and morphology gap (e.g., kinematics) pose significant challenges to transferring human knowledge to robots.

To address this, we propose using different interfaces to minimize the embodiment gap. Here, an interface is defined as a function that maps human knowledge to robot domain knowledge.

This dissertation presents a series of interfaces and associated systems that allow robots to learn directly from human video. First, we present using skills as an interface to implicitly close the visual and morphology gaps by enabling robots to identify skills demonstrated in human video and recompose these skills to complete unseen tasks. We present XSkill and ASPiRe for tasks requiring skill composition. Second, we introduce object flow as an interface to explicitly overcome the visual gap. Our approach, Im2Flow2Act, distills task knowledge by learning an object flow generator from human video while acquiring flow-conditioned manipulation policies from simulation. This system can complete a wide range of manipulation tasks in the real world without requiring real-world robot data.

Finally, we introduce the human hand as an interface to enable humans to teach robots dexterous hand manipulation skills. Our system, DexUMI, is equipped with an exoskeleton as a hardware adaptation layer to minimize the morphology gap and a data processing pipeline as a software adaptation layer to minimize the visual gap.

This dissertation demonstrates that leveraging human video enables robots to learn manipulation skills in a more scalable manner. Furthermore, the proposed interface-based approaches provide a systematic framework for addressing embodiment gaps and offer practical solutions for real-world robot deployment.

Files

  • thumbnail for Xu_columbia_0054D_19478.pdf Xu_columbia_0054D_19478.pdf application/pdf 3.3 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Song, Shuran
Degree
Ph.D., Columbia University
Published Here
October 8, 2025