2024 Theses Doctoral
Multi-modal Models for Goal-oriented Planning
The rise of artificial intelligence (AI) has unlocked exciting opportunities in multi-modalgoal-oriented planning, enabling systems to interpret diverse inputs—such as video, text, and audio—and generate outputs tailored to specific task objectives. However, achieving meaningful and practical task planning in real-world applications remains a significant challenge. A capable AI system must comprehend multi-modal inputs, infer both explicit and implicit events, capture their temporal dynamics and contextual nuances, and align these insights with task-specific objectives to produce actionable, goal-oriented outcomes. Goal-oriented planning involves creating detailed, practical action plans to guide users through tasks. This dissertation focuses on the problem of Procedure Planning in Instructional Videos (PPIV), which aims to predict action sequences that transition a task from an initial visual state to a target state. Despite advancements, existing frameworks for PPIV face several critical limitations:
1. Generalizability: Current planners often struggle to generalize effectively to real-world problems. Many rely on simplifying assumptions, limiting their applicability in complex, uncontrolled environments.
2. Dataset Scarcity: Annotating instructional videos with precise timestamps for actions and states is costly, labor-intensive, and time-consuming. This scarcity of high-quality datasets constrains the development of high-performance models.
3. Multi-modal Models: Current frameworks predominantly generate text only plans, neglecting multi-modal plans that include visual representations of task state transformations, which are crucial for real-world applicability.
4. Customization: Existing models fail to address user-specific needs, often producing generic plans that do not accommodate individual preferences or requirements.
This dissertation aims to empower AI systems to overcome these limitations, delivering a robust and practical framework for multi-modal goal-oriented planning. To this end, three novel settings are proposed to enhance the practicality and coherence of PPIV:
(a) Adaptive Procedure Planning (APP): This setting introduces the flexible generation of variable-length plans, addressing the unrealistic fixed-length sequence assumption in prior models. To enable APP, we propose the Retrieval-Augmented Planner (RAP), an autoregressive framework that integrates external memory to improve temporal understanding of action relationships. RAP employs weakly-supervised learning, facilitating scalable training on unannotated data to mitigate dataset scarcity. Experimental results on benchmark datasets demonstrate RAP’s superiority over fixed-length models, establishing it as a robust baseline for adaptive procedure planning.
(b) Multi-modal Procedure Planning in Instructional Videos (MPPIV): This setting extends beyond text-only plan generation by producing coherent multi-modal plans that combine action sequences with intermediate visual states. The approach enhances clarity and actionable outputs by using a conditioned diffusion model to predict plausible visual state transformations. We use weakly-supervised learning framework to further leverage ground-truth video frames of procedural states, ensuring robust and scalable training. Experimental evaluations validate MPPIV’s effectiveness, establishing a strong baseline for multi-modal procedure planning.
(c) Customized Procedure Planning in Instructional Videos (CPPIV): To address the customization problem, we introduce the Customized Procedure Planner (CPP), which generates action sequences tailored to user-specific requirements and the initial visual state of the task. CPP leverages multi-modal foundation models to produce highly detailed, user-centric plans, enhancing practical planning capabilities. To address the lack of annotated datasets with customization information, CPP employs a weakly-supervised training approach for large-scale training. Additionally, we propose novel LLM-based metrics to evaluate customization and open-vocabulary planning accuracy, validated through human evaluation. Experimental results, including human evaluations, demonstrate the effectiveness of the CPP framework, paving the way for advanced, user-centric AI planning applications. This work lays a foundation for future advancements in goal-oriented AI systems, enhancing their ability to interact with and assist users in diverse, real-world scenarios.
Files
-
Zare_columbia_0054D_18979.pdf application/pdf 3.45 MB Download File
More About This Work
- Academic Units
- Electrical Engineering
- Thesis Advisors
- Chang, Shih-Fu
- Degree
- Ph.D., Columbia University
- Published Here
- January 15, 2025