Theses Doctoral

Multimodal Reasoning with Fine-grained Knowledge Representation

Wang, Zhecan

Multimodal reasoning, especially the processing that involves common sense, is a vital capability for humans, encompassing a wide range of practical applications, from understanding visual cues while driving to interpreting emotions and intentions in social interactions and efficiently planning and executing household chores. Therefore, multimodal (common sense) reasoning represents an important step when developing advanced AI systems that aim to imitate human-level capabilities. However, existing methods struggle to achieve this due to several factors, including the under-utilization of fine-grained multimodal information, lack of transparency, and unexplainable and unreliable behaviors of the AI models.

Our research addresses these challenges by focusing on improving AI models' multimodal (common sense) reasoning through the utilization of fine-grained knowledge representation. We begin by developing transformer-based models to extract fine-grained knowledge across various modalities. We then propose novel solutions to leverage this extracted knowledge to enhance AI models' learning of multimodal reasoning, particularly in downstream vision-language understanding tasks such as visual question answering and visual entailment.

Beyond the focus on high performance, we further propose approaches that exploit fine-grained multimodal knowledge to enhance our understanding of AI models, thereby improving the explainability of how vision-language models work during multimodal (common sense) reasoning. Finally, we develop new methods to utilize fine-grained knowledge to create generalized and challenging multimodal benchmarks, designed specifically to evaluate future AI models on their multimodal reasoning capabilities.

Throughout our research, we conduct extensive experiments to demonstrate the effectiveness of utilizing fine-grained knowledge in improving AI models for multimodal reasoning. Our work focuses on four key perspectives: knowledge extraction, model learning, explainability, and evaluation, providing a comprehensive approach to advancing the field of AI multimodal reasoning.

Files

  • thumnail for Wang_columbia_0054D_18822.pdf Wang_columbia_0054D_18822.pdf application/pdf 2.21 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Chang, Shih-Fu
Degree
Ph.D., Columbia University
Published Here
November 6, 2024