Theses Doctoral

Aiding Complex Multimodal Reasoning with Contextual and Structural Information

Ayyubi, Hammad Abdullah

Multimodal reasoning involves integrating information from multiple data modalities —such as text, images, and videos — to perform a wide range of tasks, including Visual Question Answering, Visual Entailment, Captioning, Retrieval, and more. While current multimodal models have shown strong performance on these standard tasks, they often struggle with more complex scenarios that demand external knowledge, richer context, or the ability to process large and intricate data inputs. This thesis explores these challenging aspects of multimodal reasoning and proposes targeted solutions to help models overcome these limitations.

We focus on two broad categories of complex tasks: those that benefit from external context and those that benefit from structured representations of data. External context refers to additional information necessary for solving a task, beyond what is explicitly available in the input — for instance, knowledge from external sources or temporal context derived from surrounding data. Structured representations, on the other hand, involve organizing raw inputs into meaningful forms, such as graphs or hierarchies, which help simplify reasoning over complex or large-scale data. To address these challenges, we propose a set of complementary solutions. For tasks requiringexternal knowledge, we introduce methods that integrate information from sources such as Google Search and Large Language Models.

To address temporal context gaps, we develop techniques to extract relevant contextual cues from the source video surrounding the multimodal task instance. Additionally, to handle large or densely structured data, we propose methods that convert raw inputs into compact, structured representations — such as graphs or hierarchies — which make the data more accessible for models to interpret and reason over.

Collectively, the solutions proposed in this thesis aim to enhance the reasoning capabilities of multimodal models, equipping them to handle a broader spectrum of real-world scenarios. These advancements mark a step forward in improving the robustness and applicability of multimodal reasoning systems in domains such as general-purpose robotics, digital agents, autonomous driving, and beyond.

Files

  • thumbnail for Ayyubi_columbia_0054D_19309.pdf Ayyubi_columbia_0054D_19309.pdf application/pdf 2.37 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Chang, Shih-Fu
Degree
Ph.D., Columbia University
Published Here
July 30, 2025