Theses Doctoral

Multimodal Representations for Video

Suris Coll-Vinent, Didac

My thesis explores the fields of multimodal and video analysis in computer vision, aiming to bridge the gap between human perception and machine understanding. Recognizing the interplay among various signals such as text, audio, and visual data, my research explores novel frameworks to integrate these diverse modalities in order to achieve a deeper understanding of complex scenes, with a particular emphasis on video analysis.

As part of this exploration, I study diverse tasks such as translation, future prediction, or visual question answering, all connected through the lens of multimodal and video representations. I present novel approaches for each of these challenges, contributing across different facets of computer vision, from dataset creation to algorithmic innovations, and from achieving state-of-the-art results on established benchmarks to introducing new tasks.

Methodologically, my thesis embraces two key approaches: self-supervised learning and the integration of structured representations. Self-supervised learning, a technique that allows computers to learn from unlabeled data, helps uncovering inherent connections within multimodal and video inputs. Structured representations, on the other hand, serve as a means to capture complex temporal patterns and uncertainties inherent in video analysis. By employing these techniques, I offer novel insights into modeling multimodal representations for video analysis, showing improved performance with respect to prior work in all studied scenarios.


  • thumnail for SurisCollVinent_columbia_0054D_18440.pdf SurisCollVinent_columbia_0054D_18440.pdf application/pdf 14.9 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Vondrick, Carl M.
Ph.D., Columbia University
Published Here
June 12, 2024