Theses Doctoral

"Seeing Red" or "Tickled Pink"? Investigating the Power of Language and Vision Models through Color, Emotion, and Metaphor

Winn, Olivia

Multimodal NLP is an approach to language understanding that incorporates data from nontextual in order to enhance our linguistic understanding through additional contextual information. In particular, incorporating visual data has allowed for great strides in our ability to model language related to physical phenomena. The performance of these models has so far been contingent upon access to large datasets, focusing on classification problems without relative information, and constraining the problem space to literal descriptions and interpretations.

In this thesis, we examine these limitations by investigating how types of data previously unused in these models can be reconfigured and worked with intelligently and on a small scale to enhance our understanding of the pragmatics of language. We contribute to comparative language grounding, emotional interpretation, and metaphoric understanding by releasing multiple annotated datasets, developing a new paradigm for modeling relative data, creating a new task in examining the generation of emotional descriptions for image information, and demonstrating a novel approach to working with figurative text for image generation.

We start by examining how traditional grounding models could be adapted to incorporate relative information. As no previous work has ever utilized relative textual description for image understanding, we first constrain the problem by focusing on the language of color. We create a new dataset of comparative color terms with associated RGB datapoints, and use this data to develop a novel paradigm of grounding comparative color terms in RGB space, providing the first avenue towards utilizing relative information in a multimodal setting.

Continuing our study of color, we then turn to examining the relationship between color and emotion. In order to further our understanding of this relationship, we define a new task, called Justified Affect Transformation, in which an image is recolored specifically to alter its emotional evocation and text is generated to explain the recoloring from an emotional perspective. We create a dataset of abstract art with contiguous emotion labels and textual rationales for the emotional evocation of multiple images, and using our new dataset for training, introduce a new unified model that recolors an image and provides a textual rationale explaining the recoloring with respect to the specified emotion. We use this model to examine the relationship between color and emotion devoid of confounding factors.

Finally, we turn to figurative language as a resource, examining the pragmatics of visualizing metaphoric phrases. We demonstrate a novel approach to generating visual metaphor through the collaboration of large language models and diffusion-based text-to-image models, and in doing so create a novel dataset of visual metaphor with both literal and figurative captions. We then develop an evaluation framework using human-AI collaboration to examine the efficacy of the model collaboration, and choose a downstream task of visual entailment to evaluate the human-AI collaboration.


  • thumnail for Winn_columbia_0054D_17962.pdf Winn_columbia_0054D_17962.pdf application/pdf 4.23 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Muresan, Smaranda
Ph.D., Columbia University
Published Here
August 9, 2023