2021 Theses Doctoral
Salience Estimation and Faithful Generation: Modeling Methods for Text Summarization and Generation
This thesis is focused on a particular text-to-text generation problem, automatic summarization, where the goal is to map a large input text to a much shorter summary text. The research presented aims to both understand and tame existing machine learning models, hopefully paving the way for more reliable text-to-text generation algorithms. Somewhat against the prevailing trends, we eschew end-to-end training of an abstractive summarization model, and instead break down the text summarization problem into its constituent tasks. At a high level, we divide these tasks into two categories: content selection, or “what to say” and content realization, or “how to say it” (McKeown, 1985). Within these categories we propose models and learning algorithms for the problems of salience estimation and faithful generation.
Salience estimation, that is, determining the importance of a piece of text relative to some context, falls into a problem of the former category, determining what should be selected for a summary. In particular, we experiment with a variety of popular or novel deep learning models for salience estimation in a single document summarization setting, and design several ablation experiments to gain some insight into which input signals are most important for making predictions. Understanding these signals is critical for designing reliable summarization models.
We then consider a more difficult problem of estimating salience in a large document stream, and propose two alternative approaches using classical machine learning techniques from both unsupervised clustering and structured prediction. These models incorporate salience estimates into larger text extraction algorithms that also consider redundancy and previous extraction decisions.
Overall, we find that when simple, position based heuristics are available, as in single document news or research summarization, deep learning models of salience often exploit them to make predictions, while ignoring the arguably more important content features of the input. In more demanding environments, like stream summarization, where heuristics are unreliable, more semantically relevant features become key to identifying salience content.
In part two, content realization, we assume content selection has already been performed and focus on methods for faithful generation (i.e., ensuring that output text utterances respect the semantics of the input content). Since they can generate very fluent and natural text, deep learning- based natural language generation models are a popular approach to this problem. However, they often omit, misconstrue, or otherwise generate text that is not semantically correct given the input content. In this section, we develop a data augmentation and self-training technique to mitigate this problem. Additionally, we propose a training method for making deep learning-based natural language generation models capable of following a content plan, allowing for more control over the output utterances generated by the model. Under a stress test evaluation protocol, we demonstrate some empirical limits on several neural natural language generation models’ ability to encode and properly realize a content plan.
Finally, we conclude with some remarks on future directions for abstractive summarization outside of the end-to-end deep learning paradigm. Our aim here is to suggest avenues for constructing abstractive summarization systems with transparent, controllable, and reliable behavior when it comes to text understanding, compression, and generation. Our hope is that this thesis inspires more research in this direction, and, ultimately, real tools that are broadly useful outside of the natural language processing community.
- Kedzie_columbia_0054D_16369.pdf application/pdf 2.06 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- McKeown, Kathleen
- Ph.D., Columbia University
- Published Here
- February 22, 2021