Data-Driven Solutions to Bottlenecks in Natural Language Generation
- Data-Driven Solutions to Bottlenecks in Natural Language Generation
- Biran, Or
- Thesis Advisor(s):
- McKeown, Kathleen
- Ph.D., Columbia University
- Computer Science
- Persistent URL:
- Concept-to-text generation suffers from what can be called generation bottlenecks - aspects of the generated text which should change for different subject domains, and which are usually hard to obtain or require manual work. Some examples are domain-specific content, a type system, a dictionary, discourse style and lexical style. These bottlenecks have stifled attempts to create generation systems that are generic, or at least apply to a wide range of domains in non-trivial applications.
This thesis is comprised of two parts. In the first, we propose data-driven solutions that automate obtaining the information and models required to solve some of these bottlenecks. Specifically, we present an approach to mining domain-specific paraphrasal templates from a simple text corpus; an approach to extracting a domain-specific taxonomic thesaurus from Wikipedia; and a novel document planning model which determines both ordering and discourse relations, and which can be extracted from a domain corpus. We evaluate each solution individually and independently from its ultimate use in generation, and show significant improvements in each.
In the second part of the thesis, we describe a framework for creating generation systems that rely on these solutions, as well as on hybrid concept-to-text and text-to-text generation, and which can be automatically adapted to any domain using only a domain-specific corpus. We illustrate the breadth of applications that this framework applies to with three examples: biography generation and company description generation, which we use to evaluate the framework itself and the contribution of our solutions; and justification of machine learning predictions, a novel application which we evaluate in a task-based study to show its importance to users.
- Computer science
Natural language processing (Computer science)
- Item views
text | xml
- Suggested Citation:
- Or Biran, 2016, Data-Driven Solutions to Bottlenecks in Natural Language Generation, Columbia University Academic Commons, https://doi.org/10.7916/D8K93819.