Academic Commons

Theses Doctoral

Multi-Structured Models for Transforming and Aligning Text

Thadani, Kapil

Structured representations are ubiquitous in natural language processing as both the product of text analysis tools and as a source of features for higher-level problems such as text generation. This dissertation explores the notion that different structured abstractions offer distinct but incomplete perspectives on the meaning encoded within a piece of text. We focus largely on monolingual text-to-text generation problems such as sentence compression and fusion, which present an opportunity to work toward general-purpose statistical models for text generation without strong assumptions on a domain or semantic representation. Systems that address these problems typically rely on a single structured representation of text to assemble a sentence; in contrast, we examine joint inference approaches which leverage the expressive power of heterogenous representations for these tasks.
These ideas are introduced in the context of supervised sentence compression through a compact integer program to simultaneously recover ordered n-grams and dependency trees that specify an output sentence. Our inference approach avoids cyclic and disconnected structures through flow networks, generalizing over several established compression techniques and yielding significant performance gains on standard corpora. We then consider the tradeoff between optimal solutions, model flexibility and runtime efficiency by targeting the same objective with approximate inference techniques as well as polynomial-time variants which rely on mildly constrained interpretations of the compression task.
While improving runtime is a matter of both theoretical and practical interest, the flexibility of our initial technique can be further exploited to examine the multi-structured hypothesis under new structured representations and tasks. We therefore investigate extensions to recover directed acyclic graphs which can represent various notions of predicate-argument structure and use this to experiment with frame-semantic formalisms in the context of sentence compression. In addition, we generalize the compression approach to accommodate multiple input sentences for the sentence fusion problem and construct a new dataset of natural sentence fusions which permits an examination of challenges in automated content selection. Finally, the notion of multi-structured inference is considered in a different context -- that of monolingual phrase-based alignment -- where we find additional support for a holistic approach to structured text representation.

Subjects

Files

  • thumnail for Thadani_columbia_0054D_12631.pdf Thadani_columbia_0054D_12631.pdf binary/octet-stream 1.11 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
McKeown, Kathleen
Degree
Ph.D., Columbia University
Published Here
April 28, 2015
Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.