Academic Commons

Theses Doctoral

Ranking for Scalable Information Extraction

Barrio Gonzalez, Pablo Javier

Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple <tsunami, Hawaii> from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text.

Geographic Areas

Files

More About This Work

Academic Units
Computer Science
Thesis Advisors
Gravano, Luis
Degree
Ph.D., Columbia University
Published Here
October 16, 2015
Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.