Reports

Querying Large Text Databases for Efficient Information Extraction

Agichtein, Eugene; Gravano, Luis

A wealth of data is hidden within unstructured text. This data is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database. This exhaustive approach is not practical, or sometimes even feasible, for large databases. In this paper, we develop an efficient query-based technique to identify documents that are potentially useful for the extraction of a target relation. We start by sampling the database to characterize the documents from which an information extraction system manages to extract relevant tuples. Then, we apply machine learning and information retrieval techniques to derive queries likely to match additional useful documents in the database. Finally, we issue these queries to the database to retrieve documents from which the information extraction system can extract the final relation. Our technique requires that databases support only a minimal boolean query interface, and is independent of the choice of the underlying information extraction system. We report a thorough experimental evaluation over more than one million documents that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents. Our proposed technique could be used to query a standard web search engine, hence providing a building block for efficient information extraction over the web at large.

Subjects

Files

More About This Work

Academic Units
Computer Science
Publisher
Department of Computer Science, Columbia University
Series
Columbia University Computer Science Technical Reports, CUCS-008-02
Published Here
April 21, 2011