Generating Natural Language Summaries from Multiple On-Line Sources: Language Reuse and Regeneration

Radev, Dragomir R.

The abundance of news wire on the World-Wide Web has resulted in at least four major problems, which seem to present the most interesting challenges to users and researchers alike: size,heterogeneity, change, and conflicting information. Size: several hundred newspapers and news agencies maintain their Web sites with thousands of news stories in each. Heterogeneity: some of the data related to news is in structured format (e.g., tables); more exists in semi-structured format (e.g.,Web pages, encyclopedias, textual databases); while the rest of the data is in textual form (e.g., newswire). Change: most Web sites and certainly all news sources change on a daily basis. Disagreement: different sources present conflicting or at least different views of the same event. We have approached the second, third, and fourth of these four problems from the point of view of text generation. We have developed a system, {\scsummons}, which when coupled with appropriate information extraction technology, generates a specific genre of natural language summaries of a particular event (which we call briefings) in a restricted domain. The briefings are concise, they contain facts from multiple and heterogeneous sources, and incorporate evolving information, highlighting agreements and contradictions among sources on the same topic. We have developed novel techniques and algorithms for combining data from multiple sources at the conceptual level (using natural language understanding), for identifying new information on a given topic; and for presenting the information in natural language form to the user. We named the framework that we have developed for these problems {\em language reuse and regeneration} (LRR). Its novelty lies in the ability to produce text by collating together text already written by humans on the Web. The main features of LRR are: increased robustness through a simplified parsing/generation component, leverage on text already written by humans, and facilities for the inclusion of structured data in computer-generated text. The present thesis contains an introduction to LRR and its use inmulti-document summarization. We have paid special attention to the techniquesfor producing conceptual summaries of multiple sources, to the creation and useof a LRR-based lexicon for text generation, to a methodology used to identifynew and old information in threads of documents, and to the generation offluent natural language text using all the components above. The thesis contains evaluations of the different components of {\sc summons} aswell as certain aspects of LRR as a methodology. A review of the relevantliterature is included as a separate chapter.



More About This Work

Academic Units
Computer Science
Department of Computer Science, Columbia University
Columbia University Computer Science Technical Reports, CUCS-025-99
Published Here
April 25, 2011