Technical reports:
Microcoding the Lexicon with Co-occurrence Knowledge
Frank A. Smadja
Downloads:
- Title:
- Microcoding the Lexicon with Co-occurrence Knowledge
- Author(s):
- Smadja, Frank A.
- Date:
- 1989
- Type:
- Technical reports
- Department:
- Computer Science
- Permanent URL:
- http://hdl.handle.net/10022/AC:P:12114
- Series:
- Columbia University Computer Science Technical Reports
- Part Number:
- CUCS-448-89
- Publisher:
- Department of Computer Science, Columbia University
- Publisher Location:
- New York
- Abstract:
- Neither syntax nor semantics can justify the use of a certain class of English word combinations. This class contains word pairs that often appear together in a given context of meaning. Such pairs are called co-occurrence relations or idiosyncratic collocations [3]. To correctly understand or produce natural language, such lexical relations need to be specifically encoded in lexicons [6]. [10]. [1]. In this paper, we show how word-based lexicons can be enriched with automatically acquired lexical relations. We call this process microcoding the lexicon, since it corresponds to the addition of lexical associations in a regular lexicon. We are using our enriched lexicon for language generation. Co-occurrence knowledge is particularly important for language generation, without it, awkward or incorrect sentences could be produced. In previous natural language work, co-occurrence knowledge was ignored or hand encoded. In contrast, we acquire it automatically from the analysis of large textual corpora. We describe the acquisition method based on EXTRACT [12], a co-occurrence compiler that retrieves lexical relations from the statistical analysis of a large corpus. We indicate how these lexical associations are entered in a word-based lexicon in a useful and coherent way for language generators. We then show how this information is used in COOK, a functional unification based language based generator that correctly handles collocation ally restricted sentences. Whenever possible, we use examples taken from the bank and stock market domains.
- Subject(s):
- Computer science
- Item views:
- 38