1995 Presentations (Communicative Events)
Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus
We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word and its translation in non-parallel corpora. On the other hand, we suggest that words with productive context in one language translate to words with productive context in another language, and words with rigid context translate into words with rigid context. Context heterogeneity measures how productive the context of a word is in a given domain, independent of its absolute occurrence frequency in the text. Based on this information, we derive statistics of bilingual word pairs from a non-parallel corpus. These statistics can be used to bootstrap a bilingual dictionary compilation algorithm.
Subjects
Files
- fung_95a.pdf application/pdf 185 KB Download File
More About This Work
- Academic Units
- Computer Science
- Publisher
- Proceedings of the 3rd Annual Workshop on Very Large Corpora
- Published Here
- April 26, 2013