Reports

Pattern Matching for Translating Domain-Specific Terms from Large Corpora

Fung, Pascale

Translating domain-specific terms is one significant component of machine translation and Machine-aided translation systems. These terms are often not found in standard dictionaries. Human translators, not being experts in every technical or regional domain, cannot produce their translations effectively. Automatic translation of domain-specific terms is therefore highly desirable. Most other work on automatic term translation uses statistical information of words from parallel corpora. Parallel corpora of clean- translated texts are hard to come by whereas there are more noisy- translated texts and many more monolingual texts in various domains. We propose using noisy parallel texts and same-domain texts of a pair of languages to translate terms. In our work, we propose using a novel paradigm of pattern matching of statistical signals of word features. These features are robust to the syntactic structure, character sets, language of the text, and to the domain. We obtain statistical information which is related to the lexical properties of a word and its translation in any other language of the same domain. These lexical properties are extracted from the corpora and represented in vector form. We propose using signal processing techniques for matching these features vectors of a word to those of its translation. Another matching technique we propose is applying discriminative analysis of the word features. For each word, the various features are combined into a single vector which is then transformed into a smaller dimension eigenvector for matching. Since most domain specific terms are nouns and noun phrases, we concentrate on translating English nouns and noun phrases into other languages. We study the relationship between English noun phrases and their translations in Chinese, Japanese and French in parallel corpora. The result of this study is used in our system for translation of English noun phrases into these other languages from noisy parallel and non-parallel corpora.

Subjects

Files

More About This Work

Academic Units
Computer Science
Publisher
Department of Computer Science, Columbia University
Series
Columbia University Computer Science Technical Reports, CUCS-015-95
Published Here
February 7, 2012