2006 Presentations (Communicative Events)
Arabic Preprocessing Schemes for Statistical Machine Translation
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Subjects
Files
- habash_sadat_06.pdf application/pdf 76.1 KB Download File
More About This Work
- Academic Units
- Computer Science
- Publisher
- Proceedings of the HTL-NAACL
- Published Here
- July 5, 2013