Academic Commons

Presentations (Communicative Events)

Arabic Preprocessing Schemes for Statistical Machine Translation

Habash, Nizar Y.; Sadat, Fatiha

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

Files

More About This Work

Academic Units
Computer Science
Publisher
Proceedings of the HTL-NAACL
Published Here
July 5, 2013