Automatic Segmentation and Part-Of-Speech Tagging For Tibetan: A First Step Towards Machine Translation

Hackett, Paul G.

This paper presents what we believe to be the first reported work on Tibetan machine translation (MT). Of the three conceptually distinct components of a MT system — analysis, transfer, and generation — the first phase, consisting of POS tagging has been successfully completed. The combination POS tagger / word-segmenter was manually constructed as a rule-based multi-tagger relying on the Wilson formulation of Tibetan grammar. Partial parsing was also performed in combination with POS-tag sequence disambiguation. The component was evaluated at the task of document indexing for Information Retrieval (IR). Preliminary analysis indicated slightly better (though statistically comparable) performance to n-gram based approaches at a known-item IR task. Although segmentation is application specific, error analysis placed segmentation accuracy at 99%; the accuracy of the POS tagger is also estimated at 99% based on IR error analysis and random sampling.


More About This Work

Academic Units
American Institute of Buddhist Studies
Published Here
May 31, 2011


Proceedings of the 9th Seminar of the International Association for Tibetan Studies, Leiden, The Netherlands, June 24-30, 2000 — Information Technology Panel.