Academic Commons

Articles

Automatic Segmentation and Part-Of-Speech Tagging For Tibetan: A First Step Towards Machine Translation

Hackett, Paul G.

This paper presents what we believe to be the first reported work on Tibetan machine translation (MT). Of the three conceptually distinct components of a MT system — analysis, transfer, and generation — the first phase, consisting of POS tagging has been successfully completed. The combination POS tagger / word-segmenter was manually constructed as a rule-based multi-tagger relying on the Wilson formulation of Tibetan grammar. Partial parsing was also performed in combination with POS-tag sequence disambiguation. The component was evaluated at the task of document indexing for Information Retrieval (IR). Preliminary analysis indicated slightly better (though statistically comparable) performance to n-gram based approaches at a known-item IR task. Although segmentation is application specific, error analysis placed segmentation accuracy at 99%; the accuracy of the POS tagger is also estimated at 99% based on IR error analysis and random sampling.

Files

More About This Work

Academic Units
American Institute of Buddhist Studies
Published Here
May 31, 2011

Notes

Proceedings of the 9th Seminar of the International Association for Tibetan Studies, Leiden, The Netherlands, June 24-30, 2000 — Information Technology Panel.

Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.