An Entropy-based Assessment of the Unicode Encoding for Tibetan

Hackett, Paul G.

This paper presents an analysis of the Unicode encoding scheme for Tibetan from the standpoint of morpheme entropy. We can speak of two levels of entropy in Tibetan: syllable entropy (a measure of the probability of the sequential occurrence of syllables), and morpheme entropy (a measure of the probability of the sequential occurrence of characters or morphemes), the latter being a measure of the redundancy of the language. Syllable entropy is a purely statistical calculation that is a function of the domain of the literature sampled, while morpheme entropy, we show, is relatively domain independent given a statistically significant sample. Morpheme entropy can be calculated statistically, though a theoretical upper bound can also be postulated based on language dependent morphology rules. This paper presents both theoretical and statistical estimates of the morpheme entropy for Tibetan, and explores the Tibetan Unicode encoding scheme in relation to data compression, and other issues analyzed in light of entropy-based language modeling.



More About This Work

Academic Units
American Institute of Buddhist Studies
Published Here
May 31, 2011


Proceedings of the 10th Seminar of the International Association for Tibetan Studies, Oxford, United Kingdom, Sept. 6-12, 2003 — Information Technology Panel.