2016 Theses Doctoral
Spectral Methods for Natural Language Processing
Many state-of-the-art results in natural language processing (NLP) are achieved with statistical models involving latent variables. Unfortunately, computational problems associated with such models (for instance, finding the optimal parameter values) are typically intractable, forcing practitioners to rely on heuristic methods without strong guarantees. While heuristics are often sufficient for empirical purposes, their de-emphasis on theoretical aspects has certain negative ramifications. First, it can impede the development of rigorous theoretical understanding which can generate new ideas and algorithms. Second, it can lead to black art solutions that are unreliable and difficult to reproduce.
In this thesis, we argue that spectral methods---that is, methods that use singular value decomposition or other similar matrix or tensor factorization---can effectively remedy these negative ramifications. To this end, we develop spectral methods for two unsupervised language processing tasks. The first task is learning lexical representations from unannotated text (e.g., hierarchical clustering of a vocabulary). The second task is estimating parameters of latent-variable models used in NLP applications (e.g., for unsupervised part-of-speech tagging). We show that our spectral algorithms have the following advantages over previous methods:
1. The algorithms provide a new theoretical framework that is amenable to rigorous analysis. In particular, they are shown to be statistically consistent.
2. The algorithms are simple to implement, efficient, and scalable to large amounts of data. They also yield results that are competitive with the state-of-the-art.
Files
- Lee_columbia_0054D_13440.pdf application/pdf 1.19 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Collins, Michael
- Degree
- Ph.D., Columbia University
- Published Here
- July 11, 2016
Notes
The author is also known as Jang Sun Lee. Karl Stratos is preferred for citations.