Academic Commons

Theses Doctoral

Dimension Reduction for Short Text Similarity and its Applications

Guo, Weiwei

Recently, due to the burst of online text data, much of the focus of natural language processing (NLP) research has shifted from long documents to shorter ones such as sentences and utterances. However, short texts posit significant challenges from an NLP perspective especially if the goal is to get at sentence level semantics in the absence of larger contexts. Motivated by this challenge, this thesis focuses on the problem of predicting the similarity between two short text samples by extracting the latent representation of the text data, and we apply the resulting models in various NLP tasks that involves short text similarity computation.
The major challenge of computing short text similarity is insufficient information in the text snippets. In a sentence similarity benchmark [Agirre et al., 2012], on average a sentence has 10.8 words. Hence, there are very few overlapping words in a text pair even when they are semantically related, and the widely used bag-of-words representation fails to captures the semantics relatedness. To this end, we propose several weighted matrix factorization models for learning latent representation of texts , which induces meaningful similar scores:
1. Modeling Missing Words: To address the word sparsity issue, we propose to model the missing words (words that are not in the short text), a feature that is typically overlooked in the literature. We define the missing words of a text as the whole vocabulary in a corpus minus the observed words in the text. The model carefully handles the missing words that by assigning them a small weight in the matrix factorization framework. In the experiments, the new model {\it weighted matrix factorization} (WMF) achieves superior performance to Latent Dirichlet Allocation (LDA) [Blei et al., 2003], which does not use missing words, and latent semantic analysis (LSA) [Deerwester et al., 1990], which uses missing words but does not distinguish missing words from observed words.
2. Modeling Lexical Semantics: We improve the previous (WMF) model in terms of lexical semantics. For short text similarity, it is crucial to robustly model each word in the text to capture the complete semantic picture of the text, since there is very few repetitive information given the short context. To this end, we incorporate both corpus based (bigrams) and knowledge-based (similar words extracted from a dictionary) lexical semantics into the WMF model. The experiments show both additional information are helpful and complementary to each other.
3. Similarity Computing for Large-scale data sets: We tackle the short text similarity problem in large scale setting, i.e., given a query tweet, compute the similarity/distance with all other data point in a database, and rank them based on similarity/distance score. To reduce the computation time, we exploit binary coding to transform each data sample into a compact binary code, hence enables highly efficient similarity computations via Hamming distances between the generated codes. In order to preserve as much original data as possible in the binary bits, we restrict the projection directions to be nearly orthogonal hence reduce redundant information. The resulting model demonstrate better performance in both short text similarity task and a tweet retrieval task.
We not only are interested in the short text similarity task itself, but also are concerned with how much the model could contribute to other NLP tasks. Accordingly, we adapt the short text similarity for several NLP tasks closely associated to semantics, which involve intensive similarity computation:
4. Text Summarization Evaluation: The pyramid method is one of the most popular methods for evaluating content selection in summarization, which requires manual inspection during evaluation. Recently some efforts have been made to automate the evaluation process: Harnly et al. (2005) searched for key facts/concepts covered in the summaries based on surface word matching. We apply WMF model to this task to enable more accurate key facts identification in summaries. The resulting automated pyramid scores correlate very well with manual pyramid scores.
5. Word Sense Disambiguation: The unsupervised Word Sense Disambiguation (WSD) systems highly rely on a sense similarity module that returns a similarity score given two senses. Currently the most popular sense similarity measure is Extended Lesk [Banerjee and Pedersen, 2003], which calculates the similarity score based on the number of overlapping words and phrases between two extended dictionary definitions. We propose a new sense similarity measure wmfvec by running WMF on the sense definition data and integrating WordNet [Fellbaum, 1998] features. The WSD system using wmfvec significantly outperforms traditional surface form based WSD algorithms as well as LDA based systems.
6. Linking Tweets to News: In this task we target at social media data and news data. We propose a new task of linking a tweet to a news article that is most relevant to the tweet. The motivation of the task is to argument the context of a tweet by a news article. We extend the WMF model and incorporate multiple Twitter/news specific features, i.e., hashtag, named entities and timestamps, in the new model. Our experiments show significant improvement of the new model over baselines in various evaluation metrics.



  • thumnail for Guo_columbia_0054D_12757.pdf Guo_columbia_0054D_12757.pdf binary/octet-stream 1.26 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Diab, Mona
Ph.D., Columbia University
Published Here
May 12, 2015
Academic Commons provides global access to research and scholarship produced at Columbia University, Barnard College, Teachers College, Union Theological Seminary and Jewish Theological Seminary. Academic Commons is managed by the Columbia University Libraries.