Academic Commons

Theses Doctoral

From Language to the Real World: Entity-Driven Text Analytics

Xie, Boyi

This study focuses on the modeling of the underlying structured semantic information in natural language text to predict real world phenomena. The thesis of this work is that a general and uniform representation of linguistic information that combines multiple levels, such as semantic frames and roles, syntactic dependency structure, lexical items and their sentiment values, can support challenging classification tasks for NLP problems. The hypothesis behind this work is that it is possible to generate a document representation using more complex data structures, such as trees and graphs, to distinguish the depicted scenarios and semantic roles of the entity mentions in text, which can facilitate text mining tasks by exploiting the deeper semantic information. The testbed for the document representation is entity-driven text analytics, a recent area of active research where large collection of documents are analyzed to study and make predictions about real world outcomes of the entity mentions in text, with the hypothesis that the prediction will be more successful if the representation can capture not only the actual words and grammatical structures but also the underlying semantic generalizations encoded in frame semantics, and the dependency relations among frames and words.
The main contribution of this study includes the demonstration of the benefits of frame semantic features and how to use them in document representation. Novel tree and graph structured representations are proposed to model mentioned entities by incorporating different levels of linguistic information, such as lexical items, syntactic dependencies, and semantic frames and roles. For machine learning on graphs, we proposed a Node Edge Weighting graph kernel that allows a recursive computation on the substructures of graphs, which explores an exponential number of subgraphs for fine-grained feature engineering. We demonstrate the effectiveness of our model to predict price movement of companies in different market sectors solely based on financial news. Based on a comprehensive comparison between different structures of document representation and their corresponding learning methods, e.g. vector, tree and graph space model, we found that the application of a rich semantic feature learning on trees and graphs can lead to high prediction accuracy and interpretable features for problem understanding.
Two key questions motivate this study: (1) Can semantic parsing based on frame semantics, a lexical conceptual representation that captures underlying semantic similarities (scenarios) across different forms, be exploited for prediction tasks where information is derived from large scale document collections? (2) Given alternative data structures to represent the underlying meaning captured in frame semantics, which data structure will be most effective? To address (1), sentences that have dependency parses and frame semantic parses, and specialized lexicons that incorporate aspects of sentiment in words, will be used to generate representations that include individual lexical items, sentiment of lexical items, semantic frames and roles, syntactic dependency information and other structural relations among words and phrases within the sentence. To address (2), we incorporate the information derived from semantic frame parsing, dependency parsing, and specialized lexicons into vector space, tree space and graph space representations, and kernel methods for the corresponding data structures are used for SVM (support vector machine) learning to compare their predictive power.
A vector space model beyond bag-of-words is first presented. It is based on a combination of semantic frame attributes, n-gram lexical items, and part-of-speech specific words weighted by a psycholinguistic dictionary. The second model encompasses a semantic tree representation that encodes the relations among semantic frame features and, in particular, the roles of the entity mentions in text. It depends on tree kernel functions for machine learning. The third is a semantic graph model that provides a concise and convenient representation of linguistic semantic information. It subsumes the vector space model and the semantic tree model by using a graph data structure for a unified representation for semantic frames, lexical items, and syntactic dependency relations derived from frame parses and dependency parses of sentences.
The general goal of this study is to ground information derived from NLP techniques to textual datasets in real world observations, where natural language semantics is used as a means to learn the semantic relations that are important in the domain, to understand what is relevant for objectives of interest of the practitioner. Experiments are conducted in a financial domain to investigate whether our computational linguistic methodologies applied to large-scale analysis of financial news can improve the understanding of a company's fundamental market value, and whether linguistic information derived from news produces a consistent enough result to benefit more comprehensive financial models. Stock price data is aligned with news articles. Two kinds of labels are assigned: the existence of a price change and the direction of change. The change in price and polarity tasks are formulated as binary classification problems and bipartite ranking problems. Using the bag-of-words model and the proposed vector-space-model as benchmarks, the experiments show a significant improvement from the use of the semantic tree model. The semantic graph model with more expressive power outperforms both the vector space model and the tree space model. At best, there may be a weak predictive effect of news on price for a particular data instance, which is, for example, a company on a date, due to the fluctuation in uncertainty of financial market and the efficient market hypothesis. However, the proposed models and their outputs can provide useful information to guide financial market price prediction and to help business analysts discover potential investment opportunities. These advantages come from the rich expressive power of the semantic tree model and the semantic graph space model, since the models are able to learn the semantic relations that are important in the problem domain, and effectively discover the useful underlying structured semantic information from large-scale textual data.



  • thumnail for Xie_columbia_0054D_13024.pdf Xie_columbia_0054D_13024.pdf binary/octet-stream 6.45 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Passonneau, Rebecca
Ph.D., Columbia University
Published Here
October 16, 2015