Theses Doctoral

Beyond Discourse: Computational Text Analysis and Material Historical Processes

Atria, Jose Tomas

This dissertation proposes a general methodological framework for the application of computational text analysis to the study of long duration material processes of transformation, beyond their traditional application to the study of discourse and rhetorical action. Over a thin theory of the linguistic nature of social facts, the proposed methodology revolves around the compilation of term co-occurrence matrices and their projection into different representations of an hypothetical semantic space. These representations offer solutions to two problems inherent to social scientific research: that of "mapping" features in a given representation to theoretical entities and that of "alignment" of the features seen in models built from different sources in order to enable their comparison.
The data requirements of the exercise are discussed through the introduction of the notion of a "narrative horizon", the extent to which a given source incorporates a narrative account in its rendering of the context that produces it. Useful primary data will consist of text with short narrative horizons, such that the ideal source will correspond to a continuous archive of institutional, ideally bureaucratic text produced as mere documentation of a definite population of more or less stable and comparable social facts across a couple of centuries. Such a primary source is available in the Proceedings of the Old Bailey (POB), a collection of transcriptions of 197,752 criminal trials seen by the Old Bailey and the Central Criminal Court of London and Middlesex between 1674 and 1913 that includes verbatim transcriptions of witness testimony. The POB is used to demonstrate the proposed framework, starting with the analysis of the evolution of an historical corpus to illustrate the procedure by which provenance data is used to construct longitudinal and cross-sectional comparisons of different corpus segments.
The co-occurrence matrices obtained from the POB corpus are used to demonstrate two different projections: semantic networks that model different notions of similarity between the terms in a corpus' lexicon as an adjacency matrix describing a graph and semantic vector spaces that approximate a lower-dimensional representation of an hypothetical semantic space from its empirical effects on the co-occurrence matrix.
Semantic networks are presented as discrete mathematical objects that offer a solution to the mapping problem through operation that allow for the construction of sets of terms over which an order can be induced using any measure of significance of the strength of association between a term set and its elements. Alignment can then be solved through different similarity measures computed over the intersection and union of the sets under comparison.
Semantic vector spaces are presented as continuous mathematical objects that offer a solution to the mapping problem in the linear structures contained in them. This include, in all cases, a meaningful metric that makes it possible to define neighbourhoods and regions in the semantic space and, in some cases, a meaningful orientation that makes it possible to trace dimensions across them. Alignment can then proceed endogenously in the case of oriented vector spaces for relative comparisons, or through the construction of common basis sets for non-oriented semantic spaces for absolute comparisons.
The dissertation concludes with the proposition of a general research program for the systematic compilation of text distributional patterns in order to facilitate a much needed process of calibration required by the techniques discussed in the previous chapters. Two specific avenues for further research are identified. First, the development of incremental methods of projection that allow a semantic model to be updated as new observations come along, an area that has received considerable attention from the field of electronic finance and the pervasive use of Gentleman's algorithm for matrix factorisation. Second, the development of additively decomposable models that may be combined or disaggregated to obtain a similar result to the one that would have been obtained had the model being computed from the union or difference of their inputs. This is established to be dependent on whether the functions that actualise a given model are associative under addition or not.


  • thumnail for Atria_columbia_0054D_14862.pdf Atria_columbia_0054D_14862.pdf application/pdf 5.92 MB Download File

More About This Work

Academic Units
Thesis Advisors
Bearman, Peter Shawn
Ph.D., Columbia University
Published Here
September 7, 2018