Theses Doctoral

Modern Latent Variable Analysis: High-dimensional Grade of Membership Models and Process Data

Chen, Ling

Uncovering interpretable latent patterns in data has been a key focus in statistical learning, an interest that has steadily grown with the advent of modern data. This dissertation presents the theory and applications on latent variable analysis in two modern data regimes: high-dimensional multivariate categorical data from various applications and process data from computer-based assessments.

The first part of this dissertation focuses on the grade of membership (GoM) model, a type of mixed membership model for multivariate categorical data. The GoM model offers rich modeling power but presents significant challenges in model identifiability and estimation, especially with high-dimensional data. In Chapter 2, we present a new notion of expectation identifiability. Based on this identifiability notion, we propose a spectral method that exploits the singular subspace geometry and consistently estimates model parameters for binary responses. In Chapter 3, we extend the GoM models to encompass a broad range of data distributions and arbitrarily locally dependent noise, which we formalize the generalized-GoM models. Specifically, we extend the proposed spectral method to polytomous data and count data. We establish finite-sample entrywise error bounds for the estimated model parameters. This is supported by a new sharp two-to-infinity singular subspace perturbation theory for locally dependent and flexibly distributed noise, a contribution of independent interest.

The second part of the dissertation focuses on process data, a modern data type from computer-based assessments. Process data consist of time-stamped action sequences recorded in computer log files, capturing respondents’ interacts with computer-based items. Unlike traditional outcome data, process data provide rich information on response processes, and offer deeper insights into test-taking behaviors. However, the complex data structure poses significant challenges for integrating process data with established assessment models. In Chapter 4, we introduce a supervised variant of multidimensional scaling (MDS) to extract matrix-formatted features from process data, facilitating its application in subsequent statistical analyses. In Chapter 5, we demonstrate an application of the extracted latent features from process data on differential item functioning (DIF) analysis for assessing testing fairness. By utilizing process data features as proxies for nuisance latent attributes, we introduce a new scoring rule that incorporates respondents’ behaviors. This novel framework effectively reduces DIF and provides deeper insights in interpreting DIF.

Files

  • thumbnail for Chen_columbia_0054D_19133.pdf Chen_columbia_0054D_19133.pdf application/pdf 4.47 MB Download File

More About This Work

Academic Units
Statistics
Thesis Advisors
Liu, Jingchen
Gu, Yuqi
Degree
Ph.D., Columbia University
Published Here
July 2, 2025