Theses Doctoral

Low dimensional structure in single cell data

Kunes, Russell Allen Zhang

This thesis presents the development of three methods, each of which concerns the estimation of interpretable low dimensional representations of high dimensional data. The first two chapters consider methods for fitting low dimensional nonlinear representations. In Chapter 1, we discuss the deterministic input, noisy "and" gate (DINA) model and in Chapter 2, binary variational autoencoders. We present an example of application to single cell assay for transposase accessible chromatin sequencing data (single cell ATACseq), where the DINA model uncovers meaningful discrete representations of cell state. In scientific applications, practitioners have substantial prior knowledge of the latent components driving variation in the data. The third Chapter develops a supervised matrix factorization method, Spectra, that leverages annotations from experts and previous biological experiments to uncover latent representations of single cell RNAseq data.

Variational inference for the DINA model:
The deterministic input, noisy "and" gate (DINA) model allows for matrix decomposition where latent factors are allowed to interact via an "and" relationship. We develop a variational inference approach for estimating the parameters of the DINA model. Previous approaches based on variational inference enumerate the space of latent binary parameters (requiring exponential numbers of parameters) and cannot fit an unknown number of latent components. Here, we report that a practical mean field variational inference approach relying on a nonparametric cumulative shrinkage process prior and stochastic coordinate ascent updates achieves competitive results with existing methods while simultaneously determining the number of latent components. This approach allows scaling exploratory Q-matrix estimation to datasets of practical size with minimal hyperparameter tuning.

Gradient estimation for binary latent variable models:
In order to fit binary variational autoencoders, the gradient of the objective function must be estimated. Generally speaking, gradient estimation is often necessary for fitting generative models with discrete latent variables. Examples of this occur in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have potentially exploding variance near the boundary of the parameter space, where solutions tend to lie. To ameliorate this issue, we propose a new gradient estimator bitflip-1 that has lower variance at the boundaries of the parameter space. As bitflip-1 has complementary properties to existing estimators, we introduce an aggregated estimator, unbiased gradient variance clipping (UGC) that uses either a bitflip-1 or a DisARM gradient update for each coordinate. We theoretically prove that UGC has uniformly lower variance than DisARM.Empirically, we observe that UGC achieves the optimal value of the optimization objectives in toy experiments, discrete VAE training, and in a best subset selection problem.

The Spectra model for supervised matrix decomposition:
Factor analysis decomposes single-cell gene expression data into a minimal set of gene programs that correspond to processes executed by cells in a sample. However, matrix factorization methods are prone to technical artifacts and poor factor interpretability. We address these concerns with Spectra, an algorithm that combines user-provided gene programs with the detection of novel programs that together best explain expression covariation. Spectra incorporates existing gene sets and cell type labels as prior biological information. It explicitly models cell type and represents input gene sets as a gene-gene knowledge graph, using a penalty function to guide factorization towards the input graph. We show that Spectra outperforms existing approaches in challenging tumor immune contexts: it finds factors that change under immune checkpoint therapy, disentangles the highly correlated features of CD8+ T-cell tumor reactivity and exhaustion, finds a program that explains continuous macrophage state changes under therapy, and identifies cell-type-specific immune metabolic programs.

Files

  • thumnail for Kunes_columbia_0054D_18823.pdf Kunes_columbia_0054D_18823.pdf application/pdf 14.6 MB Download File

More About This Work

Academic Units
Statistics
Thesis Advisors
Tavare, Simon
Degree
Ph.D., Columbia University
Published Here
November 6, 2024