Theses Doctoral

Representation Learning for CRISPR-Cas13d Efficacy and Single-Cell RNA Sequencing Data

Stirn, Andrew

This thesis develops multiple novel methods spanning computational genomics and machine learning. CRISPR-Cas13d is a programmable RNA-targeting system that can knockdown specific RNA transcripts. Our method, TIGER (Wessels et al., 2023), is the first tool to model CRISPR-Cas13d efficacy as a function of both the target RNA and the guide RNA (gRNA) sequences; doing so enables biologists using our model to design the most effective gRNA for a target transcript, check a gRNA library for any unintended off-target effects, and engineer mismatches between a gRNA and its target to titrate CRISPR-Cas13d efficacy.

We leverage TIGER to study CRISPR-Cas13d binding affinity at junction splice sites (Megan D Schertzer et al., 2023) finding CRISPR-Cas13d can uniquely target 89% of human isoforms with high efficacy. Thereafter, we develop two methods (Stirn & Knowles, 2020; Stirn et al., 2023) for generating well-calibrated heteroscedastic variance estimates, which we integrate into TIGER to study sequence-based heteroscedasticity in CRISPR-Cas13d. The goal of single-cell RNA sequencing (scRNA-seq) integration is to isolate technical variation from the biological signals of interest. Several popular integration methods use a semi-supervised variational autoencoding (VAE) framework (Kingma et al., 2014) with partially observed cell-type labels to learn a low-dimensional representation disentangled from technical effects.

To better handle partially observed labels in the amortized variational setting, we develop a new distribution on the simplex (Stirn et al., 2019) that mimics the Dirichlet distribution but has analytic reparameterization gradients and thus low gradient variance. Additionally, we develop a novel method for learning structured latent embeddings for the VAE (Kingma & Welling, 2014) that outperforms existing clustering methods on benchmark datasets and state-of-the-art scRNA-seq integration methods when combined with scVI (Lopez et al., 2018), a popular VAE-based scRNA-seq integration method.

Our final chapter includes theoretical and empirical results on how to improve the VampPrior’s (Tomczak & Welling, 2018) marginal likelihood by decoupling the prior and posterior variances. We further increase the VampPrior’s flexibility by replacing its uniform mixture with a Dirichlet process mixture. In tandem, these changes both boost the VampPrior’s modeling performance and reduce cluster utilization.

Files

  • thumnail for Stirn_columbia_0054D_19035.pdf Stirn_columbia_0054D_19035.pdf application/pdf 18.3 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Knowles, David A.
Degree
Ph.D., Columbia University
Published Here
April 16, 2025