2025 Theses Doctoral
Advances in Probabilistic Machine Learning: Scalable Inference, Conditional Generation, and Invariance Modeling
A central goal of machine learning is to uncover hidden patterns in the data for making predictions and drawing insights. The probabilistic perspective accounts for uncertainty by inferring a distribution over plausible patterns, while incorporating prior beliefs. However, applying probabilistic machine learning in modern settings presents several challenges, including scalability in large-data regimes, conditional generation with complex priors, and invariance modeling of heterogeneous data. This thesis develops methodologies to address these challenges.
The first part of the thesis focuses on improving the scalability of Gaussian processes (GPs), a classical probabilistic model whose exact inference is intractable for large-scale problems. We first propose two approximate inference methods, one utilizing structured inducing points and the other exploiting sparsity in the prior precision matrix.
While these methods are computationally attractive, they introduce biases that can affect downstream performance. In a separate line of work, we investigate systematic biases of two widely used scalable GP techniques and propose randomized algorithms to achieve unbiased inference.
The second part of the thesis addresses inference challenges arising in modern deep generative models, in particular, diffusion models. These models capture distributions over complex data modalities, making them suitable as powerful priors for conditional generation tasks. However, inference from their conditional distributions is intractable. While previous methods rely on expensive training or error-prone approximations, we introduce a training-free sequential Monte Carlo algorithm that is asymptotically exact in the limit of increasing compute budget.
We demonstrate the effectiveness of our algorithm on image generation and protein design applications.
The third part of the thesis considers modeling challenges where data are collected from different environments. Fitting a model to pooled data may result in spurious correlations that fail to generalize to new environments. Instead, we aim to identify stable predictive relationships based on a subset of invariant features. To this end, we develop a probabilistic model for inferring invariant features with accompanying theoretical guarantees. To handle high-dimensional problems, we propose a scalable variational inference algorithm. Simulations and real-world experiments demonstrate improved inference accuracy and scalability over existing methods.
Subjects
Files
-
Wu_columbia_0054D_19553.pdf
application/pdf
3.83 MB
Download File
More About This Work
- Academic Units
- Statistics
- Thesis Advisors
- Cunningham, John Patrick
- Blei, David Meir
- Degree
- Ph.D., Columbia University
- Published Here
- October 29, 2025