Theses Doctoral

Probabilistic machine learning for predictions and causal discovery in health informatics.

Nazaret, Achille Oscar Romeo

This thesis is about probabilistic machine learning methods for predictive modeling, causal discovery, and applications in health informatics. Probabilistic models have achieved remarkable performance in text and image modeling, often predicting answers to textual or visual questions better than human experts. But can these methods succeed as well on health data questions? Health data are not limited to text or images -- they include tabular datasets from electronic health records, high-dimensional genomic measurements from single-cell sequencing, or time series from wearable devices. Health questions are not only about predicting an answer alone -- they involve explaining those predictions and discovering unknown relations between variables. This thesis addresses these challenges. It introduces methods that improve probabilistic prediction on common health data types and scale causal discovery to the thousands of variables encountered in genomic datasets. It then develops interpretable generative models for concrete health applications like cancer research with single-cell gene expression and heart health research with wearable time series. These methods are designed for non-experts in machine learning, with software implementations that require little to no tuning.

The thesis is organized into four parts, with the first two focusing on broadly applicable methodologies and the last two on concrete health applications of the previous methods. The first part focuses on probabilistic predictions, the task of estimating the full distribution of a target variable given other variables. This task is critical in health applications to quantify uncertainty, compute risk, or detect anomalies. We introduce two methods: treeffuser and unbounded depth neural networks (UDNs).Treeffuser provides probabilistic predictions for tabular data using a diffusion model parameterized by gradient-boosted trees.
In contrast, UDNs are deep neural networks. They provide probabilistic predictions as a mixture of outputs from all of their hidden layers. Importantly, UDNs automatically adapt their depth to the complexity of the data during training.

Both methods require minimal tuning and improve on existing methods.

The second part of this thesis focuses on causal discovery, the task of inferring causal relationships between variables. It is a fundamental task in health science, but existing methods hardly scale to the hundreds of variables of modern datasets. We develop two scalable methods: extreme greedy equivalence search (XGES) and stable differentiable causal discovery (SDCD). XGES is designed for linear models and has provable guarantees, whereas SDCD is designed for neural networks. Both methods improve convergence speed and accuracy, enabling causal discovery to scale to thousands of variables.

The third part designs probabilistic models in single-cell genomics. Single-cell RNA sequencing (scRNA-seq) measures gene expression across thousands of cells. But scRNA-seq data is challenging to analyze and usually requires multiple steps that can fail: batch correction, dimensionality reduction, data visualization, trajectory analysis, and gene pattern analysis. We propose Decipher, a tool for analyzing single-cell data that unifies all those steps and addresses their limitations. Decipher is a deep generative model. It learns a low-dimensional representation of each cell's state along with a two-dimensional visualization. Incorporating the visualization within Decipher's model enables new types of trajectory and gene pattern analyses. Applied to acute myeloid leukemia data, Decipher successfully maps the divergence from normal hematopoiesis and identifies transcriptional programs associated with NPM1 mutations.

The fourth and last part focuses on physiological time-series data recorded by the Apple Watch. We demonstrate through two studies how probabilistic models can uncover insights into fitness, heart rate regulation, and changes in human behavior. The first study models the subjects' heart rate given their activity intensity measured via GPS speed and step count. The study builds on existing physiological heart models based on differential equations and augments them with probabilistic machine-learning components. The resulting model forecasts heart rate responses better than standard deep learning models, learns personalized fitness indicators, and reveals how much environmental factors impact heart rate. The second study estimates the causal effect of the Apple Watch ``time to stand'' reminder using a regression discontinuity design specially adapted for time series. Using billions of minutes of standing data, it discovers that the nudge increases standing rates by up to 43.9\% and that it remains effective over time.

With this thesis, health researchers gain tools to uncover deeper insights into the human body, and machine learning practitioners gain methodologies for developing such tools on complex health data.

Files

  • thumbnail for Nazaret_columbia_0054D_19235.pdf Nazaret_columbia_0054D_19235.pdf application/pdf 21.5 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Blei, David Meir
Azizi, Elham
Degree
Ph.D., Columbia University
Published Here
August 27, 2025