2025 Theses Doctoral
Forecasting Tasks from Electronic Health Record Data: Model Selection, Interpretability, and Scalability
Predicting patient risk is a complex task with numerous factors to consider, such as demographic characteristics, lifestyle choices, and longitudinal medical history. The widespread use of electronic health record (EHR) data offers clinicians access to these factors, which can be analyzed to assess patient risk effectively. Leveraging EHR data, clinicians can make evidence-based decisions, proactively identify patients at elevated risk, and personalize treatment strategies by drawing on both individual trajectories and insights from similar cases. Despite its potential value, the use of large-scale EHR data for patient risk prediction is a complex and time-demanding task in practice. This makes machine learning essential, as it can efficiently process and analyze large-scale datasets to uncover patterns and correlations that may not be immediately apparent to human experts.
The development and deployment of machine learning systems in healthcare is not a trivial task. It requires coordinated efforts from multidisciplinary teams (e.g., clinicians, data engineers, and machine learning scientists) and involves numerous stages, including problem specification, phenotype definition, regulatory approval and data access, data acquisition and preprocessing, model construction, and evaluation. Challenges encountered at any of these stages can pose significant bottlenecks, hindering the practical implementation of machine learning in clinical settings.
In this thesis, we focus on the modeling-related challenges within this broader pipeline. We identify three aspects of model development that, if addressed, could improve the adoption of machine learning systems in clinical practice: (I) model selection, (II) model interpretability, and (III) model scalability.
This thesis unfolds in four parts, each addressing a key challenge at the intersection of machine learning methodology and clinical deployment. We begin by demonstrating that recent state-of-the-art survival analysis models are highly sensitive to hyperparameter choices and may suffer from instability during training, making model selection difficult for practitioners.
To address this, we introduce a novel, hyperparameter-efficient, and performant survival analysis approach. Next, recognizing the importance of interpretability (i.e. the ability to attribute model outputs to specific input features) in clinical machine learning, we develop an additive hazard model --designed to provide clinically meaningful explanations across population, subgroup, and individual levels. Moving forward, we introduce a probabilistic framework that casts feature attributions as probability distributions, which allows us to quantify attributional uncertainty and showcase its practical benefit for interpretability.
Finally, we introduce a scalable inference and learning algorithm for Gaussian processes (GP). Building on ideas from Bayesian coresets, our method reduces the parameter cost of stochastic GP learning from quadratic to linear, while achieving state-of-the-art results on root-mean-square error (RMSE) and posterior predictive likelihood (PPLL) metrics. Taken together, these contributions represent incremental yet important steps toward bridging methodological advances with practical deployments of machine learning systems in clinical settings.
Subjects
Files
-
Ketenci_columbia_0054D_19593.pdf
application/pdf
52.4 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Elhadad, NoƩmie
- Degree
- Ph.D., Columbia University
- Published Here
- November 12, 2025