2025 Theses Doctoral
Learning with Auxiliary Information: A Unified Framework for Robust Clinical Prediction in Healthcare
Machine learning offers unprecedented opportunities to advance predictive modeling in healthcare, driven by the vast increase in available clinical data. Despite the promise of increased performance, its applications pose many challenges due to the nature of healthcare datasets. Standard prediction models often fail in this context due to high dimensionality, data imbalance, irregular longitudinal collection, and data fragmentation across study cohorts. This dissertation addresses these challenges by introducing a novel paradigm, Learning with Auxiliary Information, which offers robust strategies to mitigate the limitations of standard predictive models. Auxiliary information is defined here as any information often discarded or underutilized in traditional machine-learning settings.
This work develops and applies novel machine-learning methodologies to diverse healthcare datasets to demonstrate this paradigm. For predicting preeclampsia in the Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-be (nuMoM2b) cohort, ensemble methods were used to manage clinical complexity and address algorithmic fairness. To predict necrotizing enterocolitis (NEC) from neonatal stool microbiota, an attention-based Multiple Instance Learning (MIL) method was utilized on ambiguously labeled longitudinal data. A novel "growing bag" analysis was developed throughout an infant's early life to generate a dynamic, interpretable risk score for NEC. For predicting proximal junctional kyphosis (PJK) and preterm birth (PTB), where critical post-operative and delivery information is unavailable at inference, a new Learning Using Privileged information (LUPI) algorithm, XGBoost+, was created by integrating a distillation framework into gradient boosting. Each of the above applications demonstrates a unique way of handling auxiliary information. Finally, the overarching Learning with Auxiliary Information framework was instantiated by combining LUPI with Transfer Learning in a novel XGBoost+TL model. This demonstrated that knowledge could be successfully transferred from a large source dataset to improve prediction on a smaller, distinct clinical cohort.
The conclusions of this work confirm the viability of the proposed paradigm. The developed models consistently outperformed traditional machine learning approaches across all clinical problems. For preterm birth, the LUPI framework improved accuracy and revealed the starkly different predictability of indicated versus spontaneous preterm birth subtypes, providing key clinical insights. The successful application of XGBoost+TL confirmed that combining privileged information and transfer learning is a viable strategy for overcoming data fragmentation and scarcity. This dissertation concludes that moving beyond standard learning methods to a unified framework that strategically incorporates auxiliary data makes it possible to create significantly more powerful and reliable predictive tools to support high-stakes clinical decision-making.
Subjects
Files
-
Lin_columbia_0054D_19455.pdf
application/pdf
2.85 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Degree
- Ph.D., Columbia University
- Published Here
- September 17, 2025