Theses Doctoral

Machine Learning Methods for Personalized Medicine Using Electronic Health Records

Wu, Peng

The theme of this dissertation focuses on methods for estimating personalized treatment using machine learning algorithms leveraging information from electronic health records (EHRs). Current guidelines for medical decision making largely rely on data from randomized controlled trials (RCTs) studying average treatment effects. However, RCTs are usually conducted under specific inclusion/exclusion criteria, they may be inadequate to make individualized treatment decisions in real-world settings. Large-scale EHR provides opportunities to fulfill the goals of personalized medicine and learn individualized treatment rules (ITRs) depending on patient-specific characteristics from real-world patient data. On the other hand, since patients' electronic health records (EHRs) document treatment prescriptions in the real world, transferring information in EHRs to RCTs, if done appropriately, could potentially improve the performance of ITRs, in terms of precision and generalizability. Furthermore, EHR data domain usually consists text notes or similar structures, thus topic modeling techniques can be adapted to engineer features.
In the first part of this work, we address challenges with EHRs and propose a machine learning approach based on matching techniques (referred as M-learning) to estimate optimal ITRs from EHRs. This new learning method performs matching method instead of inverse probability weighting as commonly used in many existing methods for estimating ITRs to more accurately assess individuals' treatment responses to alternative treatments and alleviate confounding. Matching-based value functions are proposed to compare matched pairs under a unified framework, where various types of outcomes for measuring treatment response (including continuous, ordinal, and discrete outcomes) can easily be accommodated. We establish the Fisher consistency and convergence rate of M-learning. Through extensive simulation studies, we show that M-learning outperforms existing methods when propensity scores are misspecified or when unmeasured confounders are present in certain scenarios. In the end of this part, we apply M-learning to estimate optimal personalized second-line treatments for type 2 diabetes patients to achieve better glycemic control or reduce major complications using EHRs from New York Presbyterian Hospital (NYPH).
In the second part, we propose a new domain adaptation method to learn ITRs in by incorporating information from EHRs. Unless assuming no unmeasured confounding in EHRs, we cannot directly learn the optimal ITR from the combined EHR and RCT data. Instead, we first pre-train “super" features from EHRs that summarize physicians' treatment decisions and patients' observed benefits in the real world, which are likely to be informative of the optimal ITRs. We then augment the feature space of the RCT and learn the optimal ITRs stratifying by these features using RCT patients only. We adopt Q-learning and a modified matched-learning algorithm for estimation. We present theoretical justifications and conduct simulation studies to demonstrate the performance of our proposed method. Finally, we apply our method to transfer information learned from EHRs of type 2 diabetes (T2D) patients to improve learning individualized insulin therapies from an RCT.
In the last part of this work, we report M-learning proposed in the first part to learn ITRs using interpretable features extracted from EHR documentation of medications and ICD diagnoses codes. We use a latent Dirichlet allocation (LDA) model to extract latent topics and weights as features for learning ITRs. Our method achieves confounding reduction in observational studies through matching treated and untreated individuals and improves treatment optimization by augmenting feature space with clinically meaningful LDA-based features. We apply the method to extract LDA-based features in EHR data collected at NYPH clinical data warehouse in studying optimal second-line treatment for T2D patients. We use cross validation to show that ITRs outperforms uniform treatment strategies (i.e., assigning insulin or another class of oral organic compounds to all individuals), and including topic modeling features leads to more reduction of post-treatment complications.


  • thumnail for Wu_columbia_0054D_15435.pdf Wu_columbia_0054D_15435.pdf application/pdf 3.2 MB Download File

More About This Work

Academic Units
Thesis Advisors
Wang, Yuanjia
Ph.D., Columbia University
Published Here
August 30, 2019