2016 Theses Doctoral
Statistical Learning Methods for Personalized Medical Decision Making
The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data.
In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder.
The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy.
In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.
- LIU_columbia_0054D_13262.pdf binary/octet-stream 1020 KB Download File
More About This Work
- Academic Units
- Thesis Advisors
- Wang, Yuanjia
- Ph.D., Columbia University
- Published Here
- April 19, 2016