2019 Theses Doctoral
Statistical Methods for Constructing Heterogeneous Biomarker Networks
The theme of this dissertation is to construct heterogeneous biomarker networks using graphical models for understanding disease progression and prognosis. Biomarkers may organize into networks of connected regions. Substantial heterogeneity in networks between individuals and subgroups of individuals is observed. The strengths of network connections may vary across subjects depending on subject-specific covariates (e.g., genetic variants, age). In addition, the connectivities between biomarkers, as subject-specific network features, have been found to predict disease clinical outcomes. Thus, it is important to accurately identify biomarker network structure and estimate the strength of connections.
Graphical models have been extensively used to construct complex networks. However, the estimated networks are at the population level, not accounting for subjects’ covariates. More flexible covariate-dependent graphical models are needed to capture the heterogeneity in subjects and further create new network features to improve prediction of disease clinical outcomes and stratify subjects into clinically meaningful groups. A large number of parameters are required in covariate-dependent graphical models. Regularization needs to be imposed to handle the high-dimensional parameter space. Furthermore, personalized clinical symptom networks can be constructed to investigate co-occurrence of clinical symptoms. When there are multiple biomarker modalities, the estimation of a target biomarker network can be improved by incorporating prior network information from the external modality. This dissertation contains four parts to achieve these goals: (1) An efficient l0-norm feature selection method based on augmented and penalized minimization to tackle the high-dimensional parameter space involved in covariate-dependent graphical models; (2) A two-stage approach to identify disease-associated biomarker network features; (3) An application to construct personalized symptom networks; (4) A node-wise biomarker graphical model to leverage the shared mechanism between multi-modality data when external modality data is available.
In the first part of the dissertation, we propose a two-stage procedure to regularize l0-norm as close as possible and solve it by a highly efficient and simple computational algorithm. Advances in high-throughput technologies in genomics and imaging yield unprecedentedly large numbers of prognostic biomarkers. To accommodate the scale of biomarkers and study their association with disease outcomes, penalized regression is often used to identify important biomarkers. The ideal variable selection procedure would search for the best subset of predictors, which is equivalent to imposing an l0-penalty on the regression coefficients. Since this optimization is a non-deterministic polynomial-time hard (NP-hard) problem that does not scale with number of biomarkers, alternative methods mostly place smooth penalties on the regression parameters, which lead to computationally feasible optimization problems. However, empirical studies and theoretical analyses show that convex approximation of l0-norm (e.g., l1) does not outperform their l0 counterpart. The progress for l0-norm feature selection is relatively slower, where the main methods are greedy algorithms such as stepwise regression or orthogonal matching pursuit. Penalized regression based on regularizing l0-norm remains much less explored in the literature. In this work, inspired by the recently popular augmenting and data splitting algorithms including alternating direction method of multipliers, we propose a two-stage procedure for l0-penalty variable selection, referred to as augmented penalized minimization-L0 (APM-L0). APM-L0 targets l0-norm as closely as possible while keeping computation tractable, efficient, and simple, which is achieved by iterating between a convex regularized regression and a simple hard-thresholding estimation. The procedure can be viewed as arising from regularized optimization with truncated l1 norm. Thus, we propose to treat regularization parameter and thresholding parameter as tuning parameters and select based on cross-validation. A one-step coordinate descent algorithm is used in the first stage to significantly improve computational efficiency. Through extensive simulation studies and real data application, we demonstrate superior performance of the proposed method in terms of selection accuracy and computational speed as compared to existing methods. The proposed APM-L0 procedure is implemented in the R-package APML0.
In the second part of the dissertation, we develop a two-stage method to estimate biomarker networks that account for heterogeneity among subjects and evaluate the network’s association with disease clinical outcome. In the first stage, we propose a conditional Gaussian graphical model with mean and precision matrix depending on covariates to obtain subject- or subgroup-specific networks. In the second stage, we evaluate the clinical utility of network measures (connection strengths) estimated from the first stage. The second stage analysis provides the relative predictive power of between-region network measures on clinical impairment in the context of regional biomarkers and existing disease risk factors. We assess the performance of the proposed method by extensive simulation studies and application to a Huntington’s disease (HD) study to investigate the effect of HD causal gene on the rate of change in motor symptom through affecting brain subcortical and cortical grey matter atrophy connections. We show that cortical network connections and subcortical volumes, but not subcortical connections are identified to be predictive of clinical motor function deterioration. We validate these findings in an independent HD study. Lastly, highly similar patterns seen in the grey matter connections and a previous white matter connectivity study suggest a shared biological mechanism for HD and support the hypothesis that white matter loss is a direct result of neuronal loss as opposed to the loss of myelin or dysmyelination.
In the third part of the dissertation, we apply the methodology to construct heterogeneous cross-sectional symptom networks. The co-occurrence of symptoms may result from the direct interactions between these symptoms and the symptoms can be treated as a system. In addition, subject-specific risk factors (e.g., genetic variants, age) can also exert external influence on the system. In this work, we develop a covariate-dependent conditional Gaussian graphical model to obtain personalized symptom networks. The strengths of network connections are modeled as a function of covariates to capture the heterogeneity among individuals and subgroups of individuals. We assess the performance of the proposed method by simulation studies and an application to a Huntington’s disease study to investigate the networks of symptoms in different domains (motor, cognitive, psychiatric) and identify the important brain imaging biomarkers associated with the connections. We show that the symptoms in the same domain interact more often with each other than across domains. We validate the findings using subjects’ measurements from follow-up visits.
In the fourth part of the dissertation, we propose an integrative learning approach to improve the estimation of subject-specific networks of target modality when external modality data is available. The biomarker networks measured by different modalities of data (e.g., structural magnetic resonance imaging (sMRI), diffusion tensor imaging (DTI)) may share the same true underlying biological mechanism. In this work, we propose a node-wise biomarker graphical model to leverage the shared mechanism between multi-modality data to provide a more reliable estimation of the target modality network and account for the heterogeneity in networks due to differences between subjects and networks of external modality. Latent variables are introduced to represent the shared unobserved biological network and the information from the external modality is incorporated to model the distribution of the underlying biological network. An approximation approach is used to calculate the posterior expectations of latent variables to reduce time. The performance of the proposed method is demonstrated by extensive simulation studies and an application to construct gray matter brain atrophy network of Huntington’s disease by using sMRI data and DTI data. The estimated network measures are shown to be meaningful for predicting follow-up clinical outcomes in terms of patient stratification and prediction.
Lastly, we conclude the dissertation with comments on limitations and extensions.
- Xie_columbia_0054D_15402.pdf application/pdf 1.73 MB Download File
More About This Work
- Academic Units
- Thesis Advisors
- Wang, Yuanjia
- Ph.D., Columbia University
- Published Here
- August 29, 2019