2014 Theses Doctoral
Graph structure inference for high-throughput genomic data
Recent advances in high-throughput sequencing technologies enable us to study a large number of biomarkers and use their information collectively. Based on high-throughput experiments, there are many genome-wide networks constructed to characterize the complex physical or functional interactions between the biomarkers. To identify outcome-related biomarkers, it is often advantageous to make use of the known relational structure, because graph structured inference introduces smoothness and reduces complexity in modelling. In this dissertation, we propose models for high-dimensional epigenetic and genomic data that incorporate the network structure and update the network structure based on empirical evidence.
In the first part of this dissertation, we propose a penalized conditional logistic regression model for high dimensional DNA methylation data. DNA methylation of CpG sites within genes are often correlated and the number of CpG sites typically far outnumbers the sample size. The new penalty function combines the truncated lasso penalty and a graph fuse-lasso penalty to induce parsimonious and consistent models, and to incorporate the CpG sites network structure without introducing extra bias. An efficient minorization-maximization algorithm that utilizes difference of convex programming and alternating direction method of multipliers is presented. Extensive simulations demonstrated superior performance of the proposed method compared to several existing methods in both model selection consistency and parameter estimation accuracy. We also applied the proposed method to a matched case-control breast invasive carcinoma methylation data from the Cancer Genome Atlas (TCGA), generated from both Illumina Infinium HumanMethylation27 (HM27) and HumanMethylation450 (HM450) Beadchip. The proposed method identified several outcome-related CpG sites that have been missed by the existing methods.
In the latter part of this dissertation, we propose a Bayesian hierarchical graph-structured model that integrates {\em a priori} network information with empirical evidence. Empirical data may suggest modifications to the given network structure, which could lead to new and interesting biological findings when the prior knowledge on the graphical structure among the variables is limited or partial. We present the full hierarchical model along with the Markov Chain Monte Carlo sampling inference procedure. Using both simulations and brain aging gene pathway data, we showed that the new method can identify discrepancy between data and a prior known graph structure and suggest modifications and updates.
Motivated by methylation and gene expression data, the two models we propose in this thesis make use of the available structure in the data and produce better inferential results. The proposed methods can be applied to a wider range of problems.
Subjects
Files
- Zhou_columbia_0054D_12374.pdf application/pdf 1.4 MB Download File
More About This Work
- Academic Units
- Biostatistics
- Thesis Advisors
- Wang, Shuang
- Degree
- Ph.D., Columbia University
- Published Here
- October 13, 2014