2024 Theses Doctoral
Computational approaches to understand mechanisms of human genetic disorders
Human genetics is one of the strongest risk factors for complex diseases. Understandingthe effects of genetic variations not only serves as a fundamental approach to studying disease mechanisms but also offers unprecedented opportunities for improved clinical screening, disease diagnosis and therapeutic discoveries. Despite decades of extensive DNA sequencing and genetic research involving large cohorts, two major challenges remain. First, the majority of disease risk genes remain unidentified due to limited statistical power. Second, the functional effects of rare variants, especially missense variants, in disease risk genes are understudied. In this thesis, I describe new computational approaches to address those challenges using statistical genetics and machine learning methods implementing intuition of biological mechanisms. First, I worked on a statistical framework that can identify disease related pathways from de novo coding variants data. I applied this framework to study the genetics of esophageal atresia / tracheoesophageal fistula (EA/TEF) and identified several potential disease causal pathways that involved in endosome trafficking.
Next, I developed a new method to identifying disease risk genes by integrating genetic (rare de novo variants) and functional genomics data. Identifying risk genes using rare variants typically has low statistical power due to the rarity of genotype data. Using functional genomics data has the potential to address this challenge as it serves as informative priors of disease risk. Therefore, I developed a statistical method called VBASS. VBASS is a semi-supervised algorithm that uses a neural network to encode biological priors, such as cell type-specific expression values, into a rigorous Bayesian statistical model to increase statistical power. On simulated data, VBASS demonstrated proper error rate control and better power than current state-of-the-art methods. We applied VBASS to congenital heart disease (CHD) and autism spectrum disorder (ASD), identifying several novel disease risk genes along with their associated cell types.
Finally, I focused on predicting the functional mechanisms of missense variants that cause diseases. Pathogenic missense variants may act through different modes of action (e.g., gain-of-function or loss-of-function) by affecting various aspects of protein function. These variants may result in distinct clinical conditions requiring different treatments, yet current computational tools cannot distinguish between them because their predictions heavily relied on evolutional conservation data. The recent breakthrough of AI-powered protein structure prediction tools provides an opportunity to address this challenge because the functional mechanisms of variants is intrinsically embedded in its structural properties. Therefore, I developed a deep learning method called PreMode. PreMode is a pretrained SE(3)-equivariant graph neural network model designed to capture the effects of missense variants from their structural contexts and evolutionary information. I pretrained PreMode using labeled pathogenicity data to enable the model to learn a general representation of variant effects, followed by protein-specific transfer learning to predict mode-of-action effects. I applied PreMode to the mode-of-action predictions of 17 genes and demonstrated that PreMode achieved state-of-the-art performance compared to existing models. PreMode has various applications, including identifying novel gain/loss-of-function variants, improving the study design of deep mutational scans and optimization in protein engineering.
Subjects
Files
- Zhong_columbia_0054D_18907.pdf application/pdf 27.3 MB Download File
More About This Work
- Academic Units
- Cellular, Molecular and Biomedical Studies
- Thesis Advisors
- Shen, Yufeng
- Degree
- Ph.D., Columbia University
- Published Here
- November 13, 2024