Theses Doctoral

Statistical Methods for Modeling Progression and Learning Mechanisms of Neuropsychiatric Disorders

Wang, Qinxia

The theme of this dissertation focuses on developing statistical models to learn progression dynamics and mechanisms of neuropsychiatric disorders using data from various domains. Due to limited knowledge about the underlying pathological processes in neurological disorders, it remains a challenge to establish reliable diagnostic criteria and predict disease prognosis in the presence of substantial phenotypic heterogeneity. As a result, current diagnosis and treatment of neurological disorders often rely on late-stage clinical symptoms, which poses barriers for developing effective interventions at the premanifest stage. It is crucial to characterize the temporal disease progression course and study the underlying mechanisms using clinical assessments, blood biomarkers, and neuroimaging biomarkers to evaluate disease stages, identify markers that are useful for early clinical diagnosis, compare or monitor treatment effects and accelerate drug discovery.

We propose three projects to tackle challenges in leveraging multi-domain biomarkers and clinical symptoms to learn disease dynamics and progression of neurological disorders: (1) A nonlinear mixture model with subject-specific random inflection points to jointly fit multiple longitudinal markers and estimate marker progression trajectories in a single modality; (2) A multi-layer exponential family factor model integrating multi-domain data to learn lower-dimensional latent space of disease impairment and fully map disease risk and progression; (3) A latent state space model that jointly analyzes multi-channel EEG signals and learns dynamics of different sources corresponding to brain cortical activities. In addition, motivated by the ongoing COVID-19 pandemic, we propose a parsimonious survival-convolution model to predict daily new cases and estimate the time-varying reproduction numbers to evaluate effects of mitigation strategies.

In the first project, we propose a nonlinear mixture model with random time shifts to jointly estimate long-term progression trajectories using multivariate discrete longitudinal outcomes. The model can identify early disease markers, their orders of occurrence, and the rates of impairment. Specifically, a latent binary variable representing disease susceptibility status incorporates subject covariates (e.g., biological measures) in the mixture model to capture between-subject heterogeneity. Measures of disease impairment for susceptible patients are modeled jointly under the exponential family framework. Our model allows for subject-specific and marker-specific inflection points associated with patients' characteristics (e.g., genetic mutation) to indicate a critical time when the fastest degeneration occurs. Furthermore, it uses subject-specific latent scores shared among markers to improve efficiency. The model is estimated using an EM algorithm. Extensive simulation studies are conducted to demonstrate validity of the proposed method and algorithm. Lastly, we apply our method to the Parkinson's Progression Markers Initiative (PPMI), and show utility to identify early disease signs and compare clinical symptomatology for the genetic form of Parkinson's Disease (PD) and idiopathic PD.

In the second project, we tackle challenges to leverage multi-domain markers to learn early disease progression of neurological disorders. We propose to integrate heterogeneous types of measures from multiple domains (e.g., discrete clinical symptoms, ordinal cognitive markers, continuous neuroimaging and blood biomarkers) using a hierarchical Multi-layer Exponential Family Factor (MEFF) model, where the observations follow exponential family distributions with lower-dimensional latent factors. The latent factors are decomposed into shared factors across multiple domains and domain-specific factors, where the shared factors provide robust information to perform behavioral phenotyping and partition patients into clinically meaningful and biologically homogeneous subgroups. Domain-specific factors capture the remaining unique variations for each domain. The MEFF model also captures the nonlinear trajectory of disease progression and order critical events of neurodegeneration measured by each marker. To overcome computational challenges, we fit our model by approximate inference techniques for large-scale data. We apply the developed method to Parkinson's Progression Markers Initiative (PPMI) data to integrate biological, clinical and cognitive markers arising from heterogeneous distributions. The model learns lower-dimensional representations of Parkinson's disease and the temporal ordering of the neurodegeneration of PD.

In the third project, we propose methods that can be used to analyze multi-channel electroencephalogram (EEG) signals intensively measured at a high temporal resolution. Modern neuroimaging technologies have substantially advanced the measurement of brain activities. EEG as a non-invasive neuroimaging technique measures changes in electrical voltage on the scalp induced by cortical activities. With its high temporal resolution, EEG has emerged as an increasingly useful tool to study brain connectivity. Challenges with modeling EEG signals of complex brain activities include interactions among unknown sources, low signal-to-noise ratio and substantial between-subject heterogeneity. In this work, we propose a state space model that jointly analyzes multi-channel EEG signals and learns dynamics of different sources corresponding to brain cortical activities. Our model borrows strength from spatially correlated measurements and uses low-dimensional latent sources to explain all observed channels. The model can account for patient heterogeneity and quantify the effect of a subject's covariates on the latent space. The EM algorithm, Kalman filtering, and bootstrap resampling are used to fit the state space model and provide comparisons between patient diagnostic groups. We apply the developed approach to a case-control study of alcoholism and reveal significant attenuation of brain activities in response to visual stimuli in alcoholic subjects compared to healthy controls.

Lastly, motivated by the ongoing COVID-19 pandemic, we propose a robust and parsimonious survival-convolution model aiming to predict COVID-19 disease course and compare effectiveness of mitigation measures across countries to inform policy decision making. We account for transmission during a pre-symptomatic incubation period and use a time-varying effective reproduction number to reflect the temporal trend of transmission and change in response to a public health intervention. We estimate the intervention effect on reducing the infection rate using a natural experiment design and quantify uncertainty by permutation. In China and South Korea, we predicted the entire disease epidemic using only early phase data (two to three weeks after the outbreak). A fast rate of decline in reproduction number was observed and adopting mitigation strategies early in the epidemic was effective in reducing the infection rate in these two countries. The nationwide lockdown in Italy did not accelerate the speed at which the infection rate decreases. In the United States, the reproduction number significantly decreased during a 2-week period after the declaration of national emergency, but declines at a much slower rate afterwards. If the trend continues after May 1, COVID-19 may be controlled by late July. However, a loss of temporal effect (e.g., due to relaxing mitigation measures after May 1) could lead to a long delay in controlling the epidemic.

Geographic Areas


  • thumnail for Wang_columbia_0054D_16810.pdf Wang_columbia_0054D_16810.pdf application/pdf 4.74 MB Download File

More About This Work

Academic Units
Thesis Advisors
Wang, Yuanjia
Ph.D., Columbia University
Published Here
September 8, 2021