Theses Doctoral

Statistical Methods for Complex, Biased, and Sparse Health Record Data

Khan, Zain

Biomedical research increasingly relies on statistical and machine learning methods to extract meaningful insights from complex, high-dimensional data. Standard approaches often fail to capture the full scope of dependencies, biases, and heterogeneity inherent in biomedical datasets. Three major challenges hinder our ability to draw accurate conclusions from such data: (1) the limitations of conventional statistical methods in detecting complex relationships, (2) datasets affected by nonrandom sample selection, which distorts what we are able to learn, and (3) capturing complex patterns given limited data size and sparsity. This dissertation develops novel methodological solutions to address these challenges, with applications in structure learning, causal mediation analysis, and longitudinal disease modeling.

A fundamental goal in biomedical research is to accurately learn dependencies between biomarkers in physiological systems. Graphical models may be used for this purpose, however, standard approaches rely on independence assumptions that often miss associations occurring at the tails of distributions. This is important in many biological settings where extreme values signal disease or dysfunction. We introduce Quantile Association via Conditional Concordance (QuACC), a novel measure of conditional association designed to capture quantile-specific dependencies in multivariate data. We use this statistic to construct quantile-specific graphical models that reveal dependencies overlooked by traditional methods. When applied to biobank data, we reveal tail-dependent interactions between biomarkers in individuals with mitochondrial disease.

Beyond understanding biomarker associations, causal inference is necessary for evaluating the impact of social and medical factors on health outcomes. In clinical decision-making, sample selection bias can distort causal effects estimates. For example, in liver transplantation, only a subset of referred patients complete the evaluation process, with dropout often influenced by social determinants of health (SDOH), including race and neighborhood deprivation factors. Using causal graphical models, we develop a method to correct for selection bias in causal mediation analysis, and recover unbiased direct, indirect, and path-specific effects from socioeconomic position to transplant listing status. This approach enables the study of causal questions that are affected by sample selection bias.

Finally, improving health outcomes requires not only an understanding of causal mechanisms, but also the ability to accurately model complex covariate behavior, particularly in the presence of small sample sizes and data sparsity. The long-term effects of severe illnesses, such as COVID-19 acute lung injury (ALI)/ARDS, remain poorly understood, especially regarding persistent biomarker abnormalities and physical function impairments. Using longitudinal modeling and time series clustering, we investigate persistent elevations in inflammatory and endothelial biomarkers among COVID-19 ALI/ARDS survivors over a three-year period. These findings show the lasting physiological and physical function impacts of severe COVID-19 and emphasize the need for targeted post-recovery interventions.

Together, these contributions advance the statistical foundations of biomedical data analysis by developing new methodologies for dependency modeling of biomarkers, drawing causal inference under selection bias, and characterizing diseases in small, sparse datasets. These methods aim to inform both scientific discovery and clinical decision-making.

Files

  • thumbnail for Khan_columbia_0054D_19511.pdf Khan_columbia_0054D_19511.pdf application/pdf 2.09 MB Download File

More About This Work

Academic Units
Biomedical Engineering
Thesis Advisors
Sajda, Paul
Degree
Ph.D., Columbia University
Published Here
October 15, 2025