Theses Doctoral

Federated Collaboration: Addressing Challenges in Data Sharing and Learning Under Statistical Heterogeneity

Elhussein, Ahmed

Modern biomedicine, spanning research initiatives (e.g., precision medicine) and clinical deployment (e.g., decision support systems), depends on large, diverse datasets to deliver robust and generalizable outcomes. Achieving this requires collaboration, as most biomedical data remain siloed within institutions. There are two primary strategies for multi-institutional collaboration: direct data sharing and distributed analysis, each facing significant challenges. Direct data sharing is hindered by institutional data ownership, security, and privacy concerns, as well as technical barriers. Conversely, distributed learning methods such as Federated Learning (FL) avoid centralizing data but are highly sensitive to statistical heterogeneity between institutions. This heterogeneity, common in biomedical settings, can degrade FL performance and lead to unequal outcomes across participants, limiting collaborative potential.

This thesis addresses these challenges through three aims: (i) developing a scalable, secure infrastructure for multimodal data sharing that addresses institutional data ownership and security concerns while supporting complex analysis; (ii) characterizing and quantifying statistical heterogeneity in multi-site healthcare datasets using interpretable, privacy-preserving methods; and (iii) designing novel FL algorithms that mitigate the negative effects of heterogeneity while maintaining privacy and computational efficiency.

To enable collaboration through secure data sharing, we developed a blockchain-based platform that stores clinical and genetic data directly on-chain. This decentralized architecture preserves institutional data sovereignty--addressing a key limitation of centralized biobanks-- and unifies multimodal data to streamline analysis. We demonstrated utility by scaling to 12,000 patients, successfully replicating a published genome-wide association study, and identifying a novel disease-associated locus in a federated rare disease cohort.

To address statistical heterogeneity in FL, we first conducted a case study on a real-world multi-institutional dataset and developed a privacy-preserving optimal transport–based dataset dissimilarity metric that predicts FL performance early in training. Building on these findings, we introduced two complementary algorithms. The first, PCBFL, uses secure patient-level similarity computation to identify clinically meaningful subgroups for targeted model training. To further improve clustering in PCBFL, we developed KEEP, a knowledge-guided embedding method that integrates medical ontologies with empirical data to produce semantically rich and robust patient representations. The second algorithm, PLayer-FL, applies a gradient-driven sensitivity analysis to dynamically balance global knowledge with local specialization across heterogeneous healthcare datasets. This architecture-agnostic method achieved consistently strong performance across diverse datasets and distributed these gains more equitably across participating sites.

Together, these contributions advance federated biomedical research and deployment by providing tools for secure data sharing and robust analysis under statistical heterogeneity, moving the field toward more collaborative, data-driven, and clinically impactful discovery.

Files

  • thumbnail for Elhussein_columbia_0054D_19610.pdf Elhussein_columbia_0054D_19610.pdf application/pdf 1.88 MB Download File

More About This Work

Academic Units
Biomedical Informatics
Thesis Advisors
Gürsoy, Gamze
Degree
Ph.D., Columbia University
Published Here
November 19, 2025