2025 Theses Doctoral
Interpretable Algorithms for Data Integration: Adaptivity, Contamination, High-dimensionality, and Privacy Constraints
This dissertation presents several methodological and theoretical contributions addressing modern challenges in data integration, with a focus on interpretable statistical approaches for tackling these issues.
We begin by introducing the motivation for studying data integration, along with illustrative examples. We then outline key challenges encountered by existing methods, including distributional shifts, data contamination and adversarial (Byzantine) attacks, high-dimensional settings, and privacy constraints. The subsequent chapters address these challenges by introducing new methods with provable theoretical guarantees across various contexts.
Our first contribution is a transfer learning algorithm tailored for high-dimensional generalized linear models, incorporating an aggregation step followed by a debiasing step. Building on this, we develop an inference procedure based on the debiased Lasso. We establish both finite-sample guarantees and asymptotic normality. In addition, we propose a transferable source detection method designed to identify and remove contaminated or anomalous datasets.
Next, we propose a federated Gradient-EM algorithm for parameter estimation in general mixture models. The method is designed to be privacy-preserving, computationally efficient, and adaptive to model similarity. We show that, under certain conditions, the algorithm achieves nearly minimax-optimal estimation error rates within polynomial time. Our theoretical results also extend to mis-clustering error bounds in specific mixture model settings, and offer insight into the empirical success of related EM algorithms proposed in the literature.
Finally, we introduce a representation learning framework for multi-task learning. Unlike prior work, our setting allows each task to have a distinct linear representation. We develop two algorithms for fitting this model via data integration, accommodating the presence of contaminated sources. One method is based on empirical risk minimization with a novel maximum principal angle penalty; the other leverages spectral methods. We derive both upper and lower bounds in finite samples, demonstrating near-optimal performance. The proposed framework generalizes prior models of linear representation and contributes several new theoretical and methodological insights.
Files
-
Tian_columbia_0054D_19271.pdf
application/pdf
3.11 MB
Download File
More About This Work
- Academic Units
- Statistics
- Thesis Advisors
- Ying, Zhiliang
- Degree
- Ph.D., Columbia University
- Published Here
- July 30, 2025