2025 Theses Doctoral
Non-Euclidean Representation Learning with Applications to Metagenomics
Machine learning relies on high-quality data representations to enable accurate modeling and prediction. However, biological data, particularly in fields like metagenomics, presents unique challenges for machine learning due to its high dimensionality, noisiness, compositionality, and complex latent interactions and hierarchies. We explore the adaptation of non-Euclidean representation learning techniques to metagenomics.
First, we present two works on modeling metagenomic samples’ latent structure and dynamics: MiSDEED, a tool for generating realistic synthetic data based on generalized Lotka-Volterra dynamics, and a method for inferring microbial growth rates from 16S amplicon data by extending peak-to-trough ratio analysis.
Next, we develop novel machine learning models that can operate on non-Euclidean data representations: first, we introduce decision tree and random forest algorithms for hyperbolic spaces, then generalize these to mixed-curvature product spaces. We further introduce two major engineering efforts to improve the efficiency and accessibility of non-Euclidean machine learning: a method for speeding up non-Euclidean decision trees by three or more orders of magnitude, and a comprehensive Python library supporting end-to-end machine learning on product manifolds.
Finally, we present preliminary explorations into the role that non-Euclidean representations can play in metagenomic settings. We propose using weighted centroids to aggregate feature (species)-level embeddings into sample embeddings, aiming to learn biologically meaningful representations that can improve downstream tasks such as sample classification.
We also explore the empirical distribution of curvature measurements via Monte Carlo sampling, offering a much-needed calibration for curvature estimation in the presence of confounding variables. By adding new methods for probing and evaluating non-Euclidean embeddings, we strive to unlock the potential of representation learning to advance microbiome research and its applications in human health and beyond.
Subjects
Files
-
Chlenski_columbia_0054D_19564.pdf
application/pdf
14.6 MB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Pe'er, Itsik
- Degree
- Ph.D., Columbia University
- Published Here
- October 29, 2025