Theses Doctoral

Non-Euclidean Representation Learning with Applications to Metagenomics

Chlenski, Philippe

Machine learning relies on high-quality data representations to enable accurate modeling and prediction. However, biological data, particularly in fields like metagenomics, presents unique challenges for machine learning due to its high dimensionality, noisiness, compositionality, and complex latent interactions and hierarchies. We explore the adaptation of non-Euclidean representation learning techniques to metagenomics.

First, we present two works on modeling metagenomic samples’ latent structure and dynamics: MiSDEED, a tool for generating realistic synthetic data based on generalized Lotka-Volterra dynamics, and a method for inferring microbial growth rates from 16S amplicon data by extending peak-to-trough ratio analysis.

Next, we develop novel machine learning models that can operate on non-Euclidean data representations: first, we introduce decision tree and random forest algorithms for hyperbolic spaces, then generalize these to mixed-curvature product spaces. We further introduce two major engineering efforts to improve the efficiency and accessibility of non-Euclidean machine learning: a method for speeding up non-Euclidean decision trees by three or more orders of magnitude, and a comprehensive Python library supporting end-to-end machine learning on product manifolds.

Finally, we present preliminary explorations into the role that non-Euclidean representations can play in metagenomic settings. We propose using weighted centroids to aggregate feature (species)-level embeddings into sample embeddings, aiming to learn biologically meaningful representations that can improve downstream tasks such as sample classification.

We also explore the empirical distribution of curvature measurements via Monte Carlo sampling, offering a much-needed calibration for curvature estimation in the presence of confounding variables. By adding new methods for probing and evaluating non-Euclidean embeddings, we strive to unlock the potential of representation learning to advance microbiome research and its applications in human health and beyond.

Files

  • thumbnail for Chlenski_columbia_0054D_19564.pdf Chlenski_columbia_0054D_19564.pdf application/pdf 14.6 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Pe'er, Itsik
Degree
Ph.D., Columbia University
Published Here
October 29, 2025