Theses Doctoral

Advances in Machine Learning for Compositional Data

Gordon Rodriguez, Elliott

Compositional data refers to simplex-valued data, or equivalently, nonnegative vectors whose totals are uninformative. This data modality is of relevance across several scientific domains. A classical example of compositional data is the chemical composition of geological samples, e.g., major-oxide concentrations. A more modern example arises from the microbial populations recorded using high-throughput genetic sequencing technologies, e.g., the gut microbiome. This dissertation presents a set of methodological and theoretical contributions that advance the state of the art in the analysis of compositional data.

Our work can be divided along two categories: problems in which compositional data represents the input to a predictive model, and problems in which it represents the output of the model. For the first class of problems, we build on the popular log-ratio framework to develop an efficient learning algorithm for high-dimensional compositional data. Our algorithm runs orders of magnitude faster than competing alternatives, without sacrificing model quality. For the second class of problems, we define a novel exponential family of probability distributions supported on the simplex. This distribution enjoys attractive mathematical properties and provides a performant probability model for simplex-valued outcomes. Taken together, our results constitute a broad contribution to the toolkit of researchers and practitioners studying compositional data.


  • thumnail for GordonRodriguez_columbia_0054D_17153.pdf GordonRodriguez_columbia_0054D_17153.pdf application/pdf 1.64 MB Download File

More About This Work

Academic Units
Thesis Advisors
Cunningham, John Patrick
Ph.D., Columbia University
Published Here
April 27, 2022