Theses Doctoral

Learning cell states from high-dimensional single-cell data

Levine, Jacob Harrison

Recent developments in single-cell measurement technologies have yielded dramatic increases in throughput (measured cells per experiment) and dimensionality (measured features per cell). In particular, the introduction of mass cytometry has made possible the simultaneous quantification of dozens of protein species in millions of individual cells in a single experiment. The raw data produced by such high-dimensional single-cell measurements provide unprecedented potential to reveal the phenotypic heterogeneity of cellular systems. In order to realize this potential, novel computational techniques are required to extract knowledge from these complex data.
Analysis of single-cell data is a new challenge for computational biology, as early development in the field was tailored to technologies that sacrifice single-cell resolution, such as DNA microarrays. The challenges for single-cell data are quite distinct and require multidimensional modeling of complex population structure. Particular challenges include nonlinear relationships between measured features and non-convex subpopulations.
This thesis integrates methods from computational geometry and network analysis to develop a framework for identifying the population structure in high-dimensional single-cell data. At the center of this framework is PhenoGraph, and algorithmic approach to defining subpopulations, which when applied to healthy bone marrow data was shown to reconstruct known immune cell types automatically without prior information. PhenoGraph demonstrated superior accuracy, robustness, and efficiency, compared to other methods.
The data-driven approach becomes truly powerful when applied to less characterized systems, such as malignancies, in which the tissue diverges from its healthy population composition. Applying PhenoGraph to bone marrow samples from a cohort of acute myeloid leukemia (AML) patients, the thesis presents several insights into the pathophysiology of AML, which were extracted by virtue of the computational isolation of leukemic subpopulations. For example, it is shown that leukemic subpopulations diverge from healthy bone marrow but not without bound: Leukemic cells are apparently free to explore only a restricted phenotypic space that mimics normal myeloid development. Further, the phenotypic composition of a sample is associated with its cytogenetics, demonstrating a genetic influence on the population structure of leukemic bone marrow.
The thesis goes on to show that functional heterogeneity of leukemic samples can be computationally inferred from molecular perturbation data. Using a variety of methods that build on PhenoGraph's foundations, the thesis presents a characterization of leukemic subpopulations based on an inferred stem-like signaling pattern. Through this analysis, it is shown that surface phenotypes often fail to reflect the true underlying functional state of the subpopulation, and that this functional stem-like state is in fact a powerful predictor of survival in large, independent cohorts.
Altogether, the thesis takes the existence and importance of cellular heterogeneity as its starting point and presents a mathematical framework and computational toolkit for analyzing samples from this perspective. It is shown that phenotypic and functional heterogeneity are robust characteristics of acute myeloid leukemia with clinically significant ramifications.


  • thumnail for Levine_columbia_0054D_13124.pdf Levine_columbia_0054D_13124.pdf binary/octet-stream 14.3 MB Download File

More About This Work

Academic Units
Biological Sciences
Thesis Advisors
Pe'er, Dana
Ph.D., Columbia University
Published Here
February 1, 2016