Theses Doctoral

Phenotyping with Partially Labeled, Partially Observed Data

Rodriguez, Victor Alfonso

Identifying a group of individuals that share a common set of characteristics is a conceptually simple task, which is often difficult in practice. Such phenotyping problems emerge in various settings, including the analysis of clinical data. In this setting, phenotyping is often stymied by persistent data quality issues. These include a lack of reliable labels to indicate the presence of absence of characteristics of interest, and significant missingness in observed variables.

This dissertation introduces methods for learning phenotypes when the data contain missing values (partially observed) and labels are scarce (partially labeled). Aim 1 utilizes an unsupervised probabilistic graphical model to learn phenotypes from partially observed data. Aim 2 introduces a related semi-supervised probabilistic graphical model for learning phenotypes from partially labeled clinical data. Finally, Aim 3 describes a method for training deep generative models when the training data contain missing values. The algorithm is then applied in a semi-supervised setting where it accounts for partially labeled data as well.


  • thumnail for Rodriguez_columbia_0054D_18123.pdf Rodriguez_columbia_0054D_18123.pdf application/pdf 5.53 MB Download File

More About This Work

Academic Units
Biomedical Informatics
Thesis Advisors
Perotte, Adler
Ph.D., Columbia University
Published Here
October 18, 2023