Theses Doctoral

Leveraging patient-provided data to improve understanding of disease risk

da Graca Polubriaginof, Fernanda Caroline

Patient-provided data are crucial to achieving the goal of precision medicine. These data, which include family medical history, race and ethnicity, and social and behavioral determinants of health, are essential for disease risk assessment. Despite the well-established importance of patient-provided data, there are many data quality challenges that affect how this information can be used for biomedical research.
To determine how to best use patient-provided data to assess disease risk, the research reflected in this dissertation was divided into three overarching aims. In Aim 1, I focused on determining the quality of race and ethnicity, family history and smoking status in clinical databases. In Aim 2, I assessed the impact of various interventions on the quality of these data, including policy changes such as the implementation of the requirements imposed by the Meaningful Use program, and patient-facing tools for collecting and sharing information with patients. In addition to these evaluations, I also developed and evaluated a method “Relationship Inference from the Electronic Health Record” (RIFTEHR), that infers familial relationships from clinical datasets. In Aim 3, I used patient-provided data to assess disease risk both at a population level, by estimating disease heritability, and at an individual level, by identifying high-risk individuals eligible for additional screening for a common disease (diabetes mellitus) and a rare disease (celiac disease).
My research uncovered several data quality concerns for patient-provided data in clinical databases. When assessing the impact of interventions on the quality of these data, I found that policy interventions led to more data collection, but not necessarily to better data quality. In contrast, patient-facing tools did increase the quality of the patient-provided data. In the absence of high-quality patient-provided data for family medical history, I developed and evaluated a method for inferring this information from large clinical databases. I demonstrated that electronic health record data can be used to infer familial relationships accurately. Moreover, I showed how the use of clinical data in conjunction with the inferred familial relationships could determine disease risk in two studies. In the first study, I successfully computed disease heritability estimates for 500 conditions, some of which had not been previously studied. In the second study, I identified that screening rates among family members that are considered to be at high-risk for disease development were low for both diabetes mellitus and celiac disease.
In summary, the work represented in this dissertation contributes to the understanding of the quality of patient-provided data, how interventions affect the quality of these data, and how novel methods can be applied to troves of existing clinical data to generate new knowledge to support research and clinical care.


  • thumnail for daGracaPolubriaginof_columbia_0054D_14771.pdf daGracaPolubriaginof_columbia_0054D_14771.pdf application/pdf 17.6 MB Download File

More About This Work

Academic Units
Biomedical Informatics
Thesis Advisors
Vawdrey, David K.
Ph.D., Columbia University
Published Here
October 5, 2018