Theses Doctoral

Data Quality Assessment for the Secondary Use of Person-Generated Wearable Device Data: Assessing Self-Tracking Data for Research Purposes

Cho, Sylvia

The Quantified Self movement has led to an increased routine use of consumer wearables, generating large amounts of person-generated wearable device data. This has become an opportunity to researchers to conduct research with large-scale person-generated wearable device data without having to collect data in a costly and time-consuming way. However, there are known challenges of wearable device data such as missing data or inaccurate data which raises the need to assess the quality of data before conducting research. Currently, there is a lack of in-depth understanding on data quality challenges of using person-generated wearable device data for research purposes, and how data quality assessment should be conducted. Data quality assessment could be especially a burden to those without the domain knowledge on a specific data type, which might be the case for emerging biomedical data sources.

The goal of this dissertation is to advance the knowledge on data quality challenges and assessment of person-generated wearable device data and facilitate data quality assessment for those without the domain knowledge on the emerging data type. The dissertation consists of two aims: (1) identifying data quality dimensions important for assessing the quality of person-generated wearable device data for research purposes, (2) designing and evaluating an interactive data quality characterization tool that supports researchers in assessing the fitness-for-use of fitness tracker data. In the first aim, a multi-method approach was taken, conducting literature review, survey, and focus group discussion sessions.

We found that intrinsic data quality dimensions applicable to electronic health record data such as conformance, completeness, and plausibility are applicable to person-generated wearable device data. In addition, contextual/fitness-for-use dimensions such as breadth and density completeness, and temporal data granularity were identified given the fact that our focus was on assessing data quality for research purposes. In the second aim, we followed an iterative design process from understanding informational needs to designing a prototype, and evaluating the usability of the final version of a tool. The tool allows users to customize the definition of data completeness (fitness-for-use measures), and provides data summarization on the cohort that meets that definition. We found that the interactive tool that incorporates fitness-for-use measures and allows customization on data completeness, can support assessing fitness-for-use assessment more accurately and in less time than a tool that only presents information on intrinsic data quality measures.


  • thumnail for Cho_columbia_0054D_16738.pdf Cho_columbia_0054D_16738.pdf application/pdf 1.75 MB Download File

More About This Work

Academic Units
Biomedical Informatics
Thesis Advisors
Natarajan, Karthik
Ph.D., Columbia University
Published Here
July 28, 2021