Theses Doctoral

Privacy-Aware Data Analysis: Recent Developments for Statistics and Machine Learning

Lut, Yuliia

Due to technological development, personal data has become more available to collect, store and analyze. Companies can collect detailed browsing behavior data, health-related data from smartphones and smartwatches, voice and movement recordings from smart home devices. Analysis of such data can bring numerous advantages to society and further development of science and technology. However, given an often sensitive nature of the collected data, people have become increasingly concerned about the data they share and how they interact with new technology.

These concerns have motivated companies and public institutions to provide services and products with privacy guarantees. Therefore, many institutions and research communities have adopted the notion of differential privacy to address privacy concerns which has emerged as a powerful technique for enabling data analysis while preventing information leakage about individuals. In simple words, differential privacy allows us to use and analyze sensitive data while maintaining privacy guarantees for every individual data point. As a result, numerous algorithmic private tools have been developed for various applications. However, multiple open questions and research areas remain to be explored around differential privacy in machine learning, statistics, and data analysis, which the existing literature has not covered.

In Chapter 1, we provide a brief discussion of the problems and the main contributions that are presented in this thesis. Additionally, we briefly recap the notion of differential privacy with some useful results and algorithms.

In Chapter 2, we study the problem of differentially private change-point detection for unknown distributions. The change-point detection problem seeks to identify distributional changes in streams of data. Non-private tools for change-point detection have been widely applied in several settings. However, in certain applications, such as identifying disease outbreaks based on hospital records or IoT devices detecting home activity, the collected data is highly sensitive, which motivates the study of privacy-preserving tools. Much of the prior work on change-point detection---including the only private algorithms for this problem---requires complete knowledge of the pre-change and post-change distributions. However, this assumption is not realistic for many practical applications of interest. In this chapter, we present differentially private algorithms for solving the change-point problem when the data distributions are unknown to the analyst. Additionally, we study the case when data may be sampled from distributions that change smoothly over time rather than fixed pre-change and post-change distributions. Furthermore, our algorithms can be applied to detect changes in linear trends of such data streams. Finally, we also provide a computational study to empirically validate the performance of our algorithms.

In Chapter 3, we study the problem of learning from imbalanced datasets, in which the classes are not equally represented, through the lens of differential privacy. A widely used method to address imbalanced data is resampling from the minority class instances. However, when confidential or sensitive attributes are present, data replication can lead to privacy leakage, disproportionally affecting the minority class. This challenge motivates the study of privacy-preserving pre-processing techniques for imbalanced learning. In this work, we present a differentially private synthetic minority oversampling technique (DP-SMOTE) which is based on a widely used non-private oversampling method known as SMOTE. Our algorithm generates differentially private synthetic data from the minority class. We demonstrate the impact of our pre-processing technique on the performance and privacy leakage of various classification methods in a detailed computational study.

In Chapter 4, we focus on the analysis of sensitive data that is generated from online internet activity. Accurately analyzing and modeling online browsing behavior play a key role in understanding users and technology interactions. Towards this goal, in this chapter, we present an up-to-date measurement study of online browsing behavior. We study both self-reported and observational browsing data and analyze what underlying features can be learned from statistical analysis of this potentially sensitive data. For this, we empirically address the following questions: (1) Do structural patterns of browsing differ across demographic groups and types of web use?, (2) Do people have correct perceptions of their behavior online?, and (3) Do people change their browsing behavior if they are aware of being observed?

In response to these questions, we found little difference across most demographic groups and website categories, suggesting that these features cannot be implied solely from clickstream data. We find that users significantly overestimate the time they spend online but have relatively accurate perceptions of how they spend their time online. We find no significant changes in behavior throughout the study, which may indicate that observation had no effect on behavior or that users were consciously aware of being observed throughout the study.


  • thumnail for Lut_columbia_0054D_17546.pdf Lut_columbia_0054D_17546.pdf application/pdf 1.31 MB Download File

More About This Work

Academic Units
Industrial Engineering and Operations Research
Thesis Advisors
Cummings, Rachel A.D.
Ph.D., Columbia University
Published Here
October 19, 2022