2025 Theses Master's
Predictive Privacy: A Framework for Quantifying Harm
Data-driven, omnipresent apps that track all aspects of our day-by-day lives send our information to corporations, third-party data brokers, and, often, the government. It feels like this much invasion of our privacy should be illegal, but given the lack of general privacy laws in the US, it is not. That said, the feeling of discomfort that comes along with being constantly watched does not go away; in fact, when it comes to consumer data usage and the lack of protections provided by privacy laws, the word “creepy” is, as Tene and Polonetsky have noted, the best one that can be used to describe the overall feeling of unease. This is because there is no uniform understanding of privacy harm, which has consequences. For example, the lack of tangible and concrete harm can cause courts and regulatory agencies to do nothing. A company is more often than not held to its privacy policy the way it would be to a legal contract, as established by Solove et al., due to failures to establish promissory estoppel and/or damages. In one case, Spokeo v. Robins, the Supreme Court held that Robins’ complaint of harm lacked “concreteness.” In TransUnion LLC v. Ramirez, the Court only awarded damages to those who were able to prove that harmful information about them had been propagated, rather than to those who had only experienced “an injury in law,” as discussed by Husi and Robbennolt. But all privacy harm is important. For example, the Federal Trade Commission Act gives the FTC jurisdiction only over behavior that is “likely to cause substantial injury to consumers.” But what is “injury,” let alone “substantial injury”? Citron has noted that “[f]or most courts, privacy and data security harms are too speculative and hypothetical.” Solove and Hartzog, quoting FTC documents, note that “[m]onetary, health, and safety risks are common injuries considered ‘substantial,’ but trivial, speculative, emotional, and ‘other more subjective types of harm’ are usually not considered substantial for unfairness purposes.” In addition to the issue of defining privacy harm in general, today we must confront the rise of machine learning, which can identify—or purport to identify—information about an individual that has not been directly observed or collected. For example, in Sterling v. Borough of Minersville, when a young man committed suicide under the threat of being outed as gay, it was held that people have the right to privacy when it comes to sexuality. But with today’s predictive technology, researchers at MIT have shown that sexual identity can be deduced from Facebook “friend” patterns. Any theory of harm must account for the social and psychological impacts that can occur as well. To address these issues, we have devised a scheme called Predictive Privacy, an experimental technique to create a standardized system for quantifying various degrees of loss and harm from disclosure of private information. The goal is to provide regulators and courts an objective basis to ascertain if there is, in fact, actual injury. By providing an objective measure of privacy injury, our approach provides regulators and the courts with a concrete basis for adjudicating claims, with the ultimate result of more informed and effective privacy protection policy. We built a database using existing synthetic data from the Pew Research Center in combination with differential privacy techniques to generate several sensitive columns of synthetic data per person with an accurate statistical distribution similar to that seen today in the American population. We then use a machine learning algorithm to cluster entries in our database. We used workers recruited via Prolific.com to score the harm from disclosure of information in randomly selected members of each cluster. We asked the scorers to rate the harm from various scenarios in two different experiments. Part I involved participants scoring harm for data that was considered 100% accurate, and Part II had participants scoring data that varied in accuracy between 75% and 50% confidence. The goal was to see the impact of data accuracy on harm perception and if perceptions changed in different scenarios. The dataset has various synthetic people with different categories of sensitive attributes, so with these questions, we hope to gain relatively accurate insight into how people view harm and risk for different groups. Finally, we use the worker responses to train a supervised ML algorithm to predict the harm score per person in the different scenarios aforementioned. The net result will be to make informational privacy harm as concrete as, say, monetary harm.
Files
-
Thesis (1).pdf
application/pdf
569 KB
Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Bellovin, Steven Michael
- Degree
- M.S., Columbia University
- Published Here
- July 9, 2025
Related Items
- Supplemented by:
- Predictive Privacy