Theses Doctoral

The emergence of the data science profession

Brandt, Philipp Soeren

This thesis studies the formation of a novel expert role—the data scientist—in order to ask how arcane knowledge becomes publicly salient. This question responds to the two-sided public debate, wherein data science is associated with problems such as discriminatory consequences and privacy infringements, but has also become linked with opportunities related to new forms of work. A puzzle arises also, as institutional boundaries have obscured earlier instances of quantitative expertise. Even a broader perspective reveals few expert groups that have gained lay salience on the basis of arcane knowledge, other than lawyers and doctors.
This empirical puzzle recovers a gap in the literature between two main lines of argument. An institutionalist view has developed ways for understanding expert work with respect to formal features such as licensing, associations and training. A constructivist view identifies limitations in those arguments, highlighting their failure to explain many instances in which arcane knowledge emerges through informal processes, including the integration of lay knowledge through direct collaboration. Consistent with this critique, data nerds largely define their work on an informal basis. Yet, they also draw heavily on a formalized stock of knowledge. In order to reconcile the two sides, this thesis proposes viewing data science as an emerging “thought community.” Such a perspective leads to an analytical strategy that scrutinizes contours that emerge as data nerds define arcane expertise as theirs.
The analysis unfolds across three empirical settings that complement each other. The first setting considers data nerds as they define their expertise in the context of public events in New York City’s technology scene. This part draws on observations beginning in 2012, shortly after data science’s first lay recognition, and covers three years of its early emergence. Two further studies comparatively test whether and in what ways contours of data science’s abstract knowledge are associated with its lay salience. They respectively consider economic and academic settings, which are most relevant to data nerds in part one. Both studies leverage specifically designed quantitative datasets consisting of traces of lay knowledge recognition and arcane knowledge construction.
Together the three studies reveal distinctive contours of data science. The main argument that follows suggests that data science gains lay salience because it relies on informal practices for recombining formal principles of knowledge construction and application, in a collective effort. Data nerds define their thought community on the basis of illustrative and persuasive tactics that combine formal ideas with informal interpretations. This form of improvisation leads data nerds to connect diverse substantive problems through an array of formal representations. They thereby undermine bureaucratic control that otherwise defines tasks in the context where data scientists mostly apply their arcane knowledge. Despite its name and arcane content, moreover, data science differs from scientific principles of knowledge construction.
The main contribution of this thesis is a first detailed and multifaceted analysis of data science. Results of this study address the main public problems. This thesis demonstrates that data science creates new opportunities for work provided that data nerds are willing to embrace the uncertainty associated with a formally undefined area of problems. The first perspective, focusing on community identification principles, furthermore allows identifying new forms of work in the ongoing technological transformation data science is part of. At the same time, the main argument supports reason for concerns as well precisely because data nerds often operate on an individually anonymous basis, despite their association with formal organizations. It has remained unclear how to address the social consequences of their work because data nerds undermine those conventional forms of control and oversight. The findings of this thesis suggest that although data nerds depart from scientific principles for identifying relevant problems, they coordinate those deviant activities through forms of discipline that qualitatively resemble those common in academic fields. Data nerds define their knowledge as a community. It follows that embedding public concerns in data science’s disciplinary forms of coordination, and enhancing those forms, offers the most effective mechanisms for preserving the utility of data science applications while limiting their potentially harmful consequences.
Finally, conceptual and methodological contributions follow as well. The focus on thought communities reveals new leverage for understanding social processes that unfold as a combination of informal activities in local settings and institutional dynamics that are largely removed from individual actors. This problem is common for many instances of skilled work. This additional leverage is the result of an integrated methodological design that relies as much on qualitative observations as on formal analyses. As part of this integration this thesis has directly encoded phenomenologically salient contours into a quantitative design, effectively leading to an analysis of data science through data science.


  • thumnail for Brandt_columbia_0054D_13548.pdf Brandt_columbia_0054D_13548.pdf application/pdf 8 MB Download File

More About This Work

Academic Units
Thesis Advisors
Bearman, Peter Shawn
Ph.D., Columbia University
Published Here
September 6, 2016