Saturday 21 March 2015

Blog post by Bettina Berendt: Big Capta, Bad Science? Hype and deconstruction

In this bog post, Bettina Berendt reviews some of the arguments made in Mayer-Schönberger and Cukier’s Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013) and Rob Kitchin’s The Data Revolution (2014).

Big Capta, Bad Science? Hype and deconstruction

That there is much hype about “Big Data” (BD) is nearly a truism by now, but (a) what exactly does this hype consist of, (b) what is problematic about the hype arguments, and (c) what can we – as scientists – do about it?

In a quest to at least begin to answer these Big Questions, I have compared two recent books: Mayer-Schönberger and Cukier’s Big Data and Kitchin’s The Data Revolution, and used the insights gained to critically deepen examples and deconstruct text passages from the two. The result is described in detail in the essay Big Capta, Bad Science? and summarised here.

Big Data is, as many others have pointed out earlier, an accessible and entertaining read that touches on many of the currently discussed “promises” and “risks” of BD. However, it reiterates, in a rather uncritical way, a number of assumptions that on closer inspection appear as key pillars of the BD hype:

1. Data are objective and provide us with direct access to the true nature of things.
2. BD are exhaustive (and therefore give us the full truth).
3. Data, and in particular Big Data, can therefore “speak for themselves”; the correlations they reveal can help society overcome “its obsession for causality”.

The Data Revolution exposes central reasoning errors behind these assumptions, which is essential for constructing a more critical and ultimately more fruitful approach to BD:

1. BD are, first of all, data. Etymologically, the word would indicate that data be given by phenomena that are measured. However, in general use, “data refer to those elements that are taken [abstracted from phenomena]: extracted through observations, computations, experiments, and record keeping”, “selected from nature by the scientist in accordance with his [sic] purpose” (Kitchin, p.2). So “capta” rather than “data” would actually be a better name, in that it stresses the taking as opposed to an assumed, but impossible giving.

2. The exhaustivity assumption is shown to be over-simplified if one considers that even the typical “big” datasets (all Google queries, all Facebook users’ posts) are still representations (of what people want to know, of what people think) and samples (because they are limited to the users of these platforms, plus often filtered by users’ access restrictions and/or platforms’ API policies).

3. Because they were “taken”, data, measurements, and analytics cannot “speak for themselves”, cannot overcome the subjectivity of qualitative methods and the central involvement of humans in sense-making. Kitchin argues that rather than naïvely assume such objectivity in data, we should always critically question the origins of data, the purposes with which they were collected and processed, and the methods with which this was done. (The non-neutrality of processing is further expanded on in a chapter on Data Infrastructures and Data Brokers.) He also points out that all such questioning and all interpretations of data and analysis results again arise from a human speaking about the data or inferences made from them and on behalf of the interpreter’s agenda. Part of this background and agenda are the manifold assumptions about causality that are integral to any sense-making. Any simplistic “correlation can replace causation” appeals to the assumption that “the data can speak for themselves” and at most serves to obfuscate the assumptions and settings that created the data and their models in the first place.

These are just three (even if key) insights of many more in this excellent book; for more, see the full review text.

Why does this matter?

Much of the hype around BD involves “sciencey-sounding” but in fact misleading or even wrong accounts of phenomena that the public cares for. Ben Goldacre has termed this Bad Science, and he demonstrates for a host of case studies, most from the medical domain, how such poor reasoning, resonating through media and public discourse, has real and dangerous consequences. It is to be feared that the same can and will happen in the BD field.

What can we do? One answer.

I want to advocate the deconstruction of “Bad Big Data Science” arguments as one way of countering this danger. This is a truly interdisciplinary exercise, which needs to draw on computer science as much as on sociology, politics, philosophy and many other sciences. At the same time, I believe that an interlinking of concrete examples of BD uses with concrete examples of BD criticism is needed – not only for encouraging more critical perspectives on uses of BD but also for transforming the critical programme itself (whose abstract notions such as “discourse” are hard to grasp for many readers) into actionable recommendations.

In the second part of the essay, I present examples of such an enrichment (of examples of BD use) respectively deconstruction (of examples of BD arguments). I draw on and extend Kitchin’s conceptual analysis as a “toolbox” for such deconstruction.

First, three short passages expand on examples from The Data Revolution:

1. Data concepts vs. data measurement: How DNA identification can tell different stories depending on progress in the natural sciences
2. When recourse to physics can be more productive than a focus on data: Privacy by design in rubbish collection
3. Data collection discourses: How voluntary are social-media data?

 Second, I deconstruct a passage about predictive policing from Big Data in detail:

“The promise of big data is that we do what we’ve been doing all along – profiling – but make it better, less discriminatory, and more individualized. That sounds acceptable if the aim is simply to prevent unwanted actions. But it becomes very dangerous if we use big-data predictions to decide whether somebody is culpable and ought to be punished for behaviour that has not yet happened.” 

While this passage does talk about opportunities and risks of using Big Data analytics, I argue that its analysis is flawed in subtle but important ways. The deconstruction steps draw on different, example-specific fields; here: computer science, law and sociology/political science.

1. Misconceptions about profiling and of how machine-learned prediction works
2. Misconceptions about the use of data under the rule of law
3. The need for an analysis of politics and power

Of course, not only the social sciences are needed for us to question BD arguments – the humanities and behavioural sciences have an equally large role to play. Their transformation by and into “Digital Humanities” bears both risks of importing the problems of Bad Big Data Science and opportunities to improve research methods while they evolve. Kitchin discusses this tension in Chapter 8 of The Data Revolution and also in Big Data, new epistemologies and paradigm shifts. In work partially building on the essay described here (Is it Research or is it Spying?), Rockwell, Büchler and I describe forms of dialogue and collaboration between researchers and teachers of AI, Digital Humanities, and other knowledge sciences.

In sum, I hope to have convinced you that both doing and understanding “big data” and “data analytics” require a deep, interdisciplinary, and critical analysis, and that it is our responsibility as scientists working with data to do so. I also hope you will find The Data Revolution to be a useful recommendation for your own, and your students’, reading lists. Last but not least, I look forward to your comments and questions!

A video abstract by Rob Kitchin on his BD&S article, 'Big Data, New Epistemologies and Paradigm Shifts' is available here.

About the author

Bettina Berendt is a a Professor of Computer Science in the Declarative Languages and Artificial Intelligence group at KU Leuven, Belgium. She works on Web, text and social and semantic mining; privacy and (non)-discrimination and how data mining can hinder or help these goals; user issues; and teaching of and for privacy.