Ethics of Big Data

Boyd & Crawford’s Questions for Big Data

In Drs. Boyd and Crawford’s “Critical Questions for Big Data”, they propose particular challenges to many common notions to so called “Big Data”. On their surface, many of these statements seem to strike chords that ring true to people who specialize in smaller datasets and are somewhat abrasive to those that work with larger datasets, it is not clear how they should be considered from the scientific perspective in ways that are unique to the process of prior scientific critique. I will discuss this in a moment.

These statements are as follows: -1 Computational Turn of Knowledge -2 Claims to Objectivity & Accuracy -3 Bigger Data Are Not Always Better Data -4 Data Out of Context Is Meaningless -5 Accessibility Is Not Necessarily Ethical -6 Lack of Accessibility Is Socially Divisive

I will discuss the first three together and the last two together.

The Computational Turn, Objectivity, & Questioning the “Goodness” of Big Data

At first glance, Boyd and Crawford seem to imply a relativist definition of knowledge dependent on the definition of Big Data. However, this is not true, at least not directly. What they do instead is conflate ontology and epistemology. They argue that researchers see objects of science differently because they “think” more computationally with Big Data. I have two arguments with this: the first actually is dependent on their characterization of Big Data, the second is in assuming the thoughts follow from Big Data and usually not the other way around.

In terms of their characterization of “Big Data”, they define it such that it is an “interplay” of computational optimization (“Technology”), “Analysis” in the form of more classical social and computer science methodologies, and the “Mythology” of Big Data’s accuracy and intelligence. This is a problematic definition because it presumes that Big Data classifies itself in a way that other data does not. Certainly the computational optimization aspects of Big Data are a primary goal that is reasonable as a characteristic of the Big Data definition, but one cannot argue that it deviates from prior scientific ontologies while requiring classic methodologies. If this is a necessary aspect of Big Data, then Big Data necessarily must make a discussion of classic scientific ontologies, or it is otherwise distinct in the scientific community because others do not think in this ontological way.

This brings me to the second point, it seems to me that Big Data brings more questions, not different objects. Perhaps the following discussions allow us to consider new objects, but these happen after a period of systematic, analytical discussion that the authors are not certain occurs due to a general false belief that others think Big Data is more right than other data analysis. I have personally never met a big data scientist in an academic context that believes dogmatically that their data methodology is better. Everyone has a rationalization, and furthermore, plenty of data scientists arguably do not fit in this definition, strictly because of the “Mythology” aspect that the authors assume definitively exists. Perhaps this is true from a communicative perspective of science, but this assumes big data scientists do not use critical of their own methodologies. This belief conflates the public view with the scientist’s view, which is an important distinction.

Later on, under “Claims to Objectivity and Accuracy”, Boyd and Crawford specifically argue that Big Data could enforce the status quo. If this is true, then Big Data is not a turn in thinking at all; it is a compounding of what we already think at a faster or more dense scale. But even if this is true, the assumption is that we are forgetting to properly vet our methodologies or algorithmic computations.

Just because I, as a casual reader, do not understand a methodology does not mean it was not properly vetted. The same can be said of any method. This is the primary critique at the overlap of any two disciplines that seems more systemic to nonspecialists of a method not believing the specialists; not people not a direct critique of the method itself, but the behavior of the method users. The bias here seems that most people do not think computationally, and thus struggle to understand Big Data methodologies, and therefore it seems “Mythological” to believe it, but I think the same is true of all of academia when we consider the public view. All one has to do is listen to Dr. Neil Degrasse Tyson speaking about the objective positivist truths of science to a general public to realize lots of people believe this even where there are clear counterexamples. It seems this is an argument almost entirely for understandability of models to the masses or for those that are not specialists. This is a very normative perspective of what science should do. We are all science illiterate in some other science. Just because a few of us have doctorate degrees does not mean we have the right to understandability to someone else’s work.

The Accessibility Tradeoff

In light of the last few sentences, perhaps this should seem obvious, but we are not going to understand all data or even be able to see every data point. In fact, as the article points out, we possibly should not see all data points even if we can currently see them.

This leads to an interesting issue for consideration. If a lack of accessibility creates inequality in society in various ways, but we should have a lack of accessibility, then we must accept inequality as necessary. So we cannot simultaneously argue for inaccessibility of some data, and argue for unilateral equality given that this data must exist. As a contrapositive, one cannot argue for closing divides on the ground of digital equality with data and simultaneously argue for data privacy.

The other alternative is saying that certain data should never exist in the first place. Perhaps this is true, but I would argue this is practically an impossible ethical proposal, especially considering that often large ethical violations of data involve the masses willingly creating the data themselves (either due to ignorance of what they are creating or under reasonable knowledge). For example, perhaps we should not be using at home DNA tests because this is a wealth of data that can be used similarly to the Cambridge Analytica scandal and by policing agencies, but people actively pay to put their own DNA data on a server somewhere.

This certainly creates an interesting discussion of how to deal with data, and certainly there are ways to make getting certain kinds of data harder. However, given the data already exists, this becomes a much more difficult policy to actualize without heavy policing, and more possible points of access to data, like backdoors to Apple phones. There are certainly interesting policy questions here, but this is partially off topic since we are speaking on the accuracy and proper vetting of science, not data or the access of data.

Instead, I would propose a pragmatic consideration of the more immediate tradeoff between mass accessibility and privacy of data as an ethical dilemma of the use of Big Data.

Conclusion

In total, these arguments do not seem unique to Big Data. These are critical perspectives that could apply to any kind of data without real changes to their core claims. Accessibility and privacy is a necessary consideration for every field of science that I think largely fields decide upon tacitly or explicitly in conferences, listservs, or in publication for example. It is good that we keep these notions in mind, but I do not think they specialize to Big Data, and I do not entirely accept the claims made about computation and Big Data methodology. I am more likely to be apologetic to data science because I think it is the lack of accessibility that leads non-computational scholars to feel like it is not properly vetted.

Perhaps this is more true in industry, or more visual to the public because of mass communication pushing the fear of lack of privacy, lack of equality, lack of interpretable justification, and a strong profit motive relative to a weak accuracy motive, and other problems, but I would posit that relative to non-Big Data fields, these problems exist there also. However, in truth interested institutions such as universities, it is not clear that this occurs more often in Big Data than other methodological perspectives, so applying it specifically to Big Data as it is described here seems more like children pointing fingers at a monster under our bed.

#Reference Boyd, d., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15, 662-679.

Image Attribution: Alexander O. Smith

You might also enjoy