Given the increased use of machine learning models for high-stakes decision making, a growing body of work aims to understand and intervene in the social values embedded within machine learning models. For example, to understand whether such models exacerbate existing racial, gender, or other disparities between groups and to intervene to mitigate such disparities (Buolamwini and Gebru, 2018; Chen et al., 2019; Chouldechova, 2017; Friedler et al., 2019; Celis et al., 2019; Hardt et al., 2016).
We ask a complementary question: how are societal, political, environmental, and other values embedded within the machine learning discipline? In particular, how do values influence what the discipline focuses on and the way it develops? It is important to think about these questions because when undesirable values are at play at the level of the discipline, intervening on particular models will not suffice to address the problem. Rather, the correct intervention must also be posed at the level of the discipline.
The focus of this paper is how values shape the development of the discipline over time. We argue that major disciplinary shifts within the machine learning discipline are not (and cannot be) “objective” processes — instead, they are value-laden. This is a consequential distinction. Certain values are still implicit even when disciplinary shifts are incorrectly seen as objective progress. However, because they are hidden they are simply accepted as a default. In order to make any intentional choices about values, we must first recognize that they exist and are at work.
Our argument proceeds in three parts.
Section 2: First, we give a conceptual, descriptive framework for progress in machine learning. We argue that “model-types” within machine learning, e.g. deep learning, graphical models, support vector machines, guide and organize research activities in machine learning. We point out similarities and differences between model types and traditional concepts from philosophy of science - Kuhn’s paradigms and Lakatos’s research programmes.
Second, we argue that the rise of a model-type is self-reinforcing: it comes hand in hand with the rise of corresponding criteria for evaluating model-types. We use the recent rise of deep learning as a case study to illustrate this point. We visit a commonly-cited cause for the rise of deep learning: its success in the 2012 ImageNet challenge. However, we argue that ImageNet not only triggered a shift to deep learning, but also a shift to evaluating models in the environments that deep learning performs best in, namely, compute-rich and data-rich environments.
Section 4: Third, we argue that criteria used to evaluate model-types encode loaded social and political values. We again illustrate the point using deep learning as a case study. Deep learning performs better when evaluated in compute-rich and data-rich environments, and typically better when evaluated on predictive accuracy (as compared to other evaluations around robustness or interpretability). This kind of evaluation furthers certain values, such as centralization of power, while hindering other values, such as environmental sustainability and privacy. Therefore, the rise of deep learning is not straightforwardly “objective” but, rather, is a value-laden process.
We have written our paper to be of interest to both philosophers of science and machine learning researchers, and our contributions are slightly different for each group. For both groups, we give a more nuanced account of disciplinary shifts in machine learning, and highlight ways in which values shape the discipline. For machine learning researchers, we hope the explicit exposition helps to bring into awareness and shift some of the values present in the discipline. In addition, for philosophers of science, our descriptive account of machine learning is a conceptual contribution on its own. First, since to our knowledge this is the first work analyzing machine learning as a discipline from a philosophy of science perspective111Related work in philosophy of science includes work on how philosophy of science and machine learning can illuminate one another on topics like inductivism, the logic of discovery, and scientific realism. For examples, see (Frické, 2015; Kitchin, 2014; Thagard, 1990; Korb, 2004; Williamson, 2004, 2009; Gillies, 1996; Bensusan, 2000; Bergadano, 1993; Corfield, 2010), we hope that our framework provides a starting point for others to further analyze the growing discipline of machine learning. Second, considering the case of machine learning can be helpful in reexamining traditional concepts in philosophy of science. In particular, our analysis suggests that it may be possible to reproduced traditional concepts and puzzles from the philosophy of science without the normal focus on theories and hypotheses.
2. Model types as organizing and guiding research
Machine learning models can be grouped into different types based on the ways that they extract patterns from data. Deep learning models (neural networks), graphical models, decision trees, and support vector machines, are all examples of model types. We argue that model types are more than just a technical apparatus. Model types guide the research agenda of machine learning practitioners who are committed to them, and when many people are committed to the same model type the discipline and the resources available to practitioners change.
We start this section by explaining what it means to be committed to a model type. We then explain how this commitment influences research agendas and the discipline. Last, we point out analogies and differences between the function of model types in machine learning and paradigms and research programmes in natural sciences. These will later be used to argue that comparison between model types is not an objective process.
2.1. Commitment to model-types
Researchers often associate their work with a specific model-type. For example, they sometimes identify themselves as working on specifically “deep learning” or “graphical models”. This reflects the fact that researchers can be committed to a model-type – have a favored model-type and focus on improving that model-type as a means to making research progress. Commitments to model-types also manifest in structuring machine learning workshops. Since in machine learning, workshops are meant to provide smaller venues to focus on making progress, this indicates that centering on a model-type can be seen as a way of making progress. For example, this year’s International Conference on Machine Learning (ICML) included at least seven workshops focused specifically on deep learning222The deep learning workshops at 2019 ICML were: “Theoretical Physics for Deep Learning”; “Uncertainty and Robustness in Deep Learning”; “Synthetic Realities: Deep Learning for Detecting AudioVisual Fakes”; “Understanding and Improving Generalization in Deep Learning”; “Identifying and Understanding Deep Learning Phenomena”; “On-Device Machine Learning & Compact Deep Neural Network Representations”; and “Invertible Neural Networks and Normalizing Flows”.
Researchers who are committed to a certain model-type think of that model-type as generalizable - that the success that the model-type had on some problems is an indication that, if we put more work into it, it will do well on many other problems. A commitment to a model-type is therefore fueled by some exemplars, some cases of great success for the model-type which are taken to be strong evidence for generalizability. For example, deep learning’s success on a particular computer vision challenge, called ImageNet, was taken as evidence for its potential future success on other types of problems (we discuss ImageNet in greater detail in Section3).
A commitment to a model-type has downstream effects for the research that is done. Model-types guide the research agenda of machine learning practitioners who are committed to them, and when many people are committed to the same model-type the discipline and the resources available to practitioners change. In particular, we argue that a commitment to model-types guides the selection of problems, constrains the creation of solutions, and promotes prerequisites around supporting tools and technologies.
2.2. How a commitment to a model-type influences research
2.2.1. Problem selection
Those who are committed to a model-type work to increase its precision and expand its scope.
Different model-types are naturally good at different things. Those who are committed to a model-type work to improve its performance in those areas in which it is doing less well. For example, deep learning models have done very well in the field of computer vision and, more recently, in other fields such as natural language processing and reinforcement learning. However, deep learning has done less well in areas such as those involving causal, logical, or probabilistic reasoning. Other model-types, such as functional causal models, decision rules, or probabilistic graphical models (PGMs), are often better at these problems. But researchers are now working to improve the ability of deep learning models to represent causal(Lopez-Paz et al., 2017), logical (Cai et al., 2017), or probabilistic knowledge (Kingma and Welling, 2014; Rezende et al., 2014).
In this way, commitments to model-types skew the selection of problems by virtue of the model-type’s varying strengths and weakness.
2.2.2. Constraining the search for solutions
Model-types also help constrain the solutions considered to problems. When a research is committed to a model-type, she believes working on the model-type will be the most fruitful path to progress. Thus, she primarily pursues improvements to her committed model-type, rather than pursuing improvements to other model-types or devising a new model-type all-together.
Improvements to a model-type are often made by revising hyperparameters
, which are the parameters which must be specified by the researcher rather than learned from the data. In deep learning, hyperparameters include specification of the mathematical transformations between layers, how many units are in each layer, how many layers there are, and so on. For example, an important hyperparameter innovation in deep learning was the change from using a sigmoidal or tanh function for the non-linear transformation between layers to using a ReLU function, a change that is often considered to have been essential to the revival of deep learning(Krizhevsky et al., 2012).
The success of a model-type depends on associated prerequisites around supporting tools, technologies, and information. A commitment to a model-type shapes the discipline by promoting and reinforcing such prerequisites. Consider deep learning as an example.
One prerequisite for the success of a deep learning is the amount of data available. Machine learning algorithms are data-driven, and are supposed to get better at a given task the more data they are provided. However, there are different ways in which a model’s performance may improve with more data. Figure 1 depicts two such ways – one model (curve A) has better performance with smaller amounts of data and the other model (curve B) has better performance with larger amounts of data. For any model, its “data efficiency”, its performance as a function of the amount of data, will vary depending on the application. But generally, different model-types tend to achieve success with different levels of data. Deep learning typically requires large data sets to perform well (Hestness et al., 2017; Sun et al., 2017). Other methods, support vector machines, linear models, probabilistic graphical models, etc., can often do better on smaller data sets. Thus, deep learning requires a larger data prerequisite than other model types.
Another prerequisite for deep learning is extensive compute power, the amount of computations that a computer can process within a given time frame. All model-types benefit from running on computers that have a lot of compute power, but some model-types benefit more than others. A similar graph to Figure 1 can also hold with compute power on the -axis. Similar to the data case, deep learning would follow the pattern of curve B. Unlike other model types, it performs better given access to large amounts of compute power. Thus, the availability of a lot of compute power may be a prerequisite for deep learning, but not for other model-types.
Third, model-types require specialized technologies. For example, deep learning requires graphics processing units (GPUs), a specialized type of hardware. GPUs are useful for deep learning because they speed up the process of performing matrix multiplications, a computation that is essential to deep learning models but not necessarily for other model-types. Model-types can also benefit from specialized software that makes it easier for researchers to create models of that type. For example, deep learning benefited from the creation of software packages that automatically perform backpropagation, an algorithm used to compute derivatives that is primarily used to train deep learning models. Other model-types can also benefit from such software packages, but much less.
A commitment to a model-type involves promoting the prerequisites that model-type needs, for example building the right tools, collecting enough data, or buying the amount of compute resources necessary. When many people are committed to the same model-type, the resources available to the discipline at large change, reinforcing the predominance of that model-type. For example, as the popularity of deep learning increased, industry labs like Facebook and Google have put effort into developing easy-to-use supporting software, such as PyTorch(Paszke et al., 2017)
and Tensorflow(Abadi et al., 2016), which have greatly reduced the barrier to entry for creating deep learning models. Furthermore, many industry labs have also designed hardware to accelerate deep learning by creating components that are specialized for the computations used by deep learning models (Sze et al., 2017). For example, Nvidia frames one of their recent GPUs as “Tesla P100: The Fastest Accelerator for Training Deep Neural Networks” (NVIDIA, 2016).
The mass commitment to deep learning has promoted the availability of data, compute power, and the specialized technologies required for deep learning.
2.3. Similarities and differences between model-types and traditional concepts
The functionality of model-types in machine learning is similar to the functionality of Kuhnian paradigms and Lakatosian research programmes in science. However, it is also different in important ways.
We argued that researchers who are committed to a model-type work to increase the scope and precision of that model-type. This functionality is analogous to the way paradigms333In the narrowest sense, a Kuhnian paradigm is an exemplar - a widely accepted solution to a problem. See Hoyningen-Huene ((1993)) for more discussion of the different senses of ”paradigm” function in the natural sciences according to Kuhn ((1962)). Kuhn argues that most of the time scientists are concerned with increasing the scope and precision of their paradigm. Theoretical and experimental scientists are engaged in activities such as applying analogous solutions to more and more problems, getting better agreement between observation and theory, elaborating the theory to make it easier to compare it to observations, and so on (e.g. Kuhn, 1962, pp. 25-230).
We have also argued that researchers who are committed to a model-type look for ways to overcome problems by making adjustments within the confines of that model-type. They may tune the hyperparameters (which are the aspects of the model that must be specified by the researcher). Or when it doesn’t work, they may blame the prerequisites, e.g., in the case of deep learning, argue that there isn’t sufficient compute power or data. This functionality is analogous to the way scientists work with research programmes according to Lakatos ((1970)). Lakatos argues that scientific theories are composed of core hypotheses, which are the hypotheses scientists mainly defend, and auxiliary hypotheses, which connect between the core hypotheses and observations. For example, in the Newtonian research programme, the core hypotheses are the three laws of motion. The auxiliary hypotheses include definitions of the terms the laws use (such as mass), procedures for measuring the quantities the laws are about (such as mass and velocity), methodologies on the correct use of scales, what reasonable margins of errors are, and so on. Upon conflicts with experience, scientists tend to revise the auxiliary hypotheses to save the core from falsification. When a prediction fails one can always blame the instruments, initial conditions, and so on. As a result of modifying the core hypotheses, a series of theories is created - that is the research programme.
However, model-types differ from paradigms and research programmes in that they seem to rely much less on activities of theory-making. Model-types themselves are not theories, and are not composed of hypotheses like research programmes (at least not straightforwardly).444Some may wonder whether deep learning can be a theory – a theory of how the brain works. Indeed, the earliest neural network (connectionist) models were inspired by neuroscience. However, to most modern machine learning researchers, who are more interested in the performance gains produced by deep learning, this is only a tangential connection. And as neuroscience has progressed, it has become clear that the neural networks used in machine learning are far simpler than the neural networks in the brain (Crick, 1989; Barrett et al., 2019). Researchers may have hypotheses about which model-types will be the most successful, but the model-type itself does not seems to consist of theories or hypotheses in the same way that a paradigm or research programme does.
The differences between the activities in machine learning and other disciplines invite not only naming new concepts (which is what we have done with “model-type”), but also re-examination of these traditional philosophy of science concepts. For example: To what extent would it be useful to apply traditional concepts from philosophy of science, such as “theory”, “paradigm”, or “research programme”, to machine learning? How central are activities of theory-making to the these traditional concepts, given that it seems possible to reproduce their functionality without it? Thinking about such questions is among the ways in which machine learning can enrich philosophy of science.
3. Comparison between model types is model-type-laden
How can model-types be compared? Some natural ways include comparing the problems that the model-type is successful at solving and the reasonableness of the background assumptions the model-typess rely on. However, as we have argued, a commitment to a model-type involves taking a stand on these very issues. Thus, being committed to a certain model-type means not being neutral with respect to the measures of success of model-types. We can’t exactly say that a person committed to model-type A would become committed to model-type B if it is shown that model B solves the important problems better. Rather, changing your commitment may involve changing your views on which problems are important and what it means for a solution to be “better”.
In this section, we illustrate this complexity using the rise of deep learning. We start by examining a common explanation for the rise of deep learning—the success of a deep learning model in the ImageNet competition.
ImageNet is a large-scale database of over 14 million images that was curated for the goal of furthering computer vision and related research (Deng et al., 2009). Between 2010-2017, ImageNet ran an annual competition called the “ImageNet Large Scale Visual Recognition Challenge”, which is also commonly referred to as simply “ImageNet” (Russakovsky et al., 2015)
. The task for the competition was to classify images into one of 1000 known classes. No deep learning models were submitted in the years 2010 and 2011, and the best error rate was 25.8%(Lin et al., 2011; Sánchez and Perronnin, 2011). In 2012, the winner was the single deep learning model which was submitted, AlexNet (Krizhevsky et al., 2012). AlexNet achieved a 16.4% error rate, a remarkable 10% percent lower than the runner-up.
This success is commonly cited as a trigger for the rise of deep learning. For example, Yann Lecun, Yoshua Bengio, and Geoffrey Hinton, who received the Turing Award in 2019 for their work in deep learning, said that (LeCun et al., 2015): “ConvNets [a type of neural network] were largely forsaken by the mainstream computer vision and machine-learning communities until the ImageNet competition in 2012.”
By 2014, nearly all the submissions to the ImageNet challenge were deep learning models and the error rate steadily decreased. Deep learning became increasingly popular not only within the competition, but also outside of it — in the industry at large as well as in academia.
3.2. Does ImageNet justify the rise of deep learning?
In the natural sciences, success in a given experiment cannot justify a shift to a different paradigm or research programme on its own. It requires, among other things, prioritizing standards and preferences. To borrow Lakatos’s example ((1970)
), consider comparing Newton’s early theory of optics, which focused on light-refraction, and Huyghens’s early theory of light, which focused on light interference. We could compare between the two using experiments pertaining to, for example light-refraction. If we considered these experiments as crucial, then we would adopt Newton’s theory over Huyghen’s theory. But in doing so, we would also implicitly be elevating the problem of light-refraction over the problem of light interference. When prioritizing problems researchers are not necessarily under the illusion that, at the moment the prioritization is made, one theory is superior to all others in all respects. Rather, there is great hope and anticipation that the chosen theory’s success on the puzzle of interest indicates success on other puzzles as well. The question is then: why do some experiments trigger this hope while others do not?
Similarly, in explaining the popularity of a model-type it is not enough to point out success in some competition. The question is not in which competitions a model-type did well, but rather why the success in some competitions rather than others had impact on the discipline. Deep learning did well in other competitions prior to ImageNet, but these didn’t make nearly the same impact on the discipline as its success in ImageNet. For example, Jürgen Schmidhuber’s team at the Dalle Molle Institute for Artificial Intelligence Research, won four computer vision competitions with deep learning models(Ciresan et al., 2011, 2011, 2012) between May 15, 2011 and September 20, 2012 (Schmidhuber, 2017). The 2012 ImageNet competition was September 30, 2012. Matthew Zeiler, one of the winners of the 2013 ImageNet challenge, also notes that (Gershgorn, 2017): “This Imagenet 2012 event was definitely what triggered the big explosion of AI today. There were definitely some very promising results in speech recognition shortly before this… but they didn’t take off publicly as much as that ImageNet win did in 2012 and the following years.” Furthermore, while deep learning models did very well on ImageNet and other computer vision and speech recognition competitions, they did less well in other areas, such as causal, probabilistic, or logical reasoning, again calling into question why ImageNet had such a large impact.
ImageNet, but not prior competitions, sparked great hope that deep learning could generalize in the field of computer vision and beyond. For example, Ilya Sutskever, one of the winners of the 2012 Imagenet challenge, said, “It was so clear that if you do a really good on ImageNet, you could solve image recognition” (Gershgorn, 2017). Indeed, in a paper at CVPR 2019 (a computer vision conference) (Kornblith et al., 2019) claim, ”An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested.” Furthermore, the influence of ImageNet spread outside of the field of computer visions, and triggered hopes of generalization in other fields as well.
So what, if anything, makes ImageNet special? One important factor was that ImageNet’s database was much larger. Compare ImageNet to its main predecessor, the PASCAL VOC challenge (Everingham et al., 2010, 2015). PASCAL VOC was a better established image classification challenge, and in the first two years that the ImageNet challenge was hosted, it was co-located with PASCAL VOC challenge as a mere “taster” competition. However, PASCAL VOC 2010 had only 20 classes and 19,737 images, while the ImageNet 2010 had 1000 classes and 1,461,406 images.
When the paper detailing the ImageNet database was originally published in 2009, skeptics disputed the value of such a large-scale database. Jia Deng, the lead author of the paper, said (Gershgorn, 2017) “There were comments like ’If you can’t even do one object well, why would you do thousands, or tens of thousands of objects?’”. And yet, after ImageNet, many people take for granted the importance of large datasets.
Thus, the 2012 ImageNet challenge did not simply showcase the high performance of deep learning, it also marked a shift in how researchers thought progress would be made. More and more people began to believe that the field could make significant progress simply by scaling up datasets (Sun et al., 2017).
Furthermore, 2012 ImageNet also impacted the way people conceived of the role of compute power in progress. This is reflected in the increase in the amount of compute power used to train models after 2012. Researchers from OpenAI found that before 2012, the amount of compute power used to train neural networks was doubling at a 2-year rate in correspondence to Moore’s law. Since 2012, when the current deep learning boom began, the amount of compute power used to train deep learning models that achieve state-of-the-art results has been increasing exponentially, doubling every 3.5 months (Amodei and Hernandez, 2018). Both trends are shown in Figure 2 (reproduced with permission from OpenAI).
Thus, the rise of deep learning came hand in hand with the rise of a new way to assess the success of models – in data-rich and compute-rich environments. Indeed, the shift is advocated for by the authors of the winning 2012 ImageNet model (Krizhevsky et al., 2012): ”All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.”
It would be natural to assume that these shifts in the conception of progress were the result of new tools or resources that were not present before. For example, the increase in compute power was possible through the use of GPUs and techniques like GPU parallelization. However, the mere availability of these tools does not by itself imply that we ought to evaluate progress in environments that are rich in data and compute power. Doing so assumes that we should define “progress” based on metrics that do not depend on the amount of data or compute power used, e.g. classification accuracy. However, we may have good reason to reject a data and compute-independent notion of progress. For example, we may reject to increased data collection for privacy reasons or the use of increased computational power for environmental costs. We will discuss this point in more detail in Section 4.
In conclusion, we cannot straightforwardly say that ImageNet shows that deep learning is better than other model-types. This explanation takes for granted prerequisites of deep learning - that model-types should be evaluated in data-rich and compute-rich environments. If models are evaluated under the conditions in which deep learning models performs better, then the cards are stacked in favor of deep learning. The more general point here is that evaluation of model-types depends on considerations on which people committed to different model-types would disagree. Therefore, evaluation of model-types is model-type-laden: it depends on which model-type the evaluator is committed to.
The claim that comparison between model-types is model-type-laden is analogous to the claim that comparison between paradigms is paradigm-laden, i.e. that paradigms are incommensurable. Since incommesurability has been harshly criticized, one might wonder how plausible it is to claim that comparison between model-types is model-typle-laden. However, key criticisms to incommensurability are not applicable in the case of machine learning.
What is incommensurability? In The Structure of Scientific Revolutions555Kuhn dedicated much of his work after The Structure of Scientific Revolutions to developing the concept of incommensurability. While some think that his concept of incommensurability has changed greatly in subsequent work (e.g. (Sankey, 1993)), others thinks the later work is a refinement of the earlier work (e.g. (Hoyningen-Huene, 1993)) , Kuhn highlighted three interconnected aspects of incommensurability: semantic, ontological/perceptual, and methodological.666We follow the distinction made by Bird ((2018)). See (Hoyningen-Huene, 1993; Sankey, 1993; Sankey and Hoyningen-Huene, 2001) for different versions of this distinction. From the semantic perspective, we can’t say that the laws of one paradigm are derivable from the laws of a different paradigms due to the semantic differences between them. In other words, paradigms are not straightforwardly comparable because key terms don’t mean or refer to the same things. For example, it is not the case that Newton’s laws are derivable from Einsteinian mechanics because key terms, such as mass, don’t mean the same thing. Second, we can’t say that one paradigm describes the world better than another because the world itself is different for people working within different paradigms. The reason is that Kuhn argues that observations crucially depend on the theory held by those who take them, that is - that observation is theory-laden. Since the conceptual apparatus of different paradigms is different, the observations made by practitioners of different paradigms are different. So different that practitioners of different paradigms essentially practice science in different worlds. Third, we can’t say that one paradigm is better than another because it does a better job at solving important problem. The reason is that, since the worlds of practitioners of different paradigms are different, the list of problems of interest are different. In addition, the standards of success are different.
The semantic aspect of Kuhn’s incommensurability received most of the critical attention, at least in philosophy of science (Scheffler, 1982; Sankey, 2018; Mizrahi, 2018).777The focus on the semantic aspect is perhaps due to the fact that Kuhn’s later work (e.g. Kuhn, (1982)) as well as Feyerabend’s ((1978)) version of incommensurabily focus on semantic incommensurability. Other criticisms of incommensurability focus on the implications of incommensurability. See, e.g. Lauden ((1996)) and Gattei ((2003)). However, the incommensurability-like effect in machine learning doesn’t directly depend on semantics. Model-types are incommensurable because people who are committed to different model-types would disagree on the measures to use to compare them. For example, among other things, the dominance of deep learning involves shifting to favoring evaluations in data-rich and compute-rich environments. There is no need to appeal to the use of specialized languages or the theory-ladenness of observations to see this methodological incommensurability in machine learning. Moreover, semantic and ontological/perceptual incommensurability are less convincing than they are in the case of natural sciences. As researchers are not directly invested in theorizing about what the world is like, there is no need to think of the work of researchers working in different model-types as resulting in commitments to specialized language about what the world is like and shape observations. In other words, there is no need to think of researchers committed to different model-types as working in different worlds or unable to fully communicate about the world. Thus, methodological incommensurability in machine learning is independent from semantic and ontological/perceptual incommensurability.
4. Comparison between model types is value-laden
Having conceptualized how progress takes place in machine learning, we can now see how values shape the discipline at large. We have argued that prioritization of model-types is model-type-laden, in the sense that it depends on considerations on which people committed to different model-types would disagree. We now argue those same considerations are also value-laden in the sense that they implicitly encode political, social, and other values. Therefore, prioritization of model-types is not only model-type-laden but also value-laden. We illustrate on the case of deep learning.
Prioritization of model-types requires favoring one set of prerequisites over another. When the prerequisites encode social and political values, the prioritization is value-laden. Let’s consider two of the prerequisites for deep learning as examples.
4.1.1. Compute Power
We have pointed out that, unlike other model-types, deep learning requires using a lot of compute power and a lot of GPUs in particular. This prerequisite is a carrier of a political value - centralization of power. Centralization of power, or rather decentralization of power, is already explicitly used for comparison between procedures and techniques in science and medicine. The general point is that tools that can only be made or used by a selected few, because they are complex or expansive, contribute to the concentration power. Such tools are only available to those with means, and they sustain and deepen the dependency between those who are in a position to provide the services and those who need those services (Longino, 1995). For example, in agriculture, sophisticated technologies create dependency on those who have the means and expertise to use them. Techniques that are accessible and can be locally implemented, such as small scale sustainable agriculture, promote decentralization of power. People who advocate for techniques of this sort are making the power dynamics involved in utilizing certain tools and procedures explicit and favor those which decentralize power.
Kevin Elliott ((2017)) highlights these issues with regards to the vitamin A deficiency crisis. Vitamin A deficiency is an acute problem among the poor worldwide, to the extent that hundreds of thousands of people become blind or die of it every year. One proposed course of action is to utilize a genetically modified species of rice, called ”golden rice”, which is enriched with vitamin A. Another alternative is to look into which of the crops that are indigenous to the relevant areas are rich in vitamin A, and encourage locals to consume them. Elliott argues that addressing the vitamin A problem using golden rice caters the western biochemical community, which not only stands to benefit from selling golden rice but also related tools of western agriculture that would likely accompany it such as pesticides and fertilizers ((2017), p. 42).
The compute power prerequisite promotes centralization of power in two related ways. First, since GPUs are very expansive they create an entry barrier that favors those with financial means, such as big companies in rich countries. In addition, GPUs sustain and deepen the dependency on the major corporations that produce or can afford them. Hopefully, the cost of GPUs will drop substantially over time. However, like golden rice, the compute power prerequisite promotes centralization of power even if it is cheap. Even if golden rice is cheap, it creates a need to rely on products that would not be necessary using other techniques.
The contribution of the popularization of deep learning to centralization of power has also been noticed by other researchers. For example, Strubell et al. ((2019)) note that: “Limiting this style of research to industry labs hurts the NLP [natural language processing] research community in many ways. First, it stifles creativity. Researchers with good ideas but without access to large-scale compute will simply not be able to execute their ideas, instead constrained to focus on different problems. Second, it prohibits certain types of research on the basis of access to financial resources. This even more deeply promotes the already problematic “rich get richer” cycle of research funding, where groups that are already successful and thus well-funded tend to receive more funding due to their existing accomplishments. Third, the prohibitive start-up cost of building in-house resources forces resource-poor groups to rely on cloud compute services such as AWS, Google Cloud and Microsoft Azure.”
The compute power prerequisite also encodes environmental values. Using environmental values to compare between tools and procedures in science and agriculture is not new. For example, one of the reasons Greenpeace objects to using golden rice is due to environmental concerns (Elliott, (2017), p 43). In machine learning, the extensive computational resources required by deep learning models take a toll on the environment. For example, Strubell et al. ((2019)) found that training one especially large state-of-the-art deep learning model resulted in carbon emissions that were over three times the amount of the carbon emissions in one car’s average lifetime . These large environmental impacts have also been noticed by the Allen Institute for Artificial Intelligence, which recently released a position paper stating that “Green AI” would be an emerging focus at the institute (Schwartz et al., 2019).
Another prerequisite for deep learning is the availability of large data sets for training. Large data sets introduce several complexities. First, they create entry barriers because not everyone has access to sufficiently large amounts of data. This gives an advantage to large companies and promotes centralization of power, like the compute power prerequisite.
Second, the collection of data about people, e.g. their location, heartbeat rates, or clicks, introduces an additional set of complexities around privacy, which bear on individual freedoms. For example, a 2009 House of Lords report on surveillance ((2009)) stated that “Mass surveillance has the potential to erode privacy. As privacy is an essential pre-requisite to the exercise of individual freedom, its erosion weakens the constitutional foundations on which democracy and good governance have traditionally been based in this country.” Deep learning requires mass collection of data and when this data is sensitive data about individuals, then it comes in tension with the value individual freedom which is associated with privacy.
Furthermore, it is a fact that we have more of certain kinds of data about some groups rather than others, and this data can be abused. For example, an anonymous programmer created a deep learning-based app called “DeepNude”, which allowed users to upload a photo of any woman and receive a fake, undressed version of the photo. The creator stated that the app only worked for women because, due to pornography, it is easier to find images of nude women online (Cole, 2019). Technology that requires large amounts of data to work furthers the power structures underlying what kinds of data are available about certain groups in the first place.
4.2. Evaluation Criteria
In comparing between theories, one needs to rely on characteristics of the theories as a whole - e.g. are they internally consistent? Are they consistent with established theories? Do they entail accurate predictions? Do they serve humanity well? Do they promote equal opportunity? Such characteristics of theories are often called “theoretical virtues”. Evaluating theories based on their theoretical virtues is a value-laden activity when theoretical virtues are carriers of values.
Some of these virtues, such as applicability to human needs, wear their value commitments on their sleeves. However, even virtues that appear neutral, such as simplicity and consistency, are at least sometimes carriers of political, social, or other values (this was pointed out by, e.g., Kuhn (1977), and Longino (1996). To borrow two examples from Longino, consider consistency with established theories. If the established theories are sexist, new theories will also need to be sexist to be consistent with them. Similarly, simplicity favors theories with fewer kinds of entities. Therefore, theories in biology which treat all humans as versions of a man are simpler than theories which have multiple archetypes of humans. However, these simpler theories are androcentric. Moreover, even if virtues like simplicity and consistency are not themselves politically or socially loaded, comparing between them and virtues that are is loaded. Suppose theory A is simpler and more consistent with established theories (for whatever reason) and theory B is more applicable to human needs. Even if simplicity and consistency are not politically loaded, favoring theory A over theory B involves taking a stand on what it more important and that is loaded.
Similar points apply to evaluation criteria in machine learning. Well known evaluation criteria for models include accuracy, explainability, transparency, and fairness. Some of these criteria, like fairness, are explicitly politically loaded. But even criteria that appear neutral on first glance may be carriers of values because of what it takes to satisfy them. Accuracy is one example. Which considerations is it permissible to use when attempting to make accurate predictions? For example, when is it permissible to use racial identity in making predictions about recidivism? Today, many think that we should not use social identity attributes, such as race and gender, to make such predictions even if they increase accuracy.888We note as a separate point that many in the machine learning and ethics community have pointed out that “fairness through unawareness” is an insufficient solution because machine learning classifiers can still pick up on proxies for the sensitive attributes (Dwork et al., (2012); Hardt et al., (2016)).
The disapproval of using protected identities in making predictions is reflected in regulation. In the US, laws such as the Fair Housing Act (FHA), the Equal Credit Opportunity Act (ECOA), and the Fair Credit Reporting Act (FCRA), are being implemented on predictions based on big data and prohibit the use of protected identities for making decisions on loans, employment, and so on. For example, a 2016 report by the Federal Trade Commission (FTC) determines that: “an employer may not disfavor a particular protected group because big data analytics show that members of this protected group are more likely to quit their jobs within a five-year period. Similarly, a lender cannot refuse to lend to single persons or offer less favorable terms to them than married persons even if big data analytics show that single persons are less likely to repay loans than married persons.” ((2016), p. 18) There are also restrictions on using generalizations based on proxies of protected identities, such as an address. For example, it is prohibited to deny a loan request because data analytics has found that people who live in the same zip code are generally not creditworthy (p. 16).
If one thinks that using these attributes is wrong, even if they yield accurate predictions, it is because some other value is more important than accuracy. For example, Bolinger ((2018)) argues that accepting generalizations based on racial stereotypes is morally and epistemically wrong even if the stereotypes are statistically accurate (Bolinger, 2018). She gives the following example to illustrate this claim (originally from Gendler ((2011))): John Hope Franklin is hosting an event to celebrate being awarded the Presidential Medal of Freedom. All other black men on the premises are uniformed attendants. Mistaking Franklin for an attendant, a woman hands him her coat check ticket and demands he brings her the coat. Statistically, since all other black men in the party are attendants, the woman’s prediction that Franklin is an attendant is accurate. However, the assumption that Franklin is an attendant still feels wrong. Bolinger’s explanation is that the problem is that the prediction is based on a racial stereotype (that black people have a lower social status), which is wrong even if the stereotype is statistically accurate. On Bolinger’s view, relying on racial stereotypes is wrong because of cumulative effects. If the only time someone assumed Franklin has a low social status was at that party, the harm would have been minimal, just a one-off mistake based on a correct generalization. But when the same type of assumption is made consistently, as it is in the case of racial stereotypes, it interferes with black people’s ability to signal authority and high social status. This results in limiting their opportunities for advancement which is incompatible with respecting their moral equality and autonomy.
Bolinger’s explanation is of course only one attempt explain what it wrong with relying on generalizations based on social identity. Whatever the details are, the general point is that accuracy is not necessarily a value-neutral evaluation criterion. One reason is that decisions on which considerations are permissible to use in attempting to make accurate predictions are value-laden. For example, when predictions are accurate because they rely on racial generalizations, accuracy is a carrier of social values. The accuracy of algorithms that intentionally avoid relying on racial generalizations is also value-laden: in those cases accuracy is a carrier of the value of equality. Moreover, even putting aside the fact that accuracy is not value neutral, prioritizing accuracy over other virtues is not neutral because the other virtues are not.
What these examples illustrate is that high accuracy is not the only thing that matters – how a model achieves high accuracy is also important. Therefore, interpertability, the ability to articulate why a model made a certain prediction, is in competition with accuracy. Deep learning models are often described as “black-boxes” because how they achieve high accuracy is difficult to scrutinize. Other model-types, such as graphical models, decision trees, or support vector machines, are typically easier to scrutinize to understand how a decision was made. There are multiple ways to address the tension between accuracy and interpretability. A person who prioritizes accuracy may be more inclined to try to make deep learning algorithms more amenable to scrutiny (Simonyan et al., 2013; Zeiler and Fergus, 2014; Olah et al., 2018). On the other hand, if interpretabililty is essential, then it may be more appealing to simply use a model that is already more amenable to interpretability.999Hybrid approaches, like ones that approximate a deep learning model with a simpler model-type, e.g. linear models, decision trees, are also popular (Ribeiro et al., (2016)). What we see here is that disagreements on which approach to take and which model-type to use also encode prioritizations on values. In this case, between accuracy and interpretability. Thus, the rise and fall of a model-type may encode the rise and fall of some values. In particular, increased interest in values like fairness, explainability, and interpretability may motivate favoring model-types other than deep learning.
5. Conclusion, and values in deliberation
There is a simple story to be told about disciplinary shifts in machine learning – that progress is made by better models. But as we have seen, what is “better” cannot be a purely objective choice. First, because the rise of model-types comes hand in hand with rise of corresponding ways to evaluate model-types. Second, because comparisons between model types are value-laden, encoding social, political, and environmental values.
It is important to explicitly reveal values that are shaping the discipline, so that these values can be examined and changed when deemed undesirable. Some in the community have already begun pushing for reforms of such disciplinary values. For example, the Allen Institute for Artificial Intelligence’s recent “Green AI” paper (Schwartz et al., 2019) advocates for increasing effort in “environmentally friendly and inclusive” AI research. They suggest to introduce environmentally positive incentives in the research community by creating norms around reporting measures of efficiency, e.g. accuracy as a function of computational cost, rather than simply accuracy alone.
A related question inspired by these issues is who should make decisions in what values are furthered? Who gets to have a voice? In talking about selection of problems in science, Kitcher ((2011)) argues that all sides should have a say, including laypersons (Kitcher, 2011). A question for machine learning is: is the same true for machine learning? Who should have a say about which criteria are important in evaluating model-types? That is itself another value-laden question.
For extensive feedback on this paper, we would like to thank Lara Buchak and Shamik Dasgupta. For comments and helpful discussion on early drafts of this paper, we would like to thank Morgan Ames, Kevin Baker, Julia Bursten, Christopher Grimsley, Elijah Mayfield, John Miller, Ludwig Schmidt, David Stump, and Will Sutherland.
Smitha was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Buolamwini and Gebru  Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (FAT*), pages 77–91, 2018.
- Chen et al.  Jiahao Chen, Nathan Kallus, Xiaojie Mao, Geoffry Svacha, and Madeleine Udell. Fairness under unawareness: Assessing disparity when protected class is unobserved. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 339–348. ACM, 2019.
- Chouldechova  Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 2017.
- Friedler et al.  Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329–338. ACM, 2019.
- Celis et al.  L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 319–328. ACM, 2019.
Hardt et al. 
Moritz Hardt, Eric Price, Nati Srebro, et al.
Equality of opportunity in supervised learning.In Advances in neural information processing systems, pages 3315–3323, 2016.
- Frické  Martin Frické. Big data and its epistemolog. Journal of the Association for Information Science and Technology, 66(4):651–661, 2015.
- Kitchin  Rob Kitchin. Big Data, new epistemologies and paradigm shifts. Big data & society, 1(1):2053951714528481, 2014.
- Thagard  Paul Thagard. Philosophy and machine learning. Canadian Journal of Philosophy, 20(2):261–276, 1990.
- Korb  Kevin B. Korb. Introduction: Machine learning as philosophy of science. Minds and Machines, 14(4):433–440, 2004. ISSN 09246495. doi: 10.1023/B:MIND.0000045986.90956.7f.
- Williamson  Jon Williamson. A dynamic interaction between machine learning and the philosophy of science. pages 539–549, 2004. ISSN 0924-6495. doi: 10.1023/B:MIND.0000045990.57744.2b. URL http://kar.kent.ac.uk/7451/.
- Williamson  Jon Williamson. The philosophy of science and its relation to machine learning. In Scientific Data Mining and Knowledge Discovery, pages 77–89. Springer, Berlin, Heidelberg, 2009.
- Gillies  Donald Gillies. Artificial intelligence and scientific method. Oxford University Press, New York, 1996.
- Bensusan  Hilan Bensusan. Is machine learning experimental philosophy of science. ECAI2000 Workshop notes on scientific Reasoning in Artificial Intelligence and the Philosophy of Science, 2000.
- Bergadano  Francesco Bergadano. Machine learning and the foundations of inductive inference. Minds and Machines, 3(1):31–51, 1993.
- Corfield  David Corfield. Varieties of justification in machine learning. Minds and Machines, 20(2):291–301, 2010.
Lopez-Paz et al. 
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and
Discovering causal signals in images.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6979–6987, 2017.
- Cai et al.  Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. ICLR, 2017.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
- Rezende et al.  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
Krizhevsky et al. 
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012.
- Hestness et al.  Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Sun et al.  Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Abadi et al.  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- Sze et al.  Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
- NVIDIA  NVIDIA. Nvidia tesla p100 white paper. NVIDIA Corporation, 2016.
- Amodei and Hernandez  Dario Amodei and Danny Hernandez. Ai and compute, May 2018. URL https://openai.com/blog/ai-and-compute/.
- Hoyningen-Huene  Paul Hoyningen-Huene. Reconstructing scientific revolutions: Thomas S. Kuhn’s philosophy of science. University of Chicago Press, 1993.
- Kuhn  Thomas S. Kuhn. The structure of scientific revolutions. The University of Chicago Press,, Chicago; London, 1962.
- Lakatos  Imre Lakatos. Falsification and the methodology of scientific research programmes1. In Criticism and the Growth of Knowledge: Volume 4: Proceedings of the International Colloquium in the Philosophy of Science, London, 1965, volume 4, page 91. Cambridge University Press, 1970.
- Crick  Francis Crick. The recent excitement about neural networks. Nature, 337(6203):129–132, 1989.
- Barrett et al.  David GT Barrett, Ari S Morcos, and Jakob H Macke. Analyzing biological and artificial neural networks: challenges with opportunities for synergy? Current opinion in neurobiology, 55:55–64, 2019.
- Deng et al.  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern recognition (CVPR). IEEE, 2009.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
Lin et al. 
Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai Yu,
Liangliang Cao, and Thomas Huang.
Large-scale image classification: Fast feature extraction and svm training.In Conference on Computer Vision and Pattern recognition (CVPR), pages 1689–1696. IEEE, 2011.
- Sánchez and Perronnin  Jorge Sánchez and Florent Perronnin. High-dimensional signature compression for large-scale image classification. In Conference on Computer Vision and Pattern recognition (CVPR), pages 1665–1672. IEEE, 2011.
- LeCun et al.  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
- Ciresan et al.  Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
Ciresan et al. 
Dan Ciresan, Alessandro Giusti, Luca M Gambardella, and Jürgen Schmidhuber.
Deep neural networks segment neuronal membranes in electron microscopy images.In Advances in neural information processing systems, pages 2843–2851, 2012.
- Schmidhuber  Jürgen Schmidhuber, Mar 2017. URL http://people.idsia.ch/~juergen/computer-vision-contests-won-by-gpu-cnns.html.
- Gershgorn  Dave Gershgorn. The data that transformed ai research-and possibly the world, Jul 2017. URL https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/.
- Kornblith et al.  Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Computer Vision and Pattern Recognition (CVPR), 2019.
- Everingham et al.  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
- Everingham et al.  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
- Sankey  Howard Sankey. Kuhn’s changing concept of incommensurability. The British Journal for the Philosophy of Science, 44(4):759–774, 1993.
- Bird  Alexander Bird. Thomas Kuhn. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy (Winter 2018 Edition). 2018. URL https://plato.stanford.edu/archives/win2018/entries/thomas-kuhn/.
- Sankey and Hoyningen-Huene  Howard Sankey and Paul Hoyningen-Huene, editors. Incommensurability and related matters. Kluwer, Dordrecht, 2001.
- Scheffler  Israel Scheffler. Science and subjectivity. Hackett Publishing, 1982.
- Sankey  Howard Sankey. The demise of the incommensurability thesis. In Moti Mizrahi, editor, The Kuhnian Image of Science: Time for a Decisive Transformation?, pages 75–91. Rowman & Littlefield International Ltd, London and New York, 2018.
- Mizrahi  Moti Mizrahi. Kuhn’s incommensurability thesis: What’s the argument? In Moti Mizrahi, editor, The Kuhnian Image of Science: Time for a Decisive Transformation?, pages 25–44. Rowman & Littlefield International Ltd, London and New York, 2018.
- Kuhn  Thomas S. Kuhn. Commensurability, comparability, communicability. PSA: Proceedings of the biennial meeting of the Philosophy of Science Association, 2, 1982.
- Feyerabend  Paul Feyerabend. Science in a free society. New Left Books, London, 1978.
- Lauden  Larry Lauden. Beyond positivism and relativisn. Westview Press, Boulder, 1996.
- Gattei  Stefano Gattei, editor. Special issue of Social Epistemology [vol. 17, issue 2–3]. 2003.
- Longino  Helen Longino. Gender, politics, and the theoretical virtues. Synthese, 104(3):383–397, 1995.
- Kevin C. Elliott  Kevin C. Elliott. A Tapestry of Values: An Introduction to Values in Science. 2017.
- Strubell et al.  Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In Association for Computational Linguistics (ACL), 2019.
- Elliott and Steel  Kevin C. Elliott and Daniel Steel, editors. Current Controversies in Values and Science. Taylor & Francis, 2017.
- Schwartz et al.  Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. 2019.
- of Lords.  House of Lords. Surveillance: citizens and the state. The Stationery Office, 2009.
- Cole  Samantha Cole. Creator of deepnude, app that undresses photos of women, takes it offline, Jun 2019. URL https://www.vice.com/en_us/article/qv7agw/deepnude-app-that-undresses-photos-of-women-takes-it-offline.
- Dwork et al.  Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
- Commission et al.  Federal Trade Commission et al. Big data: A tool for inclusion or exclusion? understanding the issues. FTC Report, January, 2016.
- Bolinger  Renée Jorgensen Bolinger. The rational impermissibility of accepting (some) racial generalizations. Synthese, pages 1–17, 2018.
- Gendler  Tamar Szabó Gendler. On the epistemic costs of implicit bias. Philosophical Studies, 156(1):33, 2011.
- Simonyan et al.  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pages 818–833. Springer, 2014.
- Olah et al.  Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 3(3):e10, 2018.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
- Kitcher  Philip Kitcher. Science in a democratic society. Prometheus Books, 2011.