Evaluation Gaps in Machine Learning Practice

05/11/2022
by   Ben Hutchinson, et al.
Google
21

Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes. Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/29/2019

SysML: The New Frontier of Machine Learning Systems

Machine learning (ML) techniques are enjoying rapidly increasing adoptio...
06/18/2012

Machine Learning that Matters

Much of current machine learning (ML) research has lost its connection t...
03/20/2022

The Dark Side: Security Concerns in Machine Learning for EDA

The growing IC complexity has led to a compelling need for design effici...
11/18/2019

Towards Quantification of Bias in Machine Learning for Healthcare: A Case Study of Renal Failure Prediction

As machine learning (ML) models, trained on real-world datasets, become ...
07/14/2021

The Benchmark Lottery

The world of empirical machine learning (ML) strongly relies on benchmar...
11/06/2019

Designing Evaluations of Machine Learning Models for Subjective Inference: The Case of Sentence Toxicity

Machine Learning (ML) is increasingly applied in real-life scenarios, ra...
12/06/2021

Thinking Beyond Distributions in Testing Machine Learned Models

Testing practices within the machine learning (ML) community have center...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

When evaluating a machine learning (ML) model for real-world uses, two fundamental questions arise: Is this ML model good (enough)? and Is this ML model better than some alternative? Obtaining reliable answers to these questions can be consequential for safety, fairness, and justice concerns in the deployment ecosystems. To address such questions, model evaluations use a variety of methods, and in doing so make technical and normative assumptions that are not always explicit. For example, when striving to answer whether a model is desirable (or good in the normative sense), evaluations typically focus largely on what is easily measurable (a reductive technical notion of good). These implicit assumptions can obscure the presence of epistemic gaps and motivations in the model evaluations, which, if not identified, constitute risky unknown unknowns.

Recent scholarship has critiqued the ML community’s evaluation practices, focusing on the use of evaluation benchmarks and leaderboards. Although leaderboards support the need of the discipline to iteratively optimize for accuracy, they neglect concerns such as inference latency, robustness, and externalities (Ethayarajh and Jurafsky, 2020). The structural incentives of the “competition mindset” encouraged by leaderboards can pose challenges to empirical rigor (Sculley et al., 2018)

. For example, over-reliance on a small number of evaluation metrics can lead to gaming the metric (cf. Goodhart’s Law “when a measure becomes a target, it ceases to be a good measure”)

(Thomas and Uminsky, 2020); this can happen unintentionally as researchers pursue models with “state of the art” performance. Benchmarks that encourage narrowly optimizing for test set accuracy can also lead to models relying on spurious signals (Carter et al., 2021), while neglecting the challenge of measuring the full range of likely harms (Bowman and Dahl, 2021). Birhane et al. find evidence for this in their study of the discourse of ML papers, showing that the field centers accuracy, generalization, and novelty, while marginalizing values such as safety (Birhane et al., 2021). Given that benchmark evaluations serve as proxies for performance on underlying abstract tasks (Schlangen, 2021), evaluating against a range of diverse benchmarks for each task might help mitigate biases within each benchmark. However, ML research disciplines seem to be trending towards relying on fewer evaluation benchmark datasets (Koch et al., 2021), with test set reuse potentially leading to a research community’s overfitting with respect to the general task (Zhang et al., 2020a; Liao et al., 2021). Furthermore, within each benchmark, items are weighted equally (thus focusing on the head of the data distribution), failing to capture inherent differences in difficulty across items, and hence providing poor measures of progress on task performance (Rodriguez et al., 2021). As Raji et al. point out, the ML research discipline’s decontextualized and non-systematic use of benchmark data raises serious issues with regards to the validity of benchmarks as measures of progress on general task performance (Raji et al., 2021).

This paper complements and extends this range of critiques, considering the risks of application developers adopting the ML research community’s standard evaluation methodologies. We seek to address challenges in measuring technology readiness (tram) (Lin et al., 2007; Rismani and Moon, 2021), while acknowledging this cannot be reduced to a purely technical question (Dahlin, 2021; Rismani and Moon, 2021). By studying and analyzing the ML research community’s evaluation practices, we draw attention to the evaluation gaps between ideal theories of evaluation and what is observed in ML research. By considering aspects of evaluation data and evaluation metrics—as well as considerations of evaluation practices such as error analysis and reporting of error bars—we highlight the discrepancies between the model quality signals reported by the research community and what is relevant to real-world model use. Our framework for analyzing the gaps builds upon and complements other streams of work on ML evaluation practices, including addressing distribution shifts between development data and application data (Sugiyama et al., 2007; Koh et al., 2020; Chen et al., 2021), and robustness to perturbations in test items (Prabhakaran et al., 2019; Moradi and Samwald, 2021; Winkens et al., 2020; hendrycks2019benchmarking). We situate this work alongside studies of the appropriateness of ML evaluation metrics (e.g., (Japkowicz, 2006; Derczynski, 2016; Zhang et al., 2020a)), noting that reliable choice of metric is often hampered by unclear goals (Kuwajima et al., 2020; D’Amour et al., 2020). In foregrounding the information needs of application developers, we are also aligned with calls for transparent reporting of ML model evaluations (Mitchell et al., 2019), prioritizing needs of ML fairness practitioners (Holstein et al., 2019), model auditing practices (Raji et al., 2020), and robust practices for evaluating ML systems for production readiness (Breck et al., 2017).

In Section 2, we consider various ideal goals that motivate why ML models are evaluated, discussing how these goals can differ between research contexts and application contexts. We then report in Section 3 on an empirical study into how machine learning research communities report model evaluations. By comparing the ideal goals of evaluation with the observed evaluation trends in our study, we highlight in Section 4 the evaluation gaps that present challenges to evaluations being good proxies for what application developers really care about. We identify six implicit evaluation assumptions that could account for the presence of these gaps. Finally, in Section 5, we discuss various techniques and methodologies that may help to mitigate these gaps.

2. Ideals of ML Model Evaluation

D’s comment: Overall, section 2 feels generally true and correct to me, but also a little bit hand-wavy. Can we

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey (Tukey, 1962, pp. 13–14)

Although this paper is ultimately concerned with practical information needs when evaluating ML models for use in applications, it is useful to first step back and consider the ultimate motivations and goals of model evaluation. To evaluate is to form a judgement; however, asking Is this a good ML model? is akin to asking such a question of other artefacts—such as Is this a good glass?—in that it requires acknowledging the implicit semantic arguments of uses and goals (Pustejovsky, 1998). For example, Is this a good glass [for my toddler to drink from, given that I want to avoid broken glass]? is a very different question from Is this a good glass [in which to serve wine to my boss, given that I want to impress them]?

In this paper, we will speak of a model evaluation as a system of arbitrary structure that takes a model as an input and produces outputs of some form to judge the model. Designing a model evaluation often involves choosing one or more evaluation metrics (such as accuracy) combined with a choice of test data. The evaluation might be motivated by various stakeholder perspectives and interests (Jones and Galliers, 1995)

. The output might, for example, produce a single metric and an associated numeric value, or a table of such metrics and values; it might include confidence intervals and significance tests on metric values; and it might include text. By producing such an output, the evaluation helps to enable transparency by reducing the number of both unknown unknowns and known unknowns.

For the purposes of this paper, it is useful to distinguish between two types of evaluations:

  1. [wide, labelwidth=!, labelindent=0pt]

  2. An ML model evaluation system useful for evaluating the learner (i.e., machine learning algorithm).

  3. An ML model evaluation system useful for evaluating a potential application.

Learner-centric evaluations make conclusions about the quality of the learner or its environment based on the evaluation of the learned model. These including evaluations motivated by novel learning algorithms or model architectures, but also ones that a) aim to shed light on the training data (for example ML model evaluations can shed light on the data-generation practices used by institutions (Andrus and Gilbert, 2019)), or b) “Green AI” explorations of how the learner can efficiently use limited amounts of resources (Schwartz et al., 2020). However, when we evaluate a model without a specific application in mind, we lose the opportunity to form judgements specific to a use case. On the other hand, application-centric evaluations are concerned with how the model will operate within an ecosystem consisting of both human agents and technical components (Figure 1), sometimes described as the “ecological validity” (De Vries et al., 2020). Applications often use scores output by the model to initiate discrete actions or decisions, by applying a specific classification threshold to the scores.111The history of this type of use case extends beyond ML models, e.g., to the use of regression models in university admissions testing (Hutchinson and Mitchell, 2019). In contrast, learner-centric evaluations sometimes care about scores output by models even in the absence of any thresholds.

Figure 1. Learner-centric ML model evaluations are concerned with the learner and its environment. Application-centric model evaluations are concerned with how the model will interact with an ecosystem into which it is introduced.

This distinction between learner-centric and application-centric is related (albeit imperfectly) to the different objectives of model evaluations that concern the engineering and science disciplines (Wallach, 2018; Mazzocchi, 2015). Note that we are not claiming (cf. the debate in (Norvig, 2017)) that science lies outside the bounds of statistical/ML methods, but rather that scientific-flavored pursuits have distinct uses of such methods (Breiman, 2001). Debates between AI practitioners about the relationships between AI, science, and statistical methods have a long history, for example Diana Forsythe’s studies of 1980s AI labs (Forsythe, 2001a). Important to this debate regarding the scientific goals of ML is the question of construct validity; that is, whether our measurements actually measure the things that we claim they do (Jacobs et al., 2020; Raji et al., 2021; Jacobs, 2021). Conversely, consequential validity—which includes the real-world consequences of an evaluation’s interpretation and use—is likely more important to considerations of accountability and governance of ML models in applications (Jacobs, 2021).

  1. [wide, labelwidth=!, labelindent=0pt]

  2. Evaluating the model can motivate beliefs/explanations about the world (including possibly the learner).

  3. Evaluating the model can tell us whether the model can be used as a means towards a goal.

This distinction is closely related to one between “scientific testing” and “competitive testing” made by Hooker in 1995, who takes the position that competitive testing a) is unscientific, and b) does not constitute true research but merely development (Hooker, 1995). However, since engineering research has its own goals, distinct from those of science (Bulleit et al., 2015), a more defensible position is that evaluations in support of scientific research are distinct from evaluations in support of engineering research.

Learner-centric evaluations Application-centric evaluations
Typical evaluation goal Distinguish better learners from poorer ones Predict ecosystem outcomes
Schematic of goal
Disciplinary goals Science or engineering Primarily engineering
Table 1. Summary of typical goals of the idealized learner-centric and application-centric evaluations.

Table 1 summarizes the above distinctions and the relationships between them. The distinction between learner-centric and application-centric evaluations relates to the question of internal validity and external validity that is more commonly discussed in the social sciences than in ML (see, e.g., (Olteanu et al., 2019)) but also sometimes in ML (Liao et al., 2021)

. This is reflected in the ways in which practitioners of the two types of evaluations discuss the topic of robustness. Learner-centric evaluations pay attention to the robustness of the learner to changes in the training data (e.g., distributional shifts, outliers, perturbations, poisoning attacks; and with connections to robust estimation of statistics

(Lecué and Lerasle, 2020)), while application-centric evaluations pay attention to desired behaviors such as the (in)sensitivity of the model to certain classes of perturbations of the input, or to sensitive input features (e.g., (Garg et al., 2019)).

Note that nothing in the ideals of evaluation described above has stipulated whether evaluations are quantitative or qualitative. For example, one could imagine interrogating a chatbot model using qualitative techniques, or adopting methodologies of political critique such as (Crawford and Paglen, 2021). Similarly, nothing has stipulated what combinations of empirical or deductive methods are used.

3. ML Model Evaluations in Practice

Beneath the technical issues lie some differences in values

concerning not only the meaning but also the relative merit of “science” and “artificial intelligence.”

— Diana Forsythe (Forsythe, 2001a)

To shed light on the ML research community’s norms and values around model evaluation, we looked at how these communities report their model evaluations. By examining 200 papers from several top conferences in two research disciplines that use ML approaches extensively, we identified patterns regarding choices of metrics, evaluation data, and measurement practices. This empirical study of ML research practices complements several recent studies of ML evaluation practices. These include: a survey 144 research papers studying the properties of models that are tested for (Zhang et al., 2020a); a review of 107 papers from Computer Vision (CV), Natural Language Processing (NLP) and other ML disciplines to diagnose internal and external modes of evaluation failures (Liao et al., 2021); an analysis of whether 60 NLP and CV papers pay attention to accuracy or efficiency (Schwartz et al., 2020); and an analysis of the Papers With Code dataset222https://paperswithcode.com for patterns of benchmark dataset creation and re-use (Koch et al., 2021).

3.1. Method

3.1.1. Data

We sampled 200 research papers, stratified by discipline, conference and year. 100 papers were selected from each of the NLP and CV disciplines. We selected 20 papers from the proceedings of each of the 55th to 59th Annual Meetings of the Association of Computational Linguistics (ACL’2017–ACL’2021), 25 papers at random from each of the proceedings of the 2019–2021 IEEE Conferences on Computer Vision and Pattern Recognition (CVPR’2019–CVPR’2021), and 25 papers from the 24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI’2021). These conferences represent the pinnacles of their respective research fields.

333ACL and CVPR are rated A (“flagship conference”), and MICCAI is rated A (“excellent conference”), by core.edu.au; all three are in the top 30 computer science conferences out of over 900 listed on research.com.

3.1.2. Analysis

The authors of this paper performed this analysis, dividing the papers among themselves based on disciplinary familiarity. Using an iterative procedure of analysis and discussion, we converged on a set of labels that captured important aspects of evaluations across and within disciplines. Recall from Section 2 that, for our purposes, a single evaluation typically involves choosing one or more metrics and one or more datasets. We coded each of the papers along three dimensions. a) Metrics: Which evaluation metrics were reported? After iteration, we converged on the categories of metrics shown in Table 2. b) Data: Was test data drawn from the same distribution as the training data, under the Independent and Identically Distributed (I.I.D.) assumption? c) Analysis: Was statistical significance of differences reported? Were error bars and/or confidence intervals reported? Was error analysis performed? Were examples of model performance provided to complement measurements with qualitative information?

3.2. Results

Although each of the disciplines and conferences does not define itself solely in terms of ML, the practice of reporting one or more model evaluations in a research paper is ubiquitous. Only five papers did not include evaluations of ML models; of these two were published at ACL (a survey paper, a paper aimed at understanding linguistic features, and one on spanning-tree algorithms), and two at CVPR (a paper with only qualitative results, and one introducing a dataset). Table 3 summarizes the results of the other 195. Counts are non-exclusive, for example papers frequently reported multiple metrics and sometimes reported performance both on I.I.D. test data and on non-I.I.D. test data.

Appendix B contains an overview of the flavors of test data we observed. We found evidence to support the claim that evaluations of NLP models have “historically involved reporting the performance (generally meaning the accuracy) of the model on a specific held-out [i.e., I.I.D.] test set” (Bommasani et al., 2021, p. 94).444Two observed non-I.I.D. evaluation patterns in NLP were: a) testing on a different linguistic “domain” (e.g., training on texts about earthquakes and testing on texts about floods (Alam et al., 2018)); and b) testing a model’s ability to predict properties of a manually compiled lexical resource (e.g., (Ustalov et al., 2017)). See also Appendix B. CV evaluations seem to be even more likely to utilize I.I.D. test data, and—consistent with (Koch et al., 2021)—CV papers typically either introduce a new task (and corresponding benchmark dataset) (Jafarian and Park, 2021; Li et al., 2016; Rostamzadeh et al., 2018; Wu et al., 2021) or present results of a new model on an existing widely-used benchmark (Ren et al., 2015; He et al., 2016). An exception to this trend was CV papers which explored shared representations (e.g., in multi-task learning (Lacoste et al., 2018; Evci et al., 2021) or domain adaptation (Pinheiro et al., 2019; Murez et al., 2018)).

Evaluations in both disciplines showed a heavy reliance on reporting point estimates of metrics, with variance or error bars typically not reported in our sample. While colloquial uses of phrases like “significantly better” were fairly common, most papers did not report on technical calculations of statistical differences; we considered only those latter instances when coding whether a paper reported significance. Regarding metrics, most of those that were frequently seen in our sample were somewhat insensitive to different types of errors. For example, accuracy does not distinguish between FP and FN;

is symmetric in FP and FN (they can be swapped without affecting ); the Overlap metrics are similary invariant to swapping of the predicted bounding box and the reference bounding box; the Distance category of metrics does not distinguish over-estimation from under-estimation on regression tasks.

From our reading of the 200 papers in our sample, one qualitative observation we had was that model evaluations typically do not include concrete examples of model behavior, nor analyses of errors (for a counterexample which includes these practices, see (Chiril et al., 2020)). Also, we noted the scarcity of papers whose sole contribution is a new dataset for an existing task, aligning with previous observations that dataset contributions are not valued highly within the community (Sambasivan et al., 2021b). We hypothesise that conference reviewers place emphasis on novelty of model, task, and/or metric. We note a general tension between disciplinary values of task novelty and demonstrating state-of-the-art performance by outperforming previous models, and the risk of overfitting from test set re-use discussed by (Liao et al., 2021).

Metric category Examples Description
Accuracy Accuracy, error rate Sensitive to the sum TP+TN and to N. Not sensitive to class imbalance.
Precision Precision, Bleu Sensitive to TP and FP. Not sensitive to FN or TN.
Recall Recall, Rouge Sensitive to TP and FN. Not sensitive to FP or TN.
F-score , Sensitive to TP, FP and FN. Not sensitive to TN.
Overlap Dice, IoU Sensitive to intersection and overlap of predicted and actual.
Likelihood Perplexity

Sensitive to the probability that the model assigns to the test data.

Distance MSE, MAE, RMSE, CD Sensitive to the distance between the prediction and the actual value.
Correlation
Pearsons ,
Spearman’s
Sensitive to each of TP, TN, FP and FN, but unlike Accuracy metrics they factor in the degree of agreement that would be expected by chance.
AUC MAP, AUROC Does not rely on a specific classification threshold, but instead calculates the area under a curve parameterized by different thresholds.
Table 2. Categories of evaluation metrics used in the analysis of the ML research literature. TP=true positives; TN=true negatives; FP=false positives; FN=false negatives; N=total number of data points. See Appendix A for the most common metrics in our data and their categorizations.
Discipline:Venue NLP:ACL CV:CVPR CV:MICCAI CV:Combined NLP+CV:Combined
(# papers with ML evals) (97) (73) (25) (98) (195)
Most Common Metrics
Metric category (num. of papers) Accuracy (47) F-score (45) Precision (43) Recall (25) AUC (32) Accuracy (25) Overlap (22) Distance (10) Distance (14) Overlap (9) AUC (6) Accuracy (4) AUC (38) Overlap (31) Accuracy (29) Distance (24) Accuracy (76) F-score + Overlap (74) Precision (48) AUC (44)
Data
I.I.D. test data 78 72 25 97 175
Non-I.I.D. test data 28 21 4 25 53
Analysis
Reports significance 24 0 7 7 31
Reports error bars 10 6 10 16 26
Table 3. Analysis of how Natural Language Processing (NLP) and Computer Vision (CV) research communities perform ML model evaluations. Appendix A provides definitions of commonly observed metrics, and their mappings to categories.

Includes any form of error bars/confidence intervals/credible intervals/variation across multiple runs.

Reported together here due to the equivalence of the Dice measure (in the Overlap category) and (in the -score category) (Powers, 2011).

3.3. Discussion

This small-scale quantitative study of model evaluations provides clues as to the values and goals of the ML research communities. Test data was often old (e.g., the CONLL 2003 English NER dataset

(Sang and De Meulder, 2003) used in two papers); optimizing for these static test sets fails to account for societal and linguistic change (Bender et al., 2021). Disaggregation of metrics was rare, and fairness analyses were absent despite our sample being from 2017 onward, concurrent with mainstream awareness of ML fairness concerns. Despite being acknowledged by influential thought-leaders in ML to be unrealistic for applications (Bengio et al., 2021), using I.I.D. test data is the norm. These are in alignment with the learner-centric goals of evaluations (Section 2). Similarly, with a few exceptions in our sample, there was general paucity of discussions of tradeoffs such as accuracy vs resource-efficiency that are typical of engineering disciplines (Bulleit et al., 2015), suggesting that the ML research disciplines generally aspire to scientific goals concerning understanding and explaining the learner. With this lens, the disciplinary paradigm of measuring accuracy on I.I.D. test data is not surprising: the goal is to assess a model’s ability to generalize. This assessment would then give us good guarantees on the application’s behavior, if the practical challenges of ascertaining the data distributions in an application ecosystem can be overcome. In practice, however, these challenges can be severe, and the research papers we surveyed do not generally tackle questions of uncertainty regarding data distributions.

To Negar’s point, cite a few good papers who have good practices that are worth mentioning adding papers that have more comprehensive testing 1-2 papers in each, and highlighting them - not best models, stating the most important testing practices

4. Gaps and Assumptions in Common Evaluation Practices

In theory there is no difference between theory and practice, while in practice there is. — Brewster (1881) (Brewster, 1881)

We now consider whether the research evaluation practices observed in Section 3 are aligned with the needs of decision-makers who consider whether to use a model in an application. That is, we consider whether the typically learner-centric evaluations, which commonly use metrics such as accuracy or on test data I.I.D. with the training data, meet the need of application-centric evaluations. In doing so, we expose, in a novel way, the interplay of technical and normative considerations in model evaluation methodologies.

4.1. Assumptions in Model Evaluation

We introduce six assumptions in turn, describing both how they operate individually in evaluations and how they compose and compound. We also call out “evaluation gaps” of concern relevant to each assumption. Appendix C contains a hypothetical example from a specific application domain that illustrates the flavors of the concerns. Our starting point is the observation from Section 2 that the goal of application-centric model evaluations is to understand how a model will interact with its ecosystem, which we denote schematically as:

(1)

Assumption 1: Consequentialism. Consequentialism is the view that whether actions are good or bad depends only on their consequences (Sinnott-Armstrong, 2021). The ML research literature often appeals to motivations about model utility to humans (e.g., (Blasi et al., 2021; Ethayarajh and Jurafsky, 2020; Bunescu and Huang, 2010; Lo and Wu, 2010; Idahl et al., 2021; Neumann et al., 2020; Hepp et al., 2018; Orekondy et al., 2018; Fu et al., 2022; Zhao et al., 2020), including papers on fairness in ML such as (Corbett-Davies et al., 2017; Card and Smith, 2020; Chohlas-Wood et al., 2021; Corbett-Davies and Goel, 2018)). In adopting consequentialism as its de facto ethical framework, ML prioritizes the greatest good for the greatest number (IEEE, 2019) and centers measurable future impacts. Moreover, the consequences that are centered are the direct consequences, with little attention given to motives, rules, or public acceptance (Sinnott-Armstrong, 2021). This is realised as a focus on the first-order consequences of introducing the model into the ecosystem. Changes to the ecosystem itself—e.g., addressing what social change is perceived as possible and desirable (Green, 2020; Hovy and Spruit, 2016; Eckhouse et al., 2019)—are assumed to be out of scope, as are concerns for setting of precedents for other ML developers. We denote this assumption schematically as: cite https://arxiv.org/pdf/2001.00329.pdf and https://arxiv.org/pdf/2109.08792v1.pdf

(2)

Evaluation Gap 1: Provenance.  A focus on future consequences neglects important moral considerations regarding the construction of the model. This excludes both deontological concerns—for example, Were data consent and sovereignty handled appropriately? (Kukutai and Taylor, 2016; Andreotta et al., 2021; Crawford and Paglen, 2021) and Were data workers treated with dignity? (Gray and Suri, 2019)—as well as questions regarding past costs of development—for example, What were the energy use externalities of model training? (Strubell et al., 2019; García-Martín et al., 2019; Crawford and Joler, 2018) and Was the labour paid fairly? (Silberman et al., 2018). Schwartz et al. coin the phrase “Red AI” to describe ML work that disregards the costs of training, noting that such work inhibits discussions of when costs might outweigh benefits (Schwartz et al., 2020).

Evaluation Gap 2: Social Responsibilities.  Another outcome of focusing primarily on direct consequences is marginalizing the assessment of a model against the social contracts that guide the ecosystem in which the model is used, such as moral values, principles, laws, and social expectations. For instance, Does the model adhere to the moral duty to treat people in ways that upholds their basic human rights? (Shue, 2020), Does it abide by legal mechanisms of accountability? (Raso et al., 2018; McGregor et al., 2019), and Does it satisfy social expectations of inclusion, such as the “nothing about us without us” principle? (Charlton, 1998).

Assumption 2: Abstractability from Context. The model’s ecosystem is reduced to a set of considerations , i.e., the inputs to the model and the “ground truth,” and in practice may often fail to model socially important yet sensitive aspects of the environment (Barocas et al., 2017; Andrus and Gilbert, 2019). The model itself is reduced to a predicted value , ignoring e.g., secondary model outputs such as confidence scores, or predictions on auxiliary model heads.

(3)

Evaluation Gap 3: System Considerations.  Equating a model with its prediction overlooks the potential usefulness of model interpretability and explainability. Also, reducing an ecosystem to model inputs and “ground truth” overlooks questions of system dynamics (Martin et al., 2020; Selbst et al., 2019), such as feedback loops, “humans-in-the-loop,” and other effects “due to actions of various agents changing the world” (Bengio et al., 2021). Also overlooked are inference-time externalities of energy use (Cai et al., 2017; García-Martín et al., 2019), cultural aspects of the ecosystem (Sambasivan et al., 2021a), and long term impacts (Card and Smith, 2020).

Evaluation Gap 4: Interpretive Epistemics.  By positing a variable which represents the “ground truth” of a situation—even in situations involving social phenomena—a positivist stance on knowledge is implicitly adopted. That is, a “true” value is taken to be objectively singular and knowable. This contrasts with anthropology’s understanding of knowledge as socially and culturally dependent (Forsythe, 2001b) and requiring interpretation (Geertz, 1973). In the specific cases of CV and NLP discussed in Section 3, cultural aspects of image and language interpretation are typically marginalized (cf. (Jappy, 2013; Lakoff and Johnson, 2008; Berger, 2008; Barthes, 1977), for example), exemplifying what Aroyo and Welty call AI’s myth of “One Truth” (Aroyo and Welty, 2015). Furthermore, the positivist stance downplays the importance of questions of construct validity and reliability (Jacobs, 2021; Friedler et al., 2021).

Figure 2. Causal graph illustrating the Input Myopia Assumption.

Assumption 3: Input Myopia. Once the input variable has been used by the model to calculate the model prediction , is typically ignored for the remainder of the evaluation. That is, the utility of the model is assumed to depend only on the model’s prediction and on the “ground truth.” We illustrate this with a causal graph diagram in Figure 2, which shows Utility as independent of once the effects of and are taken into account.

(4)

Evaluation Gap 5: Disaggregated Analyses.  By reducing the variables of interest to the evaluation to the prediction and the ground truth , the downstream evaluation is denied the potential to use . This exacerbates Evaluation Gap 3 by further abstracting the evaluation statistics from their contexts. For example, could have been used to disaggregate the evaluation statistics in various dimensions—including for fairness analyses, assuming that socio-demographic data is available and appropriate (Andrus et al., 2021; Barocas et al., 2021)—or to examine regions of the input space which raise critical safety concerns (e.g., distinguishing a computer vision model’s failure to recognise a pedestrian on the sidewalk from failure to recognise one crossing the road) (Amodei et al., 2016). Similarly, robustness analyses which compare the model predictions for related inputs in the same neighborhood of the input space are also excluded.

Assumption 4: Quantifiability. We have not yet described any modeling assumptions about the mathematical or topological nature of the implied function, which up to now has been conceived as an arbitrary procedure producing an arbitrary output. We observe, however, that when models are evaluated, there is a social desire to produce a small number of scalar scores. This is reinforced by “leaderboardism” (Ethayarajh and Jurafsky, 2020), and extends to the point of averaging different types of scores such as correlation and accuracy (Wang et al., 2018). We identify two assumptions here: first, that impacts on each individual can be reduced to a single numeric value (and thus different dimensions of impacts are commensurable555E.g., one machine learning fairness paper says “ is the cost of detention in units of crime prevented” (Corbett-Davies et al., 2017).); second, that impacts across individuals are similarly commensurable. We define and to be a specific model prediction, and a specific ”ground truth” value respectively, leading to the Individual Quantifiability Assumption and the Collective Quantifiability Assumption, respectively.

(5)
(6)

Composing these assumptions with the previous ones leads to the belief that the evaluation can be summarized as a scalar statistic: .

Evaluation Gap 6: Incommensurables.  The Quantifiability Assumptions assume that the impacts on individuals are reducible to numbers, trivializing the frequent difficulty in comparing different benefits and costs (Marrkula Center, 2019). Furthermore, the harms and benefits across individuals are assumed to be comparable in the same scale. These assumptions are likely to disproportionately impact underrepresented groups, for whom model impacts might differ in qualitative ways from the well represented groups (Sambasivan et al., 2021a; Sambasivan and Holbrook, 2018; Heldreth et al., 2021). The former groups are less likely to be represented in the ML team (West et al., 2019) and hence less likely to have their standpoints on harms and benefits acknowledged.

Assumption 5: Failures Cases Are Equivalent. For classification tasks, common evaluation metrics such as accuracy or error rate model the utility of as binary (i.e., either 1 or 0), depending entirely on whether or not it is equal to the “ground truth” . That is, for a binary task, ====== and ======. Similarly for regression tasks, common metrics such as MAE and MSE take the magnitude of error into account, yet still treat certain failures as equivalent (specifically, =====, for all ).

(7)
(8)

Taken together with the previous assumptions, this yields for classification tasks.

Evaluation Gap 7: Disparate harms and benefits.  Treating all failure cases as equivalent fails to appreciate that different classes of errors often have very different impacts (Provost and Fawcett, 1997; Challen et al., 2019). In multiclass classification, severely offensive predictions (e.g., predicting an animal in an image of a person) are given the same weight as inoffensive ones. In regression tasks, insensitivity to either the direction of the difference or the magnitude of can result in evaluations being possibly poor proxies for downstream impacts. (One common application use case of regression models is to apply a cutoff threshold to the predicted scalar values, for which both the direction of error and the magnitude of are relevant.)

Assumption 6: Test Data Validity. Taken collectively, the previous assumptions might lead one to use accuracy as an evaluation metric for a classification task. Further assumptions can then be made in deciding how to estimate accuracy. The final assumption we discuss here is that the test data over which accuracy (or other metrics) is calculated provides a good estimate of the accuracy of the model when embedded in the ecosystem.

(9)

where and are the ground truth labels and the model predictions on the test data, respectively.

Evaluation Gap 8: Data Drifts.  A simple model of the ecosystem’s data distributions is particularly risky when system feedback effects would cause the distributions of data in the ecosystem to diverge from those in the evaluation sample (Liu et al., 2018; Kannan et al., 2019). In general, this can lead to overconfidence in the system’s reliability, which can be exacerbated for regions in the tail of the input distribution.

4.2. Discussion

Assumption Considerations that might be Overlooked
Application-centric evaluation Opportunities for scientific insights.
Consequentialism Data sourcing and processing; invisible labour; consultation with impacted communities; motives; public acceptance; human rights.
Abstractability from Context System feedback loops; humans-in-the-loop.
Input Myopia Disaggregated analyses; sensitivity analyses; safety-critical edge cases.
Quantitative Modeling Different flavors of impacts on a single person; different flavors of impacts across groups.
Equivalent Failures Severe failure cases; confusion matrices; topology of the prediction space.
Test Data Validity Data sampling biases; distribution shifts.
Table 4. Sketch of how the six assumptions of Section 4—when taken collectively—compose to simplify the task of evaluating a model for an application () to one of calculating accuracy over a data sample. A pseudo-formal notation (akin to pseudo-code) is used to enable rapid glossing of the main connections. and denote the true (unobserved) distributions of ground truth and model predictions, respectively, while the variables and denote the samples of reference labels and model predictions over which accuracy is calculated in practice. The order of the assumptions reflects an increasing focus on technical aspects of model evaluation, and a corresponding minimizing of non-technical aspects. Appendix C illustrates how each of the sets of considerations might apply in a hypothetical application of a computer vision model.

We have described six assumptions that simplify the model evaluation task. Taken together, they would cause one to believe—with compounding risks—that a model’s accuracy is a good proxy for its fitness for an application. We sketch this composition of assumptions in Figure 4, along with questions that illustrate the gaps raised by each assumption. Our reason for teasing apart these assumptions and their compounding effects is not to attack the “strawman” of naive application-centric evaluations which rely solely on estimating model accuracy. Rather, our goal is to point out that most model evaluations, even sophisticated ones, make such assumptions to varying degrees. For example:

  • Some robustness evaluations (for surveys, see (Farahani et al., 2020; Wang et al., 2021)) explicitly tackle the problem of distribution shifts, rejecting the Assumptions of Test Data Validity without questioning the other assumptions we have identified.

  • Some sensitivity evaluations consider the effect on the model predictions of small changes in the input, but use accuracy as an evaluation metric, rejecting the Input Myopia Assumption without questioning the others (Ribeiro et al., 2020).

  • Some fairness evaluations perform disaggregated analyses using the Recall or Precision metrics, sticking by all assumptions other than Input Myopia and Equivalent Failures (Chouldechova, 2017; Hardt et al., 2016).

It may not be possible to avoid all of the assumptions all of the time; nevertheless unavoidable assumptions should be acknowledged and critically examined. The six assumptions we have identified also provide a lens for assessing the consistency of some evaluation metrics with other assumptions that have been made during the evaluation, for example

  • Is -score consistent with an utilitarian evaluation framework? The

    -score is mathematically a harmonic mean— which is often appropriate for averaging pairs of rates (e.g., two speeds). When applied to Precision and Recall, however, the

    -score constitutes a peculiar averaging of “apples and oranges,” since, when conceived as rates, Precision and Recall measure rates of change of different quantities, (Powers, 2014). -score is thus difficult to interpret within an evaluation framework that aims to maximize model utility.

  • Do threshold-free evaluations such as the Area Under the Receiver Operating Characteristic (auroc) abstract too much of the deployment context? Since auroc

    is calculated by averaging over a range of possible threshold values, it “cannot be interpreted as having any relevance to any particular classifier

    (Powers, 2012b) (which is not saying auroc is irrelevant to evaluating the learner, cf. Section 2, nor to a learned model’s propensity to correctly rank positive instances above negative ones). The same argument can be made for the Mean Average Precision metric used in image classification (see Appendix A). For useful application-centric evaluations, it is more meaningful to report pairs of values (for all classes) for a range of threshold values (Powers, 2012a).

In both cases, we ask whether such metrics are of limited utility in application-centric evaluations and whether they are better left to learner-centric ones.

5. Contextualizing Application-centric Model Evaluations

the ornithologists were forced to adapt their behavior (for the sake of “science”) to the most primitive evaluation method which was the only one considered or known, or else throw their data away. — Hampel (Hampel and Zurich, 1998)

When applications of ML models have the potential to impact human lives and livelihoods, thorough and reliable evaluations of models are critical. As discussed in Section 3, the different goals and values of academic ML research communities mean that research norms cannot be relied upon as guideposts for evaluating models for applications. In this section, we propose steps towards evaluations that are rigorous in their methods and aim to be humble about their epistemic uncertainties. In doing so, we expand on the call by Raji et al. to pay more attention not just to evaluation metric values but also to the quality and reliability of the measurements themselves, including sensitivity to external factors (Raji et al., 2021).

5.1. Minding the Gaps between Evaluation Goals and Research Practice

Documenting assumptions made during model evaluation is critical for transparency and enables more informed decisions. If an assumption is difficult to avoid in practice, consider augmenting the evaluation with signals that may shed complementary light on questions of concern. For example, even a handful of insightful comments from members of impacted communities can be an invaluable complement to evaluations using quantitative metrics. We now consider specific mitigation strategies for each of the gaps in turn.

Minding Gap 1: Evaluate More than Consequences. To reduce the gap introduced by the Consequentialism Assumption, evaluate the processes that led to the creation of the model, including how datasets were constructed (Scheuerman et al., 2021). We echo calls for more reflexivity around social and intentional factors around model development (Miceli et al., 2021), more documentation of the complete lifecycle of model development (Vogelsang and Borg, 2019; Hutchinson et al., 2021), and greater transparency around ML models and their datasets (Mitchell et al., 2019; Gebru et al., 2021; Bender and Friedman, 2018). It may be appropriate to contemplate whether the model is aligned with the virtues the organization aspires to (Vallor, 2016). Consider the question of whether any ML model could be a morally appropriate solution in this application context, e.g., whether it is appropriate to make decisions about one person on the basis of others’ behaviors (Eckhouse et al., 2019).

Minding Gap 2: Center Obligations. Since reasoning about uncertain future states of the world is fraught with challenges (Card and Smith, 2020), evaluations should consider indirect consequences and assess how the model upholds social obligations within the ecosystem. This may involve processes such as assessments of human rights, social and ethical impact (McGregor et al., 2019; Mantelero, 2018), audits of whether the ML system upholds the organization’s declared values or principles (Raji et al., 2020), and/or assessments of the potential for privacy leakage (e.g., (Yeom et al., 2018; Carlini et al., 2021)).

Minding Gap 3: Demarginalize the Context. To address the gap introduced by the Assumption of Abstractability from Context, consider externalities such as energy consumption (Henderson et al., 2020; Schwartz et al., 2020), as well as resource requirements (Ethayarajh and Jurafsky, 2020). It is important to think about how the human and technical parts of the system will interact (Selbst et al., 2019; Martin et al., 2020). Note that when substituting one model for another—or for displaced human labor—system stability can itself be a desirable property independent of model accuracy (and perhaps counter to tech industry discourses of “disruption” (Geiger, 2020)), and a range of metrics exist for comparing predictions with those of a legacy model (Derczynski, 2016). Care should be taken to avoid the “portability trap” of assuming that what is good for one context is good for another (Selbst et al., 2019). The more attention paid to the specifics of the application context, the better; hence, metrics which assume no particular classification threshold, such as AUC, may provide limited signal for any single context.

Minding Gap 4: Make Subjectivities Transparent. Acknowledge the subjectivities inherent in many tasks (Alm, 2011). An array of recent scholarship on subjectivity in ML has “embraced disagreement” through practices of explicitly modeling—in both the data model and the ML model—inter-subject variation in interpretations (Davani et al., 2022; Basile et al., 2021; Fornaciari et al., 2021; Díaz and Diakopoulos, 2019; Aroyo and Welty, 2015). For the purposes of ML model evaluations, disaggregating labels on test data according to the cultural and socio-demographic standpoints of their annotators enables more nuanced disaggregated evaluation statistics (Prabhakaran et al., 2021).

Minding Gap 5: Respect Differences Between Inputs

. A realistic “null hypothesis” is that misclassifications affect people in the application ecosystem disparately. For example, people may differ both in their preferences regarding model predictions

per se, as well as their preferences regarding model accuracy (Binns, 2018).666Note that in many real-world applications the “ground truth” variable may be a convenient counterfactual fiction, since the system’s actions on the basis of the prediction may inhibit from being realised—for example, a finance ML model may predict a potential customer would default on a loan if given one, and hence the system the model is deployed in may prevent the customer getting a loan in the first place. As such—and independent of fairness considerations—evaluations should be routinely pay attention to different parts of the input distribution, including disaggregating along social subgroups. Special attention should be paid to the tail of the distribution and outliers during evaluation, as these may require further analysis to diagnose the potential for rare but unsafe impacts. Input sensitivity testing can provide useful information about the sensitivity of the classifier to dimensions of input variation known to be of concern (e.g., gender in text (Zhao et al., 2018; Borkan et al., 2019; Gonen and Webster, 2020; Huang et al., 2020)).

Minding Gap 6: Think Beyond Scalar Utility. Resist the temptation to reduce a model’s utility to a single scalar value, either for stack ranking (Ethayarajh and Jurafsky, 2020) or to simplify the cognitive load on decision makers. Instead, include a range of different metrics and evaluation distributions in the evaluation (Mitchell et al., 2019). Acknowledge and report epistemic uncertainty, e.g., the effects of missing data or measurement and sampling error on metrics. Acknowledge qualitative impacts that are not addressed by metrics (e.g., harms to application users caused by supplanting socially meaningful human interactions), and rigorously assess the validity of attempts to measure social or emotional harms. Be conservative in aggregations: consider plotting data rather than reporting summary statistics (cf. Anscombe’s quartet); do not aggregate unlike quantities; report multiple estimates of central tendency and variation; and don’t assume that all users of an application will have equal benefits (or harms) from system outcomes. Consider applying aggregation and partial ranking techniques from the fair division literature to ML models, including techniques that give greater weight to those with the worst outcomes (e.g., in the extreme case, “Maximin”) (Endriss, 2018).

Minding Gap 7: Respect Differences Between Failures. If the harms of false positives and false negatives are incommensurable, report them separately. If commensurable, weight each appropriately. For multiclass classifiers, this approach generalizes to a classification cost matrix (Turney, 1994), and, more generally, including the confusion matrix before costs are assigned; for regression tasks, report metrics such as MSE disaggregated by buckets of .

Minding Gap 8: Validate Quality of Test Data. For transparency, do not assume it is obvious to others which datasets are used in training and evaluation; instead, be explicit about the provenance, distribution, and known biases of the datasets in use (Andrus et al., 2021). Consider Bayesian approaches to dealing with uncertainty about data distributions (McNair, 2018; Ji et al., 2020; Lacoste et al., 2017), especially when sample sizes are small or prior work has revealed systematic biases. For example, an evaluation which uses limited data in a novel domain (or in an under-studied language) to investigate gender biases in pronoun resolution should be tentative in drawing strong positive conclusions about “fairness” due to abundant evidence of gender biases in English pronoun resolution models (e.g. (Webster et al., 2019)).

5.2. Alternate Model Evaluation Methodologies

More radical excursions from the disciplinary paradigm are often worth considering, especially in scenarios with high stakes or high uncertainty.

Evaluation Remits. In 1995, Sparck Jones and Galliers called for a careful approach to NLP evaluation that is broadly applicable to ML model evaluations (see Appendix D) (Jones and Galliers, 1995). Their approach involves a top-down examination of the context and goal of the evaluation before the evaluation design even begins, and their call for careful documentation of the evaluation “remit”—i.e., official responsibilities—is in line with more recent work calling for stakeholder transparency for ML (Raji et al., 2020; Hutchinson et al., 2021). They advocate for establishing whose perspectives are adopted in the evaluation and whose interests prompted it. Appendix D sketches how Sparck Jones and Galliers’ framework could be adopted for ML model evaluations.

Active Testing. Active Testing aims to iteratively choose new items that are most informative in addressing the goals of the evaluation (Kossen et al., 2021; Ha et al., 2021)

(cf. its cousin Active Learning, which selects items that are informative for the learner). Active Testing provides a better estimate of model performance than using the same number of test instances sampled I.I.D. Exploring Active Testing in pursuit of fairness testing goals seems a promising direction for future research.

Adversarial Testing. In many cases, there is great uncertainty regarding an application deployment context. One cautious and conservative approach—especially in the face of great uncertainty—is to simulate “adversaries” trying to provoke harmful outcomes from the system. Borrowing adversarial techniques from security testing and privacy testing, adversarial testing of models requires due diligence to trigger the most harmful model predictions, using either manually chosen or algorithmically generated test instances (Ruiz et al., 2022; Zeng et al., 2021; Zhang et al., 2020b; Ettinger et al., 2017).

Multidimensional Comparisons. When comparing candidate models, avoid the “Leaderboardism Trap” of believing that a total ordering of candidates is possible. A multidimensional and nuanced evaluation may provide at best a partial ordering of candidate models, and it may require careful and accountable judgement and qualitative considerations to decide among them. The Fair Division literature on Social Welfare Orderings may be a promising direction for developing evaluation frameworks that prioritize “egalitarian” considerations, in which greater weighting is given to those who are worst impacted by a model (Endriss, 2018).

5.3. Evaluation-driven ML Methodologies

In this section, we follow Rostamzadeh et al. in drawing inspiration from test-driven practices, such as those of software development (Rostamzadeh et al., 2021). Traditional software testing involves significant time, resources, and effort (Harrold, 2000); even moderate-sized software projects spend hundreds of person-hours writing test cases, implementing them, and meticulously documenting the test results. In fact, software testing is sometimes considered an art (Myers et al., 2011) requiring its own technical and non-technical skills (Sánchez-Gordón et al., 2020; Matturro, 2013), and entire career paths are built around testing (Cunningham et al., 2019). Test-driven development, often associated with agile software engineering frameworks, integrates testing considerations in all parts of the development process (Astels, 2003; George and Williams, 2004). These processes rely on a deep understanding of software requirements and user behavior to anticipate failure modes during deployment and to expand the test suite. (In contrast, ML testing is often relegated to a small portion of the ML development cycle, and predominantly focuses on a static snapshot of data to provide performance guarantees.) These software testing methodologies provide a model for ML testing. First, the model suggests anticipating, planning for, and integrating testing in all stages of the development cycle, research problem ideation, the setting of objectives, and system implementation. Second, build a practice around bringing diverse perspectives into designing the test suite. Additionally, consider participatory approaches (e.g., (Martin et al., 2020)) to ensure that the test suite accounts for societal contexts and embedded values within which the ML system will be deployed.

An important principle in test-driven software development is visibility into the test data. Typically, engineers working on a system can not only see the details of the test suites but also often develop those test suites themselves. In contrast, the paradigm of ML evaluation methodologies is that the ML practitioner should not inspect the test data, lest their observations result in design decisions that produce an overfitted model. How, then, can these two methodologies be reconciled? We believe that incentives are one important consideration. In the ML research community, the “competition mindset” might indeed lead to “cheating” via deliberate overfitting. In contrast, in real-world applications model developers might benefit from a healthy model ecosystem, for example when they are members of that ecosystem. (However, when developers come from a different society altogether there may be disinterest or disalignment (Sambasivan et al., 2021a).)

Software testing produces artifacts such as execution traces, and test coverage information (Harrold, 2000). Developing practices for routinely sharing testing artifacts with stakeholders provides for more robust scrutiny and diagnosis of harmful error cases (Raji et al., 2020). In being flexible enough to adapt to the information needs of stakeholders, software testing artifacts can be considered a form of boundary object (Star and Griesemer, 1989). Within an ML context, these considerations point towards adopting ML transparency mechanisms incorporating comprehensive evaluations, such as model cards (Mitchell et al., 2019). The processes that go into building test cases should be documented, so the consumer of the ML system can better understand the system’s reliability. Finally, as for any high-stakes system—software, ML or otherwise—evaluation documentation constitutes an important part of the chain of auditable artifacts required for robust accountability and governance practices (Raji et al., 2020).

6. Conclusions

In this paper, we compared the evaluation practices in the ML research community to the ideal information needs of those who use models in real-world applications. The observed disconnect between the two is likely due to differences in motivations and goals, and also pressures to demonstrate “state of the art” performance on shared tasks, metrics and leaderboards (Ethayarajh and Jurafsky, 2020; Koch et al., 2021; Thomas and Uminsky, 2020), as well as a focus on the learner as the object upon which the researcher hopes to shed light. One limitation of our methodology is reliance on published papers, and we encourage more human subjects research in the future, in a similar vein to e.g. (Holstein et al., 2019; Sambasivan et al., 2021b; Madaio et al., 2022). We identified a range of evaluation gaps that risk being overlooked if the ML research community’s evaluation practices are uncritically adopted when for applications, and identify six assumptions that would have to be valid if these gaps are to be overlooked. The assumptions range from a broad focus on consequentialism to technical concerns regarding distributions of evaluation data. By presenting these assumptions as a coherent framework, we provide not just a set of mitigations for each evaluation gap, but also demonstrate the relationships between these mitigations. We show how in the naive case these assumptions chain together, leading to the grossest assumption that calculating model accuracy on data I.I.D. with the training data can be a reliable signal for real-world applications. We contrast the practices of ML model evaluation with those of the mature engineering practices of software testing to draw out lessons for non-I.I.D. testing under a variety of stress conditions and failure severities. One limitation of our analysis is that we are generally domain-agnostic, and we hope to stimulate investigations of assumptions and gaps for specific application domains. We believe it is fundamental that model developers are explicit about methodological assumptions in their evaluations. We believe that ML model evaluations have great potential to enable interpretation and use by different technical and non-technical communities (Star and Griesemer, 1989). By naming each assumption we identify and exploring its technical and sociological consequences, we hope to encourage more robust interdisciplinary debate and, ultimately, to nudge model evaluation practice away from abundant opaque unknowns.

Acknowledgements.
We acknowledge useful feedback from Daniel J. Barrett, Alexander D’Amour, Stephen Pfohl, D. Sculley, and the anonymous reviewers.

References

  • F. Alam, S. Joty, and M. Imran (2018) Domain adaptation with adversarial training and graph embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1077–1087. Cited by: footnote 4.
  • C. O. Alm (2011) Subjective natural language problems: motivations, applications, characterizations, and implications. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 107–112. Cited by: §5.1.
  • D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §4.1.
  • A. J. Andreotta, N. Kirkham, and M. Rizzi (2021) AI, big data, and the future of consent. AI & Society, pp. 1–14. Cited by: §4.1.
  • M. Andrus and T. K. Gilbert (2019) Towards a just theory of measurement: a principled social measurement assurance program for machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 445–451. Cited by: §2, §4.1.
  • M. Andrus, E. Spitzer, J. Brown, and A. Xiang (2021) What we can’t measure, we can’t understand: challenges to demographic data procurement in the pursuit of fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 249–260. Cited by: §4.1, §5.1.
  • L. Aroyo and C. Welty (2015) Truth is a lie: crowd truth and the seven myths of human annotation. AI Magazine 36 (1), pp. 15–24. Cited by: §4.1, §5.1.
  • D. Astels (2003) Test driven development: a practical guide. Prentice Hall Professional Technical Reference. Cited by: §5.3.
  • S. Barocas, A. Guo, E. Kamar, J. Krones, M. R. Morris, J. W. Vaughan, D. Wadsworth, and H. Wallach (2021) Designing disaggregated evaluations of ai systems: choices, considerations, and tradeoffs. arXiv preprint arXiv:2103.06076. Cited by: §4.1.
  • S. Barocas, M. Hardt, and A. Narayanan (2017) Fairness in machine learning. NIPS tutorial 1, pp. 2017. Cited by: §4.1.
  • R. Barthes (1977) Image-Music-Text. Macmillan. Cited by: §4.1.
  • V. Basile, F. Cabitza, A. Campagner, and M. Fell (2021) Toward a perspectivist turn in ground truthing for predictive computing. arXiv preprint arXiv:2109.04270. Cited by: §5.1.
  • E. M. Bender and B. Friedman (2018) Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. Cited by: §5.1.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021)

    On the dangers of stochastic parrots: can language models be too big?

    .
    In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. Cited by: §3.3.
  • Y. Bengio, Y. Lecun, and G. Hinton (2021) Deep learning for ai. Communications of the ACM 64 (7), pp. 58–65. Cited by: §3.3, §4.1.
  • J. Berger (2008) Ways of seeing. Penguin UK. Cited by: §4.1.
  • R. Binns (2018) Fairness in machine learning: lessons from political philosophy. In Conference on Fairness, Accountability and Transparency, pp. 149–159. Cited by: §5.1.
  • A. Birhane, P. Kalluri, D. Card, W. Agnew, R. Dotan, and M. Bao (2021) The values encoded in machine learning research. arXiv preprint arXiv:2106.15590. Cited by: §1.
  • D. Blasi, A. Anastasopoulos, and G. Neubig (2021) Systematic inequalities in language technology performance across the world’s languages. arXiv preprint arXiv:2110.06733. Cited by: §4.1.
  • R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Cited by: §3.2.
  • D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman (2019) Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491–500. Cited by: §5.1.
  • S. Bowman and G. Dahl (2021) What will it take to fix benchmarking in natural language understanding?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4843–4855. Cited by: §1.
  • E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley (2017) The ml test score: a rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data), pp. 1123–1132. Cited by: §1.
  • L. Breiman (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical science 16 (3), pp. 199–231. Cited by: §2.
  • B. Brewster (1881) The Yale Literary Magazine October 1881–June 1882. Cited by: §4.
  • W. Bulleit, J. Schmidt, I. Alvi, E. Nelson, and T. Rodriguez-Nikl (2015) Philosophy of engineering: what it is and why it matters. Journal of Professional Issues in Engineering Education and Practice 141 (3), pp. 02514003. Cited by: §2, §3.3.
  • R. Bunescu and Y. Huang (2010) A utility-driven approach to question ranking in social qa. In Proceedings of The 23rd International Conference on Computational Linguistics (COLING 2010), pp. 125–133. Cited by: §4.1.
  • E. Cai, D. Juan, D. Stamoulis, and D. Marculescu (2017)

    NeuralPower: predict and deploy energy-efficient convolutional neural networks

    .
    In Asian Conference on Machine Learning, pp. 622–637. Cited by: §4.1.
  • D. Card and N. A. Smith (2020) On consequentialism and fairness. Frontiers in Artificial Intelligence 3, pp. 34. Cited by: §4.1, §4.1, §5.1.
  • N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, et al. (2021) Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650. Cited by: §5.1.
  • B. Carter, S. Jain, J. W. Mueller, and D. Gifford (2021) Overinterpretation reveals image classification model pathologies. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • R. Challen, J. Denny, M. Pitt, L. Gompels, T. Edwards, and K. Tsaneva-Atanasova (2019) Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 28 (3), pp. 231–237. Cited by: §4.1.
  • J. I. Charlton (1998) Nothing about us without us. University of California Press. Cited by: §4.1.
  • M. Chen, K. Goel, N. S. Sohoni, F. Poms, K. Fatahalian, and C. Ré (2021) Mandoline: model evaluation under distribution shift. In International Conference on Machine Learning, pp. 1617–1629. Cited by: §1.
  • P. Chiril, V. Moriceau, F. Benamara, A. Mari, G. Origgi, and M. Coulomb-Gully (2020) He said “who’s gonna take care of your children when you are at acl?”: reported sexist acts are not sexist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4055–4066. Cited by: §3.2.
  • A. Chohlas-Wood, M. Coots, E. Brunskill, and S. Goel (2021) Learning to be fair: a consequentialist approach to equitable decision-making. arXiv preprint arXiv:2109.08792. Cited by: §4.1.
  • A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: 3rd item.
  • S. Corbett-Davies and S. Goel (2018) The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv preprint arXiv:1808.00023. Cited by: §4.1.
  • S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq (2017) Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797–806. Cited by: §4.1, footnote 5.
  • K. Crawford and V. Joler (2018) Anatomy of an ai system. Note: (Accessed January, 2022) Cited by: §4.1.
  • K. Crawford and T. Paglen (2021) Excavating ai: the politics of images in machine learning training sets. AI & SOCIETY, pp. 1–12. Cited by: §2, §4.1.
  • S. Cunningham, J. Gambo, A. Lawless, D. Moore, M. Yilmaz, P. M. Clarke, and R. V. O’Connor (2019) Software testing: a changing career. In European Conference on Software Process Improvement, pp. 731–742. Cited by: §5.3.
  • A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. (2020) Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. Cited by: §1.
  • E. Dahlin (2021) Mind the gap! on the future of AI research. Humanities and Social Sciences Communications 8 (1), pp. 1–4. Cited by: §1.
  • A. M. Davani, M. Díaz, and V. Prabhakaran (2022) Dealing with disagreements: looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics 10, pp. 92–110. Cited by: §5.1.
  • H. De Vries, D. Bahdanau, and C. Manning (2020) Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435. Cited by: §2.
  • L. Derczynski (2016) Complementarity, f-score, and nlp evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 261–266. Cited by: §1, §5.1.
  • M. Díaz and N. Diakopoulos (2019) Whose walkability?: Challenges in algorithmically measuring subjective experience. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–22. Cited by: §5.1.
  • L. Eckhouse, K. Lum, C. Conti-Cook, and J. Ciccolini (2019) Layers of bias: a unified approach for understanding problems with risk assessment. Criminal Justice and Behavior 46 (2), pp. 185–209. Cited by: §4.1, §5.1.
  • U. Endriss (2018) Lecture notes on fair division. arXiv preprint arXiv:1806.04234. Cited by: §5.1, §5.2.
  • K. Ethayarajh and D. Jurafsky (2020) Utility is in the eye of the user: a critique of nlp leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4846–4853. Cited by: §1, §4.1, §4.1, §5.1, §5.1, §6.
  • A. Ettinger, S. Rao, H. Daumé III, and E. M. Bender (2017) Towards linguistically generalizable nlp systems: a workshop and shared task. arXiv preprint arXiv:1711.01505. Cited by: §5.2.
  • U. Evci, V. Dumoulin, H. Larochelle, and M. C. Mozer (2021) Head2Toe: utilizing intermediate representations for better ood generalization. Cited by: §3.2.
  • A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia (2020) A brief review of domain adaptation. arXiv preprint arXiv:2010.03978. Cited by: 1st item.
  • T. Fornaciari, A. Uma, S. Paun, B. Plank, D. Hovy, and M. Poesio (2021) Beyond black & white: leveraging annotator disagreement via soft-label multi-task learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2591–2597. Cited by: §5.1.
  • D. Forsythe (2001a) Studying those who study us: an anthropologist in the world of artificial intelligence. Cited by: §2, §3.
  • D. Forsythe (2001b) Studying those who study us: an anthropologist in the world of artificial intelligence. Cited by: §4.1.
  • S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian (2021) The (im) possibility of fairness: different value systems require different mechanisms for fair decision making. Communications of the ACM 64 (4), pp. 136–143. Cited by: §4.1.
  • B. Fu, C. Chen, O. Henniger, and N. Damer (2022) A deep insight into measuring face image utility with general and face-specific image quality metrics. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 905–914. Cited by: §4.1.
  • E. García-Martín, C. F. Rodrigues, G. Riley, and H. Grahn (2019) Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing 134, pp. 75–88. Cited by: §4.1, §4.1.
  • S. Garg, V. Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel (2019) Counterfactual fairness in text classification through robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 219–226. Cited by: §2.
  • T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford (2021) Datasheets for datasets. Communications of the ACM 64 (12), pp. 86–92. Cited by: §5.1.
  • C. Geertz (1973) The interpretation of cultures. Basic Books. Cited by: §4.1.
  • S. Geiger (2020) Silicon valley, disruption, and the end of uncertainty. Journal of cultural economy 13 (2), pp. 169–184. Cited by: §5.1.
  • B. George and L. Williams (2004) A structured experiment of test-driven development. Information and software Technology 46 (5), pp. 337–342. Cited by: §5.3.
  • H. Gonen and K. Webster (2020) Automatically identifying gender issues in machine translation using perturbations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1991–1995. Cited by: §5.1.
  • M. L. Gray and S. Suri (2019) Ghost work: how to stop silicon valley from building a new global underclass. Eamon Dolan Books. Cited by: §4.1.
  • B. Green (2020) Data science as political action: grounding data science in a politics of justice. Available at SSRN 3658431. Cited by: §4.1.
  • H. Ha, S. Gupta, S. Rana, and S. Venkatesh (2021) ALT-mas: a data-efficient framework for active testing of machine learning algorithms. arXiv preprint arXiv:2104.04999. Cited by: §5.2.
  • F. Hampel and E. Zurich (1998) Is statistics too difficult?. Canadian Journal of Statistics 26 (3), pp. 497–513. Cited by: §5.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    .
    Advances in neural information processing systems 29, pp. 3315–3323. Cited by: 3rd item.
  • M. J. Harrold (2000) Testing: a roadmap. In Proceedings of the Conference on the Future of Software Engineering, pp. 61–72. Cited by: §5.3, §5.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • C. Heldreth, M. Lahav, Z. Mengesha, J. Sublewski, and E. Tuennerman (2021) “I don’t think these devices are very culturally sensitive.”—the impact of errors on african americans in automated speech recognition. Frontiers in Artificial Intelligence 26. Cited by: §4.1.
  • P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau (2020) Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research 21 (248), pp. 1–43. Cited by: §5.1.
  • B. Hepp, D. Dey, S. N. Sinha, A. Kapoor, N. Joshi, and O. Hilliges (2018) Learn-to-score: efficient 3D scene exploration by predicting view utility. In Proceedings of the European conference on computer vision (ECCV), pp. 437–452. Cited by: §4.1.
  • K. Holstein, J. Wortman Vaughan, H. Daumé III, M. Dudik, and H. Wallach (2019) Improving fairness in machine learning systems: what do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–16. Cited by: §1, §6.
  • J. N. Hooker (1995)

    Testing heuristics: we have it all wrong

    .
    Journal of heuristics 1 (1), pp. 33–42. Cited by: §2.
  • D. Hovy and S. L. Spruit (2016) The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 591–598. Cited by: §4.1.
  • P. Huang, H. Zhang, R. Jiang, R. Stanforth, J. Welbl, J. Rae, V. Maini, D. Yogatama, and P. Kohli (2020) Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 65–83. Cited by: §5.1.
  • B. Hutchinson and M. Mitchell (2019) 50 years of test (un) fairness: lessons for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 49–58. Cited by: footnote 1.
  • B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes, and M. Mitchell (2021) Towards accountability for machine learning datasets: practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575. Cited by: §5.1, §5.2.
  • M. Idahl, L. Lyu, U. Gadiraju, and A. Anand (2021) Towards benchmarking the utility of explanations for model debugging. In Proceedings of the First Workshop on Trustworthy Natural Language Processing, pp. 68–73. Cited by: §4.1.
  • IEEE (2019) The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. “Classical Ethics in A/IS”. In Ethically Aligned Design: A Vision for Prioritizing Human Well-being with Autonomous and Intelligent Systems, First Edition, pp. 36–67. Cited by: §4.1.
  • A. Z. Jacobs, S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach (2020) The meaning and measurement of bias: lessons from natural language processing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 706–706. Cited by: §2.
  • A. Z. Jacobs (2021) Measurement as governance in and for responsible ai. arXiv preprint arXiv:2109.05658. Cited by: §2, §4.1.
  • Y. Jafarian and H. S. Park (2021) Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12753–12762. Cited by: §3.2.
  • N. Japkowicz (2006) Why question machine learning evaluation methods. In AAAI workshop on evaluation methods for machine learning, pp. 6–11. Cited by: §1.
  • T. Jappy (2013) Introduction to peircean visual semiotics. A&C Black. Cited by: §4.1.
  • D. Ji, P. Smyth, and M. Steyvers (2020)

    Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference

    .
    Advances in Neural Information Processing Systems 33, pp. 18600–18612. Cited by: §5.1.
  • K. S. Jones and J. R. Galliers (1995) Evaluating natural language processing systems: an analysis and review. Vol. 1083, Springer Science & Business Media. Cited by: §2, §5.2, Table 7.
  • S. Kannan, A. Roth, and J. Ziani (2019) Downstream effects of affirmative action. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 240–248. Cited by: §4.1.
  • B. Koch, E. Denton, A. Hanna, and J. G. Foster (2021) Reduced, reused and recycled: the life of a dataset in machine learning research. NeurIPS Dataset & Benchmark track. Cited by: §1, §3.2, §3, §6.
  • P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, S. Beery, J. Leskovec, A. Kundaje, E. Pierson, S. Levine, C. Finn, and P. Liang (2020) WILDS: a benchmark of in-the-wild distribution shifts. CoRR abs/2012.07421. External Links: Link Cited by: §1.
  • J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth (2021) Active testing: sample-efficient model evaluation. In International Conference on Machine Learning, pp. 5753–5763. Cited by: §5.2.
  • T. Kukutai and J. Taylor (2016) Indigenous data sovereignty: toward an agenda. ANU press. Cited by: §4.1.
  • H. Kuwajima, H. Yasuoka, and T. Nakae (2020) Engineering problems in machine learning systems. Machine Learning 109 (5), pp. 1103–1126. Cited by: §1.
  • A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshkin, W. Chung, and D. Krueger (2017) Deep prior. arXiv preprint arXiv:1712.05016. Cited by: §5.1.
  • A. Lacoste, B. Oreshkin, W. Chung, T. Boquet, N. Rostamzadeh, and D. Krueger (2018)

    Uncertainty in multitask transfer learning

    .
    arXiv preprint arXiv:1806.07528. Cited by: §3.2.
  • G. Lakoff and M. Johnson (2008) Metaphors we live by. University of Chicago press. Cited by: §4.1.
  • G. Lecué and M. Lerasle (2020) Robust machine learning by median-of-means: theory and practice. The Annals of Statistics 48 (2), pp. 906–931. Cited by: §2.
  • Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo (2016) TGIF: a new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4641–4650. Cited by: §3.2.
  • T. Liao, R. Taori, I. D. Raji, and L. Schmidt (2021) Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §1, §2, §3.2, §3.
  • C. Lin, H. Shih, and P. J. Sher (2007) Integrating technology readiness into technology acceptance: the tram model. Psychology & Marketing 24 (7), pp. 641–657. Cited by: §1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: Table 5.
  • L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt (2018) Delayed impact of fair machine learning. In International Conference on Machine Learning, pp. 3150–3158. Cited by: §4.1.
  • C. Lo and D. Wu (2010) Evaluating machine translation utility via semantic role labels.. In LREC, Cited by: §4.1.
  • M. Madaio, L. Egede, H. Subramonyam, J. Wortman Vaughan, and H. Wallach (2022) Assessing the fairness of ai systems: ai practitioners’ processes, challenges, and needs for support. Proceedings of the ACM on Human-Computer Interaction 6 (CSCW1), pp. 1–26. Cited by: §6.
  • A. Mantelero (2018) AI and big data: a blueprint for a human rights, social and ethical impact assessment. Computer Law & Security Review 34 (4), pp. 754–772. Cited by: §5.1.
  • Marrkula Center (2019) Approaches to ethical decision-making. External Links: Link Cited by: §4.1.
  • D. Martin, V. Prabhakaran, J. Kuhlberg, A. Smart, and W. S. Isaac (2020) Extending the machine learning abstraction boundary: a complex systems approach to incorporate societal context. External Links: 2006.09663 Cited by: §4.1, §5.1, §5.3.
  • G. Matturro (2013) Soft skills in software engineering: a study of its demand by software companies in uruguay. In 2013 6th international workshop on cooperative and human aspects of software engineering (CHASE), pp. 133–136. Cited by: §5.3.
  • F. Mazzocchi (2015) Could big data be the end of theory in science? a few remarks on the epistemology of data-driven science. EMBO reports 16 (10), pp. 1250–1255. Cited by: §2.
  • L. McGregor, D. Murray, and V. Ng (2019) International human rights law as a framework for algorithmic accountability. International & Comparative Law Quarterly 68 (2), pp. 309–343. Cited by: §4.1, §5.1.
  • D. S. McNair (2018) Preventing disparities: bayesian and frequentist methods for assessing fairness in machine learning decision-support models. New Insights into Bayesian Inference, pp. 71. Cited by: §5.1.
  • M. Miceli, T. Yang, L. Naudts, M. Schuessler, D. Serbanescu, and A. Hanna (2021) Documenting computer vision datasets: an invitation to reflexive data practices. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 161–172. Cited by: §5.1.
  • M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019) Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp. 220–229. Cited by: §1, §5.1, §5.1, §5.3.
  • M. Moradi and M. Samwald (2021) Evaluating the robustness of neural language models to input perturbations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1558–1570. Cited by: §1.
  • Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018) Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4500–4509. Cited by: §3.2.
  • G. J. Myers, C. Sandler, and T. Badgett (2011) The art of software testing. John Wiley & Sons. Cited by: §5.3.
  • M. Neumann, O. Roessler, D. Suendermann-Oeft, and V. Ramanarayanan (2020) On the utility of audiovisual dialog technologies and signal analytics for real-time remote monitoring of depression biomarkers. In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations, pp. 47–52. Cited by: §4.1.
  • P. Norvig (2017) On chomsky and the two cultures of statistical learning. In Berechenbarkeit der Welt?, pp. 61–83. Cited by: §2.
  • A. Olteanu, C. Castillo, F. Diaz, and E. Kıcıman (2019) Social data: biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2, pp. 13. Cited by: §2.
  • T. Orekondy, M. Fritz, and B. Schiele (2018) Connecting pixels to privacy and utility: automatic redaction of private information in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8466–8475. Cited by: §4.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: Table 5.
  • P. O. Pinheiro, N. Rostamzadeh, and S. Ahn (2019) Domain-adaptive single-view 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7638–7647. Cited by: §3.2.
  • D. M. W. Powers (2011) Evaluation: from precision, recall and F-factor to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies 2 (1), pp. 37–63. Cited by: Table 3.
  • D. M. W. Powers (2012a) The problem of area under the curve. In 2012 IEEE International conference on information science and technology, pp. 567–573. Cited by: 2nd item.
  • D. M. W. Powers (2012b) The problem with kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 345–355. Cited by: 2nd item.
  • D. M. W. Powers (2014) What the f-measure doesn’t measure: features, flaws, fallacies and fixes. Technical report, Beijing University of Technology, China & Flinders University, Australia, Tech. Rep.. Cited by: 1st item.
  • V. Prabhakaran, A. M. Davani, and M. Diaz (2021) On releasing annotator-level labels and information in datasets. In Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pp. 133–138. Cited by: §5.1.
  • V. Prabhakaran, B. Hutchinson, and M. Mitchell (2019) Perturbation sensitivity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5740–5745. Cited by: §1.
  • F. Provost and T. Fawcett (1997) Analysis and visualization of classifier performance with nonuniform class and cost distributions. In Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection & Risk Management, pp. 57–63. Cited by: §4.1.
  • J. Pustejovsky (1998)

    The generative lexicon

    .
    MIT press. Cited by: §2.
  • I. D. Raji, E. Denton, E. M. Bender, A. Hanna, and A. Paullada (2021) AI and the everything in the whole wide world benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §1, §2, §5.
  • I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes (2020) Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 33–44. Cited by: §1, §5.1, §5.2, §5.3.
  • F. A. Raso, H. Hilligoss, V. Krishnamurthy, C. Bavitz, and L. Kim (2018) Artificial intelligence & human rights: opportunities & risks. Berkman Klein Center Research Publication (2018-6). Cited by: §4.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §3.2.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118. Cited by: 2nd item.
  • S. Rismani and A. Moon (2021) How do ai systems fail socially?: an engineering risk analysis approach. In 2021 IEEE International Symposium on Ethics in Engineering, Science and Technology (ETHICS), Vol. , pp. 1–8. External Links: Document Cited by: §1.
  • P. Rodriguez, J. Barrow, A. M. Hoyle, J. P. Lalor, R. Jia, and J. Boyd-Graber (2021) Evaluation examples are not equally informative: how should that change NLP leaderboards?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4486–4503. External Links: Link, Document Cited by: §1.
  • N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, and C. Pal (2018) Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317. Cited by: §3.2.
  • N. Rostamzadeh, B. Hutchinson, C. Greer, and V. Prabhakaran (2021) Thinking beyond distributions in testing machine learned models. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, Cited by: §5.3.
  • N. Ruiz, A. Kortylewski, W. Qiu, C. Xie, S. A. Bargal, A. Yuille, and S. Sclaroff (2022)

    Simulated adversarial testing of face recognition models

    .
    CVPR. Cited by: §5.2.
  • N. Sambasivan, E. Arnesen, B. Hutchinson, T. Doshi, and V. Prabhakaran (2021a) Re-imagining algorithmic fairness in india and beyond. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, pp. 315–328. External Links: ISBN 9781450383097, Link, Document Cited by: §4.1, §4.1, §5.3.
  • N. Sambasivan and J. Holbrook (2018) Toward responsible ai for the next billion users. Interactions 26 (1), pp. 68–71. Cited by: §4.1.
  • N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo (2021b) “Everyone wants to do the model work, not the data work”: data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. Cited by: §3.2, §6.
  • M. Sánchez-Gordón, L. Rijal, and R. Colomo-Palacios (2020) Beyond technical skills in software testing: automated versus manual testing. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, pp. 161–164. Cited by: §5.3.
  • E. T. K. Sang and F. De Meulder (2003)

    Introduction to the conll-2003 shared task: language-independent named entity recognition

    .
    In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. Cited by: §3.3.
  • M. K. Scheuerman, A. Hanna, and E. Denton (2021) Do datasets have politics? disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2), pp. 1–37. Cited by: §5.1.
  • D. Schlangen (2021) Targeting the benchmark: on methodology in current natural language processing research. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 670–674. Cited by: §1.
  • R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni (2020) Green AI. Communications of the ACM 63 (12), pp. 54–63. Cited by: §2, §3, §4.1, §5.1.
  • D. Sculley, J. Snoek, A. Wiltschko, and A. Rahimi (2018) Winner’s curse? on pace, progress, and empirical rigor. In Proceedings of ICLR 2018, Cited by: §1.
  • A. D. Selbst, D. Boyd, S. A. Friedler, S. Venkatasubramanian, and J. Vertesi (2019) Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency, pp. 59–68. Cited by: §4.1, §5.1.
  • H. Shue (2020) Basic rights: subsistence, affluence, and us foreign policy. Princeton University Press. Cited by: §4.1.
  • M. S. Silberman, B. Tomlinson, R. LaPlante, J. Ross, L. Irani, and A. Zaldivar (2018) Responsible research with crowds: pay crowdworkers at least minimum wage. Communications of the ACM 61 (3), pp. 39–41. Cited by: §4.1.
  • W. Sinnott-Armstrong (2021) Consequentialism. The Stanford Encyclopedia of Philosophy Winter 2021 Edition. External Links: Link Cited by: §4.1.
  • S. L. Star and J. R. Griesemer (1989) Institutional ecology, ‘translations’ and boundary objects: amateurs and professionals in berkeley’s museum of vertebrate zoology, 1907-39. Social studies of science 19 (3), pp. 387–420. Cited by: §5.3, §6.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. Cited by: §4.1.
  • M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, and M. Kawanabe (2007) Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems 20. Cited by: §1.
  • R. Thomas and D. Uminsky (2020) Reliance on metrics is a fundamental challenge for AI. In Proceedings of the Ethics of Data Science Conference, Cited by: §1, §6.
  • J. W. Tukey (1962) The future of data analysis. The annals of mathematical statistics 33 (1), pp. 1–67. Cited by: §2.
  • P. D. Turney (1994)

    Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm

    .
    Journal of artificial intelligence research 2, pp. 369–409. Cited by: §5.1.
  • D. Ustalov, A. Panchenko, and C. Biemann (2017) Watset: automatic induction of synsets from a graph of synonyms. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pp. 1579–1590. Cited by: footnote 4.
  • S. Vallor (2016) Technology and the virtues: a philosophical guide to a future worth wanting. Oxford University Press. Cited by: §5.1.
  • C. J. Van Rijsbergen (1974) Foundation of evaluation. Journal of documentation. Cited by: Table 5.
  • A. Vogelsang and M. Borg (2019) Requirements engineering for machine learning: perspectives from data scientists. In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 245–251. Cited by: §5.1.
  • H. Wallach (2018) Computational social science computer science+ social data. Communications of the ACM 61 (3), pp. 42–44. Cited by: §2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    ,
    pp. 353–355. Cited by: §4.1.
  • J. Wang, C. Lan, C. Liu, Y. Ouyang, W. Zeng, and T. Qin (2021) Generalizing to unseen domains: a survey on domain generalization. In Proceedings of IJCAI 2021, Cited by: 1st item.
  • K. Webster, M. R. Costa-jussà, C. Hardmeier, and W. Radford (2019) Gendered ambiguous pronoun (gap) shared task at the gender bias in nlp workshop 2019. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 1–7. Cited by: §5.1.
  • S. M. West, M. Whittaker, and K. Crawford (2019) Discriminating systems. AI Now. Cited by: §4.1.
  • J. Winkens, R. Bunel, A. Guha Roy, R. Stanforth, V. Natarajan, J. R. Ledsam, P. MacWilliams, P. Kohli, A. Karthikesalingam, S. Kohl, et al. (2020) Contrastive training for improved out-of-distribution detection. arXiv e-prints, pp. arXiv–2007. Cited by: §1.
  • H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, and R. Feris (2021) Fashion iq: a new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11307–11317. Cited by: §3.2.
  • S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha (2018) Privacy risk in machine learning: analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. Cited by: §5.1.
  • G. Zeng, F. Qi, Q. Zhou, T. Zhang, Z. Ma, B. Hou, Y. Zang, Z. Liu, and M. Sun (2021)

    OpenAttack: an open-source textual adversarial attack toolkit

    .
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 363–371. Cited by: §5.2.
  • J. M. Zhang, M. Harman, L. Ma, and Y. Liu (2020a) Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering. Cited by: §1, §1, §3.
  • W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li (2020b) Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (3), pp. 1–41. Cited by: §5.2.
  • B. Z. H. Zhao, M. A. Kaafar, and N. Kourtellis (2020) Not one but many tradeoffs: privacy vs. utility in differentially private machine learning. In Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop, pp. 15–26. Cited by: §4.1.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2018) Gender bias in coreference resolution: evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2. Cited by: §5.1.

Appendix A: Metrics in ML Model Evaluations

Here we give definitions and categorizations of some of the metrics reported in the study in Section 3. In practice, there was a long tail since many metrics were used in only a single paper. Here we include only the metrics which were most frequently observed in our study.

Metric Example Task(s) Metric category Definition
Accuracy Classification Accuracy A metric that penalizes system predictions that do not agree with the reference data ().
AUC Classification AUC The area under the curve parameterized by classification threshold , typically with -axis representing recall and -axis representing false positive rate ().
Bleu Machine translation Precision A form of “-gram precision,” originally designed for machine translation but also sometimes used for other text generation tasks, which measures whether sequences of words in the system output are also present in the reference texts (Papineni et al., 2002).
Dice Image segmentation Overlap Equivalent to (). More commonly used for medical image segmentation.
Error rate Classification Accuracy The inverse of accuracy ().
(or ) Text classification Overlap The harmonic mean of recall and precision (), originally developed for information retrieval (Van Rijsbergen, 1974) but now widely used in NLP.
Text classification Overlap A weighted harmonic mean of recall and precision, with greater weight given to recall ( with ).
Hausdorff distance Medical Image Segmentation Distance A measure of distance between two sets in a metric space. Two sets have a low Hausdorff distance if every point in each set is close to a point in the other set.
IoU Image segmentation Overlap . Equivalent to Jaccard.
Matthew’s Correlation Coefficient Correlation Has been argued to address shortcomings in ’s asymmetry with respect to classes ().
Mean absolute error Regression Distance
Mean Average Precision (MAP) Information retrieval (NLP) AUC In information retrieval, the average over information needs of the average precision of the documents retrieved for that need.
Mean average precision (mAP) Object detection (CV) AUC The area under the Precision-Recall tradeoff curve, averaged over multiple IoU (intersection over union) threshold values, then averaged across all categories (https://cocodataset.org/#detection-eval).
Mean reciprocal rank Information retrieval Other A measure for evaluating processes that produces an ordered list of possible responses. The average of the inverse rank of the first relevant item retrieved.
MSE Image Decomposition Distance Mean squared error (MSE) measures the average of the squared difference between estimated and actual values.
Normalized Discounted Cumulative Gain (NDCG) Recommendation or ranking tasks Other A measure of ranking quality which takes into account the usefulness of items based on their ranking in the result list.
Pearson’s Quality Estimation Correlation A measure of linear correlation between two sets of data.
Perplexity Language modeling Perplexity Information-theoretic metric (measured in bits-per-unit, e.g., bits-per-character or bits-per-sentence) often used for language models, inversely related to the probability assigned to the test data by the model. Closely related to the cross-entropy between the model and the test data. Can be thought of as how efficiently does the language model encode the test data.
Precision Classification Precision A metric that penalizes the system for predicting a class (if class is unspecified, by default the “positive” class) when the reference data did not belong to this class ().
PSNR Super Resolution Distance

Peak Signal-to-Noise ratio (PSNR) is the ratio between the maximum possible value of a signal and the power of distorting noise (Mean Squared Error) that impacts the quality of its representation.

Recall Classification Recall Also known as “sensitivity”, this metric that penalizes the system for failing to predict a class (if class is unspecified, by default the “positive” class) when the reference data did belong to this class (); a.k.a. true positive rate.
RMSE Depth Estimation Distance Root Mean Square Error (RMSE) is the square root of the MSE.
Rouge Text summarization Recall A form of “-gram recall,” originally designed for text summarization but also sometimes used for other text generation tasks, which measures whether sequences of words in the reference texts are also present in the system output(Lin, 2004).
Spearman’s Graph Edit Distance Correlation A measure of monotonic association between two variables–less restrictive than linear correlations.
Specificity Classification Other Like Precision, this metric that penalizes the system for failing to predict a class (if class is unspecified, by default the “positive” class) when the reference data did belong to this class; unlike Precision it rewards true negatives rather than true positives ().
SSIM Super Resolution Distance The Structural Similarity Method (SSIM) is a perception-based method for measuring the similarity between two images. The formula is based on comparison measurements of luminance, contrast, and structure.
Top- accuracy Face recognition Accuracy A metric for systems that return ranked lists, which calculates accuracy over the top entries in each list.
Word error rate Speech recognition Accuracy The inverse of word accuracy: (which is not technically always in due to the way word accuracy is defined but which is categorized as “Accuracy” here because both insertions and deletions are penalized).
Table 5. Definitions and categorizations of metrics reported in Section 3. TP, TN, FP and FN indicate the number of true positives, true negatives, false positives and false negatives, respectively. and represent actual values and values predicted by the system, respectively.

Appendix B: Types of Evaluation Data used in ML Model Evaluations

Type of Test Data Example Task(s) I.I.D. with training data? Definition
Test split Classification yes Typically, labeled data is partitioned into training and test splits (and often a dev split too), drawn randomly from the same dataset.
Manual resource Lexical acquisition no A manually compiled resource (in NLP, often a word-based resource such as a lexicon or thesaurus), against which knowledge acquired from a dataset is compared.
References Machine translation no Reference outputs (typically obtained prior to building the system) which a generative system is trying to reproduce, typically obtained from humans (e.g., manual translations of input sentences in the case of evaluations using Bleu for machine translation tasks).
Training data Keyword extraction yes Training data that contains labels is used to evaluate an unsupervised algorithm that did not have access to the labels during learning.
Novel distribution Domain transfer no Test data that has the same form as the training data but is drawn from a different distribution (e.g., in the case of NLP training on labeled newspaper data and testing on labeled Wikipedia data).
Table 6. Types of datasets used in ML model evaluations.

Appendix C: Example of Assumptions and Gaps for a Hypothetical Application

Suppose we are evaluating a hypothetical image classification model for use in an application for assisting blind people in identifying groceries in their pantries. Then some application-specific questions related to the assumptions in Section 4 might be:

Consequentialism

Was data ethically sourced and labeled? Were blind people involved in the design process? Does this use of this model encourage high-risk uses of other similar models, such as identifying pharmaceutical products?

Abstractability from Context

Does the application have a human-in-the-loop feature available when the model is uncertain? Will the system nudge purchasing behaviors towards products on which the model performs well?

Input Myopia

Are uncommon grocery products misclassified more often? Does this disproportionately impact home cooks who don’t stick to the dominant cuisines, or who have food requirements due to medical conditions?

Quantitative Modeling

Does measuring predictive accuracy fail to take into account dignitary consequences associated with increased independence? Should each user be weighted equally in the evaluation (cf. each image)?

Equivalent Failures

Are there severe risks in confusing certain pairs of products, e.g., food products associated with dangerous allergies? Are some errors only minimally inconvenient, such as confusing different shapes of pasta?

Test Data Validity

Is the evaluation data representative of what the application’s users have in their pantries? Are the image qualities (lighting, focus, framing, etc.) representative of images taken by blind photographers?

Appendix D: Model Evaluation Remits and Design

Model Evaluation Remit
 To establish:
  motivation — why evaluate the model?
   what is the perspective being adopted — task/financial/administrative/scientific/…
   whose interests prompted the evaluation — developer/funder/…
   who are the consumers of the model evaluation results — manager/user/researcher/…
  goal — what do we want/need to discover?
  orientation — intrinsic/extrinsic
  kind — investigation/experiment
  type — black box/glass box
  form (of yardstick) — ideal/attainable/exemplar/given/judged
  style — suggestive/indicative/exhaustive
  mode — quantitative/qualitative/hybrid
Model Evaluation Design
 To identify:
  ends — what is the model for? what is its objective or function?
  context — what is the ecosystem the model is in? what are the animate and inanimate factors?
  constitution — what is the structure of the model? what was the training data?
 To determine:
  factors that will be tested
   environment variables
   ‘system’ parameters
  evaluation criteria
   metrics/measures
   methods
 Evaluation data — what type, status and nature?
 Evaluation procedure
Table 7. A sketch of how Karen Sparck Jones and Julia Galliers’ 1995 NLP evaluation framework questionnaire (Jones and Galliers, 1995) can be adapted for the evaluation of ML models. The output of the remit and the design is a strategy for conducting the model evaluation. For a related but simpler framework based on model requirements analysis, see also the “7-step Recipe” for NLP system evaluation (https://www.issco.unige.ch/en/research/projects/eagles/ewg99/7steps.html) developed by the eagles Evaluation Working Group in 1999, which considers whether different parties have a shared understanding of the evaluation’s purpose.