Interpretable explanations present a direct conflict between explanations that are best describe the underlying model and explanations that are tailored to the end user. Both features are important, however, because transparency is an ethical motivation regarding the former while the latter feature is a matter of user trust and human comprehension. The interpretable machine learning community has championed evaluation of functional interpretability using human user studies. Little work has focused on the inherent problems with bias human evaluation in cases where we hope to balance accuracy and complexity.
Discussion regarding the tradeoff between accuracy and complexity in models has long preceded the field of interpretable machine learning. In statistics, we discuss "parsimony" Sidow , Townsend et al. . The problem is also discussed in the philosophy of science under "scientific realism" and "simplification" [Forster, 1986, Forster and Sober, 1994].
The majority of methods found in the interpretable machine learning community optimize for model complexity, straying away from more complex measures of interpretability. In fact, Ribeiro et al. 
explicitly mentions optimization using model complexity as a proxy for interpretability in his seminal paper. But evaluation for these explanation systems is conducted using more complex definitions of interpretability. This difference between how explanation systems are optimized and how explanation systems are evaluated can introduce bias into the explanation vehicles and strategies themselves. In this way, control over the tradeoff between accuracy and simplicity is at risk when in pursuit of better performance on real world evaluation metrics.
Studies have shown that users better understand and trust models that match their expectations and frame of view. When the user’s model is subpar, there is a boundary at which our current evaluation metrics will reward similarity to the faulty user model despite the explanation becoming more complex or less faithful to the underlying model.
In Section 2, we define terminology relevant to interpretable model explanations and human cognitive function that may affect functional interpretability. We provide definitions that distinguish between explanations that emphasize accuracy and explanations that balance that accuracy with human-related metrics in Section 3. Motivation and discussion concerning human evaluation and the resulting biases are discussed in Section 4. We present two potential research directions to mitigate the negative effects of human evaluation for interpretable machine learning in Section 5.
2 Related Terminology
An interpretable explanation, or explanation, is a simple model, visualization, or text description that lies in an interpretable feature space and approximates a more complex model. In this work, we focus on those in the form of statistical models, algorithms, or rules while referring to them as explanation models. Model complexity is a measure of the amount of information contained in a model. In this paper, we will often refer to the model complexity of an explanatory model as explanation complexity.
We introduce new terms to further clarify different components of interpretable explanation model systems. An explanation vehicle
is the model class of an explanation such as decision trees, generalized additive models[Caruana et al., 2015], falling rule lists [Wang and Rudin, 2015]. We expand this definition to pairings of model class and a fixed explanation complexity constraint, e.g. decision trees of depth less than 10. We define an explanation strategy
as the combination of explanation vehicle, objective function, constraints and hyperparameters required to generate an interpretable model explanation. Finally, we consider a particular implementation of an explanation strategy that has been fit to an underlying machine learning model to be anexplanation model system, or explanation system in short. >
3 Characterization of explanation strategies
3.1 Descriptive explanation strategies
High-fidelity explanations, also referred to as faithful, have a strong correspondence between the explanation model and the underlying machine learning model Ribeiro et al. . We define a descriptive explanation strategy as one that generates explanations with maximum model fidelity for a particular explanation vehicle and underlying machine learning model. Optimizing a descriptive explanation often involves a traditional accuracy metric, such as mean squared error calculated between the underlying and explanatory models.
Descriptive explanations best satisfy the ethical goal of transparency. When our explanations are more faithful to the underlying machine learning model, humans have more information about the inner workings of the system. This value is of societal importance and studied by the science and technology studies community, as well as the public policy and law community.
Persuasive explanation strategies
In contrast to descriptive explanations, persuasive explanation strategiesdo not achieve maximum model fidelity and often incorporate user preferences, knowledge, or characteristics. Much like a persuasive argument, these explanations balance accuracy with being convincing to the user. They are less faithful to the underlying model than descriptive explanations in a tradeoff for more freedom on the explanation complexity, structure, and parameters. This freedom permits explanations better tailored to human cognitive function, making them more functionally interpretable.
Model complexity is one of the most practiced persuasive treatments. This is natural because the optimal model complexity of an explanation may be dependent on user expectations and expertise. Researchers often must limit explanation complexity to facilitate trust and understanding by humans. Some aim to make an explanation simple enough to be contemplated by a human at once, a trait that Lipton  calls simulatability.
User trust and understanding are two popular ethical motivations for interpretability. We consider these goals persuasive, as the both rely on human cognitive function and preferences. Framing our ethical goals with respect to persuasive explanation strategies highlights some ethical dilemmas that have plagued model interpretability research as it’s currently practiced:
When is it unethical to manipulate an explanation to better persuade users? How do we balance our concerns for transparency and ethics with our desire for interpretability?
Implicit human cognitive bias
Doshi-Velez and Kim  review evaluation within interpretable machine learning, describing both human evaluation in practical applications and functional evaluation using quantitative metrics. After proposing a taxonomy of evaluation methods, they hypothesize that we may find common latent factors to inform our understanding of both explanation vehicles and the applications to which they were applied. We believe this is an important analysis to pursue. However, finding latent factors amongst heavily influenced decisions for explanation vehicles, application domains, and performance metrics will lead to biased output unless representation across each choice is balanced.
If human cognitive attributes and user expectations are indeed predictive of user trust or comprehension, it will be a major source of this bias. The expectations and expertise to which explanations are compared can differ greatly across users and applications. There may also be characteristics shared across all human users that affect the performance of an explanation system. We refer to this as implicit human cognitive bias.
Explanation systems are more likely to overfit to these attributes as researchers produce new systems. This effect is amplified due to publication bias. As the field matures, only explanation vehicles and strategies that achieve top results in performance will be published and visible to the community. Researchers will then favor and mimic top performing explanation strategies as the basis for their future work. If methods that incorporate human cognitive function and preferences score more highly on functional evaluation metrics and future work is based off of that research, we may not capture the characteristics that led to higher functional performance. Even if we understand the cognitive attributes of an explanation strategy, methods like that in Doshi-Velez and Kim  fail to allow for fine-tuned control over accuracy and interpretability.
Other fields within artificial intelligence base performance on human evaluation and are also likely to have human cognitive bias. Examples include machine translation, information retrieval, and the user modeling, adaptation, and personalization (UMAP) communities. The perils of human evaluation warrant further study in these areas. However, these communities may favor results that best match human performance outright, as success is defined by the ability to replicate human performance. This definition of success would make research in these fields more tolerant of human cognitive bias than the machine learning interpretability community.
Implicit human cognitive bias is problematic within interpretable machine learning, as it directly limits our ability to provide purely descriptive explanations that are accurate and transparent. When the bias becomes sufficiently large, we will fail to satisfy our ethical goals related to transparency of the underlying machine learning model.
4 Research directions to combat implicit human cognitive bias
We present two research directions that have potential to reduce the implicit cognitive bias in interpretable explanation strategies, while still prioritizing functional human interpretability and trust. Both methods allow for control over the tradeoff between transparency and interpretability.
4.1 Separation between descriptive and persuasive tasks
One direction involves separating the descriptive and persuasive explanation generation tasks. In this case, we treat explanation complexity as a persuasive strategy attribute. In the first step, we create a fully descriptive explanation within an interpretable feature space for a particular explanation vehicle. No efforts to simplify the explanation are made beyond projection into a given interpretable feature space. The explanation is altered to become more persuasive in the second and final step. It is here we incorporate human cognitive function, user preferences, and expertise into the explanation.
This process is already treated as two separate steps in a number of interpretable explanation systems with regard to model complexity. When reducing model complexity, some researchers choose to fit a fully descriptive model before truncating it to reduce complexity. For example, using a standard CART algorithm [breiman_classification_1984], we may produce a decision tree in an interpretable feature space that is of depth 40. This model is assumed to be too complex for human understanding, so we may remove nodes at depth greater than 10 to elicit an altered explanation that is more persuasive and interpretable.
This separation of concern encourages more rapid innovation and reduces the cost of evaluation. When separated, the descriptive step can be evaluated using functional metrics. Researchers developing methods to project uninterpretable models into an interpretable feature space could rid themselves of the expensive and delicate human evaluation tasks. The persuasive step is now transformed to a task of altering a model of a specific explanation vehicle to balance accuracy and interpretability. Researchers can make progress on the evaluation of explanations across different users and applications without conflating it with the choice of explanation vehicle.
4.2 Explicit inclusion of cognitive features
The second research direction we consider is extending the objective function to include cognitive attributes and expertise that influence our functional measures of interpretability.
This strategy has some clear disadvantages:
– Any cognitive features that are important to model user trust or any measures of functional interpretability should be explicitly included as a constraint. A feature that is missing from our explanation strategy’s loss function will contribute to the implicit cognitive bias. Relevant cognitive features may differ across applications and evaluation metrics.
Multi-objective loss functions – Optimization of a multi-objective loss function becomes difficult when nontrivial [Hwang and Masud, 1979, Miettinen, 1999, Ehrgott, 2005, Branke et al., 2008]. Goh et al.  proposes a solution for this problem using dataset constraints and a ramp penalty. The increased complexity in optimization may pose difficulty in fitting reasonable explanations.
We present user conviction as an example of an explicit attribute of human cognition. This, like other cognitive attributes, is a function of the users’ personal expertise model. It represents a fairly difficult attribute, as it differs between individual users and requires the capture of user knowledge. (See Knowledge representation above.)
Brey  describes the conflict between a user trusting one’s own judgments and intuitions with trusting that of a potentially more intelligent technology. While his comment is with regard to intelligence-enabled devices, the conflict is comparable to that of users evaluating trust in machine learning explanations. In this paper, we define user conviction as the propensity of an individual or group to trust their own judgment of classification model above that of the interpretable explanation.
At this point in time, we treat user conviction as a single parameter across the entire user expectation model . The concept of user conviction can be extended to capture feature-level or decision-level trust for more fine-grained evaluation.
If known, we then incorporate the user’s knowledge model with conviction such that for a given input , is the predicted from the explanation, and is the reported user expectation:
We presented two research directions that may mitigate the negative consequences of implicit human cognitive bias. We believe that this research should be further studied to better characterize the risks that are presented by human evaluation of functional interpretability.
This project was funded by generous grants to the eScience Institute from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation to the University of Washington eScience Institute. Thank you to Valentina Staneva, Zachary Jones, and Bill Howe for fruitful discussion. A large thank you goes to Michael Esveldt, Katie Kuhl, and the University of Washington libraries for writing and editing support.
- Branke et al.  Jürgen Branke, Kalyanmoy Deb, and Kaisa Miettinen. Multiobjective Optimization: Interactive and Evolutionary Approaches. Springer Science & Business Media, October 2008. ISBN 978-3-540-88907-6.
- Brey  Philip Brey. Freedom and Privacy in Ambient Intelligence. Ethics and Information Technology, 7(3):157–166, September 2005. ISSN 1388-1957, 1572-8439. doi: 10.1007/s10676-006-0005-3. URL https://link.springer.com/article/10.1007/s10676-006-0005-3.
- Caruana et al.  Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1721–1730, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3664-2. doi: 10.1145/2783258.2788613. URL http://doi.acm.org/10.1145/2783258.2788613.
- Doshi-Velez and Kim  Finale Doshi-Velez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [cs, stat], February 2017. URL http://arxiv.org/abs/1702.08608. arXiv: 1702.08608.
- Ehrgott  Matthias Ehrgott. Multicriteria Optimization. Springer Science & Business Media, May 2005. ISBN 978-3-540-21398-7. Google-Books-ID: yrZw9srrHroC.
- Forster  Malcolm R. Forster. Unification and Scientific Realism Revisited. Philosophy of Science Association, 1:394–405, 1986.
- Forster and Sober  Malcolm R. Forster and Elliott Sober. How to Tell when Simpler, More Unified, or Less A Hoc Theories Provide More Accurate Predictions. British Journal for the Philosophy of Science, (45):1–35, 1994.
- Goh et al.  Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael Friedlander. Satisfying Real-world Goals with Dataset Constraints. arXiv:1606.07558 [cs], June 2016. URL http://arxiv.org/abs/1606.07558. arXiv: 1606.07558.
- Hoffman  Robert R. Hoffman. A Survey of Methods for Eliciting the Knowledge of Experts. SIGART Bull., (108):19–27, April 1989. ISSN 0163-5719. doi: 10.1145/63266.63269. URL http://doi.acm.org/10.1145/63266.63269.
- Hwang and Masud  Ching-Lai Hwang and Abu Syed Md Masud. Multiple objective decision making, methods and applications: a state-of-the-art survey. Springer-Verlag, 1979. ISBN 978-0-387-09111-2.
- Leu and Abbass  George Leu and Hussein Abbass. A multi-disciplinary review of knowledge acquisition methods: From human to autonomous eliciting agents. Knowledge-Based Systems, 105(Supplement C):1–22, August 2016. ISSN 0950-7051. doi: 10.1016/j.knosys.2016.02.012. URL http://www.sciencedirect.com/science/article/pii/S0950705116000988.
- Lipton  Zachary C. Lipton. The Mythos of Model Interpretability. arXiv:1606.03490 [cs, stat], June 2016. URL http://arxiv.org/abs/1606.03490. arXiv: 1606.03490.
- Miettinen  Kaisa Miettinen. Nonlinear Multiobjective Optimization. Springer Science & Business Media, 1999. ISBN 978-0-7923-8278-2. Google-Books-ID: ha_zLdNtXSMC.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs, stat], February 2016. URL http://arxiv.org/abs/1602.04938. arXiv: 1602.04938.
- Sidow  Arend Sidow. Parsimony or statistics? Nature, 367(6458):26–26, January 1994. doi: 10.1038/367026a0. URL http://dx.doi.org/10.1038/367026a0.
- Townsend et al.  James T. Townsend, Jerome R. Busemeyer, Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers. Model Comparison and the Principle of Parsimony. In James T. Townsend and Jerome R. Busemeyer, editors, The Oxford Handbook of Computational and Mathematical Psychology. Oxford University Press, January 2015. ISBN 978-0-19-995799-6. URL http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199957996.001.0001/oxfordhb-9780199957996-e-14. DOI: 10.1093/oxfordhb/9780199957996.013.14.
- Wang and Rudin  Fulton Wang and Cynthia Rudin. Falling Rule Lists. CoRR, abs/1411.5899, 2015.