Log In Sign Up

Challenges in Applying Explainability Methods to Improve the Fairness of NLP Models

Motivations for methods in explainable artificial intelligence (XAI) often include detecting, quantifying and mitigating bias, and contributing to making machine learning models fairer. However, exactly how an XAI method can help in combating biases is often left unspecified. In this paper, we briefly review trends in explainability and fairness in NLP research, identify the current practices in which explainability methods are applied to detect and mitigate bias, and investigate the barriers preventing XAI methods from being used more widely in tackling fairness issues.


page 1

page 2

page 3

page 4


Harnessing value from data science in business: ensuring explainability and fairness of solutions

The paper introduces concepts of fairness and explainability (XAI) in ar...

Explainability for identification of vulnerable groups in machine learning models

If a prediction model identifies vulnerable individuals or groups, the u...

(Un)fairness in Post-operative Complication Prediction Models

With the current ongoing debate about fairness, explainability and trans...

Model Explainability in Deep Learning Based Natural Language Processing

Machine learning (ML) model explainability has received growing attentio...

Explainability for fair machine learning

As the decisions made or influenced by machine learning models increasin...

A Framework for Auditing Multilevel Models using Explainability Methods

Applications of multilevel models usually result in binary classificatio...

1 Introduction

Trends in Natural Language Processing (NLP) mirror those in Machine Learning (ML): breakthroughs in deep neural network architectures, pre-training and fine-tuning methods, and a steady increase in the number of parameters led to impressive performance improvements for a wide variety of NLP tasks. However, these successes have been shadowed by the repeated discoveries that a high accuracy on the held-out test set does not always mean that the model is performing satisfactorily on other important criteria such as

fairness, robustness and safety. These discoveries that models are adversarially manipulable (zhang2020adversarial), show biases against underprivileged groups (chang2019bias), and leak sensitive user information (carlini2021extracting) inspired a plethora of declarations on Responsible/Ethical AI (morley2021initial). Two of the common principles espoused in these documents are fairness and transparency.

Failures in fairness of models is often attributed, among other things, to the lack of transparency of modern AI models. The implicit argument is that, if biased predictions are due to faulty reasoning learned from biased data, then we need transparency in order to detect and understand this faulty reasoning. Hence, one approach to solving these problems is to develop methods that can peek inside the black-box, provide insights into the internal workings of the model, and identify whether the model is right for the right reasons.

As a result, ensuring the fairness of AI systems is frequently cited as one of the main motivations behind XAI research (doshi2017towards; das2020opportunities; wallace2020interpreting). However, it is not always clear how these methods can be applied in order to achieve fairer, less biased models. In this paper, we briefly summarize some XAI methods that are common in NLP research, the conceptualization, sources and metrics for unintended biases in NLP models, and some works that apply XAI methods to identify or mitigate these biases. Our review of the literature in this intersection reveals that applications of XAI methods to fairness and bias issues in NLP are surprisingly few, concentrated on a limited number of tasks, and often applied only to a few examples in order to illustrate the particular bias being studied. Based on our findings, we discuss some barriers to more widespread and effective application of XAI methods for debiasing NLP models, and some research directions to bridge the gap between these two areas.

2 Explainable Natural Language Processing

Local Global
Self- explaining Gradients simonyan2013deep
Integrated Gradients sundararajan2017axiomatic Counterfactual LM feder2021causalm
SmoothGrad smilkov2017smoothgrad
DeepLIFT shrikumar2017learning
Attention xu2015show; choi2016retain
Representer Point Selection yeh2018representer
Post-hoc LIME ribeiro2016should
SHAP lundberg2017A TCAV kim2018interpretability; nejadgholi2022improving
Counterfactuals wu2021polyjuice; ross2021explaining SEAR ribeiro2018semantically
Extractive rationales deyoung2020eraser
Influence Functions koh2017understanding; han2020explaining
Anchors ribeiro2018anchors
Table 1: Explainability methods from Sec. 2 categorized as local vs. global and self-explaining vs. post-hoc.

With the success and widespread adaptation of black-box models for machine learning tasks, increasing research effort has been devoted to developing methods that might give human-comprehensible explanations for the behaviour of these models, helping developers and end-users to understand the reasoning behind the decisions of the model. Broadly speaking, explainability methods try to pinpoint the causes of a single prediction, a set of predictions, or all predictions of a model by identifying parts of the input, the model or the training data that have the most influence on the model outcome.

The line dividing XAI methods, and methods that are developed more generally for understanding, analysis and evaluation of NLP methods beyond the standard accuracy metrics is not always clear cut. Many popular approaches such as probes (hewitt2019designing; voita2020information), contrast sets (gardner2020evaluating) and checklists (ribeiro2020beyond) share many of their core motivations with XAI methods. Here, we present some of the most prominent works in XAI, and refer the reader to the survey by danilevsky2020survey for a more extensive overview of the field. We consider a method as an XAI method if the authors have framed it as such in the original presentation, and do not include others in our analysis.

A common categorization of explainability methods is whether they provide local or global explanations, and whether they are self-explaining or post-hoc (Guidotti2018; adadi2018peeking). The first distinction captures whether the explanations are given for individual instances (local) or explain the model behaviour on any input (global). Due to the complex nature of the data and the tasks common in NLP, the bulk of the XAI methods developed for or applicable to NLP models are local rather than global (danilevsky2020survey). The second distinction is related to how the explanations are generated. In self-explaining methods, the process of generating explanations is integrated into, or at least reliant on the internal structure of the model or the process of computing the model outcome. Because of this, self-explaining methods are often specific to the type of the model. On the other hand, post-hoc or model-agnostic methods only assume access to the input-output behaviour of the model, and construct explanations based on how changes to the different components of the prediction pipeline affect the outputs. Below, we outline some of the representative explainability methods used in NLP and categorize them along the two dimensions in Table 1.

Feature attribution methods, also referred to as feature importance or saliency maps

, aim to determine the relative importance of each token in an input text for a given model prediction. The underlying assumption in each of these methods is that the more important a token is for a prediction, the more the output should change when this token is removed or changed. One way to estimate this is through the gradients of the output with respect to each input token as done by

simonyan2013deep. Other methods have been developed to address some of the issues with the original approach such as local consistency (sundararajan2017axiomatic; smilkov2017smoothgrad; selvaraju2017grad; shrikumar2017learning).

Rather than estimating the effect of perturbations through gradients, an alternative approach is to perturb the input text directly and observe its effects on the model outcome. Two of the most common methods in this class are LIME (ribeiro2016should) and SHAP (lundberg2017A)

. LIME generates perturbations by dropping subsets of tokens from the input text, and then fitting a linear classifier on these local perturbations. SHAP is inspired by Shapely values from cooperative game theory, and calculates feature importance as the fair division of a “payoff" from a game where the features cooperate to obtain the given model outcome.

AllenNLP Interpret toolkit wallace2019allennlp

provides an implementation for both types of feature attribution methods, gradient based and input perturbation based, for six core NLP tasks, including text classification, masked language modeling, named entity recognition, and others.

A third way to obtain feature attribution maps in architectures that use an attention mechanism (bahdanau2015neural) is to look at the relative attention scores for each token (xu2015show; choi2016retain). Whether this approach provides valid explanations has been subject to heated debate (jain2019attention; wiegreffe2019attention), however as galassi2020attention notes, the debate has mostly been centered around the use of attention scores as local explanations. There has also been some works that use attention scores for providing global explanations based on the syntactic structures that the model attends to (clark2019does).

Extractive rationales (deyoung2020eraser) are snippets of the input text that trigger the original prediction. They are similar in spirit to feature attribution methods, however in rationales the attribution is usually binary rather than a real-valued score, and continuous subsets of the text are chosen rather than each token being treated individually. Rationales can also be obtained from humans as explanations of human annotations rather than the model decisions, and used as an additional signal to guide the model.

Counterfactual explanations are new instances that are obtained by applying minimal changes to an input instance in order to change the model output. Counterfactuals are inspired by notions in causality, and aim to answer the question: "What would need to change for the outcome to be different?" Two examples of counterfactual explanations in NLP are Polyjuice (wu2021polyjuice) and MiCE (ross2021explaining). Polyjuice is model agnostic, and consists of a generative model trained on existing, human generated counterfactual data sets. It also allows finer control over the types of counterfactuals by allowing the user to choose which parts of the input to perturb, and how to perturb them with control codes such as “replace” or “negation”. MiCE uses model gradients to iteratively choose and mask the important tokens, and a generative model to change the chosen tokens so that the end prediction is flipped.

There are also methods that try to pinpoint which examples in the training data have the most influence on the prediction. The most common approach for this is Influence Functions (koh2017understanding; han2020explaining), where the goal is to efficiently estimate how much removing an example from the data set and retraining the model would change the prediction on a particular input. An alternative is Representer Point Selection (yeh2018representer)

, which applies to a more limited set of architectures, and aims to express the logits of an input as a weighted sum of all the training data points.

Some explainability methods are designed to provide global explanations using higher level, semantic concepts. feder2021causalm

use counterfactual language models to provide causal explanations based on high-level concepts. Their method contrasts the original model representations with alternative pre-trained representations that are adversarially trained not to capture the chosen high-level concept, so that the total causal effect of the concept on the classification decisions can be estimated.


adapt Testing Concept Activation Vector (TCAV) method of


, originally developed for computer vision, to explain the generalization abilities of a hate speech classifier. In their approach, the concepts are defined through a small set of human chosen examples, and the method quantifies how strongly the concept is associated with a given label.

Finally, some methods produce explanations in the form of rules. One method in this category is Anchors (ribeiro2018anchors), where the model searches for a set of tokens in a particular input text that predicts the given outcome with high precision. Although Anchors is a local explainability method in that it gives explanations on individual input instances, the generated explanations are globally applicable. SEAR ribeiro2018semantically, a global explainability method, finds universal replacement rules that, if applied to an input, adversarially change the prediction while keeping the semantics of the input the same.

3 Fairness and Bias in NLP Models

Unintended biases in NLP is a complex and multifaceted issue that spans various undesirable model behaviours that cause allocational and representational harms to certain demographic groups (blodgett2020language). When the demographic group is already marginalized and underprivileged in society, biases in NLP models can further contribute to the marginalization and the unfair allocation of resources. Examples include performance disparities between standard and African American English (blodgett2017racial), stereotypical associations between gendered pronouns and occupations in coreference resolution (rudinger2018gender) and machine translation (stanovsky2019evaluating), and false positives in hate speech detection on innocuous tweets mentioning demographic attributes (rottger2021hatecheck). In this section, we review some of the most popular methods and metrics to identify such biases. For a more comprehensive coverage, see recent surveys by mehrabi2021survey and caton2020fairness.

Most works in ML fairness literature assume that biases in machine learning models originate from misrepresentations in training datasets and merely reflect the societal biases. However, as hooker2021moving explains, design choices can amplify the societal biases, and automated data processing can lead to systematic un-precedented harms. shah2020predictive identify five sources for bias in NLP models. Selection bias and label bias are biases that originate in the training data. The former refers to biases that are created when choosing which data points to annotate, and includes under-representation of some demographic groups as well as misrepresentation due to spurious correlations. The latter refers to biases introduced due to the annotation process, such as when annotators are less familiar with or biased against text generated by certain groups, causing more annotation errors for some groups than others. Model bias are biases that are due to model structure, and are responsible for the over-amplification of discrepancies that are observed in training data. Semantic bias refers to biases introduced from the pre-trained representations, and include representational harms such as stereotypical associations. Finally, bias in research design covers the larger issues of uneven allocation of research efforts across different groups, dialects, languages and geographic areas.

Research in fair ML has developed a number of metrics to quantify the biases in an ML model. These metrics are usually classified as group fairness metrics and individual fairness metrics (castelnovo2021zoo; czarnowska2021quantifying). Group fairness metrics focus on quantifying the performance disparity between different demographic groups. Some examples are demographic parity, which measures the difference in the positive prediction rates across groups, predictive parity, which measures the difference in precision across groups, and

equality of odds

, which measures the differences between false positive and false negative rates across groups. Individual fairness metrics are based on the idea that the model should behave the same for similar examples regardless of the value of a protected attribute. A refinement to this approach is counterfactual fairness, where the criteria for fairness is that the model decision remains the same for a given individual in a counterfactual world where that individual belonged to a different demographic group. In NLP, this notion often appears as counterfactual token fairness (garg2019counterfactual), and is operationalized through test suites that include variations of the same text where some tokens associated with certain social groups are replaced with others, and the bias of the model is measured by the performance disparity between the pairs (kiritchenko2018examining; prabhakaran2019perturbation).

Both group fairness metrics and individual fairness metrics are instances of outcome fairness: whether a model is fair is determined solely on the outcomes with respect to various groups, regardless of how the algorithm produced those observed outcomes.111Outcome fairness is also referred to as distributive fairness in this literature. There is a complementary notion called procedural fairness that is often considered in organizational settings blader2003constitutes, which aims to capture whether the processes that were followed to obtain the outcome are fair. In ML, this translates to whether the model’s internal reasoning process is fair to different groups or individuals (grgic2018beyond; morse2021ends)

. For example, outcome fairness for a resume sorting system might be implemented as ensuring that the model has the same acceptance rates or the same precision and recall for groups defined by race, gender, or other demographic attributes. A procedural fairness approach, on the other hand, might aim to ensure that the decision making process of the system only relies on skill-related features, and not features that are strongly associated with demographic attributes, such as names and pronouns. The distinction between procedural and outcome fairness relates to different kinds of discrimination outlined in anti-discrimination laws, namely

disparate treatment and disparate impact (barocas2016big).

Fairness metrics have originally been developed for applications where the social group membership is known, for example in healthcare related tasks. An issue with applying these to NLP tasks is that either the demographic information is not available and needs to be estimated, or some auxiliary signal, such as the mention of a target group or the gender of the pronoun, needs to be used. However, inferring people’s social attributes from their data raises important ethical concerns in terms of privacy violations, lack of meaningful consent, and intersectional invisibility (mohammad2021ethics). Since determining whether the text is about a certain identity group is easier than whether it is produced by a certain identity group, there are more works investigating the former than the latter. An exception to this is the studies on disparate performance of models on certain dialects such as African American English (AAE) (sap2019risk; blodgett2017racial). This is possible due to the existence of a dialect identification tool for AAE, which was trained by pairing geo-located tweets with US census data on race (blodgett2016demographic).

One source of bias that the NLP community has devoted significant research effort to is word embeddings and pre-trained language models bolukbasi2016man; zhao2019gender, which shah2020predictive characterizes as semantic bias. Although it is not framed as such, this can be seen as a particular global explanation for biases that the models demonstrate in downstream tasks. However, the effectiveness of these methods has recently been questioned by goldfarb2021intrinsic who found that there is no correlation between intrinsic bias metrics obtained by embedding association tests, and extrinsic bias metrics on downstream tasks.

4 Applications of XAI in Fair NLP

Study Overall Objective of the Study Application Bias Type Explainability Method
mosca2021understanding Detecting classifier sensitivity towards identity terms vs. user tweet history hate speech detection social group bias SHAP
wich2020impact Measuring the effect of bias on classification performance hate speech detection political orientation SHAP
aksenov2021fine Classification of political bias in news hate speech detection political orientation aggregated attention scores
kennedy2020contextualizing Reducing the classifier’s oversensitivity to identity terms hate speech detection social group bias feature importance (SOC)
mathew2021hatexplain Improving group fairness hate speech detection social group bias LIME, attention
prabhakaran2019perturbation Detecting biases related to named entities sentiment analysis, toxicity detection sensitivity to named entities perturbation analysis
balkir2022necessity Detecting over- and under-sensitivity to identity tokens hate speech and abusive language detection social group bias necessity and sufficiency
Table 2: Summary of the studies that apply explainability techniques to uncover unintended biases in NLP systems.

To determine the uses of explainability methods in fair NLP, we search the ACL Anthology for papers that cite the explainability methods listed in Section 2, and that include keywords, “fair”, “fairness”, or “bias”. We further exclude the papers that focus on other types of biases such as inductive bias, or bias terms in the description of the architecture. Our results show that although there are a number of papers that mention unintended or societal biases as wider motivations to contextualize the work (e.g., by zylberajch2021hildif), only a handful of them apply explainability methods to uncover or investigate biases. All of the works we identify in this category use feature attribution methods, and except that of aksenov2021fine, employ them for demonstration purposes on a few examples. Although our methodology excludes works that are published in venues other than ACL conferences and workshops, we believe that it gives a good indication of the status of XAI in fairness and bias research in NLP.

mosca2021understanding use SHAP to demonstrate that adding user features to a hate speech detection model reduces biases that are due to spurious correlations in text, but introduces other biases based on user information. wich2020impact also apply SHAP to two example inputs in order to illustrate the political bias of a hate speech model. aksenov2021fine aggregate attention scores from BERT into global explanations in order to identify which words are most indicative of political bias.

Some works beyond the papers that our search methodology uncovered on the intersection of fairness for NLP and XAI are that of kennedy2020contextualizing, which uses Sampling and Occlusion algorithm of jin2019towards to detect bias toward identity terms in hate speech classifiers, and that of mathew2021hatexplain, which shows that using human rationales as an additional signal in training hate speech detection models reduces the bias of the model towards target communities. prabhakaran2019perturbation target individual fairness, and develop a framework to evaluate model bias against particular named entities with a perturbation based analysis. Although they do not frame their model as such, the automatically generated perturbations can be categorized as counterfactuals. balkir2022necessity suggest the use of two metrics—necessity and sufficiency—as feature attribution scores, and apply their method to uncover different kinds of bias against protected group tokens in hate speech and abusive language detection models.

As summarized in Table 2, almost all these works focus exclusively on hate speech detection, and use local feature attribution methods. The range of bias types is also quite limited. This demonstrates the very narrow context in which explainability has been linked to fairness in NLP.

There are also some works beyond NLP that use XAI to improve fairness of ML models. zhang2018fairness, parafita2021deep and grabowicz2022marrying leverage methods from causal inference to both model the causes of the given prediction and provide explanations, and to ensure that protected attributes are not influencing the model decisions through unacceptable causal chains. The disadvantage of these models is that they require an explicit model of the causal relations between features, which is a difficult task for textual data (feder2021causal). pradhan2021interpretable also suggest a causality inspired method that identifies subsets of data responsible for particular biases of the model. begley2020explainability extend Shapely values to attribute the overall unfairness of an algorithm to individual input features. The main limitation of all these methods is that they are currently only applicable to low dimensional tabular data. How to extend these methods to explain the unfairness of NLP models remains an open research problem.

As abstract frameworks for connecting XAI to fair ML, deepak2021fairness outline potential synergies between the two research areas. alikhademi2021can enumerate different sources of bias, and discuss how XAI methods can help identify and mitigate these.

5 XAI for Fair NLP through Causality and Robustness

The framework of causality pearl2009causality is invoked both in fairness and explainability literature. The promise of causality is that it goes beyond correlations, and characterizes the causes behind observations. This is relevant to conceptualizing fairness since, as loftus2018causal argue, there are situations that are intuitively different from a fairness point of view, but that purely observational criteria cannot distinguish.

Causality tries to capture the notion of causes of an outcome in terms of hypothetical interventions: if something is a true cause of a given outcome, then intervening on this variable will change the outcome. This notion of intervention is useful for both detecting biases and for choosing mitigation strategies. Causal interventions are also the fundamental notion behind counterfactual examples in XAI. It is easier for humans to identify the cause of a prediction if they are shown minimally different instances that result in opposite predictions. Hence, causal explanations can serve as proofs of bias or other undesirable correlations to developers and to end-users.

Going beyond correlations in data and capturing causal relations is also an effective way to increase robustness and generalization in machine learning models. As kaushik2020explaining argue, causal correlations are invariant to differing data distributions, while non-causal correlations are much more context and dataset specific. Hence, models that can differentiate between the two and rely solely on casual correlations while ignoring the non-causal ones will perform well beyond the strict i.i.d. setting.

Non-causal, surface level correlations are often referred to as spurious correlations, and a common use case of XAI methods for developers is to facilitate the identification of such patterns. A common motivating argument in XAI methods for debugging NLP models (lertvittayakumjorn2021explanation; zylberajch2021hildif), as well as counterfactual data augmentation methods (kaushik2020explaining; balashankar2021can; yang2021exploring), is that unintended biases are due to the model picking up such spurious associations, and XAI methods which can be used to improve the robustness of a model against these spurious patterns will also improve the fairness of a model as a side effect. There is indeed evidence that methods for robustness also reduce unintended bias in NLP models adragna2020fairness; pruksachatkun2021does. However, these methods are limited in that they can address unintended biases only insofar as the biases are present and identifiable as token-level spurious correlations.

6 Challenges and Future Directions

As we saw in Sec. 4 and 5, only a few studies to date have attempted to apply explainability techniques in order to uncover biases in NLP systems, to a limited extent. In this section, we discuss some possible reasons for a seeming lack of progress in this area and outline promising directions for future research.

Local explainability methods rely on the user to identify examples that might reveal bias.

One issue in preventing wider adoption of XAI methods in fair NLP stems from the local nature of most explanation methods applicable to NLP models. An important step in identifying fairness problems within a model is identifying the data points where these issues might manifest. Since local explainability methods give explanations on particular data points, it is left to the user how to pick the instances to examine. This necessitates the user to first decide what biases to search for before employing XAI methods, limiting their usefulness for identifying unknown biases.

Local explanations are not easily generalizable.

Even if an issue can be identified with a local XAI method, it is difficult to know to what extent the insight can be generalized. This is an issue because it is often essential to know what subsets of the input are affected by the identified biased behaviour in order to apply effective mitigation strategies. Some methods such as Anchors mitigate this problem by specifying the set of examples an explanation applies to. Other approaches use abstractions such as high-level concepts (feder2021causalm; nejadgholi2022improving) to provide more generalizable insights. Principled methods to aggregate local explanations into more global and actionable insights are needed to make local explainability methods better suited to identifying and mitigating unintended biases in NLP models. Also, future NLP research could explore global explainability methods that have been used to uncover unknown biases (tan2018distill).

Not all undesirable biases are surface-level or non-causal.

In the motivation for XAI methods, there is strong emphasis on identifying token-level correlations caused by sampling bias or label bias. Although methods that target these patterns are shown to also improve the fairness of models, not all sources of bias fit well into this characterization (hooker2021moving), and hence might be difficult to detect with XAI methods that provide token-level explanations. For example, bagdasaryan2019differential

show that the cost of differential privacy methods in decreasing the accuracy of deep learning NLP models, is much higher for underrepresented subgroups. A rigorous study of a model’s structure and training process is required to discover such bias sources.

Another issue that is common in works that approach fairness through robustness is the characterization of unintended biases as non-causal associations in data (kaushik2020explaining; adragna2020fairness). In fact, it can be argued that many of the undesirable correlations observed in data are causal in nature, and will likely hold in a wide variety of different data distributions. For example, correlations between different genders and occupations—which arguably is the source of the occupational gender stereotypes picked up by NLP models (rudinger2018gender)—are not due to unrepresentative samples or random correlations in the data, but rather underlying systemic biases in the distribution of occupations in the real world. To ensure a fair system, researchers must make a normative decision blodgett2020language that they do not want to reproduce this particular correlation in their model. This suggests that there may be inherent limitations to the ability of XAI methods to improve fairness of NLP methods through improving model robustness and generalization.

Some biases can be difficult for humans to recognize.

Even for biases that could be characterized in terms of surface-level correlations, XAI methods rely on humans to recognize what an undesirable correlation is, but biased models are often biased in subtle ways. For example, if the dialect bias in a hate speech detection system is mostly mediated by false positives on the uses of reclaimed slurs, this might seem like a good justification to a user who is unfamiliar with this phenomenon (sap2019risk). More studies with human subjects are needed to investigate whether humans can recognise unintended biases that cause fairness issues through explainability methods as well as they can recognise simpler data biases.

Explainability methods are susceptible to fairwashing.

An issue that has repeatedly been raised with respect to XAI methods is the potential for “fairwashing” biased models. This refers to techniques that adversarially manipulate explanations in order to obscure the model’s reliance on protected attributes. Fairwashing has been shown possible in rule lists (aivodji2019fairwashing), and both gradient based and perturbation based feature attribution methods (dimanov2020you; anders2020fairwashing). This relates to the wider issue of the faithfulness of an explainability method: if there is no guarantee that the explanations reflect the actual inner workings of the model, the explanations are of little use. One solution to this problem would be to extend certifiable robustness (cohen2019certified; ma2021metamorphic) beyond the model itself, and develop certifiably faithful explainability methods with proofs that a particular way of testing for bias cannot be adversarially manipulated. Another approach to mitigate this issue is to provide the levels of uncertainty in the explanations, giving the end-user more information on whether to trust the generated explanation (zhang2019should), or other ways to calibrate user trust to the quality of the provided explanations (zhang2020effect)

. However, the effectiveness of these methods depends substantially on whether the model’s predicted probabilities are well-calibrated to the true outcome probabilities. Certain machine learning models do not meet this criterion. Specifically, the commonly used deep learning models have been shown to be over-confident in their predictions

guo2017calibration. Calibration of uncertainties is a necessary prerequisite, should they be used to calibrate user trust, as over-confident predictions can be themselves a source of mistrust.

Fair AI is focused on outcome fairness, but XAI is motivated by procedural fairness.

Finally, it appears that there is a larger conceptual gap between the notions of fairness that the ethical AI community has developed, and the notion of fairness implicitly assumed in motivations for XAI methods. Namely, almost all the fairness metrics developed in Fair ML literature aim to formalize outcome fairness in that they are process-agnostic, and quantify the fairness of a model on its observed outcomes only. The type of fairness that motivates XAI, on the other hand, is closer to the concept of procedural fairness: XAI aims to elucidate the internal reasoning of a model, and make it transparent whether there are any parts of the decision process that could be deemed unfair.

We observe that due to the lack of better definitions of procedural fairness, the most common way XAI methods are applied to fairness issues is to check whether the model uses features that are explicitly associated with protected attributes (e.g., gendered pronouns). This practice promotes a similar ideal with “fairness through unawareness” in that it aims to place the veil of ignorance about the protected attributes not at the level of the data fed into the model, but into the model itself. In other words, the best one could do with these techniques seem to be to develop “colourblind” models which, even if they receive explicit information about protected attributes in their input, ignore this information when making their decisions. Although it is simple and intuitive, we suspect that such an approach has similar issues with the much criticized “fairness through unawareness” approach kusner2017counterfactual; morse2021ends. More clearly specified notions of procedural fairness, as well as precise quantitative metrics similar to those that have been developed for outcome fairness, are needed in order to guide the development of XAI methods that can make ML models fairer.

7 Conclusion

Publications in explainable NLP often cite fairness as a motivation for the work, but the exact relationship between the two concepts is typically left unspecified. Most current XAI methods provide explanations on a local level through post-hoc processing, leaving open questions about how to automatically identify fairness issues in individual explanations, and how to generalize from local explanations to infer systematic model bias. Although the two fields of explainability and fairness feel intuitively linked, a review of the literature revealed a surprisingly small amount of work at the intersection. We have discussed some of the conceptual underpinnings shared by both these fields as well as practical challenges to uniting them, and proposed areas for future research.