Toward a Unified Framework for Debugging Gray-box Models

by   Andrea Bontempelli, et al.
Università di Trento

We are concerned with debugging concept-based gray-box models (GBMs). These models acquire task-relevant concepts appearing in the inputs and then compute a prediction by aggregating the concept activations. This work stems from the observation that in GBMs both the concepts and the aggregation function can be affected by different bugs, and that correcting these bugs requires different kinds of corrective supervision. To this end, we introduce a simple schema for identifying and prioritizing bugs in both components, discuss possible implementations and open problems. At the same time, we introduce a new loss function for debugging the aggregation step that extends existing approaches to align the model's explanations to GBMs by making them robust to how the concepts change during training.



There are no comments yet.


page 8


TFix+: Self-configuring Hybrid Timeout Bug Fixing for Cloud Systems

Timeout bugs can cause serious availability and performance issues which...

Human-Centered Concept Explanations for Neural Networks

Understanding complex machine learning models such as deep neural networ...

Prognosis: Closed-Box Analysis of Network Protocol Implementations

We present Prognosis, a framework offering automated closed-box learning...

An Empirical Investigation of Correlation between Code Complexity and Bugs

There have been many studies conducted on predicting bugs. These studies...

Snapshot Semantics for Temporal Multiset Relations (Extended Version)

Snapshot semantics is widely used for evaluating queries over temporal d...

Algorithmic Concept-based Explainable Reasoning

Recent research on graph neural network (GNN) models successfully applie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central tenet of explainable AI is that the bugs and biases affecting a model can be uncovered by computing and analyzing explanations for the model’s predictions Guidotti et al. (2018). Post-hoc explanations of black-box models, however, can be unfaithful and ambiguous Rudin (2019); Dombrowski et al. (2019); Teso (2019); Lakkaraju and Bastani (2020); Sixt et al. (2020), and the extraction process can be computationally challenging Van den Broeck et al. (2021).

Gray-box models (GBMs) are designed to make this step as straightforward as possible while retaining the performance of more opaque alternatives. We focus on concept-based gray-box models, a class of predictors that learn a set of high-level, interpretable concepts capturing task-salient properties of the inputs, and then obtain predictions by aggregating the concept activations in a (typically) understandable manner Alvarez-Melis and Jaakkola (2018); Losch et al. (2019); Koh et al. (2020); Chen et al. (2019); Hase et al. (2019); Rymarczyk et al. (2020); Nauta et al. (2021); Lage and Doshi-Velez (2020). GBMs explain their own predictions by supplying faithful concept-level attribution maps – which encompass both the concepts and the aggregation weights –, thus facilitating the identification of bugs affecting the model. Work on troubleshooting of GBMs is, however, sparse and ad hoc Rudin et al. (2021): some approaches assume the concepts to be given and focus on correcting the aggregation step Teso (2019); Stammer et al. (2021), while others fix the learned concepts while ignoring how they are aggregated Barnett et al. (2021); Lage and Doshi-Velez (2020). Fixing only one set of bugs is however insufficient.

In this paper, we outline a unified framework for debugging GBMs. Our framework stems from the simple observations that the quality of a GBM hinges on both the quality of the concepts it relies on and on how these are aggregated, and that both elements are conveniently captured by the GBM’s explanations. This immediately suggests an explanation-based interactive debugging strategy composed of three different phases, namely evaluating concept quality, correcting the aggregation weights, and correcting the concepts themselves. This simple setup allows us to highlight limitations in existing works and indicate possible ways to overcome them. As a first step toward implementing this strategy, we introduce a new loss function on the aggregation weights that generalizes approaches for aligning local explanations Schramowski et al. (2020); Lertvittayakumjorn et al. (2020) to be robust to changes in the underlying concepts during training. We then show how the same strategy can be applied recursively to align the concepts themselves by correcting both their predictions and explanations.


Summarizing, we:

  1. Introduce a unified framework for debugging GBMs that enables us to distinguish between and address bugs affecting how the concepts are defined and how they are aggregated.

  2. Illustrate how to incorporate corrective feedback into the aggregation step in a manner that is invariant to changes to the learned concepts.

  3. Discuss how to align the concepts by fixing the reasons behind their activations, opening the door to explanation-based debugging of the concepts themselves.

2 Concept-based Gray-box Models

We are concerned with learning a high-quality classifier

that maps instances into labels . In particular, we focus on concept-based gray-box models (GBMs) that fit the following two-level structure.

At the lower level, the model extracts a vector of

concept activations from the raw input . The concepts are usually learned from data so to provide strong indication of specific classes Chen et al. (2019). For instance, in order to discriminate between images of cars and plants, the model might learn concepts that identify wheels and leaves. The concepts themselves are completely black-box.

At the upper level, the model aggregates the concept activations into per-class scores, typically in a simulatable Lipton (2018) manner, most often by taking a linear combination of the concept activations:

where is the weight vector associated to class

. Class probabilities are then obtained by passing the scores through a softmax activation, that is,


GBMs are learned by minimizing an empirical loss , where runs over the training set , as customary. The loss function combines a standard loss for classification, for instance the cross-entropy loss, together with additional regularization terms that encourage the learned concepts to be understandable to (sufficiently expert) human stakeholders. Commonly used criteria include similarity to concrete examples Alvarez-Melis and Jaakkola (2018); Chen et al. (2019), disentanglement Alvarez-Melis and Jaakkola (2018), and boundedness Koh et al. (2020).

With these GBMs, it is straightforward to extract local explanations that capture how different concepts contribute to a decision . These explanations take the form:

Notice that both the concepts and the aggregation weights are integral to the explanation: the concepts establish a vocabulary that enables communication with stakeholders, while the weights convey the relative importance of different concepts. Crucially, the prediction is independent from given the explanation , ensuring that the latter is faithful to the model’s decision process.

Method Concepts Aggregator Extra Annot. Training
SENN Alvarez-Melis and Jaakkola (2018) Autoencoder Linear Comb. End-to-end
PPNet Chen et al. (2019) Conv. Filters Linear Comb. Multistep
IAIA-BL Barnett et al. (2021) Conv. Filters Linear Comb. Concept Attr. Multistep
CBM Koh et al. (2020) Arbitrary Arbitrary Concept Labels End-to-end
Table 1: Comparison between concept-based GBMs considered in this work.

2.1 Implementations

Next, we outline how some well-known GBMs fit the above two-level template. A summary can be found in Table 1. Additional models are briefly discussed in the Related Work.

Self-Explainable Neural Networks (SENNs), introduced by Alvarez-Melis and Jaakkola (2018), generalize linear models to representation learning. SENNs instantiate the above two-level template as follows. The weight functions ,

, are implemented using a neural network for multivariate regression. Importantly, the weights are regularized so to vary slowly across inputs. This is achieved by penalizing the model for deviating from its first-order (i.e., linear) Taylor decomposition. The concepts

are either given and fixed or, more commonly, learned using an autoencoder. The latter acquires concepts that reconstruct the input while producing diverse (more specifically, sparse) activation patterns. The weights and concepts are training jointly in an end-to-end fashion. The concepts themselves are often represented using those concrete training examples that maximally activate them.

Part Prototype Networks (PPNets) Chen et al. (2019) ground the two-level template to image classification as follows. The weights are constant with respect to the input , while the concepts

indicate the presence of “part prototypes”, that is, prototypes that capture specific parts of images appearing in the training set. The part prototypes are implemented as convolutional filters extracted from a (pre-trained) convolutional neural network. Each such filter corresponds to a

parameter vector and has a rectangular receptive field spanning input pixels. In PPNets, each class is associated to its own part prototypes, which are specifically selected among all the filters in the convolutional neural network so that they closely match training images of that class and do not match those of the other classes. The concepts and the weights are learned sequentially.

IAIA-BL Barnett et al. (2021) specializes PPNets to image-based medical diagnosis. In particular, IAIA-BL accepts per-example attribute relevance information (e.g., annotations of symptomatic regions in X-ray images) and penalizes part prototypes that activate outside the relevant areas.

Concept Bottleneck Models (CBMs) Koh et al. (2020); Losch et al. (2019)

are regular feed-forward neural networks in which the neurons in a given layer are trained to align with a known set of concept functions

. This is achieved by supplying the CBM with concept-level annotations. The aggregation step is not particularly restrained: the concept neurons can be combined through a single dense layer, in which case the weights are constant with respect to , or through a more complex sequence of layers, in which case the weights are not constant and must be inferred using post-hoc techniques.

3 Existing Strategies for Debugging GBMs

Existing literature on debugging GBMs can be split into two groups. Approaches in the first group are concerned with the aggregation weights . These issues occur when the data fools the model into assigning non-zero weight to concepts that correlate with – but are not causal for – the label. A prototypical example are class-specific watermarks in image classification Lapuschkin et al. (2019). Teso (2019) addresses this issue by leveraging explanatory interactive learning (XIL) Teso and Kersting (2019); Schramowski et al. (2020)

. As in active learning, in XIL the machine obtains labels by querying a human annotator, but in addition it presents predictions for its queries and local explanations for its predictions. The user then provides corrective feedback on the explanations, for instance by indicating those parts of an image that are irrelevant to the class label but that the machine relies on for its prediction. The model is then penalized whenever its weights do not align with the user’s corrections. This setup assumes that the concepts

are fixed rather than learned from data. Stammer et al. (2021) extend XIL to neuro-symbolic explanations and structured attention, but also assume the concepts to be fixed.

Conversely, approaches in the other group are only concerned with how the concepts are defined: concepts learned from data, even if discriminative and interpretable, may be misaligned with (the stakeholders’s understanding of) the prediction task and may thus fail to generalize properly. CMBs work around this issue by leveraging concept-level label supervision Koh et al. (2020), however this does not ensure that the concepts themselves are “right for the right reasons” Ross et al. (2017). As a matter of fact, just like all other neural nets Szegedy et al. (2013), part prototypes learned by PPNets can pick up and exploit uninterpretable features of the input Hoffmann et al. (2021). Hoffmann et al. (2021) and Nauta et al. (2020) argue that it may be difficult for users to understand what a learned prototype (concept) represents, unless the model explains why the concept activates. To this end, Nauta et al. (2020) propose a perturbation-based technique – analogous to LIME Ribeiro et al. (2016) – for explaining why a particular prototype activates on a certain region of an image, but it does not illustrate how to fix this kind of bugs. Finally, Barnett et al. (2021) introduce a loss term for PPNets that penalizes concepts that activate on regions annotated as irrelevant by a domain expert. None of these works, however, considers interaction with a human debugger and the issues that this brings with it.

4 A Unified Framework for Debugging GBMs

Existing debugging strategies differ in what bugs and models they handle and, more generally, neglect the case of a GBMs affected by multiple, different bugs. Motivated by this observation, we propose a unified framework based on the “right for the right reasons” principle Ross et al. (2017). From this perspective, the goal is to ensure that the model outputs high-quality predictions that are accompanied by high-quality explanations.

4.1 What is a high-quality explanation?

Before proceeding, we need to define what we mean for an explanation to be “high-quality”. Intuitively, this ought to depend on the quality of the learned concepts and aggregation weights . We unpack this intuition as follows:

A set of concepts is high-quality for a decision if:

  • The set of concepts is sufficient to fully determine the ground-truth label from .111In case the ground-truth label is ill-defined, it should be replaced with the Bayes optimal label.

  • The various concepts are (approximately) independent from each other.

  • Each concept is semantically meaningful.

  • Each concept is easy to interpret.

Given high-quality concepts , a set of weights is high quality for a decision if:

  • It ranks all concepts in compatibly with their relevance for the task.

  • It associates (near-)zero weight to concepts in that are task-irrelevant.

An explanation is high-quality for a decision if both and are.

It is worth discussing the various points in detail. Requirement C1 ensures that the concepts are task-relevant and jointly sufficient to solve the prediction task. This requirement is imposed by the learning problem itself. Notice that, since in all GBMs the prediction is independent from the input given the explanation , C1 is necessary for the model to achieve good generalization. Although some works consider the completeness of the concept vocabulary Yeh et al. (2020); Bahadori and Heckerman (2021), we argue that – when it comes to local explanations – sufficiency is more relevant, although it is clear that if the concepts are sufficient for all instances , then they do form a complete vocabulary.

Requirements C2, C3, and C4, on the other hand, are concerned with whether the concept set can be used for communicating with human stakeholders, and are at the core of GBMs and debugging techniques for them. Notice also that requirements C1–C4 are mutually independent: not all task-relevant concepts are understandable (indeed, this is often not the case), not all semantically meaningful concepts are easy to interpret, etc.

4.2 Debugging GBMs in three steps

Assume to be given a decision and an explanation that is not high quality. What bugs should the user prioritize? We propose a simple three-step procedure:

Step 1

Evaluating concept quality: Determine if contains a high-quality subset that is sufficient to produce a correct prediction for the target instance .

Step 2

Correcting the aggregation weights: If so, then it is enough to fix how the model combines the available concepts by supplying corrective supervision for the aggregation weights .

Step 3

Correcting the learned concepts: Otherwise, it is necessary to create a high-quality subset by supplying appropriate supervision on the concepts themselves.

In the following we discuss how to implement these steps.

4.3 Step 1: Assessing concept quality

In this step, one must determine whether there exists a subset that is high-quality. This necessarily involves conveying the learned concepts to a human expert in detail sufficient to determine whether they are “good enough” for computing the ground-truth label from .

The most straightforward solution, adopted for instance by SENNs Alvarez-Melis and Jaakkola (2018) and PPNets Chen et al. (2019), is to present a set of instances that are most representative of each concept. In particular, PPNets identify parts of training instances that maximally activate each . A more refined way to characterize a concept is to explain why it activates for certain instances, possibly the prototypes themselves Nauta et al. (2020); Hoffmann et al. (2021). Such explanations can be obtained by extracting an attribution map that identifies those inputs – either raw inputs or higher-level features of like color, shape, or texture – that maximally contribute to the concept’s activations. Since concepts are black-box, this map must be acquired using post-hoc techniques, like input gradients Baehrens et al. (2010); Sundararajan et al. (2017) or LIME Ribeiro et al. (2016). Explanations are especially useful to prevent stakeholders from confusing one concept for another. For instance, in a shape recognition task, a part prototype that activates on a yellow square might do so because of the shape, of the color, or both. Without a proper explanation, the user cannot tell these concepts apart. This is fundamental for preventing users from wrongly trusting ill-behaved concepts Teso and Kersting (2019); Rudin (2019).

Although assessing concept quality does not require to communicate the full semantics of a learned concept to the human counterpart, acquiring feedback on the aggregation weights does. Doing so is non-trivial Zhang et al. (2019), as concepts learned by GBMs are not always interpretable Nauta et al. (2020); Hoffmann et al. (2021). However, we stress that uninterpretable concepts are not high-quality, and therefore they must be dealt with in Step 3, i.e., they must be improved by supplying appropriate concept-level supervision.

4.4 Step 2: Fixing how the concepts are used

In the second step, the goal is to fix up the aggregation step. This case is reminiscent to debugging black-box models. Existing approaches for this problem acquire a (possibly partial Teso and Kersting (2019); Teso (2019)) ground-truth attribution map that specifies what input attributes are (ir)relevant for the target prediction, and then penalize the model for allocating non-zero relevance to them Ross et al. (2017).

More specifically, let be a black-box model and be an attribution mechanism that assigns numerical responsibility for the decision to each input , for , for instance integrated gradients Sundararajan et al. (2017). The model’s explanations are corrected by introducing a loss of the form Ross et al. (2017); Schramowski et al. (2020); Shao et al. (2021):222The loss can be adapted to also encourage the model to rely on concepts that are deemed relevant, for instance by penalizing the model whenever the concept’s weight is too close to zero.

Now, consider a GBM and a decision obtained by aggregating high-quality concepts using low-quality weights . It is easy to see how Eq. 4.4 could help with aligning the aggregation weights. However, simply replacing with the weights does not work. The reason is that in GBMs the concepts change during training, and so do the semantics of their associated weights. Hence, feedback of the form “don’t use the -th concept” becomes obsolete (and misleading) whenever the -th concept changes.

Another major problems with Eq. 4.4 is that, by penalizing concepts by their index, it does not prevent the model from re-learning a forbidden concept under a different index. This is illustrated in Figure 1. Here we reported the concepts learned by a PPNet for a toy image classification task. Each image contains two colored 2-D shapes which entirely determine its class. In this toy example, an image belongs to the chosen class if and only if it contains a green circle (50% of the images) or a pink triangle (the other 50%). However, the data is confounded: 100% of the positive training images also contain a yellow square. The PPNet, which has a budget of two prototypes, is thus encouraged to rely on the counfounder. This is precisely what happens if no corrective supervision is provided (left column, second prototype). However, if the model is penalized for using that prototype using Eq. 4.4, it still manages to rely on the confounder by learning it as its first prototype (middle column, both prototypes).

Roughly speaking, this shows that GBMs are too flexible for the above loss and can therefore easily work around it. Our solution to this problem is presented below, in Section 5.

4.5 Step 3: Fixing how the concepts are defined

In this last step, consider a decision that depends on a low-quality set of concepts . In this case, the goal is to ensure that at least some of those concepts quickly become useful by supplying additional supervision.

The order in which the various concepts should be aligned is left to the user. This is reasonable under the assumption that she is a domain expert and that those concepts that are worth fixing are easy to distinguish from those that are not. An alternative is to debug the concepts sequentially based on their overall impact on the model’s behavior, for instance sorting them by decreasing relevance . More refined strategies will be considered in future work.

Now, let be a low-quality concept. Our key insight is that there is no substantial difference between the models’ output and a specific concept appearing in it.333A similar point was brought up in Stammer et al. (2021). This means that work on understanding and correcting predictions of black-box models can be applied for understanding and correcting concepts in GBMs – with some differences, see below. For instance, the work of Nauta et al. (2020) can be viewed as a concept-level version of LIME Ribeiro et al. (2016). To the best of our knowledge, this symmetry has never been pointed out before.

A direct consequence is that concepts, just like predictions, can be aligned by providing labels for them, as is done by CBMs Koh et al. (2020). One issue with this strategy is that, if the data are biased, the learned concept may end up relying on confounders Lapuschkin et al. (2019). A more robust alternative, that builds on approaches for correcting the model’s explanations Ross et al. (2017); Teso and Kersting (2019), is to align the concepts’ explanations. This involves extracting an explanation that uncovers the reasons behind the concept’s activations in terms of either inputs, as done in Barnett et al. (2021), or higher-level features, and supplying corrective feedback for them. One caveat is that these strategies are only feasible when the semantics of are already quite clear to the human annotator (e.g., the concept clearly captures “leaves” or “wheels”). If this is not the case, then the only option is to instruct the model to avoid using it by using or an analogous penalty.

On-demand “grayfication”.

Since concepts in GBMs are black-box, explanations for them must be extracted using post-hoc attribution techniques, which – as we argued in the introduction – is less then optimal. A better alternative is to model the concepts themselves as GBMs, making cheap and faithful explanations for the concepts readily available. We call the idea of replacing a black-box model with an equivalent GBM grayfication. The GBM can be obtained using, for instance, model distillation Gou et al. (2021), and it would persist over time. A nice benefit of grayfication is that all debugging techniques discussed so far would immediately become available for the grayfied concepts too.

Grayfication definitely makes sense in applications where concepts can be computed from simpler, lower-level concepts (like instance shape, color, and texture Nauta et al. (2020)) or where concepts exhibit a natural hierarchical (parts-of-parts) structure Fidler and Leonardis (2007). However, grayfication needs not be applied to all concepts unconditionally. One interesting direction of research is to let the user decide what concepts should be grayfied – for instance those that need to be corrected more often, or for which supervision can be easily provided.

4.6 The Debugging Loop

Summarizing, interactive debugging of GBMs involves repeatedly choosing an instance and then proceeding as in steps 1–3, acquiring corrective supervision on weights and concepts along the way. The target decisions can be either selected by the machine, as in explanatory active learning Teso and Kersting (2019), or selected by a human stakeholder, perhaps aided by the machine Popordanoska et al. (2020). Since in GBMs concepts and weights are learned jointly, the model is retrained to permit supervision to flow to both concepts and weights, further aligning them. This retraining stage proceeds until the model becomes stable enough in terms of, e.g., prediction or explanation accuracy on the training set or on a separate validation set. At this point, the annotator can either proceed with another debugging session or stop if the model looks good enough.

This form of debugging is quite powerful, as it allows stakeholders to identify and correct a variety of different bugs, but it is not universal: some bugs and biases that cannot be uncovered using local explanations Popordanoska et al. (2020). Other bugs – like those due to bad choice of hyper-parameters, insufficient model capacity, and failure to converge – are not fixable with any form of supervision. Dealing with these issues, however, is beyond the scope of this paper and left to future work.

5 A Robust Aggregation Loss

As shown in the middle part of Figure 1, simply discouraging the model from using incorrectly learned concepts does not prevent it from learning them again with minimal changes. To address this problem, we propose to penalize the model for associating non-zero weight to concepts that are similar to those concepts indicated as irrelevant to the target decision during the debugging session. We assume these concepts to have been collected in a memory . The resulting loss is:


The first sum iterates over old concepts irrelevant to , denoted , the second sum iterates over the current concepts , and measures the similarity between concepts.

Notice how Eq. 1, being independent of the concept’s order, prevents the model from re-learning forbidden concepts. Moreover, it automatically deals with redundant concepts, which are sometimes learned in practice Chen et al. (2019), in which cases all copies of the same irrelevant concept are similarly penalized.

Measuring concept similarity.

A principled way of doing so is to employ a kernel between functions. One natural approach is to employ a product kernel Jebara et al. (2004), for instance:

Here, is the ground-truth distribution over instances and controls the smoothness of the kernel. It is easy to show that is a valid kernel by rewriting it as an inner product , where we set .

This kernel measures how often two concepts co-activate on different inputs, which is somewhat limiting. For instance, when discriminating between images of cars and plants, the concepts “wheel” and “license plate” are deemed similar because they co-activate on car images and not on plant images. This suggests using a more fine-grained kernel that considers where the concepts activate, for instance:

Since co-localization entails co-activation, this kernel specializes , in the sense that . Unfortunately, is not known and the integrals in Eqs. 5 and 5 are generally intractable, so in practice we replace the latter with a Monte Carlo approximation on the training data. Notice that the activation and localization maps of irrelevant concepts in can be cached for future use, thus speeding up the computation.

Certain models may admit more efficient kernels between concepts. For instance, in PPNets one can measure concept similarity by directly comparing the convolutional filter representations of the learned concepts, e.g., . Using this kernel gives:


which dramatically simplifies the computation. However, taking inner products in parameter space is not entirely intuitive, rendering this kernel harder to control. A thorough analysis of this option is left to future work.

Beyond instance-level supervision.

Eq. 1 essentially encourages the distribution encoded by the GBM not to depend on a known-irrelevant concept. In previous work, relevance annotations are provided at the level of individual examples and, as such, specify an invariance that only holds for few similar examples Teso and Kersting (2019); Schramowski et al. (2020); Lertvittayakumjorn et al. (2020); Shao et al. (2021); Stammer et al. (2021).

Users, however, may find it more natural to provide relevance information for entire classes or for individual concepts. For instance, although the concept of “snow” is not relevant for a specific class “wolf”, it may be useful for other classes, like “winter”. Similarly, some concepts – for instance, artifacts in X-ray images due to the scanning process – should never be used for making predictions for any class. These more general forms of supervision encode invariances that applies to entire sets of examples, and as such have radically different expressive power. A nice property of Eq. 1 is that it can easily encode such supervision by letting the outer summation range over (or at least sample) sets of instances, i.e., all instances that belong to a certain class or all instances alltogether.

5.1 A toy experiment

We run a preliminary evaluation of , defined in Eq. 1

, employing PPNets. We run the experiment on a data set of synthetic images representing squares, triangles and circles of different colors. The images are labeled by five random formulas like “pink triangle or green circle”. To mislead the model, we introduce a confounder, a yellow square, that appears in all training images of one class. The hyperparameters of PPNets are set as in the original paper 

Chen et al. (2019)

, and two prototype slots are reserved for each class. The model is initially trained for 20 epochs, after which an interactive step is performed where the prototype that learned the confounder is selected for correction. Training then proceeds for other 25 epochs by alternating the optimization of the concepts

and of the weights , for 5 epochs each. Figure 1 shows the prototypes learned by the class “pink triangle or green circle” (affected by the “yellow square” confounder) under different conditions. The left column shows the two prototypes learned after the initial training phase, with the confounder clearly matched by (at least) the second prototype, and no prototype matching the triangle of the circle. The middle column shows the refined prototypes learned after the additional 25 epochs and feedback encoded via the loss (i.e., the model is penalized for assigning non-zero weight to the second prototype). Note how the confounder is learned again despite the feedback, while the triangle and the circle are again both missed. The right columns shows the prototypes learned when the feedback is encoded via the loss. This time the model manages to identify the pink triangle as a relevant concept for the class (second prototype) despite the confounder.

Before correction Attribute loss Aggregation loss
Figure 1: Left: The two prototypes learned for the class “pink triangle or green circle” affected by a “yellow square” confounder. The second prototype learned the confounder. Center: The prototypes learned after penalizing the model for assigning non-zero weight to the second prototype of the left column using . The model eludes the corrective loss by reacquiring the confounder as its first prototype. Right: The prototypes learned by penalizing the model using the loss. The model manages to identify the pink triangle as a relevant concept despite the confounder (right prototype).

6 Related Work

Our work targets well-known concept-based GBMs, including SENNs Alvarez-Melis and Jaakkola (2018), CBMs Koh et al. (2020); Losch et al. (2019), and PPNets Li et al. (2018); Chen et al. (2019); Hase et al. (2019); Rymarczyk et al. (2020); Barnett et al. (2021) and related approaches Nauta et al. (2021), but it could be easily extended to similar models and techniques, for instance concept whitening Chen et al. (2020). The latter is reminiscent of CBMs, except that the neurons in the bottleneck layer are normalized and decorrelated and feedback on individual concepts is optional. Of course, not all GBMs fit our template. For instance, Al-Shedivat et al. (2020) propose a Bayesian model that, although definitely gray-box, does not admit selecting a unique explanation for a given prediction. More work is needed to capture these additional cases.

Our debugging approach is rooted in explanatory interactive learning, a set of techniques that acquire corrective from a user by displaying explanations (either local Teso and Kersting (2019); Selvaraju et al. (2019); Lertvittayakumjorn et al. (2020); Schramowski et al. (2020); Shao et al. (2021) or global Popordanoska et al. (2020)) of the model’s beliefs. Recently, Stammer et al. (2021) designed a debugging approach for the specific case of attention-based neuro-symbolic models. These approaches assume the concept vocabulary to be given rather than learned and indeed are not robust to changes in the concepts, as illustrated above. Our aggregation loss generalizes these ideas to the case of concept-based GBMs. Other work on XIL has focused on example-based explanations Teso et al. (2021); Zylberajch et al. (2021), which we did not include in our framework, but that could provide an alternative device for controlling a GBM’s reliance on, e.g., noisy examples.

A causal approach for debiasing CBMs has been proposed by Bahadori and Heckerman (2021). This work is orthogonal to our contribution, and could be indeed integrated with it. The strategy of Lage and Doshi-Velez (2020) acquires concept-based white box models that are better aligned with the user’s mental model by interactively acquiring concept-attribute dependency information. The machine does so by asking questions like “does depression depend on lorazepam?”, acquiring more impactful dependencies first. This approach is tailored for specific white-box models, but it could and should be extended to GBMs and integrated into our framework.

7 Conclusion

We proposed a unified framework for debugging concept-based GBMs that fixes bugs by acquiring corrective feedback from a human supervisor. Our key insight is that bugs can affect both how the concepts are defined and how they are aggregated, and that both elements have to be high-quality for a GBM to be effective. We proposed a three-step procedure to achieve this, shown how existing attribution losses are unsuotable for GBMs and proposed a new loss that is robust to changes in the learned concepts, and illustrated how the same schema can be used to correct the concepts themselves by interacting with their labels and explanations. A thorough empirical validation of these ideas is currently underway.

This research has received funding from the European Union’s Horizon 2020 FET Proactive project “WeNet - The Internet of us”, grant agreement No. 823783, and from the “DELPhi - DiscovEring Life Patterns” project funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) 2017 – DD n. 1062 del 31.05.2019. The research of ST and AP was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.


  • [1] M. Al-Shedivat, A. Dubey, and E. P. Xing (2020) Contextual Explanation Networks. J. Mach. Learn. Res. 21, pp. 194–1. Cited by: §6.
  • [2] D. Alvarez-Melis and T. S. Jaakkola (2018) Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7786–7795. Cited by: §1, §2.1, Table 1, §2, §4.3, §6.
  • [3] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. Müller (2010) How to explain individual classification decisions.

    The Journal of Machine Learning Research

    11, pp. 1803–1831.
    Cited by: §4.3.
  • [4] M. T. Bahadori and D. Heckerman (2021) Debiasing concept-based explanations with causal analysis. In International Conference on Learning Representations, Cited by: §4.1, §6.
  • [5] A. J. Barnett, F. R. Schwartz, C. Tao, C. Chen, Y. Ren, J. Y. Lo, and C. Rudin (2021)

    IAIA-bl: a case-based interpretable deep learning model for classification of mass lesions in digital mammography

    arXiv preprint arXiv:2103.12308. Cited by: §1, §2.1, Table 1, §3, §4.5, §6.
  • [6] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su (2019) This looks like that: deep learning for interpretable image recognition. Advances in Neural Information Processing Systems 32, pp. 8930–8941. Cited by: §1, §2.1, Table 1, §2, §2, §4.3, §5.1, §5, §6.
  • [7] Z. Chen, Y. Bei, and C. Rudin (2020) Concept whitening for interpretable image recognition. Nature Machine Intelligence 2 (12), pp. 772–782. Cited by: §6.
  • [8] A. Dombrowski, M. Alber, C. Anders, M. Ackermann, K. Müller, and P. Kessel (2019) Explanations can be manipulated and geometry is to blame. Advances in Neural Information Processing Systems 32, pp. 13589–13600. Cited by: §1.
  • [9] S. Fidler and A. Leonardis (2007) Towards scalable representations of object categories: learning a hierarchy of parts. In

    2007 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1–8. Cited by: §4.5.
  • [10] J. Gou, B. Yu, S. J. Maybank, and D. Tao (2021) Knowledge distillation: a survey. International Journal of Computer Vision 129 (6), pp. 1789–1819. Cited by: §4.5.
  • [11] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2018) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 1–42. Cited by: §1.
  • [12] P. Hase, C. Chen, O. Li, and C. Rudin (2019) Interpretable image recognition with hierarchical prototypes. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7, pp. 32–40. Cited by: §1, §6.
  • [13] A. Hoffmann, C. Fanconi, R. Rade, and J. Kohler (2021) This looks like that… does it? shortcomings of latent space prototype interpretability in deep networks. arXiv preprint arXiv:2105.02968. Cited by: §3, §4.3, §4.3.
  • [14] T. Jebara, R. Kondor, and A. Howard (2004) Probability product kernels. The Journal of Machine Learning Research 5, pp. 819–844. Cited by: §5.
  • [15] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang (2020) Concept bottleneck models. In International Conference on Machine Learning, pp. 5338–5348. Cited by: §1, §2.1, Table 1, §2, §3, §4.5, §6.
  • [16] I. Lage and F. Doshi-Velez (2020) Learning interpretable concept-based models with human feedback. arXiv preprint arXiv:2012.02898. Cited by: §1, §6.
  • [17] H. Lakkaraju and O. Bastani (2020) “How do I fool you?” Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 79–85. Cited by: §1.
  • [18] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K. Müller (2019) Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10 (1), pp. 1–8. Cited by: §3, §4.5.
  • [19] P. Lertvittayakumjorn, L. Specia, and F. Toni (2020) FIND: human-in-the-loop debugging deep text classifiers. In

    Conference on Empirical Methods in Natural Language Processing

    pp. 332–348. Cited by: §1, §5, §6.
  • [20] O. Li, H. Liu, C. Chen, and C. Rudin (2018) Deep learning for case-based reasoning through prototypes: a neural network that explains its predictions. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §6.
  • [21] Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §2.
  • [22] M. Losch, M. Fritz, and B. Schiele (2019) Interpretability beyond classification output: Semantic Bottleneck Networks. arXiv preprint arXiv:1907.10882. Cited by: §1, §2.1, §6.
  • [23] M. Nauta, A. Jutte, J. Provoost, and C. Seifert (2020) This looks like that, because… explaining prototypes for interpretable image recognition. arXiv preprint arXiv:2011.02863. Cited by: §3, §4.3, §4.3, §4.5, §4.5.
  • [24] M. Nauta, R. van Bree, and C. Seifert (2021) Neural prototype trees for interpretable fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14933–14943. Cited by: §1, §6.
  • [25] T. Popordanoska, M. Kumar, and S. Teso (2020) Machine guides, human supervises: interactive learning with global explanations. arXiv preprint arXiv:2009.09723. Cited by: §4.6, §4.6, §6.
  • [26] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §3, §4.3, §4.5.
  • [27] A. S. Ross, M. C. Hughes, and F. Doshi-Velez (2017) Right for the right reasons: training differentiable models by constraining their explanations. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2662–2670. Cited by: §3, §4.4, §4.4, §4.5, §4.
  • [28] C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, and C. Zhong (2021) Interpretable machine learning: fundamental principles and 10 grand challenges. arXiv preprint arXiv:2103.11251. Cited by: §1.
  • [29] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §1, §4.3.
  • [30] D. Rymarczyk, Ł. Struski, J. Tabor, and B. Zieliński (2020) ProtoPShare: prototype sharing for interpretable image classification and similarity discovery. arXiv preprint arXiv:2011.14340. Cited by: §1, §6.
  • [31] P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H. Luigs, A. Mahlein, and K. Kersting (2020) Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence 2 (8), pp. 476–486. Cited by: §1, §3, §4.4, §5, §6.
  • [32] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, and D. Parikh (2019)

    Taking a hint: leveraging explanations to make vision and language models more grounded

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600. Cited by: §6.
  • [33] X. Shao, A. Skryagin, P. Schramowski, W. Stammer, and K. Kersting (2021) Right for better reasons: training differentiable models by constraining their influence function. In Proceedings of Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), Cited by: §4.4, §5, §6.
  • [34] L. Sixt, M. Granz, and T. Landgraf (2020) When explanations lie: why many modified bp attributions fail. In International Conference on Machine Learning, pp. 9046–9057. Cited by: §1.
  • [35] W. Stammer, P. Schramowski, and K. Kersting (2021) Right for the right concept: revising neuro-symbolic concepts by interacting with their explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3619–3629. Cited by: §1, §3, §5, §6, footnote 3.
  • [36] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. Cited by: §4.3, §4.4.
  • [37] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §3.
  • [38] S. Teso, A. Bontempelli, F. Giunchiglia, and A. Passerini (2021) Interactive label cleaning with example-based explanations. arXiv preprint arXiv:2106.03922. Cited by: §6.
  • [39] S. Teso and K. Kersting (2019) Explanatory interactive machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 239–245. Cited by: §3, §4.3, §4.4, §4.5, §4.6, §5, §6.
  • [40] S. Teso (2019) Toward faithful explanatory active learning with self-explainable neural nets. In Proceedings of the Workshop on Interactive Adaptive Learning (IAL 2019), pp. 4–16. Cited by: §1, §1, §3, §4.4.
  • [41] G. Van den Broeck, A. Lykov, M. Schleich, and D. Suciu (2021) On the tractability of shap explanations. In Proceedings of AAAI, Cited by: §1.
  • [42] C. Yeh, B. Kim, S. Arik, C. Li, T. Pfister, and P. Ravikumar (2020) On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems 33. Cited by: §4.1.
  • [43] Z. Zhang, J. Singh, U. Gadiraju, and A. Anand (2019) Dissonance between human and machine understanding. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–23. Cited by: §4.3.
  • [44] H. Zylberajch, P. Lertvittayakumjorn, and F. Toni (2021) HILDIF: interactive debugging of nli models using influence functions. Workshop on Interactive Learning for Natural Language Processing, pp. 1. Cited by: §6.