Interactive Label Cleaning with Example-based Explanations

by   Stefano Teso, et al.
Università di Trento

We tackle sequential learning under label noise in applications where a human supervisor can be queried to relabel suspicious examples. Existing approaches are flawed, in that they only relabel incoming examples that look “suspicious” to the model. As a consequence, those mislabeled examples that elude (or don't undergo) this cleaning step end up tainting the training data and the model with no further chance of being cleaned. We propose Cincer, a novel approach that cleans both new and past data by identifying pairs of mutually incompatible examples. Whenever it detects a suspicious example, Cincer identifies a counter-example in the training set that – according to the model – is maximally incompatible with the suspicious example, and asks the annotator to relabel either or both examples, resolving this possible inconsistency. The counter-examples are chosen to be maximally incompatible, so to serve as explanations of the model' suspicion, and highly influential, so to convey as much information as possible if relabeled. Cincer achieves this by leveraging an efficient and robust approximation of influence functions based on the Fisher information matrix (FIM). Our extensive empirical evaluation shows that clarifying the reasons behind the model's suspicions by cleaning the counter-examples helps acquiring substantially better data and models, especially when paired with our FIM approximation.


page 2

page 14


Explaining with Counter Visual Attributes and Examples

In this paper, we aim to explain the decisions of neural networks by uti...

Estimating Training Data Influence by Tracking Gradient Descent

We introduce a method called TrackIn that computes the influence of a tr...

Boosting in the presence of label noise

Boosting is known to be sensitive to label noise. We studied two approac...

Machine Guides, Human Supervises: Interactive Learning with Global Explanations

We introduce explanatory guided learning (XGL), a novel interactive lear...

Revisiting Methods for Finding Influential Examples

Several instance-based explainability methods for finding influential tr...

Efficient Data-Dependent Learnability

The predictive normalized maximum likelihood (pNML) approach has recentl...

Towards Inferring Queries from Simple and Partial Provenance Examples

The field of query-by-example aims at inferring queries from output exam...

1 Introduction

Label noise is a major issue in machine learning as it can lead to compromised predictive performance and unreliable models 

Frénay and Verleysen (2014); Song et al. (2020). We focus on sequential learning settings in which a human supervisor, usually a domain expert, can be asked to double-check and relabel any potentially mislabeled example. Applications include crowd-sourced machine learning and citizen science, where trained researchers can be asked to clean the labels provided by crowd-workers Zhang et al. (2016); Kremer et al. (2018), and interactive personal assistants Bontempelli et al. (2020), where the user self-reports the initial annotations (e.g., about activities being performed) and unreliability is due to memory bias Tourangeau et al. (2000), unwillingness to report Corti (1993), or conditioning West and Sinibaldi (2013).

This problem is often tackled by monitoring for incoming examples that are likely to be mislabeled, aka suspicious examples, and ask the supervisor to provide clean (or at least better) annotations for them. Existing approaches, however, focus solely on cleaning the incoming examples Urner et al. (2012); Kremer et al. (2018); Zeni et al. (2019); Bontempelli et al. (2020). This means that noisy examples that did not undergo the cleaning step (e.g., those in the initial bootstrap data set) or that managed to elude it are left untouched. This degrades the quality of the model and prevent it from spotting future mislabeled examples that fall in regions affected by noise.

We introduce cincer (Contrastive and InflueNt CounterExample stRategy), a new explainable interactive label cleaning algorithm that lets the annotator observe and fix the reasons behind the model’s suspicions. For every suspicious example that it finds, cincer computes a counter-example, i.e., a training example that maximally supports the machine’s suspicion. The idea is that the example/counter-example pair captures a potential inconsistency in the data—as viewed from the model’s perspective—which is resolved by invoking the supervisor. More specifically, cincer

asks the user to relabel the example, the counter-example, or both, thus improving the quality of, and promoting consistency between, the data and the model. Two hypothetical rounds of interaction on a noisy version of MNIST are illustrated in Figure 


cincer relies on a principled definition of counter-examples derived from few explicit, intuitive desiderata, using influence functions Cook and Weisberg (1982); Koh and Liang (2017). The resulting counter-example selection problem is solved using a simple and efficient approximation based on the Fisher information matrix Lehmann and Casella (2006) that consistently outperforms more complex alternatives in our experiments.

Contributions: Summarizing, we: 1) Introduce cincer, an explanatory interactive label cleaning strategy that leverages example-based explanations to identify inconsistencies in the data—as perceived by the model—and enable the annotator to fix them. 2) Show how to select counter-examples that at the same time explain why the model is suspicious and that are highly informative using (an efficient approximation of) influence functions. 3) Present an extensive empirical evaluation that showcases the ability of cincer of building cleaner data sets and better models.

Figure 1: Suspicious example and counter-examples selected using (from left to right) cincer, -NN and influence functions (IF), on noisy MNIST. Left: the suspicious example is mislabeled, the machine’s suspicion is supported by a clean counter-example. Right: the suspicious example is not mislabeled, the machine is wrongly suspicious because the counter-example is mislabeled. cincer’s counter-example is contrastive and influential; -NN’s is not influential and IF’s is not pertinent.

2 Background

We are concerned with sequential learning under label noise. In this setting, the machine receives a sequence of examples , for , where is an instance and is a corresponding label, with . The label is unreliable and might differ from the ground-truth label . They key feature of our setting is that a human supervisor can be asked to double-check and relabel any example. The goal is to acquire a clean dataset and a high-quality predictor while asking few relabeling queries, so to keep the cost of interaction under control.

The state-of-the-art for this setting is skeptical learning (SKL) Zeni et al. (2019); Bontempelli et al. (2020)

. SKL is designed primarily for smart personal assistants that must learn from unreliable users. SKL follows a standard sequential learning loop: in each iteration the machine receives an example and updates the model accordingly. However, for each example that it receives, the machine compares (an estimate of) the quality of the annotation with that of its own prediction, and if the prediction looks more reliable than the annotation by some factor, SKL asks the user to double-check his/her example. The details depends on the implementation: in 

Zeni et al. (2019)

label quality is estimated using the empirical accuracy for the classifier and the empirical probability of contradiction for the annotator, while in

Bontempelli et al. (2020) the machine measures the margin between the user’s and machine’s labels. Our approach follows the latter strategy.

Another very related approach is learning from weak annotators (LWA) Urner et al. (2012); Kremer et al. (2018), which focuses on querying domain experts rather than end-users. The most recent approach Kremer et al. (2018) jointly learns a prediction pipeline composed of a classifier and a noisy channel, which allows it to estimate the noise rate directly. Moreover, the approach in Kremer et al. (2018) identifies suspicious examples that have a large impact on the learned model. A theoretical foundation for LWA is given in Urner et al. (2012). LWA is however designed for pool-based scenarios, where the training set is given rather than obtained sequentially. For this reason, in the remainder of the paper we will chiefly build on and compare to SKL.

Limitations of existing approaches. A major downside of SKL is that it focuses on cleaning the incoming examples only. This means that if a mislabeled example manages to elude the cleaning step and gets added to the training set, it is bound to stay there forever. This situation is actually quite common during the first stage of skeptical learning, in which the model is highly uncertain and trusts the incoming examples—even if they are mislabeled. The same issue occurs if the initial training set used to bootstrap the classifier contains mislabeled examples. As shown by our experiments, the accumulation of noisy data in the training set can have a detrimental effect on the model’s performance. In addition, it can also affect the model’s ability to identify suspicious examples: a noisy data point can fool the classifier into trusting incoming mislabeled examples that fall close to it, further aggravating the situation.

3 Explainable Interactive Label Cleaning with CINCER

We consider a very general class of probabilistic classifiers of the form , where the conditional distribution has been fit on training data by minimizing the cross-entropy loss . In our implementation, we also assume

to be a neural network with a softmax activation at the top layer, trained using some variant of SGD and possibly early termination.

3.1 The cincer Algorithm

The pseudo-code of cincer is listed in Algorithm 1. At the beginning of iteration , the machine has acquired a training set and trained a model with parameters on it. At this point, the machine receives a new, possibly mislabeled example (line 3) and has to decide whether to trust it.

Following skeptical learning Bontempelli et al. (2020), cincer does so by computing the margin , i.e., the difference in conditional probability between the model’s prediction and the annotation . More formally:

The margin estimates the incompatibility between the model and the example: the larger the margin, the more suspicious the example. The example is deemed compatible if the margin is below a given threshold and suspicious otherwise (line 4); possible choices for are discussed in Section 3.5.

If is compatible, it is added to the data set as-is (line 5). Otherwise, cincer computes a counter-example that maximally supports the machine’s suspicion. The intuition is that the pair captures a potential inconsistency in the data. For instance, the counter-example might be a correctly labeled example that is close or similar to

but has a different label, or a distant noisy outlier that fools the predictor into assigning low probability to

. How to choose an effective counter-example is a major focus of this paper and discussed in detail in Section 3.2 and following.

Next, cincer asks the annotator to double-check the pair and relabel the suspicious example, the counter-example, or both, thus resolving the potential inconsistency. The data set and model are then updated accordingly (line 9) and the loop repeats.

1:fit on
2:for  do
3:     receive new example
4:     if  then
5:          is compatible
6:     else
7:         find counterexample using Eq. 3.2 is suspicious
8:         present to the user, receive possibly cleaned labels
10:     fit on
Algorithm 1 Pseudo-code of cincer. Inputs: initial (noisy) training set ; threshold .

3.2 Counter-example Selection

Counter-examples are meant to illustrate why a particular example is deemed suspicious by the machine in a way that makes it easy to elicit useful corrective feedback from the supervisor. We posit that a good counter-example should be:

  1. Contrastive: should explain why is considered suspicious by the model, thus highlighting a potential inconsistency in the data.

  2. Influential: if is mis-labeled, correcting it should improve the model as much as possible, so to maximize the information gained by interacting with the annotator.

In the following, we show how, for models learned by minimizing the cross-entropy loss, one can identify counter-examples that satisfy both desiderata.

What is a contrastive counter-example? We start by tackling the first desideratum. Let be the parameters of the current model. Intuitively, is a contrastive counter-example for a suspicious example if removing it from the data set and retraining leads to a model with parameters that assigns higher probability to the suspicious label . The most contrastive counter-example is then the one that maximally affects the change in probability:

While intuitively appealing, optimizing Eq. 3.2 directly is computationally challenging as it involves retraining the model times. This is impractical for realistically sized models and data sets, especially in our interactive scenario where a counter-example must be computed in each iteration.

Influence functions. We address this issue by leveraging influence functions (IFs), a computational device that can be used to estimate the impact of specific training examples without retraining Cook and Weisberg (1982); Koh and Liang (2017). Let be the empirical risk minimizer on and be the minimizer obtained after adding an example with weight to , namely:

Taking a first-order Taylor expansion, the difference between and can be written as . The derivative appearing on the right hand side is the so-called influence function, denoted . It follows that the effect on of adding (resp. removing) an example to can be approximated by multiplying the IF by (resp. ). Crucially, if the loss is strongly convex and twice differentiable, the IF can be written as:

where the curvature matrix is positive definite and invertible. IFs were shown to capture meaningful information even for neural networks and other non-convex models Koh and Liang (2017).

Identifying contrastive counter-examples with influence functions. To see the link between contrastive counter-examples and influence functions, notice that the second term of Eq. 3.2 is independent of

, while the first term can be conveniently approximated with IFs by applying the chain rule:


The constant can be dropped during the optimization. This shows that Eq. 3.2 is equivalent to:

Eq. 3.2 can be solved efficiently by combining two strategies Koh and Liang (2017)

: i) Caching the inverse Hessian-vector product (HVP)

, so that evaluating the objective on each becomes a simple dot product, and ii) Solving the inverse HVP with an efficient stochastic estimator like LISSA Agarwal et al. (2017). This gives us an algorithm for computing contrastive counter-examples.

Contrastive counter-examples are highly influential. Can this algorithm be used for identifying influential counter-examples? It turns out that, as long as the model is obtained by optimizing the cross-entropy loss, the answer is affirmative. Indeed, note that if , then:


Hence, Eq. 2 can be rewritten as:


It follows that, under the above assumptions and as long as the model satisfies , Eq. 3.2 is equivalent to:

This equation recovers exactly the definition of influential examples given in Koh and Liang (2017) (Eq. 2) and shows that, for the large family of classifiers trained by cross-entropy, highly influential counter-examples are highly contrastive and vice versa, so that no change to the selection algorithm is necessary.

3.3 Counter-example Selection with the Fisher information matrix

Unfortunately, we found the computation of IFs to be unstable in practice, cf. Basu et al. (2020). This reflects on the quality and reliability of the counter-examples identified by IFs. The issue is that, for the common use case of non-convex classifiers trained using gradient-based methods (and possibly early stopping),

is seldom close to a local minimum of the empirical risk, rendering the Hessian non-positive definite. In our setting, the situation is further complicated by the presence of noise, which dramatically skews the curvature of the empirical loss. Remedies like fine-tuning the model with L-BFGS 

Koh and Liang (2017); Yeh et al. (2018), preconditioning and weight decay Basu et al. (2020) proved unsatisfactory in our experiments.

We take a different approach. The idea is to replace the Hessian by the Fisher information matrix (FIM). The FIM of a discriminative model and training set is Martens and Grosse (2015); Kunstner et al. (2020):

It can be shown that, if the model approximates the data distribution, the FIM approximates the Hessian, cf. Ting and Brochu (2018); Barshan et al. (2020). Even when this assumption does not hold, as is likely in our noisy setting, the FIM still captures much of the curvature information encoded into the Hessian Martens and Grosse (2015). Under this approximation, Eq. 3.2 can be rewritten as:

The advantage of this formulation is twofold. First of all, this optimization problem also admits caching the inverse FIM-vector product (FVP), which makes it viable for interactive usage. Second, and most importantly, the FIM is positive semi-definite by construction, making the computation of Eq.3.3 much more computationally stable.

The remaining step is how to compute the inverse FVP. Trivial storage and inversion of the FIM, which is

in size, is impractical for typical models, so the FIM is usually replaced with a simpler matrix. Three common options are the identity matrix, the diagonal of the FIM, and a block-diagonal approximation where interactions between parameters of different layers are set to zero 

Martens and Grosse (2015). Our best results were obtained by restricting the FIM to the top layer of the network. We refer to this approximation as “Top Fisher”. While more advanced approximations like K-FAC Martens and Grosse (2015) exist, the Top Fisher proved surprisingly effective in our experiments.

3.4 Selecting Pertinent Counter-examples

So far we have discussed how to select contrastive and influential counter-examples. Now we discuss how to make the counter-examples easier to interpret for the annotator. To this end, we introduce the additional desideratum that counter-examples should be:

  • Pertinent: it should be clear to the user why is a counter-example for .

We integrate D3 into cincer by restricting the choice of possible counter-examples. A simple strategy, which we do employ in all of our examples and experiments, is to restrict the search to counter-examples whose label in the training set is the same as the prediction for the suspicious example, i.e., . This way, the annotator can interpret the counter-example as being in support of the machine’s suspicion. In other words, if the counter-example is labeled correctly, then the machine’s suspicious is likely right and the incoming example needs cleaning. Otherwise, if the machine is wrong and the suspicious example is not mislabeled, it is likely the counter-example – which backs the machine’s suspicions – that needs cleaning.

Finally, one drawback of IF-selected counter-examples is that they may be perceptually different from the suspicious example. For instance, outliers are often highly influential as they fool the machine into mispredicting many examples, yet they have little in common with those examples Barshan et al. (2020). This can make it difficult for the user to understand their relationship with the suspicious examples they are meant to explain. This is not necessarily an issue: first, a motivated supervisor is likely to correct mislabeled counter-examples regardless of whether they resemble the suspicious example; second, highly influential outliers are often identified (and corrected if needed) in the first iterations of cincer (indeed, we did not observe a significant amount of repetitions among suggested counter-examples in our experiments). Still, cincer can be readily adapted to acquire more perceptually similar counter-examples. One option is to replace IFs with relative IFs Barshan et al. (2020), which trade-off influence with locality. Alas, the resulting optimization problem does not support efficient caching of the inverse HVP. A better alternative is to restrict the search to counter-examples that are similar enough to in terms of some given perceptual distance  Heusel et al. (2017) by filtering the candidates using fast nearest neighbor techniques in perceptual space. This is analogous to FastIF Guo et al. (2020), except that the motivation is to encourage perceptual similarity rather than purely efficiency, although the latter is a nice bonus.

3.5 Advantages and Limitations

The main benefit of cincer is that, by asking a human annotator to correct potential inconsistencies in the data, it acquires substantially better supervision and, in turn, better predictors. In doing so, cincer also encourages consistency between the data and the model. Another benefit is that, by explaining the reasons behind the model’s skepticism, cincer allows the supervisor to spot bugs and justifiably build – or, perhaps more importantly, reject Rudin (2019); Teso and Kersting (2019) – trust into the prediction pipeline.

cincer only requires to set a single parameter, the margin threshold , which determines how frequently the supervisor is invoked. The optimal value is highly application specific, but generally speaking it depends on the ratio between the cost of a relabeling query and the cost of noise. If the annotator is willing to interact (for instance, because it is payed to do so) then the threshold can be quite generous.

cincer can be readily applied in applications in which the data set is retained over time, even if the latter is summarized periodically to save space. For settings where retaining any data is not an option, cincer could be adapted to synthesize counter-examples ex novo. Doing so, however, is highly non-trivial and lies outside the scope of this paper.

4 Experiments

We address empirically the following research questions: Q1: Do counter-examples contribute to cleaning the data? Q2: Which influence-based selection strategy identifies the most mislabeled counter-examples? Q3: What contributes to the effectiveness of the best counter-example selection strategy?

We implemented cincer

using Python and Tensorflow 

Abadi and others (2015) on top of three classifiers and compared different counter-example selection strategies on five data sets. The IF code is adapted from Ye (2020). All experiments were run on a 12-core machine with 16 GiB of RAM and no GPU. The code for all experiments is available at:

Data sets. We used a diverse set of classification data sets: Adult Dua and Graff (2017): data set of 48,800 persons, each described by 15 attributes; the goal is to discriminate customers with an income above/below $50K. Breast Dua and Graff (2017): data set of 569 patients described by 30 real-valued features. The goal is to discriminate between benign and malignant breast cancer cases. 20NG Dua and Graff (2017): data set of newsgroup posts categorized in twenty topics. The documents were embedded using a pre-trained Sentence-BERT model Reimers and Gurevych (2019) and compressed to features using PCA. MNIST LeCun et al. (2010): handwritten digit recognition data set from black-and-white, images with pixel values normalized in the range. The data set consists of 60K training and 10K test examples. Fashion Xiao et al. (2017): fashion article classification dataset with the same structure as MNIST. For adult and breast a random training-test split is used while for MNIST, fashion and 20 NG the split provided with the data set is used. The labels of 20% of training examples were corrupted at random. Performance was measured in terms of

score on the (uncorrupted) test set. Error bars in the plots indicate the standard error. The experiments were repeated five times, each time changing the seed used for corrupting the data. All competitors received exactly the same examples in exactly the same order.

Models. We applied cincer to three models: LR

, a logistic regression classifier;


, a feed-forward neural network with two fully connected hidden layers with ReLU activations; and


, a feed-forward neural network with two convolutional layers and two fully connected layers. For all models, the hidden layers have ReLU activations and 20% dropout while the top layer has a softmax activation. LR was applied to MNIST, FC to both the tabular data sets (namely: adult, breast, and 20NG) and image data sets (MNIST and fashion), and CNN to the image data sets only. Upon receiving a new example, the classifier is retrained from scratch for 100 epochs using Adam 

Kingma and Ba (2014) with default parameters, with early stopping when the accuracy on the training set reaches for FC and CNN, and for LR. This helps substantially to stabilize the quality of the model and speeds up the evaluation. Before each run, the models are trained on an bootstrap training set of 500 examples for 20NG and 100 for all the other data sets. The margin threshold is set to . Due to space constraints, we report the results on one image data set and three tabular data, and we focus on FC and CNN. The other results are consistent with what is reported below; these plots are reported in the Supplementary Material.

4.1 Q1: Counter-examples improve the quality of the data

Figure 2: cincer using Top Fisher vs. drop CE and no CE. Left to right: results for FC on adult, breast and 20NG, CNN on MNIST. Top row: # of cleaned examples. Bottom row: score.

To evaluate the impact of cleaning the counter-examples, we compare cincer combined with the Top fisher approximation of the FIM, which works best in practice, against two alternatives, namely: No CE: an implementation of skeptical learning Bontempelli et al. (2020) that asks the user to relabel any incoming suspicious examples identified by the margin and presents no counter-examples. Drop CE: a variation of cincer that identifies counter-examples using Top Fisher but drops them from the data set if the user considers the incoming example correctly labeled. The results are reported in Figure 2. The plots show that cincer cleans by far the most examples on all data sets, in between 33% and 80% more than the alternatives (top row in Figure 2). This translates into better predictive performance as measured by score (bottom row). Notice also that cincer consistently outperforms the drop CE strategy in terms of score, suggesting that relabeling the counter-examples provides important information for improving the model. These results validate our choice of identifying and relabeling counter-examples for interactive label cleaning compared to focusing on suspicious incoming examples only, and allow us to answer Q1 in the affirmative.

4.2 Q2: Fisher Information-based strategies identify the most mislabeled counter-examples

Figure 3: Counter-example Pr@5 and Pr@10. Standard error information is reported. Left to right: results for FC on adult, breast and 20NG, and CNN on MNIST.

Next, we compare the ability of alternative approximations of IFs of discovering mislabeled counter-examples. To this end, we trained a model on a noisy bootstrap data set, selected examples from the remainder of the training set, and measured how many truly mislabeled counter-examples are selected by alternative strategies. In particular, we computed influence using the IF LISSA estimator of Koh and Liang (2017), the actual FIM (denoted “full Fisher” and reported for the simpler models only for computational reasons) and its approximations using the identity matrix (aka “practical Fisher” Jaakkola et al. (1999)), and Top Fisher. We computed the precision at for , i.e, the fraction of mislabeled counter-examples within five or ten highest-scoring counter-examples retrieved by the various alternatives, averaged over iterations for five runs. The results in Figure 3 show that, in general, FIM-based strategies outperform the LISSA estimator, with Full Fisher performing best and Top Fisher a close second. Since the full FIM is highly impractical to store and invert, this confirms our choice of Top Fisher as best practical strategy.

4.3 Q3: Both influence and curvature contribute to the effectiveness of Top Fisher

Figure 4: Top Fisher vs. practical Fisher vs. NN. Left to right: results for FC on adult, breast and 20NG, CNN on MNIST. Top row: # of cleaned examples. Bottom row: score.

Finally, we evaluate the impact of selecting counter-examples using Top Fisher on the model’s performance, in terms of use of influence, by comparing it to an intuitive nearest neighbor alternative (NN), and modelling of the curvature, by comparing it to the Practical Fisher. NN simply selects the counter-example that is closest to the suspicious example. The results can be viewed in Figure 4. Top Fisher is clearly the best strategy, both in terms of number of cleaned examples and score. NN is always worse than Top Fisher in terms of , even in the case of adult (first column) when it cleans the same number of examples, confirming the importance of influence in selecting impactful counter-examples. Practical Fisher clearly underperforms compared with Top Fisher, and it shows the importance of having the curvature matrix. For each data set, all methods make a similar number of queries: 58 for 20NG, 21 for breast, 31 for adult and 37 for MNIST. The complete values are reported in the Supplementary Material.

5 Related Work

Learning under noise. Typical strategies to learning from noisy labels include discarding or down-weighting suspicious examples and employing models robust to noise Angluin and Laird (1988); Frénay and Verleysen (2014); Nigam et al. (2020); Song et al. (2020). These approaches make no attempt to recover the ground-truth label and are not ideal in interactive learning settings characterized by high noise rate/cost and small data sets. Most works on interactive learning under noise are designed for crowd-sourcing applications in which items are labelled by different annotators of varying quality and the goal is to aggregate weak annotations into a high-quality consensus label Zhang et al. (2016). Our work is strongly related to approaches to interactive learning under label noise like skeptical learning Zeni et al. (2019); Bontempelli et al. (2020) and learning from weak annotators Urner et al. (2012); Kremer et al. (2018). These approaches completely ignore the issue of noise in the training set, which can be quite detrimental, as shown by our experiments. Moreover, they are completely black-box and do not attempt to explain to the supervisor why examples are considered suspicious by the machine, making it hard for him/her to establish or reject trust into the data and the model.

Influence functions and Fisher information. It is well known that mislabeled examples tend to exert a larger influence on the model Wojnowicz et al. (2016); Koh and Liang (2017); Khanna et al. (2019); Barshan et al. (2020) and indeed IFs may be a valid alternative to the margin for identifying suspicious examples. Building on the seminal work of Koh and Liang Koh and Liang (2017), we instead leverage IFs to define and compute contrastive counter-examples that explain why the machine is suspicious. The difference is that noisy training examples influence the model as a whole, whereas contrastive counter-examples influence a specific suspicious example. To the best of our knowledge, this application of IFs is entirely novel. Notice also that empirical evidence that IFs recover noisy examples is restricted to offline learning Koh and Liang (2017); Khanna et al. (2019). Our experiments extend this to a less forgiving interactive setting in which only one counter-example is selected per iteration and the model is trained on the whole training set. The idea of exploiting the FIM to approximate the Hessian has ample support in the natural gradient descent literature Martens and Grosse (2015); Kunstner et al. (2020). The FIM has been used for computing example-based explanations by Khanna et al. Khanna et al. (2019). Their approach is however quite different from ours. cincer is equivalent to maximizing the Fisher kernel Jaakkola et al. (1999) between the suspicious example and the counter-example (Eq. 3.3) for the purpose of explaining the model’s margin, and this formulation is explicitly derived from two simple desiderata. In contrast, Khanna et al. maximize a function of the Fisher kernel (namely, the squared Fisher kernel between and divided by the norm of in the RKHS). This optimization problem is not equivalent to Eq. 3.3 and does not admit efficient computation by caching the inverse FIM-vector product.

Other works. cincer

draws inspiration from explanatory active learning, which integrates local 

Teso and Kersting (2019); Selvaraju et al. (2019); Lertvittayakumjorn et al. (2020); Schramowski et al. (2020) or global Popordanoska et al. (2020) explanations into interactive learning and allows the annotator to supply corrective feedback on the model’s explanations. These approaches differ from cincer in that they neither consider the issue of noise nor perform label cleaning, and indeed they explain the model’s predictions rather than the model’s suspicion. Another notable difference is that thy rely on attribution-based explanations, whereas the backbone of cincer are example-based explanations, which enable users to reason about labels in terms of concrete (training) examples Miller (2018); Jeyakumar et al. (2020). Saliency maps could potentially be integrated into cincer to provide more fine-grained information.

6 Conclusion

We introduced cincer, an approach for handling label noise in sequential learning that asks a human supervisor to relabel any incoming suspicious examples. Compared to previous approaches, cincer identifies the reasons behind the model’s skepticism and asks the supervisor to double-check them too. This is done by computing a training example that maximally supports the machine’s suspicions. This enables the user to correct both incoming and old examples, cleaning inconsistencies in the data that confuse the model. Our experiments shows that, by removing inconsistencies in the data, cincer enables acquiring better data and models than less informed alternatives.

Our work can be improved in several ways. cincer can be straightforwardly extended to i) Online active learning, in which the label of incoming instances must be acquired on the fly Malago et al. (2014). This is closer to the skeptical learning setup Zeni et al. (2019). cincer can also be adapted to correcting multiple counter-examples as well as the reasons behind mislabeled counter-examples. This could be achieved using “multi-round” label cleaning and group-wise measures of influence Koh et al. (2019); Basu et al. (2019); Ghorbani and Zou (2019) to handle multiple counter-examples at once.

Potential negative impact. Like most interactive approaches, there is a risk that cincer annoys the user by asking an excessive number of questions. This is currently mitigated by querying the user only when the model is confident enough in its own predictions (through the margin-based strategy) and by selecting influential counter-examples that have a high-change to improve the model upon relabeling, thus reducing the future chance of pointless queries. Moreover, the margin threshold allows to modulate the amount of interaction based on the user’s commitment.

This research has received funding from the European Union’s Horizon 2020 FET Proactive project “WeNet - The Internet of us”, grant agreement No. 823783, and from the “DELPhi - DiscovEring Life Patterns” project funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) 2017 – DD n. 1062 del 31.05.2019. The research of ST and AP was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.


  • [1] M. Abadi et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §4.
  • [2] N. Agarwal, B. Bullins, and E. Hazan (2017) Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research 18 (1), pp. 4148–4187. Cited by: §3.2.
  • [3] D. Angluin and P. Laird (1988) Learning from noisy examples. Machine Learning 2 (4), pp. 343–370. Cited by: §5.
  • [4] E. Barshan, M. Brunet, and G. K. Dziugaite (2020) RelatIF: identifying explanatory training examples via relative influence. In

    The 23rd International Conference on Artificial Intelligence and Statistics

    Cited by: §3.3, §3.4, §5.
  • [5] S. Basu, P. Pope, and S. Feizi (2020)

    Influence functions in deep learning are fragile

    arXiv preprint arXiv:2006.14651. Cited by: §3.3.
  • [6] S. Basu, X. You, and S. Feizi (2019) Second-order group influence functions for black-box predictions. arXiv preprint arXiv:1911.00418. Cited by: §6.
  • [7] A. Bontempelli, S. Teso, F. Giunchiglia, and A. Passerini (2020) Learning in the Wild with Incremental Skeptical Gaussian Processes. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Cited by: §1, §1, §2, §3.1, §4.1, §5.
  • [8] R. D. Cook and S. Weisberg (1982) Residuals and influence in regression. New York: Chapman and Hall. Cited by: §1, §3.2.
  • [9] L. Corti (1993) Using diaries in social research. Social research update 2 (2). Cited by: §1.
  • [10] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
  • [11] B. Frénay and M. Verleysen (2014) Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25 (5), pp. 845–869. Cited by: §1, §5.
  • [12] A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §6.
  • [13] H. Guo, N. F. Rajani, P. Hase, M. Bansal, and C. Xiong (2020) FastIF: scalable influence functions for efficient model interpretation and debugging. arXiv preprint arXiv:2012.15781. Cited by: §3.4.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6629–6640. Cited by: §3.4.
  • [15] T. S. Jaakkola, D. Haussler, et al. (1999) Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, pp. 487–493. Cited by: §4.2, §5.
  • [16] J. V. Jeyakumar, J. Noor, Y. Cheng, L. Garcia, and M. Srivastava (2020) How can i explain this to you? an empirical study of deep neural network explanation methods. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • [17] R. Khanna, B. Kim, J. Ghosh, and S. Koyejo (2019) Interpreting black box predictions using fisher kernels. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3382–3390. Cited by: §5.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [19] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894. Cited by: Appendix A, §1, §3.2, §3.2, §3.2, §3.3, §4.2, §5.
  • [20] P. W. W. Koh, K. Ang, H. Teo, and P. S. Liang (2019) On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems, pp. 5254–5264. Cited by: §6.
  • [21] J. Kremer, F. Sha, and C. Igel (2018) Robust active label correction. In International conference on artificial intelligence and statistics, pp. 308–316. Cited by: §1, §1, §2, §5.
  • [22] F. Kunstner, L. Balles, and P. Hennig (2020) Limitations of the empirical fisher approximation for natural gradient descent. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 4133–4144. Cited by: §3.3, §5.
  • [23] Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: 2. Cited by: §4.
  • [24] E. L. Lehmann and G. Casella (2006) Theory of point estimation. Springer Science & Business Media. Cited by: §1.
  • [25] P. Lertvittayakumjorn, L. Specia, and F. Toni (2020) Human-in-the-loop debugging deep text classifiers. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 332–348. Cited by: §5.
  • [26] L. Malago, N. Cesa-Bianchi, and J. Renders (2014) Online active learning with strong and weak annotators. In NIPS Workshop on Learning from the Wisdom of Crowds, Cited by: §6.
  • [27] J. Martens and R. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §3.3, §3.3, §5.
  • [28] T. Miller (2018) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence. Cited by: §5.
  • [29] N. Nigam, T. Dutta, and H. P. Gupta (2020) Impact of noisy labels in learning techniques: a survey. In Advances in data and information sciences, pp. 403–411. Cited by: §5.
  • [30] T. Popordanoska, M. Kumar, and S. Teso (2020) Machine guides, human supervises: interactive learning with global explanations. arXiv preprint arXiv:2009.09723. Cited by: §5.
  • [31] N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In EMNLP-IJCNLP, Cited by: §4.
  • [32] C. Rudin (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §3.5.
  • [33] P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H. Luigs, A. Mahlein, and K. Kersting (2020) Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence 2 (8), pp. 476–486. Cited by: §5.
  • [34] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, and D. Parikh (2019)

    Taking a hint: leveraging explanations to make vision and language models more grounded


    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 2591–2600. Cited by: §5.
  • [35] H. Song, M. Kim, D. Park, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §1, §5.
  • [36] S. Teso and K. Kersting (2019) Explanatory interactive machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 239–245. Cited by: §3.5, §5.
  • [37] D. Ting and E. Brochu (2018) Optimal subsampling with influence functions. In Advances in Neural Information Processing Systems, pp. 3650–3659. Cited by: §3.3.
  • [38] R. Tourangeau, L. J. Rips, and K. Rasinski (2000) The psychology of survey response. Cambridge University Press. Cited by: §1.
  • [39] R. Urner, S. Ben-David, and O. Shamir (2012) Learning from weak teachers. In Artificial intelligence and statistics, pp. 1252–1260. Cited by: §1, §2, §5.
  • [40] B. T. West and J. Sinibaldi (2013) The quality of paradata: a literature review. Improving Surveys with Paradata, pp. 339–359. Cited by: §1.
  • [41] M. Wojnowicz, B. Cruz, X. Zhao, B. Wallace, M. Wolff, J. Luan, and C. Crable (2016) “Influence sketching”: finding influential samples in large-scale regressions. In 2016 IEEE International Conference on Big Data (Big Data), pp. 3601–3612. Cited by: §5.
  • [42] H. Xiao, K. Rasul, and R. Vollgraf (2017-08-28)(Website) External Links: cs.LG/1708.07747 Cited by: §4.
  • [43] T. Ye (2020) Example based explanation in machine learning. GitHub. Note: Cited by: §4.
  • [44] C. Yeh, J. S. Kim, I. E. Yen, and P. Ravikumar (2018) Representer point selection for explaining deep neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 9311–9321. Cited by: §3.3.
  • [45] M. Zeni, W. Zhang, E. Bignotti, A. Passerini, and F. Giunchiglia (2019) Fixing mislabeling by human annotators leveraging conflict resolution and prior knowledge. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 (1), pp. 32. Cited by: §1, §2, §5, §6.
  • [46] J. Zhang, X. Wu, and V. S. Sheng (2016) Learning from crowdsourced labeled data: a survey. Artificial Intelligence Review 46 (4), pp. 543–576. Cited by: §1, §5.

Appendix A Additional Details

For all models and data sets, the margin threshold is set to , the batch size is 1024 and the number of epochs is 100. As for influence functions, we made use of an implementation based on LISSA-based strategy suggested by Koh and Liang [19]. The Hessian damping (pre-conditioning) constant was set to 0.01, the number of stochastic LISSA iterations to 10 and the number of samples to 1 (the default value). We experimented with a large number of alternative hyperaparameter settings, including larger number of LISSA iterations (up to 1000) and number of samples (up to 30), without any substantial improvements in performance for the IF approximation.

Appendix B Full Plots for Q1

Figure 5 reports the number of cleaned examples, score and number of queries to the user. The results are the same as in the main text: cincer combined with the Top Fisher approximation of the FIM is by far the best performing method. In all cases, cincer cleans more examples and outperforms in terms of the alternative approaches for noise handling, namely drop CE and no CE. The number of queries of all methods is similar across the data sets and models with few queries of difference. For the same number of queries, cincer cleans more labels confirming the advantages of identifying and relabeling counter-examples to increase the predictive performance.

Figure 5: cincer using Top Fisher vs. drop CE and no CE. Top row from left to right: LR, FC, CNN on MNIST and FC on fashion MNIST. Bottom row from left to right: CNN on fashion MNIST, FC on adult, breast and 20NG. For each row: number of cleaned examples, score and number of queries.

Appendix C Full Plots for Q2

To compare the number of mislabeled counter-examples discovered by the different approximation of IFs, we compute the precision at for . The results are shown in Figure 6. In general, FIM-based approaches outperform the LISSA estimator. Top Fisher is the best strategy after the full FIM, which is difficult to store and invert.

Figure 6: Counter-example Pr@5 and Pr@10. Standard error is reported. Top row from left to right: LR, FC, CNN on MNIST and FC on fasion MNIST. Bottom row from left to right: CNN on fashion MNIST, FC on adult, breast and 20NG.

Appendix D Full Plots for Q3

Figure 7 shows the results of the evaluation of Top Fisher, Practical Fisher and nearest neighbor (NN). Top Fisher outperforms the alternatives in terms of score and number of cleaned examples. NN is always worse than Top Fisher, even on adult (second column, second row) where it cleans the same number of examples but achieves lower predictive performance. These results show the importance of using the influence for choosing the counter-examples. As reported in the main text, Practical Fisher lags behind Top Fisher in all cases. The number of queries is similar for all strategies.

Figure 7: Top Fisher vs. practical Fisher vs. NN. Top row from left to right: LR, FC, CNN on MNIST and FC on fashion MNIST. Bottom row from left to right: CNN on fashion, FC on adult, breast and 20NG. For each row: number of cleaned examples, score and number of queries.