Unsupervised Difficulty Estimation with Action Scores

11/23/2020 ∙ by Octavio Arriaga, et al. ∙ DFKI GmbH Universität Bremen 0

Evaluating difficulty and biases in machine learning models has become of extreme importance as current models are now being applied in real-world situations. In this paper we present a simple method for calculating a difficulty score based on the accumulation of losses for each sample during training. We call this the action score. Our proposed method does not require any modification of the model neither any external supervision, as it can be implemented as callback that gathers information from the training process. We test and analyze our approach in two different settings: image classification, and object detection, and we show that in both settings the action score can provide insights about model and dataset biases.



There are no comments yet.


page 1

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Current state-of-the-art models in computer vision tasks rely on the use of convolutional neural networks (CNNs). However, modern CNN architectures contain sufficient structural-priors to reduce the solution space to a computable and generalisable one, but not restricted enough to prevent them from learning unstructured data nuances 

Zhang et al. (2016); Nguyen et al. (2015); Jo and Bengio (2017); Goodfellow et al. (2014). In this paper we present a simple method to assess the difficulty and possible biases of machine learning models by tracking the loss of each sample during training. This method does not rely in any external supervision nor model modification as opposed to similar methods Shrivastava et al. (2016); Loshchilov and Hutter (2015); Lin et al. (2017); Wang and Vasconcelos (2018). Specifically, we test it in a simple image classification scenario and a more complex setting with a multi-objective loss used in object detection.

The use of per-sample loss values is widespread in the literature. Shrivastava et al. (2016) uses the per-sample loss to mine for hard negative examples while training an object detector. Loshchilov and Hutter (2015) proposes a way to sample mini-batches using the loss as a criteria, where training samples with higher loss will be chosen more frequently. This has the effect of speeding up training by 5. The focal loss Lin et al. (2017)

introduces a similar concept where an object detector focuses on harder samples. Difficulty estimation is an emerging topic in this field.

Wang and Vasconcelos (2018)

proposes an additional output branch and a related loss function in order to learn to estimate sample difficulty. This method has learning difficulties and cannot be trained end-to-end.

2 Unsupervised Difficulty Estimation

Given a loss function and a model with free parameters , we define the action of a sample with labels as



represents epochs. Consequently, the action

111We adopt this name due to its similarity of a physical system following the path of stationary action Landau and Lifshitz (1960). of a sample is the accumulated loss over all epochs. Our method characterizes the action of each sample as a measurement of its difficulty. Therefore, samples with a higher accumulated loss represent samples that are more difficult to learn. Specifically, we argue that the action is directly proportional to its difficulty i.e.

. Within this framework we can also recover sample pairs that accumulate the least amount of loss during optimization. These samples reflect which elements are easier to learn as well as possible biases that might be present in the data. We would like to emphasize that the method presented here can be applied to any learning algorithm that is optimized iteratively and is not limited to artificial neural networks nor supervised methods.

3 Results

We first tested our method in simple classification task in which we train a VGG-like CNN222

We used the Keras CIFAR10 example CNN available at

keras-examples on CIFAR10 using the cross-entropy loss. At every epoch we calculated and stored the loss of each sample in the test set. After the conclusion of the training phase we calculated the action of each sample by summing up the stored losses. In Figure 1 we display the samples with the most and least action scores. From Figure 1 we can observe that model learns to distinguish with the least action two specific set of samples: brown horses and red cars. For our second experiment we calculate the action scores of a multi-objective loss function used for training the single-shot object detector SSD300 Liu et al. (2016). The total loss of this model consist of the combination of three different losses: positive classification, negative classification and bounding box regression. For the localization loss the samples with the most and least action are shown in Figure 2

We can observe that the most difficult samples for the box regression loss correspond to images that contain undistinguishable small objects. Moreover, easier samples for the same loss are determined by single centered objects.

We provide additional examples of object detection on PASCAL VOC 2007 in the supplementary material.

4 Conclusions and Future Work

In this work we presented a method for calculating the difficultly and possible biases of a model. Our method requires no external supervision nor a modification of the original model and it can be easily integrated in any learning framework. We test our method in two different settings. We displayed the samples with the highest and lowest actions scores. Our obtained results indicate that the maximum and minimum action scores do qualitatively correspond to difficult or biased samples. For future work we propose to apply our method in unsupervised settings, as well as to test its variability along different models.


  • [1] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • [2] J. Jo and Y. Bengio (2017) Measuring the tendency of cnns to learn surface statistical…. arXiv preprint arXiv:1711.11561. Cited by: §1.
  • [3] L. Landau and E. Lifshitz (1960) Course of theoretical physics. vol. 1: mechanics. Oxford. Cited by: footnote 1.
  • [4] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1.
  • [5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: Appendix A, §3.
  • [6] I. Loshchilov and F. Hutter (2015) Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343. Cited by: §1, §1.
  • [7] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled. pp. 427–436. Cited by: §1.
  • [8] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 761–769. Cited by: §1, §1.
  • [9] P. Wang and N. Vasconcelos (2018) Towards realistic predictors. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–51. Cited by: §1, §1.
  • [10] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016)

    Understanding deep learning requires rethinking generalization

    arXiv preprint arXiv:1611.03530. Cited by: §1.

Appendix A Object Detection Results on PASCAL VOC 2007 with SSD

In this section we show results on the PASCAL VOC 2007 validation set using the Single Shot Multibox detector [5]. SSD uses a multi-task loss, a localization loss for bounding box regression, and a cross-entropy loss for class predictions. The cross-entropy loss can be divided into loss for the positive examples (target objects), and loss for the negative examples (background). We show results in each components of the multi-task loss, namely localization, positive, and negative losses.

(a) Boat 1749.3
(b) Train 496.6
(c) Car 457.1
(d) Bird 413.1
(e) Bird 385.8
(f) Plant 357.6
(g) Bottle 354.5
(h) Person 3.9
(i) Cat 4.1
(j) Dog 4.2
(k) Cat 4.7
(l) Cat 5.0
(m) Person 5.2
(n) Person 5.3
Figure 2: Most difficult (top-row) and easiest examples (bottom-row) in the VOC 2007-VAL with the SSD localization loss. The action scores are displayed below each image as well as the true label.
(a) Cow
(b) Car
(c) Person and Horse 565.0
(d) Horse
(e) Cat
(f) Plant, Bottle and Horse 527.8
(g) Dog
Figure 3: Hardest Examples on PASCAL VOC 2007 with SSD validation positive loss. Action score is included in each caption.
(a) Person
(b) Person
(c) Person
(d) Cat
(e) Person
(f) Cat
(g) Person
Figure 4: Easiest Examples on PASCAL VOC 2007 with SSD validation positive loss. Action score is included in each caption.
(a) Person
(b) Boat
(c) Table
(d) Person
(e) Table
(f) Chair
(g) Bicycle
Figure 5: Hardest Examples on PASCAL VOC 2007 with SSD validation negative loss. Action score is included in each caption.
(a) Cat
(b) Cat
(c) Cat
(d) Dog
(e) Cat
(f) Person
(g) Cat
Figure 6: Easiest Examples on PASCAL VOC 2007 with SSD validation negative loss. Action score is included in each caption.