Current state-of-the-art models in computer vision tasks rely on the use of convolutional neural networks (CNNs). However, modern CNN architectures contain sufficient structural-priors to reduce the solution space to a computable and generalisable one, but not restricted enough to prevent them from learning unstructured data nuancesZhang et al. (2016); Nguyen et al. (2015); Jo and Bengio (2017); Goodfellow et al. (2014). In this paper we present a simple method to assess the difficulty and possible biases of machine learning models by tracking the loss of each sample during training. This method does not rely in any external supervision nor model modification as opposed to similar methods Shrivastava et al. (2016); Loshchilov and Hutter (2015); Lin et al. (2017); Wang and Vasconcelos (2018). Specifically, we test it in a simple image classification scenario and a more complex setting with a multi-objective loss used in object detection.
The use of per-sample loss values is widespread in the literature. Shrivastava et al. (2016) uses the per-sample loss to mine for hard negative examples while training an object detector. Loshchilov and Hutter (2015) proposes a way to sample mini-batches using the loss as a criteria, where training samples with higher loss will be chosen more frequently. This has the effect of speeding up training by 5. The focal loss Lin et al. (2017)
introduces a similar concept where an object detector focuses on harder samples. Difficulty estimation is an emerging topic in this field.Wang and Vasconcelos (2018)
proposes an additional output branch and a related loss function in order to learn to estimate sample difficulty. This method has learning difficulties and cannot be trained end-to-end.
2 Unsupervised Difficulty Estimation
Given a loss function and a model with free parameters , we define the action of a sample with labels as
represents epochs. Consequently, the action111We adopt this name due to its similarity of a physical system following the path of stationary action Landau and Lifshitz (1960). of a sample is the accumulated loss over all epochs. Our method characterizes the action of each sample as a measurement of its difficulty. Therefore, samples with a higher accumulated loss represent samples that are more difficult to learn. Specifically, we argue that the action is directly proportional to its difficulty i.e.
. Within this framework we can also recover sample pairs that accumulate the least amount of loss during optimization. These samples reflect which elements are easier to learn as well as possible biases that might be present in the data. We would like to emphasize that the method presented here can be applied to any learning algorithm that is optimized iteratively and is not limited to artificial neural networks nor supervised methods.
We first tested our method in simple classification task in which we train a VGG-like CNN222 We used the Keras CIFAR10 example CNN available at
We used the Keras CIFAR10 example CNN available atkeras-examples on CIFAR10 using the cross-entropy loss. At every epoch we calculated and stored the loss of each sample in the test set. After the conclusion of the training phase we calculated the action of each sample by summing up the stored losses. In Figure 1 we display the samples with the most and least action scores. From Figure 1 we can observe that model learns to distinguish with the least action two specific set of samples: brown horses and red cars. For our second experiment we calculate the action scores of a multi-objective loss function used for training the single-shot object detector SSD300 Liu et al. (2016). The total loss of this model consist of the combination of three different losses: positive classification, negative classification and bounding box regression. For the localization loss the samples with the most and least action are shown in Figure 2
We can observe that the most difficult samples for the box regression loss correspond to images that contain undistinguishable small objects. Moreover, easier samples for the same loss are determined by single centered objects.
We provide additional examples of object detection on PASCAL VOC 2007 in the supplementary material.
4 Conclusions and Future Work
In this work we presented a method for calculating the difficultly and possible biases of a model. Our method requires no external supervision nor a modification of the original model and it can be easily integrated in any learning framework. We test our method in two different settings. We displayed the samples with the highest and lowest actions scores. Our obtained results indicate that the maximum and minimum action scores do qualitatively correspond to difficult or biased samples. For future work we propose to apply our method in unsupervised settings, as well as to test its variability along different models.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
-  (2017) Measuring the tendency of cnns to learn surface statistical…. arXiv preprint arXiv:1711.11561. Cited by: §1.
-  (1960) Course of theoretical physics. vol. 1: mechanics. Oxford. Cited by: footnote 1.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: Appendix A, §3.
-  (2015) Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343. Cited by: §1, §1.
-  (2015) Deep neural networks are easily fooled. pp. 427–436. Cited by: §1.
Training region-based object detectors with online hard example mining.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §1, §1.
-  (2018) Towards realistic predictors. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–51. Cited by: §1, §1.
Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
Appendix A Object Detection Results on PASCAL VOC 2007 with SSD
In this section we show results on the PASCAL VOC 2007 validation set using the Single Shot Multibox detector . SSD uses a multi-task loss, a localization loss for bounding box regression, and a cross-entropy loss for class predictions. The cross-entropy loss can be divided into loss for the positive examples (target objects), and loss for the negative examples (background). We show results in each components of the multi-task loss, namely localization, positive, and negative losses.