Weakly Supervised Object Localization using Min-Max Entropy: an Interpretable Framework

Weakly supervised object localization (WSOL) models aim to locate objects of interest in an image after being trained only on data with coarse image level labels. Deep learning models for WSOL rely typically on convolutional attention maps with no constraints on the regions of interest which allows them to select any region, making them vulnerable to false positive regions. This issue occurs in many application domains, e.g., medical image analysis, where interpretability is central to the prediction. In order to improve the localization reliability, we propose a deep learning framework for WSOL with pixel level localization. It is composed of two sequential sub-networks: a localizer that localizes regions of interest; followed by a classifier that classifies them. Within its end-to-end training, we incorporate the prior knowledge stating that in an agnostic-class setup an image is more likely to contain relevant --object of interest-- and irrelevant regions --noise--. Based on the conditional entropy (CE) measured at the classifier, the localizer is driven to spot relevant regions (low CE), and irrelevant regions (high CE). Our framework is able to recover large discriminative regions using our recursive erasing algorithm that we incorporate within the backpropagation during training. Moreover, the framework handles intrinsically multi-instances. Experimental results on public datasets with medical images (GlaS colon cancer) and natural images (Caltech-UCSD Birds-200-2011) show that, compared to state of the art WSOL methods, our framework can provide significant improvements in terms of image-level classification, pixel-level localization, and robustness to overfitting when dealing with few training samples. A public reproducible PyTorch implementation is provided in: https://github.com/sbelharbi/wsol-min-max-entropy-interpretability .



There are no comments yet.


page 9

page 17

page 18

page 19

page 20

page 21

page 22

page 23


Deep weakly-supervised learning methods for classification and localization in histology images: a survey

Using state-of-the-art deep learning models for the computer-assisted di...

InfoMask: Masked Variational Latent Representation to Localize Chest Disease

The scarcity of richly annotated medical images is limiting supervised d...

An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization

Medical images differ from natural images in significantly higher resolu...

Inter-Image Communication for Weakly Supervised Localization

Weakly supervised localization aims at finding target object regions usi...

Classification and Disease Localization in Histopathology Using Only Global Labels: A Weakly-Supervised Approach

Analysis of histopathology slides is a critical step for many diagnoses,...

Exploiting saliency for object segmentation from image level labels

There have been remarkable improvements in the semantic labelling task i...

Thoracic Disease Identification and Localization using Distance Learning and Region Verification

The identification and localization of diseases in medical images using ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object localization222 Object localization consists in isolating an object of interest by providing the coordinates of its surrounding bounding box. In this work, it is also understood as a task providing pixel level segmentation of the object, which provides more accuracy of the localization. To avoid confusion when presenting the literature, we specify the case being considered.

can be considered as one of the most fundamental tasks for image understanding, as it provides crucial clues to challenging visual problems, such as object detection or semantic segmentation. Deep learning methods, and particularly convolutional neural networks (CNNs), are driving recent progress in these tasks. Nevertheless, despite their remarkable performance, a downside of these methods is the large amount of labeled data required for training, which is a time consuming task and prone to observer variability. To overcome this limitation, weakly supervised learning (WSL) has emerged recently as a surrogate for extensive annotation of training data

zhou2017brief . WSL involves scenarios where training is performed with inexact or uncertain supervision. In the context of object localization or semantic segmentation, weak supervision typically comes in the form of image level tags KERVADEC201988 ; kim2017two ; pathak2015constrained ; teh2016attention ; wei2017object , scribbles Lin2016 ; ncloss:cvpr18 or bounding boxes Khoreva2017 .

In WSOL, current state of the art methods in object localization and semantic segmentation rely heavily on classification activation maps produced by convolutional networks in order to localize regions of interest zhou2016learning , which can be also used as an interpretation of the model’s decision Zhang2018VisualInterp . Different work has been done in WSOL field to leverage the need to pixel level annotation. We mention bottom-up methods which rely on the input signal to locate the object of interest. Such methods include spatial pooling techniques over activation maps zhou2016learning ; oquab2015object ; sun2016pronet ; zhang2018adversarial ; durand2017wildcat , multi-instance learning ilse2018attention , attend-and-erase based methods SinghL17 ; wei2017object ; kim2017two ; LiWPE018CVPR ; pathak2015constrained . While such methods provide pixel level localization, other methods have been introduced to predict a bounding box instead, named weakly supervised object detectors bilen2016weakly ; kantorov2016contextlocnet ; tang2017multiple ; wan2018min ; shen2018generative . Inspired by human visual attention, top-down methods, which rely on the input signal and a selective backward signal to determine the corresponding object, were proposed including special feedback layers cao2015look , backpropagation error zhang2018top , Grad-CAM selvaraju2017grad ; ChattopadhyaySH18wacv for the gradient of the object class with respect to the activation maps.

Within an agnostic-class setup, input image often contains the object of interest among other parts such as noise, background, and other irrelevant subjects. Most the aforementioned methods do not consider such prior, and feed the entire image to the model. Ignoring such prior in the case where the object of interest in different images has a common shape/texture/color, the model may still be able to localize the most discriminative part of it easily oquab2015object

. This is the case of natural images, for instance. However, in the case where the object can appear in different and random shape/structure, or may have relatively similar texture/color to the irrelevant parts, the model may easily confuse between the object and the irrelevant parts. This is mainly due to the fact that the network is free to select any area of the image as a region of interest as long as the selected region allows to reduce the classification loss. Such free selection can lead to high false positive regions and inconsistent localization. This issue can be furthermore understood from the point of view of feature selection and sparsity

tibshirani2015statistical . However, instead of selecting the relevant features, the model is required to select a set of pixels –i.e., raw features– representing the object of interest. Since the only constraint to optimize during such selection is to minimize the classification loss, and without other priors nor pixel level supervision, the optimization may converge to a model that can select any random subset of pixels as long as the loss is minimized. This does not guarantee that the selected pixels represent an object, nor the correct object, nor even make sens to us333wan2018min argue that there is an inconsistency between the classification loss and the task of WSOL; and that typical the optimization may reach sub-optimal solutions with considerable randomness in them.. From optimization perspective, it does not matter which set of pixels is selected (with respect to interpretability), but what matter is to obtain a minimal loss. In practice, and in deep WSOL, this often results in localizing the smallest (–i.e., sparse set–) common discriminative region of the object such as dog’s face with respect to the object ’dog’ kim2017two ; SinghL17 ; zhou2016learning which makes sens since localizing the dog’s face can be statistically sufficient to discriminate the object ’dog’ from other objects. Once such region is located, the classification loss may reach the minimum; and then, the model stops learning.

False positive regions can be problematic in critical domains such as medical applications where interpretability plays a central role in trusting and understanding an algorithm’s prediction. To address this important issue, and motivated by the importance of using prior knowledge in learning to alleviate overfitting when training using few samples mitchell1980need ; krupka2007incorporating ; yu2007incorporating ; sbelharbiarxivsep2017 , we propose to use the aforementioned prior –i.e., an image is likely to contain relevant and irrelevant regions– in order to favorite models that behave as such. To this end, we constrain the model to learn to localize both relevant and irrelevant image regions simultaneously in an end-to-end manner within a weakly supervised scenario, where only image level labels are used for training. We model the relevant –i.e., discriminative– regions as the complement of the irrelevant –i.e., non-discriminative– regions (Fig.1). Our model is composed of two sub-models: (1) a localizer that aims at localizing regions of interest by predicting a latent mask, (2) and a classifier that aims at classifying the visible content of the input image through the latent mask. The localizer is trained, by employing the conditional entropy coverentropy2006 , to simultaneously identify (1) relevant regions where the classifier has high confidence with respect to the image label, (2) and irrelevant regions where the classifier is being unable to decide which image label to assign. This modeling allows the discriminative regions to pop out and be used to assign the corresponding image label, while suppressing non-discriminative areas, leading to more reliable predictions. In order to localize complete discriminative regions, we extend our proposal by training the localizer to recursively erase discriminative parts during training. To this end, we propose a recursive erasing algorithm that we incorporate within the backpropagation. At each recursion, and within the backpropagation, the algorithm localizes the most discriminative region; stores it; then erases it from the input image. At the end of the final recursion, the model has gathered a large extent of the object of interest that is feed next to the classifier. Thus, our model is driven to localize complete relevant regions while discarding irrelevant regions, resulting in more reliable object localization regions. Moreover, since the discriminative parts are allowed to be extended over different instances, the proposed model handles multi-instances natively.

The main interest of predicting a mask –i.e., localization at pixel level with high precision– instead of localization at a bounding box level which predicts a coarse localization, is the localization precision. In some applications, such as in medical domain, object localization may require high precision such as localizing cells, boundaries, and organs which may have an unstructured shape, and different scale that a bounding box may highly miss-represent. In such cases, a pixel level localization, such as in our proposal, can be more useful.

Figure 1: Intuition of our proposal. A decidable region covers all the discriminative parts, while an undecidable region covers all the non-discriminative parts. See Sec.3 for notation.

The main contribution of this paper is new deep learning framework for weakly supervised object localization at pixel level. Our framework is composed of two sequential sub-networks where the first one localizes regions of interest, whereas the second one classifies them. Based on conditional entropy, the end-to-end training of this framework allows to incorporate prior knowledge indicating that, in a class-agnostic setup, the image is more likely to contain relevant regions (object of interest) and irrelevant regions (noise, background). Given the conditional entropy measured at the classifier level, the localizer is driven to localize relevant regions (with low conditional entropy) and irrelevant regions (with high conditional entropy). Such localization is achieved with the main goal of providing a more interpretable and reliable regions of interest. This paper also contributes a recursive erasing algorithm that is incorporated within backpropagation, along with a practical implementation in order to obtain complete discriminative regions. Finally, we conduct an extensive series of experiments on two public image datasets (medical and natural scenes), where the results show the effectiveness of the proposed approach in terms of pixel level localization while maintaining competitive accuracy for image-level classification.

2 Background on WSOL

In this section, we briefly review state of the art of WSOL methods that aim at localizing objects of interest using only image level labels as supervision.

Fully convolutional networks with spatial pooling have shown to be effective to obtain localization of discriminative regions zhou2016learning ; oquab2015object ; sun2016pronet ; zhang2018adversarial ; durand2017wildcat . Multi-instance learning based methods have been used within an attention framework to localize regions of interest ilse2018attention . Since neural networks often, kim2017two ; SinghL17 ; zhou2016learning , provide small and most discriminative regions of object of interest, SinghL17 propose to hide large patches in training image randomly in order to force the network to seek other discriminative regions to recover large part of the object of interest. wei2017object use the attention map of a trained network to erase the most discriminative part of the original image. kim2017two use two-phase learning stage where they combine the attention maps of two networks to obtain a complete region of the object. LiWPE018CVPR propose a two-stage approach where the first network classifies the image, and provides an attention map of the most discriminative parts. Such attention is used to erase the corresponding parts over the input image, then feed the resulting erased image to a second network to make sure that there is no discriminative parts left.

Weakly supervised object detectors methods have emerged as an approach for localizing regions of interest using bounding boxes instead of pixel level. Such approaches rely on region proposals such as edge box zitnick2014edge and selective search van2011segmentation ; uijlings2013selective . In teh2016attention , the content of each proposed region is passed through an attention module, then a scoring module to obtain an average image. bilen2016weakly propose an approach to address multi-class object localization. Many improvements of this work have been proposed since then kantorov2016contextlocnet ; tang2017multiple . Other approaches rely on multi-stage training where in the first stage a network is trained to localize then refined in later stages for object detection sun2016pronet ; diba2017weakly ; ge2018multi

. In order to reduce the variance of the localization of the boxes,

wan2018min propose to reduce an entropy defined on the position of such boxes. shen2018generative propose to use generative adversarial networks to generate the proposals in order to speedup inference since most of the region proposals techniques are time consuming.

Inspired by the human visual attention, top-down methods was proposed. In Simonyan14a ; DB15a ; zeiler2014ECCV , backpropagation error is used in order to visualize saliency maps over the image for the predicted class. In cao2015look , an attention map is built to identify the class relevant regions using feedback layer. zhang2018top propose Excitation backprop that allows to pass along top-down signals downwards in the network hierarchy through a probabilistic framework. Grad-CAM selvaraju2017grad generalize CAM zhou2016learning using the derivative of the class scores with respect to each location on the feature maps which has been furthermore generalized in ChattopadhyaySH18wacv . In practice, top-down method are considered as visual explanatory tools, and they can be overwhelming in term of computation and memory usage even during inference.

While the aforementioned approaches have shown great success mostly with natural images, they still luck mechanism for modeling what is relevant and irrelevant within an image which is crucial to determine the quality of the regions of interest in term of reliability. Erase-based methods SinghL17 ; wei2017object ; kim2017two ; LiWPE018CVPR ; pathak2015constrained follow such concept where the non-discriminative parts are suppressed, throughout constraints, allowing only the discinimative ones to pop out. Explicitly modeling negative evidence within the model has shown to be effective in WSOL PariziVZF14 ; Azizpour2015SpotlightTN ; durand2016weldon ; durand2017wildcat . Among the cited literature, SinghL17 ; wei2017object ; kim2017two ; LiWPE018CVPR combined with wan2018min

is probably the closest work to our proposal. Our proposal can also be seen as a

supervised dropout srivastava14a . While dropout, applied over the input image, zeroes out pixels randomly, our proposal seeks to zero out irrelevant pixels and keep only the discriminative ones that support image label. In that sens, our proposal mimics a discminitative gate that inhibits irrelevant and noisy regions while allowing only informative and discriminative regions to pass through the gate.

3 The min-max entropy framework for WSOL

3.1 Notations and definitions

Let us consider a set of training samples where is an input image with depth , height , and width

; a realization of the discrete random variable

with support set ; is the image level label (i.e., image class), a realization of the discrete random variable with support set . We define a decidable region444In this context, the notion of region indicates one pixel. of an image as any informative part of the image that allows predicting the image label. An undecidable region is any noisy, uninformative, and irrelevant part of the image that does not provide any indication nor support for the image class. To model such definitions, we consider a binary mask where a location with value indicates a decidable region, otherwise it is an undecidable region. We model the decidability of a given location with a binary random variable . Its realization is , and its conditional probability over the input image is defined as follows,


We note a binary mask indicating the undecidable region, where . We consider the undecidable region as the complement of the decidable one. We can write: , where is the norm. Following such definitions, an input image can be decomposed into two images as , where is the Hadamard product. We note , and . inherits the image-level label of . We can write the pair in the same way as . We note by , and as the respective approximation of , and (Sec.3.3). We are interested in modeling the true conditional distribution where .

is its estimate. Following the previous discussion, predicting the image label depends only on the decidable region, i.e.,

. Thus, knowing does not add any knowledge to the prediction, since does not contain any information about the image label. This leads to: . As a consequence, the image label is conditionally independent of the undecidable region provided the decidable region Kollergraphical2009 : , where are the random variables modeling the decidable and the undecidable regions, respectively. In the following, we provide more details on how to exploit such conditional independence property in order to estimate and .

3.2 Min-max entropy

We consider modeling the uncertainty of the model prediction over decidable, or undecidable regions using conditional entropy (CE). Let us consider the CE of , denoted and computed as coverentropy2006 ,


Since the model is required to be certain about its prediction over , we constrain the model to have low entropy over . Eq.2 reaches its minimum when the probability of one of the classes is certain, i.e., coverentropy2006 . Instead of directly minimizing Eq.2, and in order to ensure that the model predicts the correct image label, we cast a supervised learning problem using the cross-entropy between and using the image-level label of as a supervision,


Eq.3 reaches its minimum at the same conditions as Eq.2 with the true image label as a prediction. We note that Eq.3 is the negative log-likelihood of the sample . In the case of , we consider the CE of , denoted and computed as,


Over irrelevant regions, the model is required to be unable to decide which image class to predict since there is no evidence to support any class. This can be seen as a high uncertainty in the model decision. Therefore, we consider maximizing the entropy of Eq.4

. The later reaches its maximum at the uniform distribution

coverentropy2006 . Thus, the inability of the model to decide is reached since each class is equiprobable. An alternative to maximizing Eq.4 is to use a supervised target distribution since it is already known (i.e., uniform distribution). To this end, we consider as a uniform distribution,


and caste a supervised learning setup using a cross-entropy between and over ,


The minimum of Eq.6 is reached when is uniform, thus, Eq.4 reaches its maximum. Now, we can write the total training loss to be minimized as,


The posterior probability

is modeled using a classifier with a set of parameters ; it can operate either on or . The binary mask (and ) is learned using another model with a set of parameters . In this work, both models are based on neural networks (fully convolutional networks LongSDcvpr15 in particular). The networks and can be seen as two parts of one single network that localizes regions of interest using a binary mask, then classifies their content. Fig.2 illustrates the entire model.

Figure 2: Our proposed method. The recursive mining of the discriminative parts is performed only during training (See Sec.3.4).

Due to the depth of , receives its supervised gradient based only on the error made by . In order to boost the supervised gradient at , and provide it with more hints to be able to select the most discriminative regions with respect to the image class, we propose to use a secondary classification task at the output of to classify the input , following lee15apmlr . computes the posterior probability which is another estimate of . To this end, is trained to minimize the cross-entropy between and ,


The total training loss to minimize is formulated as,


3.3 Mask computation

The mask is computed using the last feature maps of

which contains high abstract descriminative activations. We note such feature maps by a tensor

that contains a spatial map for each class. is computed by aggregating the spatial activation of all the classes as,


where is the continuous downsampled version of , and is the feature map of the class of the input . At convergence, the posterior probability of the winning class is pushed toward while the rest is pushed down to . This leaves only the feature map of the winning class. is upscaled using interpolation555In most neural networks libraries (Pytorch (pytorch.org), Chainer (chainer.org)), the upsacling operations using interpolation/upsamling have a non-deterministic backward. This makes training unstable due to the non-deterministic gradient; and reproducibility impossible. To avoid such issues, we detach the upsacling operation from the training graph and consider it as input data for . to which has the same size as the input

, then pseudo-thresholded using a sigmoid function to obtain a pseudo-binary



where is a constant scalar that ensures that the sigmoid approximately equals to when is larger than , and approximately equals to otherwise.

3.4 Object completeness using incremental recursive erasing and trust coefficients

Object classification methods tend to rely on small discriminative regions kim2017two ; SinghL17 ; zhou2016learning . Thus, may still contain discriminative parts. Following SinghL17 ; kim2017two ; LiWPE018CVPR ; pathak2015constrained , and in particular wei2017object , we propose a learning incremental and recursive erasing approach that drives to seek complete discriminative regions. However, in the opposite of wei2017object where such mining is done offline, we propose to incorporate the erasing within the backpropagation using an efficient and practical implementation. This allows to learn to seek discriminative parts. Therefore, erasing during inference is unnecessary. Our approach consists in applying recursively before applying within the same forward. The aim of the recursion, with maximum depth , is to mine more discriminative parts within the non-discriminative regions of the image masked by . We accumulate all discriminative parts in a temporal mask . At each recursion, we mine the most discriminative part, that has been correctly classified by , and accumulate it in . However, with the increase of , the image may run out of discriminative parts. Thus, is forced, unintentionally, to consider non-discriminative parts as discriminative. To alleviate this risk, we introduce trust coefficients that control how much we trust a mined discriminative region at each step of the recursion for each sample as follows,


where computes the trust of the current mask of the sample at the step as follows,


where encodes the overall trust with respect to the current step of the recursion. Such trust is expected to decrease with the depth of the recursion bel16 . controls the slop of the trust function. The second part of Eq.13 is computed with respect to each sample. It quantifies how much we trust the estimated mask for the current sample ,


In Eq.14, is computed over . Eq.14 ensures that at a step , for a sample , the current mask is trusted only if the correctly classifies the erased image, and does not increase the loss. The first condition ensures that the accumulated discriminative regions belong to the same class, and more importantly, the true class. Moreover, it ensures that does not change its class prediction through the erasing process. This introduces a consistency between the mined regions across the steps and avoids mixing discriminative regions of different classes. The second condition ensures maintaining, at least, the same confidence in the predicted class compared to the first forward without erasing (). The given trust in this case is equal to the probability of the true class. The regions accumulator is initialized to zero at at each forward in .

is not maintained through epoches;

starts over each time processing the sample . This prevents accumulating incorrect regions that may occur at the beginning of the training. In order to automatize when to stop erasing, we consider a maximum depth of the recursion . For a mini-batch, we keep erasing as along as we do not reach steps of erasing, and there is at least one sample with a trust coefficient non-zero (Eq.14). Once a sample is assigned a zero trust coefficient, it is maintained zero all along the erasing (Eq.12)(Fig.4). Direct implementation of Eq.12 is not practical since performing a recursive computation on a large model requires a large memory that increases with the depth . To avoid such issue, we propose a practical implementation using gradient accumulation at through the loss Eq.8; such implenetation requires the same memory size as in the case without erasing (Alg.1). We provide more details in the supplementary material (Sec.A.1).

4 Results and analysis

Our experiments focused simultaneously on classification and object localization tasks. Thus, we consider datasets that provide image-level and pixel-level labels for evaluation on classification and object localization tasks. Particularly, the following two datasets were considered: GlaS in medical domain, and CUB-200-2011 on natural scene images. (1) GlaS dataset was provided in the 2015 Gland Segmentation in Colon Histology Images Challenge Contest666GlaS: warwick.ac.uk/fac/sci/dcs/research/tia/glascontest. sirinukunwattana2017gland . The main task of the challenge is gland segmentation of microscopic images. However, image-level labels were provided as well. The dataset is composed of 165 images derived from 16 Hematoxylin and Eosin (H&E) histology sections of two grades (classes): benign, and malignant. It is divided into 84 samples for training, and 80 samples for test. Images have a high variation in term of gland shape/size, and overall H&E stain. In this dataset, the glandes are the regions of interest that the pathologists use to prognosis the image grading of being benign or malignant. (2) CUB-200-2011 dataset777CUB-200-2011: www.vision.caltech.edu/visipedia/CUB-200-2011.html WahCUB2002011 is a dataset for bird species with samples and species. For the sake of evaluation, and due to time limitation, we selected randomly 5 species and build a small dataset with samples for training, and for test; referred to in this work as CUB5. In this dataset, the object of interest are the birds. In both datasets, we take randomly of train samples for effective training, and for validation to perform early stopping. We provide the used splits and the deterministic code that generated them for both datasets.

In all the experiments, image-level labels are used during training/evaluation, while pixel-level labels are used exclusively during evaluation. The evaluation is conducted at two levels: at image level where the classification error is reported, and at the pixel level where we report F1 score (Dice index). over the foreground (object of interest), referred to as F1. When dealing with binary data, F1 score is equivalent to Dice index. We report as well the F1 score over the background, referred to as F1, in order to measure how well the model is able to identify irrelevant regions. We compare our method to different methods of WSOL. The methods use similar pre-trained backbone (resent18 heZRS16

) for feature extraction and differs mainly in the final pooling layer: CAM-Avg uses average pooling


, CAM-Max uses max-pooling

oquab2015object , CAM-LSE uses an approximation to maximum sun2016pronet ; PinheiroC15cvpr , Wildcat uses the pooling in durand2017wildcat , and Deep MIL is the work of ilse2018attention with adaptation to multi-class. We use supervised segmentation using U-Net Ronneberger-unet-2015 as an upper bound of the performance for pixel-level evaluation (Full sup.). As a basic baseline, we use a mask full of 1 of the same size of the image as a constant prediction of the objects of interest to show that F1 alone is not an efficient metric to evaluate pixel-level localization particularly over GlaS set (All-ones, see Tab.1). In our method, and share the same pre-trained backbone (resnet101 heZRS16 ) to avoid overfitting while using durand2017wildcat

as a pooling function. All methods are trained using stochastic gradient descent using momentum. In our approach, we used the same hyper-parameters over both datasets, while other methods required adaptation to each dataset. We provide a reproducible code

888https://github.com/sbelharbi/wsol-min-max-entropy-interpretability, the datasets splits, more experimental details, and visual results in the supplementary material (Sec.B).

A comparison of the obtained results of different methods, over both datasets, is presented in Tab.1 with visual results illustrated in Fig.3. Tab.2 compares the impact of using our recursive erasing algorithm to mine discriminative regions. From Tab.2, we can observe that using our recursive algorithm adds a large improvement in F1 without degrading F1. This means that the recursion allows the model to correctly localize larger portions of the object of interest without including false positive regions. In Tab.1, and compared to other WSOL methods, our method obtains relatively similar F1 score; while it obtains large F1 over GlaS where it may be easy to obtain high F1 by predicting a mask full of 1 (Fig.3). However, a model needs to be very selective in order to obtain high F1 score in order to localize tissues (irrelevant regions) where our model seems to excel at. Cub5 set seems to be more challenging due to the variable size (from small to big) of the birds, their view, the context/surrounding environment, and the few training samples. Our model outperforms all the WSOL methods in both F1 and F1 with a large gap due mainly to its ability to discard non-discriminative regions which leaves it only with the region of interest, in this case, the bird. While our model shows improvements in localization, it is still far behind full supervision. In term of classification, all methods obtained low error over GlaS which implies that it is an easy set for classification. However, and surprisingly, the other methods seem to overfit over CUB5, while our model shows a robustness. The obtained results over both datasets demonstrate, compared to WSOL methods, the effectiveness of our approach in term of image classification and object localization with more reliability in term of object localization.

Visual quality of our approach (Fig.3) shows that the predicted regions of interest over GlaS agree with the doctor methodology of colon cancer diagnostics where the glands are used as diagnostic tool. Additionally, the ability to deal with multi-instances when there are multiple glands within the image. Over CUB5, our model succeeds to spot the bird localization in order to predict its category which one may do in such task. We notice that the head, chest, tail, or body particular spots, are often parts that are used by our model to decide the bird’s species which seems a reasonable strategy as well.

Image level Pixel level
Method Error (%) F1 (%) F1 (%)
All-ones N/A N/A
CAM-Avg zhou2016learning
CAM-Max oquab2015object
CAM-LSE sun2016pronet ; PinheiroC15cvpr
Wildcat durand2017wildcat
Deep MIL ilse2018attention
Ours ()
Full sup.: U-Net Ronneberger-unet-2015 N/A N/A
Table 1: Performance over GlaS and CUB-200-2011 (CUB5) test sets.
Image level Pixel level
Ours Error (%) F1 (%) F1 (%)
Table 2: Impact of our incremental recursive erasing algorithm over the performance over GlaS and CUB-200-2011 (CUB5) test sets.
Figure 3: Visual comparison of the predicted binary mask of each method over GlaS and CUB-200-2011 (CUB5) test sets. (Best visualized in color.) (See supplementary material for more samples.)

5 Conclusion

In this work, we have presented a novel approach for WSOL where we constrained learning relevant and irrelevant regions within the model. Evaluated on two datasets, and compared to state of the art WSOL methods, our approach showed its effectiveness in correctly localizing object of interest with small false positive regions while maintaining a competitive classification error. This makes our approach more reliable in term of interpetability. As future work, we consider extending our approach to handle multiple classes within the image. Different constraints can be applied over the predicted mask, such as texture properties, shape, or other region constraints. However, this requires the mask to be differentiable with respect to the model’s parameters to be able to train the network using such constraints. Predicting bounding boxes instead of heat maps is considered, as well, since they can be more suitable in some applications where pixel-level accuracy is not required.

We discussed in Sec.B.3 a fundamental issue in erasing-based algorithms, that we noticed from applying our approach over CUB5 datasets. We arrived to the conclusion that such algorithms luck the ability to remember the location of the already mined regions of interest which can be problematic in the case where there is only one instance in the image, and, only small discriminative region. This can easily prevent recovering the complete discriminative region since the rest of the regions may not be discriminative enough to be spotted, such as the case of birds when the head is already erased. Assisting erasing algorithms with a memory-like mechanism, or spatial information about the previous mined discriminative regions may drive the network to seek discriminative regions around the previously spotted regions, since the parts of an object of interest are often closely located. Potentially, this may allow the model to spot large portion of the object of interest in this case.


This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institutes of Health Research.


  • [1] H. Azizpour, M. Arefiyan, S. Naderi Parizi, and S. Carlsson. Spotlight the negatives: A generalized discriminative latent model. In BMVC 2015.
  • [2] S. Belharbi, C. Chatelain, R. Hérault, and S. Adam. Neural networks regularization through class-wise invariant representation learning. arXiv preprint arXiv:1709.01867, 2017.
  • [3] S. Belharbi, R.Hérault, C. Chatelain, and S. Adam. Deep multi-task learning with evolving weights. In ESANN 2016.
  • [4] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR 2016.
  • [5] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV 2015.
  • [6] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV 2018.
  • [7] T. M. Cover and J. A. Thomas. Elements of Information Theory. 2006.
  • [8] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks. In CVPR 2017.
  • [9] T. Durand, T. Mordan, N. Thome, and M. Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In CVPR 2017.
  • [10] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In CVPR 2016.
  • [11] W. Ge, S. Yang, and Y. Yu. Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In CVPR 2017.
  • [12] G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS 2018.
  • [13] K. He, X. Zhang, S.g Ren, and J. Sun. Deep residual learning for image recognition. In CVPR 2016.
  • [14] M. Ilse, J. M. Tomczak, and M. Welling. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
  • [15] V. Kantorov, M. Oquab, M. Cho, and I. Laptev. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV 2016.
  • [16] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. Ben Ayed. Constrained-CNN losses for weakly supervised segmentation. MedIA 2019.
  • [17] A. Khoreva, R. Benenson, J.H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
  • [18] D. Kim, D. Cho, D. Yoo, and I. So Kweon. Two-phase learning for weakly supervised object localization. In ICCV 2017.
  • [19] D. Koller and N. Friedman.

    Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

  • [20] E. Krupka and N. Tishby. Incorporating prior knowledge on features into learning. In Artificial Intelligence and Statistics, 2007.
  • [21] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. In ICAIS 2015.
  • [22] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In CVPR 2018.
  • [23] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016.
  • [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR 2015.
  • [25] T.M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, 1980.
  • [26] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR 2015.
  • [27] S. Naderi Parizi, A. Vedaldi, A.w Zisserman, and P. F. Felzenszwalb. Automatic discovery and optimization of parts for image classification. In ICLR 2015.
  • [28] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV 2015.
  • [29] P. H. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR 2015.
  • [30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
  • [31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV 2017.
  • [32] Y. Shen, R. Ji, S. Zhang, W. Zuo, Y. Wang, and F. Huang. Generative adversarial learning towards fast weakly supervised detection. In CVPR 2018.
  • [33] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLRw 2014.
  • [34] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV 2017.
  • [35] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, et al. Gland segmentation in colon histology images: The glas challenge contest. MIA 2017.
  • [36] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In ICLRw 2015.
  • [37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR 214.
  • [38] C. Sun, M. Paluri, R. Collobert, R. Nevatia, and L. Bourdev. Pronet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR 2016.
  • [39] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers. Normalized Cut Loss for Weakly-supervised CNN Segmentation. In CVPR, 2018.
  • [40] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR 2017.
  • [41] E. W. Teh, M. Rochan, and Y. Wang. Attention networks for weakly supervised object localization. In BMVC 2016.
  • [42] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
  • [43] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV 2013.
  • [44] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV 2011.
  • [45] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, California Institute of Technology, 2011.
  • [46] F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye. Min-entropy latent model for weakly supervised object detection. In CVPR 2018.
  • [47] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR 2017.
  • [48] T. Yu, T. Jan, S. Simoff, and J. Debenham. Incorporating prior domain knowledge into inductive machine learning. Unpublished doctoral dissertation Computer Sciences, 2007.
  • [49] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV 2014.
  • [50] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. IJCV 2018.
  • [51] Q.-s. Zhang and S.-c. Zhu. Visual interpretability for deep learning: a survey. FITEE 2018.
  • [52] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR 2018.
  • [53] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In CVPR 2016.
  • [54] Z.-H. Zhou. A brief introduction to weakly supervised learning. NSR 2017.
  • [55] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV 2014.

Appendix A The min-max entropy framework for WSOL

a.1 Object completeness using incremental recursive erasing and trust coefficients

In this section, we present an illustration of our proposed recursive erasing algorithm (Fig.4). Alg.1 illustrates our implementation using accumulated gradient through the backpropagation within the localizer . We note that such erasing is performed only during training.

Figure 4: Illustration of the implementation of the proposed recursive incremental mining of disciminative parts within the backpropagation. The recursive mining is performed only during training (See Sec.3.4 for notations and more details).
1:Input: .
2:Initialization: .
3:for  do
4:     Initialization: , stop = False.
5:     Make a copy of : .
6:     # Perform the recursion. Accumulate gradients, and masks.
7:     while  and stop is False do
8:         Forward in .
9:         Compute .
10:         if  then
11:              Update accumulative mask . (Eq.12)
12:              Accumulate gradient:
13:              Erase the discriminative parts: .
14:         else
15:              stop = True.
16:         end if
17:     end while
18:     Compute: .
19:     Forward in .
20:     Compute: .
21:     Update the total gradient: .
22:end for
23:Normalize total gradient: . Update using .
24:Output: updated.
Algorithm 1 Practical implementation of our incremental recursive erasing approach during training for one epoch (or one mini-batch) using gradient accumulation.

Appendix B Results and analysis

In this section, we provide more details on our experiments, analysis, and discuss some of the drawbacks of our approach. We took many precautions to make the code reproducible for ou model up to Pytorch’s terms of reproducibility. Please see the README.md file for the concerned section with the code999https://github.com/sbelharbi/wsol-min-max-entropy-interpretability. We checked reproducibility up to a precision of . All our experiments were conducted using the seed . We run all our experiments over one GPU with 12GB101010Our code supports multiGPU, and Batchnorm synchronization with our support to reproducibility., and an environment with 10GB of RAM. Finally, this section shows more visual results, analysis, training time, and drawbacks. A link to download all the predictions with high resolution over both test sets is provided.

b.1 Datasets

We provide in Fig.5 some samples from each dataset’s test set along with their mask that indicates the object of interest.

Figure 5: Top row: GlaS dataset: test set examples of different classes with the gland segmentation. The decidable regions are the glands while the undecidable regions are the leftover tissues. Glands have different shapes, size, context. They can be multi-instance. Images have variable H&E stain. [35]. Bottom row: CUB-200-2011 (CUB5) dataset: test set examples of the 5 different classes. The decidable regions are the birds while the undecidable regions are the leftover surrounding environment. Birds have different sizes, position/view, appearance, context. [45] (Best visualized in color.)

As we mentioned in Sec.4, due to time constraints, we consider a subset from the original CUB-200-2011 dataset, and we referred to it as CUB5. To build it, we select, randomly, 5 classes from the original dataset. Then, pick all the corresponding samples of each class in the provided train and test set to build our train and test set (CUB5). Then, we build the effective train set, and validation set by taking randomly , and the left from the train set of CUB5, respectively. We provide the splits, and the code used to generate them. Our code generates the following classes:

  1. 019.Gray_Catbird

  2. 099.Ovenbird

  3. 108.White_necked_Raven

  4. 171.Myrtle_Warbler

  5. 178.Swainson_Warbler

b.2 Experiments setup

The following is the configuration we used for our model:

1. Patch size (hxw): . (for training sample patches, however, for evaluation, use the entire input image). 2. Augment patch using random rotation, horizontal/vertical flipping. (for CUB5 only horizontal flipping is performed). 3. Channels are normalized using mean and standard deviation. 4. For GlaS: patches are jittered using brightness=, contrast=, saturation=, hue=.

Pretrained resnet101 [13] as a backbone with [9] as a pooling score with our adaptation, using modalities per class. We consider using dropout [37] (with value over GlaS and over CUB5, over the final map of the pooling function right before computing the score). High dropout is motivated by [34, 12]. This allows to drop most discriminative parts at features with most abstract representation. The dropout is not performed over the final mask, but only on the internal mask of the pooling function. As for the parameters of [9], we consider their since most negative evidence is dropped, and use .

. For evaluation, our predicted mask is binarized using a

threshold to obtain exactly a binary mask. All our presented masks in this work follows this thresholding. Our F1, and F1 are computed over this binary mask.

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. Learning rate of decayed by each epochs with minimum value of . 4. Maximum epochs of 400. 5. Batch size of . 6. Early stopping over validation set using classification error as a stopping criterion.

Other WSOL methods use the following setup with respect to each dataset:


1. Patch size (hxw): . 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Patches are jittered using brightness=, contrast=, saturation=, hue=.
1. Pretrained resnet18 [13] as a backbone.
1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. 160 epochs 4. Learning rate of for the first , and of for the last epochs. 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.


1. Patch size (hxw): . (resized while maintaining the ratio). 2. Augment patch using random horizontal flip. 3. Random rotation of one of: (degrees). 4. Random affine transformation with degrees , shear , scale .

Pretrained resnet18 [13] as a backbone.

1. Stochastic gradient descent, with momentum , with Nesterov. 2. Weight decay of over the weights. 3. 90 epochs. 4. Learning rate of decayed every with . 5. Batch size of . 6. Early stopping over validation set using classification error/loss as a stopping criterion.

Running time:

Adding recursive computation in the backpropagation loop is expected to add an extra computation time. Tab.3 shows the training time (of 1 run) of our model with and without recursion over identical computation resource. The observed extra computation time is mainly due to gradient accumulation (line 12. Alg.1) which takes the same amount of time as parameters’ update (which is expensive to compute). The forward and the backward are practically fast, and take less time compared to gradient update. We do not compare the running between the datasets since they have different number/size of samples, and different pre-processing that it is included in the reported time. Moreover, the size of samples has an impact over the total time during the training over the validation set.

Model GlaS CUB5
Ours () 49min 65min
Ours () 90min () 141min ()
Table 3: Comparison of training time, of 1 run, over 400 epochs over GlaS and CUB5 of our model using identical computation resources (NVIDIA Tesla V100 with 12GB memory) when using our erasing algorithm () and without using it ().

b.3 Results

In this section, we provide more visual results over the test set of each dataset. All the predictions with high resolution over the test set of both datasets can be downloaded from this Google drive link: https://drive.google.com/file/d/18K3BawR9Aqz6igK60H6IRGx-klkAwyJk/view?usp=sharing.

Over GlaS dataset (Fig.6, 7), the visual results show clearly how our model, with and without erasing, can handle multi-instance. Adding the erasing feature allows recovering more discriminative regions. The results over CUB5 (Fig.8, 9, 10, 11, 12) while are interesting, they show a fundamental limitation to the concept of erasing in the case of one-instance. In the case of multi-instance, if the model spots one instance, then, erases it, it is more likely that the model will seek another instance which is the expected behavior. However, in the case of one instance, and where the discriminative parts are small, the first forward allows mainly to spot such small part and erase it. Then, the leftover may not be sufficient to discriminate. For instance, in CUB5, in many cases, the model often spots the head. Once it is hidden, the model is unable to find other discriminative parts. A clear illustration to this issue is in Fig.8, row 5. The model spots correctly the head, but was unable to spot the body while the body has similar texture, and it is located right near to the found head. We believe that the main cause of this issue is that the erasing concept forgets where discriminative parts are located. Erasing algorithms seem to be missing this feature that can be helpful to localize the entire object of interest by seeking around the found disciminative regions. In our erasing algorithm, once a region is erased, the model forgets about its location. Adding a memory-like, or constraints over the spatial distribution of the mined discriminative regions may potentially alleviate this issue.

It is interesting to notice the strategy used by our model to localize some types of birds. In the case of the 099.Ovenbird, it relies on the texture of the chest (white doted with black), while it localizes the white spot on the bird neck in the case of 108.White_necked_Raven. One can notice as well that our model seems to be robust to small/occluded objects. In many cases, it was able to spot small birds in a difficult context where the bird is not salient.

Figure 6: Visual comparison of the predicted binary mask of each method over GlaS test set. Class: +benign+ (Best visualized in color.)
Figure 7: Visual comparison of the predicted binary mask of each method over GlaS test set. Class: +malignant+ (Best visualized in color.)
Figure 8: Visual comparison of the predicted binary mask of each method over CUB-200-2011 (CUB5) test sets. Species: +019.Gray_Catbird+ (Best visualized in color.)
Figure 9: Visual comparison of the predicted binary mask of each method over CUB-200-2011 (CUB5) test sets. Species: +171.Myrtle_Warbler+ (Best visualized in color.)
Figure 10: Visual comparison of the predicted binary mask of each method over CUB-200-2011 (CUB5) test sets. Species: +099.Ovenbird+ (Best visualized in color.)
Figure 11: Visual comparison of the predicted binary mask of each method over CUB-200-2011 (CUB5) test sets. Species: +178.Swainson_Warbler+ (Best visualized in color.)
Figure 12: Visual comparison of the predicted binary mask of each method over CUB-200-2011 (CUB5) test sets. Species: +108.White_necked_Raven+ (Best visualized in color.)