The success of deep convolutional networks for large-scale object recognition Krizhevsky et al. (2012); Simonyan & Zisserman (2014); Szegedy et al. (2015); He et al. (2016) has spurred interest in utilizing them to automatically detect and localize objects in natural images. Pioneering this direction, Simonyan et al. (2013) and Springenberg et al. (2014) demonstrated that the gradient of the class-specific score of a given classifier could be used for extracting a saliency map of an image. Such classifier-dependent saliency maps can be utilized to analyze the inner workings of a specific network. However, as only the part of the image that is used by a given model is highlighted, these methods are not identifying all “evidence” in a given image. They also tend to be noisy, covering many irrelevant pixels and missing many relevant ones. Therefore, much of the recent work has focused on introducing regularization techniques of correcting such classifier-dependent saliency maps. For instance, Selvaraju et al. (2017) propose averaging multiple saliency maps created for perturbed images to obtain a smooth saliency map. We argue, however, that applying tricks and tweaks on top of methods that were designed to analyze inner workings of a given classifier is not a principled way to get saliency maps that focus on all useful evidence.
In this work, we aim to find saliency maps indicating pixels which aid classification, i.e. we want to find pixels in the input image such that if they were masked, it would confuse an unknown classifier. Assuming we were given a classifier, a naive approximate solution would be to train a generative model to output a mask (a saliency map) confusing that classifier. That can be achieved using a simple GAN-like approach (Goodfellow et al., 2014) where the classifier acts as a fixed discriminator. Unfortunately, as we prove experimentally, this solution suffers from the same issues as prior approaches. We argue that the strong dependence on a given classifier lies at the center of the problem. To tackle this directly we propose to train a saliency mapping that is not strongly coupled with any specific classifier. Our approach, a class-agnostic saliency map extraction, can be formulated as a practical algorithm that realizes our goal.
Our focus on classifier-agnostic saliency maps is not our objective per se, it is a remedy that resolves the core problem. The proposed approach results in a neural network based saliency mapping that only depends on an input image. We qualitatively find that it extracts higher quality saliency maps compared to classifier-dependent methods, as can be seen in Fig.2
. Extracted saliency maps show all the evidence without using any symptom-masking methods: difficult to tune regularization penalties (such as total variation), exotic activation functions, complex training procedures or image preprocessing tricks (such as superpixels), etc. We also evaluate our method quantitatively by using the extracted saliency maps for object localization. We observe that the proposed approach outperforms the existing weakly-supervised techniques setting the new state of the art result on the ImageNet dataset and closely approaches the localization performance of a strongly supervised model. Furthermore, we experimentally validate that the proposed approach works reasonably well even for classes unseen during training.
Our method has many potential applications, in which being classifier-agnostic is of primary importance. For instance, in medical image analysis, where we are interested not only in class prediction but also in indicating which part of the image is important to classification. Importantly, however, it is criticial to indicate all parts of the image, which can influence diagnosis, not just ones used by a specific classifier.
2 Classifier-agnostic saliency map extraction
In this paper, we tackle a problem of extracting a salient region of an input image as a problem of extracting a mapping over an input image . Such a mapping should retain () any pixel of the input image if it aids classification, while it should mask () any other pixel.
2.1 Classifier-dependent saliency map extraction
Earlier work has largely focused on a setting in which a classifier was given (Fong & Vedaldi, 2017; Dabkowski & Gal, 2017). These approaches can be implemented as solving the following maximization problem:
where is a score function corresponding to a classification loss. That is,
where denotes elementwise multiplication (masking), is a regularization term and is a classification loss, such as cross-entropy. We are given a training set . This optimization procedure could be interpreted as finding a mapping that maximally confuses a given classifier . We refer to it as a classifier-dependent saliency map extraction.
A mapping obtained with a classifier may differ from a mapping found using , even if both classifiers are equally good in respect to a classification loss for both original and masked images, i.e. and , where
This property is against our definition of the mapping above, which stated that any pixel which helps classification should be indicated by the mask (a saliency map) with . The reason why this is possible is these two equally good, but distinct classifiers may use different subsets of input pixels to perform classification.
This behaviour can be intuitively explained with a simple example, illustrating an extreme special case. Let us consider a data set in which all instances consist of two identical copies of images concatenated together, that is, for all , , where . For such a data set, there exist at least two classifiers, and , with the same classification loss. The classifier uses only the left half of the image, while uses the other half. Each of the corresponding mappings, and , would then indicate a region of interest only on the corresponding half of the image. When the input image does not consist of two concatenated copies of the same image, it is unlikely that two equally good classifiers will use disjoint sets of input pixels. Our example is to show an extreme case when it is possible.
2.2 Classifier-agnostic saliency map extraction
In order to address the issue of saliency mapping’s dependence on a single classifier, we propose to alter the objective function in Eq. (1
) to consider not only a single fixed classifier but all possible classifiers weighted by their posterior probabilities. That is,
where the posterior probability, , is defined to be proportional to the exponentiated classification loss , i.e., ). Solving this optimization problem is equivalent to searching over the space of all possible classifiers, and finding a mapping that works with all of them. As we parameterize as a convolutional network (with parameters denoted as ), the space of all possible classifiers is isomorphic to the space of its parameters. The proposed approach considers all the classifiers and we call it a classifier-agnostic saliency map extraction.
In the case of the simple example above, where each image contains two copies of a smaller image, both and , which respectively look at one and the other half of an image, the posterior probabilities of these two classifiers would be the same111 We assume a flat prior, i.e., . . Solving Eq. (4) implies that a mapping must minimize the loss for both of these classifiers.
The optimization problem in Eq. (4) is, unfortunately, generally intractable. This arises from the intractable expectation over the posterior distribution. Furthermore, the expectation is inside the optimization loop for the mapping , making it even harder to solve.
Thus, we approximately solve this problem by simultaneously estimating the mappingand the expected objective. First, we sample one with the posterior probability
by taking a single step of stochastic gradient descent (SGD) on the classification loss with respect to(classifier parameters) with a small step size:
We have up to samples222 A usual practice of “thinning” may be applied, leading to a fewer than samples. from the posterior distribution . We sample333 We set the chance of selecting to 50% and we spread the remaining 50% uniformly over . to get a single-sample estimate of in in Eq. (4) by computing . Then, we use it to obtain an updated (mapping parameters) by
We alternate between these two steps until converges (cf. Alg. 1). Note, that our algorithm resembles the training procedure of GANs (Goodfellow et al., 2014), where mapping takes the role of a generator and the classifier can be understood as a discriminator.
The score function estimates the quality of the saliency map extracted by given a data set and a classifier
. The score function must be designed to balance the precision and recall. The precision refers to the fraction of relevant pixels among those marked byas relevant, while the recall is the fraction of pixels correctly marked by as relevant among all the relevant pixels. In order to balance these two, the score function often consists of two terms.
The first term is aiming to ensure that all relevant pixels are included (high recall). As in Eq. (2), a popular choice has been the classification loss based on an input image masked out by . In our preliminary experiments, however, we noticed that this approach leads to obtaining masks with adversarial artifacts. We hence propose to use the entropy instead. This makes generated masks cover all salient pixels in the input, avoiding masks that may sway the class prediction to a different, but semantically close class. For example, from one dog species to another. The second term, , excludes a trivial solution, simply outputting an all-ones saliency map, which would achieve maximal recall with low very precision. In order that, we must introduce a regularization term. Some of the popular choices include total variation (Rudin et al., 1992) and norm. For simplicity, we use the latter only.
In summary, we use the following score function for the class-agnostic saliency map extraction:
where is a regularization coefficient.
As the algorithm collects a set of classifiers, ’s, from the posterior distribution, we need a strategy to keep a small subset of them. An obvious approach would be to keep all classifiers but this does not scale well with the number of iterations. We propose and empirically evaluate a few strategies. The first three of them assume a fixed size of . Namely, keeping the first classifier only, denoted by F (), the last only, denoted by L () and the first and last only, denoted by FL (). As an alternative, we also considered a growing set of classifiers where we only keep one every 1000 iterations (denoted by L1000) but whenever , we randomly remove one from the set. Analogously, we experimented with L100.
Although we described our approach using the classification loss computed only on masked images, as in Eq. (3), it is not necessary to define the classification loss exactly in this way. In the preliminary experiments, we noticed that the following alternative formulation, inspired by adversarial training (Szegedy et al., 2013), works better:
We thus use the loss as defined above in the experiments. We conjecture that it is advantageous over the original one in Eq. (3), as the additional term prevents the degradation of the classifier’s performance on the original, unmasked images while the first term encourages the classifier to collect new pieces of evidence from the images that are masked.
3 Experimental settings
Our models were trained on the official ImageNet training set with ground truth class labels Deng et al. (2009). We evaluate them on the validation set. Depending on the experiment, we use ground truth class or localization labels.
We made our code publicly available at MASKED.
Classifier and mapping
We use ResNet-50 (He et al., 2016) as a classifier in our experiments. We follow an encoder-decoder architecture for constructing a mapping . The encoder is implemented also as a ResNet-50 so its weights can be shared with the classifier or it can be separate. We experimentally find that sharing is beneficial. The decoder is a deep deconvolutional network that ultimately outputs the mask of an input image. The overall architecture is shown in Fig. 1. Details of the architecture and training procedure are in the appendix.
As noticed by Fan et al. (2017), it is not trivial to find an optimal regularization coefficient . They proposed an adaptive strategy which gets rid of the manual selection of . We, however, find it undesirable due to the lack of control on the average size of the saliency map. Instead, we propose to control the average number of relevant pixels by manually setting , while applying the regularization term only when there is a disagreement between and . We then set for each experiment such that approximately 50% of pixels in each image are indicated as relevant by a mapping . In the preliminary experiments, we further noticed that this approach avoids problematic behavior when an image contains small objects, earlier observed by Fong & Vedaldi (2017).
We also noticed that the training of mapping is more stable and effective when we use only images that the classifier is not trivially confused on, i.e. predicts the correct class for the original images.
In our experiments we only use a single architecture explained in subsection 3.1. We use the abbreviation CASM (classifier-agnostic saliency mapping) to denote the final model obtained using the proposed method. Our baseline model (Baseline) is of the same architecture and it is trained with a fixed classifier (classifier-dependent saliency mapping) realized by following thinning strategy F.
where is the average mask intensity and
is a hyperparameter. We simply setto , hence the average of pixel intensities is the same for the input mask and the discretized binary mask . To focus on the most dominant object we take the largest connected component of the binary mask to obtain the binary connected mask.
We visualize the learned mapping by inspecting the saliency map of each image in three different ways. First, we visualize the masked-in image , which ideally leaves only the relevant pixels visible. Second, we visualize the masked-out image , which highlights pixels irrelevant to classification. Third, we visualize the inpainted masked-out image using an inpainting algorithm (Telea, 2004). This allows us to inspect whether the object that should be masked out cannot be easily reconstructed from nearby pixels.
Classification by multiple classifiers
In order to verify our claim that the proposed approach results in a classifier-agnostic saliency mapping, we evaluate a set of classifers444 We train twenty ResNet-50 models with different initial random sets of parameters in addition to the classifiers from torchvision.models (https://pytorch.org/docs/master/torchvision/models.html): densenet121, densenet169, densenet201, densenet161, resnet18, resnet34, resnet50, resnet101, resnet152, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19 and vgg19_bn. on the validation sets of masked-in images, masked-out images and inpainted masked-out images. If our claim is correct, we expect the inpainted masked-out images created by our method to break these classifiers, while the masked-in images would suffer minimal performance degradation.
As the saliency map can be used to find the most dominant object in an image, we can evaluate our approach on the task of weakly supervised localization. To do so, we use the ILSVRC’14 localization task. We compute the bounding box of an object as the tightest box that covers the binary connected mask.
We use three metrics to quantify the quality of localization. First, we use the official metric (OM) from the ImageNet localization challenge, which considers the localization successful if at least one ground truth bounding box has IOU with predicted bounding box higher than 0.5 and the class prediction is correct. Since OM is dependent on the classifier, from which we have sought to make our mapping independent, we use another widely used metric, called localization error (LE), which only depends on the bounding box prediction Cao et al. (2015); Fong & Vedaldi (2017). Lastly, we evaluate the original saliency map, of which each mask pixel is a continuous value between 0 and 1, by the continuous F1 score. Precision and recall are defined as the following:
where is the ground truth bounding box. We compute F1 scores against all the ground truth bounding boxes for each image and report the highest one among them as its final score.
4 Results and analysis
Visualization and statistics
We randomly select seven consecutive images from the validation set and input them to two instances of CASM (each using a different thinning strategy – L or L100) and Baseline. We visualize the original (clean), masked-in, masked-out and inpainted masked-out images in Fig. 2. The proposed approach produces clearly better saliency maps, while the classifier-dependent approach (Baseline) produces so-called adversarial masks (Dabkowski & Gal, 2017).
We further compute some statistics of the saliency maps generated by CASM and Baseline over the validation set. The masks extracted by CASM exhibit lower total variation ( vs. ), indicating that CASM produced more regular masks, despite the lack of explicit TV regularization. The entropy of mask pixel intensities is much smaller for CASM ( vs. ), indicating that the mask intensities are closer to either or
on average. Furthermore, the standard deviation of the masked out volume is larger with CASM (vs. ), indicating that CASM is capable of producing saliency maps of varying sizes dependent on the input images.
|Fan et al. (2017)||54.5||43.5|
|Zeiler & Fergus (2014)||-||48.6|
|Zhou et al. (2016)||56.4||48.1|
|Selvaraju et al. (2017)||-||47.5|
|Fong & Vedaldi (2017)||-||43.1|
|Mahendran & Vedaldi (2016)||-||42.0|
|Simonyan et al. (2013)||-||41.7|
|Cao et al. (2015)||-||38.8|
|Zhang et al. (2016)||-||38.7|
|Dabkowski & Gal (2017)||-||36.7|
|Simonyan & Zisserman (2014)||-||34.3|
As shown on the left panel of Figure 3, the entire set of classifiers suffers less from the masked-in images produced by CASM than those by Baseline. We, however, notice that most of the classifiers fail to classify the masked-out images produced by Baseline, which we conjecture is due to the adversarial nature of the saliency maps produced by Baseline approach. This is confirmed by the right panel which shows that simple inpainting of the masked-out images dramatically increases the accuracy when the saliency maps were produced by Baseline. The inpainted masked-out images by CASM, on the other hand, do not benefit from inpainting, because it truly does not maintain any useful evidence for classification.
We report the localization performance of CASM, Baseline and prior works in Table 2 using two different metrics. Most of the existing approaches, except for Fan et al. (2017), assume the knowledge of the target class, unlike our work. CASM performs better than all prior approaches including the classifier-dependent Baseline. The difference is statistically significant. For ten separate training runs with random initialization the worst scores and the best with the average of . The fully supervised approach is the only approach that outperforms CASM.
In Table 2 (a–e), we compare the five thinning strategies described earlier, where F is equivalent to the Baseline. According to LE and OM metrics, the strategies L100 and L1000 perform better than the others, closely followed by L. These three strategies also perform the best in term of F1.
Sharing the encoder and classifier
Unlike Fan et al. (2017), we use separate score functions for training the classifier and the saliency mapping. We empirically observe in Table 2 that the proposed use of entropy as a score function results in a better mapping in term of OM and LE. The gap, however, narrows as we use better thinning strategies. On the other hand, the classification loss is better for F1 as it makes CASM focus on the dominant object only. Because we take the highest score for each ground truth bounding box, concentrating on the dominant object yields higher scores.
5 Unseen classes
Since the proposed approach does not require knowing the class of the object to be localized, we can use it with images that contain objects of classes that were not seen during training neither by the classifier nor the mapping . We explicitly test this capability by training five different CASMs on five subsets of the original training set of ImageNet.
|D, E, F||37.9||39.3||40.0||38.0||38.0||37.4||38.1|
|C, D, E, F||38.2||38.5||39.9||37.9||37.9||37.8||38.1|
|B, C, D, E, F||36.7||36.8||39.9||37.4||37.0||37.0||37.4|
We first divide the 1000 classes into five disjoint subsets (denoted as A, B, C, D, E and F) of sizes 50, 50, 100, 300, 300 and 200, respectively. We train our models (in all stages) on 95% images (classes in B, C, D, E and F), 90% images (classes in C, D, E and F), 80% images (classes in D, E and F), 50% images (classes in E and F) and finally on 20% of images only (classes in F only). Then, we test each saliency mapping on all the six subsets of classes independently. We use the thinning strategy L for computational efficiency in each case.
All models generalize well and the difference between their accuracy on seen or unseen classes is negligible (with exemption of the model trained on 20% of classes). The general performance is a little poorer which can be explained by the smaller training set. In Table 3, we see that the proposed approach works well even for localizing objects from previously unseen classes. The gap in the localization error between the seen and unseen classes grows as the training set shrinks. However, with a reasonably sized training set, the difference between the seen and unseen classes is small. This is an encouraging sign for the proposed model as a class-agnostic saliency map.
6 Related work
The adversarial localization network Fan et al. (2017) is perhaps the most closely related to our work. Similarly to ours, they simultaneously train the classifier and the saliency mapping which does not require the object’s class at test time. There are four major differences between that work and ours. First, we use the entropy as a score function for training the mapping, whereas they used the classification loss. This results in obtaining better saliency maps as we have shown earlier. Second, we make the training procedure faster thanks to tying the weights of the encoder and the classifier, which also results in a much better performance. Third, we do not let the classifier shift to the distribution of masked-out images by continuing training it on both clean and masked-out images. Finally, their mapping relies on superpixels to build more contiguous masks which may miss small details due to inaccurate segmentation and makes the entire procedure more complex. Our approach solely works on raw pixels without requiring any extra tricks.
Dabkowski & Gal (2017) also train a separate neural network dedicated to predicting saliency maps. However, their approach is a classifier-dependent method and, as such, a lot of effort is devoted to preventing generating adversarial masks. Furthermore, the authors use a complex training objective with multiple hyperparameters which also has to be tuned carefully. On a final note, their model needs a ground truth class label which limits its use in practice.
In this paper, we proposed a new framework for classifier-agnostic saliency map extraction which aims at finding a saliency mapping that works for all possible classifiers weighted by their posterior probabilities. We designed a practical algorithm that amounts to simultaneously training a classifier and a saliency mapping using stochastic gradient descent. We qualitatively observed that the proposed approach extracts saliency maps that cover all the relevant pixels in an image and that the masked-out images cannot be easily recovered by inpainting, unlike for classifier-dependent approaches. We further observed that the proposed saliency map extraction procedure outperforms all existing weakly supervised approaches to object localization and can also be used on images containing objects from previously unseen classes, paving a way toward class-agnostic saliency map extraction.
Cao et al. (2015)
Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen
Huang, Liang Wang, Chang Huang, Wei Xu, et al.
Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks.In
International Conference on Computer Vision, 2015.
- Dabkowski & Gal (2017) Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In Neural Information Processing Systems, 2017.
Deng et al. (2009)
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009.
- Fan et al. (2017) Lijie Fan, Shengjia Zhao, and Stefano Ermon. Adversarial localization network. In Learning with limited labeled data: weak supervision and beyond, NIPS Workshop, 2017.
- Fong & Vedaldi (2017) Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXiv preprint arXiv:1704.03296, 2017.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
- Mahendran & Vedaldi (2016) Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In European Conference on Computer Vision. Springer, 2016.
Mandt et al. (2017)
Stephan Mandt, Matthew D Hoffman, and David M Blei.
Stochastic gradient descent as approximate bayesian inference.
The Journal of Machine Learning Research, 2017.
- Rudin et al. (1992) Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60, 1992.
- Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pp. 618–626, 2017.
- Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
An image inpainting technique based on the fast marching method.Journal of graphics tools, 2004.
- Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
- Zeiler & Fergus (2014) Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.
- Zhang et al. (2016) Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, 2016.
Zhou et al. (2016)
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.In Computer Vision and Pattern Recognition, 2016.
Architecture and training procedure
Classifier and mapping
As mentioned before, we use ResNet-50 (He et al., 2016) as a classifier in our experiments. We follow the encoder-decoder architecture for constructing a mapping . The encoder is implemented also as a ResNet-50 so its weights can be shared with the classifier or it can be separate. We experimentally find that sharing is beneficial. The decoder is a deep deconvolutional network that ultimately outputs the mask of an input image. The input to the decoder consists of all hidden layers of the encoder which are directly followed by a downscaling operation. We upsample them to be of the same size and concatenate them into a single feature map . This upsampling operation is implemented by first applying 1
56 pixels (using bilinear interpolation). Finally, a single 33 convolutional filter followed by sigmoid activation is applied on and the output is upscaled to a 224224 pixel-sized mask using proximal interpolation. The overall architecture is shown in Fig. 1.
We initialize the classifier by training it on the entire training set. We find this pretraining strategy facilitates learning, particularly in the early stage. In practice we use the pretrained ResNet-50 from torchvision model zoo. We use vanilla SGD with a small learning rate of (with momentum coefficient set to 0.9 and weight-decay coefficient set to ) to continue training the classifier with the mixed classification loss as in Eq. (8). To train the mapping we use Adam (Kingma & Ba, 2014) with the learning rate (with weight-decay coefficient set to
) and all the other hyperparameters set to default values. We fix the number of training epochs to 70 (each epoch covers only a random 20% of the training set).
We noticed that the details of the resizing policy preceding the evaluation procedures OM and LE vary between different works. The one thing they have in common is that the resized image is always 224224 pixels. The two main approaches are the following.
The image in the original size is resized such that the smaller edge of the resulting image is 224 pixels long. Then, the central 224224 crop is taken. The original aspect ratio of the objects in the image is preserved. Unfortunately, this method has a flaw – it may be impossible to obtain IOU > 0.5 between predicted localization box and the ground truth box when than a half of the bounding box is not seen by the model.
The image in the original size is resized directly to 224224 pixels. The advantage of this method is that the image is not cropped and it is always possible to obtain IOU > 0.5 between predicted localization box and the ground truth box. However, the original aspect ratio is distorted.
The difference in LE scores for different resizing strategies should not be large. For CASM it is 0.6%. In this paper, for CASM, we report results for the first method.
In the remained of the appendix we replicate the content of Fig. 2 for sixteen randomly chosen classes. That is, in each figure we visualize saliency maps obtained for seven consecutive images from the validation set. The original images are in the first row. In the following rows masked-in images, masked-out images and inpainted masked-out images are shown. As before, we used two instances of CASM (each using a different thinning strategy – L or L100) and Baseline.