1 Introduction
Semantic image segmentation is the problem of assigning the pixels of an image to a selected set of predefined labels, based on the semantic structure that the pixel belongs to. Most successful models for semantic image segmentation employ a variation of CNN for computing the probability distribution over the classes for each pixel. During inference, these distributions are fed as unary potentials to a fully connected CRF with Gaussian edge potentials, and a joint labeling for the pixels of the image is inferred from the CRF. The work by Krähenbühl & Koltun
[12], allows for efficient inference in such models.Successful semantic image segmentation requires access to a large number of images that have been densely labelled. However, dense labeling of images is an expensive and timeconsuming operation [16, 19, 18]. Therefore, the number of densely labeled images available is usually a minuscule percentage of the total set of images. Hence, the models that rely solely on densely labeled images, are limited in their scope. These models will be referred to as fully supervised models in the sequel.
The limitations of fully supervised models has necessitated the development of models that can incorporate weakly labeled images for training. These include models that utilize bounding box prior [13, 5, 16, 29], few points per class [1] and imagelevel labels [25, 16, 19]. Of particular interest are models that rely on imagelevel labels only, since the web provides an almost unlimited source of weakly annotated images.
Unfortunately, the decoupled CNNCRF combination (or CNN alone) fares poorly, when only imagelevel labels are available [19, 17]. To alleviate this problem, several researchers have resorted to the use of localization cues, such as saliency and attentions maps [28, 10] or objectness priors [19, 27], thereby improving performance to an extent. Improvements in CNN architecture for segmentation [4, 30], have further aided in improved performance.
In this paper, we propose a model that learns to output segmentation masks using only imagelevel labels without the aid of localization cues or saliency masks. In particular, we enforce a pixellabel loss as well as a neighborhood loss on the output of a CNN. Since real pixellabels are unavailable, we map the output of the CNN to auxiliary pixel labels to get an approximate segmentation mask. The neighborhood loss allows us to enforce the constraints imposed by conditional random field on the output of the CNN thereby forcing it to generate crisp segmentation masks that align with the boundary of the object
Contributions
Our contributions are as follows. (1) We propose a new interpretable model for weakly supervised semantic segmentation. (2)
The model is trained by imposing pixellabel and neighborhood loss functions on the output of a fully convolutional neural network.
(3) We prove that by imposing neighborhood loss on the output of the CNN, the output of the CNN is forced to satisfy the constraints imposed by a conditional random field. (4) We achieve an accuracy of 52.01 % on the test set and 51.6 % on the validation set of Pascal VOC2012, which is the state of the art for methods that do not employ any pixellevel labels.2 Preliminaries and background
Notations
The probability distributions are indicated by lower case letters, for example, and . Subscripts in the distribution indicate the location, whereas superscripts indicate the name of the distribution. The label at location is denoted as , while any distribution at the corresponding location is denoted as with an appropriate superscript indicating the type of distribution. The labels in the segmentation mask form a grid, which is denoted by . The entire segmentation mask is denoted as .
Conditional random fields for semantic image segmentation
A conditional random field is an example of undirected graphical model that models the conditional distribution of output given the input, when the output is structured in the form of factors. For the problem of semantic image segmentation, the image is the input, whereas the pixellevel labels form the output. A conditional random field is completely characterized by its potential functions.
Traditionally, conditional random fields employ two forms of potential functions: unary potentials and binary potentials. A unary potential is a function of the conditioning variable and a single output variable
. The unary potentials encode the suitability of assigning a specific label at a specific location. A popular approach is to learn a local classifier for each location in the image, using the features extracted locally from the image. The unary potential for a specific label at a specific location is then equated to the negative logprobability of observing the label at that location. These local classifiers have largely been replaced by convolutional neural networks that consider local as well as global information
[31, 4].The binary potential measures the compatibility of two labels and for locations and . The most commonly used binary potential is the contrast sensitive Potts model [2, 23, 9], which captures the difference in color among neighboring locations, that is
(1) 
where and are nonnegative scalar constants.
A CRF with unary and binary potentials only, is referred to as a pairwise CRF. For a given input image and a segmentation mask
, the joint distribution of a pairwise CRF is given below:
(2) 
where indicates the neighbors of the node.
3 Proposed Model
3.1 Overview
Given a set of images and their corresponding imagelevel labels, the aim is to learn a model that can output pixellevel labels from the input image. It is important to note that pixellevel labels are not provided during training, and hence, constitute the latent variables in the current model. An image is fed through a segmentation network that outputs a distribution over the labels for each pixel location . We refer to this distribution as the predicted distribution, since this is the only distribution that will be required during inference. Our aim is to ensure that the predicted distribution constitutes a valid segmentation mask for the input image. Hence, we impose multiple losses on the predicted distribution. In particular, the pixellabel estimator incorporates the imagelabel information in the predicted distribution to generate a distribution over pixellevel labels . This distribution can be thought of as an auxiliary ground truth, since the true pixel level labels are not available. The segmentation network is trained using the auxiliary ground truth.
Next, the neighborhood estimator computes a smooth version of the output distribution by averaging the output of the neighbors for each location. We force the output of the CNN to be close to the neighborhood distribution. We further show that this is equivalent to enforcing the constraints of a CRF on the output of a CNN.
In sequel, represent the input image and represents the segmentation mask. The label of the pixel at location indexed by is denoted as . The various components involved in the model are discussed below.
3.2 Segmentation Network
The segmentation network receives the image as input and generates a distribution over the segmentation masks as output. As is common in CNNbased training for segmentation [3, 15], we assume that the output distribution over pixellevel labels factorizes completely for each location. In particular, let be the conditional distribution over the pixellevel labels given the image. We assume that , where is the distribution at location , and is the corresponding label. Furthermore, we assume that the distribution is parametrized by a CNN , that is,
(3) 
The segmentation network used in this paper is shown in Figure 2.
3.3 Pixel Label Estimator
Since ground truth information for the pixels is not available, we attempt to generate auxiliary ground truth information for each pixel from the output of the network. In particular, we infer a distribution over the labels for each location in the image from the output distribution of segmentation network. Given this auxiliary ground truth, the classification objective can be rewritten as
(4) 
In absence of any restriction, the model can choose to assign all the pixels to a single class, for instance, the background class. In order to prevent this from happening, one needs to ensure that for each class present in the image, at least a certain percentage of pixels are allotted to that class. Furthermore, one also needs to ensure that no pixels are allotted to classes absent from the image. Hence, we couple the distribution with a prior to obtain the distribution over the pixellabels.
(5) 
for , and . In order to complete the description, we define the prior distribution as below:
(6) 
Images in the ImageNet dataset are assumed to contain only one foreground object, and hence
. We learn the constants independently for each image. In particular, for each image, we learn the most noninformative prior that can guarantee the assignment of a certain percentage of pixels to each class present in the image, while assigning no pixels to classes absent from the image. In order to quantify information, we maximize the entropy of the prior distribution, while simultaneously forcing it to satisfy a set of constraints. That is,(7)  
subject to  
and 
The constants dictates the percentage of pixels that are guaranteed to belong to class , if label is present in the image. We choose for the background class, and for all the other object classes present in the image.
For images in the ImageNet dataset, , and hence the above optimization problem contains only two variables, and , where is the label of the foreground object is present in the image. Furthermore, by equating to , we further reduce the number of variables from to . Hence, the above optimization problem reduces to a constrained optimization in a single variable which can be solved very efficiently. This approach is discussed in further detail in the Appendix.
3.4 Neighborhood Estimator
To ensure the correct alignment between the predicted boundaries and the actual boundaries, we utilize the following information: Pixels that lie close together and have similar color, also have the same label. Hence, we force the distribution at location to be close to the distribution of its neighbors. Towards that end, we compute a neighborhood distribution for each location , and minimize the KLdivergence between the output distribution at that location and the corresponding neighborhood distribution. The corresponding objective is given by
(8) 
The combined objective is given by
(9) 
for some constant . We propose two approaches for computing the neighborhood distribution.
Weighted mean: In this approach, the neighborhood distribution is computed as follows:
(10) 
for . Here, is a measure of similarity between the locations and . For our purpose, we define the neighbors as all the locations that lie close to the current location, and the corresponding pixels have similar color. Hence, we use the contrast sensitive twokernel potential [12] defined in terms of pixel locations and pixel brightness as follows:
(11)  
where, . Here, and
are hyperparameters that are fixed during training. As discussed in
[12], the second term prevents the formation of small isolated regions as segments.Exponentiated weighted mean: As the name suggests, the neighborhood distribution in this approach is obtained by exponentiating the weighted mean. The exponentiation causes the neighborhood distribution to be sharper, resulting in high confidence predictions.
(12) 
where
and is a normalization constant that ensures that the above distribution sums up to . Next, we show the connection between the exponentiated weighted mean and CRF.
3.5 Connections with CRF
In this section, we provide a formal justification for the choice of the neighborhoodbased objective function. In particular, we will show that the objective emerges naturally, when a CRF is used as a prior while computing the conditional loglikelihood.
Given an image , let the CRF prior over the segmentation masks be defined as below:
(13) 
where is the normalization constant. Note that the prior distribution has no unary potentials. We will further assume that binary potentials have no trainable parameters.
The prior provides a distribution over all possible segmentation masks for a given image. Furthermore, let has the form
(14) 
for some choice of kernel . The corresponding CRF prior gives low probability to masks that assign different labels to pixels and with high similarity (that is, high ). This is a reasonable prior assumption about the segmentation mask of an image. Note that the prior does not penalize masks that assign the same label to pixels and with low similarity. This allows the inclusion of object classes with multicolored instances. For instance, the dress a person wears will often be colored differently from his skin color.
If the CRF prior is approximated by a fully factorized distribution, the resultant distribution will have to satisfy the constraints entailed in Proposition 3.1.
Proposition 3.1.
Let be the mean field approximation to the CRF prior . That is, among all distributions of the form , let be the one that minimizes . Then the distribution satisfies the following constraints:
(15) 
for and .
These constraints are referred to as mean field constraints. The proof of the Proposition is given in the Appendix. is the normalization constant that ensures that the distribution sums up to .
Coming back to the predictive distribution for segmentation masks in our model , if one wishes to impose a CRF prior on , one must force it to satisfy the meanfield constraints, that is,
(16) 
for and . The distribution on the RHS of the above equation is exactly the neighborhood distribution of equation (12), with the exception that the kernel is normalized in (12). The distribution is defined in terms of the output of a neural network. Hence, instead of the equality constraints imposed by the meanfield, we add the term to the objective. The KLdivergence term forces the output of the network to satisfy the mean field constraints imposed by the CRF prior.
Note: The binary potential used in the CRF prior in this section is given by . In contrast, the binary potential commonly used for semantic segmentation has the form . However, when the kernel is normalized at the pixellevel (as has been suggested in [12]), the resultant distributions are exactly the same.
4 Relation with similar works
Recent works on semantic segmentation using deep architectures have focused on pairwise CRFs with only unary and binary potentials. The unary potentials were specified by the output of a CNN while the binary potentials have no learnable parameters [4, 31, 22]. The work in [14] allows the binary potentials to be learnable as well.
Our work significantly differs from the above mentioned works in the learning algorithm used for training the parameters of the CRF. Most of the works that combine CNNs with CRFs use piecewise learning [24, 14, 4], that is, the energy function is decomposed into its potentials, and each potential is normalized individually. For instance, if are the potentials that form the energy function, the piecewise approximation to the objective is given by
(17) 
where is the normalization constant for the potential.
Hence, in a pairwise CRF, with unary potentials given by the output of a CNN, each output location of the CNN is trained independently. Furthermore, the contribution of the pairwise potentials is not incorporated during training of the parameters of the CNN. Hence, the training is equivalent to the training of several independent classifiers, one for each location in the segmentation mask.
While piecewise training is extremely efficient, the lower bound that it optimizes is a very weak lower bound of true loglikelihood. By training the local classifier without incorporating the binary potentials, we ignore the dependence among labels of nearby pixels with similar color. More importantly, it is completely unsuitable when the pixellevel labels are absent, which is the main concern of this paper.
More recently, several authors have considered training the mean field approximation rather than the actual CRF distribution [11, 31, 22] for semantic segmentation. The meanfield approximation for a distribution is the distribution that minimizes the KLdivergence . By computing the gradient of the KLdivergence with respect to , and equating it to , one can obtain an iterative algorithm for finding the minima. For a pairwise CRF, the meanfield update equations are given by
where is the distribution of the location of the mean field approximation at the iteration.
Hence, the mean field approximation at the iteration can be defined recursively, as a function of the meanfield approximation at the iteration and the potential functions. Consequently, the gradient of the meanfield distribution for iteration can be written as a function of the gradient of the approximation at the iteration and the gradient of the potential functions. This approach for training CRFs has been used in [11, 31, 22].
5 Experimental setup
5.1 Network architecture
For our experiments, we have used pretrained VGG16 network (trained on ImageNet for classification; torchvision^{1}^{1}1https://github.com/pytorch/vision/tree/master/torchvision), and we have modified it for the task of semantic segmentation. The VGG16 network consists of 13 convolutional layers and 3 fully connected layers. First, we removed the fully connected layers and the last pooling layer. The resultant network receives images of size , and generates feature maps of size
. The receptive field of a neuron in the last convolutional layer is
, and hence, it nearly encompasses the entire input image. This implies that every neuron in the last layer has access to almost the entire image. This network serves as an encoder in our model.To learn finegrained contours of objects, we added skip connections from the layers of the encoder to the layers of the decoder. The receptivefield size of the neurons in the encoder is much smaller than their counterpart in the decoder. Hence, they have access to more finegrained information. We performed convolution to the output of the layers of the encoder before concatenating them with the layers of the decoder. The concatenated output is fed through multiple layers of convolution and nonlinearities, to allow the final prediction layer to learn nonlinear mapping of the local and global information. The final architecture of our model is shown in Figure 1.
5.2 Dataset
Models that use weaksupervision for training often employ a much larger dataset than fullysupervised models. For instance, the model in [28] mines the pages of Flickr^{2}^{2}2https://www.flickr.com/, to generate the training data. Similarly, the model in [21] uses a subset of Flickr dataset [8] for training. A clean subset of the ImageNet dataset is used in [7], while a much larger subset of the same dataset is used in [19]. More importantly, almost all the approaches for semantic segmentation utilize the ImageNet dataset for pretraining the network.
We follow the experimental setup of [19]. In particular, we downloaded images of objects belonging to the object classes in the VOC 2012 dataset [6] from the Imagenet database [20]. Several authors mine the dataset further to obtain a set of simple images which contain the object against a plain background. However, we choose to use the entire dataset to minimize manual dependence. A script to download the ImageNet classes used in training, will soon be available.
For testing, we use the validation and test set of VOC 2012 dataset. For most of our experiments, we use the validation set only, since the ground truth is publicly available. The test set is used only for comparing the final model against the stateoftheart.
Almost all approaches for weakly supervised semantic segmentation report the results on VOC 2012 dataset. This is primarily because the classes in VOC 2012 are discrete object categories such as sheep, person, dog, etc. Most images in VOC 2012 contain very few of these object classes, and hence, it is easy to utilize weak labels for training. Two classes that almost always occur together, will be impossible to discern using weak labels only, for instance, road and sky.
5.3 Training protocol
Stochastic gradient descent (SGD) with a minibatch of images is used for training. The initial learning rate for the pretrained layers is set at , while the initial learning rate for newly added layers is set at . The momentum and the weight decay for gradient descent is set at and respectively. We halve the learning rate after every iterations. All the networks are trained for iterations.
6 Experiments
We use a weighted combination of the objective based on auxiliary labels and the neighborhood based objective in our experiments. For our comparisons, we have limited ourselves to models that do not employ pixel level information. Hence, we have excluded the models that employ agnostic segmentation or saliency masks that have been manually labelled [7, 26] with the exception of STC [28].
(18) 
The model is trained on the ImageNet subset and evaluated on the VOC 2012 validation set.
Hyperparameters: We use the same hyperparameter settings for the kernel as used in the publicly available code associated with the paper [12]. In particular, we modify the code in [12] to allow us to compute the weighted and exponentiated weighted mean. To account for the fact that our network reduces the segmentation mask to onefourth of the input image, we divide and by . during training.
We also evaluate the effect of the hyperparameter on the performance of the model. Note that the KLdivergence term will be minimized when all the pixels are assigned the same label. Hence, when higher weight is given to the KLdivergence term, all the pixels get assigned to the background class (since it is the most prominent class). In contrast if is close to , the boundaries of segmentation masks learnt by the model do not coincide with the boundaries of the objects. We vary the value of from to to study the effects of on the performance of the model. The results are shown in Figures 3. As can be observed, when weighted mean is used as the neighborhood distribution, the model remains quite stable to the choice of . However, the overall performance achieved is worse as compared to the model that relies on exponentiated weighted mean (51.6% vs 50.7%). The segmentation masks generated by using exponentiated weighted mean at are shown in Figure 2.
We also compare the proposed model against other models for weakly supervised segmentation on the VOC 2012 val set. Note that generating segmentation masks involves identifying the object and identifying the boundary of the object. If the boundary of the object can be identified correctly, segmentation involves classifying the segmented object which can be done using a classifier with relative ease. Hence, models that employ networks pretrained on boundary detection are obviously at an advantage and hence, excluded from comparison. This includes [7] which employs groundtruth saliency masks, since the boundaries in the saliency masks used in this paper, coincide with the boundaries of the segmentation masks.
The models used for comparison are (i) MIL+ILP [19]. (ii) EMAdapt [16], (iii) CCNN [17], (iv) SEC [10] and (v) STC [28]. Among these models, SEC employs separate convolution networks for computing the saliency maps for foreground and background. On the other hand, the saliency network in STC is trained using dense (pixellevel) supervision. The results are given in Table 1. Note that the proposed model is similar in spirit to CCNN [17] and EMAdapt [16], with the exception of the incorporation of a neighborhood based objective. One can observe that the incorporation of neighborhood information results in drastically improved performance.
Finally, we evaluate the performance of our model on the VOC 2012 test set by submitting our results to the evaluation server of VOC 2012. An anonymous view of the results is available at http://host.robots.ox.ac.uk:8080/anonymous/BEC3EB.html. We achieve an accuracy of which is the stateoftheart for weakly supervised semantic image segmentation on this dataset.
class 
MIL+ILP 
EMAdapt 
CCNN 
SEC 
STC 
Ours 

background  77.2  67.2  68.5  82.4  84.5  85.4 
aeroplane  37.3  29.2  25.5  62.9  68.0  68.4 
bike  18.4  17.6  17.0  26.4  19.5  28.3 
bird  25.4  28.6  25.4  61.6  60.5  63.6 
boat  28.2  22.2  20.2  27.6  42.5  42.9 
bottle  31.9  29.6  26.3  38.1  44.8  54.4 
bus  41.6  47.0  46.8  66.6  68.4  62.7 
car  48.1  44.0  47.1  62.7  64.0  62.9 
cat  50.7  44.2  48.0  75.2  64.8  67.5 
chair  12.7  14.6  15.8  22.1  14.5  10.6 
cow  53.5  45.7  35.1  53.5  52.0  46.3 
diningtable  14.6  24.9  21.0  28.3  22.8  37.2 
dog  50.9  41.0  44.5  65.8  58.0  48.7 
horse  44.1  34.8  34.5  57.8  55.3  53.8 
motorbike  39.2  41.6  46.2  62.3  57.8  57.3 
person  37.9  32.1  40.7  62.3  60.5  64.7 
plant  28.3  30.4  24.8  32.5  40.6  44.3 
sheep  44.0  36.3  37.4  62.6  56.7  58.2 
sofa  19.6  24.0  22.2  32.1  23.0  35.8 
train  37.6  38.1  38.8  45.4  57.1  43.6 
tvmonitor  35.0  31.6  36.9  45.3  31.2  47.4 
mean  36.6  33.8  35.3  50.7  49.8  51.6 
7 Conclusions
In this paper, we have proposed a new model for weakly supervised semantic image segmentation that uses only imagelevel labels. We have shown that the output of the CNN can be forced to satisfy the constraints of a conditional random field, without explicitly evaluating the meanfield distribution at every step. We achieve this forcing the output distribution at every pixel to be close to its neighbors. As a consequence, we achieve significant performance improvement over traditional CNNs with negligible increase in training time.
We focused on weakly annotated images in this paper, since CRFs can achieve drastic performance improvements for this task. When pixellevel information is available, forcing the pixellevel labels to be close to its neighbors will serve as an unnecessary and often oversmooth regularizer, unless only rough object boundaries are only available. A model capable of handling rough object boundaries, can drastically reduce the time required for manually generating segmentation masks. We intend to explore the utility of the model for handling rough boundaries in the future.
8 Acknowledgement
Authors acknowledge financial support for the ”CyberGut” expedition project by the Robert Bosch Centre for Cyber Physical Systems at the Indian Institute of Science, Bengaluru.
References

[1]
A. Bearman, O. Russakovsky, V. Ferrari, and L. FeiFei.
What’s the point: Semantic segmentation with point supervision.
In
European Conference on Computer Vision
, pages 549–565. Springer, 2016.  [2] Y. Y. Boykov and M.P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 1, pages 105–112. IEEE, 2001.
 [3] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062, 2014.
 [4] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915, 2016.
 [5] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1635–1643, 2015.
 [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [7] Q. Hou, P. K. Dokania, D. Massiceti, Y. Wei, M.M. Cheng, and P. Torr. Mining pixels: Weakly supervised semantic segmentation using image labels. arXiv preprint arXiv:1612.02101, 2016.
 [8] M. J. Huiskes, B. Thomee, and M. S. Lew. New trends and ideas in visual concept detection: the mir flickr retrieval evaluation initiative. In Proceedings of the international conference on Multimedia information retrieval, pages 527–536. ACM, 2010.
 [9] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009.
 [10] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weaklysupervised image segmentation. In European Conference on Computer Vision, pages 695–711. Springer, 2016.

[11]
P. Kraehenbuehl and V. Koltun.
Parameter learning and convergent inference for dense random fields.
In
Proceedings of The 30th International Conference on Machine Learning
, pages 513–521, 2013.  [12] P. Krähenbühl and V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 109–117. Curran Associates, Inc., 2011.
 [13] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding box prior. In IEEE 12th International Conference on Computer Vision, 2009, pages 277–284. IEEE, 2009.

[14]
G. Lin, C. Shen, A. van den Hengel, and I. Reid.
Efficient piecewise training of deep structured models for semantic
segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3194–3203, 2016.  [15] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 [16] G. Papandreou, L.C. Chen, K. Murphy, and A. L. Yuille. Weaklyand semisupervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734, 2015.
 [17] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1796–1804, 2015.
 [18] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multiclass multiple instance learning. arXiv preprint arXiv:1412.7144, 2014.
 [19] P. O. Pinheiro and R. Collobert. From imagelevel to pixellevel labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1713–1721, 2015.
 [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [21] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Builtin foreground/background prior for weaklysupervised semantic segmentation. In European Conference on Computer Vision, pages 413–432. Springer, 2016.
 [22] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015.
 [23] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multiclass object recognition and segmentation. In European Conference on Computer Vision, pages 1–15. Springer, 2006.

[24]
C. Sutton and A. McCallum.
Piecewise training for undirected models.
In
Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence
, pages 568–575. AUAI Press, 2005.  [25] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly supervised semantic segmentation with a multiimage model. In 2011 IEEE International Conference on Computer Vision (ICCV), pages 643–650. IEEE, 2011.
 [26] Y. Wei, J. Feng, X. Liang, M.M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. arXiv preprint arXiv:1703.08448, 2017.
 [27] Y. Wei, X. Liang, Y. Chen, Z. Jie, Y. Xiao, Y. Zhao, and S. Yan. Learning to segment with imagelevel annotations. Pattern Recognition, 59:234–244, 2016.
 [28] Y. Wei, X. Liang, Y. Chen, X. Shen, M.M. Cheng, J. Feng, Y. Zhao, and S. Yan. Stc: A simple to complex framework for weaklysupervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
 [29] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [30] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.

[31]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. H. Torr.
Conditional random fields as recurrent neural networks.
In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
On Adaptive prior
Let be an image and be the corresponding image level labels (including the background class). The optimization problem for obtaining the pixellabels using an adaptive prior is given below:
(19)  
subject to  
Here, is a constant that determines the minimum fraction of pixels in the image that must be assigned to class . When the images contain objects from a single object class (as is the case with ImageNet dataset), say , the above optimization problem can be rewritten as shown below:
(20)  
subject to  
where
(21) 
and .
The above problem is an optimization problem in single variable. We can solve it approximately by evaluating the constraints for several values of . This can be achieved very efficiently on a GPU, since it involves elementwise operations on matrices of probability distributions. Let be the set of all the selected values of which satisfy the constraints. Among these values, we return the value of which minimizes the objective.
If none of the selected values of satisfy the constraints, we choose the value of which minimizes the following:
(22) 
Proof of Proposition 3.1
Proposition 3.1 Let be the mean field approximation to the CRF prior . That is, among all distributions of the form , let be the one that minimizes . Then the distribution will have to satisfy the following constraints:
(23) 
Proof.
The proof of this result can be obtained by differentiating the KLdivergence divergence withrespectto the components of and equating it to . Towards that end, we expand the KLdivergence term as given below:
(24) 
Here, is a constant that doesn’t depend on . Differentiating the above expression with respect to and equating it to , we get
(25) 
Finally, since is a probability distribution that sums up to , we normalize to get the desired result. ∎
Comments
There are no comments yet.