1 Introduction
The semantic segmentation task has rapidly advanced with the use of Convolutional Neural Networks (cnns) [5, 6, 24, 37]. The performance of cnns, however, is largely dependent on the availability of a large corpus of annotated training data, which is both cost and timeintensive to acquire. The pixellevel annotation of an image in PASCAL VOC takes on average minutes [4], which is likely a conservative estimate given that it is based on the COCO dataset [23] in which groundtruths are obtained by annotating polygon corners rather than pixels directly. In response, recent works focus on weaklysupervised semantic segmentation [4, 19, 27, 28, 30, 31, 35]. These differ from fullysupervised cases in that rather than having pixellevel groundtruth segmentations, the supervision available is to some lesser degree (imagelevel labels [19, 27, 28, 30], bounding boxes [27], or points and scribbles [4, 22, 34]).
In this work, we address the semantic segmentation task using only image labels, which specify the object categories present in the image. Our motivation for this is twofold: (i) the annotation of an image with object classes in PASCAL VOC is estimated to take seconds, which is at least times faster than a pixellevel annotation and is also scalable, and (ii) images with their image labels or tags can easily be downloaded from the Internet, providing a rich and virtually infinite source of training data. The method we adopt, similarly to the weaklysupervised semantic segmentation of [27], takes the form of ExpectationMaximization (EM) [10, 25]. An EMbased approach has three key steps: (i) initialization; (ii) latent posterior estimation (E step); and (ii) parameter update (M step). We focus on all of these aspects. In what follows we briefly talk about each of them.
We provide an informed initialization to the EM algorithm by training an initial model for the semantic segmentation task using an approximate groundtruth obtained using the combination of classagnostic saliency map [15] and classspecific attention maps [36] on a set of simple images with one object category (ImageNet [11] classification dataset). This intuition comes from the obvious fact that it is easier to learn to segment simple images and then progress to more complex ones, similar to the work of [35]. Given an image, our saliency map finds the object (Figure 0(b))  this is a classagnotic ‘bottomup’ cue. Added to this, once provided with the class present in the image (from the image label  ‘boat’ in this case), our attention map (Figure 0(c)) gives the ‘topdown’ classspecific regions in the image. Since both saliency and attention maps are tasked to find the same object, their combination is more powerful than if either one is used in isolation, as shown in Figure 0(d). The combined probability map is then used as the groundtruth for training an initial model for the semantic segmentation task. The trained initial model provides the initialization parameters for the followup EM algorithm. Notice that this initialization is in contrast to [27] where the initial model is trained for the image classification task on the same ImageNet dataset. Experimentally we have found that this simple way of combining bottomup with topdown cues on the ImageNet dataset (with no images from PASCAL VOC 2012) allows us to obtain an initial model capable of outperforming all current stateoftheart algorithms for the weaklysupervised semantic segmentation task on the PASCAL VOC 2012 dataset. This is surprising since these algorithms are significantly more complex and they rely on higher degrees of supervision such as bounding boxes, points/squiggles and superpixels. This clearly indicates the importance of learning from simple images before delving into more complex ones. With the trained initial model, we then incorporate PASCAL VOC images (with multiple objects) for the E and M Steps of our EMbased algorithm.
In the Estep, we obtain the latent posterior probability distribution by constraining (or regularizing) the
cnn likelihood using image labels based prior. This reduces many false positives by redistributing the probability masses (which were initially over the object categories) among only the labels present in the image and the backgroud. In the Mstep, the parameter update step, we then minimize a combination of the standard softmax loss (where the groundtruth is assumed to be a Dirac delta distribution) and the KL divergence [21] between the latent posterior distribution (obtained using the Estep) and the likelihood given by the cnn. In the weaklysupervised setting this makes the approach more robust than using the softmax loss alone since in the case of confusing classes, the latent posterior (from the Estep) can sometimes be completely wrong. In addition to this, to obtain better cnn parameters, we add a probabilistic approximation of the IntersectionoverUnion (IoU) [1, 9, 26]to the above loss function.
With this intuitive approach we obtain stateoftheart results in the weaklysupervised semantic segmentation task on the PASCAL VOC 2012 benchmark [12]. We beat the best method which uses image label supervision by 10%.
2 Related Work
Work in weaklysupervised semantic segmentation has explored varying levels of supervision including combinations of image labels [19, 27, 28, 35], annotated points [4], squiggles [22, 34], and bounding boxes [27]. Papandreou [27] employ an EMbased approach with supervision from image labels and bounding boxes. Their method iterates between inferring a latent segmentation (Estep) and optimizing the parameters of a segmentation network (Mstep) by treating the inferred latents as the groundtruth segmentation. Similarly, [35] train an initial network using saliency maps, following which a more powerful network is trained using the output of the initial network. The mil frameworks of [30] and [29] use fully convolutional networks to learn pixellevel semantic segmentations from only image labels. The image labels, however, provide no information about the position of the objects in an image. To address this, localization cues can be used [30, 31], obtained through indirect methods like bottomup proposal generation (for example, mcg [3]), and saliency [35] and attentionbased [36] mechanisms. Localization cues can also be obtained directly through point and squigglelevel annotations [4, 22, 34].
Our method is most similar to the EMbased approach of [27]. We use saliency and attention maps to learn a network for a simplified semantic segmentation task which provides better initialization of the EM algorithm. This is in contrast to [27] where a network trained for a classification task is used as initialization. Also different from [27] where the latent posterior is approximated by a Dirac delta function (which we argue is too harsh of a constraint in a weaklysupervised setting), we instead propose to use the combination of the true posterior distribution and the Dirac delta function to learn the parameters.
3 The Semantic Segmentation Task
Consider an image consisting of a set of pixels
. Each pixel can be thought of as a random variable taking on a value from a discrete semantic label set
, where is the number of classes ( for the background). Under this setting, a semantic segmentation is defined as the assignment of all pixels to their corresponding semantic labels, denoted as .cnns are extensively used to model the classconditional likelihood for this task. Specifically, assuming each random variable (or pixel) to be independent, a cnn models the likelihood function of the form , where is the softmax probability (or the marginal) of assigning label to the th pixel, obtained by applying the softmax^{1}^{1}1The softmax function is defined as function to the cnn outputs such that . Given a training dataset , where and represent the th image and its corresponding groundtruth semantic segmentation, respectively, the loglikelihood is maximized by minimizing the crossentropy loss function using the backpropagation algorithm to obtain the optimal . At test time, for a given image, the learned is used to obtain the softmax probabilities for each pixel. These probabilities are either postprocessed or used directly to assign semantic labels to each pixel.
4 WeaklySupervised Semantic Segmentation
As mentioned in Sec. 3, to find the optimal for the semantic segmentation task, we need a dataset with groundtruth pixellevel semantic labels, obtaining which is highly timeconsuming and expensive: for a given image, annotating it with object classes takes nearly seconds, while pixelwise segmentation takes nearly seconds [4]. This is highly nonscalable to higher numbers of images and classes. Motivated by this, we use an ExpectationMaximization (EM) [10, 25] based approach for weaklysupervised semantic segmentation using only image labels. Let us denote as the set of object labels we are interested in, and a weak dataset as where and are the th image and the labels present in the image. The task is to learn an optimal using .
4.1 The EM Algorithm
Similar to [27], we treat the unknown semantic segmentation as the latent variable. Our probabilistic graphical model is of the following form (Fig. 2):
(1) 
where we assume that . Briefly, to learn while maximizing the above joint probability distribution, the three major steps of an EM algorithm are: (i) initialize the parameter ; (ii) Estep: compute the expected completedata loglikelihood ; and (iii) Mstep: update by maximizing . In what follows, we first talk about how to obtain a good initialization in order to avoid poor local maxima and then talk about optimizing parameters (E and M steps) for a given .
4.2 Initialization: Skipping Poor Local Maxima Using BottomUp TopDown Cues
It is well known that in case that the loglikelihood has several maxima or saddle points, an EMbased approach is highly susceptible to mediocre local maxima and a good initialization is crucial [16]. We argue that instead of initializing the algorithm with parameters learned for the classification task using the ImageNet classification dataset [11], as is done by most stateoftheart methods irrespective of their nature, it is much more effective and intuitive to initialize with parameters learned for solving the task at hand  that is semantic segmentation  using the same dataset. This would allow for the full power of the dataset to be harnessed. For weaklysupervised semantic segmentation, however, the challenge is that only imagelevel labels are accessible during the training process. In the following we address this issue to obtain a good initialization by training an initial cnn model over simple ImageNet images for the weaklysupervised semantic segmentation task.
Let us denote as a subset of images from the ImageNet dataset containing objects of the categories in which we are interested (for details of this dataset, see Section 5). Dataset mostly contains images with centered and clutterfree single objects, unlike the challenging PASCAL VOC 2012 [12]. Given , in order to train the initial model to obtain , we need pixellevel semantic labels which are not available in the weaklysupervised setting. To circumvent this, we use the combination of a classagnostic saliency map [8, 15] (bottomup cue) and a classspecific attention map [36] (topdown cue) to obtain an approximate groundtruth probability distribution of labels for each pixel. Intuitively, a saliency map gives the probability of each pixel belonging to any foreground class, and an attention map provides the probability of it belonging to the given object class. Combining these two maps allow us to obtain a very accurate groundtruth probability distribution of the labels for each pixel in the image (see Figure 1).
Precisely, as shown in Algorithm 1, for a given simple image and its corresponding class label , we combine the attention and the saliency values per pixel to obtain for all the pixels in the image. The value of for the th pixel denotes the probability of it being the th object category. Similarly, is the probability of it being the background. The combination function in Algorithm 1 is a userdefined function that combines the saliency and the attention maps. In this work we employ the function which takes the union of the attention and saliency maps (Figure 1), thus complement each other. Let us define the approximate groundtruth label distribution for the th pixel as . Thus, , where at the th index for the object category , at the th index for the background, and zero otherwise. Given for each pixel, we find by using a cnn and optimizing the perpixel crossentropy loss between and , where is the cnn likelihood.
Notice that using this approach to obtain the approximate groundtruth label distribution requires only imagelevel labels. No human annotator involvement is required. By using the probability value directly instead of using a standard Dirac delta distribution makes the approach more robust to noisy attention and saliency maps. This approach can be seen as a way of mining classspecific noisefree pixels, and is motivated by the work of Bearman et al. [4] where humans annotate points and squiggles in complex images. Their work showed that instead of training a network with a fullysupervised dataset, the learning process can be sufficiently guided using only a few supervised pixels which are easy to obtain. Their approach still requires human intervention in order to obtain points and squiggles, whereas, our approach requires only imagelevel labels which makes it highly scalable.
4.3 Optimizing Parameters
Let us now talk about how to define and optimize the expected completedata loglikelihood . By definition, , where the expectation is taken over the posterior over the latent variables at a given set of parameters , denoted as . In the case of semantic segmentation, the latent space is exponentially large , therefore, computing is infeasible. However, as will be shown, the independence assumption over the random variables, namely , allows us to maximize efficiently by decomposition. By using Eqn. (1), the independence assumption, the identity , and ignoring the terms independent of , can be written in a simplified form as:
(2) 
Without loss of generality, we can write , and using the identity , we obtain:
(3) 
Therefore, the Mstep parameter update, which is maximizing w.r.t. , can be written as:
(4) 
As mentioned in Sec. 4.1, the latent posterior distribution is , where is the likelihood obtained using the cnn at a given . The distribution can be used to regularize the cnn likelihood by imposing constraints based on the image label information given in the dataset. Note that, is independent of and is a taskspecific userdefined distribution that depends on the image labels or object categories. For example, if we know that there are only two classes in a given training image such as ‘cat’ and ‘person’ out of many other possible classes, then we would like to push the latent posterior probability of absent classes to zero and increase the probability of the present classes. To impose this constraint, let us assume that similar to the likelihood, also decomposes over the pixels and belongs to the exponential family distribution , where is the userdefine function designed to impose the desired constraints. Thus, the posterior can be written as . In order to impose the above mentioned constraints, we use the following form of :
(5) 
Practically speaking, imposing the above constraint is equivalent to obtaining softmax probabilities for only those classes (including background) present in the image and assigning a probability of zero to all other classes. In other words, the above definition of inherently defines such that it is uniform for the classes present in the image including the background () and zero for the remaining ones. Other forms of can also be used to impose different taskspecific labeldependent constraints.
Cross entropy functions
Consider the parameter update (Mstep) as defined in Eq. 4. Solving this is equivalent to minimizing the cross entropy or the kl divergence between the latent posterior distribution and the cnn likelihood. Notice that, as opposed to [27], which uses a Dirac delta approximation of the posterior distribution, where at and otherwise zero, we use the true posterior distribution (or the regularized likelihood) itself. We argue that using a Dirac delta distribution imposes a hard constraint that is suitable only when we are very confident about the true label assignment (for example, in the fullysupervised setting where the label is provided by a human annotator). In the weaklysupervised setting where the latent posterior, which decides the label, can be noisy (mostly seen in the case of confusing classes), it is more suitable to use the true posterior distribution, obtained using the combination of the cnn likelihood and the class labelbased prior . We propose to optimize by combining this true posterior distribution and its Dirac delta approximation:
(6) 
where, . To investigate the interplay between the two terms, we define a criterion which we call the
Relative Heuristic
to compute the value of given the pixelwise latent posterior and a userdefined hyperparameter :(7) 
where , and and are the highest and the second highest probability values in the latent posterior distribution. Intuitively, implies that the most probable score should be at least better compared to the second most probable score.
The IoU gain function
Along with minimizing the cross entropy losses as shown in the Eq. 6, in order to obtain better parameter estimate, we also maximize the probabilistic approximation of the intersectionoverunion (IoU) between the posterior distribution and the likelihood [1, 9, 26]:
(8) 
where, and . Refer to [9] for further details about Eq. 4.3.
Overall objective function and the algorithm
Combining the cross entropy loss function (Eq. 6) and the IoU gain function (Eq. 4.3), the Mstep parameter update problem is:
(9) 
We use a cnn model along with the backpropagation algorithm to optimize the above objective function. Recall that our evaluation is based on the PASCAL VOC 2012 dataset, therefore, during the Mstep of the algorithm we use both the ImageNet and the PASCAL trainval datasets (see Section 5 for details). Our overall approach is summarized in Algorithm 2.
Method  Dataset  Dependencies  Supervision  CRF [20]  mIoU (Val)  mIoU (Test)  
EM Adapt [27]  No  Image labels  ✗  
✓  
ccnn [28]  No  Image labels  ✗  
✓  
Class size  ✗  
✓  
sec [19]  Saliency [32] &  Image labels  ✗  
Localization [38]  ✓  
mil [30]  Superpixels [13]  Image labels  ✗  
BBox BING [7]  
MCG [3]  
wtp [4]  Objectness [2]  Image labels  
Image labels +  
1 Point/Class  
Image labels +  
1 Squiggle/Class  
stc [35]  Saliency [18]  Image labels  ✓  
AugFeed [31]  SS [33]  Image labels  ✗  
✓  
MCG [3]  ✗  
✓  
Ours  Initial Model  Image labels  ✗  
Saliency [15] &  ✓  
Final  Attention [36]  ✗  
✓ 
5 Experimental Results and Comparisons
We show the efficacy of our method on the challenging PASCAL VOC 2012 benchmark and outperform all existing stateoftheart methods by a large margin. Specifically, we improve on the current stateoftheart weaklysupervised method using only image labels[31] by 10%.
5.1 Setup
Dataset .
To train our initial model (Section 4.2), we download images from the ImageNet dataset which contain objects in the
foreground object categories of the PASCAL VOC 2012 segmentation task. We filter this dataset using simple heuristics. First, we discard images with width or height less than 200 or greater than 500 pixels. Using the attention model of
[36] (which is trained with only imagelevel labels), we generate an attention map for each image and record the most probable class label with its corresponding probability. We discard the image if (i) its most probable label does not match the image label or (ii) its most probable label matches the image label but its corresponding probability is less than 0.2. We then generate saliency maps using the saliency model of [15] (which is trained with only classagnostic saliency masks). For the remaining images, we combine attention and saliency by finding the pixelwise intersection between the saliency and the attention binary masks. The saliency mask is obtained by setting the pixel’s value to 1 if its corresponding saliency probability is greater than 0.5. The same is done to obtain the attention mask. For each object category, the images are sorted by this intersection area (i.e. the number of overlapping pixels between the two masks) with the intuition that larger intersections correspond to higher quality saliency and attention maps. The top 1500 images are then selected for each category, with the exception of the ‘person’ category where the top 2500 images are kept, and any category with fewer than 1500 images, in which case all images are kept. This complete filtering process leaves us with simple images, which contain uncluttered and mainlycentered single objects. We denote this dataset as and will make this dataset publicly available. We highlight that does not contain any additional images relative to those used by other weakly supervised works (see Dataset column in Table 1).Datasets and for Mstep.
For the Mstep, we use a filtered subset of PASCAL VOC 2012 images, denoted , and a subset of . To obtain , we take complex PASCAL VOC 2012 images ( in total, made up of training images [12] and the extra images provided by [14]), and use the trained initial model (i.e. ) to generate a (hard) groundtruth segmentation for each. The hard segmentations are obtained by assigning each pixel with the class label that has the highest probability. The ratio of the foreground area to the whole image area (where area is the sum of the number of pixels) is computed, and if the ratio is below 0.05, the image is discarded. This leaves images. We also further filter : using the trained initial model, we generate (hard) segmentations for all simple ImageNet images in . We compute the intersection area (as above) between the attention mask and the predicted segmentation (rather than the saliency mask as before) and select the top of images based on this metric. Together and this subset of consist of images used for the Mstep.
CNN architecture and parameter settings
Similar to [19, 27, 35], both our initial model and our EM model are based on the largeFOV DeepLab architecture [6]
. We use simple bilinear interpolation to map the downsampled feature maps to the original image size as suggested in
[24]. We use the publicly available Caffe toolbox
[17] for our implementation. We use weight decay , momentum , and iteration size for gradient accumulation. The learning rate is at the beginning and is divided by every epochs. We use a batch size of and randomly crop the input image to . Images with width or height less thanare padded with the mean pixel values and the corresponding places in the groundtruth are padded with ignore labels to nullify the effect of padding. We flip the images horizontally, resulting in an augmented set twice the original one. We train our networks for
K iterations by optimizing Eq. 9 as per Algorithm 2 with and . Performance gains beyond two EM iterations were not significant compared to the computational cost.Data  Method  CRF  bkg  plane  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  motor  person  plant  sheep  sofa  train  tv  mIoU 

Val  Initial  ✗  87.1  74.7  29.0  69.8  55.8  55.6  73.3  65.2  63.4  15.8  61.5  15.9  60.0  56.4  57.5  53.7  32.9  65.6  23.9  64.6  42.2  
✓  87.7  79.7  32.6  73.4  58.4  57.8  74.3  64.8  66.0  15.9  63.1  15.0  62.3  59.6  57.7  54.9  33.8  69.1  23.7  65.0  44.3  
Final  ✗  87.8  72.4  28.7  67.9  58.8  55.8  78.0  69.7  70.2  17.8  63.3  23.2  65.7  60.5  63.1  58.7  40.0  68.2  28.9  70.9  45.5  
✓  88.6  76.1  30.0  71.1  62.0  58.0  79.2  70.5  72.6  18.1  65.8  22.3  68.4  63.5  64.7  60.0  41.5  71.7  29.0  72.5  47.3  
Test  Initial  ✗  87.9  69.2  29.2  74.9  41.7  53.4  70.6  69.6  59.9  18.3  66.1  24.9  62.5  63.3  68.8  55.4  33.7  63.8  18.6  64.3  44.9  
✓  88.5  72.6  32.6  80.3  44.6  55.4  70.9  69.6  62.2  18.9  68.4  24.6  65.2  66.8  71.2  57.2  37.2  66.7  17.4  64.8  45.9  
Final  ✗  88.2  69.5  29.7  72.2  45.1  57.3  73.2  72.7  69.3  20.5  65.4  33.5  67.8  64.0  72.3  58.9  45.5  69.8  26.8  63.8  46.8  
✓  88.9  72.7  31.0  76.3  47.7  59.2  74.3  73.2  71.7  19.9  67.1  34.0  70.3  66.6  74.4  60.2  48.1  73.1  27.8  66.9  47.9 
5.2 Results, Comparisons, and Analysis
We provide Table 1 and 3 for an extenstive comparison between our and current methods, their dependencies, and degrees of supervision. Regarding the dependencies of our method, our saliency network [15] is trained using salient region masks. These masks are classagnostic, therefore, once trained the network can be used for any semantic object category, so there is no issue with scalability and no need to train the saliency network again for new object categories. Our second dependency, the attention network [36] is trained using solely image labels.
Dependency  Supervision 

Class size  Image labels 
+ Bboxes  
Saliency  [32] Image labels 
[18] Bboxes  
[15] Saliency masks  
Attention [36]  Image labels 
Objectness [2]  Image labels 
+ Bboxes  
Localization [38]  Image labels 
Superpixels [13]  None 
Bbox BING [7]  Bboxes 
MCG [3]  Pixel labels 
SS [33]  Bboxes 
CRF [20]  Pixel labels 
(parameter crossval) 
Stateoftheart.
Our method outperforms all existing stateoftheart methods by a very high margin. The most directly comparable method in terms of supervision and dependencies is AugFeed [31] which uses superpixels. Our method obtains almost better mIoU than AugFeed on both the val and test sets. Even if we disregard ‘equivalent’ supervision and dependencies, our method is still almost and better than the best method in the val and test sets, respectively. Table 2 shows classwise performance of our method.
Simplicity vs sophistication.
The initial model is essential to the success of our method. We train this model in a very simple and intuitive way by employing a filtered subset of simple ImageNet images and training for the semantic segmentation task. Importantly, this process uses only image labels and is fully automatic, requiring no human intervention. The learned provides a very good initialization for the EM algorithm, enabling it to avoid poor local maxima. This is shown visually in Figure 3: the initial model (third column) is already a good prediction, with the first and second EM iterations (fourth and fifth columns), improving the semantic segmentation even further. We highlight that with this simple approach, surprisingly, our initial model beats all current stateoftheart methods, which are more complex and often use higher degrees of supervision. By implementing this intuitive modification, we believe that many methods can easily boost their performance.
To CRF or not to CRF?
In our work, we specifically choose not to employ a CRF [20] as a postprocessing step nor inside our models, during training and testing, for the following reasons: (1) CRF hyperparameters are normally cross validated over a fullysupervised pixelwise segmentation dataset which contradicts a fully “weak” supervision. This is likewise the case for mcg [3] which is trained on a pixellevel semantic segmentation dataset. (2) The CRF hyperparameters are incredibly sensitive, and if we wish to extend our framework to incorporate new object categories, this would require a pixellevel annotated dataset of the new categories along with the old ones for the crossvalidation of the CRF hyperparameters. This is highly nonscalable. For completeness, however, we include our method with a CRF applied (Table 1 & 2) which boosts our accuracy by %. We note that even without a CRF, our best approach exceeds the stateoftheart (which uses a CRF and a higher degree of supervision) by % on the test set.
6 Conclusions and Future Work
We have addressed weaklysupervised semantic segmentation using only image labels. We proposed an EMbased approach and focus on the three key components of the algorithm: (i) initialization, (ii) Estep and (iii) Mstep. Using only the image labels of a filtered subset of ImageNet, we learn a set of parameters for the semantic segmentation task which provides an informed initialization of our EM algorithm. Following this, with each EM iteration, we empirically and qualitatively verify that our method improves the segmentation accuracy on the challenging PASCAL VOC 2012 benchmark. Furthermore, we show that our method outperforms all stateoftheart methods by a large margin.
Future directions include making our method more robust to noisy labels, for example, when images downloaded from the Internet have incorrect labels, as well as better handling images with multiple classes of objects.
References
 [1] F. Ahmed, D. Tarlow, and D. Batra. Optimizing expected intersectionoverunion with candidateconstrained crfs. In ICCV, 2015.
 [2] B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image windows. In PAMI, 2012.
 [3] P. Arbelaez, J. PontTuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
 [4] A. Bearman, O. Russakovsky, V. Ferrari, and F.F. Li. What’s the point: Semantic segmentation with point supervision. In ECCV, 2016.
 [5] S. Chandra and I. Kokkinos. Fast, exact and multiscale inference for semantic image segmentation with deep gaussian crfs. In ECCV, 2016.
 [6] L.G. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected. In ICLR, 2015.

[7]
M. Cheng, Z. Zhang, W. Lin, and P. H. S. Torr.
Bing: Binarized normed gradients for objectness estimation at 300fps.
In CVPR, 2014.  [8] M.M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.M. Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
 [9] M. Cogswell, X. Lin, S. Purushwalkam, and D. Batra. Combining the best of graphical models and convnets for semantic segmentation. In arXiv:1412.4313, 2014.
 [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. In Journal of the Royal Statistical Society, 1977.
 [11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [12] M. Everingham, S. M. A. Eslami, L. V. Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge a retrospective. In IJCV, 2014.
 [13] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph based image segmentation. In IJCV, 2004.
 [14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
 [15] Q. Hou, M.M. Cheng, X.W. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. In IEEE CVPR, 2017.
 [16] C. F. Jeff Wu. On the convergence properties of the em algorithm. In The Annals of Statistics, 1983.
 [17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, 2014.
 [18] H. Jiang, J. Wang, Z. Yuan, Y. Wu, Z. N., and S. Li. Salient object detection: A discriminative regional feature integration approach. In CVPR, 2013.
 [19] A. Kolesnikov and C. H. Lampert. Seed, Expand and Constrain: Three Principles for WeaklySupervised Image Segmentation. In ECCV, 2016.
 [20] P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
 [21] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 1951.
 [22] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribblesupervised convolutional networks for semantic segmentation. In CVPR, 2016.
 [23] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [25] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley, 1997.
 [26] S. Nowozin. Optimal decisions from probabilistic models: the intersectionoverunion case. In CVPR, 2014.

[27]
G. Papandreou, L.C. Chen, K. P. Murphy, and A. L. Yuille.
Weakly and semisupervised learning of a DCNN for semantic image segmentation.
In ICCV, 2015.  [28] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
 [29] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multiclass multiple instance learning. In ICLR, 2014.
 [30] P. O. Pinheiro and R. Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, 2015.
 [31] X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, 2016.
 [32] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
 [33] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. In IJCV, 2013.
 [34] J. Xu, A. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In CVPR, 2015.
 [35] W. Yunchao, L. Xiaodan, C. Yunpeng, S. Xiaohui, M.M. Cheng, Z. Yao, and Y. Shuicheng. STC: A simple to complex framework for weaklysupervised semantic segmentation. In arXiv:1509.03150, 2015.
 [36] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Topdown neural attention by excitation backprop. In ECCV, 2016.

[37]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. Torr.
Conditional random fields as recurrent neural networks.
In ICCV, 2015. 
[38]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning deep features for discriminative localization.
In CVPR, 2016.
Comments
There are no comments yet.