Amortized Inference and Learning in Latent Conditional Random Fields for Weakly-Supervised Semantic Image Segmentation

05/03/2017 ∙ by Gaurav Pandey, et al. ∙ indian institute of science 0

Conditional random fields (CRFs) are commonly employed as a post-processing tool for image segmentation tasks. The unary potentials of the CRF are often learnt independently by a classifier, thereby decoupling the inference in CRF from the training of classifier. Such a scheme works effectively, when pixel-level labelling is available for all the images. However, in absence of pixel-level labels, the classifier is faced with the uphill task of selectively assigning the image-level labels to the pixels of the image. Prior work often relied on localization cues, such as saliency maps, objectness priors, bounding boxes etc., to address this challenging problem. In contrast, we model the labels of the pixels as latent variables of a CRF. The pixels and the image-level labels are the observed variables of the latent CRF. We amortize the cost of inference in the latent CRF over the entire dataset, by training an inference network to approximate the posterior distribution of the latent variables given the observed variables. The inference network can be trained in an end-to-end fashion, and requires no localization cues for training. Moreover, unlike other approaches for weakly-supervised segmentation, the proposed model doesn't require further post-processing. The proposed model achieves performance comparable with other approaches that employ saliency masks for the task of weakly-supervised semantic image segmentation on the challenging VOC 2012 dataset.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic image segmentation is the problem of assigning the pixels of an image to a selected set of predefined labels, based on the semantic structure that the pixel belongs to. Most successful models for semantic image segmentation employ a variation of CNN for computing the probability distribution over the classes for each pixel. During inference, these distributions are fed as unary potentials to a fully connected CRF with Gaussian edge potentials, and a joint labeling for the pixels of the image is inferred from the CRF. The work by Krähenbühl & Koltun 

[12], allows for efficient inference in such models.

Successful semantic image segmentation requires access to a large number of images that have been densely labelled. However, dense labeling of images is an expensive and time-consuming operation [16, 19, 18]. Therefore, the number of densely labeled images available is usually a minuscule percentage of the total set of images. Hence, the models that rely solely on densely labeled images, are limited in their scope. These models will be referred to as fully supervised models in the sequel.

The limitations of fully supervised models has necessitated the development of models that can incorporate weakly labeled images for training. These include models that utilize bounding box prior [13, 5, 16, 29], few points per class [1] and image-level labels [25, 16, 19]. Of particular interest are models that rely on image-level labels only, since the web provides an almost unlimited source of weakly annotated images.

Unfortunately, the decoupled CNN-CRF combination (or CNN alone) fares poorly, when only image-level labels are available [19, 17]. To alleviate this problem, several researchers have resorted to the use of localization cues, such as saliency and attentions maps [28, 10] or objectness priors [19, 27], thereby improving performance to an extent. Improvements in CNN architecture for segmentation [4, 30], have further aided in improved performance.

In this paper, we propose a model that learns to output segmentation masks using only image-level labels without the aid of localization cues or saliency masks. In particular, we enforce a pixel-label loss as well as a neighborhood loss on the output of a CNN. Since real pixel-labels are unavailable, we map the output of the CNN to auxiliary pixel labels to get an approximate segmentation mask. The neighborhood loss allows us to enforce the constraints imposed by conditional random field on the output of the CNN thereby forcing it to generate crisp segmentation masks that align with the boundary of the object


Our contributions are as follows. (1) We propose a new interpretable model for weakly supervised semantic segmentation. (2)

The model is trained by imposing pixel-label and neighborhood loss functions on the output of a fully convolutional neural network.

(3) We prove that by imposing neighborhood loss on the output of the CNN, the output of the CNN is forced to satisfy the constraints imposed by a conditional random field. (4) We achieve an accuracy of 52.01 % on the test set and 51.6 % on the validation set of Pascal VOC-2012, which is the state of the art for methods that do not employ any pixel-level labels.

2 Preliminaries and background


The probability distributions are indicated by lower case letters, for example, and . Subscripts in the distribution indicate the location, whereas superscripts indicate the name of the distribution. The label at location is denoted as , while any distribution at the corresponding location is denoted as with an appropriate superscript indicating the type of distribution. The labels in the segmentation mask form a grid, which is denoted by . The entire segmentation mask is denoted as .

Conditional random fields for semantic image segmentation

A conditional random field is an example of undirected graphical model that models the conditional distribution of output given the input, when the output is structured in the form of factors. For the problem of semantic image segmentation, the image is the input, whereas the pixel-level labels form the output. A conditional random field is completely characterized by its potential functions.

Traditionally, conditional random fields employ two forms of potential functions: unary potentials and binary potentials. A unary potential is a function of the conditioning variable and a single output variable

. The unary potentials encode the suitability of assigning a specific label at a specific location. A popular approach is to learn a local classifier for each location in the image, using the features extracted locally from the image. The unary potential for a specific label at a specific location is then equated to the negative log-probability of observing the label at that location. These local classifiers have largely been replaced by convolutional neural networks that consider local as well as global information 

[31, 4].

The binary potential measures the compatibility of two labels and for locations and . The most commonly used binary potential is the contrast sensitive Potts model [2, 23, 9], which captures the difference in color among neighboring locations, that is


where and are non-negative scalar constants.

A CRF with unary and binary potentials only, is referred to as a pairwise CRF. For a given input image and a segmentation mask

, the joint distribution of a pairwise CRF is given below:


where indicates the neighbors of the node.

3 Proposed Model

Figure 1: The input image is fed through a fully convolutional network to generate a distribution over segmentation masks

. The pixel label estimator incorporates the image label information in the distribution to generate

. We force the output of segmentation network to be close to this updated distribution. Simultaneously, the neighborhood loss enforces the output of the segmentation network to be close to the distribution computed from its neighbors.

3.1 Overview

Given a set of images and their corresponding image-level labels, the aim is to learn a model that can output pixel-level labels from the input image. It is important to note that pixel-level labels are not provided during training, and hence, constitute the latent variables in the current model. An image is fed through a segmentation network that outputs a distribution over the labels for each pixel location . We refer to this distribution as the predicted distribution, since this is the only distribution that will be required during inference. Our aim is to ensure that the predicted distribution constitutes a valid segmentation mask for the input image. Hence, we impose multiple losses on the predicted distribution. In particular, the pixel-label estimator incorporates the image-label information in the predicted distribution to generate a distribution over pixel-level labels . This distribution can be thought of as an auxiliary ground truth, since the true pixel level labels are not available. The segmentation network is trained using the auxiliary ground truth.

Next, the neighborhood estimator computes a smooth version of the output distribution by averaging the output of the neighbors for each location. We force the output of the CNN to be close to the neighborhood distribution. We further show that this is equivalent to enforcing the constraints of a CRF on the output of a CNN.

Figure 2:

The segmentation network in the proposed model. Except for the last layer, each convolutional layer in the inference network is followed by a ReLU layer and a batch normalization layer. The last layer is followed by exponentiation and normalization to get


In sequel, represent the input image and represents the segmentation mask. The label of the pixel at location indexed by is denoted as . The various components involved in the model are discussed below.

3.2 Segmentation Network

The segmentation network receives the image as input and generates a distribution over the segmentation masks as output. As is common in CNN-based training for segmentation [3, 15], we assume that the output distribution over pixel-level labels factorizes completely for each location. In particular, let be the conditional distribution over the pixel-level labels given the image. We assume that , where is the distribution at location , and is the corresponding label. Furthermore, we assume that the distribution is parametrized by a CNN , that is,


The segmentation network used in this paper is shown in Figure 2.

3.3 Pixel Label Estimator

Since ground truth information for the pixels is not available, we attempt to generate auxiliary ground truth information for each pixel from the output of the network. In particular, we infer a distribution over the labels for each location in the image from the output distribution of segmentation network. Given this auxiliary ground truth, the classification objective can be rewritten as


In absence of any restriction, the model can choose to assign all the pixels to a single class, for instance, the background class. In order to prevent this from happening, one needs to ensure that for each class present in the image, at least a certain percentage of pixels are allotted to that class. Furthermore, one also needs to ensure that no pixels are allotted to classes absent from the image. Hence, we couple the distribution with a prior to obtain the distribution over the pixel-labels.


for , and . In order to complete the description, we define the prior distribution as below:


Images in the ImageNet dataset are assumed to contain only one foreground object, and hence

. We learn the constants independently for each image. In particular, for each image, we learn the most non-informative prior that can guarantee the assignment of a certain percentage of pixels to each class present in the image, while assigning no pixels to classes absent from the image. In order to quantify information, we maximize the entropy of the prior distribution, while simultaneously forcing it to satisfy a set of constraints. That is,

subject to

The constants dictates the percentage of pixels that are guaranteed to belong to class , if label is present in the image. We choose for the background class, and for all the other object classes present in the image.

For images in the ImageNet dataset, , and hence the above optimization problem contains only two variables, and , where is the label of the foreground object is present in the image. Furthermore, by equating to , we further reduce the number of variables from to . Hence, the above optimization problem reduces to a constrained optimization in a single variable which can be solved very efficiently. This approach is discussed in further detail in the Appendix.

3.4 Neighborhood Estimator

To ensure the correct alignment between the predicted boundaries and the actual boundaries, we utilize the following information: Pixels that lie close together and have similar color, also have the same label. Hence, we force the distribution at location to be close to the distribution of its neighbors. Towards that end, we compute a neighborhood distribution for each location , and minimize the KL-divergence between the output distribution at that location and the corresponding neighborhood distribution. The corresponding objective is given by


The combined objective is given by


for some constant . We propose two approaches for computing the neighborhood distribution.

Weighted mean: In this approach, the neighborhood distribution is computed as follows:


for . Here, is a measure of similarity between the locations and . For our purpose, we define the neighbors as all the locations that lie close to the current location, and the corresponding pixels have similar color. Hence, we use the contrast sensitive two-kernel potential [12] defined in terms of pixel locations and pixel brightness as follows:


where, . Here, and

are hyperparameters that are fixed during training. As discussed in 

[12], the second term prevents the formation of small isolated regions as segments.

Exponentiated weighted mean: As the name suggests, the neighborhood distribution in this approach is obtained by exponentiating the weighted mean. The exponentiation causes the neighborhood distribution to be sharper, resulting in high confidence predictions.



and is a normalization constant that ensures that the above distribution sums up to . Next, we show the connection between the exponentiated weighted mean and CRF.

3.5 Connections with CRF

In this section, we provide a formal justification for the choice of the neighborhood-based objective function. In particular, we will show that the objective emerges naturally, when a CRF is used as a prior while computing the conditional log-likelihood.

Given an image , let the CRF prior over the segmentation masks be defined as below:


where is the normalization constant. Note that the prior distribution has no unary potentials. We will further assume that binary potentials have no trainable parameters.

The prior provides a distribution over all possible segmentation masks for a given image. Furthermore, let has the form


for some choice of kernel . The corresponding CRF prior gives low probability to masks that assign different labels to pixels and with high similarity (that is, high ). This is a reasonable prior assumption about the segmentation mask of an image. Note that the prior does not penalize masks that assign the same label to pixels and with low similarity. This allows the inclusion of object classes with multicolored instances. For instance, the dress a person wears will often be colored differently from his skin color.

If the CRF prior is approximated by a fully factorized distribution, the resultant distribution will have to satisfy the constraints entailed in Proposition 3.1.

Proposition 3.1.

Let be the mean field approximation to the CRF prior . That is, among all distributions of the form , let be the one that minimizes . Then the distribution satisfies the following constraints:


for and .

These constraints are referred to as mean field constraints. The proof of the Proposition is given in the Appendix. is the normalization constant that ensures that the distribution sums up to .

Coming back to the predictive distribution for segmentation masks in our model , if one wishes to impose a CRF prior on , one must force it to satisfy the mean-field constraints, that is,


for and . The distribution on the RHS of the above equation is exactly the neighborhood distribution of equation (12), with the exception that the kernel is normalized in (12). The distribution is defined in terms of the output of a neural network. Hence, instead of the equality constraints imposed by the mean-field, we add the term to the objective. The KL-divergence term forces the output of the network to satisfy the mean field constraints imposed by the CRF prior.

Note: The binary potential used in the CRF prior in this section is given by . In contrast, the binary potential commonly used for semantic segmentation has the form . However, when the kernel is normalized at the pixel-level (as has been suggested in [12]), the resultant distributions are exactly the same.

4 Relation with similar works

Recent works on semantic segmentation using deep architectures have focused on pairwise CRFs with only unary and binary potentials. The unary potentials were specified by the output of a CNN while the binary potentials have no learnable parameters [4, 31, 22]. The work in [14] allows the binary potentials to be learnable as well.

Our work significantly differs from the above mentioned works in the learning algorithm used for training the parameters of the CRF. Most of the works that combine CNNs with CRFs use piecewise learning [24, 14, 4], that is, the energy function is decomposed into its potentials, and each potential is normalized individually. For instance, if are the potentials that form the energy function, the piecewise approximation to the objective is given by


where is the normalization constant for the potential.

Hence, in a pairwise CRF, with unary potentials given by the output of a CNN, each output location of the CNN is trained independently. Furthermore, the contribution of the pairwise potentials is not incorporated during training of the parameters of the CNN. Hence, the training is equivalent to the training of several independent classifiers, one for each location in the segmentation mask.

While piecewise training is extremely efficient, the lower bound that it optimizes is a very weak lower bound of true log-likelihood. By training the local classifier without incorporating the binary potentials, we ignore the dependence among labels of nearby pixels with similar color. More importantly, it is completely unsuitable when the pixel-level labels are absent, which is the main concern of this paper.

More recently, several authors have considered training the mean field approximation rather than the actual CRF distribution  [11, 31, 22] for semantic segmentation. The mean-field approximation for a distribution is the distribution that minimizes the KL-divergence . By computing the gradient of the KL-divergence with respect to , and equating it to , one can obtain an iterative algorithm for finding the minima. For a pairwise CRF, the mean-field update equations are given by

where is the distribution of the location of the mean field approximation at the iteration.

Hence, the mean field approximation at the iteration can be defined recursively, as a function of the mean-field approximation at the iteration and the potential functions. Consequently, the gradient of the mean-field distribution for iteration can be written as a function of the gradient of the approximation at the iteration and the gradient of the potential functions. This approach for training CRFs has been used in [11, 31, 22].

5 Experimental setup

5.1 Network architecture

For our experiments, we have used pretrained VGG16 network (trained on ImageNet for classification; torchvision111, and we have modified it for the task of semantic segmentation. The VGG16 network consists of 13 convolutional layers and 3 fully connected layers. First, we removed the fully connected layers and the last pooling layer. The resultant network receives images of size , and generates feature maps of size

. The receptive field of a neuron in the last convolutional layer is

, and hence, it nearly encompasses the entire input image. This implies that every neuron in the last layer has access to almost the entire image. This network serves as an encoder in our model.

To learn fine-grained contours of objects, we added skip connections from the layers of the encoder to the layers of the decoder. The receptive-field size of the neurons in the encoder is much smaller than their counterpart in the decoder. Hence, they have access to more fine-grained information. We performed convolution to the output of the layers of the encoder before concatenating them with the layers of the decoder. The concatenated output is fed through multiple layers of convolution and non-linearities, to allow the final prediction layer to learn non-linear mapping of the local and global information. The final architecture of our model is shown in Figure 1.

5.2 Dataset

Models that use weak-supervision for training often employ a much larger dataset than fully-supervised models. For instance, the model in [28] mines the pages of Flickr222, to generate the training data. Similarly, the model in [21] uses a subset of Flickr dataset [8] for training. A clean subset of the ImageNet dataset is used in [7], while a much larger subset of the same dataset is used in [19]. More importantly, almost all the approaches for semantic segmentation utilize the ImageNet dataset for pretraining the network.

We follow the experimental setup of [19]. In particular, we downloaded images of objects belonging to the object classes in the VOC 2012 dataset [6] from the Imagenet database [20]. Several authors mine the dataset further to obtain a set of simple images which contain the object against a plain background. However, we choose to use the entire dataset to minimize manual dependence. A script to download the ImageNet classes used in training, will soon be available.

For testing, we use the validation and test set of VOC 2012 dataset. For most of our experiments, we use the validation set only, since the ground truth is publicly available. The test set is used only for comparing the final model against the state-of-the-art.

Almost all approaches for weakly supervised semantic segmentation report the results on VOC 2012 dataset. This is primarily because the classes in VOC 2012 are discrete object categories such as sheep, person, dog, etc. Most images in VOC 2012 contain very few of these object classes, and hence, it is easy to utilize weak labels for training. Two classes that almost always occur together, will be impossible to discern using weak labels only, for instance, road and sky.

5.3 Training protocol

Stochastic gradient descent (SGD) with a minibatch of images is used for training. The initial learning rate for the pretrained layers is set at , while the initial learning rate for newly added layers is set at . The momentum and the weight decay for gradient descent is set at and respectively. We halve the learning rate after every iterations. All the networks are trained for iterations.

6 Experiments

We use a weighted combination of the objective based on auxiliary labels and the neighborhood based objective in our experiments. For our comparisons, we have limited ourselves to models that do not employ pixel level information. Hence, we have excluded the models that employ agnostic segmentation or saliency masks that have been manually labelled [7, 26] with the exception of STC [28].


The model is trained on the ImageNet subset and evaluated on the VOC 2012 validation set.

Hyperparameters: We use the same hyperparameter settings for the kernel as used in the publicly available code associated with the paper [12]. In particular, we modify the code in [12] to allow us to compute the weighted and exponentiated weighted mean. To account for the fact that our network reduces the segmentation mask to one-fourth of the input image, we divide and by . during training.

We also evaluate the effect of the hyperparameter on the performance of the model. Note that the KL-divergence term will be minimized when all the pixels are assigned the same label. Hence, when higher weight is given to the KL-divergence term, all the pixels get assigned to the background class (since it is the most prominent class). In contrast if is close to , the boundaries of segmentation masks learnt by the model do not coincide with the boundaries of the objects. We vary the value of from to to study the effects of on the performance of the model. The results are shown in Figures 3. As can be observed, when weighted mean is used as the neighborhood distribution, the model remains quite stable to the choice of . However, the overall performance achieved is worse as compared to the model that relies on exponentiated weighted mean (51.6% vs 50.7%). The segmentation masks generated by using exponentiated weighted mean at are shown in Figure 2.

We also compare the proposed model against other models for weakly supervised segmentation on the VOC 2012 val set. Note that generating segmentation masks involves identifying the object and identifying the boundary of the object. If the boundary of the object can be identified correctly, segmentation involves classifying the segmented object which can be done using a classifier with relative ease. Hence, models that employ networks pretrained on boundary detection are obviously at an advantage and hence, excluded from comparison. This includes [7] which employs groundtruth saliency masks, since the boundaries in the saliency masks used in this paper, coincide with the boundaries of the segmentation masks.

The models used for comparison are (i) MIL+ILP [19]. (ii) EM-Adapt [16], (iii) CCNN [17], (iv) SEC [10] and (v) STC [28]. Among these models, SEC employs separate convolution networks for computing the saliency maps for foreground and background. On the other hand, the saliency network in STC is trained using dense (pixel-level) supervision. The results are given in Table 1. Note that the proposed model is similar in spirit to CCNN [17] and EM-Adapt [16], with the exception of the incorporation of a neighborhood based objective. One can observe that the incorporation of neighborhood information results in drastically improved performance.

Finally, we evaluate the performance of our model on the VOC 2012 test set by submitting our results to the evaluation server of VOC 2012. An anonymous view of the results is available at We achieve an accuracy of which is the state-of-the-art for weakly supervised semantic image segmentation on this dataset.








background 77.2 67.2 68.5 82.4 84.5 85.4
aeroplane 37.3 29.2 25.5 62.9 68.0 68.4
bike 18.4 17.6 17.0 26.4 19.5 28.3
bird 25.4 28.6 25.4 61.6 60.5 63.6
boat 28.2 22.2 20.2 27.6 42.5 42.9
bottle 31.9 29.6 26.3 38.1 44.8 54.4
bus 41.6 47.0 46.8 66.6 68.4 62.7
car 48.1 44.0 47.1 62.7 64.0 62.9
cat 50.7 44.2 48.0 75.2 64.8 67.5
chair 12.7 14.6 15.8 22.1 14.5 10.6
cow 53.5 45.7 35.1 53.5 52.0 46.3
diningtable 14.6 24.9 21.0 28.3 22.8 37.2
dog 50.9 41.0 44.5 65.8 58.0 48.7
horse 44.1 34.8 34.5 57.8 55.3 53.8
motorbike 39.2 41.6 46.2 62.3 57.8 57.3
person 37.9 32.1 40.7 62.3 60.5 64.7
plant 28.3 30.4 24.8 32.5 40.6 44.3
sheep 44.0 36.3 37.4 62.6 56.7 58.2
sofa 19.6 24.0 22.2 32.1 23.0 35.8
train 37.6 38.1 38.8 45.4 57.1 43.6
tvmonitor 35.0 31.6 36.9 45.3 31.2 47.4
mean 36.6 33.8 35.3 50.7 49.8 51.6
Table 1: Results on PASCAL VOC 2012 (mIoU in %) val set for weakly supervised segmentation.
Figure 3: When weighted mean is used as neighborhood distribution
Table 2: Examples of predicted segmentation masks. The middle row is the ground truth. Note that the model has learnt to align the predicted boundaries with the true boundaries.

7 Conclusions

In this paper, we have proposed a new model for weakly supervised semantic image segmentation that uses only image-level labels. We have shown that the output of the CNN can be forced to satisfy the constraints of a conditional random field, without explicitly evaluating the mean-field distribution at every step. We achieve this forcing the output distribution at every pixel to be close to its neighbors. As a consequence, we achieve significant performance improvement over traditional CNNs with negligible increase in training time.

We focused on weakly annotated images in this paper, since CRFs can achieve drastic performance improvements for this task. When pixel-level information is available, forcing the pixel-level labels to be close to its neighbors will serve as an unnecessary and often over-smooth regularizer, unless only rough object boundaries are only available. A model capable of handling rough object boundaries, can drastically reduce the time required for manually generating segmentation masks. We intend to explore the utility of the model for handling rough boundaries in the future.

8 Acknowledgement

Authors acknowledge financial support for the ”CyberGut” expedition project by the Robert Bosch Centre for Cyber Physical Systems at the Indian Institute of Science, Bengaluru.


On Adaptive prior

Let be an image and be the corresponding image level labels (including the background class). The optimization problem for obtaining the pixel-labels using an adaptive prior is given below:

subject to

Here, is a constant that determines the minimum fraction of pixels in the image that must be assigned to class . When the images contain objects from a single object class (as is the case with ImageNet dataset), say , the above optimization problem can be rewritten as shown below:

subject to



and .

The above problem is an optimization problem in single variable. We can solve it approximately by evaluating the constraints for several values of . This can be achieved very efficiently on a GPU, since it involves element-wise operations on matrices of probability distributions. Let be the set of all the selected values of which satisfy the constraints. Among these values, we return the value of which minimizes the objective.

If none of the selected values of satisfy the constraints, we choose the value of which minimizes the following:


Proof of Proposition 3.1

Proposition 3.1 Let be the mean field approximation to the CRF prior . That is, among all distributions of the form , let be the one that minimizes . Then the distribution will have to satisfy the following constraints:


The proof of this result can be obtained by differentiating the KL-divergence divergence with-respect-to the components of and equating it to . Towards that end, we expand the KL-divergence term as given below:


Here, is a constant that doesn’t depend on . Differentiating the above expression with respect to and equating it to , we get


Finally, since is a probability distribution that sums up to , we normalize to get the desired result. ∎