A Caffe implementation of http://arxiv.org/abs/1512.07928
We propose a novel weakly-supervised semantic segmentation algorithm based on Deep Convolutional Neural Network (DCNN). Contrary to existing weakly-supervised approaches, our algorithm exploits auxiliary segmentation annotations available for different categories to guide segmentations on images with only image-level class labels. To make the segmentation knowledge transferrable across categories, we design a decoupled encoder-decoder architecture with attention model. In this architecture, the model generates spatial highlights of each category presented in an image using an attention model, and subsequently generates foreground segmentation for each highlighted region using decoder. Combining attention model, we show that the decoder trained with segmentation annotations in different categories can boost the performance of weakly-supervised semantic segmentation. The proposed algorithm demonstrates substantially improved performance compared to the state-of-the-art weakly-supervised techniques in challenging PASCAL VOC 2012 dataset when our model is trained with the annotations in 60 exclusive categories in Microsoft COCO dataset.READ FULL TEXT VIEW PDF
Weakly supervised semantic segmentation has been a subject of increased
Weakly supervised semantic segmentation receives much research attention...
We introduce a new loss function for the weakly-supervised training of
The concept of affordance is important to understand the relevance of ob...
We aim to model the top-down attention of a Convolutional Neural Network...
In this paper, we improve semantic segmentation by automatically learnin...
The performance of the state-of-the-art image segmentation methods heavi...
A Caffe implementation of http://arxiv.org/abs/1512.07928
Semantic segmentation refers to the task assigning dense class labels to each pixel in image. Although pixel-wise labels provide richer descriptions of images than bounding box labels or image-level tags, inferring such labels is a much more challenging task as it involves a highly complicated structured prediction problem.
Recent breakthrough in semantic segmentation has been mainly accelerated by the approaches based on Convolutional Neural Networks (CNNs) [21, 4, 11, 10, 25]. Given a classification network pre-trained on a large image collection, they learn a network for segmentation based on strong supervision—pixel-wise class labels. Although the approaches substantially improve the performance over the prior arts, training CNN requires a large number of fine-quality segmentation annotations, which are difficult to collect due to extensive labeling cost. For this reason, scaling up the semantic segmentation task to a large number of categories is very challenging in practice.
Weakly-supervised semantic segmentation [5, 31, 27, 29] is an alternative approach to alleviate annotation efforts in supervised methods. They infer latent segmentation labels from training images given weak annotations such as image-level class labels [31, 27, 29] or bounding boxes . Since such annotations are easy to collect and even already available in existing datasets , it is straightforward to apply those approaches to large-scale problems with many categories. However, the segmentation quality by the weakly-supervised techniques is typically much worse than the one by supervised methods since there is no direct supervision for segmentation such as object shapes and locations during training.
The objective of this paper is to reduce the gap between semantic segmentation algorithms based on strong supervisions (e.g., semi- and fully-supervised approaches) and weak supervisions (e.g., weakly-supervised approaches). Our key idea is to employ segmentation annotations available for different categories to compensate for missing supervisions in weakly annotated images. No additional cost is required to collect such data since there are already several datasets publicly available with pixel-wise annotations, e.g., BSD , Microsoft COCO , and LabelMe . These datasets have not been actively explored yet for semantic segmentation due to the mismatches in semantic categories with the popular benchmark datasets, e.g. PASCAL VOC . The critical challenge in this problem is to learn common prior knowledge for segmentation transferrable across categories. It is not a trivial task with existing architectures, since they simply pose the semantic segmentation as pixel-wise classification and it is is difficult to exploit examples from the unseen classes.
We propose a novel encoder-decoder architecture with attention model, which is conceptually appropriate to transfer segmentation knowledge from one category to another. In this architecture, the attention model generates category-specific saliency on each location of an image, while the decoder performs foreground segmentation using the saliency map based on category-independent
segmentation knowledge. Our model trained on one dataset is transferable to another by adapting the attention model to focus on unseen categories. Since the attention model is trainable with only image-level class labels, our algorithm is applicable to semantic segmentation on weakly-annotated images through transfer learning.
The contributions of this paper are summarized below.
We propose a new paradigm for weakly-supervised semantic segmentation, which exploits segmentation annotations from different categories to guide segmentations with weak annotations. To our knowledge, this is the first attempt to tackle the weakly-supervised semantic segmentation problem by transfer learning.
We propose a novel encoder-decoder architecture with attention model, which is appropriate to transfer the segmentation knowledge across categories.
The proposed algorithm achieved substantial performance improvement over existing weakly-supervised approaches with segmentation annotations in exclusive categories.
The rest of the paper is organized as follows. We briefly review related work and introduce our algorithm in Section 2 and 3, respectively. The detailed configuration of the proposed network is described in Section 4. Training and inference procedures are presented in Section 5. Section 6 illustrates experimental results on a benchmark dataset.
Recent success in CNN has brought significant progress on semantic segmentation in the past few years [21, 4, 11, 10, 25]. By posing the semantic segmentation as region-based classification problem, they train the network to produce pixel-wise class labels using segmentation annotations as training data [21, 11, 10, 25]. Based on this framework, some approaches improve segmentation performance by learning deconvolution network to capture accurate object boundaries  or adopting fully connected CRF as post-processing [38, 4]. However, the performance of the supervised approaches depends heavily on the size and quality of training data, which limits the scalability of the algorithms.
To reduce the efforts for annotations, weakly-supervised approaches attempt to learn the model for semantic segmentation only with weak annotations [31, 27, 29, 5]. To infer latent segmentation labels, they often rely on the techniques such as Multiple Instance Learning (MIL) [31, 29]
or Expectation-Maximization (EM). Unfortunately, they are not sufficient to make up missing supervision and lead to significant performance degradation compared to fully-supervised approaches. In the middle, semi-supervised approaches [27, 13] exploit a limited number of strong annotations to reduce performance gap between fully- and weakly-supervised approaches. Notably,  proposed a decoupled encoder-decoder architecture for segmentation, where it divides semantic segmentation into two separate problems—classification and segmentation—and learns a decoder to perform binary segmentation for each class identified in the encoder. Although this semi-supervised approach improves performance by sharing the decoder for all classes, it still needs strong annotations in the corresponding classes for segmentation. We avoid this problem by using segmentation annotations available for other categories.
In the field of computer vision, the idea of employing external data to improve performance of target task has been explored in context of domain adaptation[33, 15, 9, 8] or transfer learning [19, 36]. However, the approaches in domain adaptation often assume that there are shared categories across domains, and the techniques with transfer learning are often limited to simple tasks such as classification. We refer  for comprehensive surveys on domain adaptation and transfer learning. Hoffman et al.  proposed a large-scale detection system by transferring knowledge for object detection between categories. Our work shares the motivations with this work, but aims to solve a highly complicated structured prediction problem, semantic segmentation.
There has been a long line of research on learning visual attention [18, 3, 1, 24, 2, 37, 35]. Their objective is to learn the attention mechanism that can adaptively focus on salient part of an image or video for various computer vision tasks, such as object recognition [18, 1, 2], object tracking , caption generation , image generation , etc. Our work is an extension of this idea to semantic segmentation by transfer learning.
This paper tackles the weakly-supervised semantic segmentation problem in transfer learning perspective. Suppose that we have two sources of data, and , which are composed of and images, respectively. Note that a set of images in target domain, denoted by , only have image-level class labels while the other set of data in source domain, referred to as , have pixel-wise segmentation annotations. Our objective is to improve the weakly-supervised semantic segmentation on the target domain using the segmentation annotations available in the source domain. We assume that both target and source domains are composed of exclusive sets of categories. In this setting, there is no direct supervision (i.e., ground-truth segmentation labels) for the categories in the target domain, which makes our objective similar to a weakly-supervised semantic segmentation setting.
To transfer segmentation knowledge from source to target domain, we propose a novel encoder-decoder architecture with attention model. Figure 1
illustrates the overall architecture of the proposed algorithm. The network is composed of three parts: encoder, decoder and attention model between the encoder and the decoder. In this architecture, the input image is first transformed to a multi-dimensional feature vector by the encoder, and the attention model identifies salient region for each category associated with the image. The output of the attention model reveals location information of each category in a coarse feature map, where the dense and detailed foreground segmentation mask for each category is subsequently obtained by the decoder.
Training our network involves different mechanisms for source and target domain examples, since they are associated with heterogeneous annotations with different levels of supervision. We leverage the segmentation annotations from source domain to train both the decoder and the attention model with segmentation objective, while image-level class labels in both target and source domains are used to train the attention model under classification objective. The training is performed jointly for both objectives using examples from both domains.
The proposed architecture exhibits several advantages to capture transferrable segmentation knowledge across domains. Employing the decoupled encoder-decoder architecture  makes it possible to share the information for shape generation among different categories. The attention model provides not only predictions for localization but also category-specific information that enables us to adapt the decoder trained in source domain to target domain. The combination of two components makes information for segmentation transferable across different categories, and provides useful segmentation prior that is missing in weakly annotated images in target domain.
This section describes the architecture of the proposed algorithm, including the attention model and the decoder.
We first describe notations and general configurations of the proposed model. Our network is composed of four parts, and
, which are neural networks corresponding to encoder, attention model, classifier and decoder, respectively. Our objective is to train all components using the examples from both domains.
Let denotes a training image from either source or target domain. We assume that the image is associated with a set of class labels , which is given by either ground-truth (in training) or prediction (in testing). Given an input image , the network first extracts a feature descriptor as
where is the model parameter for the encoder, and and denote the number of hidden units in each channel and the number of channels, respectively. We employ VGG-16 layer net 
pre-trained on ImageNet as our encoder , and the feature descriptor is obtained from the last convolutional layer to retain spatial information in input image. The extracted feature and associated labels are then used to generate attention, which is discussed in the following subsection.
Given a feature descriptor extracted from the encoder and its associated class labels , the objective of our attention model is to learn a set of positive weight vectors defined over a 2D space, where each element of represents the relevance of each feature location to category. Our attention model can be formally given by
where denotes a one-hot label vector for the category, and represents unnormalized attention weights. To encourage the model to pay attention to only a part of the image, we normalize to using a softmax function as suggested in .
To generate category-specific attention using our attention model , we employ multiplicative interactions 
between feature and label vector. It learns a set of gating parameters represented by a 3-way tensor to model correlation between feature and label vectors. For scalability issue, we reduce the number of parameters by the factorization technique proposed in, and our model can be written as
where denotes element-wise multiplication and is bias. Note that the weights are given by and , where and denote the size of label vector and the number of factors, respectively. We observe that using multiplicative interaction generally gives better results than additive ones (e.g., concatenation), because it is capable of capturing high-order dependency between the feature and the label.
To apply the attention to our transfer-learning scenario, the model should be trainable in both target and source domains. Since examples from the target domain are associated with only image-level class labels, we put additional layers on top of the attention model, and optimize both and under classification objective. To this end, we extract features based on the category-specific attention by aggregating features over the spatial region as follows:
Intuitively, represents a category-specific feature defined over all the channels in the feature map.
Using the weak annotations associated with both target and source domain images, our attention model is trained to minimize the classification loss as follows:
where denote the loss between ground-truth and predicted label vector , respectively, which are both defined for single label and aggregated over
for each image. We employed softmax loss function to measure classification loss.
The optimization of Eq. (6) may involve potential overfitting issue, since the ground-truth class label vector is given as an input to attention model as well. In practice, we observe that our model can avoid this issue by effectively eliminating the direct link from attention to label prediction and constructing intermediate representation based on the original feature .
Figure 2 illustrates the learned attention weights for individual classes. We observe that the attention model adapts spatial saliency effectively depending on its input labels.
The attention model described in the previous section generates a set of adaptive saliency maps for each category , which provides useful information for localization. Given these attentions, the next step of our algorithm is to reconstruct dense foreground segmentation mask for each attended category by the decoder. However, the direct application of attentions weights to segmentation may be problematic, since the activations tend to be sparse due to the softmax operation in Eq. (3) and may lose information encoded in the feature map useful for shape generation.
To resolve this issue and reconstruct useful information for segmentation, we feed the additional inputs to the decoder using attention and the original feature . Rather then directly using the attention, we exploit the intermediate representation obtained from Eq. (5). It represents relevance of each channel out of the feature maps with respect to category. Then we aggregate spatial activations in each channel of the feature using as coefficients, which is given by
where represents densified attention in the same size with and serves as inputs to the decoder. As shown in Figure 3, densified attention maps preserve more details of the object shape compared to the original attention (). See Appendix A for more comprehensive analysis of Eq. (7).
Given densified attention as input, the decoder is trained to minimize the segmentation loss by the following objective function
where denotes a binary segmentation mask of image for category, and denotes pixel-wise loss function between ground-truth and predicted segmentation mask. Similar to classification, we employ a softmax loss function for . Since training requires ground-truth segmentation annotations, the objective function is only optimized with images in source domain.
We employ recently proposed deconvolution network  for our decoder architecture . Given an input to the decoder , it generates a segmentation mask in the same size with input image by multiple successive operations of unpooling, deconvolution and rectification. Pooling switches are shared between pooling and unpooling layers, which is appropriate to recover accurate object boundary. We refer to  for more details.
Note that we train our decoder to generate foreground segmentation of each attention . By decoupling classification, which is a domain specific task, from the decoding , we can capture category-independent information for shape generation and apply the architecture to any unseen categories. Since all weights in the decoder are shared between different categories, it potentially encourages the decoder to capture common shape information that can be generally applicable to multiple categories.
where controls balance between classification and segmentation. During training, we optimize Eq. (9) using examples from both domains. Note that it allows joint optimization of attention model for both classification and segmentation. Although our attention model is generally good even trained with only class labels (see Figure 3), training attention based only on classification objective sometimes lead to noisy predictions due to missing supervision for localization. By jointly training with segmentation objective, we can regularize it to avoid finding noisy solution for target domain categories. After training, we remove the classification layers since it is required only in training to learn attentions for the data from target domain categories.
For inference of target domain images with the trained model, we first apply a separate classifier to identify a set of labels associated with the image. Then for each identified label , we iteratively construct attention weights and obtain foreground segmentation mask
from the decoder output. Given foreground probability maps from all labels, the final segmentation label is obtained by taking the maximum probability in channel direction.
This section describes detailed information about implementation and experiment, and provides results in a challenging benchmark dataset.
We employ PASCAL VOC 2012  as target domain and Microsoft COCO (MS-COCO)  as source domain, which have 20 and 80 labeled semantic categories, respectively. To simulate the transfer learning scenario, we remove all training images containing 20 PASCAL categories from MS-COCO dataset, and use only 17,443 images from 60 categories (with no overlap with the PASCAL categories) to construct the source domain data. We train our model using image-level class labels in PASCAL VOC dataset and segmentation annotations in MS-COCO dataset, respectively, and evaluate the performance on PASCAL VOC 2012 benchmark images.
We initialize the encoder by fine-tuning the pre-trained CNN from ImageNet  to perform multi-class classification on combined datasets of PASCAL VOC and MS-COCO. The weights in the attention model and classification layers ( and , respectively) are pre-trained by optimizing Eq. (6). Then we optimize both decoder, attention model and classification layers jointly using the objective function in Eq. (9), while the weights in the decoder () are initialized with zero-mean Gaussians. We fix the weights in the encoder () during training.
We implement the proposed algorithm based on Caffe library. We employ Adam optimization  to train our network with learning rate 0.0005 and default hyper-parameter values proposed in . The size of mini-batch is set to 64. We trained our models using a NVIDIA Titan X GPU. Training our model takes 4 hours for the pre-training of attention model including classification layers, and 10 hours for the joint training of all other parts, respectively.
We exploit an additional classifier trained on PASCAL VOC dataset to predict class labels on target domain images. The predicted class labels are used to generate segmentations as described in Section 5. Optionally, we employ post processing based on fully-connected CRF . In this case, we apply the CRF on foreground/background probability maps for each class label independently, and obtain combined segmentations by taking pixel-wise maximums of foreground probabilities across labels.
This section presents evaluation results of our algorithm with the competitors on PASCAL VOC 2012 benchmark dataset. We follow comp6 evaluation protocol, and scores are measured by computing Intersection over Union (IoU) between ground truth and predicted segmentations.
Table 1 summarizes the evaluation results on PASCAL VOC 2012 validation dataset. We compared the proposed algorithm with state-of-the-art weakly- and semi-supervised algorithms111Strictly speaking, our method is not directly comparable to both approaches since we use auxiliary examples. Note that we do not use ground-truth segmentation annotations for the categories used in evaluation, since the examples are from different categories. . Our method is denoted by TransferNet, and TransferNet-GT indicates our method with ground-truth class labels for segmentation inference, which serves as the upper-bound performance of our method since it assumes classification is perfect. The proposed algorithm outperforms all weakly-supervised semantic segmentation techniques with substantial margins, although it does not employ any ground-truth segmentations for categories used in evaluation. The performance of the proposed algorithm is comparable to semi-supervised semantic segmentation methods, which exploits a small number of ground-truth segmentations in addition to weakly-annotated images for training. The results suggest that segmentation annotations from different categories can be used to make up the missing supervision in weakly-annotated images; the proposed encoder-decoder architecture based on attention model successfully captures transferable segmentation knowledge from the exclusive segmentation annotations and uses it as prior for segmentation in unseen categories.
Table 2 summarizes our results on PASCAL VOC 2012 test dataset. Our algorithm exhibits superior performance to weakly-supervised approaches, but there are still large performance gaps with respect to fully-supervised approaches. It shows that there is domain-specific segmentation knowledge which cannot be made up by annotations form different categories.
The qualitative results of the proposed algorithm are presented in Figure 5. Our algorithm often produces accurate segmentations in the target domain by transferring the decoder trained with source domain examples, although it is not successful in capturing some category-specific fine details in some examples. The missing details can be recovered through post-processing based on CRF. Since the attention model in the target domain may not be perfect due to missing supervisions, our algorithm sometimes produces noisy predictions as illustrated in Figure 5(b). See Apendix C for more qualitative comparisons.
To better understand the benefits from the attention model in our transfer learning scenario, we compare the proposed algorithm with two baseline algorithms, which are denoted by DecoupledNet and BaselineNet. (See Appendix B for detailed configurations of the baselines and proposed algorithm.)
DecoupledNet directly applies the architecture proposed in  to our transfer learning task by training the decoder in MS-COCO and applying it to PASCAL VOC for inference. The model employs the same decoupled decoder architecture to ours, but has a direct connection between encoder and decoder without attention mechanism. The result in Table 1 shows that the model trained on source domain fails to adapt to target domain categories. It is mainly because the decoder cannot interpret the features from unseen categories in target domain. Our model can mitigate the issue since the attention model provides coherent representations to decoder across domains.
Although the above baseline shows the benefits of the attention model in our architecture, the advantage of attention estimation from the intermediate layer may be still ambiguous. To this end, we employ another FCN -style baseline denoted by BaselineNet, which uses class score map as input to the decoder. It can be considered as a specific case of our method that the attention is extracted from the final layer of the classification network (). The performance of BaselineNet is better than DecoupledNet since the class score map is a general representation to the decoder across different categories. However, the performance is considerably worse than the proposed method as shown in Table 1 and Figure 5. We observe that the class score map is sparse and in row-resolution, while densified attention map in our model contains richer information for segmentation.
The comparisons to the baseline algorithms show that transferring segmentation knowledge from different categories is a very challenging task. The straightforward extensions of existing architectures have difficulty in generalizing the knowledge across different categories. In contrast, our model effectively transfers segmentation knowledge by learning general features through attention mechanism.
To see the impact of number of annotations in the source domain, we conduct additional experiments by varying the number of annotations in the source domain (MS-COCO). To this end, we randomly construct subsets of training data by varying their sizes in ratios (50%, 25%, 10%, 5% and 1%) and average the performance in each size with 3 subsets. The results are illustrated in Figure 4. In general, more annotations in the source domain improve the segmentation quality on the target domain. Interestingly, the performance of the proposed algorithm is still better than other weakly-supervised methods even with a very small fraction of annotations. It suggests that exploiting even small number of segmentations from other categories can effectively reduce the gap between the approaches based on strong and weak supervisions.
We propose a novel approach for weakly-supervised semantic segmentation, which exploits extra segmentation annotations in different categories to improve segmentation on the dataset with missing supervisions. The proposed encoder-decoder architecture with attention model is appropriate to capture transferable segmentation knowledge across categories. The results on challenging benchmark dataset suggest that the gap between approaches based on strong and weak supervision can be reduced by transfer learning. We believe that scaling up the proposed algorithm to a large number of categories would be one of interesting future research direction (e.g., semantic segmentation on 7.6K categories in ImageNet dataset using segmentation annotations from 20 PASCAL VOC categories).
Unsupervised domain adaptation by backpropagation.In ICML, 2015.
Learning to combine foveal glimpses with a third-order boltzmann machine.In NIPS. 2010.
Conditional random fields as recurrent neural networks.In ICCV, 2015.
where is the Gram matrix of ; each element in denoted by represents similarity between and pixels in the feature map. Eq. (10) means that the densified attention is given by the weighted linear combination of rows in the Gram matrix, where the weights are given by the attention . The densified attention reveals more detailed shape of the objects than the sparse attention weight by highlighting not only attended pixels but also visually correlated areas.
Figure 6 visualizes the densified attention obtained by selecting a row of the Gram matrix in Eq. (10). The row is given by the one-hot attention vector , which represents a single pixel attention in the feature map. We observe that it successfully generates dense activation maps based only on extremely sparse attentions using the correlation of pixels. It also suggests that using the densified attention as an input to the decoder is more useful to generate accurate and dense segmentation mask than using the original sparse attention.
Figure 7 illustrates the detailed configurations of the algorithms used in Section 6.3—the proposed algorithm and two baselines denoted by BaselineNet and DecoupledNet. The baseline algorithms are straightforward extensions of existing CNNs for semantic segmentation, which are designed for transfer learning without attention mechanism. They share the same encoder and decoder architectures with the proposed model but have different approaches to construct the input to the decoder .
BaselineNet (Figure 7(b)) is an extension of FCN -style architecture for our transfer learning scenario. Given an input image, it generates a set of 2D class-specific score maps retaining spatial information based on fully-convolutional network . Then for each class presented in the image, it extracts a two-dimensional () score map of the selected class, and converts it to the input to the decoder through a fully-connected layer. Based on this architecture, transfer learning can be achieved by training the decoder in source domain, and fine-tuning the classification network with target domain categories.
DecoupledNet (Figure 7(c)) is an extension of the decoupled encoder-decoder architecture proposed in  for transfer learning scenario. Given an input image, it first identifies categories presented in the image using the outputs of the classification network, and subsequently generates foreground segmentation of each identified category by the decoder. To this end, it computes gradient of class score with respect to the feature map by back-propagation, and constructs input to the decoder by combining the feature and gradient maps using a few feed-forward layers. In this architecture, transfer learning can be achieved by the same way to the BaselineNet; training decoder in source domain, and fine-tuning the classification network with target domain categories.
Table 1 presents comparisons of the proposed algorithm to the baseline architectures. DecoupledNet directly exploits feature and gradient maps to construct input to the decoder. Since the both representations are domain-specific and changed by fine-tuning the classification network in different domains, the decoder trained on the source domain is difficult to be generalized to unseen categories in the target domain. Our architecture alleviates this problem using attention, where the attention model provides coherent representation to the decoder across domains. Compared to BaselineNet, the proposed architecture is more effective to reconstruct crucial information required for segmentation since it employs the attention as well as the original features; it is more appropriate to accomplish more accurate segmentation.
The poor performance of the baseline architectures suggests that transferring segmentation knowledge across domain is a very challenging task, and naive extensions of the existing architectures may not be able to handle this challenge effectively. The proposed architecture based on attention mechanism is appropriate to transfer the decoder across domains and obtain accurate segmentation.
Figure 8 presents qualitative results of the state-of-the-art weakly-supervised semantic segmentation techniques [27, 28] including our algorithm on PASCAL VOC 2012 validation images. The compared algorithms adopt post-processing based on fully-connected CRF  to refine segmentation results. Nevertheless, the segmentation of other methods is not successful frequently since the output predictions of the CNN are often too noisy and inaccurate to capture precise object shapes. Our approach tends to find more accurate object boundaries even without CRF by exploiting the decoder trained with segmentation annotations in different categories, and the results exhibit distinguishing performance compared to existing weakly-supervised approaches.