Weakly Supervised Semantic Image Segmentation with Self-correcting Networks

11/17/2018
by   Mostafa S. Ibrahim, et al.
Simon Fraser University
0

Building a large image dataset with high-quality object masks for semantic segmentation is costly and time consuming. In this paper, we reduce the data preparation cost by leveraging weak supervision in the form of object bounding boxes. To accomplish this, we propose a principled framework that trains a deep convolutional segmentation model that combines a large set of weakly supervised images (having only object bounding box labels) with a small set of fully supervised images (having semantic segmentation labels and box labels). Our framework trains the primary segmentation model with the aid of an ancillary model that generates initial segmentation labels for the weakly supervised instances and a self-correction module that improves the generated labels during training using the increasingly accurate primary model. We introduce two variants of the self-correction module using either linear or convolutional functions. Experiments on the PASCAL VOC 2012 and Cityscape datasets show that our models trained with a small fully supervised set perform similar to, or better than, models trained with a large fully supervised set while requiring 7x less annotation effort.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

08/06/2021

Medical image segmentation with imperfect 3D bounding boxes

The development of high quality medical image segmentation algorithms de...
10/12/2021

Weakly-Supervised Semantic Segmentation by Learning Label Uncertainty

Since the rise of deep learning, many computer vision tasks have seen si...
07/16/2019

Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision

Training convolutional networks for semantic segmentation with strong (p...
04/27/2021

Weakly Supervised Volumetric Segmentation via Self-taught Shape Denoising Model

Weakly supervised segmentation is an important problem in medical image ...
04/02/2021

Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

We address the problem of weakly-supervised semantic segmentation (WSSS)...
05/07/2022

Automatic segmentation of meniscus based on MAE self-supervision and point-line weak supervision paradigm

Medical image segmentation based on deep learning is often faced with th...
12/18/2019

One-Shot Weakly Supervised Video Object Segmentation

Conventional few-shot object segmentation methods learn object segmentat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have been successful in many computer vision tasks including image classification 

[27, 18, 71], object detection [43, 32, 41], semantic segmentation [3, 66, 8], action recognition [13, 24, 47, 52], and facial landmark localization [50, 64, 70]. However, the common prerequisite for all these successes is the availability of large training corpora of labeled images. Of these tasks, semantic image segmentation is one of the most costly tasks in terms of data annotation. For example, drawing a segmentation annotation on an object is on average 8x slower than drawing a bounding box and 78x slower than labeling the presence of objects in images [4]. As a result, most image segmentation datasets are orders of magnitude smaller than image-classification datasets.

In this paper, we mitigate the data demands of semantic segmentation with a weakly supervised method that leverages cheap object bounding box labels in training. This approach reduces the data annotation requirements at the cost of requiring inference of the mask label for an object within a bounding box.

Current state-of-the-art weakly supervised methods typically rely on hand-crafted heuristics to infer an object mask inside a bounding box 

[39, 11, 25]. In contrast, we propose a principled framework that trains semantic segmentation models in a weakly supervised setting using a small set of fully supervised images (with semantic object masks and bounding boxes) and a larger set of weakly supervised images (with only bounding box annotations). The fully supervised set is first used to train an ancillary segmentation model that predicts object masks on the weakly labeled set. Using this augmented data a primary segmentation model is trained. This primary segmentation model is probabilistic to accommodate the uncertainty of the mask labels generated by the ancillary model. Training is formulated so that the labels supplied to the primary model are refined during training from the initial ancillary mask labels to more accurate labels obtained from the primary model itself as it improves. Hence, we call our framework a self-correcting segmentation model as it improves the weak ancillary labels based on its current probabilistic model of object masks.

We propose two approaches to the self-correction mechanism. Firstly, inspired by Vahdat [53], we use a function that linearly combines the ancillary and model predictions. We show that this simple and effective approach is the natural result of minimizing a weighted Kullback-Leibler (KL) divergence from a distribution over segmentation labels to both the ancillary and primary models. However, this approach requires defining a weight whose optimal value should change during training. With this motivation, we develop a second adaptive self-correction mechanism. We use CNNs to learn how to combine the ancillary and primary models to predict a segmentation on weakly supervised instances. This approach eliminates the need for a weighting schedule.

Experiments on the PASCAL VOC and Cityscapes datasets show that our models trained a with small portion of fully supervised set achieve a performance comparable to (and in some cases better than) the models trained with all the fully supervised images.

2 Related Work

Semantic Segmentation:

Fully convolutional networks (FCNs) [35] have become indispensable models for semantic image segmentation. Many successful applications of FCNs rely on atrous convolutions [62] (to increase the receptive field of the network without down-scaling the image) and dense conditional random fields (CRFs) [26] (either as post-processing [5] or as an integral part of the segmentation model [68, 31, 46, 34]). Recent efforts have focused on encoder-decoder based models that extract long-range information using encoder networks whose output is passed to decoder networks that generate a high-resolution segmentation prediction. SegNet [3], U-Net [44] and RefineNet [30] are examples of such models that use different mechanisms for passing information from the encoder to the decoder.111SegNet [3]

transfers max-pooling indices from encoder to decoder, U-Net 

[44] introduces skip-connections between encoder-decoder networks and RefineNet [30] proposes multipath refinement in the decoder through long-range residual blocks. Another approach for capturing long-range contextual information is spatial pyramid pooling [28]. ParseNet [33] adds global context features to the spatial features, DeepLabv2 [6] uses atrous spatial pyramid pooling (ASPP), and PSPNet [66] introduces spatial pyramid pooling on several scales for the segmentation problem.

While other segmentation models may be used, we employ DeepLabv3+ [8] as our segmentation model because it outperforms previous CRF-based DeepLab models using simple factorial output. DeepLabv3+ replaces Deeplabv3’s [7] backbone with the Xception network [9] and stacks it with a simple two-level decoder that uses lower-resolution feature maps of the encoder.

Robust Training:

Training a segmentation model from bounding box information can be formulated as a problem of robust learning from noisy labeled instances. Previous work on robust learning has focused on classification problems with a small number of output variables. In this setting, a common simplifying assumption models the noise on output labels as independent of the input [38, 37, 40, 49, 65]. However, recent work has lifted this constraint to model noise based on each instance’s content (i.e., input-dependent noise). Xiao et al[60] use a simple binary indicator function to represent whether each instance does or does not have a noisy label. Misra et al[36] represent label noise for each class independently. Vahdat [53]

proposes CRFs to represent the joint distribution of noisy and clean labels extending structural max-margin models

[54, 55] to deep networks. Ren et al[42] gain robustness against noisy labels by reweighting each instance during training whereas Dehghani et al[12] reweight gradients based on a confidence score on labels. Among methods proposed for label correction, Veit et al[56] use a neural regression model to predict clean labels given noisy labels and image features, Jiang et al[23] learn curriculum, and Tanaka et al[51] use the current model to predict labels on noisy instances. All these models have been restricted to image-classification problems and have not yet been applied to image segmentation.

Weakly Supervised Semantic Segmentation:

The focus of this paper is to train deep segmentation CNNs using bounding box annotations. Papandreou et al[39]

propose an Expectation-Maximization-based (EM) algorithm on top of DeepLabv1 

[5]

to estimate segmentation labels for the weakly annotated images with box information. In each training step, segmentation labels are estimated based on the network output in an EM fashion. Dai

et al[11] propose an iterative training approach that alternates between generating region proposals (from a pool of fixed proposals) and fine-tuning the network. Similarly, Khoreva et al[25] use an iterative algorithm but rely on GrabCut [45] and hand-crafted rules to extract the segmentation mask in each iteration. Our work differs from these previous methods in two significant aspects: i) We replace hand-crafted rules with an ancillary CNN for extracting probabilistic segmentation labels for an object within a box for the weakly supervised set. ii) We use a self-correcting model to correct for the mismatch between the output of the ancillary CNN and the primary segmentation model during training.

In addition to box annotations, segmentation models may use other forms of weak supervision such as image pixel-level [57, 59, 21, 2, 16, 58, 14], image label-level  [63], scribbles [61, 29], point supervision [4], or web videos [19]. Recently, adversarial learning-based methods [22, 48] have been also proposed for this problem. Our framework is complimentary to other forms of supervision or adversarial training and can be used alongside them.

Figure 1: An overview of our segmentation framework consisting of three models: i) Primary segmentation model generates a semantic segmentation of objects given an image. This is the main model that is subject to the training and is used at test time. ii) Ancillary segmentation model outputs a segmentation given an image and bounding box. This model generates an initial segmentation for the weakly supervised set, which will aid training the primary model. iii) Self-correction module refines segmentations generated by the ancillary model and the current primary model for the weakly supervised set. The primary model is trained using the cross-entropy loss that matches its output to either ground-truth segmentation labels for the fully supervised examples or soft refined labels generated by the self-correction module for the weakly supervised examples.

3 Methods

Our goal is to train a semantic segmentation network in a weekly-supervised setting using two training sets: i) a small fully supervised set (containing images, segmentation ground-truth and object bounding boxes) and ii) a larger weakly supervised set (containing images and object bounding boxes only). An overview of our framework is shown in Fig. 1. There are three models: i) The Primary segmentation model generates a semantic segmentation of objects given an image. ii) The Ancillary segmentation model outputs a segmentation given an image and bounding box. The model generates an initial segmentation for the weakly supervised set, which aids training of the primary model. iii) The Self-correction module refines the segmentations generated by the ancillary and current primary model for the weakly supervised set. Both the ancillary and the primary models are based on DeepLabv3+ [8]. However, our framework is general and can use any existing segmentation model.

In Sec. 3.1, we present the ancillary model, and in Sec. 3.2, we show a simple way to use this model to train the primary model. In Sec. 3.3 and Sec. 3.4, we present two variants of self-correcting model.

Notation: represents an image, represents object bounding boxes in an image, and represents a segmentation label where for is a one-hot label for the pixel, is the number of foreground labels augmented with the background class, and is the total number of pixels. Each bounding box is associated with an object and has one of the foreground labels. The fully supervised dataset is indicated as where is the total number of instances in . Similarly, the weakly supervised set is noted by . We use to represent the primary segmentation model and to represent the ancillary model. and are the respective parameters of each model. We occasionally drop the denotation of parameters for readability. We assume that both ancillary and primary models define a distribution of segmentation labels using a factorial distribution, i.e., and where each factor ( or ) is a categorical distribution (over categories).

3.1 Ancillary Segmentation Model

The key challenge in weakly supervised training of segmentation models with bounding box annotations is to infer the segmentation of the object inside a box. Existing approaches to this problem mainly rely on hand-crafted rule-based procedures such as GrabCut [45] or an iterative label refinement [39, 11, 25] mechanism. This latter procedure typically iterates between segmentation extraction from the image and label refinement using the bounding box information (for example, by zeroing-out the mask outside of boxes). The main issues with such procedures are i) bounding box information is not directly used to extract the segmentation mask, ii) the procedure may be suboptimal as it is hand-designed, and iii) the segmentation becomes ambiguous when multiple boxes overlap.

In this paper, we take a different approach by designing an ancillary segmentation model that forms a per-pixel label distribution given an image and bounding box annotation. This model is easily trained using the fully supervised set () and can be used as a training signal for images in . At inference time, both the image and its bounding box are fed to the network to obtain , the segmentation labels distribution.

Our key observation in designing the ancillary model is that encoder-decoder-based segmentation networks typically rely on encoders initialized from an image-classification model (e.g., ImageNet pretrained models). This usually improves the segmentation performance by transferring knowledge from large image-classification datasets. To maintain the same advantage, we augment an encoder-decoder-based segmentation model with a parallel

bounding box encoder network that embeds bounding box information at different scales (See Fig. 2).

Figure 2: An overview of the ancillary segmentation model. We modify an existing encoder-decoder segmentation model by introducing a bounding box encoder that embeds the box information. The output of the bounding box encoder after passing through a sigmoid activation acts as an attention map. Feature maps at different scales from the encoder are fused (using element-wise-multiplication) with attention maps, then passed to the decoder.

The input to the bounding box encoder is a 3D tensor representing a binarized mask of the bounding boxes and a 3D shape representing the target dimensions for the encoder output. The input mask tensor is resized to the target shape then passed through a 3

3 convolution layer with sigmoid activations. The resulting tensor can be interpreted as an attention map which is element-wise multiplied to the feature maps generated by the segmentation encoder. Fig. 2 shows two paths of such feature maps at two different scales, as in the DeepLabv3+ architecture. For each scale, an attention map is generated, fused with the corresponding feature map using element-wise multiplication, and fed to the decoder. For an image of size , we represent its object bounding boxes using a binary mask of size that encodes the binary masks. The binary mask at a pixel has the value 1 if it is inside one of the bounding boxes of the class. A pixel in the background mask has value 1 if it is not covered by any bounding box.

The ancillary model is trained using the cross-entropy loss on the full dataset :

(1)

which can be expressed analytically under the factorial distribution assumption. This model is held fixed for the subsequent experiments.

3.2 No Self-Correction

We empirically observe that the performance of our ancillary model is superior to segmentation models that do not have box information. This is mainly because the bounding box information guides the ancillary model to look for the object inside the box at inference time.

The simplest approach to training the primary model is to train it to predict using ground-truth labels on the fully supervised set and the labels generated by the ancillary model on the weakly supervised set . For this “no-self-correction” model the Self-correction module in Fig. 1 merely copies the predictions made by the ancillary segmentation model.

Training is guided by optimizing the following loss:

where the first term is the cross-entropy loss with one-hot ground-truth labels as target and the second term is the cross-entropy with soft probabilistic labels generated by as target. Note that the ancillary model parameterized by is fixed. We call this approach the no self-correction model as it relies directly on the ancillary model for training the primary model for examples in .

3.3 Linear Self-Correction

Eq. 3.2 relies on the ancillary model to predict label distribution on the weakly supervised set. However, this model is trained using only instances of without benefit of the data in . Several recent works [39, 11, 25, 51, 53] have incorporated the information in by using the primary model itself (as it is being trained on both and ) to extract more accurate label distributions on .

Vahdat [53] introduced a regularized Expectation-Maximization algorithm that uses a linear combination of KL divergences to infer a distribution over missing labels for general classification problems. The main insight is that the inferred distribution over labels should be close to both the distributions generated by the ancillary model and the primary model . However, since the primary model is not capable of predicting the segmentation mask accurately early in training, these two terms are reweighted using a positive scaling factor :

(3)

The global minimizer of Eq. 3

is obtained as the weighted geometric mean of the two distributions:

(4)

Since both and

decompose into a product of probabilities over the components of

, and since the distribution over each component is categorical, then

is also factorial where the parameters of the categorical distribution over each component are computed by applying softmax activation to the linear combination of logits coming from primary and ancillary models using

. Here, is the softmax function and, and are logits generated by primary and ancillary models for the pixel.

Having fixed on the weakly supervised instances in each iteration of training the primary model, we can train the primary model using:

Note that in Eq. 3 controls the closeness of to and . With , we have and the linear self-correction in Eq. 3.3 collapses to Eq. 3.2, whereas recovers . A finite maintains close to both and . At the beginning of training, cannot predict the segmentation label distribution accurately. Therefore, we define a schedule for where is decreased from a large value to a small value during training of the primary model.

This corrective model is called the linear self-correction model as it uses the solution to a linear combination of KL divergences (Eq. 3) to infer a distribution over latent segmentation labels.222In principal, logits of can be obtained by a 11 convolutional layer applied to the depth-wise concatenation of and with a fixed averaging kernel. This originally motivated us to develop the convolutional self-correction model in Sec. 3.4 using trainable kernels. As the primary model’s parameters are optimized during training, biases the self-correction mechanism towards the primary model.

3.4 Convolutional Self-Correction

One disadvantage of linear self-correction is the hyperparameter search required for tuning the

schedule during training. In this section, we present an approach that overcomes this difficulty by replacing the linear function with a convolutional network that learns the self-correction mechanism. As a result, the network automatically tunes the mechanism dynamically as the primary model is trained. If the primary model predicts labels accurately, this network can shift its predictions towards the primary model.

Fig. 3 shows the architecture of the convolutional self-correcting model. This small network accepts the logits generated by and models and generates the factorial distribution over segmentation labels where represents the parameters of the subnetwork. The convolutional self-correction subnetwork consists of two convolution layers. Both layers use a 3

3 kernel and ReLU activations. The first layer has 128 output feature maps and the second has feature maps based on the number of classes in the dataset.

The challenge here is to train this subnetwork such that it predicts the segmentation labels more accurately than either  or . To this end, we introduce an additional term in the objective function, which trains the subnetwork using training examples in while the primary model is being trained on the whole dataset:

where the first and second terms train the primary model on and

(we do not backpropagate through

in the second term) and the last term trains the convolutional self-correcting network.

Because the subnetwork is initialized randomly, it is not able to accurately predict segmentation labels on early on during training. To overcome this issue, we propose the following pretraining procedure:

  1. Initial training of ancillary model: As with the previous self-correction models, we need to train the ancillary model. Here, half of the fully supervised set () is used for this purpose.

  2. Initial training of convolutional self-correction network: The fully supervised data () is used to train the primary model and the convolutional self-correcting network. This is done using the first and last terms in Eq. 3.4.

  3. The main training: The whole data ( and ) are used to fine-tune the previous model using the objective function in Eq. 3.4.

The rationale behind using half of in stage 1 is that if we used all for training the  model, it would train to predict the segmentation mask almost perfectly on this set, therefore, the subsequent training of the convolutional self-correcting network would just learn to rely on  . To overcome this training issue, the second half of is held out to help the self-correcting network to learn how to combine  and .

Figure 3: Convolutional self-correction model learns refining the input label distributions. This subnetwork receives logits from the primary and ancillary models, then concatenates and feeds the output to a two-layer convolution network.

4 Experiments

In this section, we evaluate our models on the PASCAL VOC 2012 and Cityscapes datasets. Both datasets contain object segmentation and bounding box annotations. We split the full dataset annotations into two parts to simulate a fully and weakly supervised setting. Similar to  [8, 39], performance is measured using the mean intersection-over-union (mIOU) across the available classes.

Training:

We use the public Tensorflow 

[1] implementation of DeepLabv3+ [8] as the primary model. We use an initial learning rate of 0.007 and train the models for 30,000 steps from the ImageNet-pretrained Xception-65 model [8].333Note that, we do not initialize the parameters from a MS-COCO pretrained model. For all other parameters we use standard settings suggested by other authors. At evaluation time, we apply flipping and multi-scale processing for images as in  [8]. We use 4 GPUs, each with a batch of 4 images.

We define the following baselines in all our experiments:

  1. Ancillary Model: This is the ancillary model, introduced in Sec. 3.1, predicts semantic segmentation labels given an image and its object bounding boxes. This model is expected to perform better than other models as it uses bounding box information.

  2. No Self-correction: This is the primary model trained using the model introduced in Sec. 3.2.

  3. Lin. Self-correction: This is the primary model trained with linear self-correction as in Sec. 3.3.

  4. Conv. Self-correction: The primary model trained with the convolutional self-correction as in Sec. 3.4.

  5. EM-fixed Baseline: Since our linear self-correction model is derived from a regularized EM model [53], we compare our model with Papandreou et al[39] which is also an EM-based model. We implemented their EM-fixed baseline with DeepLabv3+ for fair comparison. This baseline achieved the best results in [39]

    for weakly supervised learning.

For linear self-correction, controls the weighting in the KL-divergence bias with large favoring the ancillary model and small favoring the primary model. We explored different starting and ending values for with an exponential decay between these endpoints. We find that a starting value of and the final value of performs well for both datasets. This parameter setting is robust as moderate changes of these values have little effect.

4.1 PASCAL VOC Dataset

In this section, we evaluate all models on the PASCAL VOC 2012 segmentation benchmark [15]. This dataset consists of 1464 training, 1449 validation, and 1456 test images covering 20 foreground object classes and one background class for segmentation. An auxiliary dataset of 9118 training images is provided by [17]. We suspect, however, that the segmentation labels of  [17] contain a small amount of noise. In this section, we refer to the union of the original PASCAL VOC training dataset and the auxiliary set as the training set. We evaluate the models mainly on the validation set and the best model is evaluated only once on the test set using the online evaluation server.

In Table 1, we show the performance of different variants of our model for different sizes of the fully supervised set . The remaining examples in the training set are used as . We make several observations from Table 1: i) The ancillary model that predicts segmentation labels given an image and its object bounding boxes performs well even when it is trained with a training set as small as 200 images. This shows that this model can also provide a good training signal for the weakly supervised set that lacks segmentation labels. ii) The linear self-correction model typically performs better than no self-correction model supporting our idea that combining the primary and ancillary model for inferring segmentation labels results in better training of the primary model. iii) The convolutional self-correction model performs better than the linear self-correction while eliminating the need for defining an schedule. Fig. 28 shows the output of these models.

# images in 200 400 800 1464
Ancillary Model 81.57 83.56 85.36 86.71
No Self-correction 78.75 79.19 80.39 80.34
Lin. Self-correction 79.43 79.59 80.69 81.35
Conv. Self-correction 78.29 79.63 80.12 82.33
Table 1: Ablation study of models on the PASCAL VOC 2012 validation set using mIOU for different sizes of . For the last three rows, the remaining images in the training set is used as , i.e. .

Table 2 compares the performance of the self-correcting models against different baselines and published results. In this experiment, we use 1464 images as and 9118 images originally from the auxiliary dataset as . Both self-correction models achieve similar results and outperform other models. The EM-fixed baseline [39] trained using the DeepLabv3+ [8] architecture achieves 79.25% on the validation set while the original paper has reported 64.6% using DeepLabv1 trained with the same split. We also train the DeepLabv3+ [8] initialized from the ImageNet pretrained Xception model using all the training data as the fully supervised set, and we observe that this model achieves 79.33%.444Chen et al[8] have reported 81.21% for the same baseline (without COCO pretraining). We believe the discrepancy is mainly due to computing the batch norm statistics using batch size of 16 by [8], while the publicly available implementation can use only 4.

Surprisingly, our weakly supervised models outperform the fully supervised trained model using the whole training data. We hypothesize that this may be due to the label noise in the 9k auxiliary set [17] that negatively affects the performance of the Vanilla DeepLapv3+. Fig. 47 compares the output of the ancillary model with ground-truth annotations for a few examples in this set.

Comparing Table 1 and 2, we see that with and , our linear self-correction model performs similarly to vanilla DeepLabv3+ trained with the whole dataset. Using the labeling cost reported in [4], this theoretically translates to a 7x reduction in annotation cost.

Data Split Architecture Method Val Test
1464 9118 DeepLabv3+ Lin. Self-Corr. 81.35 81.97
1464 9118 DeepLabv3+ Conv. Self-Corr. 82.33 82.72
1464 9118 DeepLabv3+ EM-fixed [39] 79.25 -
10582 - DeepLabv3+ Vanilla 79.33 80.39
1464 9118 DeepLabv1 BoxSup-MCG [11] 63.5 -
1464 9118 DeepLabv1 EM-fixed [39] 65.1 -
1464 9118 DeepLabv1 M G+ [25] 65.8 -
10582 - DeepLabv1 Vanilla [25] 69.1 -
Table 2: Results on PASCAL VOC 2012 validation and test sets. The last four rows report the performance of previous weakly supervised models. Note that the previous weakly supervised approaches are usually inferior to the fully supervised models whereas our models outperform the fully supervised model.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
Figure 28: Qualitative results on the PASCAL VOC 2012 validation set. The last four columns represent the models in column 1464 of Table 1. The Conv. Self-correction model typically segments objects better than other models.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
Figure 47: Qualitative results on the PASCAL VOC 2012 auxiliary (the weakly supervised) set. The heatmap of a single class for the ancillary model is shown for several examples. The ancillary model can successfully correct the labels for missing or over-segmented objects in these images (marked by ellipses).

4.2 Cityscapes Dataset

In this section we evaluate performance on the Cityscapes dataset [10] which contains images collected from cars driving in cities during different seasons. This dataset has good quality annotations, however some instances are over/under segmented. It consists of 2975 training, 500 validation, and 1525 test images covering 19 foreground object classes (stuff and object) for the segmentation task. However, 8 of these classes are flat or construction labels (e.g., road, sidewalk, building, and etc.), and a very few bounding boxes of such classes cover the whole scene. To create an object segmentation task similar to the PASCAL VOC dataset, we use only 11 classes (pole, traffic light, traffic sign, person, rider, car, truck, bus, train, motorcycle, and bicycle) as foreground classes and all other classes are assigned as background. Due to this modification of labels, we report the results only on the validation set, as the test set on server evaluates on all classes. We do not use the coarse annotated training data in the dataset.

Table 3 reports the performance of our model for an increasing number of images as , and Table 4 compares our models with several baselines similar to the previous dataset. The same conclusion and insights observed on the PASCAL dataset hold for the Cityscapes dataset indicating the efficacy of our self-corrective framework.

# images in 200 450 914
Ancillary Model 79.4 81.19 81.89
No Self-correction 73.69 75.10 75.44
Lin. Self-correction 73.56 75.24 76.22
Conv. Self-correction 69.38 77.16 79.46
Table 3: Ablation study of our models on Cityscapes validation set using mIOU for different sizes of . For the last three rows, the remaining images in the training set are used as , i.e., .
Data Split Architecture Method mIOU
914 2061 DeepLabv3+ Lin. Self-Correction 76.22
914 2061 DeepLabv3+ Conv. Self-Correction 79.46
914 2061 DeepLabv3+ EM-fixed [39] 74.97
2975 - DeepLabv3+ Vanilla 77.49
Table 4: Results on Cityscapes validation set. Here 30% of the training examples is used as the fully supervised set, and the remaining 2061 examples as weakly supervised data.

5 Conclusion

In this paper, we have proposed a weakly supervised framework for training deep CNN segmentation models using a small set of fully labeled and a large set of weakly labeled images. We introduced two mechanisms that enable the underlying primary model to correct the weak labels provided by an ancillary model. The proposed self-correction mechanisms combine the predictions made by the primary and ancillary model either using a linear function or trainable CNN. The experiments show that our proposed framework outperforms previous weakly supervised models on both the PASCAL VOC 2012 and Cityscapes datasets. Our framework can also be applied to the instance segmentation task [20, 69, 67], but we leave further study of this to future work.

References

  • [1] M. Abadi, A. Agarwal, and et al.

    Tensorflow: Large-scale machine learning on heterogeneous distributed systems.

    2016.
  • [2] J. Ahn and S. Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. 2015.
  • [4] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision (ECCV), 2016.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations (ICLR), 2015.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
  • [7] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. 2017.
  • [8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [9] F. Chollet.

    Xception: Deep learning with depthwise separable convolutions.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [11] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [12] M. Dehghani, A. Mehrjou, S. Gouws, J. Kamps, and B. Schölkopf. Fidelity-weighted learning. In International Conference on Learning Representations (ICLR), 2018.
  • [13] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • [14] T. Durand, T. Mordan, N. Thome, and M. Cord. WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [15] M. Everingham, S. M. A. Eslami, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 2015.
  • [16] W. Ge, S. Yang, and Y. Yu. Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [17] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), 2011.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [19] S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han. Weakly supervised semantic segmentation using web-crawled videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [20] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang. Weakly-supervised semantic segmentation network with deep seeded region growing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [22] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H. Yang. Adversarial learning for semi-supervised semantic segmentation. In British Machine Vision Conference (BMVC), 2018.
  • [23] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. Mentornet: Regularizing very deep neural networks on corrupted labels. In International Conference on Machine Learning (ICML), 2018.
  • [24] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [25] A. Khoreva, R. Benenson, J. H. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [26] P. Krähenbühl and V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Advances in Neural Information Processing Systems (NIPS), 2011.
  • [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  • [28] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Conference on Computer Vision and Pattern Recognition (CPRV), pages 2169–2178. IEEE Computer Society, 2006.
  • [29] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [30] G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [31] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [33] W. Liu, A. Rabinovich, and A. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  • [34] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [36] I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick.

    Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels.

    In CVPR, 2016.
  • [37] V. Mnih and G. E. Hinton. Learning to label aerial images from noisy data. In International Conference on Machine Learning (ICML), pages 567–574, 2012.
  • [38] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
  • [39] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille.

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation.

    In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [40] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making neural networks robust to label noise: A loss correction approach. In Computer Vision and Pattern Recognition, 2017.
  • [41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [42] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning (ICML), 2018.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [44] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  • [45] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004.
  • [46] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015.
  • [47] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [48] N. Souly, C. Spampinato, and M. Shah. Semi supervised semantic segmentation using generative adversarial network. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [49] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
  • [50] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
  • [51] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [52] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [53] A. Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In Neural Information Processing Systems (NIPS), 2017.
  • [54] A. Vahdat and G. Mori. Handling uncertain tags in visual recognition. In International Conference on Computer Vision (ICCV), 2013.
  • [55] A. Vahdat, G.-T. Zhou, and G. Mori. Discovering video clusters from visual features and noisy tags. In European Conference on Computer Vision (ECCV), 2014.
  • [56] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy large-scale datasets with minimal supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6575–6583. IEEE, 2017.
  • [57] X. Wang, S. You, X. Li, and H. Ma. Weakly-supervised semantic segmentation by iteratively mining common object features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [58] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [59] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Revisiting dilated convolution: A simple approach for weakly- and semi- supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [60] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [61] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [62] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2015.
  • [63] W. Zhang, S. Zeng, D. Wang, and X. Xue. Weakly supervised semantic segmentation for social images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [64] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.
  • [65] Z. Zhang and M. R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Neural Information Processing Systems (NIPS), 2018.
  • [66] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [67] X. Zhao, S. Liang, and Y. Wei. Pseudo mask augmented object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [68] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr.

    Conditional random fields as recurrent neural networks.

    In International Conference on Computer Vision (ICCV), pages 1529–1537, 2015.
  • [69] Y. Zhou, Y. Zhu, Q. Ye, Q. Qiu, and J. Jiao. Weakly supervised instance segmentation using class peak response. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [70] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 146–155, 2016.
  • [71] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.