On Boosting Semantic Street Scene Segmentation with Weak Supervision

03/08/2019 ∙ by Panagiotis Meletis, et al. ∙ TU Eindhoven 0

Training convolutional networks for semantic segmentation requires per-pixel ground truth labels, which are very time consuming and hence costly to obtain. Therefore, in this work, we research and develop a hierarchical deep network architecture and the corresponding loss for semantic segmentation that can be trained from weak supervision, such as bounding boxes or image level labels, as well as from strong per-pixel supervision. We demonstrate that the hierarchical structure and the simultaneous training on strong (per-pixel) and weak (bounding boxes) labels, even from separate datasets, constantly increases the performance against per-pixel only training. Moreover, we explore the more challenging case of adding weak image-level labels. We collect street scene images and weak labels from the immense Open Images dataset to generate the OpenScapes dataset, and we use this novel dataset to increase segmentation performance on two established per-pixel labeled datasets, Cityscapes and Vistas. We report performance gains up to +13.2 classes, and inference speed of 20 fps on a Titan V GPU for Cityscapes at 512 x 1024 resolution. Our network and OpenScapes dataset are shared with the research community.



There are no comments yet.


page 1

page 2

page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation of street scene images is a fundamental building block for automated driving [1]

. It is the first step of scene understanding and provides the necessary information towards higher level reasoning and planning 

[2]. Formulation of the problem as per-pixel (dense) classification and modeling it with Fully Convolutional Networks [3] has become the de facto solution for semantic segmentation of images. However, its success is based on the availability of huge amounts of tediously, per-pixel labeled datasets [4, 5, 6], and existing solutions do not leverage weakly labeled data that are provided in much larger and more diverse datasets [7].

Therefore, in this work, we explore a method for per-pixel training of Fully Convolutional Networks on multiple datasets simultaneously, containing images with strong (per-pixel) or weak (bounding boxes and image-level) labels. The ability to train from weakly labeled data is an important research topic in the field of computer vision 

[8, 9, 10, 11], which, when solved, can be of benefit to many application domains.

Our method fully solves, in a consistent and uniform manner, the different annotation types challenge during training on heterogeneous datasets, as introduced in [12]

. The system consists of a hierarchy of classifiers and a hierarchical loss function, which can handle any of the aforementioned types of supervision. This is achieved by transforming the incompatible, for semantic segmentation, weak labels into per-pixel weak labels, a common practice also appearing in other related problems 


Fig. 1: Our contributions in blue color. Using diverse types of weak supervision from the Open Images dataset we achieve increase on performance over datasets with strong supervision using a fully convolutional network with a hierarchy of classifiers and the corresponding loss.

In order to prove how weak labels can boost semantic segmentation performance of strongly (per-pixel) labeled datasets, e.g. Cityscapes [4], we collect a weakly labeled dataset by mining street scene images from the very large scale Open Images dataset [7] and we name it OpenScapes. This new dataset contains 100,000 images with 2,242,203 bounding box labels and 100,000 images with 1,199,582 image-level labels, and spans 14 of the most important street scene classes. However, even after the automated selection procedure, the domain gap between OpenScapes and the per-pixel datasets, for which we want to prove the performance increase, remains large, as can be seen in Fig. 3, making the weak supervision even ”weaker”.

We evaluate our system on two established per-pixel labeled datasets and we show that performance increase is analogous to the amount of extra weak labels used. We achieve that without using any external components as in [11, 8], but only the hierarchical structure of the classifiers and the proposed hierarchical loss.

To summarize, the contributions of this work are:

  • A methodology for training semantic segmentation networks on datasets with diverse supervision, including per-pixel labels, bounding box labels, and image-level labels.

  • OpenScapes dataset: a large, weakly labeled dataset with images and 14 semantic classes for street scenes recognition.

Our system and the OpenScapes dataset are made available to the research community [14].

Fig. 2: Network architecture. The root classifier passes its decisions to the two subclassifiers, which classify only pixels that are assigned to them by the root classifier.

Ii Method

In this Section, we describe the proposed training and inference methodology. We generalize previous work [12] by enabling weak supervision from both bounding box labeled and image-level labeled images without using any external components.

Our method facilitates training of any fully convolutional network for per-pixel semantic segmentation and only requires a specific structure of classifiers and a specialized loss to train them. To achieve that, weak labels (bounding boxes and image labels) have to be converted to pseudo per-pixel ground truth. This is described in Sec. II-B. The network architecture and the corresponding hierarchical loss are presented in Sec. II-A and Sec. II-C respectively. In addition, we address the shortcomings of pseudo ground-truth generation [12] for any type of weak labels.

Fig. 3: Example images from per-pixel labeled Cityscapes dataset and the weakly labeled OpenScapes dataset that demonstrate the big domain gap.

Ii-a Convolutional Network Architecture

The network architecture follows the design proposed in [12] and is depicted in Fig. 2. Specifically, we opt for a two-level hierarchical convolutional network, which consists of a fully convolutional shared feature extractor and a set of, hierarchically arranged, classifiers. The root classifier is trained only with strong supervision (per-pixel labeled semantic classes). The subclassifiers are trained, using the hierarchical loss of Sec. II-C with per-pixel supervision. For that purpose, the weak labels are converted into per-pixel pseudo ground truth as described in Sec. II-B.

The benefits of the hierarchical structure [12] are twofold: 1) it solves the problem of simultaneously training with different types of supervision, by placing classes with weak labels in the subclassifiers, and 2) it solves the semantic class incompatibilities between datasets, due to the unavailability of specific semantic classes in all datasets.

The hierarchy of classifiers is constructed according to the availability of strong and weak labels for each class. The root classifier (left in Fig. 2) contains high-level classes with per-pixel labels. Each one of the subclassifiers corresponds to one high-level class of the root classifier, and contains subclasses with per-pixel and/or weak supervision.

The shared feature representation (see Fig. 2) is passed through two shallow, per-classifier adaptation networks, which adapt the common representation, its depth, and receptive field to meet the requirements of each classifier as described in [12]. In this work, we use a single ResNet bottleneck layer [15] as in [12].

Ii-B Generation of pseudo per-pixel ground truth from weak labels

The goal is to train the network with per-pixel labels, thus we need to generate per-pixel ground truth from bounding boxes and image-level labels. The 2D ground truth generation procedure in [12] is ambiguous for classes whose bounding box boundaries do not match tightly the object boundaries. Thus that method is valid only for square-shaped, compact objects, like traffic signs, and cannot be applied to image-level labels.

According to [12] the 2D pseudo ground truth for each image is generated pixel-wise, by assigning a single label to each pixel from the set of bounding boxes that this pixel belongs to, as depicted at the top of Fig. 4

. This procedure effectively generates a so-called sparse or one-hot categorical probability distribution, since each pixel belongs to a specific class with probability

. Contrary, in this work, we model the per-pixel labels as a dense or multi-hot categorical probability distribution, and thus the ground truth for each images becomes 3D (see Fig. 4). This model assigns to each pixel a probability for every class, and the sum of probabilities for all classes must be

. In order to convert bounding boxes and image labels to per-pixel labels, we use a voting scheme, according to which each label increases each pixel’s counter vector by

. After collecting all votes we normalize across all classes, in order for the labels to represent a valid probability distribution.

Fig. 4: 2D vs 3D per-pixel pseudo ground truth (GT) generation. Left: image with a selected subset of bounding boxes colored by class. Right top: 2D GT generation used in [12]. Right bottom: proposed 3D GT generation. In 2D GT, overlapping bounding boxes produce ambiguity in generated per-pixel labels (e.g. car label is hidden behind pedestrian labels), which is solved by adding a 3rd dimension in GT generation. The same principle is used for generating GT from image level labels by considering for each label the boundaries of the bounding box to extend to the whole image.

Ii-C Hierarchical loss

classifier Per-pixel labeled data Weakly labeled data
(Cityscapes or Vistas) (OpenScapes)

sparse CCE -
vehicle subcl. dense CCE conditional dense CCE
human subcl. dense CCE conditional dense CCE
TABLE I: Loss components per classifier and per dataset. All losses are per-pixel Categorical Cross Entropy (CCE) losses between the dense or sparse categorical labels and the softmax probabilities of the associated classifier.

We construct the hierarchical loss as in [12], namely the loss is accumulated unconditionally for per-pixel labeled datasets and conditionally for per bounding box or per image-level labeled datasets for all classifiers. The total loss terms are pixel-wise categorical cross entropy losses and are summarized in Table I. The five loss terms of Table I are added, using coefficients of for the subclassifier’s losses and, together with the regularization loss, build the total loss.

For the root classifier we use only the sparse categorical per-pixel labels, i.e. each pixel can belong to one class with probability , since this classifier receives supervision from the per-pixel labeled dataset. For the subclassifiers we use dense categorical labels for both the per-pixel and the weakly labeled images (see Sec. II-B). We convert the per-pixel labeled dataset’s sparse labels to dense categorical labels by assigning a probability of to all classes except for the ground truth class, which is assigned probability .

The parts of the losses that correspond to per-pixel labeled data, i.e. the sparse categorical CE loss of the root classifier and the parts of the dense categorical CE losses of the subclassifiers, are accumulated unconditionally. The rest parts of the losses, i.e. the parts of the dense categorical CE losses of the subclassifiers, are collected if the following two conditions hold for each pixel: 1) the per-pixel pseudo ground truth has positive probability for that pixel, and 2) the root classifier decision agrees with the per-pixel pseudo ground truth for that pixel, i.e. it is a class that has positive probability in the per-pixel pseudo ground truth.

Cityscapes Vistas OpenScapes

# of images
2975 18000 200,000
# of classes 27 65 14
# of pixel labels -
# of bound. boxes - - 2,242,203
# of image labels - - 1,199,582
TABLE II: OpenScapes dataset overview and comparison with per-pixel labeled datasets. Training splits are shown.

Iii OpenScapes Dataset and Implementation

In this Section, we describe the collection process of OpenScapes and we compare it with the per-pixel annotated Cityscapes and Vistas datasets. Moreover, we discuss all the implementation details for our experiments.

Iii-a OpenScapes Street Scenes Dataset

We collect images of street scenes from the recently open-sourced, very large scale Open Images dataset [7] and create a subset that we call OpenScapes. Open Images dataset contains over 9,000,000 images, 14,600,000 bounding boxes for 600 object classes, and more than 27,900,000 human-verified image-level labels for 19,794 classes. We collected 200,000 images, containing 2,242,203 bounding box labels and 1,199,582 image-level labels from 14 classes, with as much as possible street scene related content.

The fully automated collection procedure is described in Sec. III-A1. However, even after the careful selection, the domain gap [16, 17] between the per-pixel datasets (Cityscapes, Vistas) and OpenScapes is large. This can be seen by the image examples in Figures 1 and 3, and is discussed together with a comparison with the employed per-pixel labeled datasets in Sec. III-A2.

Iii-A1 Mining procedure

First, we rank in descending order images from Open Images by the number of bounding boxes and image-level labels they contain for the 14 selected street scene classes. Then we select the top 100,000 images for the bounding box labeled subset and then 100,000 images for the image-level labeled subset and we make sure that there is no image overlap between the two subsets. For the ranking we used a voting system, according to which classes in the weak labels of an image vote for an image to be a street scene image or not. The more probable classes, like traffic light and license plate, can cast more votes than classes that may appear in other contexts (e.g. car, person).

Iii-A2 Comparison with per-pixel labeled datasets

In Tab. II we compare OpenScapes with two established per-pixel labeled datasets that we experiment on also in this paper. In Fig. 3 we present some images from Cityscapes and OpenScapes. As can be seen Cityscapes image domain is very consistent with images taken from a specific point of view and in one country, contrary to OpenScapes, which contains web-like images and does not correspond to a consistent domain.

Fig. 5: Comparison for Cityscapes validation split images for the hierarchical network trained on OpenScapes and Cityscapes against the baseline network trained on Cityscapes only. The three classes with biggest improvement in mIoU are Truck(+13.2%), Rider(+3.5%), and Person(+2.1%).

Iii-B Implementation details

The network is depicted in Fig. 2. The feature extractor consists of the ResNet-50 layers (without the classifier) from [15], followed by an 1x1 convolutional layer, to decrease feature dimensions to 256, and a Pyramid Pooling Module [18]

. The stride of the feature representation on the input is reduced from 32 to 8, using dilated convolutions. Each branch has an extra bottleneck module 

[15], a bilinear upsampling layer to recover original resolution, and a softmax classifier.

We use Tensorflow 


and 4 Titan V 12 GB GPUs for training. We implemented synchronous, cross-GPU batch normalization, and for all experiments we use batch size of up to 4 images per-GPU depending on the experiment, containing 1 image from the per-pixel labeled dataset (Cityscapes or Vistas), 2 images from the bounding box labeled dataset (

OpenScapes subset), and 1 image from the image-level labeled dataset (OpenScapes subset).

For experiments involving Cityscapes we use images dimensions of 512x1024 and for Vistas 621x855. Since, OpenScapes

images have multiple aspect ratios, we upscale each image to fit tightly the aspect ratio of the per-pixel labeled dataset and then we crop a random patch of same dimensions as the per-pixel labeled image. Networks with batch size of 3 per-GPU are trained for 26 epochs with initial learning rate 0.02 and of 4 per-GPU for 31 epochs with initial learning rate 0.03. All networks are trained with Stochastic Gradient Descent and momentum of 0.9, L2 weight regularization with decay of 0.00017, the learning rate is halved three times, and batch normalization moving averages decay set to 0.9. We use the same hyperparameter values for the

coefficients of the loss as in [12].

Cityscapes OpenScapes Cityscapes Vistas
per-pixel bound. boxes image-level mAcc mIoU mAcc mIoU
77.8 68.9 53.0 43.6
79.2 70.2 52.1 43.6
79.3 70.3 52.0 43.0
TABLE III: Overall performance improvements using weak supervision from OpenScapes dataset in addition to strong supervision from Cityscapes or Vistas, over the baseline network trained with only per-pixel labels.

Iv Experiments

We evaluate performance using two established multiclass metrics for semantic segmentation [4], namely mean pixel Accuracy (mAcc) and mean Intersection over Union (mIoU). Metrics for all experiments are evaluated on the per-pixel datasets, and are the average of the last three epochs of the respective validation sets. In Sec. IV-A and IV-B we present overall results and per class results for those classes that receive extra weakly supervised examples. In Sec. IV-C we investigate the effect of the number of examples used from the weakly labeled dataset. Example results from all datasets are shown in Figures 5 and 6.

Origin of ground truth Vehicle subclassifier Human subclassifier
per-pixel bound. boxes image-level











67.0 79.7 91.9 52.2 69.3 62.3 70.4 70.2 47.9 59.0
67.8 81.8 92.5 50.3 69.3 71.4 72.2 71.9 50.7 61.3
67.9 79.1 92.5 48.7 69.3 75.5 72.2 72.3 51.4 61.9
TABLE IV: Cityscapes per class mIoU (%) improvements, for the classes, which belong to subclassifiers that receive extra supervision from the weakly labeled OpenScapes dataset (100k subsets). Results are grouped per subclassifier.
Origin of ground truth Vehicle subclassifier Human subclassifier
per-pixel bound. boxes image-level







On rails

Other vehicle



Wheeled slow





Other rider


55.0 26.7 75.0 88.8 0.3 54.2 38.4 16.9 0.3 65.0 7.4 38.9 65.5 51.4 43.1 0.0 40.0
56.1 21.2 73.8 88.6 11.6 53.9 49.2 18.4 0.9 66.9 10.7 41.0 64.7 47.1 52.7 0.4 41.2
54.5 21.2 74.0 88.4 11.4 52.8 49.0 18.1 0.8 66.0 10.6 40.6 64.6 47.1 49.9 0.3 40.5
TABLE V: Vistas per class mIoU (%) improvements, for the classes, which belong to subclassifiers that receive extra supervision from the weakly labeled OpenScapes dataset (100k subsets). Results are grouped per subclassifier.

Iv-a Overall results

In Table III the overall results for Cityscapes [4] and Vistas [5] are shown. All networks are trained with strong (per-pixel) supervision, from Cityscapes or Vistas, and a combination of weak (per bounding box or image-level or both) supervision from OpenScapes. We used the two subsets of OpenImages with 100k images each (Sec. III-A) and their generated pseudo per-pixel labels, as described in Sec. II-B, mixed in the batch with Cityscapes or Vistas images (see Sec. III-B for implementation details).

For Cityscapes, we observe that mAcc and mIoU increase steadily by increasing the amount of weakly labeled data included during training. For Vistas, however, training together with the OpenScapes subsets slightly harms the performance. This is possibly due to the diversity of images of Vistas and the large domain gap with OpenScapes. Overall, we denote that by adding extra supervision for specific classes, mean performance over all classes is not harmed dramatically, and in most cases also boosted.

Iv-B Improvements on classes with weak supervision

In this Section, we investigate the performance on classes that receive extra weak supervision apart from strong per-pixel supervision. As can be seen in Tables IV and V, overall mIoU of classes belonging to vehicle and human subclassifiers improves in both datasets when adding the OpenScapes bounding box labeled subset. Although, in the Cityscapes case, adding the OpenScapes image-level labeled subset increases the performance, in the Vistas case it reduces it. We hypothesize that this is due to the domain gap between the datasets (see Sec. III-A2V). We would also like to mention the big increase for specific classes, e.g. +13.2% for Cityscapes ”Truck” class, +11.3% for Vistas ”Caravan” class, and +10.8% for Vistas ”On rails” class.

Iv-C Effect of weakly labeled dataset size

In this experiment we train the hierarchical architecture on Cityscapes, together with different portions of the OpenScapes bounding boxes labeled subset, with all other hyperparameters fixed, to investigate the effect of the size of the weakly labeled dataset. From Table VI, row 2, it becomes clear that without using enough weakly labeled images the performance may even drop. However, when enough weak supervision is provided, row 3 and 4, the performance is enhanced adequately.

per-pixel + #images with bbox GT mAcc mIoU

images ( bboxes)
77.8 68.9
images ( bboxes) 77.4 68.4
images ( bboxes) 78.2 69.2
images ( bboxes) 79.2 70.2
TABLE VI: Performance (mIoU) on Cityscapes with different amount of bounding boxes used to generate pseudo ground truth labels for the weakly labeled dataset.
Fig. 6: Comparison for Vistas validation split images for the hierarchical network trained on OpenScapes bounding boxes subset and Vistas against the baseline network trained on Vistas only. The three classes with biggest improvement in mIoU are Caravan(+11.3%), On rails(+10.8%), and Motorcyclist(+9.6%).

V Discussion and future work

The performance of our method heavily depends on two factors: 1) the amount of weak labels and their semantic extent of class connotation, and 2) the domain gap [16, 17, 20] between strongly and weakly labeled datasets.

In this work we hypothesized that the images for datasets that are trained simultaneously come from similar domains, and thus features from a common feature extractor can be classified by the same classifier. In reality, this assumption rarely holds, but we leave investigation of this matter and how to solve it during inference to future research. Methods that perform domain agnostic inference, like [20], can hold solutions for this problem.

Another important matter is the connotation extent (the extent of the class name connotation for labeling visually similar objects) of a semantic class. Although in this work, we assumed that classes described by the same high-level semantic concepts, like truck or bus, depict very similar objects across datasets, this is not true in general, and should be investigated in the future. This is visible, for example, in the performance drop for the motorcycle class in Table IV, for which the connotation extent for motorcycle objects diverges between Cityscapes and OpenScapes datasets.

Vi Conclusion

We presented a fully convolutional network coupled with a hierarchy of classifiers for simultaneous training on strongly and weakly labeled datasets for semantic segmentation. We collected street scene images from OpenImages to generate a big-scale weakly labeled dataset called OpenScapes. Using OpenScapes we showed that the overall performance, as well as the performance for classes that receive extra weak supervision, are increased, provided that enough weak labels are available. Moreover, we examined the effect of the size of the weakly labeled dataset and showed that the performance increase is proportional to the size of the dataset. For our experiments we assumed that the domain gap between simultaneously trained datasets is minor, however we would like to stretch that it can be a limiting factor, especially when using image-level labels, and thus should receive attention in future research.