Training Constrained Deconvolutional Networks for Road Scene Semantic Segmentation

04/06/2016 ∙ by German Ros, et al. ∙ iRobot Universitat Autònoma de Barcelona 0

In this work we investigate the problem of road scene semantic segmentation using Deconvolutional Networks (DNs). Several constraints limit the practical performance of DNs in this context: firstly, the paucity of existing pixel-wise labelled training data, and secondly, the memory constraints of embedded hardware, which rule out the practical use of state-of-the-art DN architectures such as fully convolutional networks (FCN). To address the first constraint, we introduce a Multi-Domain Road Scene Semantic Segmentation (MDRS3) dataset, aggregating data from six existing densely and sparsely labelled datasets for training our models, and two existing, separate datasets for testing their generalisation performance. We show that, while MDRS3 offers a greater volume and variety of data, end-to-end training of a memory efficient DN does not yield satisfactory performance. We propose a new training strategy to overcome this, based on (i) the creation of a best-possible source network (S-Net) from the aggregated data, ignoring time and memory constraints; and (ii) the transfer of knowledge from S-Net to the memory-efficient target network (T-Net). We evaluate different techniques for S-Net creation and T-Net transferral, and demonstrate that training a constrained deconvolutional network in this manner can unlock better performance than existing training approaches. Specifically, we show that a target network can be trained to achieve improved accuracy versus an FCN despite using less than 1% of the memory. We believe that our approach can be useful beyond automotive scenarios where labelled data is similarly scarce or fragmented and where practical constraints exist on the desired model size. We make available our network models and aggregated multi-domain dataset for reproducibility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 8

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deconvolutional Networks (DNs) are a class of neural network which have achieved notable recent success on the task of semantic segmentation, in which image recognition is performed at the resolution of individual pixels 

[1, 2, 3, 4]. They have consequently become an attractive architecture for road scene segmentation—a useful component in many autonomous driving or advanced driver assistance systems. However, several limitations exist when trying to apply state-of-the-art DNs in practice.

Firstly, they are inefficient in terms of memory footprint. While commercial chips targeting the automotive industry are becoming increasingly parallel, the small size of fast-access on-chip SRAM memories remains limited (e.g. 512 KB for the Mobileye EyeQ2111http://www.mobileye.com/technology/processing-platforms/eyeq2/222All product names may be trademarks of their respective companies. chip and 1-10 MB for the Toshiba TMPV 760 Series333http://toshiba.semicon-storage.com/ap-en/product/automotive.html2 chip family). In contrast, the popular FCN-8s network [2] with 134.5 M parameters requires more than 500 MB of memory. Although more efficient architectures have been proposed, such as SegNet [4], they still contain tens of millions of parameters (29.5 M for [4]) and are yet to demonstrate accuracy on a par with the larger FCN-8s.

Secondly, since DNs are typically trained in a supervised manner, their performance benefits from access to a large amount of training data with corresponding per-pixel annotations. Producing such annotations is an expensive and time-consuming process. Hence, while datasets for tasks such as image classification can reach images in scale [5, 6, 7], popular semantic road scene segmentation datasets such as CamVid [8] or KITTI [9] contain images. The scarcity of data results in a lack of samples for rarer but important classes such as pedestrians and cyclists, which can make it difficult for models to learn these concepts without overfitting. Furthermore, data scarcity implies poor coverage over the true distribution of possible road scenes: datasets are typically captured in one or a few localised regions under relatively homogeneous road conditions. Understanding how best to incorporate knowledge from new domains as training data becomes available is an important problem to ensure the best general task performance given available data.

To address these limitations, we propose an approach which draws on ideas from domain adaptation [10, 11] and model compression by transfer learning [12]. We begin by collating numerous publicly available datasets from different domains and modalities which are useful for the task of semantic road scene segmentation. We refer to our aggregated dataset as the Multi-Domain Road Scene Semantic Segmentation (MDRS3) dataset. In contrast to existing work in road scene semantic segmentation [13, 14], we select two of the constituent datasets in their entirety as the test set for MDRS3. This means that training and testing for MDRS3 are not carried out on subsets of the same original dataset and performance is a better indication of task generalisation. We then examine methods for training on different modalities of data to create the best possible model, ignoring time and memory constraints. We discover that ensembling networks trained on distinct domains leads to much improved performance, and create a best-performing network containing 269 million parameters, which we refer to as the Source Network (S-Net). Finally, we explore methods for transferring knowledge from the unconstrained S-Net to a memory constrained architecture, which we refer to as the Target Network (T-Net). We demonstrate that by this approach, we can meet the desired constraints for embedded applications while achieving a higher accuracy than is otherwise impossible through existing training strategies. Concretely, we show that the performance of a state-of-the-art FCN [2] can be bettered using a deconvolutional network with 1% of the memory and comparable run-time. Fig. 1 summarises our experimental findings. For reproducibility, we plan to make publicly available all of our trained models and MDRS3 dataset upon publication.

Figure 1: Left: Model performance versus model size. Blue points indicate the baseline performance of the two different network architectures we compare. Green points indicate performance using different training strategies explained in Sec. 5. Red point denotes the performance of our constrained network after knowledge transfer, explained in Sec. 5. Our approach outperforms the per-class accuracy of a state-of-the-art fully convolutional network (FCN e2e (D)) with 1% of the model size. Right: Segmentation output on a test image, illustrating the qualitative improvement of our method. Best viewed on screen.

2 Related Work

Our work spans the topics of semantic segmentation, training with limited data, and knowledge transfer. We briefly recapitulate related literature from each.

Semantic Segmentation.

The task of semantic segmentation involves the estimation of a function

which maps an input image, such as , to an output label image , where the labels index the semantic class of the input at that pixel (e.g. road, sidewalk, sky, vegetation, pedestrians, etc.). It is a popular problem in computer vision and has been tackled for various environments from indoors [15, 16] to outdoors [13, 17], as well as for specific tasks such as road scene perception [18]. For the latter, which is the focus of our work, semantic segmentation is expected to play a key role as part of the local planning and obstacle avoidance sub-systems of future semi-autonomous and autonomous vehicles.

Classical tools for addressing the problem include pipelines based on a combination of hand-crafted features (e.g. 

SIFT, HOG) and region-based classifiers (

e.g. SVM, ADABoost), with probabilistic graphical models such as Conditional Random Fields (CRFs) used to produce structured predictions [14, 18, 19, 20, 21, 22]

. With the arrival of deep convolutional neural networks (CNNs), hand-crafted features were substituted by learned CNN representations, which worked at the level of image patches 

[23, 24]. This trend continued with the introduction of DNs, which naturally perform the process of recognition and whole-image segmentation, producing a dense inference at a pixel level [2, 3, 4, 15]. Recently, this trend has culminated in the addition of structured prediction by employing messaging passing between the net and an MRF [25] or adopting the use of Recurrent Nets with equivalent behaviours to a CRF [26].

Training with Limited Data.

One key problem with DNs is that when applied to certain domains, such as automotive environments, there is a lack of suitably large and varied training data. Two recent approaches [27, 28] propose means of mitigating this problem by augmenting an existing semantic segmentation dataset (i.e. consisting of pixel-wise labels) with additional data from object detection and image classification datasets, which are weakly annotated with bounding boxes or text captions. Both approaches are directly applied on the augmented datasets to train DNs in an end-to-end fashion and both report subsequent improvements in accuracy for the PASCAL-VOC dataset [29]. However, obtaining significant improvements in this manner is possible only when the existing and additional datasets are similar in nature—in this case, both consisted of annotations of simple objects, ignoring the architectonic elements composing the background, e.g. road, sidewalk, etc. As we show in this work, the application of these strategies when dealing with urban imagery and architectonic classes fails to produce the most competitive results in automotive scenarios. Furthermore, while the issue of training data scarcity is likely to diminish with time, as new larger datasets are released (e.g. the recent releases of the Cityscapes [30] dataset, containing 5,000 fine-labelled images and 20,000 coarse-labelled images, and the SYNTHIA dataset [31], containing 250,000 synthetic fine-labelled images), we believe that our approach will remain useful for training resource-constrained segmentation networks.

Model Compression by Knowledge Transfer.

While the recent trend in deep learning has been to strive for even deeper models 

[32], the preference for deep versus shallower models is not because shallower models have been shown to have limited capacity or representational power, but rather that learning and regularization procedures used to train shallow models are not sufficiently powerful [33]. One reason for this is that, counterintuitively, the likelihood of falling into poor quality local minima increases with decreasing network size [34]. Various approaches to extract better performance from shallow networks have been proposed in the literature. In [35], an ensemble of classifiers, trained on a small but representative subset of a larger dataset, is used to label the larger unlabelled dataset. The large ensemble-labelled dataset is then used to train a network, demonstrating improved performance versus training on the original ground truth for the smaller dataset. More recently, [33]

shows that shallow neural nets can be trained to achieve performances previously only reachable by deep models, by training a large teacher ensemble and transferring knowledge from it to a shallow but wide model by training it to match the logit activations of the teacher. Hinton

et al.  [12]

confirm these findings and propose to address the problem by exploiting the “dark knowledge” available in the teacher ensemble, referring to the full probability distribution produced by the soft-max classifier. This knowledge is transferred to a compact student network using relaxation of cross-entropy. This approach was extended in 

[36], showing that it is possible to reduce the number of parameters by creating deeper-and-thinner students out of shallower-wider teachers, at the possible expense of increasing computation time. A further recent line of relevant work on network compression focuses on applying sophisticated engineering tools to reduce the network size. Examples include [37], which uses pruning, trained quantization and Huffman coding to further compression, and  [38], in which these engineering tools are combined with a novel architecture to produce very compact classification networks. In this paper we extend previous research on knowledge transfer to a novel problem and the recently proposed architecture of DNs.

3 Multi-Domain Road Scene Semantic Segmentation (MDRS3)

DENSE: CamVid (2007) [8, 39] KITTI-S (2012) [9] *Urban LabelMe (2008) [40] *CBCL (2013) [41]
Example(s)
# Images 600 (Cambridge, UK) 547 (Karlsruhe, GER) 942 (Various) 3,547 (Boston, USA)
Used for Training Training Test Test
SPARSE: *ETH-RMPTMP (2009) [42] *GTSRB (2013) [43] M-COCO (2014) [5] *KITTI-O (2012) [9]
Example(s)
# Images 14,056 (Zurich, CH) 740 (Various, GER) 3,262 (Various) 7,481 (Karlsruhe, GER)
Used for Training Training Training Training
Figure 2: Constituent datasets of the Multi-Domain Road Scene Semantic Segmentation (MDRS3) dataset, containing dense and sparse labels. Datasets marked with * were upgraded from coarse labels or bounding boxes to pixel-wise annotations. The test set consists of two dense datasets (CBCL, Urban LabelMe) to better evaluate generalisation performance. Best viewed on screen.

Acquiring data suitable for training road scene semantic segmentation is expensive and time-consuming. The process of densely labelling an image with 10-20 classes can take up to 30 minutes for a typical, cluttered perspective street-view image and so existing datasets tend to be relatively small. In addition, datasets are often confined to localised geographic regions and trained and tested on in isolation. In our work we consider using numerous datasets to create one aggregate dataset, which we refer to as the Multi-Domain Road Scene Semantic Segmentation dataset (MDRS3), to take advantage of all of the relevant training data available.

Dataset composition.

Fig. 2 details the constituent datasets of our MDRS3 dataset. We consider popular road scene semantic segmentation datasets with dense pixel-wise annotations such as CamVid [8, 39] and KITTI Semantic (KITTI-S) [9, 13, 14]. However, as shown in Table 1, these dense datasets contain a large imbalance in the frequency of occurrence of various classes: structural classes such as road, sky or building are several orders of magnitude more frequent than important non-structural classes such as cars, pedestrians, road-signs or cyclists. To boost the recognition of the latter, we include specific detection and recognition datasets where annotations are available in the form of bounding-boxes or segmentation masks: KITTI Objects (KITTI-O) [9], a filtered set of Microsoft COCO (M-COCO) [5] containing pedestrians, cyclists, road signs and cars in urban environments, ETH Robust Multi-Person Tracking from Mobile Platforms (ETH-RMPTMP) [42] for pedestrians and the German Traffic Sign Recognition Benchmark (GTSRB) [43] for road signs. The distribution of classes for our MDRS3 train and test sets (final two rows of Table 1) illustrate how training data in our dataset includes many more instances of important rare classes compared to existing dense datasets.

Dataset # Image sky build. road side. fence veg. pole car sign ped. cycl.
CamVid [8, 39] 600 15.7 24.4 33.4 6.2 2.6 11.4 0.4 4.8 0.5 0.4 0.5
KITTI-S [9] 547 6.2 25.9 17.2 7.0 3.7 28.7 0.5 9.9 0.4 0.2 0.2
*U-LabelMe [40] 942 13.2 39.9 19.1 8.1 0.3 11.1 0.5 5.8 0.3 1.1 0.5
*CBCL [41] 3,547 5.4 26.4 28.2 6.9 0.7 17.9 1.3 11.8 0.3 0.8 0.2
*ETH-RMPTMP [42] 14,056 - - - - - - - - - 100.0 -
*GTSRB [43] 740 - - - - - - - - 100.0 - -
M-COCO [5] 3,262 - - - - - - 1.0 63.7 11.4 16.6 7.3
*KITTI-O [9] 7,481 - - - - - - - 90.7 - 7.4 1.9
MDRS3-Train 26,686 5.4 12.1 12.5 3.1 1.5 9.2 0.5 36.6 3.4 13.2 2.5
MDRS3-Test 4,489 10.0 34.4 22.8 7.6 0.5 14.0 0.8 8.3 0.3 1.0 0.3
Table 1: Class distribution (% of total pixels) for the MDRS3 dataset constituents and test/train splits. The “void” class has been removed for clarity.

Refinement of sparse annotations.

For constituent datasets where annotations are provided in the form of bounding-boxes (marked with an asterisk in Table 1), we perform refinement to pixel-wise annotations by adopting a similar GrabCut-based approach to [28]. For the CBCL dataset, which is labelled with polygonal bounding-boxes for 9 object categories and contains many void areas, we enlarge the category set to 11 and extend existing labels to missing areas using a CRF classifier [21]. We provide further detail of this process in the accompanying supplementary material.

Test dataset.

For evaluation, we maintain a separation between datasets used for training and testing. We use a combination of different domains with dense and sparse annotations for training, while for testing we use two separate datasets with dense pixel-wise annotations: a new subset of the LabelMe dataset [40] with urban images from different cities, referred here to as Urban LabelMe (U-LabelMe) and a processed subset of the CBCL StreetScenes Challenge Framework [41]. These two datasets are more challenging compared to CamVid and KITTI, containing a larger variety of scenarios with different viewpoint and illumination conditions (compared to the forward-looking camera viewpoint in CamVid and KITTI). Our test dataset thus provides a better measure of the generalisation performance of the trained network at test time, especially compared to the common practice of using subsets of the same sequence for training and testing.

4 Network Architectures for Semantic Segmentation

We consider two DN architectures and the trade-off they achieve between task performance and memory footprint. The two selected state-of-the-art networks are: the fully convolutional network (FCN) [2] and the DeconvNet [3]. We do not consider models that are extended with a CRF such as [26], since such extensions do not alter the intrinsic model capacity and smoothing can be added as a post-processing step if desired.

The DeconvNet and FCN architectures are shown in Fig. 3(ii)-(iii), respectively. Both DNs expand on VGG-16 [44], but the DeconvNet is much deeper than FCN (75% more parameters), making it harder to train and unsuitable for embedded applications. The depth of these networks is justified for the task of semantic segmentation of general scenes (which contain thousands of classes of objects), but shallower networks may suffice for constrained urban environments. Another difference between these architectures is in the upsampling philosophy: the FCN combines outputs of different layers to achieve better localization accuracy, while the DeconvNet stores pooling indices to re-use later for guiding feature map upsampling. This last strategy has been shown to improve localization accuracy for different problems [45].


Figure 3: Summary of network architectures. From top to bottom: (i) a key to basic building blocks; (ii) DeconvNet [3] (251M parameters); (iii) FCN [2]

(134M parameters); (iv) our constrained Target Net or T-Net (1.4M parameters); and (v) our dense/sparse FCN ensemble or chosen S-Net (269M parameters). All convolutions assume stride 1 unless otherwise specified. Descriptions of these networks can be found in sections 

4 and 5.

4.1 Source Network (S-Net) Architecture

The Source Network (S-Net) is selected by choosing the best possible performing network, disregarding memory or computational constraints. Our choice of S-Net consists of an ensemble of two FCN networks trained on different data modalities, i.e. dense and sparse data modality—see Fig. 3(iv)—, which was found to be the best performing unconstrained network as reported in section 7. Although this ensemble contains more parameters than a DeconvNet (269M versus 251M), it leads to better results and is faster to train, reason why we do not use DeconvNet and favour FCN-based approaches instead for S-Net. Further detail is contained in section 5.4.

4.2 Target Network (T-Net) Architecture

The T-Net, shown in Fig. 3(v), consists of a model based on the pooling-unpooling principle of the DeconvNet [3] but simplified to suit an embedded system (similar to the “basic” SegNet version proposed in [4]).

T-Net consists of 4 contraction blocks, followed by 4 expansion blocks, with a total of M parameters. This reduced size offers a good compromise between memory requirement and performance. Contraction blocks serve to create a rich representation that allows for recognition as in standard classification CNNs. Expansion blocks are used to improve the localization and delineation of label assignments. Both contraction and expansion blocks use

kernels with a stride of 1 pixel and a fixed number of 64 feature maps. Batch normalization is added prior to ReLU to reduce internal covariate shift 

[46] during training and improve convergence. Upsampling in expansion blocks is carried out by storing and retrieving pooling indices for current activations. This helps to produce sharp edges in the final output, avoiding blocky results [3]. A linear classifier performs the final label estimation at the pixel level. The choice of 4 blocks is motivated by empirical analysis, offering the best trade-off between model compactness and good performance (see supplementary material).

5 Training Strategies for DN architectures

In this section we describe the different approaches used to train our networks on the challenging MDRS3 dataset described in section 3. The approaches explored are (i) “e2e”—standard end-to-end training over various subsets of the multi-domain training data; (ii) “BGC”—which uses Balanced Gradient Contribution to generate stable gradient directions for end-to-end training; (iii) “Flying-Cars”—dynamic domain adaptation of the sparse training data; and (iv) “Ensemble”—the ensembling of models trained on separate domains. The performance of these methods is evaluated in section 7.

Each training strategy was initialised identically. Contraction blocks of the architecture under study were assigned the weights of classification networks pre-trained on ImageNet – VGG-16 

[44] in the case of FCN and VGG-F [47] in the case of T-Net. Adjustments to the shape of the weights were performed where dimensions do not match. Expansion blocks were initialized using the method of He et al. [48].

Optimisation was performed via standard backpropagation using Stochastic Conjugate Gradient Descent (S-CGD), endowed with a bounded line-search strategy and backtracking with Armijo’s rule 

[49]

. To avoid overfitting, the number of line-search iterations was bounded to 3.This proved to converge faster to good solutions than stochastic gradient descent without manual tweaking of learning rates.

5.1 End-to-End Training (e2e)

Our simplest training approach, end-to-end (e2e) training, consisted of standard mini-batch training on random samples (with replacement) from the mixed dense and sparse training set (i.e. all data in). To achieve reasonable per-class accuracy, the use of weighted cross-entropy (WCE) as training loss was found to be essential. WCE re-scales the importance of each class, , according to its inverse frequency in the training data set , i.e. :

(1)

where refers to the network, , stand for the -th training image and ground truth image respectively, and the weighting function is given by

(2)

In this way, WCE helped the networks to account for class frequency imbalances, a common phenomenon exposed in Table 1, which were otherwise observed to reduce a network’s attention to rare but important classes such as pedestrians or bicycles during training.

End-to-end training was applied to learn separate models for the dense and sparse domains as well as a combined model on both data domains. However, when this approach is used naively on the combined data, we observed an unstable oscillatory behaviour of the objective and eventually divergence of the system. This phenomenon is due to the strong difference between the statistics of both distributions, which give rise to very noisy descent directions during optimisation. Thus, in order to exploit all the information available in both domains one needs to stabilize the training process, via alternatives such as those proposed in the following sections.

5.2 Balanced Gradient Contribution (BGC)

The severe statistical difference between the domains induces a large variance in gradients for a sequence of mini-batches. Data from the dense domain is more stable and suitable for structural classes, but less informative in general. Data from the sparse domain is highly informative, with critical information about dynamic classes, but very noisy. To deal with these aspects we propose to compute search directions using the directions proposed by the dense domain under a controlled perturbation given by the sparse domain as shown in (

3).

(3)

where stand for a subset of samples and their associated labels, drawn from the dense (D) or sparse (S) domains. This procedure can be seen as the addition of a very informative regularizer controlled by the parameter , but an analogous effect can be achieved by generating mini-batches containing a carefully chosen proportion of images from each domain, such that . This modification of the training procedure leads to superior results and a stable behaviour, as reported in section 7.

5.3 Flying Cars (FC): Domain Adaptation by Data Projection

Another alternative to solve the problem caused by the combination of incompatible domains is to project or transfer one domain into another. In our case, the noisy sparse domain is projected to the dense domain, using ideas from domain adaptation [50]. This can be achieved, for instance, by selecting random images from the dense domain and using them as backgrounds in which to inject the objects and labels of the sparse domain. This approach was recently used in the “Flying Chairs” approach of [51] to train DNs for optical flow from synthetic data. It can be seen as a way of performing highly informative data augmentation over the dense domain. Similarly to [51], we use a naive approach which does not provide a hard constraint on the spatial context of the objects being inserted into the scene, hence the name “Flying Cars” (FC).

5.4 Ensemble of Sparse and Dense Domains

Finally, it is possible to think about the domains as two different tasks: one consisting of recognizing classes from finely-annotated data and the other recognizing classes, i.e. foreground, traffic signs, poles, cars, pedestrians and cyclists; from noisy sparse annotations. The model trained on the dense domain, , is better at structural elements such as roads, buildings and sidewalks; while the model trained on the sparse domain, , is extremely good at segmenting dynamic objects such as pedestrians and cyclists. These models can be combined as part of a larger network which adds several new trainable blocks to perform a consensus from the output of the original models. In our experiments the ensemble is performed by fixing the original networks and adding a convolutional block and four residual-blocks [32] to estimate a consistent output. This is shown in Fig. 3(iv). Residual-blocks were used as they were found to lead to better generalization than simple convolutions in practice. The current configuration, 4 blocks, one with features and three with features, was the best configuration we found that did not lead to clear overfitting. This approach is further analysed in section 7.

6 Transferring Knowledge across Deconvolutional Networks

We use the training methods described in section 5 to train both FCN and T-Net architectures. The results are reported in Table 2. For all training methods explored, FCN is observed to consistently outperform the smaller T-Net. Moreover, among the different approaches for training, the most outstanding in terms of per-class accuracy is the multi-domain ensemble.

Despite the high accuracy of the FCN ensemble, its large number of parameters makes it unsuitable for embedded applications, in the context of road scene segmentation. We next investigate whether it is possible to boost a more compact model such as T-Net to have an equivalent performance. Our hypothesis is that the capacity of T-Net is sufficient to produce results at the level of FCN and FCN-ensemble, but due to specific details of its training and architecture, such as batch normalization and noise within the training data, the methods of section 5 cannot exploit its full potential. We therefore examine an alternative training approach for T-Net. We adopt the FCN ensemble as a Source Network (S-Net) and attempt to emulate its behaviour with (i.e. transfer its knowledge to) the T-Net. We describe three approaches to transfer knowledge: (i) via labels (TK-L), (ii) via soft-max probabilities (TK-SMP), and (iii) via soft-max probabilities with weighted-cross-entropy (TK-SMP-WCE).

Transferring Knowledge Through Labels (TK-L).

This strategy aims to distill the knowledge of the S-Net directly from its predicted labels, in the spirit of [35]. We use both dense and sparse domains of training data described in section 3, ignoring their original annotations. The benefit of this approach is that the multi-modality of the data has been filtered by the S-Net and some distractors are ignored, so the information reaching T-Net is simpler, leading to a smoother search space and making it easier to find good solutions. In our setup we included extra training data from a large unlabelled Google Street View (GSV) dataset [52], taken of street scenes from multiple cities in the US. We remove the upward facing camera and took a random crop from each image to produce 51,715 images. We combined previous and new training data using BGC to train the T-Net with a standard cross-entropy loss. Here, BGC is used as an important mechanism to control the influence of the GSV data and prevent from drift.

Transferring Knowledge Through SoftMax Probabilities (TK-SMP).

The strategy uses additional information from S-Net during transfer, by considering the probability distributions produced by the softmax classifier, which contains information about how different classes are correlated [12]

. To this end, we train a T-Net using standard cross-entropy between the probability distributions of S-Net and T-Net as our loss function. As in the previous strategy, the training makes use of BGC to control the influence of GSV data to bound its contribution. This second approach leads to a notable improvement of the network per-class accuracy as shown in Table 

3 (i)-(ii).

A variation of this method consists of adding drop-out blocks to the T-Net during the transference process. In practice, this addition behaves as in end-to-end training, helping to improve the generalization of the net. See Table 3(iii).

Transferring Knowledge Through SoftMax Probabilities with WCE (TK-SMP-WCE).

One of the problems with the previous approaches of TK-TL and TK-SMP is that they do not account for class imbalance during transfer. In practice this means that the resulting models are biased towards the dominant classes and producing models with higher per-class accuracy requires a higher number of epochs during training. We propose to solve this problem by controlling the influence of each softmax sample with WCE, in the same way that the influence of different datasets is controlled by BGC. This simple modification, in combination with the use of dropout in the T-Net, leads to models that have virtually the same per-class accuracy as the S-Net,

i.e. an ensemble of FCNs; see Table 3 (iv). In this way the full potential of the T-Net is unlocked, giving rise to an accurate and memory-efficient model, convenient for embedded systems and automotive applications.

7 Experimental Results

We evaluate the performance of the proposed training methodology with respect to a set of state-of-the-art baselines. Special emphasis is set on the performance of our TK-SMP-WCE transfer technique when used in combination with Balanced Gradient Contributions (BGC).

Experiment Setup.

All our experiments are carried out on the MDRS3 dataset (Section 3), testing on the combination of U-LabelMe and CBCL (1,526 images overall). Due to time and resource constraints, we subsample the original images to a resolution of in all our experiments. This speeds up training and evaluation of models but makes certain classes, such as sidewalks, poles and traffic signs, systematically harder to recognize for all models due to the low resolution. Nevertheless, this factor is consistent across all the experiments and does not affect the conclusions obtained when comparing different training approaches and models. Images are initially normalized using spatial contrast normalization, independently applied to each channel. Afterwards, zero-mean and data re-scaling in the range [-127,127] are applied. In practice we observed that this normalization speeds-up convergence.

Results are evaluated according to the average per-class accuracy (per-class) and the global accuracy (global). Given the number of pixels, , belonging to class and classified as class , and assuming is the number of classes, then per-class is evaluated as and global as where is the total number of pixels in the evaluation set. Due to the intrinsic unbalanced nature of the class frequencies in urban scenes, we consider the average per-class to be more significant to assess the recognition and generalisation capabilities of the models. Within parenthesis we report the difference between the results of the current method and the FCN model at Table 2(i) as a reference (improvements are highlighted in blue, diminishments in red).

7.1 Assessing Multi-Domain Training

End-to-End training.

In Table 2 (i)-(iii), we first evaluate the performance of T-Net against the FCN network [2] and ALE [21], a classical semantic segmentation framework based on hand-crafted features. These models are trained using the dense domain only, with the end-to-end approach described in section 5.1. As Table 2 (i)-(iii) shows, for this initial setup T-Net underperforms both FCN, by 11.2 points per-class, and ALE, by 1.9 points.

We extended this first evaluation by adding the sparse domain to the end-to-end training. However, as shown in Table 2 (iv) and (viii) the training diverged in both cases. This phenomenon was commented on in section 5.1 and is attributed to the gradient noise introduced by the sparse domain when its contribution is unbounded. This reinforces our claim that control over the distribution and the complexity of the data is required to produce competitive training results.

Flying Cars, BGC & Ensemble.

When the end-to-end training is replaced by methods implementing policies to control the contribution of each domain, the improvement in accuracy is notable. Table 2 (v)-(vii) shows that for all the techniques, controlled training improves the per-class of the standard FCN. FC and BGC methods, although not achieving the top performance, have the advantage of requiring just one training stage; while the ensemble requires training individual models first (per domain) and then merging them. Yet, since the ensemble of FCN achieves the highest performance we use it as our S-Net, and try to match its performance with T-Net. The outcome of applying FC, BGC and ensemble on the T-Net are analogous to the previous case; and again, the ensemble renders the best results in terms of per-class accuracy (see Table 2 (ix)-(xi)).

7.2 Evaluation of Knowledge Transfer Methods

As summarized in Table 3, results of previous training approaches on T-Net are dramatically improved when applying knowledge transfer methods. For all the transference method we added the unlabelled data from the Google Street View Data Set [53] in order to increase the variability of the S-Net responses during the process, which helps capturing the behaviour of S-Net.

Here we see that the evolution of the transferring techniques is directly correlated to the improvement of the T-Net performance. A simple transfer of labels (TK-L) from the S-Net produces a T-Net model that is already 2.9 points better than FCN (used here as a reference). When the transfer is based on the softmax probability distribution over the classes, as in TK-SMP, accuracy is boosted up to 57.3 (6.7 points better than FCN). It is worth noticing than, when dropout is included in the TK-SMP transference (TK-SMP-Drop), it improves global accuracy in 3.2 points compared to FCN. We observed this effect when using dropout at expense of some loss in per-class accuracy.

Finally, Table 3 (iv) shows that when the S-Net softmax distributions are weighted according to their relevance in the dataset (i.e. less abundant more relevant), the transference of this knowledge reaches the maximum performance found so far, 59.3% of per-class accuracy. Thus, the TK-SMP-WCE approach produces a T-Net 9.1 points better than FCN in per-class and 0.2 in global accuracy, almost reaching the results of the S-Net, i.e. an ensemble of two FCN which has more parameters. Visual results of this evolution are shown in Fig. 4 for testing examples. Notice how the T-Net obtained from TK-SMP-WCE can sometimes produce better results than the S-Net. We believe that these results give strong evidence to render knowledge transfer methods and in particular TK-SMP-WCE as preferred methods to train memory constrained DNs.

Method sky building road sidewalk fence vegetat. pole car sign pedest. cyclist per-class global
(i) FCN [2] (D) 77.8 67.6 86.1 35.9 35.1 89.6 8.9 86.2 41.3 17.6 10.6 50.6 71.6
(ii) T-Net (D) 80.6 62.0 85.2 20.6 4.4 84.5 9.4 70.0 6.4 7.6 2.3 39.4 (-11.2- -11.2 -11.2) 66.6 (-5.0- -5.0 -5.0)
(iii) ALE [21] (D) 85.0 69.0 94.0 8.0 19.0 83.0 3.0 74.0 13.0 5.0 1.0 41.3 (-9.3- -9.3 -9.3) 72.1 (0.5- 0.5 0.5)
(iv) FCN [2] e2e (D+S) training diverged
(v) FCN [2] FC 85.3 71.4 87.0 26.4 19.8 86.5 10.6 89.3 45.4 58.8 7.0 53.4 (2.8- 2.8 2.8) 73.2 (1.6- 1.6 1.6)
(vi) FCN [2] BGC 80.3 73.5 82.6 49.5 39.6 91.6 11.6 87.3 50.8 44.1 19.6 57.3 (6.7- 6.7 6.7) 75.5 (3.9- 3.9 3.9)
(vii) FCN [2] Ens. (S-Net) 77.4 71.9 85.0 27.8 40.8 85.8 8.0 93.4 43.0 80.4 60.6 61.3 (10.7- 10.7 10.7) 73.4 (1.8- 1.8 1.8)
(viii) T-Net e2e (D+S) training diverged
(ix) T-Net FC 77.5 67.2 77.7 34.4 18.4 86.3 8.0 80.0 18.8 25.9 4.8 45.3 (-5.3- -5.3 -5.3) 69.1 (-2.5- -2.5 -2.5)
(x) T-Net BGC 58.9 64.5 81.6 21.5 4.8 83.1 11.0 82.3 21.2 31.3 9.3 42.7 (-8.3- -8.3 -8.3) 64.9 (-6.7- -6.7 -6.7)
(xi) T-Net Ens. 89.0 57.4 85.5 22.9 0.3 92.2 11.4 86.3 14.6 56.6 16.9 46.9 (-3.7- -3.7 -3.7) 65.5 (-6.1- -6.1 -6.1)
Table 2: Semantic segmentation quantitative results for FCN and T-Net on the testing dataset for the training methods under study, i.e. end-to-end (dense, D; sparse, S; and both, D+S), BGC, Flying cars (FC) and net ensemble (Ens.).
Method sky building road sidewalk fence vegetat. pole car sign pedest. cyclist per-class global
(b) baseline: FCN [2] (D) 77.8 67.6 86.1 35.9 35.1 89.6 8.9 86.2 41.3 17.6 10.6 50.6 71.6
(i) T-Net (TK-L) 88.7 50.6 68.8 45.4 48.7 77.2 18.6 73.1 19.4 68.8 29.3 53.5 (2.9- 2.9 2.9) 62.9 (-8.7- -8.7 -8.7)
(ii) T-Net (TK-SMP) 85.1 65.1 87.5 21.1 35.7 85.3 6.6 90.0 45.2 53.2 55.6 57.3 (6.7- 6.7 6.7) 70.8 (-0.8- -0.8 -0.8)
(iii) T-Net (TK-SMP-Drop) 87.6 75.9 79.3 43.2 27.1 80.8 4.0 86.9 19.9 68.5 14.0 53.4 (2.8- 2.8 2.8) 74.8 (3.2- 3.2 3.2)
(iv) T-Net (TK-SMP-WCE) 87.4 66.9 82.0 33.0 37.9 83.3 14.1 89.4 40.0 78.6 40.2 59.3 (9.1- 9.1 9.1) 71.8 (0.2- 0.2 0.2)
Table 3: Evaluation of the proposed Knowledge Transfer techniques for .
Figure 4: Qualitative results on test images for different training methods. Our proposed method of training T-Net via transfer learning results in visually good segmentations, in some cases providing better results than the FCN ensemble and even noisy ground truth.

8 Conclusions and Future Work

In this work we have described a training strategy for DNs to be used in resource-constrained applications such as road scene segmentation. We showed that training relatively shallow target networks via regular end-to-end approaches on a challenging aggregate dataset leads to underperformance versus state-of-the-art deep models. One likely cause for this is that shallow models are much harder to optimize and, when trained directly on noisy or multi-modal data, have difficulty in navigating local minima. To overcome this, we extended the idea of knowledge transfer to DNs. We first explored various means of producing a best-performing source network, by relaxing resource constraints and ensembling networks across different data domains. We then demonstrated that by using the source network as a guide, it was possible to train a target network which satisfied all constraints while giving better performance than a state-of-the-art FCN network and almost the same performance than an ensemble of FCNs, with a memory footprint that is just of the ensemble. We believe that our findings will be very useful for training DNs not just in automotive settings but also in context where labelled data is limited and practical constraints exist on model size.

References

  • [1] Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Machine Intell. 35(8) (2013) 1915–1929
  • [2] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation.

    In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2015)

  • [3] Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [4] Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
  • [5] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Eur. Conf. on Computer Vision (ECCV). (2014)
  • [6] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Intl. J. of Computer Vision (2015)
  • [7] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.:

    Learning deep features for scene recognition using places database.

    In: Advances in Neural Information Processing Systems (NIPS). (2014)
  • [8] Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters 30(2) (2009) 88–97
  • [9] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Intl. J. of Robotics Research (2013)
  • [10] Chen, Q., Huang, J., Feris, R., Brown, L.M., Dong, J., Yan, S.: Deep domain adaptation for describing people based on fine-grained clothing attributes. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2015)
  • [11] Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [12] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  • [13] Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Eur. Conf. on Computer Vision (ECCV). (2014)
  • [14] Ros, G., Ramos, S., Granados, M., Bakhtiary, A., Vázquez, D., López, A.M.: Vision-based offline-online perception paradigm for autonomous driving. In: Winter Conference on Applications of Computer Vision (WACV). (2015)
  • [15] N. Silberman, D.H., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Eur. Conf. on Computer Vision (ECCV). (2012)
  • [16] Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Understanding real world indoor scenes with synthetic data. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [17] Tighe, J., Lazebnik, S.: Superparsing: Scalable nonparametric image parsing with superpixels. In: Eur. Conf. on Computer Vision (ECCV). (2010)
  • [18] Sengupta, S., Greveson, E., Shahrokni, A., Torr, P.H.S.: Urban 3D semantic modelling using stereo vision. In: IEEE Intl. Conf. on Robotics and Automation (ICRA). (2013)
  • [19] Hu, H., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient 3-D scene analysis from streaming data. In: IEEE Intl. Conf. on Robotics and Automation (ICRA). (2013) 2297–2304
  • [20] Kohli, P., Ladický, L., Torr, P.H.S.: Robust higher order potentials for enforcing label consistency. Intl. J. of Computer Vision 82(3) (2009) 302–324
  • [21] Ladický, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? Combining object detectors and CRFs. In: Eur. Conf. on Computer Vision (ECCV). (2010) 427–437
  • [22] Valentin, J.P.C., Sengupta, S., Warrell, J., Shahrokni, A., Torr, P.H.S.: Mesh based semantic modelling for indoor and outdoor scenes. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2013)
  • [23] Álvarez, J.M., Gevers, T., LeCun, Y., López, A.M.: Road scene segmentation from a single image. In: Eur. Conf. on Computer Vision (ECCV). (2012)
  • [24] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2014)
  • [25] Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning deep structured models.

    In: Intl. Conf. on Machine Learning (ICML). (2015)

  • [26] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.:

    Conditional random fields as recurrent neural networks.

    In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [27] Dai, J., He, K., Sun, J.: BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [28] Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.:

    Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation.

    In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [29] Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. Intl. J. of Computer Vision 111 (2015)
  • [30] Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset. In: CVPR Workshop on The Future of Datasets in Vision. (2015)
  • [31] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: SYNTHIA: A large collection of synthetic images for semantic segmentation of urban scenes. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [32] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [33] Ba, L.J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems (NIPS). (2014)
  • [34] Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y.: The loss surfaces of multilayer networks.

    In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics. (2015) 192–204

  • [35] Buciluă, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2006) 535–541
  • [36] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: Intl. Conf. on Learning Representations (ICLR). (2015)
  • [37] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR (2015)
  • [38] Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50 fewer parameters and 1MB model size. arXiv preprint arXiv:1602.07360 (2016)
  • [39] Brostow, G.J., Shotton, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Eur. Conf. on Computer Vision (ECCV). (2008)
  • [40] Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Intl. J. of Computer Vision 77(1–3) (2008) 157–173
  • [41] Bileschi, S.: CBCL StreetScenes challenge framework. http://cbcl.mit.edu/software-datasets/streetscenes/ (2007)
  • [42] Ess, A., Leibe, B., Schindler, K., Gool, L.V.: Robust multi-person tracking from a mobile platform. IEEE Trans. Pattern Anal. Machine Intell. 31(10) (2009) 1831–1846
  • [43] Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In: International Joint Conference on Neural Networks. Number 1288 (2013)
  • [44] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Intl. Conf. on Learning Representations (ICLR). (2015)
  • [45] Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2010)
  • [46] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Intl. Conf. on Machine Learning (ICML). (2015)
  • [47] Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional networks. In: British Machine Vision Conf. (BMVC). (2014)
  • [48] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [49] Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep learning. In: Intl. Conf. on Machine Learning (ICML). (2011)
  • [50] Vazquez, D., Lopez, A., Ponsa, D., Marin, J.:

    Cool world: domain adaptation of virtual and real worlds for human detection using active learning.

    In: NIPS Domain Adaptation Workshop: Theory and Application. (2011)
  • [51] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbaş, C., Golkov, V., Smagt, P.V., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Intl. Conf. on Computer Vision (ICCV). (2015)
  • [52] Zamir, A.R., Shah, M.: Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs. Pattern Analysis and Machine Intelligence, IEEE Transactions on 36(8) (2014) 1546–1558
  • [53] Zamir, A., Shah, M.: Image geo-localization based on multiple nearest neighbor feature matching using generalized graphs. (2014)