Adversarially Tuned Scene Generation

01/02/2017 ∙ by V S R Veeravasarapu, et al. ∙ 0

Generalization performance of trained computer vision systems that use computer graphics (CG) generated data is not yet effective due to the concept of 'domain-shift' between virtual and real data. Although simulated data augmented with a few real world samples has been shown to mitigate domain shift and improve transferability of trained models, guiding or bootstrapping the virtual data generation with the distributions learnt from target real world domain is desired, especially in the fields where annotating even few real images is laborious (such as semantic labeling, and intrinsic images etc.). In order to address this problem in an unsupervised manner, our work combines recent advances in CG (which aims to generate stochastic scene layouts coupled with large collections of 3D object models) and generative adversarial training (which aims train generative models by measuring discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier). Our method uses iterative estimation of the posterior density of prior distributions for a generative graphical model. This is done within a rejection sampling framework. Initially, we assume uniform distributions as priors on the parameters of a scene described by a generative graphical model. As iterations proceed the prior distributions get updated to distributions that are closer to the (unknown) distributions of target data. We demonstrate the utility of adversarially tuned scene generation on two real-world benchmark datasets (CityScapes and CamVid) for traffic scene semantic labeling with a deep convolutional net (DeepLab). We realized performance improvements by 2.28 and 3.14 points (using the IoU metric) between the DeepLab models trained on simulated sets prepared from the scene generation models before and after tuning to CityScapes and CamVid respectively.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, computer graphics (CG) generated data has been actively utilized to train and validate computer vision (CV) systems, especially, in situations where acquiring large scale data and groundtruth is costly. Examples are many pixel level prediction tasks such as semantic segmentation [8, 17, 16, 19], optical flow [7] and intrinsic images [11] etc. However, the performance of CV systems when they are trained only on simulated data is not as good as expected due to the issue of domain shift [17]

. This problem is due to the fact that the probability distribution over parameters resulting from the simulation process,

, may not match those parameters describing real-world data, . This can be caused by many factors such as deviations in lighting, camera parameters, scene geometry and many others from the true unknown underlying distributions . These deviations may result in poor generalization of the trained CV models to the target application domains. The term used to describe this phenomenon is ’domain-shift’ or ’data-shift’.

In the classical CV literature, two alternatives to reduce domain-shift have been discussed: 1) Using engineered feature spaces that achieve invariance to large variations in specific attributes such as illumination or pose, and 2) learning of scene priors for the generative process that are optimized to the specific target domain. Several works designed [3] or transferred the representations from virtual domains that are quasi invariant to domain shift, for instance, geometry or motion feature representations as well as their distributions (see for instance [14]). With the advent of automated feature learning architectures, recent works [17, 16] have demonstrated that augmenting large scale simulated training data with a few labelled real-world samples can ameliorate domain shift. However, annotating even a few samples is expensive and laborious in many pixel level applications such as optical flow and intrinsic images. Hence, bootstrapping generative models from real-world data is often desired but difficult to achieve due to its inherent complexities in the bootstrapping process and the need for richly annotated seed data along with meta-information such as camera parameters, geographic information, etc. [8].

Recently, advances in the field of unsupervised generative learning, i.e. Generative Adversarial Training [9], popularly known as generative adversarial networks (GANs), propose to use unlabelled samples from a target domain to progressively obtain better point estimates of parameters in generative models by minimizing the discrepancy between generative and target distributions in the space of a deep discriminatively-trained classifier. Here, we propose to use and evaluate the ability of this adversarial approach to tune scene priors in the context of CG based data generation.

In the traditional GAN approach neural networks are used both for the generative model and the discriminative model

[9, 15]). Our paper focuses on the iterative estimation of the posterior density over parameters describing prior distributions over parameters, , for a generative graphical model via: 1) generation of virtual samples given a starting prior, 2) estimation of conditional class probabilities of labeling a given virtual sample as real data using a discriminative classifier network , 3) mapping these conditional class probabilities to estimate class conditional probabilities for labeling of data as real given the parameters of the generative model , and finally, 4) doing a Bayesian update to estimate the posterior density over parameters describing the prior . This is done within a rejection sampling framework. Initially, we assume uniform distributions as priors on the parameters of the generative scene model. As iterations proceed the uniform prior distributions get updated to distributions that are closer to the unknown prior distributions of target data. Please see Fig 1 for a schematic flow of our adversarial tuning procedure.

More specifically, we use a parametric generative 3D scene model, , which is a graphical model with scene semantics. This makes it possible to generate semantic annotations along with image data by using an off-the-shelf graphics rendering method. This model exploits existing 3D CAD models of objects and implements intra-object variations. This model is parametrized by several variables including 1) Light variables: intensity, spectrum, position of light source, weather scattering parameters; 2) Geometry variables: object cooccurrences, spatial alignments; 3) Camera parameters: position and location of the camera.

Figure 1: Flow chart of adversarial tuning

Paper organization: We will first review some of the related concepts and works in Section 2. Section 3 introduces our generative model and adversarial training approach to tune the model’s parameters. Our experiments in Section 4 compare the model’s properties before and after adversarial training. This includes comparing data statistics and generalization of vision systems against real world data. Finally we conclude in Section 5 by describing future directions.

2 Background

Our work builds upon several recent advances in the fields of computer graphics, which aim to automatically generate configurations of 3D objects from individual 3D CAD models of objects, and unsupervised generative learning, which aim to train generative models to a given unlabeled dataset from a target domain. Here, we summarize related work and concepts that are relevant to our work.

2.1 Scene Generative Models

Automatic scene generation has been a goal both within CG and CV. The optimal spatial arrangement of randomly selected 3D CAD models according to a cost function is a well studied problem in the field of CG. Simulated annealing based optimization of scene layouts have been applied to specific domains such as the arrangement of furniture in a room. For instance, [24] use a simulated annealing approach to generate furniture arrangements that obey specific feasibility constraints such as spatial relationships and visibility. Similarly, [13]

propose an interactive indoor layout system built on top of reversible-jump MCMC (monte-carlo markov chain) that recommends different layouts by sampling from a density function that incorporates layout guidelines. Factor potentials are used in

[23] to incorporate several constraints, for example, that furniture does not overlap, that chairs face each other in seating arrangements, and that sofas are placed with their backs against a wall.

Similarly, in the aerial image understanding literature, several spatial processes have been used to infer 3D layouts [12, 21]. Ample literature has been describing the constraints that characterize pleasing design patterns such as spatial exclusion, mutual alignment [2]. Inspired by these works, we view city layouts as point fields that are associated with some marks, i.e. attributes such as type, shape, scale, and orientation. Hence, we use a stochastic spatial process called a Marked Point Process, which is coupled with 3D CAD models and is used to synthesize geometric city layouts. Spatial relations and mutual alignments are encoded using Gibbs potentials between marks.

2.2 Graphics for Vision

Due to the need for large scale annotated datasets, e.g. in the automotive setting, several attempts have been utilizing existing CAD city models [17], racing games [16, 19] or probabilistic scene models for annotated data generation, but naturalistic scenes have even been used to investigate properties of the human visual system [18]. In the context of pedestrian detection, some work [22] demonstrated domain adaptation methods by exploring several ways of combining a few real world pedestrian samples to many synthetic samples from H-life game environments. In [8], the authors introduced a fully annotated synthetic video dataset, based on a virtual cloning method that takes richly annotated video as input seed. More recently, several independent research groups [17, 16, 19] demonstrated that augmenting a large collection of virtual samples with few labelled real-world samples could improve domain-shift. In our work, we address the question of how far one can go without the need of labelled real world samples. We use unlabelled data from a target domain and estimate the scene prior distributions of the generative model whose samples are adversary to the classifier.

Figure 2: Graphical representation of the scene generative model and illustration of 3D CAD object models used in this work.
(a) RGB image
(b) Semantic labels
(c) Depth
(d) Surface normals
(e) Diffuse reflections
Figure 3: A rendered image sample together with corresponding pixel-level annotations.

3 Approach

Our approach to tuning a generative model to given target-data is shown in Fig 1. We summarize the key steps below:

  • The generative model has a set of parameters related to different scene attributes such as geometry and photometry.

  • A renderer takes these parameters sampled from the distributions and outputs image data .

  • The discriminator , a standard convolutional network, is trained using gradient descent to classify data originating from the target domain and as being either real or generated. outputs a scalar probability, which is trained to be high if the input was real and low if the data were generated from .

  • The probabilities for all simulated samples are used to estimate the likelihood .

  • This is then used to update our prior distributions, which will be used in the next iteration as .

We now describe the details of the components used in this process.

3.1 Probabilistic Scene Generation

Probabilistic scene models deal with several attributes for a scene that are relevant for the target domain. One can divide these attributes into 1) geometry, 2) photometry and 3) dynamics. However, we skip the modeling of scene dynamics in this work as we only consider static images and also aim to use publicly available large scale 3D CAD repositories such as Google’s sketchup 3D warehouse. Hence, in our generative scene model we consider modeling scene layouts with CAD models and photometric parameters.

Scene geometry: We designed a 3D scene geometry layout model that is based on Marked Poisson Processes coupled with 3D CAD object models. It considers objects as points in a world coordinate system and their attributes, such as object class, position, orientation, and scale as marks associated with them. These points are sampled from a probabilistic point process and the marks are sampled from another set of conditional distributions such as distributions on bounding box sizes, orientations given object type, etc. 3D CAD models are randomly imported from our collection with a few samples shown in Fig 2, and placed in sampled scene layouts. The camera is linked to a random car with a height that is uniformly distributed around a mean height of .

In sampling from the world models one can assume statistical independence between marks of the point process for simplicity. Such scene states are likely to generate objects with spatial overlaps, which are physically improbable. Hence, some inter-dependencies between marks such as spatial non-overlap, cooccurrence, and coherence among instances of object classes are incorporated with the help of Gibbs potentials. In such cases, the resulting point process is called a Poisson process [12] and the density of object layouts is formulated using the Gibbs equation: , where introduces prior knowledge on the object layouts by taking into account pairwise interactions between the objects . This allows encoding strong structural information by defining complex and specific interactions such as interconnections or mutual alignments of objects [12, 20]. However, due to computational complexities such constraints results in extended computational times in sampling scene states. To avoid these problems, we limit the interactions to the essential ones for obtaining a general model of the non-overlapping objects and constraining road angles. Strong structural information can then be introduced in a subsequent step by developing post-processing in order to connect objects. This can be expressed using the term , where takes on values in the interval and quantifies the relative mutual overlap between objects and , and is a large positive real value (in our experiments ), which strongly penalizes large overlaps. For small overlaps between two objects this prior will only weakly penalize the global energy. But if the overlapping is high, this prior will act as a hard constraint, strongly affecting overall energy.

Scene photometry: In addition to the above geometry parameters, we also model 1) the light source sun and its extrinsic (position and orientation) as well as intrinsic parameters (intensity and color spectrum), 2) weather scattering parameters (particle density and scattering coefficient), 3) camera extrinsic parameters such as orientation and field-of-view. These models are implemented through the use of python scripting interface to an open source graphics platform, BLENDER [1]. A Monte-Carlo path tracer is used to render the scenes as images along with annotations, if required. Please see the supplementary material for details. A schematic graphical model is as shown in Fig 2 along with a few samples of CAD object models used in this work.

3.2 Initialization

As shown in Fig 2

, our generative model is a physics-based parametric model whose inputs are a set of scene variables

such as lighting, weather, geometry and camera parameters. We assume that all these parameters are statistically independent of each other, which provides the least expensive option for modeling and sampling. One can model dependencies using distributions on these parameters based on an expert’s knowledge for a target domain or based on additional knowledge such as atmospheric optics, geographic and demographic studies. However, in the absence of priors, we use uniform distributions in their permissible ranges. For instance, the light source’s intensity in BLENDER is modeled as , where an intensity level of can correspond to night while corresponds to lighting at noon. With these settings our model was able to render physically plausible and visually realistic images. This scene model was used in our previous work, which is provided as supplementary material. Performance of a vision model trained to perform semantic segmentation on simulated data was quite good on real-world data. Yet, data-shift was observed due to deviations between the scene generation statistics and the target real-world domain. Hence, in the present work, we focus on the task of matching generative statistics to those of real-world target data such as for instance CityScapes [6]. Some samples rendered in this initial setting are shown in Fig (b)b.

3.3 Sampling and Rendering

Although sampling from is easy initially, it eventually becomes harder as gets updated iteratively through Bayesian updates: . The reason is that we do not have conjugate relationships between the classifier’s probabilities and . Hence, these intermediate probabbility functions lose their easy-to-sample-from structure. Hence, we use a rejection sampling scheme to sample from due to its scalability. In general, an open issue in the use of rejection sampling schemes is to come up with an optimal scaling factor , which results in a proposal distribution that is an envelope to the complicated distribution that we want to sample from. This issue does not arise in our case as our initial uniform distributions of can behave as envelopes for all intermediate s, if they are not re-normalized. However, this ends up increase the probability of rejecting many samples and therefore generating samples becomes computationally progressively more expensive with the number of iterations. We solve this issue by normalizing intermediate probability tables with their respective maximum values. Corresponding labels are obtained through annotation shaders, which we implemented in Blender. An image sample with corresponding labels are shown in Fig 3. The details about our rendering choices and their impact on the semantic segmentation results can be found in the supplementary material.

3.4 Adversarial Training

In a GAN setting, the generator is supplemented with a discriminator , which is trained to classify samples as real versus generated. In simple terms, the output of the discriminator should be one for a real image and zero for a generated image. One can select any off-the-shelf classifier as . However, the choice of plays a critical role as it measures dissimilarity between and in the feature space that is based on. Here we use AlexNet, a 5 layer convolutional neural net, as

to learn the feature space automatically as in conventional GANs. Standard stochastic gradient descent with backpropagation is used to train this net.

Training D: All images are resized to a common resolution of

, which is the default input size of AlexNet’s implementation in Tensorflow. This is done to speed-up the training process and save memory. However, this has the disadvantage of missing the details of smaller objects of some pedestrians and vehicles. All real images in

are labelled as 1, while simulated data is labeled as 0. Data augmentation techniques such as random cropping, left-right flipping, random brightness and contrast modifiers are applied, too, including per-image whitening. 10000 epochs are used to train the classifier.

Tuning G: We now estimate the quantity from the classification probabilities, i.e. the softmax outputs of for all virtual samples in

. This is estimated using weighted Gaussian kernel density estimation (KDE). Using the classifier outputs

as weights we obtain:


where a Gaussian kernel with bandwidth . In our experiments, we use . We explored the use of automated bandwidth selection methods but in our experiments a default setting seemed to perform adequately. This KDE estimate represents the likelihood of generating samples similar to for given values of . In a Bayesian setting, this can be used to update our prior beliefs about iteratively as:


After a number of iterations, if and have enough capacity, they will reach a point at which both cannot improve because . In the limit, the discriminator is unable to differentiate between the two distributions and becomes a random classifier, i.e. . However, we fix the maximum number of updates on to 6 in the following experiments.

(a) Histogram of
(b) A few samples of sampled from the model before tuning
(c) Pixel-proportions/class
(d) Histogram of CityScapes
(e) A few samples from CityScapes data
(f) Pixel-proportions/class
(g) Histogram of
(h) A few samples of sampled from the model after tuning
(i) Pixel-proportions/class
(j) Histogram of CamVid
(k) A few samples from CamVid data
(l) Pixel-proportions/class
(m) Histogram of
(n) A few samples of sampled from the model after tuning
(o) Pixel-proportions/class
Figure 4: Qualitative comparison of training sets, both simulated and real, and their statistics before and after tuning the generative model (Best viewed in color).

4 Experiments

In this section, we provide an evaluation of our generative adversarial tuning approach in terms of performance of a deep convolutional network (DCN) for urban traffic scene semantic segmentation. We choose to use a state-of-the-art DCN-based architecture as a vision system for these experiments. As we treat as a black-box, we believe that our experimental results will be of interest to other researchers using DCN-based applications. We selected two publicly available urban datasets to study the benefits of our approach for synthetic data generation.

Vision system (S): We select a state-of-the-art DCN-based architecture, i.e. DeepLab [5] as . DeepLab is an modified version of VGG-net to operate at original image resolutions, by making the following changes: 1) replacing the fully connected layers with convolutional ones, 2) skiping the last subsampling steps and up-sampling the feature-maps by using Atros

convolutions. This still results in a coarser map with a stride of 8 pixels. Hence, during training the targets, i.e. the semantic labels, are the ground truth labels subsampled by 8. During testing, bi-linear interpolation followed by a fully connected conditional random field (CRF) was used to obtain the final label maps. We modify the last layer of DeepLab from a 21-class to a 7-class, including the categories: vehicle, pedestrian, building, vegetation, road, ground, and sky.

Training S

: Our DeepLab models are initialized with ImageNet pre-trained weights to avoid longer training times. Stochastic gradient descent and the cross-entropy loss function are used with an initial learning rate of 0.001, momentum of 0.9 and a weight decay of 0.0005. We use a mini-batch of 4 images and the learning rate is multiplied by 0.1 after every 2000 iterations. High-resolution input images are down-sampled by a factor 4. Training data is augmented by vertical mirror reflections and random croppings from the original resolution images, which increases the amount of data by a factor of 4. As stopping criteria, we used a fixed number of SGD iterations (30,000) in all our experiments. In the CRF postprocessing, we used fixed parameters in the CRF inference process (10 mean field iterations with Gaussian edge potentials as described in the

[5]) in all reported experiments. The CRF parameters are optimized on a subset of 300 images randomly selected from the training set. The peformance of DeepLab with different training-testing settings is tabulated in Table 1. We report the accuracy in terms of the IoU measure for DeepLab for each of the seven classes with their average per-class and global accuracies for both real datasets we used.

Real world target datasets : We used CityScapes [6] and CamVid [4] as target datasets which are tailored for urban scene semantic segmentation. CityScapes was recorded on the streets of several European cities. It provides a diverse set of videos with public access to 3475 images with finer pixel-level annotations for semantic labels. However, in the adversarial tuning process we use 1000 randomly selected samples from CityScapes as in each iteration to train and we set to generate 1000 samples from . CamVid is recorded in and around the Cambridge region in UK. It provides 701 images along with high-quality semantic annotations. While tuning the generative model to CamVid, we randomly sample 500 samples from CamVid in each iteration and set .

It is worth highlighting the differences between these datasets. Each of them has been acquired in a different city or cities. The camera models used are different. Due to the geographical and demographical differences in weather, lighting, object shapes, the statistics of these dataset may differ. For instance, we computed the intensity histograms over full CityScapes and CamVid datasets, see Fig (d)d and Fig (j)j. For better visual comparison, we normalized the histograms with their maximum frequencies. Topologically, these histograms are quite different. Similarly, label statistics also differ, see the histograms of semantic class labels in Fig (f)f and Fig (l)l. As quantified in Table 1, these statistical differences in the training datasets are reflected as performance shift of DeepLab. For instance, the DeepLab model trained on CityScapes training data () is performing at 67.71 IoU points on , a validation set from CityScapes, i.e. within the same domain. This performance is reduced by nearly 13 points instead when the validation set from CamVid () is used for testing. Similar behavior is observed when transferring the DeepLab model from CityScapes to CamVid. Performance degradation when transferring from virtual to real domains is comparable. Similar observations can be found in [22] in the context of pedestrian detection with a classifier based on HOG and linearSVM.

Virtual reality datasets (): To quantify the performance changes due to adversarial tuning, we prepared three sets that are simulated from the initial model and the models tuned with the approach discussed in Section 3 to the datasets CityScapes and CamVid. We denote them with , and respectively. Each set has 5000 images along with several annotations along with pixel-wise semantic labels. We first compare the performance statistics of simulated training sets against the target datasets used for adversarial tuning. Later in the section, we also compare the generalizations of a vision system on the target dataset when it is trained on these sets separately to quantify the performance shift due to adversarially trained scene generation.

4.1 Statistics of Training sets

Though its difficult to appreciate significant performance changes due to adversarial training by visual inspection, Figures (b)b, (h)h, and (n)n can be used to obtain insights about how the training affected pixel-level labeling. We computed histograms of pixel intensities over the full datasets generated from the initial model, our target data CityScapes and generated the the model tuned to CityScapes . These plots are shown in the first column of Fig 4. The structure of these histogram has been moved closer to the one of CityScapes through the process of tuning. Quantitatively, the KL divergence between virtual data and CityScapes data has been reduced from 0.57 before tuning to 0.44 after tuning to CityScapes. A similar behavior is observed when the model is trained on the CamVid data. Finally we also obtained similar histograms for the ground-truth labels. As with the previous comparisons, on can observe that the label statistics are again closer to the real datasets after tuning, as shown in the last column of Fig 4. This evidence points to the potential usefulness of simulated datasets as virtual proxies for these real world datasets.

Training set Validation global vehicle pedestrian building vegetation road ground sky
Model Tuned to CityScapes data
V_init CS_val 49.86 48 53 63 51 47 34 53
V_cityscapes CS_val 52.14 (+2.28) 56 47 65 57 53 31 56
CS_train CS_val 67.71 59 57 73 64 69 64 88
V_cityscapes CV_val 50.28 (+0.43) 51 50 55 48 49 49 50
CS_train CV_val 54.42 47 43 55 69 46 51 70
Model Tuned to CamVid Data
V_init CV_val 46.42 53 38 54 35 43 39 63
V_camvid CV_val 49.85 (+3.42) 57 34 63 37 48 44 66
CV_train CV_val 67.42 77 34 65 54 98 45 99
V_camvid CS_val 39.85 (-6.57) 35 41 44 44 32 40 43
CV_train CS_val 54.28 46 43 55 69 46 51 70
Data augmentations
V_init+10%CS CS_val 67.42 60 66 52 67 74 72 81
V_cityscapes + 10%CS CS_val 70.01 (+2.57) 68 60 59 68 77 69 89
V_init+10%CV CV_val 68.85 51 61 71 67 65 77 90
V_camvid+10%CV CV_val 70.57 (+1.71) 63 57 76 73 67 74 84
Table 1: Quantitative analysis of the performance of DeepLab models with different training-testing combinations.
Notation: CS and CV refers to real CityScapes and CamVid datasets respectively, and prefix ’V’ represents simulated sets.

4.2 Generalization of DeepLab

In our first set of experiments we used CityScapes as the target domain which means that we took the validation set from CityScapes () for testing. We compared the utility of simulated data generated from the initial model and the model tuned to CityScapes () in terms of generalization of the trained models to . produced good results in classifying the objects such as building, vehicles, vegetation, roads, and sky. However, pedestrians were poorly recognized due to low frequency of occurrences and the use of low quality (low polygon meshes and textured) CAD models. However, the use of , which is generated from the model tuned to real CityScapes, improved the over-all performance on global IoU by 2.28 points. This time, the per-class IoU measure on the pedestrian class also improved to some extent. This may be credited to the increased number of occurrences after tuning. This can be discerned in the bar plot of Fig 4

, last column. To measure the statistical significance of these improvements, we repeated the training-testing experiment 5 times and measured the improvement each time. The computed mean and standard deviations are


In our second set of experiments we use CamVid as the target domain and take the validation set from CamVid for testing. We compared the utility of the simulated data generated from the initial model and the model tuned to CamVid in terms of the generalization from the trained models to . already produced good results. However, the use of improved the overall performance, i.e. the global IoU by 3.42 points. Interestingly, the DeepLab model trained on showed improved performance also on the CamVid validation set, which however was not true the other way around as seen by a degradation in performance of 6.57%. We conjecture that the high number of pedestrians and their diversity in the CityScapes set might be one of the reasons.

In the final set of experiments, we compared the results of unsupervised adversarial tuning to those of supervised domain adaptation, i.e. augmenting the simulated data with 10% labeled samples from the target domain. Clearly, supervised domain adaptation provides improved performance gains over our adversarial tuning approach. However, we note that our modest improvements using unsupervised learning described above were achieved without labelled samples from the target domain, thus, the costs for these improvements is low by comparison. Instead of using the data simulated with the initial model

, we improve performance on the corresponding validation sets by 2.57 and 1.71 IoU points respectively by using data from models tuned to and with DeepLab. This suggests that the amount of real world labelled data required to correct for the domain-shift in order to achieve the same level of performance as +10%CS is reduced. A rough analysis using a linear fit to the empirical performance gains reported in Table 1 provides the observation that the amount of labelled real world data needed to reach the same level performance with is 9% of training data compared to the 10% labeling of training data needed for .

5 Conclusions and Future Work

In this work, we have evaluated an adversarial approach to tune generative scene priors for the process of CG-based data generation to train CV systems. To achieve this goal, we designed a parametric scene generative model, followed by AlexNet whose output probabilities are used to update the distributions over scene parameters. Our experiments in the context of urban scene semantic segmentation with DeepLab provided evidence of improved generalization of models trained on simulated data generated from adversarially tuned scene models. These improvements were found to be on average 2.28% and 3.42% IoU points on two real world benchmark datasets, CityScapes and CamVid respectively.

Our current work does not vary the intrinsic attributes of objects such as shape and texture. Instead we used a fixed set of CAD shapes and textures as a proxy to model intra-class variations. We expect significant performance improvements for the future when expanding the set of 3D models from the current, relatively small and fixed set of CAD models. A possible extension is to use component-based shape synthesis models similar to [10] in order to learn distributions over object shapes. We plan to conduct more experiments to characterize the behavior of adversarial tuning by studying the variability in performance on simulated training and target domains. Of particular interest should be relating the performance gains as a function of the KL-divergence between the prior distributions used for training and those of the target domains.


This work was supported by the German Federal Ministry of Education and Research (projects 01GQ0840 and 01GQ0841) and by Continental automotive GmbH.