Recently, computer graphics (CG) generated data has been actively utilized to train and validate computer vision (CV) systems, especially, in situations where acquiring large scale data and groundtruth is costly. Examples are many pixel level prediction tasks such as semantic segmentation [8, 17, 16, 19], optical flow  and intrinsic images  etc. However, the performance of CV systems when they are trained only on simulated data is not as good as expected due to the issue of domain shift 
. This problem is due to the fact that the probability distribution over parameters resulting from the simulation process,, may not match those parameters describing real-world data, . This can be caused by many factors such as deviations in lighting, camera parameters, scene geometry and many others from the true unknown underlying distributions . These deviations may result in poor generalization of the trained CV models to the target application domains. The term used to describe this phenomenon is ’domain-shift’ or ’data-shift’.
In the classical CV literature, two alternatives to reduce domain-shift have been discussed: 1) Using engineered feature spaces that achieve invariance to large variations in specific attributes such as illumination or pose, and 2) learning of scene priors for the generative process that are optimized to the specific target domain. Several works designed  or transferred the representations from virtual domains that are quasi invariant to domain shift, for instance, geometry or motion feature representations as well as their distributions (see for instance ). With the advent of automated feature learning architectures, recent works [17, 16] have demonstrated that augmenting large scale simulated training data with a few labelled real-world samples can ameliorate domain shift. However, annotating even a few samples is expensive and laborious in many pixel level applications such as optical flow and intrinsic images. Hence, bootstrapping generative models from real-world data is often desired but difficult to achieve due to its inherent complexities in the bootstrapping process and the need for richly annotated seed data along with meta-information such as camera parameters, geographic information, etc. .
Recently, advances in the field of unsupervised generative learning, i.e. Generative Adversarial Training , popularly known as generative adversarial networks (GANs), propose to use unlabelled samples from a target domain to progressively obtain better point estimates of parameters in generative models by minimizing the discrepancy between generative and target distributions in the space of a deep discriminatively-trained classifier. Here, we propose to use and evaluate the ability of this adversarial approach to tune scene priors in the context of CG based data generation.
In the traditional GAN approach neural networks are used both for the generative model and the discriminative model[9, 15]). Our paper focuses on the iterative estimation of the posterior density over parameters describing prior distributions over parameters, , for a generative graphical model via: 1) generation of virtual samples given a starting prior, 2) estimation of conditional class probabilities of labeling a given virtual sample as real data using a discriminative classifier network , 3) mapping these conditional class probabilities to estimate class conditional probabilities for labeling of data as real given the parameters of the generative model , and finally, 4) doing a Bayesian update to estimate the posterior density over parameters describing the prior . This is done within a rejection sampling framework. Initially, we assume uniform distributions as priors on the parameters of the generative scene model. As iterations proceed the uniform prior distributions get updated to distributions that are closer to the unknown prior distributions of target data. Please see Fig 1 for a schematic flow of our adversarial tuning procedure.
More specifically, we use a parametric generative 3D scene model, , which is a graphical model with scene semantics. This makes it possible to generate semantic annotations along with image data by using an off-the-shelf graphics rendering method. This model exploits existing 3D CAD models of objects and implements intra-object variations. This model is parametrized by several variables including 1) Light variables: intensity, spectrum, position of light source, weather scattering parameters; 2) Geometry variables: object cooccurrences, spatial alignments; 3) Camera parameters: position and location of the camera.
Paper organization: We will first review some of the related concepts and works in Section 2. Section 3 introduces our generative model and adversarial training approach to tune the model’s parameters. Our experiments in Section 4 compare the model’s properties before and after adversarial training. This includes comparing data statistics and generalization of vision systems against real world data. Finally we conclude in Section 5 by describing future directions.
Our work builds upon several recent advances in the fields of computer graphics, which aim to automatically generate configurations of 3D objects from individual 3D CAD models of objects, and unsupervised generative learning, which aim to train generative models to a given unlabeled dataset from a target domain. Here, we summarize related work and concepts that are relevant to our work.
2.1 Scene Generative Models
Automatic scene generation has been a goal both within CG and CV. The optimal spatial arrangement of randomly selected 3D CAD models according to a cost function is a well studied problem in the field of CG. Simulated annealing based optimization of scene layouts have been applied to specific domains such as the arrangement of furniture in a room. For instance,  use a simulated annealing approach to generate furniture arrangements that obey specific feasibility constraints such as spatial relationships and visibility. Similarly, 
propose an interactive indoor layout system built on top of reversible-jump MCMC (monte-carlo markov chain) that recommends different layouts by sampling from a density function that incorporates layout guidelines. Factor potentials are used in to incorporate several constraints, for example, that furniture does not overlap, that chairs face each other in seating arrangements, and that sofas are placed with their backs against a wall.
Similarly, in the aerial image understanding literature, several spatial processes have been used to infer 3D layouts [12, 21]. Ample literature has been describing the constraints that characterize pleasing design patterns such as spatial exclusion, mutual alignment . Inspired by these works, we view city layouts as point fields that are associated with some marks, i.e. attributes such as type, shape, scale, and orientation. Hence, we use a stochastic spatial process called a Marked Point Process, which is coupled with 3D CAD models and is used to synthesize geometric city layouts. Spatial relations and mutual alignments are encoded using Gibbs potentials between marks.
2.2 Graphics for Vision
Due to the need for large scale annotated datasets, e.g. in the automotive setting, several attempts have been utilizing existing CAD city models , racing games [16, 19] or probabilistic scene models for annotated data generation, but naturalistic scenes have even been used to investigate properties of the human visual system . In the context of pedestrian detection, some work  demonstrated domain adaptation methods by exploring several ways of combining a few real world pedestrian samples to many synthetic samples from H-life game environments. In , the authors introduced a fully annotated synthetic video dataset, based on a virtual cloning method that takes richly annotated video as input seed. More recently, several independent research groups [17, 16, 19] demonstrated that augmenting a large collection of virtual samples with few labelled real-world samples could improve domain-shift. In our work, we address the question of how far one can go without the need of labelled real world samples. We use unlabelled data from a target domain and estimate the scene prior distributions of the generative model whose samples are adversary to the classifier.
Our approach to tuning a generative model to given target-data is shown in Fig 1. We summarize the key steps below:
The generative model has a set of parameters related to different scene attributes such as geometry and photometry.
A renderer takes these parameters sampled from the distributions and outputs image data .
The discriminator , a standard convolutional network, is trained using gradient descent to classify data originating from the target domain and as being either real or generated. outputs a scalar probability, which is trained to be high if the input was real and low if the data were generated from .
The probabilities for all simulated samples are used to estimate the likelihood .
This is then used to update our prior distributions, which will be used in the next iteration as .
We now describe the details of the components used in this process.
3.1 Probabilistic Scene Generation
Probabilistic scene models deal with several attributes for a scene that are relevant for the target domain. One can divide these attributes into 1) geometry, 2) photometry and 3) dynamics. However, we skip the modeling of scene dynamics in this work as we only consider static images and also aim to use publicly available large scale 3D CAD repositories such as Google’s sketchup 3D warehouse. Hence, in our generative scene model we consider modeling scene layouts with CAD models and photometric parameters.
Scene geometry: We designed a 3D scene geometry layout model that is based on Marked Poisson Processes coupled with 3D CAD object models. It considers objects as points in a world coordinate system and their attributes, such as object class, position, orientation, and scale as marks associated with them. These points are sampled from a probabilistic point process and the marks are sampled from another set of conditional distributions such as distributions on bounding box sizes, orientations given object type, etc. 3D CAD models are randomly imported from our collection with a few samples shown in Fig 2, and placed in sampled scene layouts. The camera is linked to a random car with a height that is uniformly distributed around a mean height of .
In sampling from the world models one can assume statistical independence between marks of the point process for simplicity. Such scene states are likely to generate objects with spatial overlaps, which are physically improbable. Hence, some inter-dependencies between marks such as spatial non-overlap, cooccurrence, and coherence among instances of object classes are incorporated with the help of Gibbs potentials. In such cases, the resulting point process is called a Poisson process  and the density of object layouts is formulated using the Gibbs equation: , where introduces prior knowledge on the object layouts by taking into account pairwise interactions between the objects . This allows encoding strong structural information by defining complex and specific interactions such as interconnections or mutual alignments of objects [12, 20]. However, due to computational complexities such constraints results in extended computational times in sampling scene states. To avoid these problems, we limit the interactions to the essential ones for obtaining a general model of the non-overlapping objects and constraining road angles. Strong structural information can then be introduced in a subsequent step by developing post-processing in order to connect objects. This can be expressed using the term , where takes on values in the interval and quantifies the relative mutual overlap between objects and , and is a large positive real value (in our experiments ), which strongly penalizes large overlaps. For small overlaps between two objects this prior will only weakly penalize the global energy. But if the overlapping is high, this prior will act as a hard constraint, strongly affecting overall energy.
Scene photometry: In addition to the above geometry parameters, we also model 1) the light source sun and its extrinsic (position and orientation) as well as intrinsic parameters (intensity and color spectrum), 2) weather scattering parameters (particle density and scattering coefficient), 3) camera extrinsic parameters such as orientation and field-of-view. These models are implemented through the use of python scripting interface to an open source graphics platform, BLENDER . A Monte-Carlo path tracer is used to render the scenes as images along with annotations, if required. Please see the supplementary material for details. A schematic graphical model is as shown in Fig 2 along with a few samples of CAD object models used in this work.
As shown in Fig 2
, our generative model is a physics-based parametric model whose inputs are a set of scene variablessuch as lighting, weather, geometry and camera parameters. We assume that all these parameters are statistically independent of each other, which provides the least expensive option for modeling and sampling. One can model dependencies using distributions on these parameters based on an expert’s knowledge for a target domain or based on additional knowledge such as atmospheric optics, geographic and demographic studies. However, in the absence of priors, we use uniform distributions in their permissible ranges. For instance, the light source’s intensity in BLENDER is modeled as , where an intensity level of can correspond to night while corresponds to lighting at noon. With these settings our model was able to render physically plausible and visually realistic images. This scene model was used in our previous work, which is provided as supplementary material. Performance of a vision model trained to perform semantic segmentation on simulated data was quite good on real-world data. Yet, data-shift was observed due to deviations between the scene generation statistics and the target real-world domain. Hence, in the present work, we focus on the task of matching generative statistics to those of real-world target data such as for instance CityScapes . Some samples rendered in this initial setting are shown in Fig (b)b.
3.3 Sampling and Rendering
Although sampling from is easy initially, it eventually becomes harder as gets updated iteratively through Bayesian updates: . The reason is that we do not have conjugate relationships between the classifier’s probabilities and . Hence, these intermediate probabbility functions lose their easy-to-sample-from structure. Hence, we use a rejection sampling scheme to sample from due to its scalability. In general, an open issue in the use of rejection sampling schemes is to come up with an optimal scaling factor , which results in a proposal distribution that is an envelope to the complicated distribution that we want to sample from. This issue does not arise in our case as our initial uniform distributions of can behave as envelopes for all intermediate s, if they are not re-normalized. However, this ends up increase the probability of rejecting many samples and therefore generating samples becomes computationally progressively more expensive with the number of iterations. We solve this issue by normalizing intermediate probability tables with their respective maximum values. Corresponding labels are obtained through annotation shaders, which we implemented in Blender. An image sample with corresponding labels are shown in Fig 3. The details about our rendering choices and their impact on the semantic segmentation results can be found in the supplementary material.
3.4 Adversarial Training
In a GAN setting, the generator is supplemented with a discriminator , which is trained to classify samples as real versus generated. In simple terms, the output of the discriminator should be one for a real image and zero for a generated image. One can select any off-the-shelf classifier as . However, the choice of plays a critical role as it measures dissimilarity between and in the feature space that is based on. Here we use AlexNet, a 5 layer convolutional neural net, as
Training D: All images are resized to a common resolution of
, which is the default input size of AlexNet’s implementation in Tensorflow. This is done to speed-up the training process and save memory. However, this has the disadvantage of missing the details of smaller objects of some pedestrians and vehicles. All real images in
are labelled as 1, while simulated data is labeled as 0. Data augmentation techniques such as random cropping, left-right flipping, random brightness and contrast modifiers are applied, too, including per-image whitening. 10000 epochs are used to train the classifier.
Tuning G: We now estimate the quantity from the classification probabilities, i.e. the softmax outputs of for all virtual samples in
. This is estimated using weighted Gaussian kernel density estimation (KDE). Using the classifier outputsas weights we obtain:
where a Gaussian kernel with bandwidth . In our experiments, we use . We explored the use of automated bandwidth selection methods but in our experiments a default setting seemed to perform adequately. This KDE estimate represents the likelihood of generating samples similar to for given values of . In a Bayesian setting, this can be used to update our prior beliefs about iteratively as:
After a number of iterations, if and have enough capacity, they will reach a point at which both cannot improve because . In the limit, the discriminator is unable to differentiate between the two distributions and becomes a random classifier, i.e. . However, we fix the maximum number of updates on to 6 in the following experiments.
In this section, we provide an evaluation of our generative adversarial tuning approach in terms of performance of a deep convolutional network (DCN) for urban traffic scene semantic segmentation. We choose to use a state-of-the-art DCN-based architecture as a vision system for these experiments. As we treat as a black-box, we believe that our experimental results will be of interest to other researchers using DCN-based applications. We selected two publicly available urban datasets to study the benefits of our approach for synthetic data generation.
Vision system (S): We select a state-of-the-art DCN-based architecture, i.e. DeepLab  as . DeepLab is an modified version of VGG-net to operate at original image resolutions, by making the following changes: 1) replacing the fully connected layers with convolutional ones, 2) skiping the last subsampling steps and up-sampling the feature-maps by using Atros
convolutions. This still results in a coarser map with a stride of 8 pixels. Hence, during training the targets, i.e. the semantic labels, are the ground truth labels subsampled by 8. During testing, bi-linear interpolation followed by a fully connected conditional random field (CRF) was used to obtain the final label maps. We modify the last layer of DeepLab from a 21-class to a 7-class, including the categories: vehicle, pedestrian, building, vegetation, road, ground, and sky.
: Our DeepLab models are initialized with ImageNet pre-trained weights to avoid longer training times. Stochastic gradient descent and the cross-entropy loss function are used with an initial learning rate of 0.001, momentum of 0.9 and a weight decay of 0.0005. We use a mini-batch of 4 images and the learning rate is multiplied by 0.1 after every 2000 iterations. High-resolution input images are down-sampled by a factor 4. Training data is augmented by vertical mirror reflections and random croppings from the original resolution images, which increases the amount of data by a factor of 4. As stopping criteria, we used a fixed number of SGD iterations (30,000) in all our experiments. In the CRF postprocessing, we used fixed parameters in the CRF inference process (10 mean field iterations with Gaussian edge potentials as described in the) in all reported experiments. The CRF parameters are optimized on a subset of 300 images randomly selected from the training set. The peformance of DeepLab with different training-testing settings is tabulated in Table 1. We report the accuracy in terms of the IoU measure for DeepLab for each of the seven classes with their average per-class and global accuracies for both real datasets we used.
Real world target datasets : We used CityScapes  and CamVid  as target datasets which are tailored for urban scene semantic segmentation. CityScapes was recorded on the streets of several European cities. It provides a diverse set of videos with public access to 3475 images with finer pixel-level annotations for semantic labels. However, in the adversarial tuning process we use 1000 randomly selected samples from CityScapes as in each iteration to train and we set to generate 1000 samples from . CamVid is recorded in and around the Cambridge region in UK. It provides 701 images along with high-quality semantic annotations. While tuning the generative model to CamVid, we randomly sample 500 samples from CamVid in each iteration and set .
It is worth highlighting the differences between these datasets. Each of them has been acquired in a different city or cities. The camera models used are different. Due to the geographical and demographical differences in weather, lighting, object shapes, the statistics of these dataset may differ. For instance, we computed the intensity histograms over full CityScapes and CamVid datasets, see Fig (d)d and Fig (j)j. For better visual comparison, we normalized the histograms with their maximum frequencies. Topologically, these histograms are quite different. Similarly, label statistics also differ, see the histograms of semantic class labels in Fig (f)f and Fig (l)l. As quantified in Table 1, these statistical differences in the training datasets are reflected as performance shift of DeepLab. For instance, the DeepLab model trained on CityScapes training data () is performing at 67.71 IoU points on , a validation set from CityScapes, i.e. within the same domain. This performance is reduced by nearly 13 points instead when the validation set from CamVid () is used for testing. Similar behavior is observed when transferring the DeepLab model from CityScapes to CamVid. Performance degradation when transferring from virtual to real domains is comparable. Similar observations can be found in  in the context of pedestrian detection with a classifier based on HOG and linearSVM.
Virtual reality datasets (): To quantify the performance changes due to adversarial tuning, we prepared three sets that are simulated from the initial model and the models tuned with the approach discussed in Section 3 to the datasets CityScapes and CamVid. We denote them with , and respectively. Each set has 5000 images along with several annotations along with pixel-wise semantic labels. We first compare the performance statistics of simulated training sets against the target datasets used for adversarial tuning. Later in the section, we also compare the generalizations of a vision system on the target dataset when it is trained on these sets separately to quantify the performance shift due to adversarially trained scene generation.
4.1 Statistics of Training sets
Though its difficult to appreciate significant performance changes due to adversarial training by visual inspection, Figures (b)b, (h)h, and (n)n can be used to obtain insights about how the training affected pixel-level labeling. We computed histograms of pixel intensities over the full datasets generated from the initial model, our target data CityScapes and generated the the model tuned to CityScapes . These plots are shown in the first column of Fig 4. The structure of these histogram has been moved closer to the one of CityScapes through the process of tuning. Quantitatively, the KL divergence between virtual data and CityScapes data has been reduced from 0.57 before tuning to 0.44 after tuning to CityScapes. A similar behavior is observed when the model is trained on the CamVid data. Finally we also obtained similar histograms for the ground-truth labels. As with the previous comparisons, on can observe that the label statistics are again closer to the real datasets after tuning, as shown in the last column of Fig 4. This evidence points to the potential usefulness of simulated datasets as virtual proxies for these real world datasets.
|Model Tuned to CityScapes data|
|Model Tuned to CamVid Data|
|V_cityscapes + 10%CS||CS_val||70.01 (+2.57)||68||60||59||68||77||69||89|
Notation: CS and CV refers to real CityScapes and CamVid datasets respectively, and prefix ’V’ represents simulated sets.
4.2 Generalization of DeepLab
In our first set of experiments we used CityScapes as the target domain which means that we took the validation set from CityScapes () for testing. We compared the utility of simulated data generated from the initial model and the model tuned to CityScapes () in terms of generalization of the trained models to . produced good results in classifying the objects such as building, vehicles, vegetation, roads, and sky. However, pedestrians were poorly recognized due to low frequency of occurrences and the use of low quality (low polygon meshes and textured) CAD models. However, the use of , which is generated from the model tuned to real CityScapes, improved the over-all performance on global IoU by 2.28 points. This time, the per-class IoU measure on the pedestrian class also improved to some extent. This may be credited to the increased number of occurrences after tuning. This can be discerned in the bar plot of Fig 4
, last column. To measure the statistical significance of these improvements, we repeated the training-testing experiment 5 times and measured the improvement each time. The computed mean and standard deviations are.
In our second set of experiments we use CamVid as the target domain and take the validation set from CamVid for testing. We compared the utility of the simulated data generated from the initial model and the model tuned to CamVid in terms of the generalization from the trained models to . already produced good results. However, the use of improved the overall performance, i.e. the global IoU by 3.42 points. Interestingly, the DeepLab model trained on showed improved performance also on the CamVid validation set, which however was not true the other way around as seen by a degradation in performance of 6.57%. We conjecture that the high number of pedestrians and their diversity in the CityScapes set might be one of the reasons.
In the final set of experiments, we compared the results of unsupervised adversarial tuning to those of supervised domain adaptation, i.e. augmenting the simulated data with 10% labeled samples from the target domain. Clearly, supervised domain adaptation provides improved performance gains over our adversarial tuning approach. However, we note that our modest improvements using unsupervised learning described above were achieved without labelled samples from the target domain, thus, the costs for these improvements is low by comparison. Instead of using the data simulated with the initial model, we improve performance on the corresponding validation sets by 2.57 and 1.71 IoU points respectively by using data from models tuned to and with DeepLab. This suggests that the amount of real world labelled data required to correct for the domain-shift in order to achieve the same level of performance as +10%CS is reduced. A rough analysis using a linear fit to the empirical performance gains reported in Table 1 provides the observation that the amount of labelled real world data needed to reach the same level performance with is 9% of training data compared to the 10% labeling of training data needed for .
5 Conclusions and Future Work
In this work, we have evaluated an adversarial approach to tune generative scene priors for the process of CG-based data generation to train CV systems. To achieve this goal, we designed a parametric scene generative model, followed by AlexNet whose output probabilities are used to update the distributions over scene parameters. Our experiments in the context of urban scene semantic segmentation with DeepLab provided evidence of improved generalization of models trained on simulated data generated from adversarially tuned scene models. These improvements were found to be on average 2.28% and 3.42% IoU points on two real world benchmark datasets, CityScapes and CamVid respectively.
Our current work does not vary the intrinsic attributes of objects such as shape and texture. Instead we used a fixed set of CAD shapes and textures as a proxy to model intra-class variations. We expect significant performance improvements for the future when expanding the set of 3D models from the current, relatively small and fixed set of CAD models. A possible extension is to use component-based shape synthesis models similar to  in order to learn distributions over object shapes. We plan to conduct more experiments to characterize the behavior of adversarial tuning by studying the variability in performance on simulated training and target domains. Of particular interest should be relating the performance gains as a function of the KL-divergence between the prior distributions used for training and those of the target domains.
This work was supported by the German Federal Ministry of Education and Research (projects 01GQ0840 and 01GQ0841) and by Continental automotive GmbH.
-  http://www.blender.org/.
-  C. Alexander, S. Ishikawa, and M. Silverstein. A pattern language: towns, buildings, construction, volume 2. Oxford University Press, 1977.
-  M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
-  G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. arXiv preprint arXiv:1604.01685, 2016.
-  P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852, 2015.
-  A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. arXiv preprint arXiv:1605.06457, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun. A probabilistic model for component-based shape synthesis. ACM Transactions on Graphics (TOG), 31(4):55, 2012.
-  N. Kong and M. J. Black. Intrinsic depth: Improving depth transfer with intrinsic images. In IEEE International Conference on Computer Vision (ICCV), pages 3514–3522, Dec. 2015.
F. Lafarge, G. Gimel’Farb, and X. Descombes.
Geometric feature extraction by a multimarked point process.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1597–1609, 2010.
-  P. Merrell, E. Schkufza, Z. Li, M. Agrawala, and V. Koltun. Interactive furniture layout using interior design guidelines. In ACM Transactions on Graphics (TOG), volume 30, page 87. ACM, 2011.
-  V. Parameswaran, V. Shet, and V. Ramesh. Design and validation of a system for people queue statistics estimation. In Video Analytics for Business Intelligence, pages 355–373. Springer, 2012.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. arXiv preprint arXiv:1608.02192, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3234–3243, 2016.
-  C. A. Rothkopf, T. H. Weisswange, and J. Triesch. Learning independent causes in natural images explains the spacevariant oblique effect. In Development and Learning, 2009. ICDL 2009. IEEE 8th International Conference on, pages 1–6. IEEE, 2009.
-  A. Shafaei, J. J. Little, and M. Schmidt. Play and learn: Using video games to train computer vision models. arXiv preprint arXiv:1608.01745, 2016.
-  O. Tournaire, N. Paparoditis, and F. Lafarge. Rectangular road marking detection with marked point processes. In Proc. conference on Photogrammetric Image Analysis, 2007.
-  A. Utasi and C. Benedek. A 3-d marked point process model for multi-view people detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3385–3392. IEEE, 2011.
-  D. Vazquez, A. M. Lopez, J. Marin, D. Ponsa, and D. Geroimo. Virtual and real world adaptation for pedestrian detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(4):797–809, 2014.
-  Y.-T. Yeh, L. Yang, M. Watson, N. D. Goodman, and P. Hanrahan. Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. ACM Transactions on Graphics (TOG), 31(4):56, 2012.
-  L. F. Yu, S. K. Yeung, C. K. Tang, D. Terzopoulos, T. F. Chan, and S. J. Osher. Make it home: automatic optimization of furniture arrangement. ACM Transactions on Graphics (TOG)-Proceedings of ACM SIGGRAPH 2011, v. 30, no. 4, July 2011, article no. 86, 2011.