A large amount of labeled data is required to train deep neural networks. The manual annotation process is laborious and time-consuming especially for complex vision tasks like object detection, pose estimation or instance segmentation. Furthermore, a costly sensor setup is required to gather annotations for tasks like depth estimation. Use of computer graphics as a source of synthetic data is an attractive alternative to this problem as the annotations are essentially free. However, training a learner with synthetic data results into the problem of domain gap where the training data (source domain) is different from test data (target domain). The well studied paradigm of domain adaptation (DA) and recently proposed domain randomization (DR) are two popular solutions to this problem. However, DA assumes a certain degree of information about the target domain, for example unlabelled samples in case of unsupervised domain adaptation. In the absence of such target samples, DR is an effective approach. DR compensates for a lack of information by assuming access to a sufficiently accurate simulator capable of synthesizing data. The key idea here is to train on synthesized data with enough variations such that target data is perceived as just another variation by the learner.
|(a) Object Detection|
|(b) Depth Estimation|
We show (top left) object spawn probability learned by our policy, (top right) domain randomized images generated using parameters sampled from the policy, (bottom left) real input and (bottom right) model’s output trained only on synthetic data, for object detection and depth estimation. Our approach encourages generation of hard examples like occluded and truncated objects for object detection and small objects for depth estimation.
A natural question therefore arises: how accurate does the simulator have to be for DR to work? This is in fact one of the key underlying assumption made by DR – that invariants of the target domain such as shape, size and type of object are already contained in the simulator by design. For example, when one is training a car detector from synthetic data, the synthetic data generator must contain cars. While this point may seem obvious and inconsequential, it important to point out that with a truly arbitrary simulator, randomization is bound to fail (i.e., it is unlikely to randomly generate the data that you need for your task). In this work, we establish a theoretical framework for analyzing such assumptions and characteristics of DR and we also use it to answer questions like: What differentiates DR from DA? When, Where and Why is DR effective? Can DR be used together with DA?
Along with generation of free and complex annotations, using a simulator has the advantage of synthesizing challenging scenarios for a learner. For example, small, occluded and significantly truncated objects are some of the hard instances for the task of object detection. This is useful as such examples can be notoriously hard to observe, making both training and evaluation of existing systems difficult. 6 out of 20 classes in Cityscapes  account for 90% of the annotated pixel mass , Caltech pedestrian detection benchmark  has cyclist examples . However, we argue that DR fails to fully utilize this capability of a simulator. DR uniformly samples from a space of rendering parameters, this often leads to a redundant set of easy synthetic data for the learner . Furthermore, uniform sampling in the rendering parameter space does not guarantee uniform exploration in the image space. For example, when generating a car, uniform sampling of rendering may result in many images of the same car but under slight variations in lighting conditions. This is not only a waste of valuable compute, but also results in insufficient training of the hard examples. Thus, we motivate the need for a more systematic sampling strategy to ensure better performance with a tractable number of synthetic data.
We address these concerns by proposing the Visual Adversarial Domain Randomization and Augmentation (VADRA) framework which randomizes the examples in adversarial way to improve performance in target domain by generating hard examples. Figure 1
shows the results of VADRA for the task of object detection and depth estimation. We give control of a part of the rendering space to a policy, for example where to spawn an object in the scene. The policy is trained using reinforcement learning and is encouraged to generate hard examples with respect to the learner. We visualize the object spawn probability map learned by the policy for detection and depth estimation as a heatmap where we notice increased likelihood for regions far from the camera resulting in harder examples.
We also extend our framework to incorporate unsupervised domain adaptation when unlabelled target data is available. Unlike traditional adaptation approaches, we share the ’adaptation work load’ equally between the simulator and the learner. In summary, the contributions of our work are as follows:
Theoretical analysis: We present a theoretical perspective on effectiveness of domain randomization. Our analysis shows that contrary to popular belief domain adaptation and domain randomization are both complementary techniques and can be used together effectively.
Visual Adversarial Domain Randomization and Augmentation: We propose a novel algorithm to maximize the utility of using a simulator for generating annotated data. Our data generation approach specifically focuses on hard examples with respect to a supervised learner. This results in an effective traversal of an essentially infinite space of rendering parameters.
Evaluations on diverse tasks: We benchmark our approach on various vision tasks like object classification, object detection and depth estimation using state of the art simulators on public datasets like CLEVR, Syn2Real, VIRAT.
2 Related Work
Our work is broadly related to approaches using a simulator as a source of supervised data and solutions for the reduction of domain gap.
Synthetic Data for Training Recently with the advent of rich 3D model repositories like ShapeNet and the related ModelNet , Google 3D warehouse , ObjectNet3D , IKEA3D , PASCAL3D+  and increase in accessibility of rendering engines like Blender3D, Unreal Engine 4 and Unity3D, we have seen a rapid increase in using synthetic data for performing visual tasks like object classification , object detection [30, 47, 14], pose estimation [42, 22, 43], semantic segmentation [50, 40], visual question answering . Often the source of such synthetic data is a simulator, and use of simulators for training control policies is already a popular approach in robotics [4, 45]. SYNTHIA , GTA5 , VIPER , CLEVR , AirSim , CARLA 
are some of the popular simulators in computer vision.
Unfortunately, despite the growing photorealism 
, simply training a supervised learning model on synthetic images yields disappointing results on real images due to domain gap. The solutions addressing this problem can be broadly classified into domain adaptation and domain randomization.
Domain Adaptation: Given source domain and target domain, methods like [3, 7, 16, 17, 55, 28, 48] aim to reduce the gap between the feature distributions of the two domains. [3, 13, 12, 48, 17] did this in an adversarial fashion using a discriminator for domain classification whereas [49, 27] minimized a defined distance metric between the domains. Another approach is to match statistics on the batch, class or instance level [17, 6] for both the domains. Although these approaches outperform simply training on source domain, they all rely on having access to target data albeit unlabelled.
Domain Randomization: These methods [39, 10, 46, 19, 47, 32, 31, 44, 43, 21] do not use any information about the target domain during training and only rely on a simulator capable of generating varied data. The goal is to close the domain gap by generating synthetic data with sufficient variation that the network views real data as just another variation. The underlying assumption here is that simulator encodes the domain knowledge about the target domain which is often specified manually .
3 Theoretical framework for DR
In this section we establish a theoretical framework for domain randomization. Furthermore, we provide a qualitative reasoning about its effectiveness using insights on combining data from multiple sources .
3.1 Problem Setup
Let and be the input and output space respectively and
be the probability distribution defined onwith as the labeling function. We define target domain as a two-tuple consisting of target data distribution and target labeling function as . Our goal is to learn a hypothesis from a finite VC dimensional hypothesis space which closely approximates . Given a distance metric in the output space, we can frame our goal as an optimization of objective function as follows:
3.2 Domain Randomization
This technique addresses the above problem when we do not assume access to target domain but are instead given a simulator at our disposal. The simulator is a generative model capable of generating labelled data . It encapsulates domain knowledge in form of rules about the target domain and represents the target labelling function internally i.e . Concretely, let be the rendering parameter space and be the probability distribution defined on . We denote the simulator as a function and where .
Domain randomization algorithm randomly generates labelled data samples by uniformly sampling i.e where
is a uniform distribution over. Alg 1 represents the domain randomization algorithm where we randomly generate m data samples using the simulator and train a hypothesis on the generated data.
The objective function optimized by these steps can be written as follows:
3.3 Combining data from Multiple Sources
In this section we present a qualitative discussion based on Theorem 4 from  stated here without proof for completeness.
Consider the setting where we are presented data from N source domains . Each source domain is associated with an unknown distribution and labelling function . We sample a total of m labeled data points from these source domains, with samples from each source such that .
Let be the expected loss and be the empirical loss on the domain . We use this sampled data to train a learner using source weighted empirical loss .
The objective is to use samples from N source domains to train a model to perform well on the target domain . We use for expected loss on target domain for simplicity instead of .
Let be a hypothesis space of VC dimension d. Given N source domains , for each , we generate a labeled sample of size by drawing points from and labeling them according to . If is the empirical minimizer of for a fixed weight vector
for a fixed weight vectoron these samples and is the target error minimizer, then for any , with probability at least ,
where . is the minimum error on target domain. is the minimum error on target domain of a learner empirically trained using N source domains.
represents the consistency between the labelling functions and . is the divergence between the distributions and . overall is the distance between source domain and target domain.
(1) Single Source: and We only use one of the source domain to draw m labelled samples. Using theorem 1 we have,
(2) Multi Source: and . We draw labelled samples from each of the N sources and weigh all the sources equally in the loss . Using theorem 1 we have,
We argue that domain randomization is similar to the multi-source case with each distinct data sample being an individual source domain. The randomization process increases the variation in the data and is therefore equivalent to using an ensemble of source domains for training.
is the average distance of all source domains from the target domain, and the ensemble effect leads to decrease in the variance of this distance measure resulting into a lower upper bound forfor multi-source case.
In the case of when source data is generated using a simulator , we can assume that the labelling functions are fairly consistent with the target labeling function i.e . This assumption is justified as the simulator encodes the annotation knowledge by design. Therefore we have . This quantity can be minimized if we have access to unlabelled samples from target domain which is the approach used in unsupervised domain adaptation by mapping the input to an intermediate domain invariant space. Interestingly, we can still use averaging to further decrease the variance of i.e domain adaptation and domain randomization are complementary to each other.
Obtaining different sources is dependent on our data generation approach. Domain randomization employs an uniform sampling over rendering parameter space. However, with uniform sampling it is necessary to choose a large enough to ensure good performance on target domain. Thus, we motivate a need for a clever sampling strategy to ensure better performance with a tractable .
4 Visual Adversarial Domain Randomization and Augmentation
We model by making a strong pessimistic assumption about the rendering parameter distribution. By making this assumption, the hypothesis learned would be robust to large variations occurring in the target domain. This is especially useful when it is desirable when annotated target data is not available for rare scenarios.
The proposed min-max objective function is as follows:
4.1 Implementation Details
The above objective function can be formulated as a two player zero sum game whose Nash equilibrium corresponds to the optimal evaluation of the objective. We proceed to find the equilibrium of the above min-max optimization by performing gradient descent directly on the actions of the two players. We model the optimization problem using and . together consists of a policy and simulator where generates which is converted into labelled data by . On the other hand, is a supervised learning model trained on generated data from .
Specifically, we maximize the objective for G.
We use REINFORCE  to obtain gradients for updating using an unbiased empirical estimate of
where b is a baseline computed using previous rewards and M is the generated sample size. Both and H are trained together adversarially according to the algorithm 2.
4.2 VADRA with Unlabelled Target Data
In presence of unlabelled target data, we introduce a domain classifier (target label as 1, source label as 0) which takes features extracted from the penultimate layer of as input. Reward for is modified as , where
are hyperparameters. This encourages the policyto fool , thus generating synthetic data which looks similar to target data.
However, it is plausible that due to simulator’s design limitations we might never be able to match the target distribution. In this case, similar to domain adaptation we also modify H’s loss as where are hyperparameters. This formulations allows both the simulator and task model to minimize distance from the target domain.
5.1 Object Classification
We formulate a toy problem for the task of color classification with 6 colors. Here, = [shape, material, color, size] where shape sphere, cube, cylinder, material rubber, metal, color red, yellow, green, cyan, blue, magenta, size . Random noise is added to color, size, lighting, object position, camera position in the image.
Target Data: We generate 5000 images at resolution consisting only of spheres but with all other variations. Figure 3 shows the target data. For consistency, we ensure each variation has equal numbers of images in the target data ().
Simulator : We use Blender3D with assets provided by  for data generation. The source data is generated at resolution and it consists of all variations in the space. Figure 4 visualizes a random source batch.
Policy : consists of parameters each representing the probability of a possible variation in .
Task Model : We use ResNet18 followed by a full connected layer as our classifier which is trained end-to-end. The hyperparameters are reported in the supplementary material.
Results: We evaluate DR, VADRA and VADRA+DA. For a fair comparison, each iteration of DR consists of training on samples whereas VADRA trains on samples and trains on samples. We separately generate 1000 unlabelled target data for VADRA+DA. Figure 5 shows the target accuracy Vs training iterations averaged over 10 independent runs.
Our initial experiments show that it is possible to achieve accuracy on target data just by training on small size objects (size which is not the case with large objects (size . Thus, it is useful to focus on generation of these hard examples which VADRA quickly does after few iterations and therefore does better than DR. In VADRA + DA, the domain classifier eventually learns to discriminate on the basis of shape of the object, till then VADRA + DA performs worse than DR. However, learns to generate sphere images which is infact the target data, this leads to quicker boost in performance.
Target Data: We use MS-COCO based validation split of the closed-set classification task from Syn2Real dataset as our target data. Figure 6 shows examples of images from the target data.
Simulator (): We use Blender3D along with CAD models for 12 classes with varying camera elevation, lighting condition and object pose. Figure 7 shows examples of source data.
Policy (): The policy controls which class sample would be generated. It is a multinomial distribution of size 12 initialized as uniform distribution.
Task Model ()
: We use ResNet18 pretrained on ImageNet as our task model. Domain classifieris a small full connected networks accepting 512 dimensional feature vector from as input.
Results: Table 1 shows quantitative results for DR, VADRA, DR + DA (only H) and VADRA + DA. Our policy generates focuses on pairwise confusing classes in a batch like (1) car and bus, (2) bike and motor bike decreasing samples of easier classes like plant and train. We provide visualizations of during training in the supplementary material.
|DR + DA||68||41||63||34||57||45||74||30||57||24||63||15||47.6|
|VADRA + DA||65||54||60||46||53||41||72||42||54||29||65||25||50.5|
5.2 Object Detection
We compare our approach against DR for the task of object detection on two surveillance scenes from VIRAT dataset .
Target Data: We evaluate our approach on 5000 images each from two VIRAT scenes at 1920 1080 resolution. The data has bounding box annotations for two kind of foreground objects, namely person and car. For our evaluations we only use car bounding box annotations.
Simulator (): We use the Unreal Engine based simulator by  for the VIRAT dataset. The simulator models the surveillance scene in 3D using scene geometry and camera parameters. To ensure performance on real data, we perform randomization using a texture bank of 100 textures with 10 cars, 5 person models and geometric distractors along with varying lighting conditions, contrast and brightness. Please refer to the supplementary material for the details.
Figure 8 shows labelled samples from the VIRAT dataset generated using the simulator along with ground truth instance segmentation map.
Policy (): In this case, where is the RGB image, is a list of car bounding boxes which we extract from the instance segmentation map. includes a list of object attributes and lighting conditions. Each object attribute consists of class, CAD model type, pose, 3d location, 3d size, and texture.
To include variable number of objects in the image, we randomly sample = number of objects, . Furthermore, we divide the ground plane of the scene into rectangular cells of equal size where each cell is associated with an object spawning probability such that . is further divided into three object class probabilities , we have . The spawn map is controlled by the policy and is visited times to decide where and which object to place in the scene. The exact location inside the cell and other object attributes like texture, size, pose, model type are randomly sampled. The reward for policy is computed per cell and is negative of the IoU of the bounding box predicted by the model.
Task Model (): We use FasterRCNN  with ROI Align and ResNet101 with feature pyramid network  architecture as the backbone for object detection. Please refer to the supplementary material for the hyperparameters and the training schedule.
Results: We compare our synthetic data trained model with a baseline detector referred as COCO trained on MS-COCO  dataset. To investigate the effect of number of samples on the performance, we evaluate two kinds of models trained on 5000 and 10,000 synthetic images generated per scene - (1) DR-5k, (2) VADRA-5k, (3) DR-10k, (4) VADRA-10k.
Figure 1 visualizes the learned object spawn probabilities (summed over all object classes) as a heatmap. The policy encourages hard examples like object truncation. Table 2 shows quantitative evaluations. VADRA outperforms the COCO and DR models in the low data regime and is especially effective on scene 2 which has severely truncated objects. However, as we generate on an average about 10 instances per image, at high data regime of 10k images per scene, DR is more effective than VADRA because DR has enough samples for the target domain. However, our VADRA achieves high performance with small number of samples. We shows qualitative results from VADRA-5k in Figure 9, our model performs well under severe truncations and occlusions even in with few data samples.
|Model||VIRAT Scene 1||VIRAT Scene 2|
5.3 Depth Estimation
Similar to an appearance invariant feature like optical flow, depth profile of a surveillance scene can be helpful in activity recognition. We show this is possible without installing costly depth sensors using simulators.
Target Data: We use 5000 images each from two VIRAT scene at resolution for evaluation.
Simulator (): We update the asset shader in the simulator from previous section to generate synthetic depth map along with RGB images. Figure 10 shows synthetic samples generated from the simulator using VADRA. Please refer to the supplementary material for more details on data generation and the hyperparameters.
Policy (): Similar to object detection setup, we divide the ground plane of the 3D scene in the simulator into a rectangular grid. Each cell in the grid is associated with a object spawn probability . We do not use distractors for this task and only choose between person or car object. The object’s appearance, pose and location in the cell are sampled from a uniform distribution. We also vary lighting conditions like number light sources and their orientations randomly. The reward is again computed per cell and is the negative of the average cross entropy loss using the task model’s predictions.
Task Model ()
: The ground truth synthetic depth map is quantized into 80 bins and a fully convolutional neural network with a ResNet101  as backbone architecture is used to classify each pixel into one of the bins at a coarse resolution of . The batch size is set to 2 along with SGD optimizer with a linearly decaying learning rate. We train a separate model for each of the scene.
Results: Figure 11 shows qualitative results for the task of depth estimation. very quickly learns the depth profile of the background as it is almost constant for all the training data. attempts to make it harder for to predict the depth profile of the foreground object by minimizing their pixels in the image. This results into increasing the spawn probability of cells away from the surveillance camera as shown in figure 1. As a result, our task model trained using VADRA becomes good at predicting the depth profile of small foreground objects like people.
We presented theoretical analysis for domain randomization where we gave a qualitative explanation on effectiveness of domain randomization. Also, we proposed the VADRA algorithm that generates hard examples in order to improve a network’s performance. Our VADRA algorithm makes domain randomization more useful using an adversarial strategy. We demonstrated the effectiveness of our proposed method on diverse tasks such as object classification, object detection, and depth estimation.
-  S. L. Arlinghaus and S. Arlinghaus. Google earth: Benchmarking a map of walter christaller. 2007.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
-  K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.(2016). arxiv. arXiv preprint arXiv:1606.01540, 2016.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool.
Domain adaptive faster r-cnn for object detection in the wild.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3339–3348, 2018.
-  Y. Chen, W. Li, and L. Van Gool. Road: Reality oriented adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7892–7901, 2018.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 304–311. IEEE, 2009.
-  A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017.
-  Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
-  H. Hattori, N. Lee, V. N. Boddeti, F. Beainy, K. M. Kitani, and T. Kanade. Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance. International Journal of Computer Vision, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213, 2017.
-  J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
-  S. Huang and D. Ramanan. Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
-  S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  R. Khirodkar, D. Yoo, and K. M. Kitani. Domain Randomization for Scene-Specific Car Detection and Pose Estimation. ArXiv:1811.05939, Nov. 2018.
-  A. Kundu, Y. Li, and J. M. Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3559–3568, 2018.
-  J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2992–2999, 2013.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
-  Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. arXiv preprint arXiv:1712.00479, 13, 2017.
-  S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai. A large-scale benchmark dataset for event recognition in surveillance video. IEEE Comptuer Vision and Pattern Recognition (CVPR), 2011.
-  X. Peng, B. Usman, K. Saito, N. Kaushik, J. Hoffman, and K. Saenko. Syn2real: A new benchmark forsynthetic-to-real visual domain adaptation. arXiv preprint arXiv:1806.09755, 2018.
-  X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
-  L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel. Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017.
-  A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci, G. State, O. Shapira, and S. Birchfield. Structured domain randomization: Bridging the reality gap by context-aware synthetic data. arXiv preprint arXiv:1810.10093, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In International conference on computer vision (ICCV), volume 2, 2017.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118. Springer, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
S. Rota Bulo, G. Neuhold, and P. Kontschieder.
Loss max-pooling for semantic image segmentation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2126–2135, 2017.
-  F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201, 2016.
-  F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez. Effective use of synthetic data for urban scene semantic segmentation. In European Conference on Computer Vision, pages 86–103. Springer, Cham, 2018.
-  S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017.
-  H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  M. Sundermeyer, Z. Marton, M. Durner, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 699–715, 2018.
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332, 2018.
-  Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 23–30. IEEE, 2017.
-  J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. arXiv preprint arXiv:1804.06516, 2018.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
-  G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pages 4627–4635. IEEE, 2017.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  M. Wrenninge and J. Unger. Synscapes: A photorealistic synthetic dataset for street scene parsing. arXiv preprint arXiv:1810.08705, 2018.
-  Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision, pages 160–176. Springer, 2016.
-  Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 75–82. IEEE, 2014.
-  Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In The IEEE International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.