VADRA: Visual Adversarial Domain Randomization and Augmentation

12/03/2018 ∙ by Rawal Khirodkar, et al. ∙ Carnegie Mellon University 0

We address the issue of learning from synthetic domain randomized data effectively. While previous works have showcased domain randomization as an effective learning approach, it lacks in challenging the learner and wastes valuable compute on generating easy examples. This can be attributed to uniform randomization over the rendering parameter distribution. In this work, firstly we provide a theoretical perspective on characteristics of domain randomization and analyze its limitations. As a solution to these limitations, we propose a novel algorithm which closes the loop between the synthetic generative model and the learner in an adversarial fashion. Our framework easily extends to the scenario when there is unlabelled target data available, thus incorporating domain adaptation. We evaluate our method on diverse vision tasks using state-of-the-art simulators for public datasets like CLEVR, Syn2Real, and VIRAT, where we demonstrate that a learner trained using adversarial data generation performs better than using a random data generation strategy.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large amount of labeled data is required to train deep neural networks. The manual annotation process is laborious and time-consuming especially for complex vision tasks like object detection, pose estimation or instance segmentation. Furthermore, a costly sensor setup is required to gather annotations for tasks like depth estimation. Use of computer graphics as a source of synthetic data is an attractive alternative to this problem as the annotations are essentially free. However, training a learner with synthetic data results into the problem of domain gap where the training data (source domain) is different from test data (target domain). The well studied paradigm of domain adaptation (DA) and recently proposed domain randomization (DR) are two popular solutions to this problem. However, DA assumes a certain degree of information about the target domain, for example unlabelled samples in case of unsupervised domain adaptation. In the absence of such target samples, DR is an effective approach. DR compensates for a lack of information by assuming access to a sufficiently accurate simulator capable of synthesizing data. The key idea here is to train on synthesized data with enough variations such that target data is perceived as just another variation by the learner.

(a) Object Detection
(b) Depth Estimation
Figure 1:

We show (top left) object spawn probability learned by our policy, (top right) domain randomized images generated using parameters sampled from the policy, (bottom left) real input and (bottom right) model’s output trained only on synthetic data, for object detection and depth estimation. Our approach encourages generation of hard examples like occluded and truncated objects for object detection and small objects for depth estimation.

A natural question therefore arises: how accurate does the simulator have to be for DR to work? This is in fact one of the key underlying assumption made by DR – that invariants of the target domain such as shape, size and type of object are already contained in the simulator by design. For example, when one is training a car detector from synthetic data, the synthetic data generator must contain cars. While this point may seem obvious and inconsequential, it important to point out that with a truly arbitrary simulator, randomization is bound to fail (i.e., it is unlikely to randomly generate the data that you need for your task). In this work, we establish a theoretical framework for analyzing such assumptions and characteristics of DR and we also use it to answer questions like: What differentiates DR from DA? When, Where and Why is DR effective? Can DR be used together with DA?

Along with generation of free and complex annotations, using a simulator has the advantage of synthesizing challenging scenarios for a learner. For example, small, occluded and significantly truncated objects are some of the hard instances for the task of object detection. This is useful as such examples can be notoriously hard to observe, making both training and evaluation of existing systems difficult. 6 out of 20 classes in Cityscapes [8] account for 90% of the annotated pixel mass [38], Caltech pedestrian detection benchmark [9] has cyclist examples [18]. However, we argue that DR fails to fully utilize this capability of a simulator. DR uniformly samples from a space of rendering parameters, this often leads to a redundant set of easy synthetic data for the learner [47]. Furthermore, uniform sampling in the rendering parameter space does not guarantee uniform exploration in the image space. For example, when generating a car, uniform sampling of rendering may result in many images of the same car but under slight variations in lighting conditions. This is not only a waste of valuable compute, but also results in insufficient training of the hard examples. Thus, we motivate the need for a more systematic sampling strategy to ensure better performance with a tractable number of synthetic data.

We address these concerns by proposing the Visual Adversarial Domain Randomization and Augmentation (VADRA) framework which randomizes the examples in adversarial way to improve performance in target domain by generating hard examples. Figure 1

shows the results of VADRA for the task of object detection and depth estimation. We give control of a part of the rendering space to a policy, for example where to spawn an object in the scene. The policy is trained using reinforcement learning and is encouraged to generate hard examples with respect to the learner. We visualize the object spawn probability map learned by the policy for detection and depth estimation as a heatmap where we notice increased likelihood for regions far from the camera resulting in harder examples.

We also extend our framework to incorporate unsupervised domain adaptation when unlabelled target data is available. Unlike traditional adaptation approaches, we share the ’adaptation work load’ equally between the simulator and the learner. In summary, the contributions of our work are as follows:

Theoretical analysis: We present a theoretical perspective on effectiveness of domain randomization. Our analysis shows that contrary to popular belief domain adaptation and domain randomization are both complementary techniques and can be used together effectively.

Visual Adversarial Domain Randomization and Augmentation: We propose a novel algorithm to maximize the utility of using a simulator for generating annotated data. Our data generation approach specifically focuses on hard examples with respect to a supervised learner. This results in an effective traversal of an essentially infinite space of rendering parameters.

Evaluations on diverse tasks: We benchmark our approach on various vision tasks like object classification, object detection and depth estimation using state of the art simulators on public datasets like CLEVR, Syn2Real, VIRAT.

2 Related Work

Our work is broadly related to approaches using a simulator as a source of supervised data and solutions for the reduction of domain gap.

Synthetic Data for Training Recently with the advent of rich 3D model repositories like ShapeNet and the related ModelNet [5], Google 3D warehouse [1], ObjectNet3D [53], IKEA3D [23], PASCAL3D+ [54] and increase in accessibility of rendering engines like Blender3D, Unreal Engine 4 and Unity3D, we have seen a rapid increase in using synthetic data for performing visual tasks like object classification [30], object detection [30, 47, 14], pose estimation [42, 22, 43], semantic segmentation [50, 40], visual question answering [20]. Often the source of such synthetic data is a simulator, and use of simulators for training control policies is already a popular approach in robotics [4, 45]. SYNTHIA [37], GTA5 [36], VIPER [35], CLEVR [20], AirSim [41], CARLA [11]

are some of the popular simulators in computer vision.

Unfortunately, despite the growing photorealism [52]

, simply training a supervised learning model on synthetic images yields disappointing results on real images due to domain gap. The solutions addressing this problem can be broadly classified into domain adaptation and domain randomization.

Domain Adaptation: Given source domain and target domain, methods like [3, 7, 16, 17, 55, 28, 48] aim to reduce the gap between the feature distributions of the two domains. [3, 13, 12, 48, 17] did this in an adversarial fashion using a discriminator for domain classification whereas [49, 27] minimized a defined distance metric between the domains. Another approach is to match statistics on the batch, class or instance level [17, 6] for both the domains. Although these approaches outperform simply training on source domain, they all rely on having access to target data albeit unlabelled.

Domain Randomization: These methods [39, 10, 46, 19, 47, 32, 31, 44, 43, 21] do not use any information about the target domain during training and only rely on a simulator capable of generating varied data. The goal is to close the domain gap by generating synthetic data with sufficient variation that the network views real data as just another variation. The underlying assumption here is that simulator encodes the domain knowledge about the target domain which is often specified manually [33].

3 Theoretical framework for DR

In this section we establish a theoretical framework for domain randomization. Furthermore, we provide a qualitative reasoning about its effectiveness using insights on combining data from multiple sources [2].

3.1 Problem Setup

Let and be the input and output space respectively and

be the probability distribution defined on

with as the labeling function. We define target domain as a two-tuple consisting of target data distribution and target labeling function as . Our goal is to learn a hypothesis from a finite VC dimensional hypothesis space which closely approximates . Given a distance metric in the output space, we can frame our goal as an optimization of objective function as follows:

3.2 Domain Randomization

This technique addresses the above problem when we do not assume access to target domain but are instead given a simulator at our disposal. The simulator is a generative model capable of generating labelled data . It encapsulates domain knowledge in form of rules about the target domain and represents the target labelling function internally i.e . Concretely, let be the rendering parameter space and be the probability distribution defined on . We denote the simulator as a function and where .

Domain randomization algorithm randomly generates labelled data samples by uniformly sampling i.e where

is a uniform distribution over

. Alg 1 represents the domain randomization algorithm where we randomly generate m data samples using the simulator and train a hypothesis on the generated data.


2:for  i in {1, 2, .. m} do
Algorithm 1 DR algorithm

The objective function optimized by these steps can be written as follows:

3.3 Combining data from Multiple Sources

In this section we present a qualitative discussion based on Theorem 4 from [2] stated here without proof for completeness.

Consider the setting where we are presented data from N source domains . Each source domain is associated with an unknown distribution and labelling function . We sample a total of m labeled data points from these source domains, with samples from each source such that .

Let be the expected loss and be the empirical loss on the domain . We use this sampled data to train a learner using source weighted empirical loss .

The objective is to use samples from N source domains to train a model to perform well on the target domain . We use for expected loss on target domain for simplicity instead of .

Theorem 1.

Let be a hypothesis space of VC dimension d. Given N source domains , for each , we generate a labeled sample of size by drawing points from and labeling them according to . If is the empirical minimizer of

for a fixed weight vector

on these samples and is the target error minimizer, then for any , with probability at least ,

where . is the minimum error on target domain. is the minimum error on target domain of a learner empirically trained using N source domains.

represents the consistency between the labelling functions and . is the divergence between the distributions and . overall is the distance between source domain and target domain.

Let ,
and .

(1) Single Source: and We only use one of the source domain to draw m labelled samples. Using theorem 1 we have,


(2) Multi Source: and . We draw labelled samples from each of the N sources and weigh all the sources equally in the loss . Using theorem 1 we have,


We argue that domain randomization is similar to the multi-source case with each distinct data sample being an individual source domain. The randomization process increases the variation in the data and is therefore equivalent to using an ensemble of source domains for training.

is the average distance of all source domains from the target domain, and the ensemble effect leads to decrease in the variance of this distance measure resulting into a lower upper bound for

for multi-source case.

In the case of when source data is generated using a simulator , we can assume that the labelling functions are fairly consistent with the target labeling function i.e . This assumption is justified as the simulator encodes the annotation knowledge by design. Therefore we have . This quantity can be minimized if we have access to unlabelled samples from target domain which is the approach used in unsupervised domain adaptation by mapping the input to an intermediate domain invariant space. Interestingly, we can still use averaging to further decrease the variance of i.e domain adaptation and domain randomization are complementary to each other.

Obtaining different sources is dependent on our data generation approach. Domain randomization employs an uniform sampling over rendering parameter space. However, with uniform sampling it is necessary to choose a large enough to ensure good performance on target domain. Thus, we motivate a need for a clever sampling strategy to ensure better performance with a tractable .

4 Visual Adversarial Domain Randomization and Augmentation

We model by making a strong pessimistic assumption about the rendering parameter distribution. By making this assumption, the hypothesis learned would be robust to large variations occurring in the target domain. This is especially useful when it is desirable when annotated target data is not available for rare scenarios.

The proposed min-max objective function is as follows:

Figure 2: policy, simulator, rendering parameter, data sample, hypothesis

4.1 Implementation Details

The above objective function can be formulated as a two player zero sum game whose Nash equilibrium corresponds to the optimal evaluation of the objective. We proceed to find the equilibrium of the above min-max optimization by performing gradient descent directly on the actions of the two players. We model the optimization problem using and . together consists of a policy and simulator where generates which is converted into labelled data by . On the other hand, is a supervised learning model trained on generated data from .

Specifically, we maximize the objective for G.


We use REINFORCE [51] to obtain gradients for updating using an unbiased empirical estimate of


where b is a baseline computed using previous rewards and M is the generated sample size. Both and H are trained together adversarially according to the algorithm 2.


1:for iteration i = {1,2..T} do
2:     Use policy to sample parameters
3:     Render labelled samples using
4:     Train on rendered data
6:     Use policy to sample parameters
7:     Render labelled samples using
8:     Obtain rewards on samples using equation 4
9:     Update policy using equation 5
Algorithm 2 VADRA algorithm

4.2 VADRA with Unlabelled Target Data

In presence of unlabelled target data, we introduce a domain classifier (target label as 1, source label as 0) which takes features extracted from the penultimate layer of as input. Reward for is modified as , where

are hyperparameters. This encourages the policy

to fool , thus generating synthetic data which looks similar to target data.

However, it is plausible that due to simulator’s design limitations we might never be able to match the target distribution. In this case, similar to domain adaptation we also modify H’s loss as where are hyperparameters. This formulations allows both the simulator and task model to minimize distance from the target domain.

5 Experiments

5.1 Object Classification

5.1.1 Clevr

We formulate a toy problem for the task of color classification with 6 colors. Here, = [shape, material, color, size] where shape sphere, cube, cylinder, material rubber, metal, color red, yellow, green, cyan, blue, magenta, size . Random noise is added to color, size, lighting, object position, camera position in the image.

Target Data: We generate 5000 images at resolution consisting only of spheres but with all other variations. Figure 3 shows the target data. For consistency, we ensure each variation has equal numbers of images in the target data ().

Simulator : We use Blender3D with assets provided by [20] for data generation. The source data is generated at resolution and it consists of all variations in the space. Figure 4 visualizes a random source batch.

Policy : consists of parameters each representing the probability of a possible variation in .

Task Model : We use ResNet18 followed by a full connected layer as our classifier which is trained end-to-end. The hyperparameters are reported in the supplementary material.

Results: We evaluate DR, VADRA and VADRA+DA. For a fair comparison, each iteration of DR consists of training on samples whereas VADRA trains on samples and trains on samples. We separately generate 1000 unlabelled target data for VADRA+DA. Figure 5 shows the target accuracy Vs training iterations averaged over 10 independent runs.

Our initial experiments show that it is possible to achieve accuracy on target data just by training on small size objects (size which is not the case with large objects (size . Thus, it is useful to focus on generation of these hard examples which VADRA quickly does after few iterations and therefore does better than DR. In VADRA + DA, the domain classifier eventually learns to discriminate on the basis of shape of the object, till then VADRA + DA performs worse than DR. However, learns to generate sphere images which is infact the target data, this leads to quicker boost in performance.

Figure 3: Target data for color classification using CLEVR.
Figure 4:

Source data visualization of a random batch of samples.

Figure 5: Target Classification Accuracy Vs Iteration.

5.1.2 Syn2Real

Target Data: We use MS-COCO based validation split of the closed-set classification task from Syn2Real dataset as our target data. Figure 6 shows examples of images from the target data.

Simulator (): We use Blender3D along with CAD models for 12 classes with varying camera elevation, lighting condition and object pose. Figure 7 shows examples of source data.

Policy (): The policy controls which class sample would be generated. It is a multinomial distribution of size 12 initialized as uniform distribution.

Task Model ()

: We use ResNet18 pretrained on ImageNet as our task model. Domain classifier

is a small full connected networks accepting 512 dimensional feature vector from as input.

Results: Table 1 shows quantitative results for DR, VADRA, DR + DA (only H) and VADRA + DA. Our policy generates focuses on pairwise confusing classes in a batch like (1) car and bus, (2) bike and motor bike decreasing samples of easier classes like plant and train. We provide visualizations of during training in the supplementary material.

Method aero bike bus car horse knife mbike person plant skbrd train truck Mean
DR 48 3 46 54 26 10 21 5 22 13 55 3 25.5
VADRA 40 12 52 50 25 18 26 9 19 16 48 10 27.1
DR + DA 68 41 63 34 57 45 74 30 57 24 63 15 47.6
VADRA + DA 65 54 60 46 53 41 72 42 54 29 65 25 50.5
Table 1: Classification accuracy on Syn2Real validation dataset
Figure 6: Target data for object classification depicting 12 classes.
Figure 7: Source data for object classification depicting 12 classes.

5.2 Object Detection

Figure 8: Source data for VIRAT scenes rendered using VADRA along with the ground truth instance segmentation map.

Figure 9: Object Detection results on VIRAT using FasterRCNN trained on 5000 domain randomized images using VADRA

We compare our approach against DR for the task of object detection on two surveillance scenes from VIRAT dataset [29].

Target Data: We evaluate our approach on 5000 images each from two VIRAT scenes at 1920 1080 resolution. The data has bounding box annotations for two kind of foreground objects, namely person and car. For our evaluations we only use car bounding box annotations.

Simulator (): We use the Unreal Engine based simulator by [21] for the VIRAT dataset. The simulator models the surveillance scene in 3D using scene geometry and camera parameters. To ensure performance on real data, we perform randomization using a texture bank of 100 textures with 10 cars, 5 person models and geometric distractors along with varying lighting conditions, contrast and brightness. Please refer to the supplementary material for the details.

Figure 8 shows labelled samples from the VIRAT dataset generated using the simulator along with ground truth instance segmentation map.

Policy (): In this case, where is the RGB image, is a list of car bounding boxes which we extract from the instance segmentation map. includes a list of object attributes and lighting conditions. Each object attribute consists of class, CAD model type, pose, 3d location, 3d size, and texture.

To include variable number of objects in the image, we randomly sample = number of objects, . Furthermore, we divide the ground plane of the scene into rectangular cells of equal size where each cell is associated with an object spawning probability such that . is further divided into three object class probabilities , we have . The spawn map is controlled by the policy and is visited times to decide where and which object to place in the scene. The exact location inside the cell and other object attributes like texture, size, pose, model type are randomly sampled. The reward for policy is computed per cell and is negative of the IoU of the bounding box predicted by the model.

Task Model (): We use FasterRCNN [34] with ROI Align and ResNet101 with feature pyramid network [24] architecture as the backbone for object detection. Please refer to the supplementary material for the hyperparameters and the training schedule.

Results: We compare our synthetic data trained model with a baseline detector referred as COCO trained on MS-COCO [25] dataset. To investigate the effect of number of samples on the performance, we evaluate two kinds of models trained on 5000 and 10,000 synthetic images generated per scene - (1) DR-5k, (2) VADRA-5k, (3) DR-10k, (4) VADRA-10k.

Figure 1 visualizes the learned object spawn probabilities (summed over all object classes) as a heatmap. The policy encourages hard examples like object truncation. Table 2 shows quantitative evaluations. VADRA outperforms the COCO and DR models in the low data regime and is especially effective on scene 2 which has severely truncated objects. However, as we generate on an average about 10 instances per image, at high data regime of 10k images per scene, DR is more effective than VADRA because DR has enough samples for the target domain. However, our VADRA achieves high performance with small number of samples. We shows qualitative results from VADRA-5k in Figure 9, our model performs well under severe truncations and occlusions even in with few data samples.

Model VIRAT Scene 1 VIRAT Scene 2
AP@0.5 AP@0.5
COCO 92.9 85.5
DR-5k 96.5 88.1
VADRA-5k 97.6 94.7
DR-10k 98.9 97.6
VADRA-10k 98.2 96.4
Table 2: Object detection results for VIRAT dataset.

5.3 Depth Estimation

Figure 10: Source data for VIRAT scenes rendered using VADRA along with the ground truth depth map.
Figure 11: Depth estimation output from a FCN trained on data generated by VADRA for VIRAT dataset

Similar to an appearance invariant feature like optical flow, depth profile of a surveillance scene can be helpful in activity recognition. We show this is possible without installing costly depth sensors using simulators.

Target Data: We use 5000 images each from two VIRAT scene at resolution for evaluation.

Simulator (): We update the asset shader in the simulator from previous section to generate synthetic depth map along with RGB images. Figure 10 shows synthetic samples generated from the simulator using VADRA. Please refer to the supplementary material for more details on data generation and the hyperparameters.

Policy (): Similar to object detection setup, we divide the ground plane of the 3D scene in the simulator into a rectangular grid. Each cell in the grid is associated with a object spawn probability . We do not use distractors for this task and only choose between person or car object. The object’s appearance, pose and location in the cell are sampled from a uniform distribution. We also vary lighting conditions like number light sources and their orientations randomly. The reward is again computed per cell and is the negative of the average cross entropy loss using the task model’s predictions.

Task Model ()

: The ground truth synthetic depth map is quantized into 80 bins and a fully convolutional neural network

[26] with a ResNet101 [15] as backbone architecture is used to classify each pixel into one of the bins at a coarse resolution of . The batch size is set to 2 along with SGD optimizer with a linearly decaying learning rate. We train a separate model for each of the scene.

Results: Figure 11 shows qualitative results for the task of depth estimation. very quickly learns the depth profile of the background as it is almost constant for all the training data. attempts to make it harder for to predict the depth profile of the foreground object by minimizing their pixels in the image. This results into increasing the spawn probability of cells away from the surveillance camera as shown in figure 1. As a result, our task model trained using VADRA becomes good at predicting the depth profile of small foreground objects like people.

6 Conclusion

We presented theoretical analysis for domain randomization where we gave a qualitative explanation on effectiveness of domain randomization. Also, we proposed the VADRA algorithm that generates hard examples in order to improve a network’s performance. Our VADRA algorithm makes domain randomization more useful using an adversarial strategy. We demonstrated the effectiveness of our proposed method on diverse tasks such as object classification, object detection, and depth estimation.