DeceptionNet: Network-Driven Domain Randomization

04/04/2019 ∙ by Sergey Zakharov, et al. ∙ 0

We present a novel approach to tackle domain adaptation between synthetic and real data. Instead of employing 'blind' domain randomization, i.e. augmenting synthetic renderings with random backgrounds or changing illumination and colorization, we leverage the task network as its own adversarial guide towards useful augmentations that maximize the uncertainty of the output. To this end, we design a min-max optimization scheme where a given task competes against a special deception network, with the goal of minimizing the task error subject to specific constraints enforced by the deceiver. The deception network samples from a family of differentiable pixel-level perturbations and exploits the task architecture to find the most destructive augmentations. Unlike GAN-based approaches that require unlabeled data from the target domain, our method achieves robust mappings that scale well to multiple target distributions from source data alone. We apply our framework to the tasks of digit recognition on enhanced MNIST variants as well as classification and object pose estimation on the Cropped LineMOD dataset and compare to a number of domain adaptation approaches, demonstrating similar results with superior generalization capabilities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Training pipeline. Training is performed in two alternating phases. Phase 1: The weights of the deception network are updated, while those of the recognition network are frozen. The recognition network’s objective is maximized instead of being minimized, forcing the deception network to produce increasingly confusing images. Phase 2: The generated deceptive images provided by the deception network, whose weights are now frozen, are passed to the recognition network and its weights are updated such that the loss is minimized. As a result of this min-max optimization, the input images are automatically altered by the deception network, forcing the recognition network to be robust to these domain changes.

The alluring possibility of training machine learning models on purely synthetic data allows for a theoretically infinite supply of both input data samples and associated label information. Unfortunately, for computer vision applications, the domain gap between synthetic renderings and real-world imagery poses serious challenges for generalization. Despite the apparent visual similarity, synthetic images structurally differ from real camera sensor data. First, synthetic image formation produces clear edges with approximate physical shading and illumination, whereas real images undergo many types of noise, such as optical aberrations, Bayer demosaicing, or compression artifacts. Second, the visual differences between synthetic CAD models and their actual physical counterparts can be quite significant. Apart from the visual gap, supervised approaches also require cumbersome and error-prone human labeling of real training data in the form of 2D bounding boxes, segmentation masks, or 6D poses

[25, 17]. For other approaches, such as robotic control learning, solutions must be found by exploration in tight simulation-based feedback loops that require synthetic rendering [28, 41, 31].

The gap between the visual domains is nowadays mainly bridged with adaptation and/or randomization techniques. In the case of supervised domain adaptation approaches [57, 32, 3, 30], a certain amount of labeled data from the target domain exists, while in unsupervised approaches [13, 48, 43, 4] the target data are available but unlabeled. In both cases, the goal is to match the source and target distributions by finding either a direct mapping, a common latent space, or through regularization of task networks trained on the source data. Recent unsupervised approaches are mostly based on generalized adversarial networks (GANs) [4, 22, 26, 48, 51, 24, 45, 38, 1] and although these methods perform proper target domain transfers, they can overfit to the chosen target domain and exhibit a decline in performance for unfamiliar out-of-distribution samples.

Domain randomization methods  [49, 21, 29, 55, 47] have no access to any target domain and employ the rather simple technique of randomly perturbing (synthetic) source data during training to make the tasks networks robust to perceptual differences. This approach can be effective, but is generally unguided, and needs an exhaustive evaluation to find meaningful augmentations that increase the target domain performance. Last but not least, results from pixel-level adversarial attacks [6, 46] suggest the existence of architecture-dependent effects that cannot be addressed by ”blind” domain randomization for robust transfer.

We propose herein a general framework that performs guided randomization with the help of an auxiliary deception network trained in a similar min-max fashion as GAN networks. This is done in two alternating phases, as illustrated in Fig. 1. In the first phase, the synthetic input is fed to our deception network responsible for producing augmented images that are then passed to a recognition network to compute the final task-specific loss with provided labels. Then, instead of minimizing the loss, we maximize it via gradient reversal [12] and only back-propagate an update to the deception network parameters. The deception network parameters are steering a set of differentiable modules , from which augmentations are sampled. In the next phase, we feed the augmented images to the recognition network together with the original images to minimize the task-specific loss and update the recognition network. In this way, the deception network is encouraged to produce domain randomization by confusing the recognition network, and the recognition network is made resilient to such random changes. By adding different modules and constraints we can influence how much and which parts of the image the deception network alters. In this way, our method outputs images completely independent from the target domain and therefore generalizes much better to new unseen domains than related approaches. In summary, our contributions are:

  • [noitemsep]

  • DeceptionNet framework that performs a min-max optimization for guided domain randomization;

  • Various pixel-level perturbation modules employed in such a framework suited for synthetic data;

  • Novel sequences: MNIST-COCO and Extended Cropped LineMOD that allow to demonstrate our strong generalization capabilities to unseen domains.

In the experimental section we will show that steered randomization by leveraging the network structure actually generalizes much better to new domains than unsupervised approaches with access to the target data while performing comparably well to them on known target domains.

2 Related Work

Domain Adaptation.

Various domain adaptation works put their efforts to bridge the gap between the domains mostly based on unsupervised conditional generative adversarial networks (GANs) [48, 43, 4, 1] or style-transfer solutions [14]. These methods use an unlabeled subset of target data to improve the synthetic data performance. For example, the authors of [4, 1] proposed to use GANs to learn the mapping from synthetic images to real. Extending this idea, approaches of [45, 38] use GANs to tune the parameters of user-defined transformations to fit to the target distribution. As opposed to GANs, work [11]

used a sequence autoencoder to extract the feature vector pairs from the available data, which are then decoded to generate new data samples.

Alternatively, domain-invariant features that work well for both real and synthetic domains can be learned.  [37] mapped real image features to the feature space of synthetic images and used the mapped information as an input to a task-specific network, trained on synthetic data only.

Another example is DSN [5]

, which proposes the extraction of image representations that are partitioned into two subspaces: private to each domain and one which is shared across domains (learning domain-invariant features). The shared subspace is then used to train a classifier that performs well on both domains. Similarly,

DRIT [23] embeds images on a domain-invariant content space (capturing shared information across domains) and a domain-specific attribute space by introducing a cross-cycle consistency loss based on disentangled representations. Other approaches, such as DANN [13] or ADDA [51] instead focus on adapting the recognition methods themselves to make them more robust to the domain changes.

Domain Randomization.

However, what if one does not have real data available? The answer for this case is domain randomization. Domain randomization is a popular approach [49, 21, 55, 36, 47, 40] that aims to randomize parts of the domain that we do not want our algorithm to be sensitive to. For example,  [49] and  [40] trained complex recognition methods by means of adding variability to the input render data, i.e., different illumination conditions, texture changes, scene decomposition, . This sort of parameterization allows to learn features that are invariant to the particular properties of the domain. The authors of [55] used a sophisticated depth augmentation pipeline trying to cover possible artifacts of the common commodity depth sensors. It was then used to train a network removing these artifacts from the input and generating a clean, synthetically-looking image. Building on top of this idea, the methods of [36, 47] extended this to the RGB domain.

Nevertheless, the main question remains unsolved: What is the main cause of confusion given the domain change? Domain randomization tries to target all possible scenarios, but we do not really know which of them are actually useful to bridge the domain gap. Moreover, covering all possible variations present in the real world by applying simple augmentations is almost impossible.

Our approach can be placed between domain randomization and GAN methods, however, instead of forcing randomization without any clear guidance on its usefulness, we propose to delegate this to a neural network, which we call

deception network, which tries to alter the images in an automated fashion, such that the task network is maximally confused. Moreover, to do so, we do not require any images, labeled or unlabeled, from the target domain.

3 Methodology

(a) Deception Modules for MNIST
(b) Deception Modules for LineMOD
Figure 2: Architecture of the deception networks used for the presented experiments. For the case of MNIST classification, three deception modules are used: the distortion module applying elastic deformations on the image, the BG/FG module responsible for generating background and foreground colors, and the noise module additionally distorting the image by applying slight noise. The LineMOD dataset requires a more sophisticated treatment and features four deception modules: noise and distortion (applied on depth channel only), modules similar to the previous case, pixel-wise BG module and light module generating different illumination conditions based on the Phong model.

As outlined, our approach towards steered domain randomization is essentially an extension of the task algorithm. Therefore, we have the actual task network , which, given an input image , returns an estimated label (, class, pose, segmentation mask, ), and (2) a deception network that takes the source image and returns the deceptive image , which, when provided to the task net , maximizes the difference between and . While the recognition network architectures are standard and follow related work [4, 12], we will first focus herein on our structured deception network first, and then describe the optimization objective and the training.

To formalize our pipeline, let be a source dataset composed of source images for an object of class . Then, is the source dataset covering all object classes . A dataset of real images (not used by us for training) is similarly defined.

3.1 Deception Modules

The deception network follows the encoder-decoder architecture where input is encoded to a lower-dimensional 2D latent space vector and given as an input to multiple decoding modules . The final output of is then a weighted sum of decoded outputs where act as spatial masking operations. While such a formulation allows for flexibility, the decoders must follow a set of predefined constraints to create meaningful outputs and leverage an inherent image structure instead of finding trivial mappings to decrease the task performance (e.g., by decoding always to 0). Note that our proposed framework is general and, thus, requires instantiations of the deception network for specific datasets. Similar to architecture search, discovering the ”best” instantiation is infeasible, but good ones can be found by analyzing the data source. After a reasonable experimentation we settled on certain configurations for MNIST (RGB) and LineMOD (RGB-D), depicted in Fig. 2. We continue by providing more detail on the used decoder modules and their constraint ranges.

3.1.1 Background Module (Bg)

Since our source images have black backgrounds, they hardly transfer over to the real world with infinite background variety, resulting in a significant accuracy drop. [21, 29] tackle this problem by rendering objects on top of images from large-scale datasets (e.g., MS COCO [25]).

Instead, our background module produces its output by chaining multiple upsampling and convolution operations. While the output is rather simple at start, the module regresses very complex and visually confusing structures in the advanced stages of training.

For MNIST, we used a simpler variant that outputs a single RGB background color and an RGB foreground bias (restricted not to intersect with the background color). To form the output, we first apply the background color and then add the foreground bias using the mask. We ensure that the final values are in the range .

3.1.2 Distortion Module (Ds)

The module is based on the idea of the elastic distortions first presented in [44]. Essentially, a 2D deformation field is randomly initialized from

and then convolved with a Gaussian filter of standard deviation

. For large values of , the resulting field approaches 0, whereas smaller values of keep the field mostly random. However, the moderate values of make the resulting field perform elastic deformations, where defines the elasticity coefficient. The resulting field is then multiplied by a scaling factor , which controls the deformation intensity.

Our implementation closely follows the described approach but we use the decoder output as the distortion field and apply resampling, similar to spatial transformer networks 

[20]. We fix , but learn both and the general decoder parameters. This means that the network itself controls where and how much to deform the object.

3.1.3 Noise Module (Ns)

Applying slight random noise augmentation to the network input during training is common practice. In a similar fashion, we use the noise decoder to add generated values to the input. The noise decoder regresses a tensor of the input size with values in the range

, which are then added to the input of the module.

3.1.4 Light Module (L)

Another feature not well covered by synthetic data is proper illumination. Recent methods [21, 29, 16, 56] prerender a number of synthetic images featuring different light conditions. Here, we instead implement differentiable lighting based on the simple Phong model [35], which is fully operated by the network. While more complex parametric and differentiable illumination models do exist, we found this basic approach to already work quite well.

The module requires surface information which is provided in form of normal maps. From this, we generate three different types of illumination, namely ambient, diffusive, and specular. The light decoder outputs a block of 9 parameters that are used to define the final light properties, i.e., a 3D light direction, an RGB light color (restricted to the range of ), and a weight for each of the three illumination types ().

(a) MNIST
(b) MNIST-M
(c) MNIST-COCO
(d) PixelDA [4]
(e) Ours
Figure 3: Example samples of the MNIST modalities: MNIST (Source), MNIST-M (Target), and MNIST-COCO (Generalization) on the left; and example augmentation images generated by PixelDA and our method respectively.

3.2 Optimization Objective

The optimization objective of the deception network is essentially the loss of the recognition network; however, instead of minimizing it, we maximize it by updating the parameters in the direction of the positive gradient. This is achieved by adding a gradient reversal layer [12] between the deception and recognition nets as shown in Fig. 1. The layer only negates the gradient when back-propagating, thereby resulting in the reversed optimization objective for a given loss. Therefore, the general optimization objective can be written as follows:

(1)
subject to (2)

where is the input image, is the ground truth label, is the task network, is the task loss, is the deception network, and denotes the hard constraints defined by the deception modules enforced by projection after a gradient step. In this framework, the deception network’s objective only depends on the objective of the recognition task and can, therefore, be easily applied to any other task.

3.3 Training Procedure

We use two different SGD solvers, where the actual task network has a learning rate of with a decaying factor of every iteration. The learning rate of the deception network was found to work well with a constant value of

. We train with a batch size of 64 for all the experiments and we stop training after 500 epochs. During the experimentation, we also discovered that concatenating real and perturbed images led to a consistent improvement in numbers.

4 Evaluation

(a) Synthetic
(b) Real
(c) Extended
(d) PixelDA [4]
(e) Ours
Figure 4: Example samples of the LineMOD modalities: Synthetic (Source), Real (Target), and Extended (Generalization) on the left; and example augmentation images generated by PixelDA and our method respectively.

In this section, we conduct a series of experiments to compare the capabilities of our pipeline with the state-of-the-art domain adaptation methods. We first compare ourselves against these baselines for the problem of adaptation and will then compare in terms of generalization. We will conclude with an ablative analysis to measure the impact of each module and modality on the final performance.

As the first dataset, we used the popular handwritten digits dataset MNIST as well as MNIST-M, introduced in [13] for unsupervised domain adaptation (depicted in Figs. 2(a)  2(b)). MNIST-M blends digits from the original monochrome set with random color patches from BSDS500 [2] by simply inverting the color values for the pixels belonging to the digit. The training split containing 59001 target images is then used for domain adaptation. The remaining 9001 target images are used for evaluation. That means that around 86% of the target data is used for training. Note that while MNIST is not technically synthetic, its clean and homogeneous appearance is typical for synthetic data.

The second dataset is the Cropped LineMOD dataset  [53] consisting of small centered, cropped 6464 patches of 11 different objects in cluttered indoor settings displayed in various of poses. It is based on the LineMOD dataset [15] featuring a collection of annotated RGB-D sequences recorded using the Primesense Carmine sensor and associated 3D object reconstructions. The dataset also features a synthetic set of crops of the same objects in various poses on a black background. We will treat this Synthetic Cropped LineMOD as the source dataset and the Real Cropped LineMOD as the target dataset. Domain adaptation methods use a split of 109208 rendered source images and 9673 real-world target images, 1000 real images for validation, and a target domain test set of 2655 images for testing. We show examples in Figs. 3(a) and 3(b).

The last dataset pair we used for the experiments is SYNTHIA [39] and Cityscapes [10]. SYNTHIA is a collection of pixel-annotated road scene frames rendered from a virtual city. Cityscapes is its real counterpart acquired in the street scenes of 50 different actual cities. Following a common evaluation protocol, we used a subset of 9400 SYNTHIA images, also known as SYNTHIA-RAND-CITYSCAPES, as the source data and 500 Cityscapes validation images as the target data.

4.1 Adaptation Tests

All domain adaptation methods use a significant portion of the target data for training, making the resulting mapped source images very similar to the target images (e.g., Fig. 2(b) vs 2(d) and Fig. 3(b) vs 3(d)). A common benchmark for domain adaptation is then to compare the performance of a classifier trained on the mapped data against a classifier trained on the source data only (lower baseline) and against a classifier trained directly on the target data (upper baseline).

Our approach is generally disadvantaged since we can structure our domain mapping only through the source data and the deception architecture. To show that our learned randomization is indeed guided, we additionally implement an unguided randomization variant that applies train time augmentation similar to the related work. It employs the same modules and constraints as our deception network, but its perturbations are conditioned on random values in each forward pass instead of latent codes from the input.

4.1.1 Classification on MNIST

In Table 1 we collect the results of the most relevant methods tested on the MNIST MNIST-M scenario and split them according to the type of data used. Since domain adaptation methods use both source and target data for training, they are allocated to a separate group (S + T). Both our method and the unguided randomization variant only have access to the source data and are therefore grouped in S. The task network follows the architecture presented in [12], which is also used by the other methods. The task’s objective is a simple cross entropy loss between the predicted and the ground truth label distributions.

We can identify three key observations: (1) our method shows very competitive results (90.4% classification) and is on par with the latest domain adaptation pipelines: DSN – 83.2%, DRIT – 91.5% and PixelDA – 95.9%. Moreover, we outperform most of the methods by a significant margin despite the fact that they had access to a large portion of the target data to minimize the domain shift. (2) Guiding the randomization leads to  7% higher accuracy which supports our claim convincingly. (3) Surprisingly, unguided randomization (with appropriate modules) alone is in fact enough to outperform most methods on MNIST.

MNIST MNIST-M Synthetic Cropped LineMOD Real Cropped LineMOD
Model Classification Accuracy (%) Classification Accuracy (%) Mean Angle Error ()
Source (S) 56.6 42.9 73.7

S

Unguided 83.1 53.1 52.6
Ours 90.4 95.8 51.9
S + T
CycleGAN [59] 74.5 68.2 47.5
MMD [52, 27] 76.9 72.4 70.6
DANN [13] 77.4 99.9 56.6
DSN [5] 83.2 100 53.3
DRIT [23] 91.5 98.1 34.4
PixelDA [4] 95.9 99.9 23.5
Target (T) 96.5 100 12.3
Table 1: Baseline tests: While performing slightly worse than the leading state-of-the-art domain adaptation methods using target data, we still manage to achieve very competitive performance without access to target data.

4.1.2 Classification and Pose Estimation on LineMOD

As before, the domain adaptation methods are trained on a mix of source (Synthetic Cropped LineMOD) and target (Real Cropped LineMOD) data and we compare to the predefined baselines. We use the common task network for this benchmark from [12] and the associated task loss:

(3)

where the first term is the classification loss and the second term is the log of a quaternion rotation metric [19]. weighs both terms whereas and are the ground truth and predicted quaternions, respectively.

The results in Table 1 present a more nuanced case. On this visually complex dataset, unguided randomization performs only above the lower baseline and is far behind any other method. Our guided randomization, on the other hand, with – 95.8% classification and 51.9 angle error is competitive with those of the latest domain adaptation methods using target data: DSN – 100% & 53.3, DRIT – 98.1% & 34.4, and PixelDA – 99.9% & 23.5. Nonetheless, we believe that both DRIT and PixelDA are not fully reachable by target-agnostic methods like ours since the space of all needed adaptations (e.g., aberrations or JPEG artifacts) has to be spanned by our deception modules. The augmentation differences between PixelDA and our method (Figs. 3(d) and  3(e)) suggest the existance of some visual phenomena we are still not accounting for with our deception network.

4.2 Generalization Tests

For the second set of experiments, we test the generalization capabilities of our method as well as the competing approaches. The major advantage of our pipeline is its independence from any target domain by design. To support our case we designed two new datasets:

  • [itemsep=1mm]

  • MNIST-COCO The data collection follows the exact same generation procedure of MNIST-M and has the same exact number of images for both training and testing. The only difference here is that instead of the BSDS500 dataset, we use crops from MS COCO. Fig. 2(e) demonstrates some of the newly generated images.

  • Extended Real Cropped LineMOD Thanks to the help of the authors of the original LineMOD dataset [15], we were able to get some of the original LineMOD objects, namely ”phone”, ”benchvise”, and ”driller”. We repeated the physical acquisition setup and generated an annotated scene for each object. Each scene depicts a specific object placed on a white markerboard atop a turntable and coarsely surrounded by a small number of cluttered objects, slightly occluding the object at times. Each sequence contains 130 RGB-D images covering the full 360 rotation at an elevation angle of approximately 60. Given the acquired and refined poses, we again crop the images in the same fashion as in the Cropped LineMOD dataset [53]. All 390 images are used for evaluation, with some examples shown in Fig. 3(c).

MNIST MNIST-COCO Synthetic Cropped LineMOD Extended Real Cropped LineMOD
Model Classification Accuracy (%) Classification Accuracy (%) Mean Angle Error ()
Source (S) 57.2 63.1 78.3

S

Unguided 85.8 77.2 48.5
Ours 89.4 99.0 46.5
S + T
DSN [5] 73.2 45.7 76.3
PixelDA [4] 72.5 76.0 84.2
Target (T) 96.1 100 14.7
Table 2: Generalization tests: Our method generalizes well to the extended datasets, while the adaptation methods underperform due to overfitting.

For a comparison with the strongest related methods, i.e., DSN, DRIT, and PixelDA, we used open source implementations and diligently ensured that we are able to properly train and reproduce the reported numbers from Table 1. While the DRIT implementation worked well for the adaptation experiments, we failed to produce reasonably high numbers for the generalization experiment and chose to exclude it from the comparison.

Similar to before, we train them using the target data from MNIST-M and Real Cropped LineMOD. After the training is finished and the corresponding accuracies on the target test splits are achieved, we test them on the newly acquired dataset. While different, these extended datasets still bear a certain resemblance to the target dataset and we could expect to see a certain amount of generalization. For our randomization methods, we can immediately test on the new data, since retraining is not necessary.

MNIST MNIST-M Synthetic Cropped LineMOD Real Cropped LineMOD
Modules Classification Accuracy (%) Classification Accuracy (%) Mean Angle Error ()
None 56.6 42.9 73.7
BG 82.4 74.8 50.4
BG + NS 86.5 77.6 52.8
BG + NS + DS 90.4 78.7 48.2
BG + NS + DS + L - 95.8 51.9
Table 3: Module ablation: Evaluation of the importance of the deception network’s modules. BG – background, NS – noise, DS – distortion, L – light.
Road SW BLDG Wall Fence Pole TL TS VEG Sky PRSN Rider Car Bus Mbike Bike mIoU mIoU*
Source (S) 3.8 10.2 46.3 1.8 0.3 19.1 4.0 7.5 71.8 72.2 44.6 3.4 24.9 5.2 0.0 2.5 19.8 22.8

S

Unguided 17.9 8.8 59.2 0.8 0.4 22.1 3.5 6.1 71.4 70.4 40.3 7.3 37.9 3.3 0.2 7.3 22.3 25.7
Ours 51.4 17.8 62.5 1.6 0.4 22.6 6.0 11.9 70.9 73.5 42.1 8.2 40.9 8.1 3.9 18.4 27.5 32.0

S + T

FCNs Wld [18] 11.5 19.6 30.8 4.4 0.0 20.3 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 20.1 22.9
CDA [58] 65.2 26.1 74.9 0.1 0.5 10.7 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 29.0 34.8
Cross-City [7] 62.7 25.6 78.3 - - - 1.2 5.4 81.3 81.0 37.4 6.4 63.5 16.1 1.2 4.6 - 35.7
Tsai et al. [50] 78.9 29.2 75.5 - - - 0.1 4.8 72.6 76.7 43.4 8.8 71.1 16.0 3.6 8.4 - 37.6
ROAD-Net [9] 77.7 30.0 77.5 9.6 0.3 25.8 10.3 15.6 77.6 79.8 44.5 16.6 67.8 14.5 7.0 23.8 36.1 41.7
LSD-seg [42] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1 42.1
Chen et al. [8] 78.3 29.2 76.9 11.4 0.3 26.5 10.8 17.2 81.7 81.9 45.8 15.4 68.0 15.9 7.5 30.4 37.3 43.0
Target (T) 96.5 74.6 86.1 37.1 33.2 30.2 39.7 51.6 87.3 90.4 60.1 31.7 88.4 52.3 33.6 59.1 59.5 65.5
Table 4: Real-world application: Segmentation performance on SYNTHIA Cityscapes benchmark based on Intersection over Union (IoU) tested on 16 (mIoU) and 13 (mIoU*) classes of the Cityscapes dataset. Our method outperforms source and unguided by a significant margin and remains competitive to the methods relying on the target data.

As is evident from Table 2, the accuracy of our method on MNIST-COCO is very close to the MNIST-M number (90.4% and 89.4% respectively). For the case of Extended Real Cropped LineMOD, we get even better results than for the Real Cropped LineMOD for both accuracy and angle error: We only need to classify 3 objects instead of 11 with a much smaller pose space, and the scenes are in general cleaner and less occluded. These results underline our claim with respect to generalization. This is, however, not the case for the domain adaptation methods showing drastically worse results. Interestingly, we observe an inverse trend where better results on the original target data lead to a more significant drop. Despite of having a very high accuracy on the target data and the ability to generate additional samples that do not exist in the dataset, these methods present typical signs of overfit mappings that cannot generalize well to the extensions of the same data acquired in a similar manner. The simple reason for this might be the nature of these methods: they do not generalize to the features that matter the most for the recognition task, but to simply replicate the target distribution as close as possible. As a result, it is not clear what the classifier exactly focuses on during inference; however, it could very likely be the particular type of images (e.g., in case of MNIST-COCO) or a specific type of backgrounds and illumination (e.g., in case of Extended Real Cropped LineMOD). In contrast to domain adaptation methods, our pipeline is designed not to replicate the target distribution, but to make the classifier invariant to the changes that should not affect classification, which is the reason why our results remain stable.

4.3 Ablation Studies

In this section, we perform a set of ablation studies to gain more insight into the impact of each module inside the deception network. Obviously, our modules model only a fraction of possible perturbations and it is important to understand the individual contributions. Moreover, we demonstrate how well we perform provided different types of input modalities for the LineMOD datasets.

Synthetic Cropped LineMOD Real Cropped LineMOD Synthetic Cropped LineMOD Extended Real Cropped LineMOD
Input Classification Accuracy (%) Mean Angle Error () Classification Accuracy (%) Mean Angle Error ()
D 73.3 36.6 78.7 34.9
RGB 84.8 57.4 85.9 49.4
RGB-D 95.8 51.9 99.0 46.5
Table 5: Input modality ablation: Performance evaluation based on the input data type used: depth, RGB, or RGB-D.

4.3.1 Deception Modules

We tested 4 different variations of the deception net that use varying combinations of the deception modules: background (BG), noise (NS), distortion (DS), and light (L). The exact combinations and the results on both datasets are listed in Table 3.

It can be clearly seen that each additional module in the deception network adds to the discriminative power of the final task network. The most important modules can also be easily distinguished based on the results. Apparently, the background module always makes a significant difference: the purely black backgrounds of the source data are drastically different from the real imagery. Another interesting observation is the strong impact the lighting perturbation has in the case of the Cropped LineMOD dataset. This enforces the notion that real sequences undergo many kinds of lighting changes that are not well-represented by synthetic renderings without any additional relighting. Note that the MNIST deception network does not employ lighting.

4.3.2 Input Modalities

For the task of simultaneous instance classification and pose estimation, we (as well as the other methods) always use the full RGB-D information. This ablation aims to show how well we fare provided only a certain type of data and the impact on the final results. Table 5 shows that RGB allows for better classification, whereas depth provides better pose estimates. We can further boost the classification enormously and reduce the pose error by combining both modalities.

4.4 Real-world Scenario

We demonstrate a real-world application of our approach on a more practical problem of semantic segmentation using the common SYNTHIA Cityscapes benchmark. Having only synthetic SYNTHIA renderings, we try to generalize to the real Cityscapes data by evaluating our method on 13 and 16 classes using the Intersection over Union (IoU) metric. This setup is particularly difficult since the domain gap problem here is intensified by a completely different set of segmentation instances and camera views. For a fair comparison, all methods use a VGG-16 base (FCN-8s) recognition network. The deception modules used in this case are as follows: 2D noise (NS), elastic distortion (DS), and light (L). Normal maps for the light module are generated from the available synthetic depth data.

Table 4 shows that even without access to target domain data, our pipeline remains competitive with the methods relying on target data, showing mIoU of 27.5 and mIoU* of 32 (16 and 13 classes) – well above source and unguided. The results also confirm the generality of the approach with respect to the different task architectures and datasets.

5 Conclusion

In this paper we presented a new framework to tackle the domain gap problem when no target data is available. Using a task network and its objective, we show how to extend it with a simple encoder-decoder deception network and bind both in a min-max game in order to achieve guided domain randomization. As a result, we obtain increasingly more robust task networks. We demonstrate a comparable performance to domain adaptation methods on two datasets and, most importantly, show superior generalization capabilities where the domain adaptation methods tend to drop in performance due to overfitting to the target distribution. Our results suggest that guided randomization, because of its simple but effective nature, should become a standard procedure to define baselines for domain transfer and adaptation techniques.

Figure 5: DeceptionNet architecture. Our network features a typical encoder-decoder architecture. The encoder part consists of 2 consecutive downsamplings followed by a sequence of convolutional blocks . The decoder part shares a similar architecture for all presented augmentation modules. Arrows show the skip connections between blocks.

Appendix A Supplementary Material

a.1 Network Architecture

Let be a convolutional block composed of the following layers: 33 convolution with

filters, BatchNorm (BN), and ReLU activation function. Similarly, let

be a decoding block made of: 2-factor upsampling transposed convolution, BatchNorm (BN), and ReLU.

DeceptionNet :

Using the defined nomenclature, the encoding part of the DeceptionNet can be described as: ; and its decoding part as: . Where is defined by the specific module type, and

stands for a 2-factor max-pooling layer. Encoding blocks have skip connections concatenating the channels with the opposite decoding blocks. The visual representation of the DeceptionNet’s architecture is depicted in Fig. 

5.

Figure 6: MNIST Classifier: Simple LeNet-like architecture, where 2 convolutional layers followed by ReLUs and max-poolings are finalized by 3 fully-connected layers.
Task Network :

In both cases, i.e., for MNIST classification and Cropped LineMOD classification and pose estimation, follows a simple LeNet-like architecture. As for MNIST (see Fig. 6), the final layer outputs the 10D vector, whereas for Cropped LineMOD (see Fig. 7) there is a 11D classification output as well as 4D quaternion output.

Figure 7: Cropped LineMOD Task Network: Simple LeNet-like architecture followed by a dropout layer with a 50% rate and outputting both a class and pose vector.
Figure 8: Deceptive images over consecutive iterations: The output becomes increasingly more complex for .
Figure 9: Unguided samples: We provide a sample of unguided augmentations for MNIST and LineMOD.
Figure 10: Deceptive augmentations: Augmentations applied for the SYNTHIA Cityscapes scenario.

a.2 Unguided Randomization: BG Filling

One of the modalities we have compared our results with is unguided randomization that applies augmentations during the data preprocessing step. While using the same modules and constraints as our deception network, its perturbations are conditioned on random values instead of latent codes from the input.

Since our DeceptionNet is capable of generating very complex backgrounds, we have also used complex noise types for unguided randomization to make the comparison more fair (see Fig. 9

). Apart from a uniform white noise, two additional noise types were used: Perlin  

[34] and cellular noises [54]

. Sample frequencies were sampled from the uniform distribution

. Both noise types were generated using the open source FastNoise library [33].

a.3 Additional Qualitative Results

In this section, we present additional output examples of the deception networks for Synthetic Cropped LineMOD and SYNTHIA test cases.

The LineMOD deception network uses all of the deception modules presented in the paper, whereas the SYNTHIA deception network uses three modules: light (L), elastic distortions (DS), and foreground noise (N). The sample outputs from each of the above-mentioned modules are shown in Fig. 10. Moreover, Fig. 8 demonstrates the output of the deception network during the training process. One can see that the output becomes increasingly more sophisticated for recognition by the task network.

References

  • [1] A. Antoniou, A. Storkey, and H. Edwards (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: §1, §2.
  • [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. TPAMI. Cited by: §4.
  • [3] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky (2014) Neural codes for image retrieval. CoRR. External Links: Link, 1404.1777 Cited by: §1.
  • [4] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §1, §2, 2(d), §3, 3(d), Table 1, Table 2.
  • [5] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In NIPS, Cited by: §2, Table 1, Table 2.
  • [6] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. In NIPS, Cited by: §1.
  • [7] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. Frank Wang, and M. Sun (2017) No more discrimination: cross city adaptation of road scene segmenters. In ICCV, Cited by: Table 4.
  • [8] Y. Chen, W. Li, X. Chen, and L. V. Gool (2019) Learning semantic segmentation from synthetic data: a geometrically guided input-output adaptation approach. In CVPR, Cited by: Table 4.
  • [9] Y. Chen, W. Li, and L. Van Gool (2018) Road: reality oriented adaptation for semantic segmentation of urban scenes. In CVPR, Cited by: Table 4.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, Cited by: §4.
  • [11] T. DeVries and G. W. Taylor (2017) Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538. Cited by: §2.
  • [12] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In ICML, Cited by: §1, §3.2, §3, §4.1.1, §4.1.2.
  • [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR. Cited by: §1, §2, Table 1, §4.
  • [14] L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks

    .
    In CVPR, Cited by: §2.
  • [15] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, Cited by: 2nd item, §4.
  • [16] S. Hinterstoisser, V. Lepetit, P. Wohlhart, and K. Konolige (2017)

    On pre-trained image features and synthetic images for deep learning

    .
    Cited by: §3.1.4.
  • [17] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis (2017) T-less: an rgb-d dataset for 6d pose estimation of texture-less objects. In WACV, Cited by: §1.
  • [18] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: Table 4.
  • [19] D. Q. Huynh (2009) Metrics for 3d rotations: comparison and analysis. Journal of Mathematical Imaging and Vision. Cited by: §4.1.2.
  • [20] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. In NIPS, Cited by: §3.1.2.
  • [21] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In ICCV, Cited by: §1, §2, §3.1.1, §3.1.4.
  • [22] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016)

    Photo-realistic single image super-resolution using a generative adversarial network

    .
    CoRR. Cited by: §1.
  • [23] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018)

    Diverse image-to-image translation via disentangled representations

    .
    In ECCV, Cited by: §2, Table 1.
  • [24] K. Lee, G. Ros, J. Li, and A. Gaidon (2019) SPIGAN: privileged adversarial learning from simulation. ICLR. Cited by: §1.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §3.1.1.
  • [26] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. CoRR. Cited by: §1.
  • [27] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. ICML. Cited by: Table 1.
  • [28] J. Mahler and K. Goldberg (2017) Learning deep policies for robot bin picking by simulating robust grasping sequences. In CoRL, Cited by: §1.
  • [29] F. Manhardt, W. Kehl, N. Navab, and F. Tombari (2018) Deep model-based 6d pose refinement in rgb. In ECCV, Cited by: §1, §3.1.1, §3.1.4.
  • [30] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. ICCV. Cited by: §1.
  • [31] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2018) Learning dexterous in-hand manipulation. CoRR. Cited by: §1.
  • [32] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014) Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, Cited by: §1.
  • [33] Peck,Jordan (2016) FastNoise library. GitHub. Note: https://github.com/Auburns/FastNoise Cited by: §A.2.
  • [34] K. Perlin (2002) Improving noise. In ACM Transactions on Graphics (TOG), Cited by: §A.2.
  • [35] B. T. Phong (1975) Illumination for computer generated pictures. Communications of the ACM. Cited by: §3.1.4.
  • [36] B. Planche, S. Zakharov, Z. Wu, A. Hutter, H. Kosch, and S. Ilic (2019) Seeing beyond appearance-mapping real images into geometrical domains for unsupervised cad-based recognition. IROS. Cited by: §2.
  • [37] M. Rad, M. Oberweger, and V. Lepetit (2018) Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In CVPR, Cited by: §2.
  • [38] A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré (2017) Learning to compose domain-specific transformations for data augmentation. In Advances in neural information processing systems, Cited by: §1, §2.
  • [39] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §4.
  • [40] F. Sadeghi and S. Levine (2016) Cad2rl: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §2.
  • [41] F. Sadeghi and S. Levine (2017) CAD2RL: real single-image flight without a single real image. In Robotics: Science and Systems(RSS), Cited by: §1.
  • [42] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In CVPR, Cited by: Table 4.
  • [43] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In CVPR, Cited by: §1, §2.
  • [44] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In ICDAR, Cited by: §3.1.2.
  • [45] L. Sixt, B. Wild, and T. Landgraf (2018) Rendergan: generating realistic labeled data. Frontiers in Robotics and AI. Cited by: §1, §2.
  • [46] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    .
    Cited by: §1.
  • [47] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. In ECCV, Cited by: §1, §2.
  • [48] Y. Taigman, A. Polyak, and L. Wolf (2017) Unsupervised cross-domain image generation. ICLR. Cited by: §1, §2.
  • [49] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. IROS. Cited by: §1, §2.
  • [50] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: Table 4.
  • [51] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. CVPR. Cited by: §1, §2.
  • [52] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. CoRR. Cited by: Table 1.
  • [53] P. Wohlhart and V. Lepetit (2015) Learning descriptors for object recognition and 3d pose estimation. In CVPR, Cited by: 2nd item, §4.
  • [54] S. Worley (1996) A cellular texture basis function. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 291–294. Cited by: §A.2.
  • [55] S. Zakharov, B. Planche, Z. Wu, A. Hutter, H. Kosch, and S. Ilic (2018) Keep it unreal: bridging the realism gap for 2.5d recognition with geometry priors only. 3DV. Cited by: §1, §2.
  • [56] S. Zakharov, I. Shugurov, and S. Ilic (2019) DPOD: 6d pose object detector and refiner. In ICCV, Cited by: §3.1.4.
  • [57] M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR. Cited by: §1.
  • [58] Y. Zhang, P. David, and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, Cited by: Table 4.
  • [59] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: Table 1.