Sensor Transfer: Learning Optimal Sensor Effect Image Augmentation for Sim-to-Real Domain Adaptation

09/17/2018 ∙ by Alexandra Carlson, et al. ∙ University of Michigan 0

Performance on benchmark datasets has drastically improved with advances in deep learning. Still, cross- dataset generalization performance remains relatively low due to the domain shift that can occur between two different datasets. This domain shift is especially exaggerated between synthetic and real datasets. Significant research has been done to reduce this gap, specifically via modeling variation in the spatial layout of a scene, such as occlusions, and scene environmental factors, such as time of day and weather effects. However, few works have addressed modeling the variation in the sensor domain as a means of reducing the synthetic to real domain gap. The camera or sensor used to capture a dataset introduces artifacts into the image data that are unique to the sensor model, suggesting that sensor effects may also contribute to domain shift. To address this, we propose a learned augmentation network composed of physically-based augmentation functions. Our proposed augmentation pipeline transfers specific effects of the sensor model --chromatic aberration, blur, exposure, noise, and color temperature-- from a real dataset to a synthetic dataset. We provide experiments that demonstrate that augmenting synthetic training datasets with the proposed learned augmentation framework reduces the domain gap between synthetic and real domains for object detection in urban driving scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Synthetic datasets are designed to contain numerous spatial and environmental features that are found in the real domain: images captured during different times of day, in various weather conditions, and in structured urban environments. However, in spite of these shared features and high levels of photorealism, images from synthetic datasets are noticeably stylistically distinct from real images. Figure 1 shows a side-by-side comparison of two of widely-used real benchmark vehicle datasets, KITTI [1, 2], Cityscapes [3], and a state-of-the-art synthetic dataset, GTA Sim10k [4, 5]

. These differences can be quantified; a performance drop is observed between training and testing deep neural networks (DNNs) between the synthetic and real domains 

[5]. This suggests that real and synthetic datasets differ in their global pixel statistics.

Fig. 1: A comparison of images sampled from the real domain, KITTI Benchmark dataset (shown in the left hand column), images taken from the Cityscapes dataset (shown in the center column), and images from GTA Sim10k dataset (shown in the right hand column). Note that each dataset has a distinct visual style, specifically differing color cast, brightness, and blur.

Domain adaptation methods attempt to minimize such dissimilarities between synthetic and real datasets that result from an uneven representation of visual information in one domain compared to the other. Recent domain adaptation research has focused on learning salient visual features from real data – specifically scene lighting, scene background, weather, and occlusions – using generative adversarial frameworks in an effort to better model the representation of these visual elements in synthetic training sets [6, 7, 8]. However, little work has focused on modelling realistic, physically-based augmentations of synthetic data. Carlson et al. [9] demonstrate that randomizing across the sensor domain significantly improves performance over standard augmentation techniques. The information loss that results from the interaction between the camera model and lighting in the environment is not generally modelled in rendering engines, despite the fact that it can greatly influence the pixel-level artifacts, distortions, and dynamic range, and thus the global visual style induced in each image  [10, 11, 12, 13, 14, 15, 16].

In this study, we build upon [9] to work towards closing the gap between real and synthetic data domains. We propose a novel learning framework that performs sensor transfer on synthetic data. That is, the network learns to transfer the real sensor effect domain – blur, exposure, noise, color cast, and chromatic aberration – to synthetic images via a generative augmentation network. We demonstrate that augmenting relatively small labeled datasets using sensor transfer generates more robust and generalizable training datasets that improve the performance of DNNs for object detection and semantic segmentation tasks in urban driving scenes for both real and synthetic visual domains.

This paper is organized as follows: Section II presents related background work; section III details the proposed sensor transfer learning framework; section  IV describes experiments and discusses results of these experiments and section V concludes the paper. Code will be made publicly available.

Ii Related work

Our work focuses on augmenting the training data directly so that it can be applied to any task or input into any deep neural network regardless of the architecture. Zhang et al. 2017 [17]

demonstrate that the level of photorealism of the synthetic training data directly impacts the robustness and performance of the deep learning algorithm when tested on real data across a variety of computer vision tasks. However, it remains unclear what features of real data are necessary for this performance gain, or what parts of rendering pipelines should be modified to bridge the synthetic to real domain gap. Much work in the fields of data augmentation and learned rendering pipelines have proposed methods that shed light on this topic, and are summarized below.

Ii-a Domain Randomization

Recent work on domain randomization seeks to bridge the sim-to-real domain gap by generating synthetic data that has sufficient random variation over scene factors and rendering parameters such that the real data falls into this range of variation, even if the rendered data does not appear photorealistic. Such scene factors include such as textures, occlusion levels, scene lighting, camera field of view, and uniform noise, and have been applied to vision tasks in complex indoor and outdoor environments [18, 19]. The drawback of these techniques is that they only work if they sample the visual parameters spaces finely enough, and create a large enough dataset from a broad enough range of visual distortions to encompass the variation observed in real data. This can result in intractably large datasets that require significant training time for a deep learning algorithm. While we also aim to achieve robustness via an augmentation framework, we can use smaller datasets to achieve state-of-the-art performance because our method is learning how to augment synthetic data with salient visual information that exists in real data. Note that, because our work focuses on image augmentation outside of the rendering pipeline, it could be used in addition to domain randomization techniques.

Ii-B Optimizing Augmentation

In contrast to domain randomization, task-dependent techniques have been proposed to achieve more efficient data augmentation by learning the type and number of image augmentations that are important for algorithm performance. State-of-the-art methods [20, 21, 22]

in this area treat data augmentation as a form of network regularization, selecting a subset of augmentations that optimize algorithm performance for a given dataset and task as the algorithm is being trained. Unlike these methods, we propose that data augmentation can function as a domain adaptation method. Our learning framework is task-independent, and uses physically based augmentation techniques to investigate the visual degrees of freedom (defined by physically-based models) necessary for optimizing network performance from the synthetic to real domain.

Ii-C Learned Rendering Pipelines

Several studies have proposed unsupervised, generative learning frameworks that either take the place of a standard rendering engine [7] or complement the rendering engine via post-processing [23, 24, 25] in order to model relevant visual information directly from real images with no dependency on a specific task framework. Both  [7] and  [25] are applied to complex outdoor image datasets, but are designed to learn distributions over simpler spatial features in real images, specifically scene geometry. Other methods, such as  [23, 24], attempt to learn low-level pixel features. However, they are only applied to image sets that are homogeneously structured and low resolution. This may be due to the sensitivity of training adversarial frameworks. Our work focuses specifically on modeling the camera and image processing pipeline rather than scene elements or environmental factors that are specific to a given task. Our method can be applied to high resolution images of complex scenes.

Ii-D Impact of Sensor Effects on Deep Learning

Recent work has demonstrated that elements of the image formation and processing pipeline can have a large impact upon learned representation for deep neural networks across a variety of vision tasks [26, 27, 15, 16]. The majority of methods propose learning techniques that remove these effects from images [27]. As many of these sensor effects can lead to loss of information, correcting for them is non-trivial, potentially unstable, and may result in the hallucination of visual structure in the restored image. In contrast, Carlson et al. [9] demonstrate that significant performance boosts can be achieved by augmenting images using physically-based, sensor effect domain randomization. However, their method requires hand-tuning/evaluation of the visual quality of image augmentation. This human-in-the-loop dependence is inefficient and difficult to scale for large synthetic datasets, and the evaluated visual image quality is subjective. Rather than removing these effects, randomly adding them in, or manually adding them in via human-in-the-loop, our method learns the the style of sensor effects from real data and transfers this sensor style to synthetic images to bridge the synthetic-to-real domain gap.

Iii Methods

Fig. 2: The schematic of the proposed sensor transfer network structure. The style loss trains the sensor effect parameter generators (represented as the yellow boxes) to select parameters that transform the input synthetic images based upon how the sensor effect augmentation functions alter the style of the real data domain. This effectively transfers ’sensor style’ of the target dataset to the source dataset.

The objective of the sensor transfer network is to learn the the optimal set of augmentations that transfer sensor effects from a real dataset to a synthetic dataset. Our complete Sensor Transfer Network is shown in Figure 2.

Iii-a Sensor Effect Augmentation Pipeline

We adopt the sensor effect augmentation pipeline from  [9]. This is the backbone of the Sensor Transfer Network. Refer to [9] for a detailed discussion of each function and its relationship to the image formation process in a camera. We briefly describe each sensor effect augmentation function below for completeness. The sensor effect augmentation pipeline is a composition of chromatic aberration, Gaussian blur, exposure, pixel-sensor noise, and post-processing color balance augmentation functions:

(1)

Chromatic Aberration
To model lateral chromatic aberration, we apply translations in 2D pixel space to each of the color channels of an image. To model longitudinal chromatic aberration, we scale the green color channel relative to the red and blue channels of an image by a value . We combine these parameters into an affine transformation on each pixel in color channel of the image. The augmentation parameters learned for this augmentation function are , the red channel translations and , the green channel translations and , and the blue channel translations and .

Blur
We implement out-of-focus blur, which is modeled by convolving the image with a Gaussian filter [28]

.We fix the window size of the kernel to 7.0. The augmentation parameter learned for this augmentation function is the standard deviation

of the kernel.

Exposure
We implement the exposure density function developed in [29, 30]:

(2)

where is image intensity, models the incoming light intensity, and is a constant value that describes image contrast. We set to 0.85. This model is used to re-expose an image as follows:

(3)
(4)

The augmentation parameters learned for this augmentation function are to model changing exposure, where a positive relates to increasing the exposure, and a negative value indicates decreasing exposure.

Noise
We use the Poisson-Gaussian noise model proposed in [12]:

(5)

where is the ground truth image at pixel location , is the signal-dependent poisson noise, and is the signal-independent gaussian noise. The augmentation parameters learned for this augmentation function are the and for each color channel, for a total of six parameters.

Post-processing
We model post-processing techniques done by cameras, such as white balancing or gamma transformation, by performing linear translations in LAB color space [31, 32]. The augmentation parameters learned for this augmentation function the are translations in the a (red-green)and b (blue-yellow) channels in normalized LAB space.

Fig. 3:

A detailed schematic of how the training process occurs for a single sensor effect function. A 200 dimensional Gaussian noise vector is generated for a given input synthetic image. The Gaussian noise vector is input into the fully connected neural network that constitutes the parameter generator, which outputs sampled value(s) for the respective sensor effect augmentation function. The sampled parameter value(s) and the input synthetic image are fed into the sensor effect augmentation function, which outputs an augmented synthetic image. The style loss is calculated between the augmented synthetic image and a real image. This style loss is then backpropagated through the augmentation functions to train the parameter generator to select parameters that reduce the style differences between the real and augmented synthetic images.

Iii-B Training the Sensor Transfer Network

A high-level overview of a single training iteration for a single sensor effect is given in Figure 3. Each sensor effect augmentation function has its own parameter generator network. The training objective for each of these networks is to learn the distribution over its respective augmentation parameter(s) based upon real data. Each generator network is a two-layer, fully connected neural network. The following steps are required to perform a single training iteration of the Sensor Transfer Network using a single synthetic image. First, a 200 dimensional Gaussian noise vector, , is generated and paired with the input synthetic image. The noise vector is input into each separate generator network. Each generator network consists of two fully connected layers that together project into its respective sensor effect parameter space. For example, the blur parameter generator will map the to a value in the

parameter space. The output sampled parameters, paired with the input synthetic image are then input into the augmentation pipeline, which outputs an augmented synthetic image. This augmented synthetic image is then paired with a real image, both of which are input to the loss function.


We employ a loss function similar to the one used in Johnson et. al [33]. We assume that the layers of the VGG-16 network [34]

trained on ImageNet 

[35] encode relevant style information for salient objects We fix the weights of the pretrained VGG-16 network, and use it to project real and augmented synthetic images into the hidden layer feature spaces. We calculate the style loss, given in Eqn. 6, and use this as the training signal for the parameter generators.

(6)

In the above equation, is a real image batch, is an augmented synthetic image batch, is the Gram matrix of the feature map of hidden layer of the pretrained VGG-16 network, and is the corresponding quantity for augmented synthetic images. Through performance-based ablation studies, we found that = 10 gives the best performance, so the style loss is calculated for the first ten layers of VGG-16. Once calculated, the style loss is backpropagated through the sensor effect augmentation functions to train the sensor effect parameter generators. The above process is repeated with images from the synthetic and real datasets until the style loss has converged.

We train the sensor effects generators concurrently to learn the joint probability distribution over the sensor effect parameters. This is done to capture the dependencies that exist between these effects in a real camera. Once training is complete, we can fix the weights of the parameter generators, and use them to sample learned parameters to augment synthetic images. Table 

I shows the statistics of the learned distributions for sensor effect parameters of different real datasets. See Section IV for analysis and discussion of the learned parameters. Note that style loss was chosen because it is independent of spatial structure of an image. In effect, the augmentation parameter generators learn to sample the distributions of sensor effects in real data as constrained by the style of the real image domain.

Iv Experiments

Iv-a Experimental Setup

To verify that the proposed method can transfer the sensor effects of different datasets, we train Sensor Transfer Networks using the following synthetic and real benchmark datasets: GTASim10k [5] is comprised of 10,000 highly photorealistic synthetic images collected from the Grand Theft Auto (GTA) rendering engine. It captures different weather conditions and time of day. The Cityscapes [3] training image set is comprised of 2975 real images collected in over 50 cities across Germany. The KITTI training set [1] is comprised of 7481 real images collected in Karlshue, Germany. We train a Sensor Transfer Network to transfer the sensor style of the KITTI training set to GTASim10k, which is referred to as GTASim10kKITTI. We also train a Sensor Transfer Network to transfer the sensor style of the Cityscapes training set to GTASim10k, which is referred to as GTASim10kCityscapes. To train each Sensor Transfer Network, we use a batch size of 1 and learning rate of

. We trained each network for 4 epochs. For all experiments, we compare our results to the Sensor Effect Domain Randomization 

[9] of GTASim10k as a baseline measure. To generate the Sensor Effect Domain Randomization augmentations, we used the same human-selected parameter ranges as in [9].

Iv-B Evaluation of Learned Sensor Effect Augmentations

Qualitatively, from observing Figure 1, KITTI images feature more pronounced visual distortions due to blur, over-exposure, and a blue color tone. Cityscapes, on the other hand, has a more under-exposed, darker visual style.

Fig. 4: Qualitative comparison of augmented GTASim10kKITTI images in the left column, unaugmented GTASim10k in the center column, and GTASim10kCityscapes in the right column. The primary sensor effect transferred in GTASim10kCityscapes augmentation is decreased exposure, whereas the primary sensor effects transferred in GTASim10kKITTI augmentation is a blueish hue and increased exposure.
Proposed Method GTASim10kCityscapes Sensor Effect Parameters
Chrom. Ab. Blur Exposure Noise Post-processing
: : : : :
: : :
: :
: :
: :
: :
:
Proposed Method GTASim10kKITTI Sensor Effect Parameters
Chrom. Ab. Blur Exposure Noise Post-processing
: : : : :
: : :
: :
: :
: :
: :
:
Carlson et al. [9] Sensor Effect Domain Randomization Parameters
: 0.998-1.002 : 3-11 : -0.6-1.2 : 0.00-0.05 : -10.0-10.0
: -0.003-0.003 : 0.0-3.0 : 0.00-0.05 : -10.0-10.0
: -0.003-0.003 : 0.00-0.05
: -0.003-0.003 : 0.00-0.05
: -0.003-0.003 : 0.00-0.05
: -0.003-0.003 : 0.00-0.05
: -0.003-0.003
TABLE I: Learned sensor effect parameters for GTASim10kCityscapes and GTASim10kKITTI, and the Sensor Effect Domain Randomization parameters from Carlson et al. [9]. Note that for the Sensor transferred parameters in the first two rows, the mean and standard deviation of each sensor effect parameter value is given in the convention . For the Carlson et al. [9] Sensor Effect Domain Randomization parameters, given in the final row, the minimum and maximum of the human selected range is provided. Quantitatively, the GTASim10kKITTI increases image exposure, adds chromatic aberration, noise, and adds a blue color cast. For GTASim10kCityscapes, image exposure is decreased, adds chromatic aberration, a higher level of noise is added, and slight yellow-blue color cast is applied.

Figure 4 shows examples of unaugmented GTASim10k images in the center columns, those same images augmented by GTASim10kKITTI in the left column, and those images augmented by GTASim10kCityscapes in the right column. When compared to Figure 1, it does appear that, for both the sensor transfer of GTASim10kKITTI and GTASim10kCityscapes, realistic aspects of exposure, noise, and color cast are transferred to GTASim10k. The statistics of the learned parameter values are given in Table I. In general the selected parameter values generate augmented synthetic images with style that matches the real datasets. We hypothesize that the color shift for GTASim10kCityscapes is not as strong as GTASim10kKITTI because there is a more even distribution of sky and buildings in Cityscapes, where as KITTI has a significant number of instances of sky. Interestingly, the blur parameter, , did not converge and was pushed towards zero for both GTASim10kCityscapes and GTASim10kKITTI. This suggests that Gaussian blur does not match the blur captured by style of real images. Further research could consider more accurate models of blur, such as motion blur.

Iv-C Impact of Learned Sensor Transformation on Object Detection for Benchmark datasets

Training Dataset Tested on KITTI
Augmentation Method Gain
Baseline 51.01
Carlson et al. [9] 48.94 -2.07
Proposed Method 52.67 +1.66

Training Dataset Tested on Cityscapes
Augmentation Method Gain
Baseline 30.13
Carlson et al. [9] 34.89 +4.76
Proposed Method 35.48 +5.35

TABLE II: Results of the sensor effects augmentations on Faster R-CNN object detection performance. The percent change for both the Carlson et al. [9] and proposed method are calculated relative to the full, unaugmented baseline datasets.

To evaluate if the Sensor Transfer Network is adding in visual information that is salient for vision tasks in the real image domain, we train an object detection neural network on the unaugmented and augmented synthetic data and evaluate the performance of the object detection network on the real data domains, KITTI and Cityscapes. We chose to use Faster R-CNN as our base network for 2D object detection [36]. Faster R-CNN achieves relatively high performance on the KITTI benchmark dataset. Many state-of-the-art object detection networks that improve upon these results still use Faster R-CNN as their base architecture.

We compare to Faster R-CNN networks trained on unaugmented GTASim10k and Sensor Transfer Domain Randomization from Carlson et al. [9] augmented GTASim10k as our baselines. To create augmented training datasets, we combine the unaugmented GTASim10k with varying amounts of augmented GTASim10k data. For all datasets, both augmented and unaugmented, we trained each Faster R-CNN network for 10 epochs using two Titan X Pascal GPUs in order to control for potential confounds between performance and training time. We evaluate the Faster R-CNN networks on either the KITTI training dataset or the Cityscapes training dataset depending on the Sensor Transfer Network used for training dataset augmentation. Each dataset is converted into Pascal VOC 2012 format to standardize training and evaluation, and performance values are the VOC AP50 reported for the car class [37].

Table II shows the object detection results for the proposed method in comparison to the baselines. In general, the addition of sensor effect augmentations has a positive boost on Faster R-CNN performance for training on GTASim10k and testing on Cityscapes. Our proposed method, for both GTASim10kCityscapes and GTASim10kKITTI, achieves the best performance over both the baseline and Sensor Effect Domain Randomization.

Fig. 5: Results of the learned sensor effect augmentations on Faster R-CNN object detection performance. Note that higher performance can be achieved using smaller synthetic datasets augmented with the proposed method for both KITTI and Cityscapes.

To evaluate the impact of Sensor Transfer on the number of synthetic training images required for maximal object detection performance, we trained Faster R-CNNs on datasets comprised of the 10k unagumented GTASim10k images combined with either 2k augmented images, 5k augmented images, 8k augmented images, or 10k augmented images. Figure 5 captures the effect of increasing number of augmentations on Faster R-CNN performance. We see that, when compared to the Sensor Transfer domain randmization method, fewer trainiing images are required when using Sensor Transfer augmentation for both GTASim10kKITTI and GTASim10kCityscapes. Our results indicate that learning the augmentation parameters allows us to train on significantly smaller datasets without compromising performance. This demonstrates that we are more efficiently modeling salient visual information than domain randomization. Interestingly, the Sensor Effect Domain Randomization method does worse than baseline across all levels of augmentation when tested on KITTI. We expect that this is because human-chosen set of parameter ranges, which are shown in the bottom row of Table I, do not generalize well when adapting GTA Sim10k to KITTI even though they may generate visually realistic images. One reason for this is that the visually realistic parameter ranges selected in  [9] where chosen using a GTA dataset of all daytime images, whereas GTASim10k contains an even representation of daytime and nighttime images. This further demonstrates the importance of learning the sensor effect parameter distributions constrained by how they affect the styles of both the real and synthetic image datasets.

V Discussion and Conclusions

In general, our results show that the proposed Sensor Transfer Network reduces the synthetic to real domain gap more effectively and more efficiently than domain randomization. Future work includes increasingly the complexity and realism of the Sensor Transfer augmentation pipeline by modeling other, different sensor effects, as well as implementing models that better capture the pixel statistics of real images, such as motion or defocus blur. Other avenues include investigating the impact of task performance and problem space on the sensor effect parameter selection, and evaluating how the proposed method impacts performance for training synthetic datasets rendered with various levels of photorealism.

References