Synthetic datasets are designed to contain numerous spatial and environmental features that are found in the real domain: images captured during different times of day, in various weather conditions, and in structured urban environments. However, in spite of these shared features and high levels of photorealism, images from synthetic datasets are noticeably stylistically distinct from real images. Figure 1 shows a side-by-side comparison of two of widely-used real benchmark vehicle datasets, KITTI [1, 2], Cityscapes , and a state-of-the-art synthetic dataset, GTA Sim10k [4, 5]
. These differences can be quantified; a performance drop is observed between training and testing deep neural networks (DNNs) between the synthetic and real domains. This suggests that real and synthetic datasets differ in their global pixel statistics.
Domain adaptation methods attempt to minimize such dissimilarities between synthetic and real datasets that result from an uneven representation of visual information in one domain compared to the other. Recent domain adaptation research has focused on learning salient visual features from real data – specifically scene lighting, scene background, weather, and occlusions – using generative adversarial frameworks in an effort to better model the representation of these visual elements in synthetic training sets [6, 7, 8]. However, little work has focused on modelling realistic, physically-based augmentations of synthetic data. Carlson et al.  demonstrate that randomizing across the sensor domain significantly improves performance over standard augmentation techniques. The information loss that results from the interaction between the camera model and lighting in the environment is not generally modelled in rendering engines, despite the fact that it can greatly influence the pixel-level artifacts, distortions, and dynamic range, and thus the global visual style induced in each image [10, 11, 12, 13, 14, 15, 16].
In this study, we build upon  to work towards closing the gap between real and synthetic data domains. We propose a novel learning framework that performs sensor transfer on synthetic data. That is, the network learns to transfer the real sensor effect domain – blur, exposure, noise, color cast, and chromatic aberration – to synthetic images via a generative augmentation network. We demonstrate that augmenting relatively small labeled datasets using sensor transfer generates more robust and generalizable training datasets that improve the performance of DNNs for object detection and semantic segmentation tasks in urban driving scenes for both real and synthetic visual domains.
Ii Related work
Our work focuses on augmenting the training data directly so that it can be applied to any task or input into any deep neural network regardless of the architecture. Zhang et al. 2017 
demonstrate that the level of photorealism of the synthetic training data directly impacts the robustness and performance of the deep learning algorithm when tested on real data across a variety of computer vision tasks. However, it remains unclear what features of real data are necessary for this performance gain, or what parts of rendering pipelines should be modified to bridge the synthetic to real domain gap. Much work in the fields of data augmentation and learned rendering pipelines have proposed methods that shed light on this topic, and are summarized below.
Ii-a Domain Randomization
Recent work on domain randomization seeks to bridge the sim-to-real domain gap by generating synthetic data that has sufficient random variation over scene factors and rendering parameters such that the real data falls into this range of variation, even if the rendered data does not appear photorealistic. Such scene factors include such as textures, occlusion levels, scene lighting, camera field of view, and uniform noise, and have been applied to vision tasks in complex indoor and outdoor environments [18, 19]. The drawback of these techniques is that they only work if they sample the visual parameters spaces finely enough, and create a large enough dataset from a broad enough range of visual distortions to encompass the variation observed in real data. This can result in intractably large datasets that require significant training time for a deep learning algorithm. While we also aim to achieve robustness via an augmentation framework, we can use smaller datasets to achieve state-of-the-art performance because our method is learning how to augment synthetic data with salient visual information that exists in real data. Note that, because our work focuses on image augmentation outside of the rendering pipeline, it could be used in addition to domain randomization techniques.
Ii-B Optimizing Augmentation
In contrast to domain randomization, task-dependent techniques have been proposed to achieve more efficient data augmentation by learning the type and number of image augmentations that are important for algorithm performance. State-of-the-art methods [20, 21, 22]
in this area treat data augmentation as a form of network regularization, selecting a subset of augmentations that optimize algorithm performance for a given dataset and task as the algorithm is being trained. Unlike these methods, we propose that data augmentation can function as a domain adaptation method. Our learning framework is task-independent, and uses physically based augmentation techniques to investigate the visual degrees of freedom (defined by physically-based models) necessary for optimizing network performance from the synthetic to real domain.
Ii-C Learned Rendering Pipelines
Several studies have proposed unsupervised, generative learning frameworks that either take the place of a standard rendering engine  or complement the rendering engine via post-processing [23, 24, 25] in order to model relevant visual information directly from real images with no dependency on a specific task framework. Both  and  are applied to complex outdoor image datasets, but are designed to learn distributions over simpler spatial features in real images, specifically scene geometry. Other methods, such as [23, 24], attempt to learn low-level pixel features. However, they are only applied to image sets that are homogeneously structured and low resolution. This may be due to the sensitivity of training adversarial frameworks. Our work focuses specifically on modeling the camera and image processing pipeline rather than scene elements or environmental factors that are specific to a given task. Our method can be applied to high resolution images of complex scenes.
Ii-D Impact of Sensor Effects on Deep Learning
Recent work has demonstrated that elements of the image formation and processing pipeline can have a large impact upon learned representation for deep neural networks across a variety of vision tasks [26, 27, 15, 16]. The majority of methods propose learning techniques that remove these effects from images . As many of these sensor effects can lead to loss of information, correcting for them is non-trivial, potentially unstable, and may result in the hallucination of visual structure in the restored image. In contrast, Carlson et al.  demonstrate that significant performance boosts can be achieved by augmenting images using physically-based, sensor effect domain randomization. However, their method requires hand-tuning/evaluation of the visual quality of image augmentation. This human-in-the-loop dependence is inefficient and difficult to scale for large synthetic datasets, and the evaluated visual image quality is subjective. Rather than removing these effects, randomly adding them in, or manually adding them in via human-in-the-loop, our method learns the the style of sensor effects from real data and transfers this sensor style to synthetic images to bridge the synthetic-to-real domain gap.
The objective of the sensor transfer network is to learn the the optimal set of augmentations that transfer sensor effects from a real dataset to a synthetic dataset. Our complete Sensor Transfer Network is shown in Figure 2.
Iii-a Sensor Effect Augmentation Pipeline
We adopt the sensor effect augmentation pipeline from . This is the backbone of the Sensor Transfer Network. Refer to  for a detailed discussion of each function and its relationship to the image formation process in a camera. We briefly describe each sensor effect augmentation function below for completeness. The sensor effect augmentation pipeline is a composition of chromatic aberration, Gaussian blur, exposure, pixel-sensor noise, and post-processing color balance augmentation functions:
To model lateral chromatic aberration, we apply translations in 2D pixel space to each of the color channels of an image. To model longitudinal chromatic aberration, we scale the green color channel relative to the red and blue channels of an image by a value . We combine these parameters into an affine transformation on each pixel in color channel of the image. The augmentation parameters learned for this augmentation function are , the red channel translations and , the green channel translations and , and the blue channel translations and .
We implement out-of-focus blur, which is modeled by convolving the image with a Gaussian filter 
.We fix the window size of the kernel to 7.0. The augmentation parameter learned for this augmentation function is the standard deviationof the kernel.
where is image intensity, models the incoming light intensity, and is a constant value that describes image contrast. We set to 0.85. This model is used to re-expose an image as follows:
The augmentation parameters learned for this augmentation function are to model changing exposure, where a positive relates to increasing the exposure, and a negative value indicates decreasing exposure.
We use the Poisson-Gaussian noise model proposed in :
where is the ground truth image at pixel location , is the signal-dependent poisson noise, and is the signal-independent gaussian noise.
The augmentation parameters learned for this augmentation function are the and for each color channel, for a total of six parameters.
We model post-processing techniques done by cameras, such as white balancing or gamma transformation, by performing linear translations in LAB color space [31, 32]. The augmentation parameters learned for this augmentation function the are translations in the a (red-green)and b (blue-yellow) channels in normalized LAB space.
Iii-B Training the Sensor Transfer Network
A high-level overview of a single training iteration for a single sensor effect is given in Figure 3. Each sensor effect augmentation function has its own parameter generator network. The training objective for each of these networks is to learn the distribution over its respective augmentation parameter(s) based upon real data. Each generator network is a two-layer, fully connected neural network. The following steps are required to perform a single training iteration of the Sensor Transfer Network using a single synthetic image. First, a 200 dimensional Gaussian noise vector, , is generated and paired with the input synthetic image. The noise vector is input into each separate generator network. Each generator network consists of two fully connected layers that together project into its respective sensor effect parameter space. For example, the blur parameter generator will map the to a value in the
parameter space. The output sampled parameters, paired with the input synthetic image are then input into the augmentation pipeline, which outputs an augmented synthetic image. This augmented synthetic image is then paired with a real image, both of which are input to the loss function.
We employ a loss function similar to the one used in Johnson et. al . We assume that the layers of the VGG-16 network 
trained on ImageNet encode relevant style information for salient objects We fix the weights of the pretrained VGG-16 network, and use it to project real and augmented synthetic images into the hidden layer feature spaces. We calculate the style loss, given in Eqn. 6, and use this as the training signal for the parameter generators.
In the above equation, is a real image batch, is an augmented synthetic image batch, is the Gram matrix of the feature map of hidden layer of the pretrained VGG-16 network, and is the corresponding quantity for augmented synthetic images. Through performance-based ablation studies, we found that = 10 gives the best performance, so the style loss is calculated for the first ten layers of VGG-16. Once calculated, the style loss is backpropagated through the sensor effect augmentation functions to train the sensor effect parameter generators. The above process is repeated with images from the synthetic and real datasets until the style loss has converged.
We train the sensor effects generators concurrently to learn the joint probability distribution over the sensor effect parameters. This is done to capture the dependencies that exist between these effects in a real camera. Once training is complete, we can fix the weights of the parameter generators, and use them to sample learned parameters to augment synthetic images. TableI shows the statistics of the learned distributions for sensor effect parameters of different real datasets. See Section IV for analysis and discussion of the learned parameters. Note that style loss was chosen because it is independent of spatial structure of an image. In effect, the augmentation parameter generators learn to sample the distributions of sensor effects in real data as constrained by the style of the real image domain.
Iv-a Experimental Setup
To verify that the proposed method can transfer the sensor effects of different datasets, we train Sensor Transfer Networks using the following synthetic and real benchmark datasets: GTASim10k  is comprised of 10,000 highly photorealistic synthetic images collected from the Grand Theft Auto (GTA) rendering engine. It captures different weather conditions and time of day. The Cityscapes  training image set is comprised of 2975 real images collected in over 50 cities across Germany. The KITTI training set  is comprised of 7481 real images collected in Karlshue, Germany. We train a Sensor Transfer Network to transfer the sensor style of the KITTI training set to GTASim10k, which is referred to as GTASim10kKITTI. We also train a Sensor Transfer Network to transfer the sensor style of the Cityscapes training set to GTASim10k, which is referred to as GTASim10kCityscapes. To train each Sensor Transfer Network, we use a batch size of 1 and learning rate of
. We trained each network for 4 epochs. For all experiments, we compare our results to the Sensor Effect Domain Randomization of GTASim10k as a baseline measure. To generate the Sensor Effect Domain Randomization augmentations, we used the same human-selected parameter ranges as in .
Iv-B Evaluation of Learned Sensor Effect Augmentations
Qualitatively, from observing Figure 1, KITTI images feature more pronounced visual distortions due to blur, over-exposure, and a blue color tone. Cityscapes, on the other hand, has a more under-exposed, darker visual style.
Figure 4 shows examples of unaugmented GTASim10k images in the center columns, those same images augmented by GTASim10kKITTI in the left column, and those images augmented by GTASim10kCityscapes in the right column. When compared to Figure 1, it does appear that, for both the sensor transfer of GTASim10kKITTI and GTASim10kCityscapes, realistic aspects of exposure, noise, and color cast are transferred to GTASim10k. The statistics of the learned parameter values are given in Table I. In general the selected parameter values generate augmented synthetic images with style that matches the real datasets. We hypothesize that the color shift for GTASim10kCityscapes is not as strong as GTASim10kKITTI because there is a more even distribution of sky and buildings in Cityscapes, where as KITTI has a significant number of instances of sky. Interestingly, the blur parameter, , did not converge and was pushed towards zero for both GTASim10kCityscapes and GTASim10kKITTI. This suggests that Gaussian blur does not match the blur captured by style of real images. Further research could consider more accurate models of blur, such as motion blur.
Iv-C Impact of Learned Sensor Transformation on Object Detection for Benchmark datasets
To evaluate if the Sensor Transfer Network is adding in visual information that is salient for vision tasks in the real image domain, we train an object detection neural network on the unaugmented and augmented synthetic data and evaluate the performance of the object detection network on the real data domains, KITTI and Cityscapes. We chose to use Faster R-CNN as our base network for 2D object detection . Faster R-CNN achieves relatively high performance on the KITTI benchmark dataset. Many state-of-the-art object detection networks that improve upon these results still use Faster R-CNN as their base architecture.
We compare to Faster R-CNN networks trained on unaugmented GTASim10k and Sensor Transfer Domain Randomization from Carlson et al.  augmented GTASim10k as our baselines. To create augmented training datasets, we combine the unaugmented GTASim10k with varying amounts of augmented GTASim10k data. For all datasets, both augmented and unaugmented, we trained each Faster R-CNN network for 10 epochs using two Titan X Pascal GPUs in order to control for potential confounds between performance and training time. We evaluate the Faster R-CNN networks on either the KITTI training dataset or the Cityscapes training dataset depending on the Sensor Transfer Network used for training dataset augmentation. Each dataset is converted into Pascal VOC 2012 format to standardize training and evaluation, and performance values are the VOC AP50 reported for the car class .
Table II shows the object detection results for the proposed method in comparison to the baselines. In general, the addition of sensor effect augmentations has a positive boost on Faster R-CNN performance for training on GTASim10k and testing on Cityscapes. Our proposed method, for both GTASim10kCityscapes and GTASim10kKITTI, achieves the best performance over both the baseline and Sensor Effect Domain Randomization.
To evaluate the impact of Sensor Transfer on the number of synthetic training images required for maximal object detection performance, we trained Faster R-CNNs on datasets comprised of the 10k unagumented GTASim10k images combined with either 2k augmented images, 5k augmented images, 8k augmented images, or 10k augmented images. Figure 5 captures the effect of increasing number of augmentations on Faster R-CNN performance. We see that, when compared to the Sensor Transfer domain randmization method, fewer trainiing images are required when using Sensor Transfer augmentation for both GTASim10kKITTI and GTASim10kCityscapes. Our results indicate that learning the augmentation parameters allows us to train on significantly smaller datasets without compromising performance. This demonstrates that we are more efficiently modeling salient visual information than domain randomization. Interestingly, the Sensor Effect Domain Randomization method does worse than baseline across all levels of augmentation when tested on KITTI. We expect that this is because human-chosen set of parameter ranges, which are shown in the bottom row of Table I, do not generalize well when adapting GTA Sim10k to KITTI even though they may generate visually realistic images. One reason for this is that the visually realistic parameter ranges selected in  where chosen using a GTA dataset of all daytime images, whereas GTASim10k contains an even representation of daytime and nighttime images. This further demonstrates the importance of learning the sensor effect parameter distributions constrained by how they affect the styles of both the real and synthetic image datasets.
V Discussion and Conclusions
In general, our results show that the proposed Sensor Transfer Network reduces the synthetic to real domain gap more effectively and more efficiently than domain randomization. Future work includes increasingly the complexity and realism of the Sensor Transfer augmentation pipeline by modeling other, different sensor effects, as well as implementing models that better capture the pixel statistics of real images, such as motion or defocus blur. Other avenues include investigating the impact of task performance and problem space on the sensor effect parameter selection, and evaluating how the proposed method impacts performance for training synthetic datasets rendered with various levels of photorealism.
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the
kitti vision benchmark suite,” in
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354–3361.
-  J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and evaluation benchmark for road detection algorithms,” in International Conference on Intelligent Transportation Systems (ITSC), 2013.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.
-  S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision (ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906. Springer International Publishing, 2016, pp. 102–118.
-  M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan, “Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 746–753.
-  H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a conditional generative adversarial network,” arXiv preprint arXiv:1701.05957, 2017.
-  V. Veeravasarapu, C. Rothkopf, and R. Visvanathan, “Adversarially tuned scene generation,” arXiv preprint arXiv:1701.00405, 2017.
-  C. Sakaridis, D. Dai, S. Hecker, and L. Van Gool, “Model adaptation with synthetic and real data for semantic dense foggy scene understanding,” arXiv preprint arXiv:1808.01265, 2018.
-  A. Carlson, K. A. Skinner, and M. Johnson-Roberson, “Modeling camera effects to improve deep vision for real and synthetic data,” arXiv preprint arXiv:1803.07721, 2018.
-  M. D. Grossberg and S. K. Nayar, “Modeling the space of camera response functions,” vol. 26, no. 10. IEEE, 2004, pp. 1272–1282.
F. Couzinie-Devy, J. Sun, K. Alahari, and J. Ponce, “Learning to estimate and remove non-uniform image blur,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1075–1082.
-  A. Foi, M. Trimeche, V. Katkovnik, and K. Egiazarian, “Practical poissonian-gaussian noise modeling and fitting for single-image raw-data,” vol. 17, no. 10. IEEE, 2008, pp. 1737–1754.
-  A. Andreopoulos and J. K. Tsotsos, “On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms,” vol. 34, no. 1. IEEE, 2012, pp. 110–126.
-  S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
-  S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on. IEEE, 2016, pp. 1–6.
-  C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.
Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser, “Physically-based rendering for indoor scene understanding using convolutional neural networks,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 5057–5065.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
-  J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” arXiv preprint arXiv:1804.06516, 2018.
-  M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid, “Transformation pursuit for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 3646–3653.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
-  J. Lemley, S. Bazrafkan, and P. Corcoran, “Smart augmentation learning an optimal data augmentation strategy.” IEEE Access, vol. 5, pp. 5858–5869, 2017.
-  L. Sixt, B. Wild, and T. Landgraf, “Rendergan: Generating realistic labeled data,” arXiv preprint arXiv:1611.01331, 2016.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” arXiv preprint arXiv:1612.07828, 2016.
-  S. Huang, D. Ramanan, undefined, undefined, undefined, and undefined, “Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00, pp. 4664–4673, 2017.
-  C. Kanan and G. W. Cottrell, “Color-to-grayscale: does the method matter in image recognition?” PloS one, vol. 7, no. 1, p. e29740, 2012.
-  S. Diamond, V. Sitzmann, S. Boyd, G. Wetzstein, and F. Heide, “Dirty pixels: Optimizing image classification architectures for raw sensor data,” 2017.
-  H. Cheong, E. Chae, E. Lee, G. Jo, and J. Paik, “Fast image restoration for spatially varying defocus blur of imaging sensor,” Sensors, vol. 15, no. 1, pp. 880–898, 2015.
-  S. A. Bhukhanwala and T. V. Ramabadran, “Automated global enhancement of digitized photographs,” IEEE Transactions on Consumer Electronics, vol. 40, no. 1, pp. 1–10, 2 1994.
-  G. Messina, A. Castorina, S. Battiato, and A. Bosco, “Image quality improvement by adaptive exposure correction techniques,” in Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International Conference on, vol. 1, 7 2003, pp. 549–552.
-  R. S. Hunter, “Accuracy, precision, and stability of new photoelectric color-difference meter,” in Journal of the Optical Society of America, vol. 38, no. 12, 1948, pp. 1094–1094.
-  S. Annadurai, “Fundamentals of digital image processing.” Pearson Education India, 2007.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 694–711.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.