Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

03/27/2019 ∙ by Jun Gao, et al. ∙ UNIVERSITY OF TORONTO Microsoft Peking University 0

We present a method for compositing virtual objects into a photograph such that the object colors appear to have been processed by the photo's camera imaging pipeline. Compositing in such a camera-aware manner is essential for high realism, and it requires the color transformation in the photo's pipeline to be inferred, which is challenging due to the inherent one-to-many mapping that exists from a scene to a photo. To address this problem for the case of a single photo taken from an unknown camera, we propose a dual-learning approach in which the reverse color transformation (from the photo to the scene) is jointly estimated. Learning of the reverse transformation is used to facilitate learning of the forward mapping, by enforcing cycle consistency of the two processes. We additionally employ a feature sharing schema to extract evidence from the target photo in the reverse mapping to guide the forward color transformation. Our dual-learning approach achieves object compositing results that surpass those of alternative techniques.



There are no comments yet.


page 1

page 3

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compositing virtual objects into real photographs, such as adding a streetlamp in front of a building, is a common feature in interactive applications such as augmented reality. While this can be done with current computer vision technology, making the composited object look realistic remains a challenge. Even with a highly detailed object model and known illumination conditions, the object’s appearance in a photo can appear unnatural because its colors do not conform with the rest of the scene, as shown in Figure 

1(left). The colors in a photograph are a result not only of the scene content, but also of the imaging pipeline in the camera, which may include color filters, white balancing, and dynamic range compression. For attaining high realism, a composited virtual object needs to undergo the same color transformations as the rest of the image, so that it can blend seamlessly into the photo, as exemplified in Figure 1(right).

Figure 1: Compositing of a virtual streetlamp. Left: Directly compositing the object model. Right: Compositing the object model via the proposed color translation network.

If the camera that took the photo is available, one could capture a collection of aligned RAW111RAW: minimally processed image data from the sensor of a camera.-JPEG image pairs and train a network to map RAW to JPEG [23], as the virtual object is in the RAW domain. However, for broad applicability, it is desirable to composite virtual objects into photos taken by unknown cameras for which we have no training data. Since there exist many possible color transformations from RAW to JPEG due to the broad diversity of camera imaging pipelines, finding the specific one that generated a given photo, without having the camera, is a non-trivial problem. This is the core challenge addressed in our work.

To address this issue, we present a deep dual-learning approach in which images are bidirectionally transformed between the JPEG colors produced by the imaging device and a canonical RAW color space in which the virtual object is represented. In particular, to aid in learning the RAW-to-JPEG transformation (primal network), we simultaneously learn a JPEG-to-RAW transformation (dual network), which is more practical to learn, since it represents a one-to-one mapping (i.e., a JPEG image captures a specific scene). Furthermore, there exist many objects such as grass, sky, and human skin that span a limited range of colors in natural scenes and thus provide strong constraints on this mapping. To facilitate learning of the coupled networks, we employ cycle consistency [31, 8]

, in which translating JPEG to RAW and back to JPEG should be an identity mapping. However, after training, as the primal network itself is still a deterministic function, it can only represent a one-to-one mapping. We thus propose a feature sharing scheme where features extracted by the dual network from the JPEG are passed to the primal network to guide the RAW-to-JPEG transformation.

Given an object model and a target JPEG image at test time, our system feeds the JPEG into the dual network to obtain an image in the canonical RAW color space and the corresponding shared features. The object is rendered into the image, which, together with the shared features, is then input into the primal network to generate the compositing result. Although there exist many camera-dependent ways in which a RAW image can be translated to JPEG, the color translation in the primal network is determined by the original JPEG image through the neural features extracted and shared by the dual network. Done this way, the primal network generates a result that mimics the color imaging pipeline of the JPEG input, without needing training data from its camera.

A user study shows that our proposed approach leads to compositing results that are perceptually more coherent than from common baseline techniques. An empirical examination of different variants of this approach is presented as well.

2 Related Work

Image Pipeline Modeling

Physics-based computer vision methods such as shape-from-shading require measurements of scene radiance that are physically accurate. Towards obtaining accurate measurements from photographs, the imaging pipeline of cameras has been modeled and used to undo the effects of in-camera processing. Many techniques have been proposed for modeling a particular component of an imaging pipeline, such as tone mapping [4, 14, 20] or white balancing [1, 10, 28]. More comprehensive are works that aim to model the sequence of processing operations that occur within an imaging device [3, 13]

. Recently, a deep neural network was presented for modeling the scene-dependent color processing of a given camera, where RAW-JPEG image pairs are captured from the camera for training

[23]. In our work, we utilize this deep network for modeling color transformations in the imaging pipeline, but infer the model using only a single photograph from an unknown camera. This inference from a single image is made possible through the use of contextual color priors on common scene objects and our proposed dual-learning approach with a feature sharing schema.

Image Compositing

For increasing the realism of objects composited into photographs, methods have been presented for estimating scene illumination [12, 18] and for recovering camera distortions such as those resulting from sensor noise and motion blur [6], or caused by the camera’s lens and rolling shutter [17]. In contrast to these previous techniques, our work seeks to heighten realism by estimating and applying the in-camera color processing to composited objects, and thus is complementary to this prior research. Moreover, unlike the methods that model imaging distortions [6, 17], which require access to the camera for calibrating these effects, our method is specifically developed not to need the camera at hand, so that it can be applied to arbitrary images.

Image-to-Image Translation

Many image processing problems can be viewed as translating an input image into an output image that exhibits a different representation of the scene. A general framework for this translation problem was introduced using a Generative Adversarial Network (GAN) that learns this mapping from a training set of aligned image pairs from the two domains [11]. To relax the requirement of paired training data, recent methods have exploited the duality in the image translation problem by jointly learning an additional GAN that maps images from the output domain to the input domain while enforcing a cycle-consistency constraint in which an image mapped from the input domain to the output domain and then back to the input domain should yield the original input [31, 21, 15]. Through this coupling of GANs, the training data need not be paired, but rather it is sufficient to have independent sets of images in each of the two domains.

Modeled by a deterministic network, the translation learned in these prior works is a one-to-one mapping, where an image in one domain corresponds to a specific image in the other domain, and vice versa. By contrast, our work deals with a one-to-many mapping (RAW-to-JPEG) that arises from the differences in imaging pipelines among different in-camera processes, and we focus on how to determine the correct transformation for the one-to-many mapping.

3 Dual Learning for Object Compositing

Our approach proceeds as follows. We first feed the target JPEG image into the dual network (JPEG-to-RAW) in order to translate it into a demosaiced RAW image in a canonical color space. The virtual object, also represented in the canonical space, is then rendered under the estimated lighting conditions and then composited into the RAW image. Here, we utilize an existing technique for illumination estimation [9] and focus on the composition task. The compositing result is then obtained by passing the composited RAW image through the primal network (RAW-to-JPEG) . An overview of this process is illustrated in Figure 2.

We first introduce the canonical color space and a method for transforming a specific camera’s RAW image colors to this space in Sec. 3.1. Then we present the primal and dual networks and their training algorithms in Sec. 3.2. The object compositing method based on these networks is described in Sec. 3.3.

3.1 Canonical Color Space

Our system translates image colors between the input JPEG image and a canonical color space in which a virtual object can be represented. As part of learning color translations to and from this canonical space, we capture RAW images from multiple cameras, and transform the camera-dependent RAW image colors to the canonical space through color camera calibration.

Color calibration of cameras is conventionally performed using a ColorChecker chart, which contains patches of known colors [25]. For a RAW image taken of a ColorChecker chart, we thus know the RAW image colors of the patches and their corresponding colors in various color spaces. In this paper, we choose linear sRGB as the canonical color space. With the correspondence among RAW and sRGB colors, the RAW-to-sRGB color transformation can be expressed as follows:


where denotes colors in linear sRGB, represents corresponding colors in the RAW image, and is a mapping function. To model

, we utilize a linear transformation

, which has been found to give the best performance among several candidate models for color calibration [24]. The mapping is optimized via least squares fitting after subtracting the black level value from the RAW color values.

Figure 2: Overview of object compositing using the learned primal network and dual network .

3.2 Image Translation

Mapping of an image between the canonical RAW and the photo’s JPEG domain can be modeled as an image translation problem, which has been widely studied for applications including image colorization and super-resolution

[31, 19, 30]. For the two mappings, we train two networks denoted as and , the first for RAW-to-JPEG prediction and the other for estimating JPEG-to-RAW.

Figure 3: Network Architecture. For and , we first extract multi-scale histogram features from the input image, which are further processed with three convolutional layers to predict the image in the target domain. We use Average Pooling to extract the shared features from and propagate them to via a single Fully-Connected layer followed by repetition.

width=0.48 Layer Output Size Layer Output Size Hist (3+3)*h*w Hist (3+3)*h*w Conv1 128*h*w Conv1 128*h*w Conv2 128*h*w Conv2 128*h*w Conv3 3*h*w Conv3 3*h*w FC 128

Table 1: Network Configuration. Image size of h*w.

3.2.1 Network Architecture

The structure of our network is illustrated in Figure 3, with the network configuration details given in Table 1. Our networks and are adopted from the Multiscale Learnable Histogram network in [23]

, which achieves state-of-the-art performance on radiometric calibration. The networks first extract color histogram features from the input image with learnable bin centers and widths. The histograms are then computed within a multi-scale pyramid, allowing global and local context to be extracted and combined for each pixel. The stacked histograms and images are fed into a three-layer convolutional neural network to predict the output image. Additionally,

produces a feature vector that encodes the global color transformation properties of the JPEG photo and is forwarded to to aid in predicting the final JPEG image. The design of this feature sharing scheme from to will be described in Sec. 3.2.2.

Note that other image translation models could potentially be used for and . A model should ideally satisfy two properties: (1) the network should be able to account for the high-level global semantic content of the image so that objects which can constrain the mapping (i.e., those with a restricted range of natural colors) are all jointly considered in determining a color transformation; (2) it should be able to extract low-level local color information that reflects the properties of the color transformations in the imaging pipeline. We found multi-scale histogram pyramids to be effective at capturing these two types of information in images. We also tried deep encoder-decoder networks with skip connections [27]. This yielded worse results, likely because deeper networks are better at extracting high-level semantics but discard low-level information.

3.2.2 Feature Sharing

As mentioned in the Sec. 1, RAW-to-JPEG color translation is dependent on the imaging pipeline. It thus requires information related to the pipeline’s color processing that produced the input JPEG photo. For this, we extract features from a hidden layer in and share them with . Hidden layers in can provide clues on the color processing, as seeks to separate the JPEG color processing from the intrinsic colors of objects in the scene. Let us denote the output of the hidden layer as . When passing from to , we transform it by a function :


where denotes the features received by . For , we employ average pooling followed by a fully connected layer. Effective in extracting global features [26], average pooling facilitates inference of global color transformations. The fully connected layer learns to adapt the features so that they become compatible with the feature space of . In , is repeated across the spatial dimension to be consistent in size with the feature map in , so that it can be easily concatenated with the feature map and processed by the convolutional layers.

Along the processing hierarchy of our JPEG-to-RAW network , the feature maps should provide an image representation increasingly sensitive to the canonical RAW colors. In contrast to the later layers, the earlier layers more closely represent the JPEG coloring effects of the imaging pipeline. We thus choose the feature map after the first convolutional layer in as . Related concepts have been used for style transfer, where style information is extracted from earlier layers while content-related features are obtained from later layers [7, 22]. These shared features are inserted into at its first convolutional layer, so that this information can be accounted for throughout the subsequent layers. The whole model can be expressed as:


where and denote the predicted RAW and JPEG images, respectively, and and are the input RAW and JPEG images.

3.2.3 Training Loss

To optimize the network parameters, the most straightforward loss function is the mean-squared error between the target images and predicted images of the two networks:


However, during inference time, error will accumulate along the processing hierarchy as the JPEG image passes through and then through . To reduce such error, we encourage cycle consistency [31, 8]:


where a JPEG image passed through the JPEG-to-RAW and RAW-to-JPEG networks should yield a predicted image identical to the original JPEG input. This is done by adding the following term to the loss function:


A hyperparameter

is introduced to balance the reconstruction loss and the cycle consistency constraint, giving us the overall loss function:


3.3 Object Compositing

To composite a synthetic object into a JPEG photo, we first render the object with lighting estimated using the online demo222http://rachmaninoff.gel.ulaval.ca:8000/ provided by [9]. Its RGB values in the canonical color space and image mask are obtained with the Blender renderer333https://www.blender.org/. At the same time, we also feed the JPEG photo into to get the corresponding RAW image and the shared feature vector . We then composite the rendered object with the RAW image using the mask :


where denotes the Hadamard product and is the input JPEG photo. Subsequently, we pass and to and obtain the predicted JPEG image :


The final composited JPEG image is computed as:


4 Experiments

In this section, we extensively evaluate our image translation system. As the object compositing largely relies on the quality of image translation, we first focus our evaluation on various alternative network configurations in Sec. 4.3. The system is further qualitatively validated on compositing results through comparisons to alternative approaches and by conducting user studies in Sec. 4.4.

4.1 Data Collection

To train the coupled networks and , we manually collected 683 RAW-JPEG pairs using a Sony -5100 camera. All photos were acquired with the camera set to auto mode, which results in various color transformation pipelines depending on the scene. Our dataset contains various kinds of scenes including outdoor, indoor, landscape, and portrait. Some examples are shown in the Supplementary Material. We additionally utilize the Canon 5D Mark III dataset from [23], which contains 645 RAW-JPEG image pairs of various scenes. Although two additional datasets are provided in [23], we do not use them because we lack access to those camera models for color calibration.

To further diversify the training data, we augment each of the two datasets by simulating various simple pipelines on the RAW images, specifically by applying random RGB rescalings, saturation level adjustments, and a gamma correction from a set of ten common samples. Details on the data augmentation are given in the Supplementary Material.

Sharing-schema Max Pool Avg Pool RAWJPEG JPEGRAW Cycle(JPEG)
No Sharing [23] 26.29/24.25 34.92/32.42 24.49/25.33
26.03/24.00 34.40/32.39 31.72/32.37
Sharing-Conv2 28.65/25.26 34.02/32.42 26.57/27.94
28.80/25.44 34.05/32.57 26.74/28.06
28.46/24.99 34.09/32.41 31.70/33.06
28.64/25.09 34.20/32.30 31.81/33.09
Sharing-Conv1 30.84/26.07 34.36/32.59 26.88/28.18
31.07/26.16 34.34/32.60 27.24/28.35
30.73/25.75 34.35/32.48 31.84/33.38
30.83/25.98 34.16/32.55 32.06/33.64
Table 2: Comparisons with different network configurations. The two values in each cell represent PSNR values for Canon/Sony images. Cycle(JPEG) denotes results where we feed the output of to and get the predicted JPEG images. Bold text indicates the best performance.

4.2 Experimental Settings

We first calibrate demosaiced RAW images from both datasets into the canonical color space using the estimated transformation function described in Sec. 3.1. For the Sony -5100 dataset that we collected, we set aside images for testing, use of the remaining images for training, and take the other for validation. For the Canon 5D Mark III dataset, we use the same configuration as in [23], where the ratio between training and validation is , excluding

images for testing. Note that none of the images generated using a simulated pipeline were used as test images. Considering the relatively small size of the datasets that we have, we further augment the data during training. Each RAW-JPEG pair is first randomly left-right flipped or up-down flipped with 0.5 probability for each. Then we crop the image with randomly generated square bounding boxes, which are obtained by first randomly choosing its upper-left corner location, and then randomly selecting the box length, from 128px to the maximum length without extending beyond the image. The crops are resized to 256px

256px to facilitate batch training. In testing, the whole image can be fed into the networks, which can accept input images of arbitrary size.

Our networks are implemented in PyTorch and trained with the Adam optimizer

[16]. For a mini batch, we randomly select 8 images with half from the Sony -5100 dataset and the other half from the Canon 5D Mark III dataset. The learning rate is set to for both and . The hyperparameter is set to 1 in all experiments.

4.3 Image Translation

4.3.1 Different Network Configurations

We compare our model with its variants that employ other loss functions, feature sharing schema, and base networks. Due to the wide use in the literature [23], the Peak Signal-to-Noise Ratio (PSNR) with respect to test sets of both the Canon 5D Mark III and Sony -5100 is used as metric. The results, shown in Table 2, Table 3, and Figure 4, are discussed in the following. We also measure performance using CIE Delta E 2000 [2] and find the results are consistent with that from PSNR, as shown in the supplement.

Cycle Consistency Constraint

As shown in Table 2, including the cycle consistency constraint () leads to better results for both datasets, regardless of whether feature sharing is enabled. Note that the performance of RAW-to-JPEG prediction becomes slightly worse. We hypothesize that, as the two networks are trained jointly, they would implicitly cooperate with each other to achieve a smaller loss on the joint prediction task (JPEG-to-RAW-to-JPEG) at the cost of degrading performance on a single task (JPEG-to-RAW or RAW-to-JPEG).

Shared Features

We observe that JPEG prediction is better with feature sharing than without it. The RAW-to-JPEG prediction improves from 26.03/24.00 to 30.83/25.98 in terms of PSNR on the Canon/Sony Datasets, and the cycle JPEG prediction performance also increases on the two datasets, by 0.34/1.27. Sharing global features related to the in-camera color processing pipeline effectively removes ambiguity in JPEG prediction and generates better results.

Sharing Methods

We examine the use of different hidden layers in and functions for feature sharing. As indicated by the results in Table 2, the performance becomes worse by taking the feature map of deeper layers. This result is expected, as deeper features provide a more semantic representation of RAW images and are less reflective of the JPEG coloring properties. We thus use the feature map of the first convolutional layer in our system. Among variants of function , we find average pooling slightly outperforms max pooling.

width=0.48 Network RAWJPEG JPEGRAW Cycle(JPEG) MLP 23.44/21.90 31.97/31.62 34.93/35.06 SRCNN [5] 26.17/23.21 33.34/32.64 31.82/32.76 UNet [27] 26.13/23.56 32.70/32.37 31.72/32.83 Multi 30.83/25.98 34.16/32.55 32.06/33.64

Table 3: Comparisons with different base networks. The two values in each cell represent PSNR values for Canon/Sony images.
Canon 60D 27.45 30.50 32.71
Sony NEX 7 26.93 31.58 32.74
Table 4: PSNR results on unknown cameras.
Base Networks

We also experiment with different image translation models as our base network. Specifically, we consider four different types of neural networks: Multi-layer Perceptron (MLP), SRCNN 

[5], UNet [27], and Multi-Scale Learnable Histograms [23]. The configurations for the different neural networks are given in the supplement, and the networks are trained using cycle consistency and feature sharing. From the results listed in Table 3, it can be seen that the Multi-Scale Learnable Histogram performs best on both RAW-to-JPEG and JPEG-to-RAW. This was also found to be the case with other network settings, whose results are also provided in the supplement. Although the other networks give slightly better cycle results (RAW-to-JPEG-to-RAW), we found that their lower performance on RAW-to-JPEG leads to poorer compositing results.

Input Cycle + Feature Sharing Feature Sharing Baseline Ground Truth Cycle + Feature Sharing Feature Sharing Baseline
Figure 4: The first row shows predictions using different network configurations. The corresponding error maps are displayed in the second row. The baseline method utilizes no feature sharing or cycle consistency.
Input JPEG Predicted JPEG Calibrated RAW Predicted RAW
Figure 5: The first/second row shows the predictions for a photo from Canon 60D/Sony NEX7.
(a) (b) (c) (d) (e)
Figure 6: Comparisons to related image translation methods, where the deer is the composited object. (a) Input JPEG. (b) Blended RAW with linear scaling. (c) Style transfer by [7]. (d) Color transfer by [22]. (e) Our method. [Please zoom-in.]
(a) (b) (c) (d) (e)
Figure 7: Comparison to estimated white balance, where the desk is the composited object. (a) Input JPEG. (b) Blended RAW with linear scaling. (c) Gamma correction. (d) White balance and gamma correction. (e) Our method. [Please zoom-in.]

4.3.2 Unknown Cameras

As the feature sharing schema is designed to extract features that represent the JPEG rendering characteristics of the imaging pipelines, our model should be able to generalize to unknown cameras. To verify this, we first train the model on images from just a single camera (Canon 5D Mark III, specifically) and test it on images from the training camera and another camera (Sony -5100, specifically). The model achieves a 31.10/35.98 (RawJPEG/Cycle) PSNR on the same camera and a 24.69/36.30 PSNR on the Sony -5100 camera, exhibiting a moderate level of generalization ability from only a single training camera.

As is the case for other convolutional networks, more generalizable CNN features can be learned by providing a broader distribution of training data, i.e. from multiple cameras. We thus additionally train the model on images from two cameras (Canon 5D Mark III and Sony -5100) and collect 50 other RAW-JPEG pairs using a Canon 60D and Sony NEX for testing. Although these models are from the same company as the training cameras, we observed differences in the color transformation pipelines from images taken of the same scenes (see the supplement for examples). We directly test the trained model on these datasets without finetuning. The prediction results are shown in Table 4. Though the PSNR values are slightly lower than those in Table 2, they are at a similar level. Examples of JPEG inputs, predicted RAW images, ground truth calibrated RAW images, and predicted JPEG images (after JPEG-to-RAW and RAW-to-JPEG) are shown in Figure 5.

width=0.42 Gamma Gamma+WB Our method Max 58.3% 58.3% 91.7% Min 16.7% 4.2% 29.2% Average 24.6% 38.0% 58.5%

Table 5: User study results, in terms of selection percentage

4.4 Compositing Objects

4.4.1 External Comparison

An alternative approach to our problem is to apply related image-to-image translation techniques such as style transfer

[7] or color transfer [22]. Specifically, these methods could be used to transfer style or color from the JPEG photo to the virtual object prior to compositing. The results are presented in Figure 6. It can be seen that neither of these approaches are suitable for this problem. The style transfer method [7] copies textural properties of the JPEG photo to the virtual object, producing an unnatural-looking result that is inconsistent with the object’s white plaster material. Even when only color is transferred, as with [22], the transferred colors reflect the intrinsic colors of objects in the scene in addition to the JPEG color processing. By extracting and applying the color processing from the JPEG input, our method produces the most satisfactory results.

Another technique we compare to is the use of white balance estimation with gamma correction, where the inverse of the white balance (estimated using the method of [10]) is applied to the virtual object and is followed by a gamma correction of 2.2. Figure 7 shows a comparison with our method. One can observe that white balance plus gamma does not adequately approximate the downloaded photo’s color processing, which includes a boost in saturation. By contrast, our neural network model is powerful enough to capture such color transformations, as seen by the more saturated composited object.

Another alternative approach is to harmonize the foreground object and background through Deep Image Harmonization (DIH) [29]. We present the results in Figure 8. It can be seen that although DIH produces a visually pleasing result where the object color looks aesthetically compatible with the surroundings, this is not the same as being photometrically correct with respect to the imaging pipeline. In this particular case, it can be seen that the inserted object for DIH exhibits color variations that are inconsistent with the actual object.

Figure 8: Comparison with DIH [29]. Left: result from DIH. Right: result from our method. [Please zoom-in.]

4.4.2 User Study

We conducted a user study to evaluate the visual quality of our compositing results. Virtual objects were composited into 24 images for this study. The images were all downloaded from the web and were taken by unknown cameras, with some having an Instagram-style appearance. For comparison, the objects are also composited into the images using a default gamma correction of 2.2 or gamma correction with an estimated white balance [10]. For each image, our result and the comparisons are shown in random order. The users are asked to select which of the three images appears more natural. A total of 25 users participated in this study.

The results are presented in Table 5, which shows statistics on the percentage of times a method’s result was selected. Max/Min are the maximum and minimum percentages from among all the users. It is seen that the users clearly prefer the results of our method over compositing using gamma correction (with/without) white balance. Images for the user study are provided in the supplement.

Figure 9: Analysis on the shared features. The first column is the input JPEG; the second column is the network prediction following the standard procedure; the last column is the prediction by swapping shared features. [Please zoom-in.]

4.5 Analysis

To further analyze the characteristics of shared features in the proposed pipeline, we swap the shared features for two different images that capture the same scene but undergo different color pipelines. Specifically, we first capture two photos of the same scene using different camera settings, feed the JPEG to the network separately and obtain the corresponding RAW and shared features for each photo, then we swap the shared features and use network to predict a new JPEG. Results are shown in Figure 9. It can be seen that, by swapping the shared features, the colors in the predicted images are also swapped. On the other hand, using the original shared features leads to predictions consistent with the input. This demonstrates that the shared features actually capture the color characteristics of the input JPEG.

5 Conclusion and Future Work

We presented an object compositing system that estimates the color transformation in the imaging pipeline of the target photo. To solve for this transformation from a single image, we propose a dual learning approach that is made tractable through the use of shared features from the dual to the primal network. We believe that this strategy could be useful for other problems in which a network needs to infer a particular solution in an inherently one-to-many mapping.

Our system is designed to model global color transformations. There exist some advanced imaging pipelines that may process certain image regions differently from others, for example, by detecting the sky region in an outdoor photo and making it more blue. How to extend our model to handle spatial variations in color processing would be an interesting direction for future study. Another avenue for further work is to adapt the image translation model with model compression techniques such that it could run on mobile devices with fast inference time.

6 Acknowledgement

This work is partially supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026), BJNSF (L172037), and a grant from Microsoft Research Asia.


  • [1] J. T. Barron and Y.-T. Tsai. Fast fourier color constancy. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • [2] D. H. Brainard et al. Color appearance and color difference specification. The science of color, 2:191–216, 2003.
  • [3] A. Chakrabarti, D. Scharstein, and T. Zickler. An empirical camera model for internet color vision. In British Machine Vision Conference, 2009.
  • [4] A. Chakrabarti, Y. Xiong, B. Sun, T. Darrell, D. Scharstein, T. Zickler, and K. Saenko. Modeling radiometric uncertainty for vision with tone-mapped color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2185––2198, 2014.
  • [5] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199. Springer, 2014.
  • [6] J. Fischer, D. Bartz, and W. Strasser. Enhanced visual realism by incorporating camera image effects. In Proc. Int’l Symp. Mixed and Augmented Reality, pages 205–208, 2006.
  • [7] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [8] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
  • [9] Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.-F. Lalonde. Deep outdoor illumination estimation. In IEEE International Conference on Computer Vision and Pattern Recognition, 2017.
  • [10] Y. Hu, B. Wang, and S. Lin. Fc4: Fully convolutional color constancy with confidence-weighted pooling. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • [11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    arXiv preprint, 2017.
  • [12] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics (TOG), 30(6):157, 2011.
  • [13] S. J. Kim, H. T. Lin, Z. Lu, S. Süsstrunk, S. Lin, and M. S. Brown. A new in-camera imaging model for color computer vision and its application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2289–2302, 2012.
  • [14] S. J. Kim and M. Pollefeys. Robust radiometric calibration and vignetting correction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4):562–576, 2008.
  • [15] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In

    International Conference on Machine Learning

    , 2017.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] G. Klein and D. W. Murray. Simulating low-cost cameras for augmented reality compositing. IEEE transactions on visualization and computer graphics, 16(3):369–380, 2010.
  • [18] S. B. Knorr and D. Kurz. Real-time illumination estimation from faces for coherent rendering. In Mixed and Augmented Reality (ISMAR), 2014 IEEE International Symposium on, pages 113–122. IEEE, 2014.
  • [19] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint, 2016.
  • [20] S. Lin, J. Gu, S. Yamazaki, and H. Shum. Radiometric calibration from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 938–945, 2004.
  • [21] M.-Y. Liu, T. Breuel, and J. Kautz. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, 2017.
  • [22] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [23] S. Nam and S. J. Kim. Modelling the scene dependent imaging in cameras with a deep neural network. arXiv preprint arXiv:1707.08350, 2017.
  • [24] R. Nguyen, D. K. Prasad, and M. S. Brown. Raw-to-raw: Mapping between image sensor color responses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3398–3405, 2014.
  • [25] D. Pascale. Rgb coordinates of the macbeth colorchecker. The BabelColor Company, pages 1–16, 2006.
  • [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas.

    Pointnet: Deep learning on point sets for 3d classification and segmentation.

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [27] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [28] W. Shi, C. C. Loy, and X. Tang. Deep specialized network for illuminant estimation. In Proceedings of the European Conference on Computer Vision, pages 371–387, 2016.
  • [29] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3789–3797, 2017.
  • [30] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666. Springer, 2016.
  • [31] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.