Log In Sign Up

One-Shot Mutual Affine-Transfer for Photorealistic Stylization

Photorealistic style transfer aims to transfer the style of a reference photo onto a content photo naturally, such that the stylized image looks like a real photo taken by a camera. Existing state-of-the-art methods are prone to spatial structure distortion of the content image and global color inconsistency across different semantic objects, making the results less photorealistic. In this paper, we propose a one-shot mutual Dirichlet network, to address these challenging issues. The essential contribution of the work is the realization of a representation scheme that successfully decouples the spatial structure and color information of images, such that the spatial structure can be well preserved during stylization. This representation is discriminative and context-sensitive with respect to semantic objects. It is extracted with a shared sparse Dirichlet encoder. Moreover, such representation is encouraged to be matched between the content and style images for faithful color transfer. The affine-transfer model is embedded in the decoder of the network to facilitate the color transfer. The strong representative and discriminative power of the proposed network enables one-shot learning given only one content-style image pair. Experimental results demonstrate that the proposed method is able to generate photorealistic photos without spatial distortion or abrupt color changes.


page 1

page 7

page 8


Deep Photo Style Transfer

This paper introduces a deep-learning approach to photographic style tra...

Deep Preset: Blending and Retouching Photos with Color Style Transfer

End-users, without knowledge in photography, desire to beautify their ph...

Style Transfer With Adaptation to the Central Objects of the Scene

Style transfer is a problem of rendering image with some content in the ...

Deep Semantics-Aware Photo Adjustment

Automatic photo adjustment is to mimic the photo retouching style of pro...

Deep Painterly Harmonization

Copying an element from a photo and pasting it into a painting is a chal...

Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion

Photo-to-caricature translation aims to synthesize the caricature as a r...

Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

We present a method for compositing virtual objects into a photograph su...

1 Introduction

Photorealistic style transfer is a challenging problem which aims to change the style of a content photo to that of a reference photo as shown in Fig. 1. By choosing different reference photos, one could make the content photo look as if, for example, it was taken under different illuminations, at different time of the day or season of the year [24, 27, 26]. A successful photorealistic stylization method should be able to transfer sophisticated styles with drastic local color changes while at the same time preserve the spatial (or structural) information of the content photo naturally, such that the resulting image looks like a real photo taken by a camera [24, 26].

Figure 1: Given a reference style photo taken at night, the content image is stylized as if it was taken at night. Top-left: content-style image pairs. Top-right: WCT [23]. Bottom-left: OS-MDN. Bottom-right: sub-images of the content, WCT and proposed.

Existing approaches generally perform stylization either in a global or local way. Global-based methods [34, 31] transfer the style of a photo with spatially invariant functions. Although they perform well for global color shifting, they usually fail to handle style images with drastic local color changes. Local-based methods [38, 9, 27]

generally consist of two major steps, , feature extraction and stylization. Recent methods perform the stylization based on the high-level features of the content and style images extracted with pre-trained convolutional neural networks (CNN). Although they could transfer dramatic local styles, it often failed to preserve spatial structure of the content image. For faithful local color transfer 

[26, 24], context-based methods were proposed to preserve the spatial structure with constraint and they perform region-based color transfer according to the semantic labels. However, region-based transfer can be prone to abrupt color changes across different regions, making the result less photorealistic.

There are, in general, two key challenging issues with the photorealistic style transfer problem, structure preservation (, stylization without changing spatial structure of the content image) and local changes vs. global consistency (, faithfully transferring local colors without introducing abrupt changes/artifacts within or across semantic regions). In addition, CNN-based methods usually have to be trained on a large amount of data, which is computationally cost.

To generate photorealistic images with desired local color transfer, well-preserved spatial structure, and smooth color transitions, the key is to find a representation scheme that effectively decouples the color and spatial structure information of both the content and style images, such that when performing color transfer, the spatial information is untouched. On the other hand, such representation scheme should also facilitate context-sensitive local color transfer, such that stylization can be performed in a globally consistent fashion. If such a representation scheme is realized, it may be possible of one-shot learning where only one pair of the content and style image is all we need for conducting photorealistic style transfer.

In this paper, we propose a one-shot mutual Dirichlet network, referred to as OS-MDN, to address the challenges of photorealistic style transfer. The essential contribution of the work is the realization of the representation scheme that successfully decouples the spatial structure and color information of both the content and style images. It converts the image from RGB color space to a so-called “abundance space”, where each pixel is represented by the coefficient (or abundance) of each color basis. We refer to this coefficient as “representation”, that satisfies the sum-to-one and non-negativity physical constraints. The physical meaning of this representation is that it indicates how much contribution each color basis has in constructing a certain pixel. By enforcing a sparsity constraint on the representation, we show that each pixel is made up of only one major color basis, and such basis reflects the semantic object in the scene. Therefore, although this representation is pixel-wise, it is context-sensitive from semantic perspective. This representation is extracted with a shared sparse Dirichlet encoder.

The second important contribution of OS-MDN is making the context-sensitive representation correlated between the style image and the content image. This is necessary to achieve faithful color transfer. We achieve this by matching the representations of the content and style images through maximizing the mutual information (MI) between the representations and their own RGB input. This is done through a simplified mutual discriminative network.

The third contribution of OS-MDN is the design of an affine-transfer decoder that learns the global color bases. Since the color bases of the content and style images are different, the affine-transfer model is embedded into the decoder network to facilitate color transfer.

Due to the strong representative and discriminative power of the designed network, the extracted representation is context-sensitive with semantic information embedded in its distribution, and yet well decoupled from the global color bases. This has largely enabled one-shot learning with only one pair of content and style image given. To the best of our knowledge, this work represents the first that performs photorealistic style transfer with one-shot learning.

2 Related Work

Classical style transfer methods stylize an image in a global fashion with spatial-invariant transfer functions [34, 31, 1, 32, 8, 26]. These methods can handle global color shifts, but they are limited in matching sophisticated styles with drastic color changes [26, 24], as shown in Fig. 5.

The quality of image stylization can be improved by densely matching the low-level or high-level features between the content and style images [37, 38, 41, 9]. Gatys  [9]

demonstrated impressive art style transfer results with pretrained CNN, which matches the correlations of deep features extracted from CNN based on Gram matrix. Since then, numerous approaches had been developed to further improve the stylization performance as well as efficiency 

[5, 21, 25, 16, 36, 22]. For example, feed-forward approaches [17, 39]

improved the stylization speed by training a decoder network with different loss functions. In order to transform arbitrary styles to content images, Li  

[23] adopted the classical signal whitening and coloring transforms (WCTs) on features that extracted from CNN. These methods could generate promising images with different art styles. However, the spatial structures of the content image could not be preserved well even the given style image is a real photo, as shown in Fig. 5.

Recently, there have been a few methods specifically designed for photorealistic image stylization [12, 27]. Luan  [26] preserved the structure of the content image by adopting a color-affine-transfer constraint and color transfer is performed according to the semantic region. However, the generated results easily suffer abrupt color changes with noticeable artifacts especially between adjacent regions/segments. Mechrez  [27] proposed to maintain the fidelity of the stylized image with a post-processing step based on the screened poisson equation (SPE). Li  [24] improved the spatial consistency of the output image by adopting the manifold ranking algorithm as the post-processing step. He  [12] optimized the dense semantic correspondence in deep feature domain, resulting in the smooth local color transfer in the image domain. Although these methods preserve the spatial structure well, the light and color changes of different parts and materials are not smooth. See Fig. 5 for a comparison. Aside from image quality, these methods need to train a network with large amount of parameters on large dataset.

3 Problem Formulation

As discussed in Sec. 1, the key issue in obtaining a faithful and high-quality photorealistic style transfer is the decoupling of spatial structure and color information. We solve this problem from a unique angle, based on image decomposition [30, 19]. To facilitate the subsequent processing, the images are unfolded into 2D matrices.

Given a content image, , where , and denote its width, height and number of channels, respectively, and a style image, , where , and denote its width, height and number of channels, respectively, the goal is to reconstruct the image with the content from and the color style from

. The images are decomposed based on the fact that, natural image can be represented by a set of color bases with its coefficient vectors 

[30, 19]. The decomposition can be expressed as Eqs. (1) and (2).


where , and each row of which denote the color basis that preserves the color information of the entire content and style images, respectively. And , denote the corresponding coefficients of the content and style images for each of color bases, respectively. Since each row vector of (or ) indicates how the color bases are combined at a specific location, it only carries the color information of an individual pixel. Applying transfer on such vector would not affect the spatial structure of the content image.

Through the above decomposition, we have extracted the “representation”, and , which transforms the style transfer problem from the original RGB space to the so-called abundance (weight coefficient) space. This representation carries spatial information (of color mixture) at a specific spatial location, and is decoupled from the global color information of the image.

In the following, we elaborate on how this representation also carries the context information from both color and spatial perspectives.

Taken as an example, Eq. (1) can also be written as , where denotes a row vector of carrying one color basis, and denotes a column vector of indicating the proportion of the color basis in making up the whole image. Studies [30, 19] have shown that each pixel can be constructed with only a few color bases. Hence the representations should be sparse to increase its discriminative capacity. If we reshape such as a 2D matrix , the distribution of the basis in the whole image is easy to be observed. As shown in Fig. 7, different columns of carry the spatial distributions of different color bases. And this color bases indicate the major color components of different objects, , sky, cloud, tree. Therefore, we claim that the representations are context-sensitive for different objects if it is enforced to be sparse.

Since the content and style image holds different color bases, to facility the color transfer, we should transfer the color bases of the content image to that of style image with affine transfer model defined as .

Context-based transfer is important to transfer the right colors at the right place, , the color of the sky in the style image should transfer to that in the content image. In our framework, since different objects usually have only one major color basis, and its column coefficient () indicates its distribution, we want the coefficient columns of the content image match that of the style image for better stylization.

4 Proposed Method

We propose a one-shot mutual Dirichlet network architecture to perform photorealistic style transfer through two procedures, , feature extraction and style transfer. The network mainly consists of three unique structures, a shared sparse Dirichlet encoder for the extraction of feature vectors with both representative and discriminative capacity, an affine-transfer decoder for global color transfer, and a simplified mutual discriminative network to enforce the correspondence between multi-modal representations. The stylization is done by whitening and coloring the context-sensitive representation vectors and transfer them according to the global color bases of the style image. The unique architecture is shown in Fig. 2. Note that the dashed lines in Fig. 2 show the path of back-propagation which will be elaborated in Sec. 4.5.

Figure 2: Flowchart of the proposed method.

4.1 Network Architecture

As shown in Fig. 2, the network extracts the spatial representations and global color bases from both the content image and style image by sharing the same encoder structure, and the learning of an affine-transfer decoder. Let us define the input domain as , output domain as , and the representation domain as , the encoder of the network , maps the input data to high-dimensional representations (latent variables on the hidden layer), , . Let be the affine transform function, the affine transfer decoder reconstructs the data from the representations, , . Note that

is constructed with fully-connected layers with only identity activation functions. The representation

contains the mixing coefficients that reflects the local contribution of different color bases, and the weights of the decoder and serve as color bases and in Eqs. (1) and (2), respectively. This correspondence is further elaborated below.

Taking the procedure of content image reconstruction as an example. The content image is reconstructed by , where . Since is linear, the structure can be simplified as . Compared with Eq. (1), we find that the weights of the decoder correspond to the color bases .

In the stylization procedure, as shown in the lower part of Fig. 2, the distribution of is matched with that of with the whitening and coloring transformation (WCT) [23]. The transferred is then fed into the style’s affine-transfer decoder , to generate the stylized image .

4.2 Sparse Dirichlet Encoder

We adopt the Dirichlet-Net [28, 33] as our encoder, which naturally meets coefficients’ physical constraints, , sum-to-one and non-negativity. Furthermore, to increase representations’ discriminative capacity, we enforce them to be sparse by normalized entropy. The sparse representations are context-sensitive according to their major colors. The detailed encoder is shown at the bottom-left of Fig. 2.

The Dirichlet-Net is constructed with the stick-breaking process, which can be illustrated as breaking a unit-length stick into pieces, the length of which follows a Dirichlet distribution [35]. Here, we follow the work of [28, 33], which draw the samples of representation from the Kumaraswamy distribution [18]. Assuming that the row vector of representations is denoted as , we have , and . Each variable can be defined as


where is drawn from the inverse transform of Kumaraswamy distribution. Both parameters and are learned through the network for each row vector as illustrated in Fig. 2. Since , a softplus is adopted as the activation function [7] at the layer. Similarly, a sigmoid [11] is used to map into range at the layer. To increase the representation power of the network, the encoder of the network is densely connected, , each layer is fully connected with all its subsequent layers [14].

To increase the discriminative capacity, we encourage the representations to be sparse in order to better embed the semantic context information. Since the stick-breaking structure has the sum-to-one property, the traditional widely used

regularization or Kullback-Leibler divergence

[10] could not be used to reduce the sparsity. Instead, we adopt the normalized entropy function [15], defined in Eq. (4), which decreases monotonically when the data become sparse. For example, if the representation has two dimensions with , the local minimum only occurs at the boundaries of the quadrants, , either or is zero. This nice property guarantees the sparsity of arbitrary data even the data are sum-to-one.


We choose for efficiency. The objective function for sparse loss can then be defined as


4.3 Simplified Mutual Discriminative Network

Context information is important for conducting faithful color transfer. Previous researchers either transfer color within a region according to its semantic labels [26, 24], or between the most likely patches [21, 12]. However, the former is prone to introducing abrupt color changes especially at the border between adjacent regions. The latter only focuses on the local similarity which may generate results such that the color of many patches are transferred from the same style patch [26]. Thus, for faithful (or semantically accurate) color transfer, we should extract features carrying context information and that such features should be matched between the multiple input modalities (i.e., the content and style images in our case). In our network design, we enforce such context correspondence with simplified mutual discriminative network based on mutual information.

As shown in Sec. 3 and Fig. 7, the extracted representations also carry context information from semantic perspective. Suppose the content and style image both include sky with different color styles, the spatial distribution of sky’s major color basis is included in a single column, , , of the content image representation . Similarly, carries the spatial distribution of sky’s major color of the style image representation. For faithful color style transfer, and should be at the same position of and , respectively, , . Similarly, other columns of representations and should also be corresponding to each other if they possess the distribution of similar objects. Such correspondence can be encouraged by maximizing the dependency of and . Since our encoder is non-linear, traditional constraints like correlation may not catch such dependency. Thus we maximize the dependency by maximizing their mutual information.

Mutual information has been widely used for multi-modality registrations [42, 40]

. It is a Shannon-entropy based measurement of mutual independence between two random variables, ,

and . The mutual information measures how much uncertainty of one variable ( or ) is reduced given the other variable ( or ). Mathematically, it is defined as


where indicates the Shannon entropy, is the conditional entropy of given .

is the joint probability distribution, and

denotes the product of marginals. Belghazi  [2]

introduced an MI estimator, which allows to estimate MI through neural network.

In our problem, since and , their MI can also be expressed as . However, it is difficult to maximize their dependency by maximizing their MI through MI estimator directly, because the resolution of the content and style image and their scenes are different. Instead, we maximize the average MI between the representations and their own inputs, , and through the same discriminative network simultaneously. Note that, and are achieved by projecting the and images into abundance space. In this space, and are context-sensitive, their distribution are related to the distributions of objects in their image, not their color information. When we maximize the MI with the same , the dependency between and would be similar to that of and , , the columns of and , which carry context distribution, is encouraged to follow the same order if they possess similar objects.

Let’s take as an example. It is equivalent to Kullback-Leibler (KL) divergence [2]

between the joint distribution

and the product of the marginals . Such MI can be maximized by maximizing the KL-divergence’s lower bound based on Donsker-Varadhan (DV) representation [6]. Since we do not need to calculate the exact MI, we introduce an alternative lower bound based on Jensen-Shannon which works better than DV based objective function [13].

In the network design, the mutual network is built with parameter . The MI estimator can be defined as


where and is an input sampled from by randomly shuffling the input data. The term carrying the shuffling data is called the negative sample. Since our input only has two images, it is unstable to train the network with random shifting input data. Thus, we simplify the network without taking the negative samples in Eq. (7). Combined with the MI of , our objective function is defined as


By maximizing , we could extract optimized representations and that can best represent and , and the columns of and are ordered in a similar way as shown in Fig. 7, , the first column of carries the distribution of sky, the first column of will also carry the distribution of its sky, etc. Then and are encouraged to be matched.

4.4 Affine-Transfer Decoder

Since the global color bases of the content and style image are different, to enforce the global color transfer, we assume that the bases of the content image and style image have an affine relationship [3, 20]. Since the content image and style image may have drastic color changes, we define the affine-transfer decoder as


where , and , are the network weights. and share the same basis weights and they correspond to and in Eq. (1) and Eqs. (2), respectively. Thus they carry global color bases of the content and style images.

4.5 Style Transfer and Implementation Details

In order to extract better color bases, we adopt the norm [29] instead of traditional norm for reconstruction loss. The objective function for loss is defined as


where . norm will encourage the rows of the reconstruction error to be sparse. That is, the network is designed to learn individual pixels as accurate as possible. In this way, it extracts better color bases to further facilitate the style transfer.

The objective function of the proposed network architecture can then be expressed as:


where norm is applied on the decoder weights to prevent over-fitting. , and are the parameters that balance the trade-off between construction error, sparse loss, negative of mutual information and weight loss, respectively. Note that is achieved by first reshaping the representations and to 3D domain according to their image size, then stacking them on top of and , respectively, before feeding into the network .

Before training, the content and style images are down-sampled to a smaller size for efficiency and transformed to zero-mean vectors by reducing their own mean. Because the down-sampled image pairs still have similar distributions as the input image pairs, the learned weights can be used to generate the stylized image with original resolution.

The network consists of a few fully-connected layers. The number of nodes for different layers are shown in Table 1. It is optimized with back-propagation as illustrated in Fig. 2 with red-dashed lines and stops when the reconstruction error of the network does not decrease anymore.

layers 4/1/1 2 1/1 2
nodes [3,3,3,3]/10/1 [13,1] 10/10 [10,10]
Table 1: The number of layers and nodes in the proposed network.

After the training procedure, the network learns to extract context-sensitive representations and global color bases. In order to match the distributions of to that of , we adopt the WCT [23] as the transfer function, which is defined based on covariance matrix that irrelevant to the dimensions of the representations. Applying WCT on the matched and , we are able to match the distribution of objects in to that of similar objects in . By feeding the images of original size into the network, we get and . The distribution of is matched to that of with WCT to generate transferred representations . Then is fed into the style decoder to generate the stylization image .

5 Experimental Results

The stylization results of the proposed OS-MDN on various types of photos are compared with state-of-the-art methods belonging to different categories, global-based, local-based general stylization and context-based photorealistic stylization methods with CNN structure. Both the visual comparisons and user study are provided to demonstrate the effectiveness of the proposed method.

Visual Comparison Figure  5 shows the stylization results of the proposed method compared to that of the global-based methods. We can observe that, Reinhard  [34] failed to handle local dramatic color changes. Although Pitié [4] is able to transfer colors, it generates noticeable artifacts. That is because the global-based methods transfer the style of a photo with spatial-invariant functions which limits their stylization ability on sophisticated styles. In contrast, our method yields better results with less artifacts, because the stylization is based on context-sensitive features.

Figure  5 shows the comparison of our method against general stylization methods. Although both Gatys  [9] and Li  [23] could transfer the color style well, they failed to preserve the spatial structure of the content image. The proposed method could generate satisfied results without spatial distortion. That is because, the designed architecture powered by Dirichlet encoder could project images onto abundance space, and the transfer on such space would not distort the spatial structure of the content image.

Figure  5 shows visual results of the proposed method as compared to context-based photorealistic stylization methods. Both Luan  [26] and Li  [24] can successfully transfer the color style to the content image. The generated results from Luan  [26] can preserve the spatial structure well with local affine transfer, however it causes color inconsistency in homogeneous area. With post-processing smoothing step, Li  [24] can generate better results. However, both methods have abrupt color changes between different semantic regions. That is because they perform region-based transfer within the semantic regions of the content and style images. We also observe that they tend to match the color of the content image to that of the style image at the same position if the semantic labels of the style and content images does not match, , last row in Fig. 5, the red umbrella is included in the semantic label of the style image, but is not in that of the content image. The red color is transferred to the same position in the content image by both methods. Overall, the images generated by the proposed method could not only transfer the color style correctly but also preserve natural color transitions for neighborhood pixels, especially the transitions between different contexts. The key contributes to the performance gain is that, the proposed method is able to extract matched context-sensitive representations with mutual sparse Dirichlet network. These representations can be transferred globally with WCT and affine-transfer decoder. Thus it produces more photorealistic photos with desired styles.

Figure 3: Visual comparison with global-based methods. First column: content image. Second: reference style image. Third: Reinhard [34]. Fourth: Pitié [4]. Fifth: proposed OS-MDN.
Figure 4: Visual comparison with local stylization methods. First column: content image. Second: reference style image. Third: Gatys [9]. Fourth: WCT [23]. Fifth: proposed OS-MDN.
Figure 5: Visual comparison with photorealistic methods. First column: content image. Second: reference style image. Third: Luan [26]. Fourth: Li  [24]. Fifth: proposed OS-MDN.

User Study Since the evaluation of photorealistic style transfer tends to be subjective, we conduct two user studies to further validate the proposed method quantitatively. One study asks users to choose the result that better carries the style of the reference style image. The other one asks users to choose the result that looks more like a real photo without artifacts. We choose 20 images of different scenes from the benchmark dataset offered by Luan  [26] and collect responses from Amazon Mechanical Turk (AMT) platform for both studies. Since the general stylization methods Gatys  [9] and Li  [23] could not generate photorealistic results, as shown in Fig. 5, the proposed method is only compared with photorealistic stylization methods including global-based Pitie [4], and CNN-based Luan  [26] and Li  [26]. For each study, there are totally 60 questions. For each question, we show the AMT workers a pair of content and style images and the result of our method and one other method. Each question is answered by 10 different workers. Thus the evaluation is based on 600 responses for each study. The feedback is summarized in Table 2. We can observe that, compared to the other photorealistic transfer methods, our method can not only stylize the image well but also generate more photorealistic images. Note that our method only need one pair of data, , the content and style image, to generate such results.

Methods Better Stylization Photorealistic
Pitie[4]/ours 37%/63% 26%/74%
Luan [24]/ours 36.5%/63.5% 29%/71%
Li  [26]/ours 31%/69% 20.5%/79.5%
Table 2: User Study

Computational Efficiency In Table 3, the run-time of the proposed OS-MDN is compared with other methods [4, 26, 24] on NVIDIA Tesla K40c GPU with image of resolution . To generate satisfied results, CNN-based methods usually need to train an encoder and decoder on large dataset for days [26, 24]. However, given a content-style pair, our method only needs to train with its down-sampled pair, which takes a few minutes even without GPU. Since our method is carefully designed with few layers, the style transfer only takes 1.6 second, which is much faster than state-of-the-art.

Methods Pitie[4] Luan[26] Li [26] ours
Time(s) 4.94 31.2 380.54 1.6
Table 3: Running Time in Seconds

6 Ablation Study

Figure 6: The representations extracted with the OS-MDN from the content image (top) and style image (bottom). Brighter color indicates higher value. Second, third, fourth column: the major color distribution of sky, cloud, and tree, respectively. With sparse constraint, color bases are highlighted in different representation columns. With the mutual discriminative network, the extracted representations of the content and style image are encouraged to be matched with each other, , sky-to-sky, cloud-to-cloud.
Figure 7: Visualization of effects of parameters and . Top: , is changed from . Bottom, , is changed from .

The two most important components of the proposed method are the sparse constraint, used to extract discriminative features, and simplified mutual discriminative network, used to enforce the representations of the content and style images to have context correspondence. With these two factors, the network is able to extract discriminative representations that are corresponding to each other’s semantic context, as shown in Fig. 7. To evaluate the effects of these two factors, we change their parameters and demonstrate the results in Fig. 7. We can observe that, when both and are zero, it is more like a global transfer, , no drastic local color changes. However, when we increase the sparse parameter , the local colors are changed dramatically, because the distribution of representations are more discriminative. However, since the two features do not correspond to each other, the color is not transferred correctly. When we increase the mutual parameter , the extracted features are more related to each other, resulting in a more faithful local color transfer.

Failure Cases The proposed method transfers the color style based on image’s own characteristics. As shown in Fig. 8, since the bottle in the input image is transparent, the output follows its natural color transactions. It won’t change the bottle’s pattern.

Figure 8: Failure case due to the natural property of the content image. First and Second: the content and style images. Third: WCT [23]. Fourth: OS-MDN.

7 Conclusion

We proposed a one-shot mutual Dirichlet network to address the problem of photorealistic style transfer. To the best of our knowledge, this work represents the first attempt to addressing the problem with one-shot learning. The representation scheme successfully decouples the spatial structure and color information of both the content and style images, with a shared sparse Dirichlet encoder, for the extraction of discriminative representation and an affine-transfer decoder for global color transfer, such that the structure information is well preserved during style transfer. The extracted sparse representation is context-sensitive from semantic perspective. The network further enforces the correspondence of such representations between the content and style images, through a simplified mutual discriminative network, for faithful color transfer. Experimental results demonstrate that the proposed method is able to generate photorealistic photos without spatial distortion or abrupt color changes.


  • [1] S. Bae, S. Paris, and F. Durand. Two-scale tone management for photographic look. 25(3):637–645, 2006.
  • [2] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
  • [3] A. Bousseau, S. Paris, and F. Durand. User-assisted intrinsic images. ACM Transactions on Graphics (TOG), 28(5):130, 2009.
  • [4] R. Caputo. In photography field guide, 2005. National Geographics.
  • [5] T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016.
  • [6] M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, 1983.
  • [7] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating second-order functional knowledge for better option pricing. Advances in neural information processing systems, pages 472–478, 2001.
  • [8] D. Freedman and P. Kisilev. Object-to-object color transfer: Optimal flows and smsp transformations.

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , pages 287–294, 2010.
  • [9] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016.
  • [10] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  • [11] J. Han and C. Moraga.

    The influence of the sigmoid function parameters on the speed of backpropagation learning.

    From Natural to Artificial Neural Computation, pages 195–201, 1995.
  • [12] M. He, J. Liao, D. Chen, L. Yuan, and P. V. Sander. Progressive color transfer with dense semantic correspondences. ACM Transactions on Graphics (TOG), 2018.
  • [13] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • [14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
  • [15] S. Huang and T. D. Tran. Sparse signal recovery via generalized entropy functions minimization. arXiv preprint arXiv:1703.10556, 2017.
  • [16] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    European conference on computer vision, pages 694–711, 2016.
  • [18] P. Kumaraswamy.

    A generalized probability density function for double-bounded random processes.

    Journal of Hydrology, 46(1-2):79–88, 1980.
  • [19] P.-Y. Laffont, A. Bousseau, S. Paris, F. Durand, and G. Drettakis. Coherent intrinsic images from photo collections. ACM Transactions on Graphics, 31(6), 2012.
  • [20] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE transactions on pattern analysis and machine intelligence, 30(2):228–242, 2008.
  • [21] C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
  • [22] X. Li, S. Liu, J. Kautz, and M.-H. Yang. Learning linear transformations for fast arbitrary style transfer. arXiv preprint arXiv:1808.04537, 2018.
  • [23] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Universal style transfer via feature transforms. Advances in neural information processing systems, pages 386–396, 2017.
  • [24] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. Proceedings of the European Conference on Computer Vision (ECCV), pages 453–468, 2018.
  • [25] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088, 2017.
  • [26] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4990–4998, 2017.
  • [27] R. Mechrez, E. Shechtman, and L. Zelnik-Manor. Photorealistic style transfer with screened poisson equation. arXiv preprint arXiv:1709.09828, 2017.
  • [28] E. Nalisnick and P. Smyth. Deep generative models with stick-breaking priors. ICML, 2017.
  • [29] F. Nie, H. Huang, X. Cai, and C. H. Ding.

    Efficient and robust feature selection via joint ℓ2, 1-norms minimization.

    Advances in neural information processing systems, pages 1813–1821, 2010.
  • [30] I. Omer and M. Werman. Color lines: image specific color representation. Proceedings of IEEE computer society conference on Computer vision and pattern recognition (CVPR), pages 946–953, 2004.
  • [31] F. Pitie, A. C. Kokaram, and R. Dahyot. N-dimensional probability density function transfer and its application to color transfer. Tenth IEEE International Conference on Computer Vision (ICCV), 2:1434–1439, 2005.
  • [32] F. Pitié, A. C. Kokaram, and R. Dahyot. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1-2):123–137, 2007.
  • [33] Y. Qu, H. Qi, and C. Kwan. Unsupervised sparse dirichlet-net for hyperspectral image super-resolution. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2511–2520, 2018.
  • [34] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. Color transfer between images. IEEE Computer graphics and applications, 21(5):34–41, 2001.
  • [35] J. Sethuraman. A constructive definition of dirichlet priors. Statistica sinica, pages 639–650, 1994.
  • [36] L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8242–8250, 2018.
  • [37] Y. Shih, S. Paris, C. Barnes, W. T. Freeman, and F. Durand. Style transfer for headshot portraits. ACM Transactions on Graphics (TOG), 33(4):148, 2014.
  • [38] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG), 32(6):200, 2013.
  • [39] D. Ulyanov, V. Lebedev, V. Lempitsky, et al. Texture networks: Feed-forward synthesis of textures and stylized images.

    International Conference on Machine Learning

    , pages 1349–1357, 2016.
  • [40] J. Woo, M. Stone, and J. L. Prince. Multimodal registration via mutual information incorporating geometric and spatial context. IEEE Transactions on Image Processing, 24(2):757–769, 2015.
  • [41] F. Wu, W. Dong, Y. Kong, X. Mei, J.-C. Paul, and X. Zhang. Content-based colour transfer. Computer Graphics Forum, 32(1):190–203, 2013.
  • [42] B. Zitova and J. Flusser. Image registration methods: a survey. Image and vision computing, 21(11):977–1000, 2003.