DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation

by   Mor Avi-Aharon, et al.

We present the DeepHist - a novel Deep Learning framework for augmenting a network by histogram layers and demonstrate its strength by addressing image-to-image translation problems. Specifically, given an input image and a reference color distribution we aim to generate an output image with the structural appearance (content) of the input (source) yet with the colors of the reference. The key idea is a new technique for a differentiable construction of joint and color histograms of the output images. We further define a color distribution loss based on the Earth Mover's Distance between the output's and the reference's color histograms and a Mutual Information loss based on the joint histograms of the source and the output images. Promising results are shown for the tasks of color transfer, image colorization and edges → photo, where the color distribution of the output image is controlled. Comparison to Pix2Pix and CyclyGANs are shown.



There are no comments yet.


page 2

page 7

page 8

page 13

page 14

page 15

page 16

page 18


Hue-Net: Intensity-based Image-to-Image Translation with Differentiable Histogram Loss Functions

We present the Hue-Net - a novel Deep Learning framework for Intensity-b...

GIFnets: Differentiable GIF Encoding Framework

Graphics Interchange Format (GIF) is a widely used image file format. Du...

A Modified Image Comparison Algorithm Using Histogram Features

This article discuss the problem of color image content comparison. Part...

Facial Makeup Transfer Combining Illumination Transfer

To meet the women appearance needs, we present a novel virtual experienc...

USIS: Unsupervised Semantic Image Synthesis

Semantic Image Synthesis (SIS) is a subclass of image-to-image translati...

Slide-free MUSE Microscopy to H E Histology Modality Conversion via Unpaired Image-to-Image Translation GAN Models

MUSE is a novel slide-free imaging technique for histological examinatio...

Color Constancy by GANs: An Experimental Survey

In this paper, we formulate the color constancy task as an image-to-imag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) dramatically improve the state-of-the-art in many practical domains [9, 10]

. While numerous loss functions were proposed, metrics based on image histograms, which represent images by their color distributions 

[2, 21] are not considered. The main obstacle seems to be the histogram construction which is not a differentiable operation and therefore cannot be incorporated into a deep learning framework.

In this work, we introduce the DeepHist - a deep learning framework for image generation, which enables a differentiable construction of joint and color histograms of the output images. We further define color-based and statistical similarity loss functions that are exclusively built on the differentiable histograms of the generated images. Specifically, we augment a neural network generator by histogram layers that take part in the back-propagation process in which the respective histogram loss functions are used for updating the generator weights. Relying on the color distribution rather than on the differences between corresponding pixels allows us to address image-to-image translation problems for which the desired, target images do not necessarily exist. Consider for example the color transfer problem, as exemplified in the left panel of Fig. 1, where the aim is to paint a input (source) image with the colors of a different color reference image. For this kind of unpaired learning tasks, neither of the prevalent loss functions that are based on pixel-by-pixel comparison, e.g., mean-square error (MSE) or cross-entropy, can be used. We also address generalization of the image colorization and edgephoto problems, where the color distribution of a generated image is constrained to fit a particular color histogram (Fig. 1 middle and right panels).

Figure 1: Image-to-image translation tasks are presented from left to right: color transfer, image colorization and edgesphoto. The inputs for all tasks consist of a content reference image (an edge map in the case of edgephoto) and the color histograms of an RGB image (a color-reference image). The outputs for all the tasks is an RGB image with the content of the source image and the color distributions of the color-reference image. For example, (l) and (m) are two possible outputs of the edgesphoto, for the input histogram of either (i) or (k), respectively.

Color and intensity histograms are useful representations for image-to-image translation tasks. Classical methods for color transfer were based on the concept of histogram matching, where the main idea was to adapt a color histogram of a given image to the target image. Reinhard et al. [17] addressed color transfer by using a simple statistical analysis to impose one image’s color characteristics on another, in the Lab color space. Neumann et al. [14] used 3D histogram matching in the hue-saturation-lightness (HSL) colorspace. Their method is based on mapping an arbitrary source gamut to the arbitrary target one, while colors with same hues of target image will have the same hues after the transformation. The proposed mapping required histogram smoothing to reduce undesired gradient effects.

In this work, we exploit histogram matching using the network as an optimizer. The distance between a pair of histograms is defined by the Earth Mover’s Distance (EMD).

A deformation of the color distribution of an image can distort its content, therefore enforcement of the structural similarity between the source and the output images is required. The main problem is that images have different intensities in corresponding locations making pixel-to-pixel comparison not applicable. To address this issue, we suggest to use the mutual information (MI) of the source and the output images as a measure of their content-based, color-free similarity. In a seminal work Viola and Wells [22] used a cost function based on MI for image registration, where the target image and the source have different intensity distributions. Since then, MI-based registration became popular in biomedical imaging applications, in particular when the alignment of medical images acquired by different imaging modalities is addressed. An essential component for calculating the MI of two images is the generation of their joint histogram. In the context of image registration it is called a co-occurrence matrix. While there has been significant work exploiting co-occurrence matrices, the use of joint histograms and MI for image-to-image translation tasks (to the best of our knowledge) has not been done before. Moreover, differential construction of intensity histograms and joint histograms as part of a deep learning framework is done here for the first time.

Recent image generation approaches and image-to-image translation, in particular are mostly based on deep learning frameworks. Since the main aim is generating realistic examples, adversarial frameworks, in which an adversarial network is trained on discriminating between real and fake examples, seem to be very effective [3]. In their pix2pix framework, Isola et al. performed image-to-image translation (e.g., colorization of gray scale images and edgesphoto) by using adversarial loss as well as loss between corresponding pixels in the network’s output and the desired target image [6]. In this sense, the pix2pix is a fully supervised method and obviously cannot be applied to problems (such as color transfer) where the desired target image does not exist. Moreover, as discussed in [6]

the images generated by using L1 loss tend to have grayish or brownish colors when there is an uncertainty regarding to which of several plausible color values a pixel should take on. Specially, L1 will be minimized by choosing the median of the conditional probability density function over possible colors. The problem of color-uncertainty is addressed in Zhang et al. 

[25] by a class-based colorization approach, in which the loss of each pixel in an image of a particular class is weighted the frequency of its color in that class. This process, termed as class-rebalancing increases the color diversity of the test results. Zhu et al. [27] referred to image-to-image translation in unpaired setting using cycle-consistent adversarial networks. The Cycle GAN enables style and color transfer (e.g., summer to winter) when the desired output image cannot be used for training. The main idea is using an adversarial loss to map an image X into Y and then mapping Y into X such that the cycle consistency is preserved. The cycle GAN presents compelling results, yet since in many cases the cyclic consistency constrain is not sufficient, additional supervision and loss functions are often required. He et al. [4] proposed two-step pipeline for color transfer based on deep semantic correspondences (via VGG19) between an input and a reference images followed by local color transfer in the image domain. The method provides visually appealing results yet requires structural and semantic similarity of the reference with respect to the input image. Moreover, the output color distribution can be only controlled by the reference image.

Input Source Output 1 Output 2 Output 3
Figure 2: Edgesphoto based on the same edge image yet with different user-selected color histograms. The reference color histogram for output 1 is the color histogram of the source image. For output 2-3, we directly defined the color histogram in order to generate images with new colors.

The DeepHist presents a conceptual alternative to existing image-to-image translation methods. It does not require the extraction of semantic features neither does it need a reference color image with semantic similarity to the input image. Instead, reference color histograms representing the desired color distribution of the output are provided to the network. While these color histograms can be constructed from a reference color image - as is the case for the color transfer problem and as exemplified in Fig. 1, they can be also user-defined for color-controlled image colorization or edgesphoto tasks (Fig. 2). The intensity-based loss we propose for ‘painting’ the output image with the reference colors is based on the EMD between a differentiable histogram constructed from the output image and the reference histogram. Moreover, structure/content similarity between the source and the output images is preserved thanks to the mutual information loss, which we define based on the joint source-output histogram. While here as well adversarial loss is utilized for generating realistic images ensuring, for example, green grass and blue sky and not the other way around, our framework does not exclusively or mainly relay on it - making it much more stable. Finally, avoiding the use of pixel-to-pixels comparison via L1 or other distance measures, allows us to handle unpaired image-to-image translation such as color transfer.

The DeepHist framework is comprised of a generator, which is an adaptation of the well known U-Net - an encoder-decoder with skip connections [18]. Yet, the main contributions are the augmented parts of the network which allow differential construction of intensity (1D) and joint (2D) histograms, such that histogram-based loss functions are used to train the image generator in an end-to-end manner. We demonstrate the proposed frameworks for different paired and unpaired image-to-image translation with several publicly available datasets. This includes color transfer for the flowers dataset [15], image colorization for the summer-winter dataset [27] and edgesphoto for the shoes [24] and the bags [26] datasets.

2 Methods

In this section we review the main principles underlying the differentiable construction of 1D and 2D (joint) color histograms (Section 2.1). We then define the histogram-based metrics (Section 2.2) that are used for defining the differentiable loss functions (Section 2.3). The network architecture is presented in Section 2.4. Implementation details are presented in Section 2.5.

2.1 Differentiable Histograms Construction

2.1.1 Color Space

To address image-to-image translation problems we choose the YUV color space. It is composed of one luma component (Y) and two chrominance components, called U (blue projection) and V (red projection). The Y channel is in the range while the range of the U and the V channels is For practical reasons we map all channels’ values to In the following Sections we refer to each color channel as a gray-level image.

2.1.2 Differentiable 1D Color Histogram Formulation

Images acquired by digital cameras have three color channels each with a discrete range of intensity values. The intensity distribution of each channel can be described with an intensity histogram obtained by counting the number of pixels in each intensity value. Considering synthesized images that can take any value in the continuous range we define the intensity of an image pixel in a particular channel as

We use the Kernel Density Estimation (KDE) for estimating the gray level density

of an image’s channel as follows


where , is the kernel, is the bandwidth and is the number of pixels in the image. We choose the kernel

as the derivative of the logistic regression function

as follows


where . We note that Eq. (2) is a non-negative real-valued integrable function and satisfies the requirements for a kernel (normalization and symmetry).

For the construction of smooth and differentiable image histogram, we partition the interval into sub intervals , each interval with length and center , then . We then can define the probability of pixel in the image to belong to certain gray level interval (the value of normalized histogram’s bin) as


By solving the integral we get


The function which provides the value of the bin in a differentiable histogram can be rewritten as follows:




is a differentiable approximation of the Rect function. Fig. 3

illustrates the application of three (out of K) activation functions

on a gray scale image. The resulting channels are used for the construction of the corresponding gray level histogram. Specifically, the histogram bin is obtained by a summation of the channel values. The set of channels can be viewed as smooth 1-hot approximations of the pixels values in a gray-level image. Note that the support of is over the gray-level range and as opposed to convolutional kernel it is not spatial. A differentiable histogram of a gray-level image is defined as follows:

(a) Output channel (b) Activation functions (c) Activation maps
Figure 3: Activation functions and maps. (a) One (out of three) color channel of the generated output image. (b) Three out of activation functions. (c) Three out of activation maps, each is generated by an application of the respective activation function to the output color channel shown in (a). Note that pixels with values closer to have higher values in the activation map.

2.1.3 Differentiable Joint Color Histogram Formulation

The joint histogram of two gray-level images, each with discrete gray levels is a matrix constructed such that its entry counts the number of times, pixels with gray level value in one image correspond to pixels with gray level value in the other. The joint gray-level density is obtained by normalizing the joint gray-level histogram. Considering two images with continues pixel values, their joint gray-level density can be defined using multivariate KDE as follows:


where, , , is the bandwidth (or smoothing) matrix and is the symmetric 2D kernel function. As in the 1D case (Eq. 2), we choose the kernel as the derivative of the logistic regression function for each of the two variables separately:


We define the bandwidth matrix as . We define the probability of corresponding pixels in and to belong to the intensity intervals and correspondingly, as follows:


By solving the integral we get:


By using the definition of from Eq. 6, we can expressed the value of joint histogram -th bin as


This equation can be also written using matrix notation. We define a matrix where each of its rows is a flatten activation map, generated from a gray level image . A differentiable joint histogram of two images , can be constructed via matrix multiplication as follows:


2.1.4 Histogram Layers

1D Histogram Layer
Figure 4: Histogram construction: (a) activation maps ( matrices) are generated by the application of activation functions to an output channel of the generator network (one out of three color channels of the synthesized image). Summation (and normalization by number of pixels ) of the values of the activation maps provides the respective histogram bin. (b) Construction of activation maps by the application of activation functions to each of the color channels of the source image. (c) The Joint histogram is constructed by matrix multiplication of the reshaped activation maps () of the output channel and the reshaped activation maps of the source (). The joint histogram is used for defining the Mutual Information loss to constrain content-based similarity between the generated and the source images.

Three gray-level (1D) histogram layers of size are constructed from the output layers of the generator network (the synthesized output image) one for each color channel. The value of the unit in an histogram layer is obtained by a summation (and normalization by ) of the respective activation map (Eq. 5). As illustrated in Figure 2, the activation maps are constructed by the application of activation functions to the three output image layers. This operation is illustrated in Fig. 4a.

Joint Histogram Layer

Having activation maps for each color channel of the synthesized output image, we construct three matrices of size by reshaping the maps into vectors. Applying a similar process to the input image, we can now construct three joint histograms via three matrix multiplications (Eq. 13), corresponding to the Y,U and V channels. Figure 4 illustrates the main ideas.

2.2 Metrics

2.2.1 Earth Mover’s Distance

We use the EMD [19], also known as the Wasserstein metric [1] to define the distance between two image histograms. Let and be the histograms of the images and , respectively. We note that when and have the same overall mass, the EMD is a true metric [19]. Moreover, when the compared histograms are also 1D EMD has been shown to be equivalent to Mallows distance, which has a closed-form solution [11]. Werman et al. [23] showed that the EMD is equal to the distance between the cumulative histograms. Following Hou et al. [5] we use the Euclidean distance because it usually converges faster and is easier to optimize with gradient descent [13, 20]:


where, is the -th element of the cumulative density function of .

2.2.2 Mutual information

The MI of two images and is defined as follows:


where, , are the image histograms as defined is Eq. 5, and is the joint histogram discussed in Section 2.1.3. Maximizing the MI between the output and the source image allows us to generate images with color-free statical similarity. Following [8] we define the MI loss as follows:


where, is the joint entropy of , defined as


The quantity is a metric [8], with and for all pairs . This metric has symmetry, positivity and boundedness properties.

2.3 Loss functions

The complete loss is a weighted sum of three loss functions:


where , , are the color loss using EMD, the statistical similarity loss using MI and the adversarial loss, respectively. The scalars , , are the weights.

The EMD loss is derived from Eq. 14 which defines the EMD between two histograms, the EMD loss between the output and reference color histograms is defined as follows:


where, , are the reference and the output histograms of the YUV channels.

MI loss between the channels of the network’s output and the source image is based on their relative MI (Eq. 16) and defined as follows:


We use conditional GAN loss similar [6]. The discriminator learns to distinguish between the output and the source conditioned by the input. For the color transfer problem, the discriminator input is the source or the output image, without conditioned input. The objective of the conditional GAN can be expressed as:


where is the generator, is the discriminator, is the input image, is the output image, and is noise in the form of dropout.

2.4 DeepHist Network Architecture

Figure 5: DeepHist network architecture. The DeepHist network architecture is composed on an image generator (a modified version of the UNet, yellow color) augmented by input (light blue) and output (pink) histogram layers. The input to the encoder part of the generator is either a gray-scale image (for image colorization), an edge map (for edgephoto), or a different color images (for color transfer). In addition, target color histograms are fed (each separately) into embedding layers, followed by a fully connected layer and a concatenation with the code layer of the generator. The three output layers of the generator (which together composed the three color channels of synthesized output image) are used for the construction of color (1D) and joint (2D) histogram layers. The histograms’ construction is illustrated in Figure 4.

Figure 5 illustrates the generator architecture as well as the augmented input and output histogram layers. The DeepHist network architecture is composed on an image generator (a modified version of the UNet [18]) augmented by input and output histogram layers. The input to the encoder part of the generator is either a gray-scale image (for image colorization), an edge map (for edgephoto), or a RGB image (for color transfer). In addition, reference color histograms are fed (each separately) into embedding layers, followed by a fully connected layer and a concatenation with the code layer of the generator. Embedding of the reference histogram within the network generator allows us to control the color distribution of the output image. The three output layers of the generator (which together composed the three color channels of synthesized output image) are used for the construction of color (1D) and joint (2D) histogram layers. The histogram construction is illustrated in Fig. 4. The color histogram layers allow us to constrain color similarity to the reference while the joint histograms layers enable to constrain content similarity to the source via the respective loss functions. As in [6]

, we use the convolutional “PatchGAN” classifier 

[12] as a discriminator for the construction of an adversarial loss.

2.5 Implementation Details

To optimize our networks, we alternate between one gradient descent step on the Discriminator (D), then one step on the Generator (G). As suggested in [3], we train to maximize . We use minibatch SGD and apply the Adam solver [7], with a learning rate of , and momentum parameters , . For histograms construction we use bins, , .

3 Experimental Results

To demonstrate the strengths of DeepHist method, we test it on a several tasks and datasets:

  1. Edges photo We used two different datasets from [24] and [26] to demonstrate the edgesshoe and edgesbag problems. We divided the datasets into training and test as in [6]. During training, the input is an edge map and the output is a synthesized image with the color distribution of the source image (i.e., the real image used for generating the edge map). The MI loss is calculated with respect to the source image to constrain content similarity to the source. During the test phase, we generate synthesized images based on the same edge map yet with different selected color histograms. For evaluating our method, we present synthesized images with the color distribution of either the source image or a color reference image. Figure 6 and Figure 7 present visual edgesshoe and edgesbag results, respectively.

          Input            Source            Our (source)       Pix2Pix       Reference        Our (reference)
    Figure 6: Visual results of the edgesshoe problem. The output image is generated with the colors of either the source image (col. 3) or a different color reference image (col. 6). For comparison, Pix2Pix [6] results are presented (col. 4).
    Input Source Output (source) Reference Output (reference)
    Figure 7: Visual results of the edgesbag problem. The output image is generated with the colors of either the source image (col. 3) or a different color reference image (col. 5).
  2. Image colorization We use the summer/winter Yosemite dataset, prepared by [27] using Flickr API. We use train/test splits as in [27]. During training, the input is a gray-scale image (generated from the source image), randomly selected from the training set of summer and winter images and the output is a colorized image with color distributions of the source (original) image. During the test phase, we generate synthesized images based on the same gray-scale image yet with different selected color histograms. For evaluating our method, we present synthesized images with the color distribution of either the source image or a color reference image. Results and comparison to CycleGAN are shown in Figure 8. We note that the results obtained by the CycleGan are much less colourful than the DeepHist results.

           Input              Source 1          Output (1)          Output (2)     CycleGAN (winter)
           Input              Source 2          Output (2)          Output (1)     CycleGAN (summer)
           Input              Source 1          Output (1)          Output (2)     CycleGAN (winter)
           Input              Source 2          Output (2)          Output (1)     CycleGAN (summer)
           Input              Source 1          Output (1)          Output (2)     CycleGAN (winter)
           Input              Source 2          Output (2)          Output (1)     CycleGAN (summer)
    Figure 8: Visual results of the image colorization problem. An input gray-scale image is painted with the colors of either of two source color distributions. Specifically, output (1) or output (2) refers to the colors of source image 1 or 2, respectively. Comparison to CycleGAN [27] is presented in column 5.
  3. Color transfer We used the Oxford 102 Category Flower Dataset [15], which consists of images. The dataset was randomly divided into and images for training and test, respectively. During training, the input consists of an input and a color reference images that were randomly selected. The aim is to paint the output image in the colors of the reference. Figure 9 presents color transferred images obtained with and without the MI loss, demonstrating the contribution of the MI loss. To further justify the use of MI loss we calculated the MI between the input and the output images for all three color channels. As expected (and desired), the MI between the output and the source is higher using all three DeepHist loss functions rather than without the MI loss. Results are shown in Table 1. The implication is that the content of the input is better preserved when using the MI loss. This can be also visually observed in Figure 9 when comparing the third and the fourth columns.

    Source Target DeepHist w/o
    Figure 9: Visual color transfer results of the proposed framework compared to the framework trained without the MI loss
    Loss Y U V
    DeepHist 0.178 0.141 0.167
    w/o 0.066 0.035 0.044
    Table 1: MI results for the color transfer problem. Average MI results for each of the three color channels between the generated color transferred images and the respective input images. The comparison is made for the DeepHist framework, using and not using the MI loss.

3.1 Perceptual Realism

Addressing the problem of color transfer, the aim is to paint an input image with the colors of a different target image. Note that the desired output image does not exist and therefore we cannot measure the results by quantitative comparison (pixel-to-pixel) of the output image to a ground truth image. For evaluating the ‘realism’ of our color transfer results we set up a questionnaire for a human observer, in which we presented real (reference) or color transferred (output) images in a random order. The questionnaire based on our generated images and the true ones can be accessed via Overall, we used images, of which were real and were painted. Participants were asked to mark ‘real’ or ‘fake’. Specifically, the following instructions are presented:
The following questionnaire shows real pictures of flowers and pictures of flowers that were obtained by painting (changing the colors) of real flower images using a deep learning approach. Can you tell whether these images are Real or Fake?
We distributed the questionnaire anonymously with the social net (via WhatsApp). The statistics presented here are based on nearly 100 questionnaire participants, of age groups as shown in Figure 10.

Figure 10: Age distribution of the participants in our DeepHist Questionnaire
Figure 11: The total distribution of the correct results (points) out of 24 questions.
Real Fake
Target (real image) 65.8 34.2
Output (painted image) 51.9 48.1
Table 2: Average percentage of participants who marked the target (first row) or the output (second row) as ”Real” (first column) or ”Fake” (second column). True-Positive (Target, Real), True-Negative (Target, Fake), False-Positive (Output, Real) and False-Negative (Output, Fake) statistics are shown in the table.

The distribution of the number of correct answers is shown in Figure 11

. Confusion matrix of the average percentage of participants who marked the target or the output images by either ”Real” or ”Fake” is shown in Table 

2. As shown in the Table, the DeepHist color transfer results misled (on the average) the questionnaire participants on about half of the cases. Moreover, 51.9% of the synthesized (fake) images were marked as real.

3.2 Ablation study

We run ablation studies to isolate the effect of the EMD term, the MI term and the GAN term. Figure 12 shows the qualitative effects of these variations on the edges photo problem. MI and EMD alone (setting in Eq. 18) are not enough to overcome the ”Checkerboard artifact” [16]. Using only MI and ADV loss function without the EMD loss (setting in Eq. 18), does not allow the network to adapt the color distribution of the output image to the target color distribution. Finally, using the ADV and EMD loss functions without the MI loss introduces visual artifacts. The MI loss is important for preserving the content of the source (regions with the same color). Table 3 shows that it is also improved the MSE. We note that in the color-transfer problem since the discriminator does not have a conditional input, the MI term is essential to preserve the content of the image. Examples are shown in Fig. 9. Table 3 presents quantitative ablation study results.

     Input          Source         DeepHist      w/o      w/o      w/o
Figure 12: Ablation study, performance of edges shoes dataset.
w/o w/o w/o DeepHist
MSE average (std) .31 (.077) .053 (.013) .024(.012) 0.018(.010)
Table 3: Ablation study for the edgesshoe problem. MSE between the source images and the generated shoe images. The calculation was performed for 200 test images. Specifically we compared the MSE results obtained for the DeepHist methods implemented with all three loss function with respect to the MSE results obtained without either of the loss functions. All image values are in the range .

3.3 Colorfulness

As discussed in the pix2pix paper [6] the images generated by using L1 loss tend to have grayish or brownish colors when there is an uncertainty regarding to which of several plausible color values a pixel should take. Specially, L1 will be minimized by choosing the median of the conditional probability density function over possible colors. In [6] it was shown that the conditional GAN loss turns the output images more colorful. In Figure 13, we demonstrate the gray-scale range obtained for each of the YUV color channels in the generated images for the edgesshoe dataset. The plots show the gray-level distributions using the YUV color space for the entire test set, comparing the proposed DeepHist with Pix2Pix and the actual color images. While there are no significant differences for the Y channel, it is apparent that the DeepHist reflects better the gray-level distribution of the actual images for the U and V channels.

                       Y                                        U                                        V
Figure 13: Gray level distributions for the output edgesshoe images. The plots show the gray-level distributions for the Y, U and V color channels, comparing the proposed DeepHist (blue) with Pix2Pix (green) and the actual color images (orange).

4 Conclusions

We presented the DeepHist, a novel deep learning method for image-to-image translation based on the construction of differentiable histograms and histogram-based loss functions. Specifically, intensity-based and MI loss functions are used to encourage intensity similarity to a reference color distribution and structural similarity to the source image. The adversarial loss is incorporated to constrain the generation of realistic images, making sure, for example, that the leaves and nor the petals will be painted in green. While the results are promising we believe that the tools we developed can be applicable to other computer vision tasks with slight modifications, e.g., multi-modal image registration or changing illumination.


  • [1] R. L. Dobrushin (1970)

    Prescribing a system of random variables by conditional distributions

    Theory of Probability & Its Applications 15 (3), pp. 458–486. Cited by: §2.2.1.
  • [2] R. C. Gonzalez and R. E. Woods (2008) Digital image processing. Prentice Hall. Cited by: §1.
  • [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.5.
  • [4] M. He, J. Liao, D. Chen, L. Yuan, and P. V. Sander (2019) Progressive color transfer with dense semantic correspondences. ACM Transactions on Graphics (TOG) 38 (2), pp. 1–18. Cited by: §1.
  • [5] L. Hou, C. Yu, and D. Samaras (2016) Squared earth mover’s distance-based loss for training deep neural networks. CoRR abs/1611.05916. External Links: 1611.05916, Link Cited by: §2.2.1.
  • [6] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    CoRR abs/1611.07004. External Links: 1611.07004, Link Cited by: §1, §2.3, §2.4, Figure 6, item 1, §3.3.
  • [7] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.5.
  • [8] A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger (2003) Hierarchical clustering based on mutual information. External Links: q-bio/0311039 Cited by: §2.2.2.
  • [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [10] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • [11] E. Levina and P. Bickel (2001-07) The earth mover’s distance is the mallows distance: some insights from statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 251–256 vol.2. External Links: Document Cited by: §2.2.1.
  • [12] C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pp. 702–716. Cited by: §2.4.
  • [13] D. G. Luenberger, Y. Ye, et al. (1984) Linear and nonlinear programming. Vol. 2, Springer. Cited by: §2.2.1.
  • [14] L. Neumann and A. Neumann (2005) Color style transfer techniques using hue, lightness and saturation histogram matching. In Computational Aesthetics, pp. 111–122. Cited by: §1.
  • [15] M-E. Nilsback and A. Zisserman (2008-12) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §1, item 3.
  • [16] A. Odena, V. Dumoulin, and C. Olah (2016) Deconvolution and checkerboard artifacts. Distill 1 (10), pp. e3. Cited by: §3.2.
  • [17] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer graphics and applications 21 (5), pp. 34–41. Cited by: §1.
  • [18] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.4.
  • [19] Y. Rubner, C. Tomasi, and L. J. Guibas (2000)

    The earth mover’s distance as a metric for image retrieval

    International journal of computer vision 40 (2), pp. 99–121. Cited by: §2.2.1.
  • [20] S. Shalev-Shwartz and A. Tewari (2011) Stochastic methods for l1-regularized loss minimization.

    Journal of Machine Learning Research

    12 (Jun), pp. 1865–1892.
    Cited by: §2.2.1.
  • [21] R. Szeliski (2010) Computer vision: algorithms and applications. 1st edition, Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 1848829345, 9781848829343 Cited by: §1.
  • [22] P. Viola and W. M. Wells III (1997) Alignment by maximization of mutual information. International journal of computer vision 24 (2), pp. 137–154. Cited by: §1.
  • [23] M. Werman, S. Peleg, and A. Rosenfeld (1985) A distance metric for multidimensional histograms. Computer Vision, Graphics, and Image Processing 32 (3), pp. 328 – 336. External Links: Document, ISSN 0734-189X, Link Cited by: §2.2.1.
  • [24] A. Yu and K. Grauman (2014) Fine-grained visual comparisons with local learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 192–199. Cited by: §1, item 1.
  • [25] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §1.
  • [26] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Cited by: §1, item 1.
  • [27] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §1, Figure 8, item 2.