. While numerous loss functions were proposed, metrics based on image histograms, which represent images by their color distributions[2, 21] are not considered. The main obstacle seems to be the histogram construction which is not a differentiable operation and therefore cannot be incorporated into a deep learning framework.
In this work, we introduce the DeepHist - a deep learning framework for image generation, which enables a differentiable construction of joint and color histograms of the output images. We further define color-based and statistical similarity loss functions that are exclusively built on the differentiable histograms of the generated images. Specifically, we augment a neural network generator by histogram layers that take part in the back-propagation process in which the respective histogram loss functions are used for updating the generator weights. Relying on the color distribution rather than on the differences between corresponding pixels allows us to address image-to-image translation problems for which the desired, target images do not necessarily exist. Consider for example the color transfer problem, as exemplified in the left panel of Fig. 1, where the aim is to paint a input (source) image with the colors of a different color reference image. For this kind of unpaired learning tasks, neither of the prevalent loss functions that are based on pixel-by-pixel comparison, e.g., mean-square error (MSE) or cross-entropy, can be used. We also address generalization of the image colorization and edgephoto problems, where the color distribution of a generated image is constrained to fit a particular color histogram (Fig. 1 middle and right panels).
Color and intensity histograms are useful representations for image-to-image translation tasks. Classical methods for color transfer were based on the concept of histogram matching, where the main idea was to adapt a color histogram of a given image to the target image. Reinhard et al.  addressed color transfer by using a simple statistical analysis to impose one image’s color characteristics on another, in the Lab color space. Neumann et al.  used 3D histogram matching in the hue-saturation-lightness (HSL) colorspace. Their method is based on mapping an arbitrary source gamut to the arbitrary target one, while colors with same hues of target image will have the same hues after the transformation. The proposed mapping required histogram smoothing to reduce undesired gradient effects.
In this work, we exploit histogram matching using the network as an optimizer. The distance between a pair of histograms is defined by the Earth Mover’s Distance (EMD).
A deformation of the color distribution of an image can distort its content, therefore enforcement of the structural similarity between the source and the output images is required. The main problem is that images have different intensities in corresponding locations making pixel-to-pixel comparison not applicable. To address this issue, we suggest to use the mutual information (MI) of the source and the output images as a measure of their content-based, color-free similarity. In a seminal work Viola and Wells  used a cost function based on MI for image registration, where the target image and the source have different intensity distributions. Since then, MI-based registration became popular in biomedical imaging applications, in particular when the alignment of medical images acquired by different imaging modalities is addressed. An essential component for calculating the MI of two images is the generation of their joint histogram. In the context of image registration it is called a co-occurrence matrix. While there has been significant work exploiting co-occurrence matrices, the use of joint histograms and MI for image-to-image translation tasks (to the best of our knowledge) has not been done before. Moreover, differential construction of intensity histograms and joint histograms as part of a deep learning framework is done here for the first time.
Recent image generation approaches and image-to-image translation, in particular are mostly based on deep learning frameworks. Since the main aim is generating realistic examples, adversarial frameworks, in which an adversarial network is trained on discriminating between real and fake examples, seem to be very effective . In their pix2pix framework, Isola et al. performed image-to-image translation (e.g., colorization of gray scale images and edgesphoto) by using adversarial loss as well as loss between corresponding pixels in the network’s output and the desired target image . In this sense, the pix2pix is a fully supervised method and obviously cannot be applied to problems (such as color transfer) where the desired target image does not exist. Moreover, as discussed in 
the images generated by using L1 loss tend to have grayish or brownish colors when there is an uncertainty regarding to which of several plausible color values a pixel should take on. Specially, L1 will be minimized by choosing the median of the conditional probability density function over possible colors. The problem of color-uncertainty is addressed in Zhang et al. by a class-based colorization approach, in which the loss of each pixel in an image of a particular class is weighted the frequency of its color in that class. This process, termed as class-rebalancing increases the color diversity of the test results. Zhu et al.  referred to image-to-image translation in unpaired setting using cycle-consistent adversarial networks. The Cycle GAN enables style and color transfer (e.g., summer to winter) when the desired output image cannot be used for training. The main idea is using an adversarial loss to map an image X into Y and then mapping Y into X such that the cycle consistency is preserved. The cycle GAN presents compelling results, yet since in many cases the cyclic consistency constrain is not sufficient, additional supervision and loss functions are often required. He et al.  proposed two-step pipeline for color transfer based on deep semantic correspondences (via VGG19) between an input and a reference images followed by local color transfer in the image domain. The method provides visually appealing results yet requires structural and semantic similarity of the reference with respect to the input image. Moreover, the output color distribution can be only controlled by the reference image.
|Input||Source||Output 1||Output 2||Output 3|
The DeepHist presents a conceptual alternative to existing image-to-image translation methods. It does not require the extraction of semantic features neither does it need a reference color image with semantic similarity to the input image. Instead, reference color histograms representing the desired color distribution of the output are provided to the network. While these color histograms can be constructed from a reference color image - as is the case for the color transfer problem and as exemplified in Fig. 1, they can be also user-defined for color-controlled image colorization or edgesphoto tasks (Fig. 2). The intensity-based loss we propose for ‘painting’ the output image with the reference colors is based on the EMD between a differentiable histogram constructed from the output image and the reference histogram. Moreover, structure/content similarity between the source and the output images is preserved thanks to the mutual information loss, which we define based on the joint source-output histogram. While here as well adversarial loss is utilized for generating realistic images ensuring, for example, green grass and blue sky and not the other way around, our framework does not exclusively or mainly relay on it - making it much more stable. Finally, avoiding the use of pixel-to-pixels comparison via L1 or other distance measures, allows us to handle unpaired image-to-image translation such as color transfer.
The DeepHist framework is comprised of a generator, which is an adaptation of the well known U-Net - an encoder-decoder with skip connections . Yet, the main contributions are the augmented parts of the network which allow differential construction of intensity (1D) and joint (2D) histograms, such that histogram-based loss functions are used to train the image generator in an end-to-end manner. We demonstrate the proposed frameworks for different paired and unpaired image-to-image translation with several publicly available datasets. This includes color transfer for the flowers dataset , image colorization for the summer-winter dataset  and edgesphoto for the shoes  and the bags  datasets.
In this section we review the main principles underlying the differentiable construction of 1D and 2D (joint) color histograms (Section 2.1). We then define the histogram-based metrics (Section 2.2) that are used for defining the differentiable loss functions (Section 2.3). The network architecture is presented in Section 2.4. Implementation details are presented in Section 2.5.
2.1 Differentiable Histograms Construction
2.1.1 Color Space
To address image-to-image translation problems we choose the YUV color space. It is composed of one luma component (Y) and two chrominance components, called U (blue projection) and V (red projection). The Y channel is in the range while the range of the U and the V channels is For practical reasons we map all channels’ values to In the following Sections we refer to each color channel as a gray-level image.
2.1.2 Differentiable 1D Color Histogram Formulation
Images acquired by digital cameras have three color channels each with a discrete range of intensity values. The intensity distribution of each channel can be described with an intensity histogram obtained by counting the number of pixels in each intensity value. Considering synthesized images that can take any value in the continuous range we define the intensity of an image pixel in a particular channel asof an image’s channel as follows
where , is the kernel, is the bandwidth and is the number of pixels in the image. We choose the kernel
as the derivative of the logistic regression functionas follows
where . We note that Eq. (2) is a non-negative real-valued integrable function and satisfies the requirements for a kernel (normalization and symmetry).
For the construction of smooth and differentiable image histogram, we partition the interval into sub intervals , each interval with length and center , then . We then can define the probability of pixel in the image to belong to certain gray level interval (the value of normalized histogram’s bin) as
By solving the integral we get
The function which provides the value of the bin in a differentiable histogram can be rewritten as follows:
is a differentiable approximation of the Rect function. Fig. 3
illustrates the application of three (out of K) activation functionson a gray scale image. The resulting channels are used for the construction of the corresponding gray level histogram. Specifically, the histogram bin is obtained by a summation of the channel values. The set of channels can be viewed as smooth 1-hot approximations of the pixels values in a gray-level image. Note that the support of is over the gray-level range and as opposed to convolutional kernel it is not spatial. A differentiable histogram of a gray-level image is defined as follows:
|(a) Output channel||(b) Activation functions||(c) Activation maps|
2.1.3 Differentiable Joint Color Histogram Formulation
The joint histogram of two gray-level images, each with discrete gray levels is a matrix constructed such that its entry counts the number of times, pixels with gray level value in one image correspond to pixels with gray level value in the other. The joint gray-level density is obtained by normalizing the joint gray-level histogram. Considering two images with continues pixel values, their joint gray-level density can be defined using multivariate KDE as follows:
where, , , is the bandwidth (or smoothing) matrix and is the symmetric 2D kernel function. As in the 1D case (Eq. 2), we choose the kernel as the derivative of the logistic regression function for each of the two variables separately:
We define the bandwidth matrix as . We define the probability of corresponding pixels in and to belong to the intensity intervals and correspondingly, as follows:
By solving the integral we get:
By using the definition of from Eq. 6, we can expressed the value of joint histogram -th bin as
This equation can be also written using matrix notation. We define a matrix where each of its rows is a flatten activation map, generated from a gray level image . A differentiable joint histogram of two images , can be constructed via matrix multiplication as follows:
2.1.4 Histogram Layers
1D Histogram Layer
Three gray-level (1D) histogram layers of size are constructed from the output layers of the generator network (the synthesized output image) one for each color channel. The value of the unit in an histogram layer is obtained by a summation (and normalization by ) of the respective activation map (Eq. 5). As illustrated in Figure 2, the activation maps are constructed by the application of activation functions to the three output image layers. This operation is illustrated in Fig. 4a.
Joint Histogram Layer
Having activation maps for each color channel of the synthesized output image, we construct three matrices of size by reshaping the maps into vectors. Applying a similar process to the input image, we can now construct three joint histograms via three matrix multiplications (Eq. 13), corresponding to the Y,U and V channels. Figure 4 illustrates the main ideas.
2.2.1 Earth Mover’s Distance
We use the EMD , also known as the Wasserstein metric  to define the distance between two image histograms. Let and be the histograms of the images and , respectively. We note that when and have the same overall mass, the EMD is a true metric . Moreover, when the compared histograms are also 1D EMD has been shown to be equivalent to Mallows distance, which has a closed-form solution . Werman et al.  showed that the EMD is equal to the distance between the cumulative histograms. Following Hou et al.  we use the Euclidean distance because it usually converges faster and is easier to optimize with gradient descent [13, 20]:
where, is the -th element of the cumulative density function of .
2.2.2 Mutual information
The MI of two images and is defined as follows:
where, , are the image histograms as defined is Eq. 5, and is the joint histogram discussed in Section 2.1.3. Maximizing the MI between the output and the source image allows us to generate images with color-free statical similarity. Following  we define the MI loss as follows:
where, is the joint entropy of , defined as
The quantity is a metric , with and for all pairs . This metric has symmetry, positivity and boundedness properties.
2.3 Loss functions
The complete loss is a weighted sum of three loss functions:
where , , are the color loss using EMD, the statistical similarity loss using MI and the adversarial loss, respectively. The scalars , , are the weights.
The EMD loss is derived from Eq. 14 which defines the EMD between two histograms, the EMD loss between the output and reference color histograms is defined as follows:
where, , are the reference and the output histograms of the YUV channels.
MI loss between the channels of the network’s output and the source image is based on their relative MI (Eq. 16) and defined as follows:
We use conditional GAN loss similar . The discriminator learns to distinguish between the output and the source conditioned by the input. For the color transfer problem, the discriminator input is the source or the output image, without conditioned input. The objective of the conditional GAN can be expressed as:
where is the generator, is the discriminator, is the input image, is the output image, and is noise in the form of dropout.
2.4 DeepHist Network Architecture
Figure 5 illustrates the generator architecture as well as the augmented input and output histogram layers. The DeepHist network architecture is composed on an image generator (a modified version of the UNet ) augmented by input and output histogram layers. The input to the encoder part of the generator is either a gray-scale image (for image colorization), an edge map (for edgephoto), or a RGB image (for color transfer). In addition, reference color histograms are fed (each separately) into embedding layers, followed by a fully connected layer and a concatenation with the code layer of the generator. Embedding of the reference histogram within the network generator allows us to control the color distribution of the output image. The three output layers of the generator (which together composed the three color channels of synthesized output image) are used for the construction of color (1D) and joint (2D) histogram layers. The histogram construction is illustrated in Fig. 4. The color histogram layers allow us to constrain color similarity to the reference while the joint histograms layers enable to constrain content similarity to the source via the respective loss functions. As in 
, we use the convolutional “PatchGAN” classifier as a discriminator for the construction of an adversarial loss.
2.5 Implementation Details
To optimize our networks, we alternate between one gradient descent step on the Discriminator (D), then one step on the Generator (G). As suggested in , we train to maximize . We use minibatch SGD and apply the Adam solver , with a learning rate of , and momentum parameters , . For histograms construction we use bins, , .
3 Experimental Results
To demonstrate the strengths of DeepHist method, we test it on a several tasks and datasets:
Edges photo We used two different datasets from  and  to demonstrate the edgesshoe and edgesbag problems. We divided the datasets into training and test as in . During training, the input is an edge map and the output is a synthesized image with the color distribution of the source image (i.e., the real image used for generating the edge map). The MI loss is calculated with respect to the source image to constrain content similarity to the source. During the test phase, we generate synthesized images based on the same edge map yet with different selected color histograms. For evaluating our method, we present synthesized images with the color distribution of either the source image or a color reference image. Figure 6 and Figure 7 present visual edgesshoe and edgesbag results, respectively.
Input Source Our (source) Pix2Pix Reference Our (reference) Figure 6: Visual results of the edgesshoe problem. The output image is generated with the colors of either the source image (col. 3) or a different color reference image (col. 6). For comparison, Pix2Pix  results are presented (col. 4). Input Source Output (source) Reference Output (reference) Figure 7: Visual results of the edgesbag problem. The output image is generated with the colors of either the source image (col. 3) or a different color reference image (col. 5).
Image colorization We use the summer/winter Yosemite dataset, prepared by  using Flickr API. We use train/test splits as in . During training, the input is a gray-scale image (generated from the source image), randomly selected from the training set of summer and winter images and the output is a colorized image with color distributions of the source (original) image. During the test phase, we generate synthesized images based on the same gray-scale image yet with different selected color histograms. For evaluating our method, we present synthesized images with the color distribution of either the source image or a color reference image. Results and comparison to CycleGAN are shown in Figure 8. We note that the results obtained by the CycleGan are much less colourful than the DeepHist results.
Input Source 1 Output (1) Output (2) CycleGAN (winter) Input Source 2 Output (2) Output (1) CycleGAN (summer) Input Source 1 Output (1) Output (2) CycleGAN (winter) Input Source 2 Output (2) Output (1) CycleGAN (summer) Input Source 1 Output (1) Output (2) CycleGAN (winter) Input Source 2 Output (2) Output (1) CycleGAN (summer) Figure 8: Visual results of the image colorization problem. An input gray-scale image is painted with the colors of either of two source color distributions. Specifically, output (1) or output (2) refers to the colors of source image 1 or 2, respectively. Comparison to CycleGAN  is presented in column 5.
Color transfer We used the Oxford 102 Category Flower Dataset , which consists of images. The dataset was randomly divided into and images for training and test, respectively. During training, the input consists of an input and a color reference images that were randomly selected. The aim is to paint the output image in the colors of the reference. Figure 9 presents color transferred images obtained with and without the MI loss, demonstrating the contribution of the MI loss. To further justify the use of MI loss we calculated the MI between the input and the output images for all three color channels. As expected (and desired), the MI between the output and the source is higher using all three DeepHist loss functions rather than without the MI loss. Results are shown in Table 1. The implication is that the content of the input is better preserved when using the MI loss. This can be also visually observed in Figure 9 when comparing the third and the fourth columns.
Source Target DeepHist w/o Figure 9: Visual color transfer results of the proposed framework compared to the framework trained without the MI loss Loss Y U V DeepHist 0.178 0.141 0.167 w/o 0.066 0.035 0.044 Table 1: MI results for the color transfer problem. Average MI results for each of the three color channels between the generated color transferred images and the respective input images. The comparison is made for the DeepHist framework, using and not using the MI loss.
3.1 Perceptual Realism
Addressing the problem of color transfer, the aim is to paint an input image with the colors of a different target image. Note that the desired output image does not exist and therefore we cannot measure the results by quantitative comparison (pixel-to-pixel) of the output image to a ground truth image.
For evaluating the ‘realism’ of our color transfer results we set up a questionnaire for a human observer, in which we presented real (reference) or color transferred (output) images in a random order.
The questionnaire based on our generated images and the true ones can be accessed via https://forms.gle/NN6HB4Sbr5fDPYo1A.
Overall, we used images, of which were real and were painted.
Participants were asked to mark ‘real’ or ‘fake’.
Specifically, the following instructions are presented:
The following questionnaire shows real pictures of flowers and pictures of flowers that were obtained by painting (changing the colors) of real flower images using a deep learning approach. Can you tell whether these images are Real or Fake?
We distributed the questionnaire anonymously with the social net (via WhatsApp). The statistics presented here are based on nearly 100 questionnaire participants, of age groups as shown in Figure 10.
|Target (real image)||65.8||34.2|
|Output (painted image)||51.9||48.1|
The distribution of the number of correct answers is shown in Figure 11
. Confusion matrix of the average percentage of participants who marked the target or the output images by either ”Real” or ”Fake” is shown in Table2. As shown in the Table, the DeepHist color transfer results misled (on the average) the questionnaire participants on about half of the cases. Moreover, 51.9% of the synthesized (fake) images were marked as real.
3.2 Ablation study
We run ablation studies to isolate the effect of the EMD term, the MI term and the GAN term. Figure 12 shows the qualitative effects of these variations on the edges photo problem. MI and EMD alone (setting in Eq. 18) are not enough to overcome the ”Checkerboard artifact” . Using only MI and ADV loss function without the EMD loss (setting in Eq. 18), does not allow the network to adapt the color distribution of the output image to the target color distribution. Finally, using the ADV and EMD loss functions without the MI loss introduces visual artifacts. The MI loss is important for preserving the content of the source (regions with the same color). Table 3 shows that it is also improved the MSE. We note that in the color-transfer problem since the discriminator does not have a conditional input, the MI term is essential to preserve the content of the image. Examples are shown in Fig. 9. Table 3 presents quantitative ablation study results.
|Input Source DeepHist w/o w/o w/o|
|MSE average (std)||.31 (.077)||.053 (.013)||.024(.012)||0.018(.010)|
As discussed in the pix2pix paper  the images generated by using L1 loss tend to have grayish or brownish colors when there is an uncertainty regarding to which of several plausible color values a pixel should take. Specially, L1 will be minimized by choosing the median of the conditional probability density function over possible colors. In  it was shown that the conditional GAN loss turns the output images more colorful. In Figure 13, we demonstrate the gray-scale range obtained for each of the YUV color channels in the generated images for the edgesshoe dataset. The plots show the gray-level distributions using the YUV color space for the entire test set, comparing the proposed DeepHist with Pix2Pix and the actual color images. While there are no significant differences for the Y channel, it is apparent that the DeepHist reflects better the gray-level distribution of the actual images for the U and V channels.
|Y U V|
We presented the DeepHist, a novel deep learning method for image-to-image translation based on the construction of differentiable histograms and histogram-based loss functions. Specifically, intensity-based and MI loss functions are used to encourage intensity similarity to a reference color distribution and structural similarity to the source image. The adversarial loss is incorporated to constrain the generation of realistic images, making sure, for example, that the leaves and nor the petals will be painted in green. While the results are promising we believe that the tools we developed can be applicable to other computer vision tasks with slight modifications, e.g., multi-modal image registration or changing illumination.
Prescribing a system of random variables by conditional distributions. Theory of Probability & Its Applications 15 (3), pp. 458–486. Cited by: §2.2.1.
-  (2008) Digital image processing. Prentice Hall. Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.5.
-  (2019) Progressive color transfer with dense semantic correspondences. ACM Transactions on Graphics (TOG) 38 (2), pp. 1–18. Cited by: §1.
-  (2016) Squared earth mover’s distance-based loss for training deep neural networks. CoRR abs/1611.05916. External Links: Cited by: §2.2.1.
Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Cited by: §1, §2.3, §2.4, Figure 6, item 1, §3.3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.5.
-  (2003) Hierarchical clustering based on mutual information. External Links: Cited by: §2.2.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
-  (2001-07) The earth mover’s distance is the mallows distance: some insights from statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 251–256 vol.2. External Links: Cited by: §2.2.1.
-  (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pp. 702–716. Cited by: §2.4.
-  (1984) Linear and nonlinear programming. Vol. 2, Springer. Cited by: §2.2.1.
-  (2005) Color style transfer techniques using hue, lightness and saturation histogram matching. In Computational Aesthetics, pp. 111–122. Cited by: §1.
-  (2008-12) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §1, item 3.
-  (2016) Deconvolution and checkerboard artifacts. Distill 1 (10), pp. e3. Cited by: §3.2.
-  (2001) Color transfer between images. IEEE Computer graphics and applications 21 (5), pp. 34–41. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.4.
The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40 (2), pp. 99–121. Cited by: §2.2.1.
Stochastic methods for l1-regularized loss minimization.
Journal of Machine Learning Research12 (Jun), pp. 1865–1892. Cited by: §2.2.1.
-  (2010) Computer vision: algorithms and applications. 1st edition, Springer-Verlag, Berlin, Heidelberg. External Links: Cited by: §1.
-  (1997) Alignment by maximization of mutual information. International journal of computer vision 24 (2), pp. 137–154. Cited by: §1.
-  (1985) A distance metric for multidimensional histograms. Computer Vision, Graphics, and Image Processing 32 (3), pp. 328 – 336. External Links: Cited by: §2.2.1.
Fine-grained visual comparisons with local learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 192–199. Cited by: §1, item 1.
-  (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §1.
-  (2016) Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597–613. Cited by: §1, item 1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §1, Figure 8, item 2.