1 Introduction
Convolutional Neural Networks (CNNs) dramatically improve the stateoftheart in many practical domains [12, 13]. While numerous loss functions were proposed, metrics based on intensity histograms, which represent images by their graylevel distributions [4, 27] are not considered. The main obstacle seems to be histogram construction which is not a differentiable operation and therefore cannot be incorporated into a deep learning framework. In this work, we introduce the HueNet  a deep neural network which is an image generator augmented by layers which allow, in a differentiable manner, the construction of intensity histograms of the generated images. We further define differentiable intensitybased and semanticbased similarity measures between pairs of images that are exclusively built on their histograms. Using these similarity measures as loss functions allows to address ’imagetoimage translation’. This paper focuses on color transfer, as shown in Figure 1, where the aim is to paint a source image with the colors of a different target image. In this type of problems the desired output is rarely available and therefore, loss functions based on pixeltopixel comparison cannot be used. In contrast, using the proposed histogrambased loss functions the HueNet can be trained in an unsupervised manner, providing realistic and semantically meaningful images at test time.
Source Target Output 
An intensity histograms is a useful representation for imagetoimage translation tasks. A widely used technique is histogram equalization (HE), which enhances image contrast by evening out the intensity histogram to the entire dynamic range. While its simplicity may be appealing, HE tends to provide unrealistic results as the dependency between neighboring pixels is not considered. Some works aimed to address this problem by partitioning the image space and equalizing the locally generated histograms [1, 2, 14, 20, 26, 31]. Classical methods for color transfer were based on the concept of histogram matching, where the main idea was to adapt a color histogram of a given image to the target image. Neumann et al. [18]
used 3D histogram matching in HSL colorspace, similarly to the sampling of multivariable functions applying a sequential chain of conditional probability density functions. Reinhard
et al. [21] achieved color transfer by using a statistical analysis in Lab color space.In this work, we exploit histogram matching using the network as an optimizer. The distance between a pair of histograms is defined by the Earth Mover’s Distance (EMD). Observing that the HSV (hue, saturation, value) colorspace is most suitable for addressing color transfer problems, we use a cyclic version of the EMD when applied to the hue channel. Hue channels of some images are presented in rows 2,5 of Figure 2.
A deformation of the intensity distribution of an image can distort its content, therefore enforcement of the semantic similarity between the source and the ‘color transferred’ images is required. The main problem is that images have different intensities in corresponding locations making pixeltopixel comparison not applicable. To address this issue, we suggest to use the mutual information (MI) of the source and the output images as a measure of their colorfree similarity. In a seminal work Viola and Wells [28] used a cost function based on MI for image registration, where the target image and the source have different intensity distributions. Since then, MIbased registration became popular in biomedical imaging applications, in particular when the alignment of medical images acquired by different imaging modalities is addressed. An essential component for calculating the MI of two images is the generation of their joint histogram (see Figure 3). In the context of image registration it is called a cooccurrence matrix. While there has been significant work exploiting cooccurrence matrices, the use of joint histograms and MI for color transfer (to the best of our knowledge) has not been done before. Moreover, differential construction of intensity histograms and joint histograms as part of a deep learning framework is done here for the first time.
Recent image generation approaches and imagetoimage translation, in particular are mostly based on deep learning. Since the main aim is generating realistic examples, adversarial frameworks, in which an adversarial network is trained on discriminating between real and fake examples, seem to be very effective [5]. In their pix2pix framework, Isola et al.performed imagetoimage translation (e.g.
colorization of gray scale images) by using adversarial loss as well as
loss between corresponding pixels in the network’s output and the desired target image. In this sense, the pix2pix is a fully supervised method and obviously cannot be applied to problems (such as color transfer) where the desired target image does not exist. A more recent paper by Zhu et al. [30] referred to imagetoimage translation in unpaired setting using cyclic GANs. While we agree that an adversarial loss is needed to enhance the generated images, the intensitybased loss is needed for ‘transferring’ the source colors to that of the target. We also show (Section 3) that the semanticbased loss is essential. Color transfer, has been recently addressed by [6] providing visually appealing results by using neural image representation for matching. While being interesting and promising, this approach is very different then ours.The HueNet is comprised of a generator, which is the well known UNet  an encoderdecoder with skip connections [22]. Yet, the main contributions are the augmented parts of the HueNet which allow differential construction of intensity (1D) and joint (2D) histograms, such that histogrambased loss functions are used to train the image generator in an endtoend manner. We demonstrate the proposed framework for colortransfer of images from three different datasets. While the results are promising we believe that the tools we developed can be applicable to different imagetoimage translation tasks with slight modifications.
2 Methods
In this section we review the main principles we used in order to define intensity based metric via 1D histogram comparison (Section 2.1) and semanticbased metrics via joint (2D) histogram (Section 2.2). The HueNet architecture and its augmented parts are presented in Sections 2.32.4, respectively. The HueNet loss functions are presented in Section 2.5. We conclude the section by presenting implementation details (Section 2.6).
2.1 Intensity based metric
In the following section we derive a differentiable approximation to intensity histograms and present the metric we use for histogram matching.
2.1.1 HSV Color representation
To address the color transfer problem we choose the HSV (Hue, Saturation, Value) color space. The HSV as opposed to the RGB representation allows to separate the color component of the image (hue) from the color intensity (saturation) and the black/white ratio (value). The proposed HueNet is designed to generate the hue channel () of the desired output image, while its input consists of all HSV channels of the source image and the hue channel of the target image (Figure 4d). The intensitybased metric to be hereby defined is applied to the HueNet output (the generated hue channel) with respect to the target’s hue channel.
2.1.2 Differentiable intensity histogram formulation
Images acquired by digital cameras have discrete range of intensity values. Their intensity distributions can be described with intensity histograms obtained by counting the number of pixels in each intensity value. We, however, define a gray level image considering also generated images that can take any value in the continuous range . Let define a gray level value of an image pixel
We use the Kernel Density Estimation (KDE) for estimating the intensity density
of an image as follows(1) 
where , is the kernel, is the bandwidth and is the number of pixels in the image. We choose the kernel
as the derivative of the logistic regression function
as follows(2) 
where . We note that Eq. (2) is a nonnegative realvalued integrable function and satisfies the requirements for a kernel (normalization and symmetry).
For the construction of smooth and differentiable image histogram, we partition the interval into sub intervals , each interval with length and center , then . We then can define the probability of pixel in the image to belong to certain intensity interval (the value of normalized histogram’s bin) as
(3) 
By solving the integral we get
(4) 
The function which provides the value of the bin in a differentiable histogram can be rewritten as follows:
(5) 
where,
(6) 
is a differentiable approximation of the Rect function. Finally, we define the differentiable histogram of an image as follows
(7) 
2.1.3 Earth Mover’s Distance
We use the EMD [23], also known as the Wasserstein metric [3] to define the distance between two histograms. Let and be the source and the target histograms of images and , respectively:
(8) 
where . Let be the ground distance matrix, where its th entry is the ground distance between and Let be the transportation matrix, where its th entry indicates the mass transported from to . We aim to find a flow that minimizes the overall cost:
(9) 
subject to the following constraints:
(10)  
The EMD between the two histograms is the minimum cost of work (Eq. 9) that satisfies the constraints (Eq. 10) normalized by the total flow:
(11) 
For onedimensional histograms with equal areas, EMD has been shown to be equivalent to Mallows distance which has the following closedform solution [15]:
(12) 
where, is the cumulative density function. We use for Euclidean distance. Dropping the normalization term, we obtain an intensitybased metric:
(13) 
where, is the th element of the CDF of . Following Hou et al. [7] we use instead of because it usually converges faster and is easier to optimize with gradient descent [17, 25].
2.1.4 Cyclic histogram
The hue channel in HSV colorspace is cyclic, thus the EMD measure should be adapted accordingly. Werman et al. [29] showed that the EMD is equal to the distance between the cumulative histograms. They also proved that matching two cyclic histograms by only examining cyclic permutations is optimal. Therefore, the cyclic distance can be expressed as
(14) 
where shifted negatively the elements in by the offset of . Elements that passed the last position will wrap around to the first. This operator can be described as
(15) 
2.2 Semanticbased metric
In a color transfer problem, we wish to constrain semantic similarity between the source image and the color transformed output image , where each has a different intensity distribution. This is accomplished by applying the semanticbased metric, to be hereby defined, to the generated hue channel with respect to the source’s hue channel (Figure 4k). This metric will be based on the MI between these hue channels.
2.2.1 Mutual information
The MI of two images and is defined as follows:
(16) 
where, , are the image histograms as defined is Eq. 5, and is the joint histogram that will be described next section. Maximizing the MI between the output and the input image allows us to generate images which are semantically similar. Following [10] we define the semanticbased loss as follows:
(17) 
where, is the joint entropy of , defined as
(18) 
The quantity is a metric [10], with and for all pairs . This metric has symmetry, positivity and boundedness properties.
2.2.2 Differentiable joint intensity histogram
We use the multivariate KDE for estimating the joint intensity density of two images , as follows:
(19) 
where, , , is the bandwidth (or smoothing) matrix and is the symmetric 2D kernel function. As in the 1D case (Eq. 2), we choose the kernel as the derivative of the logistic regression function for each of the two variables separately:
(20) 
We define the bandwidth matrix as . We define the probability of corresponding pixels in and to belong to the intensity intervals and correspondingly, as follows:
(21) 
By solving the integral we get:
(22) 
By using the definition of from Eq. 6, we can expressed the value of joint histogram th bin as
(23) 
This equation can be also written using matrix notation. We define a matrix where , where is the pixel index in a flatten image , . The approximated joint histogram ( matrix) of two images , can be defined with a matrix multiplication:
(24) 
2.3 HueNet architecture
Figure 4 illustrates the HueNet architecture. As in [8], we use the UNet architecture [22]
for the HueNet generator and the convolutional “PatchGAN” classifier
[16] for its discriminator. The UNet generates the hue channel of the output image. The augmented parts of the HueNet are detailed next.2.4 Network augmentation
2.4.1 1D Histogram construction
For the construction of an intensity histogram the output hue channel (of size ) is replicated
times, where each pixel is represented by a smooth 1hot vector. The
th channel is obtained by an application of the function to the output’s pixels, thus representing its ‘contribution’ to the th histogram bin (in the interval ). An intensity histogram is constructed by a summation of each of these channels (Eq. 5). This operation is illustrated in Figure 5.We construct cyclic permutations of the constructed histogram by matrix multiplication of and circulant matrix (a special kind of Toeplitz matrix), to obtain the transformation described in Eq. 15.
2.4.2 Joint Histogram construction
Similar to the construction of the 1D histograms, the generation of the joint histogram of two images and is based on the construction of channels from each. Each channel is reshaped into an vector (activation map in Figure 5) , where is the image size in pixels. The Kchannels form a matrix, which is exactly as defined in Eq. 24. Multiplying and normalizing the matrices obtained for each image we get the joint histogram as defined in Eq. 24.
2.5 HueNet losses
The complete HueNet loss is a weighted sum of three loss functions:
(25) 
where , ,
are the intensitybased loss using EMD, the semanticbased loss using MI and the adversarial loss, respectively. The scalars , are the weights.
Note, that we empirically set =100, =25.
Intensitybased loss
The intensitybased loss is derived from Eq. 14 which defines the EMD between the two cyclic histograms of the target and the output:
(26) 
Semanticbased loss
Semanticbased loss between the hue channels of the network’s output and the source image
is based on their MI (Eq. 17) and defined as follows:
(27) 
Adversarial loss
We use conditional GAN loss similar [8].
The discriminator learns to distinguish between and conditioned by
The adversarial loss for the generator is based on the discriminator output:
(28) 
The discriminator loss is defined as:
(29)  
where is the discriminator output.
2.6 Implementation Details
To optimize our networks, we alternate between two gradient descent steps training the Generator (G) and the Discriminator (D). As suggested in [5], we train to maximize We use a batch size of , minibatch SGD and apply the Adam solver [9], with a learning rate of , and momentum parameters ,
. We augmented the training images with vertical flips. We randomly shuffled the dataset to get different pairs of source and target images in each epoch. Histograms were constructed using
bins.Input Target Without proposed HueNet Reinhard [21] 
3 Experimental Results
We test our method on three datasets:

Oxford 102 Category Flower Dataset [19] consists of 8189 images, which we divide into and images for train and test respectively. The images have large scale, pose and light variations. In addition, there are categories that have large variations within the category and several very similar categories. Example images are shown in Figure 6.

Flower Color Images consists of only 210 images was downloaded from Kaggle web site: https://www.kaggle.com/olgabelitskaya/flowercolorimages. Examples are shown in Figures 2,1 and in the HueNet architecture diagram (Figure 4).
Evaluating the quality of synthesized images is an open and difficult problem [24]
. Traditional metrics such as perpixel meansquared error do not assess joint statistics of the result, and therefore do not measure the image similarity we aim to capture. A “real vs. fake” questionnaire based on our generated images and the true ones can be accessed via
https://forms.gle/wAmRSChHpa65d1So7. Human observer results and statistics based this questionnaire can be found in the supplementary material.We evaluate the contribution of semanticbased HueNet loss by training the network without it. In addition, we compared the results to those obtained by Reinhard [21] via manipulations of the color histograms. Results are shown in Figure 6. As can be seen, the HueNet painting is not an application of deterministic functions to huechannel histograms. The insect for example, that camouflages with the blue flower in the third row, appears to be green in the HueNet painted image while the petals are pink. Another example is the painted three flowers in Figure 1 (first row) that appear in different shades, as in the target image although they all have the same shade of purple in the source image. For each of the images in Figure 6, we calculated the EMD and the relative MI between the target’s and source’s or output’s hue channels. We also measured the relative MI for the color transfer results of the network trained without the MI loss.The results are shown in Table 1.
EMD  MI  
No.  TARSRC  TAROUT  SRCTAR  SRCOUT  SRCOUT 
w/o  
1  .175  .08  .029  .21  .111 
2  .184  .131  .039  .46  .112 
3  .173  .139  .069  .255  .118 
4  .186  .053  .055  .304  .094 
5  .482  .081  .034  .144  .046 
6  .221  .088  .07  .263  .127 
7  .29  .216  .036  .18  .095 
4 Summary and future work
We presented the HueNet, a novel deep learning method for color transfer based on the construction of differentiable histograms and histogrambased loss functions. Specifically, intensitybased and semanticbased metrics are used to encourage intensity similarity to the target image and semantic similarity to the source image. The adversarial loss is incorporated to constrain the generation of realistic images, making sure, for example, that the leaves and nor the petals will be painted in green. While the results are promising we believe that the tools we developed can be applicable to different imagetoimage translation tasks, such as photo enhancement, removal of illumination effects and colorization, with slight modifications.
References
 [1] M. AbdullahAlWadud, M. H. Kabir, M. A. Akber Dewan, and O. Chae. A dynamic histogram equalization for image contrast enhancement. IEEE Transactions on Consumer Electronics, 53(2):593–600, May 2007.
 [2] T. Celik and T. Tjahjadi. Contextual and variational contrast enhancement. IEEE Transactions on Image Processing, 20(12):3431–3441, Dec 2011.

[3]
Roland L Dobrushin.
Prescribing a system of random variables by conditional distributions.
Theory of Probability & Its Applications, 15(3):458–486, 1970.  [4] Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Prentice Hall, 2008.
 [5] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [6] Mingming He, Jing Liao, Lu Yuan, and Pedro V. Sander. Neural color transfer between images. CoRR, abs/1710.00756, 2017.
 [7] Le Hou, ChenPing Yu, and Dimitris Samaras. Squared earth mover’s distancebased loss for training deep neural networks. CoRR, abs/1611.05916, 2016.
 [8] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A. Efros. Imagetoimage translation with conditional adversarial networks. CoRR, abs/1611.07004, 2016.
 [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [10] Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger. Hierarchical clustering based on mutual information, 2003.
 [11] Jonathan Krause, Michael Stark, Jia Deng, and Li FeiFei. 3d object representations for finegrained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR13), Sydney, Australia, 2013.
 [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [13] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [14] C. Lee, C. Lee, and C. Kim. Contrast enhancement based on layered difference representation of 2d histograms. IEEE Transactions on Image Processing, 22(12):5372–5384, Dec 2013.

[15]
E. Levina and P. Bickel.
The earth mover’s distance is the mallows distance: some insights
from statistics.
In
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001
, volume 2, pages 251–256 vol.2, July 2001.  [16] Chuan Li and Michael Wand. Precomputed realtime texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016.
 [17] David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984.
 [18] László Neumann and Attila Neumann. Color style transfer techniques using hue, lightness and saturation histogram matching. In Computational Aesthetics, pages 111–122. Citeseer, 2005.
 [19] ME. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
 [20] Stephen M. Pizer, E. Philip Amburn, John D. Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B. Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing, 39(3):355 – 368, 1987.
 [21] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley. Color transfer between images. IEEE Computer graphics and applications, 21(5):34–41, 2001.
 [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [23] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
 [24] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans, 2016.

[25]
Shai ShalevShwartz and Ambuj Tewari.
Stochastic methods for l1regularized loss minimization.
Journal of Machine Learning Research
, 12(Jun):1865–1892, 2011.  [26] J. A. Stark. Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE Transactions on Image Processing, 9(5):889–896, May 2000.
 [27] Richard Szeliski. Computer Vision: Algorithms and Applications. SpringerVerlag, Berlin, Heidelberg, 1st edition, 2010.
 [28] Paul Viola and William M Wells III. Alignment by maximization of mutual information. International journal of computer vision, 24(2):137–154, 1997.
 [29] Michael Werman, Shmuel Peleg, and Azriel Rosenfeld. A distance metric for multidimensional histograms. Computer Vision, Graphics, and Image Processing, 32(3):328 – 336, 1985.
 [30] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
 [31] Karel Zuiderveld. Graphics gems iv. chapter Contrast Limited Adaptive Histogram Equalization, pages 474–485. Academic Press Professional, Inc., San Diego, CA, USA, 1994.
Comments
There are no comments yet.