XNet: GAN Latent Space Constraints

01/14/2019 ∙ by Omry Sendik, et al. ∙ 16

Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation. In particular, it was shown that a wide variety of image translation operators may be learned from two image sets, containing images from two different domains, without establishing an explicit pairing between the images. This was made possible by introducing clever regularizers to overcome the under-constrained nature of the unpaired translation problem. In this work, we introduce a novel architecture for unpaired image translation, and explore several new regularizers enabled by it. Specifically, our architecture comprises a pair of GANs, as well as a pair of translators between their respective latent spaces. These cross-translators enable us to impose several regularizing constraints on the learnt image translation operator, collectively referred to as latent cross-consistency. Our results show that our proposed architecture and latent cross-consistency constraints are able to outperform the existing state-of-the-art on a wide variety of image translation tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

page 9

page 10

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Our architecture uses two GANs that learn an image translation operator from two unpaired sets of images, and . By introducing a pair of cross-translators between the latent spaces ( and ) of the two Encoder-Decoder generators, we enable several novel latent cross-consistency constraints. At test time, the only components which are used are the encoders and decoders, while the rest of the components serve the training process alone.

Many useful graphical operations on images may be cast as an image translation

task. These include style transfer, image colorization, automatic tone mapping and many more. Several such operations are demonstrated in Figure 

LABEL:fig:teaser

. While each of these operations may be carried out by a carefully-designed task-specific operator, in many cases, the abundance of digital images along with the demonstrated effectiveness of deep learning architectures, makes a data-driven approach feasible and attractive.

A straightforward supervised approach is to train a deep network to perform the task using a large number of pairs of images, before and after the translation (Isola et al., 2017). However, collecting a large training set consisting of paired images is often prohibitively expensive or infeasible.

Alternatively, it has been demonstrated that an image translation operator, which maps an image from domain to domain , may also be learned from two image sets, containing images from the two domains and , respectively, without establishing an explicit pairing between images in the two sets (Zhu et al., 2017; Yi et al., 2017; Liu and Tuzel, 2016). This is accomplished using generative adversarial networks (GANs) (Goodfellow et al., 2014).

This latter approach is more attractive, as it requires much weaker supervision, however, this comes at the cost of making the translation problem highly under-constrained. In particular, a meaningful pairing is not guaranteed, as there are many pairings that are able to yield the desired distribution of translated images. Furthermore, undesirable phenomena, such as mode collapse, may arise when attempting to train the translation GAN (Goodfellow et al., 2014).

To address these issues, existing GAN-based approaches for unpaired image translation (Zhu et al., 2017; Yi et al., 2017), train two GANs. One GAN maps images from domain to domain , and a second one operates in the opposite direction (from to ). Furthermore, a strong regularization is imposed in the form of the cycle consistency loss, which ensures that concatenating the two translators roughly reconstructs the original image. Note that the cycle consistency loss is measured using a pixelwise metric () in the original input domain.

In this work, we introduce a novel architecture for unpaired image translation, and explore several new regularizers enabled by it. Our architecture also comprises a pair of GANs (for and translation), but we also add a pair of translators between their respective latent spaces, as shown in Figure 1. These cross-translators enable us to impose several regularizing constraints on the learnt image translation operator, collectively referred to as latent cross-consistency. Intuitively, regularizing the latent spaces is a powerful yet flexible approach: the latent representation computed by the GAN’s generator captures the most pertinent information for the translation task, while the original input representation (pixels) contains much additional irrelevant information.

We demonstrate the competence of the proposed architecture and latent cross-consistency, in conjunction with several additional loss terms, by means of an ablation study and comparisons with existing approaches. We show our competitive advantages vs. the existing state-of-the-art on tasks such as translating between specular and diffuse objects, inverting halftone images, removing watermarks and translating mobile phone photos to DSLR-like quality. Figure LABEL:fig:teaser demonstrates some of our results.

2. Related work

2.1. Unpaired image-to-image translation

2017 was a year with multiple breakthroughs in unpaired image-to-image translation. Taigman et al. (2017) kicked off by proposing an unsupervised formulation employing GANs for transfer between two unpaired domains, demonstrating transfer of SVHN images to MNIST ones and of face photos from the Facescrub dataset to emojis.

Two seminal works which achieved great success in unpaired image-to-image translation are CycleGAN (Zhu et al., 2017) and DualGAN (Yi et al., 2017). Both proposed to regularize the training procedure by requiring a bijection, enforcing the translation from source to target domain and back to the source to return to the same starting point. Such a constraint yields a meaningful mapping between the two domains. Furthermore, since bijection cannot be achieved in the case of mode collapse, it thus prevents it.

Dong et al. (2017)

trained a conditional GAN to learn shared global features from two image domains, then followed by synthesis of plausible images in either domain from a noise vector conditioned on a class/domain label. To enable image-to-image translation, they separately train an encoder to learn a mapping from an image to its latent code, which would serve as the noise input to the conditional GAN to generate a target image.

Choi et al. (2017) proposed StarGAN, a network that learns the mappings among multiple domains using only a single generator and a discriminator, training from images of multiple domains. Their novelty was in enabling image-to-image translations for multiple domains using only a single model.

Kim et al. (2017) tackled the lack of image pairing in the image-to-image translation setting through a model based on two different GANs coupled together. Each of them ensured that their generative functions can map each domain to its counterpart domain. Since their method discovers relations between different domains, it may be leveraged to successfully transfer style.

A recent very different approach is NAM (Hoshen and Wolf, 2018), which relies on having a high quality pre-trained unsupervised generative model for the source domain. Assuming such a generator is available, a generative model needs to be trained only once per target dataset, and can thus be used to map to many target domains without adversarial generative training.

In this work we also address unpaired image-to-image translation. Contemporary approaches and some of those mentioned above tackle this problem by imposing constraints formulated in the image domain. Our approach consists of novel regularizers operating across the two latent spaces. Through the introduction of a unique architecture which enables a strong coupling between a pair of generators, we are able to define a set of losses which are domain-agnostic. Additionally, a benefit of our architecture is that it enables multiple regularizers, which together push the trained outcome to a more stable final result. We stress that this is different from contemporary approaches relying on image domain losses, which make use of one or two losses (an identity loss and cycle-consistency or another prior).

2.2. Latent space regularization

Motivated by the fact that image-to-image translation aims at learning a joint distribution of images from the source and target domains, by using images from the marginal distributions in individual domains, Liu et al. 

(2017) made a shared-latent space assumption, and devised an architecture which maps images from both domains to the same latent space. By sharing weight parameters corresponding to high level semantics in both the encoder and decoder networks, the coupled GANs are enforced to interpret these image semantics in the same way. Additionally, VeeGAN (Srivastava et al., 2017)

also addressed mode collapse by imposing latent space constraints. In their work, a reconstructor network reverses the action of the generator through an architecture which specifies a loss function over the latent space.

The two works mentioned above attempt to translate an image from a source domain to a single target domain . The scheme by which they achieve this limits their ability to extend to translating an input image to multiple domains at once. Armed with this realization, Huang et al. (2018) proposed a multimodal unpaired image-to-image translation (MUNIT) framework. Achieving this involved decoupling the latent space into content and style, under the assumption that what differs between target domains is the style alone.

With the existing architectures there is no path within the network graph, which enables formulating losses that constrain both latent spaces at once. For this reason, we dissect the common GAN architecture, and propose a path between encoders from cross (opposite) domains. Our architecture thus consists of a pair of GANs, but in addition, we couple each generator with a translator between latent spaces. The addition of the translators opens up not only the ability to enforce bijection constraints in latent space but more intriguing losses, which further constrain the problem, leading to better translations.

3. Cross Consistency Constraints

Input Latent Code Activations Output
Figure 2.

Latent space visualization for two Horse-to-Zebra translation examples. The second column visualizes the latent space of the generator by using PCA to reduce the 256 channels of the latent space to three, mapped to RGB. Alternatively, the third column shows the magnitude of the 256-dimensional feature vector at each latent space neuron. Note that the latent space in these examples indicates the positions and shapes of the zebra stripes in the resulting translated image.

The two upper rows show the results of Cycle GAN. The bottom most row shows our result where it is visible in both the output and the latent space visualization, that the encoder doesn’t attempt to texture improper regions.

Architectures such as CycleGAN (Zhu et al., 2017) or DualGAN (Yi et al., 2017) are able to accomplish unpaired image-to-image translation by imposing consistency constraints in the original image domains and . Thus, their constraints operate on the original, pixel-based, image representations, which contain much information that is irrelevant to the translation task at hand. However, it is well-known that in a properly trained Encoder-Decoder architecture, the latent space contains a distillation of the features that are the most relevant and pertinent to the task. To demonstrate this, consider the Horse-to-Zebra translation task, for example. The top row of Figure 2 visualizes the latent code of CycleGAN’s Horse-to-Zebra Encoder-Decoder generator, where we can see that the latent code already contains zebra-specific information, such as the locations and shapes of the zebra stripes. In a manner of speaking, the generator’s encoder has already planted “all the makings of a zebra” into the latent code, leaving the decoder with the relatively simpler task of transforming it back into the image domain. Similarly, the second row of Figure 2 demonstrates a case where the zebra-specific features are embedded in the wrong spatial regions, yielding a failed translation result.

The above observation motivates us to explore an architecture that enables imposing consistency constraints on and between the latent spaces. In our architecture, the latent spaces of the two Encoder-Decoder generators (from to and vice versa) are coupled via a pair of cross-translators. Adding these translators creates additional paths through which data can flow, enabling several novel latent cross-consistency constraints in the training stage, as described below. The bottom row of Figure 2 shows an example where imposing these constraints avoids the incorrect embedding of the zebra specific features.

3.1. Architecture

Armed with the motivation to impose regularizations in latent space, we propose an architecture which links between the latent spaces of the two generators (from domain to and vice versa), thereby enabling a variety of consistency constraints. Our architecture is shown in Figure 1, and the notations that we use in this paper are summarized in Table 1.

The architecture consists of a pair of Encoder-Decoder generators, which we denote by , and . The encoder encodes an image to a latent code denoted by , while decodes it to an output image . Similarly, the encoder encodes an image to a latent code denoted by , while decodes it to an output image . The discriminators, and , attempt to determine whether or not an input image from domains or , respectively, is real or fake. The novel part of our architecture is the addition of two cross-translators, and , shown in Figure 1. Each translator is trained to transform the latent codes of one generator into those of the other, namely, from to and from to , respectively. By adding the two cross-translators to our architecture, several additional paths, through which data may flow, become possible, paving the way for new consistency constraints. In this work, we present three novel latent cross-consistency losses, which are shown to conjoin to produce superior results on a variety of image translation tasks.

Note that the translators and the discriminators are used only at train time. At test time, the only components used for trsanslating a new input image are the the encoders and decoders.

Notation Meaning
, Image domains
, Latent domains
, Encoders: ,  
, Translators: ,  
, Latent codes produced by the encoders
, Latent codes produced by the translators
, Decoders: ,  
, Discriminators for image domains and
Table 1. Summary of notations used in this paper.

3.2. Latent Cross-Identity Loss

In order to train the cross-translators and , we require that an image fed into the encoder should be reconstructable by the decoder of the dual generator, , after translation of its latent code by . A symmetric requirement is imposed on the translator . These two requirements are formulated as the cross-identity loss:

(1)

The corresponding data path through the network is shown in Figure 3

(a). This may be thought of as an autoencoder loss, where the autoencoder, in addition to an encoder and a decoder, has a dual latent space, with a translator between its two parts.

Note that some previous unpaired translation works (Yi et al., 2017; Zhu et al., 2017), use an ordinary identity loss (without cross-translation), where images from domain are fed to the generator, and vice versa. We adopt this loss as well, as we found it to complement our cross-identity loss in (1):

(2)

3.3. Latent Cross-Translation Consistency

While the normal expected input, for each encoder, is an image from its intended source domain, let us consider the scenario where one of the encoders is given an image from its target domain, instead. For example, if are images of horses and images of zebras, what should happen when a zebra image is given as input to the “horse-to-zebra” encoder ? Our intuition tells us that in such a case we’d like the generator to avoid modifying its input. This implies that the resulting latent code should capture and retain the essential “zebra-specific” information present in the input image. The translator is trained to map such “zebra features” to “horse features”, thus we expect to be similar to the latent code , obtained by feeding the zebra image to the “zebra-to-horse” encoder, which should also yield “horse features”. The above reasoning, applied in both directions, is formally expressed using the cross-translation consistency loss (see Figure 3(b)):

(3)
Figure 3. Data paths used by the novel loss terms in our approach (symmetric paths are omitted for clarity). (a) The latent cross-identity loss trains the cross-translators to map between the two latent spaces of the dual generators. (b) The latent cross-translation consistency loss regularizes the latent spaces generated by each of the two encoders. (c) The latent cycle-consistency loss ensures that the cross-translators define bijections between the two latent spaces.

3.4. Latent Cycle-Consistency

Our final latent space regularization is designed to ensure that our cross-translators are bijections between the two latent spaces of the generators. Similarly to the motivation behind the cycle-consistency loss of Zhu et al. (2017), having bijections helps achieving a meaningful mapping between the two domains, as well as avoids mode collapse during the optimization process.

Specifically, we require that translating a latent code first by and then back by yields roughly the same code back:

(4)

The data path corresponding to this latent cycle-consistency loss is depicted in Figure 3(c).

3.5. Training Details

Our final loss, which we optimize throughout the entire training process is a weighed sum of the losses presented in the previous sections, and a GAN loss (Goodfellow et al., 2014),

(5)

Rather than using a negative log likelihood objective in the GAN loss, we make use of a least-squares loss (Mao et al., 2017). Additionally, we adopt Shrivastava et al.’s method (2017), which updates the discriminators using a history of translated images rather than only the recently translated ones.

Unless otherwise mentioned, throughout the entirety of this paper, we set , , , and .

For our generators, we adopt the Encoder and Decoder architecture from Johnson et al. (2016)

. Their Encoder architecture consists of an initial 7x7 convolution with stride 1, two stride 2 convolutions with a 3x3 kernel, and 9 residual blocks with 3x3 convolutions. The Decoder consists of two transposed convolutions with stride 2 and a kernel size of 3x3, followed by a final convolution with kernel size of 7x7 and hyperbolic tangent activations for normalization of the output range. For our discriminator networks we use 70x70 PatchGANs

(Li and Wand, 2016)

, whose task is to classify whether the overlapping patches are fake or real.

Finally, our two latent code translators consist of 9 residual blocks that use 3x3 convolutions with stride 1.

For all of the applications which we present in the following sections, we trained our proposed method using two sets of ~1000 images (a total of ~2000 images). The final generator which we used for producing the results was selected as the result after training for 200 epochs.

4. Comparisons

4.1. Ablation study

For evaluating the necessity and effect of our newly devised cross-consistency losses, we conduct an ablation study. In Figure 4 we show the results for the Horse Zebra translation, Watermark removal and Halftone to Grayscale translation, demonstrating the visual effect of adding each one of the three losses. Additionally, we compare our results to those of CycleGAN (Zhu et al., 2017), Double-Dip (Gandelsman et al., 2018) (a watermark removal method on which we elaborate more in the coming sections) and the Rolling Guided Filter (RGF) method (Zhang et al., 2014) (a halftone inversion method on which we also elaborate next). In all of these results, both and were included in our training.

In Figure 4 it is visible that through gradually adding the three cross-consistency losses, results improve. The best results are obtained when all three losses are included, as shown in the 6th row (one before last). The 7th (last) row shows translation results generated using competing methods, where it may be seen that they are less successful. Some zebra stripes remain when translating zebras to horses, and not all horses are translated to zebras. In the watermark removal use case, watermark residue is rather visible and in the Halftone to grayscale translation, RGF produces more blurry results.

4.2. Unpaired image-to-image translation

In Figure 5 we show a variety of our unpaired image-to-image translation results (XNet), compared with CycleGAN. From top to bottom, we show image-to-image translation results for: Apples Oranges, Summer Winter images and for Haltfone Grayscale translation. All of our results were achieved with the full loss in (5), with the relative weights reported in Section 3.5. Qualitatively, it may be observed that in all three image-to-image translation tasks, XNet outperforms CycleGAN, providing better texture transfer, color reproduction, and also better structure (visible in the Apples Oranges translations). Note that for producing the CycleGAN results, where possible, we used the existing pretrained models, made available by Zhu et al. (2017).

Zebra to Horse Horse to Zebra Watermark removal Halftone to grayscale
Input
Competitor
Figure 4. Ablation study: Through a gradual inclusion of losses, we demonstrate how the results of translating between horses and zebras or watermark removal improve. A combination of all three cross-consistency losses (6th row) is shown to yield better results than those produced by our competitors (last row) namely CycleGAN on the horse to zebra translation (the six leftmost columns) and Double-DIP on the watermark removal task (the three rightmost columns).
   Input     XNet     CycGAN    Input     XNet     CycGAN    Input     XNet     CycGAN
Apple Orange
Summer Winter
Halftone Grayscale
Figure 5. A variety of image-to-image translation results of our method (XNet), compared to CycleGAN. Note that XNet depicts more semantic consistency over the entire image. for example: In the left most Winter Summer image, the grass in the CycleGAN image isn’t very plausible (it grows on the top of the mountain and not on the bottom). In the Apples Oranges samples CycleGAN consistently produces errors and artifacts that seem semantic in nature (shadows become illuminated, the rightmost orange looks like a superimposed image of an open orange).

5. Applications

5.1. Weakly supervised watermark removal

Visible watermarks are commonly used by stock content providers to mark and protect their digital photos and videos. Such watermarking usually involves alpha-compositing a text or a logo over a source image. In order to make the unlicensed usage of photos difficult, visible watermarks often contain complex structures. As previously stated by Dekel et al. (2017), “removing a watermark from a single image without user supervision or a-priori information is an extremely difficult task. However, the fact that watermarks are added in a consistent manner to many images has thus far been overlooked.”

We demonstrate that our XNet architecture may be used for weakly-supervised watermark removal by training it using a set of clean images and a set of watermarked images, without providing a pairing between the two sets. Specifically, we randomly draw the coordinates of the watermark, and composite it over a random clean photo. Similarly to Dekel et al. (2017), we assume that the same alpha value is used for the entire set, a property which usually holds for the watermarks of stock content providers. The actual alpha value is randomly drawn.

Once trained, our network is able to translate watermarked images into clean ones (neither of which were seen during training) in a completely automatic manner. In contrast, Dekel et al. (2017) require the user to provide an initial rough bounding box of the watermark.

In Figure 6 we show our results, compared with a commercial inpainting-based watermark removal application (Web-Inpaint, 2018), which requires the user to provide the bounding box of the watermark. We also compare with Double-DIP (Gandelsman et al., 2018), a recent method that uses a pretrained CNN for watermark removal, given several images with the same watermark as input (three images in their available implementation).

For the sake of objective comparison, we report two measures. We calculate the PSNR between the ground truth photo without the watermark and each of the three results namely, Web-Inpaint, Double-DIP, and our result. Since Web-Inpaint requires a bounding box as input, it benefits from knowing which parts of the image should not be altered. Thus, we also report the PSNR inside the bounding box of the watermark, in order to quantify the proximity between the ground truth and the output results only on the affected rectangular region. We also stress that both methods we compare with directly target the task of watermark removal, while our approach is a generic one that assumes no additional priors unique to this specific task.

The results in Figure 6 show that our approach achieves higher PSNR values compared with the two competitors. While our margin is only mild (up to  1.6dB), when measured over the entire image, it is more significant (up to  5dB), when measured over the bounding box containing the watermark. Additional images and quantitative results are provided in the supplementary material.

Input Web-Inpaint Double-DIP XNet

PSNR=19.91dB

BBox=14.53dB

PSNR=22.0795

BBox=19.52dB

PSNR=22.3776

BBox=20.49dB

PSNR=22.13dB

BBox=17.01dB

PSNR=21.45dB

BBox=17.89dB

PSNR=22.74dB

BBox=24.06dB

PSNR=21.58dB

BBox=16.32dB

PSNR=22.22dB

BBox=17.38dB

PSNR=23.89dB

BBox=22.40dB

Figure 6. Watermark removal: From left to right, we show the watermarked input, the results of a commercial watermark removal application, the Double-DIP method (2018) and our output. We report the PSNR over the full image, as well as the PSNR within a bounding box of the watermark. Despite being generic, our approach achieves the best PSNR values.

5.2. Inverse Halftoning

Halftoning is a technique that involves printing dots of a single tone, which vary either in size or in spacing, for simulating continuous tone imagery. Due to the binary nature of halftone images, simple operations such as image rescaling are difficult to perform. Quality degradation is greatly reduced if the halftone is inverted (converted to grayscale) before any processing.

Our XNet architecture may be applied to the task of inverse halftoning: reconstruction of continuous tone images from halftone ones. Here, the two sets of images used for training are a set of grayscale images and their halftone versions, produced using the Floyd-Steinberg algorithm (Floyd, 1976). We compare our results with ones obtained using Rolling Guided Filter (RGF), a state-of-the-art inverse halftoning algorithm by Zhang et al. (2014). The RGF is an effective scale-aware filter that can remove different levels of details in any input natural image and which is very naturally applied to halftone inversion.

We quantitatively compare between our results and RGF by reporting the PSNR between the ground truth and the inverted outputs. In Figure 7 we show the halftone input, the RGF results, our result, and the ground truth, from left to right respectively. Through the PSNR, it is evident that our result produces outputs more similar to the ground truth. Additionaly, our subjective impression is that through learning, high frequency details are reproduced better. Notice for example, the high frequency details in the bottom row, showing fine details of vegetation.

During training, we found that halftone to grayscale translation results are best after only 25 epochs. Thus, we present results using this epoch’s output network.

We provide additional quantitative results and output images in the supplementary material.

Halftone input RGF (2014) XNet Ground truth
PSNR=20.38dB PSNR=21.06dB
PSNR=18.84dB PSNR=19.38dB
PSNR=18.54dB PSNR=19.55dB
Figure 7. Results for halftone reconstruction: From left to right, we show the input halftone image, RGF (2014), our result, and the ground truth. PSNR values show that our method more accurately reconstructs the original grayscale image.

5.3. Specular Diffuse

Most multi-view 3D reconstruction algorithms, assume that object appearance is predominantly diffuse. However, in real world images, often the contrary is true. In order to alleviate this restriction, Wu et al. (2018) proposed a neural network for transferring multiple views of objects with specular reflection into diffuse ones. By introducing a Multi-View Coherence loss, exploiting the multiple views of a single specular object, they were able to synthesize faithful diffuse appearances of an object.

We make use of their publicly available dataset, and train XNet to translate specular objects to diffuse ones, and vice versa. Our training set is composed of ~1000 specular images and ~1000 diffuse ones. We emphasize that the results shown below were achieved by applying XNet on input images from a test set, which were not used at train time (and similarly for all of the applications shown throughout this paper).

Please note that differently from Wu et al., we do not rely on any prior assumptions specific to the task of specular to diffuse translation. A fundamental difference between Wu et al. and our approach is that the input to their network is a sequence of images, which encourages the learning of a specific object’s structure, while in our approach, only a single image is provided as input to the appropriate generator.

In Figure 8 we show the results of applying XNet on Wu et al.’s dataset. Results are shown for both translation directions, specular to diffuse and diffuse to specular. Both types of translations produce visually convincing results. It is visible that our translation captures the sculpture’s fine details, and properly shades the outputs. An undesidered phenomenon, which occurs due to the inherent nature of our approach, is the changes which are created in the background. A perfect translation between these two domains shouldn’t have alterred any background pixels, but our method provides no method to control which pixels are left untouched. Additional results are provided in the supplementary material.

Specular to Diffuse
Input XNet Input XNet
Diffuse to Specular
Input XNet Input XNet
Figure 8. Specular

Diffuse translation: The odd columns show the inputs, while the even ones show our output. The two top rows show a translation from a specular input to a diffuse output, while the two bottom rows show the opposite translation.

5.4. Mobile phone to SLR

Although extremely popular, contemporary mobile phones are still very far from being able to produce results of quality comparable to those of a professional DSLR camera. This is mostly due to the limitations on sensor and aperture size. In a recent work by Ignatov et al. (2017)

, a translation function using a residual convolutional neural network that improves both color rendition and image sharpness was proposed and applied on their own manually collected large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. Here, we demonstrate the applicability of our approach to the same task.

In their work, Ignatov et al. proposed a loss function, under the assumption that the overall perceptual image quality can be decomposed into three independent parts: i) color quality, ii) texture quality and iii) content quality. They defined an explicit loss function for each component, and ensured invariance to local shifts by design.

Ignatov et al.’s approach is fully supervised, where pairing of images (mobile and DSLR photos) is available during training. For achieving so, they introduced a large-scale DPED dataset that consists of photos taken synchronously in the wild by a smartphone and a DSLR camera. They captured a total of 5727 image pairs, from an iPhone and a DSLR.

Similarly to our previous applications, we train on only a subset of the images, using ~1000 images per set. In contrast to DPED, we naively train XNet to translate iPhone photos to those of DSLR quality, ignoring the pairing of data and adding no additional explicit priors on the nature of the data.

Since the DLSR camera did not capture the scene from exactly the same perspective as the iPhone, an objective comparison is not possible. Ignatov at al. proposed aligning and warping between the two images but since the entire essence of the approaches we compare is to produce high resolution details, we find warping (and essentially interpolating) improper and resort to qualitative comparisons.

We compare our results to CycleGAN, MUNIT (2018) and to the fully supervised method DPED, and show that our approach translates the input image into a space which is richer in colors and details.

In Figure 9 we present our results and compare them with those mentioned above. Zooming into specific regions of the results, shows the effectiveness of XNet vs. the competitors, for producing colors and details in the translated images. In order to properly compare, all of the algorithms were fed with an input image scaled to 256x256. We show the input, XNet, CycleGAN’s, MUNIT’s and DPED’s results, from left to right respectively. Every odd row shows the full images, while the even rows show a zoom-in. The two top rows show XNet’s ability to enhance colors stronger than the competing methods. The two middle rows show that our approach enhances contrast more strongly. Finally, the two bottom most rows show that our result enhances details better than the competitors. We provide additional results in the supplementary material.

Input XNet Cyc GAN MUNIT DPED
Figure 9. Results of employing XNet for enhancing mobile phone photos. We show the input, our result, CycleGAN’s, MUNIT’s and DPED’s results, from left to right. Every odd row shows the full images, while the even rows show a zoom-in.

6. Limitations and Discussion

Through the addition of the latent code translations, cross coupling between two Generators has become feasible and we were able to apply our losses which regularized the training process and provided compelling results. We presented three losses, which involved running through our neural network graph through various paths. However, it seems intriguing to try other paths which we haven’t tried. One example is the path involving feeding an image from domain into the Decoder and also through followed by a translator , requiring that the the latent code given by these two paths be consistent. This is somewhat symmetrical to our proposed but involves back propagating through different parts of our architecture using different inputs.

We presented the results of our newly devised architecture on a variety of applications, namely watermark removal, halftone reconstruction, converting an object’s appearance from specular to diffuse and enhancing a mobile phone’s photo to that of DLSR quality. In all of these applications, we have used the exact same architecture and training hyperparameters, and were able to demonstrate that the resulting translation competes favorable with the available alternatives. Nevertheless, our method is not artifact-free. Since XNet is a fully-automated method, with no prior on the specific translation task at hand, some flaws are inevitable. Typical artifacts include blurriness, checkerboard artifacts, and improper color shifts. We believe that through the addition of task specific priors, realized in latent space, one may leverage our proposed architecture and push the visual quality of the results further. We see this path as an promising topic for further research.

Finally, in Figure 10 we provide some results of our method’s failures. The most striking limitation which we found is the lack of a control input, directing the encoder-decoder pair which pixels not to touch. This is of interest for both the watermark removal and specular to diffuse applications. In the upper left pair in Figure 10

, it is visible that not only has our method failed to remove the watermark, it has also changed the palette of the entire image. Similarly, in the upper right pair, the background of the specular object has undesiredly changed. In the bottom left pair, our method has failed to properly translate the zebra to a horse, most probably due to the large scale of the zebra. Our intuition is that such cases may be improved by increasing the receptive field of the encoder. In the bottom right pair, notice that the translated halftone image consists of a few green pixels. However, in this application it is very clear that the ouput should be completely monochrome.

Input XNet Input XNet
Figure 10. Examples of our method’s failures. The upper left pair shows a failure in the attempt to remove a watermark. The upper right pair shows our method’s result applied on specular to diffuse translation. The bottom row shows a result of a Zebra to Horse and Halftone to Grayscale translations, all of which are failures.

References

7. Supplementary Material

   Input     XNet    Input     XNet    Input     XNet
Monet Photo
Cezanne Photo
Figure 11. A variety of image-to-image translation results of our method (XNet), applied on style transfer tasks.
Specular to Diffuse Diffuse to Specular
Input Ours Input Ours Input Ours Input Ours
Figure 12. Results of our Specular to Diffuse: The odd colums show the inputs while the even ones show our output. The four left most columns depict a translation form a specular input to a diffuse output, while the four right rows show the opposite direction of translation.
Input Ours CG MUNIT DPED
Figure 13. Results of employing XNet for enhancing a mobile phone photo to DSLR quality. We show the input, our result, Cycle GAN’s, MUNIT’s and DPED’s results, from left to right. The top row shows the full images, while the second row show a zoom-in.
Input Web-Inpaint DIP XNet Input Web-Inpaint DIP XNet

PSNR=17.84dB

Crop=12.44dB

PSNR=22.46dB

Crop=17.57dB

PSNR=18.06dB

Crop=15.78dB

PSNR=20.08dB

Crop=14.56dB

PSNR=21.30dB

Crop=17.03dB

PSNR=21.67dB

Crop=20.96dB

PSNR=21.39dB

Crop=16.08dB

PSNR=23.43dB

Crop=20.21dB

PSNR=22.77dB

Crop=21.62dB

PSNR=17.92dB

Crop=13.01dB

PSNR=18.87dB

Crop=16.08dB

PSNR=18.42dB

Crop=16.55dB

PSNR=19.21dB

Crop=13.78dB

PSNR=22.27dB

Crop=18.60dB

PSNR=22.59dB

Crop=20.34dB

PSNR=19.82dB

Crop=14.20dB

PSNR=21.25dB

Crop=16.48dB

PSNR=21.55dB

Crop=19.58dB

Figure 14. Unsupervised watermark removal: From left to right, we show the watermarked input, the results of a commercial watermark removal application, the DIP approach (2018) and our output. We report the PSNR over the full image, as well as the PSNR within a bounding box of the watermark. Despite being generic, our approach achieves the best PSNR values.
Halftone input XNet RGF Ground truth Halftone input XNet RGF Ground truth
PSNR=20.96dB PSNR=20.57dB PSNR=21.88dB PSNR=21.51dB
PSNR=19.82dB PSNR=19.48dB PSNR=22.01dB PSNR=21.71dB
PSNR=22.98dB PSNR=22.70dB PSNR=24.82dB PSNR=24.63dB
Figure 15. Results for halftone reconstruction: From left to right, we show the input halftone image, our result, RGF (2014) and the ground truth. PSNR values show our competitive ability to reproduce the original grayscale image.