From the aspect of human visual perception, why we consider a synthesized image as fake is often because it contains local artifacts. Although it looks like real at the first glance, we can still easily distinguish the fake from the real by gazing for only about . Human being has the ability to draw a realistic scene from coarse structure to fine detail, that is, we usually get the global structure of a scene while focus on the detail of an object and understand how it is associated with surroundings. Under this intuition, our goal of this work is to develop an image-to-image translation system for high-quality image synthesis with clear structure and vivid details.
Many efforts have been made to develop an automatic image-to-image translation system. The straightforward approach was to optimize on pixel-wise space with L1 or L2 loss [9, 23]. However, both of them suffer from blur problem. So some works added adversarial loss for generating more sharp images in both spatial and spectral dimensions . Except for the GAN loss, perceptual loss has been used in image-to-image translation tasks, but it was limited to a pre-training deep model and the training datasets . Although we have a variety of losses to evaluate the discrepancy between real image and generated image, using GAN for image-to-image translation still encounters with the artifacts and unsmooth color distribution problems, and it is even hard to generate high-resolution photo-realistic images because of the high dimension distribution .
So, how could we solve this problem intuitively? We decompose the procedure of image-to-image translation task into three iterated steps, first is to generate an image with global structure but some local artifacts (via GAN), second is to propose the most fake region from the generated image (using our DRPnet shown in Fig. 1), and third is to implement “image inpainting” on the most fake region for more realistic result, so that the system (our DRPAN) can be gradually optimized to synthesize images with more attention on the most artifact local part. Inspired by this motivation, we develop a framework based on patch-wise discriminator to predict the discriminative score map and use sliding windows to find the most artificial region. Then the proposed discriminative region will be used to mask the corresponding real sample and output as “masked fake”. Finally, we propose a reviser to distinguish the real from the masked fake for producing realistic details and serve as auxiliaries for generator to synthesize high-quality translation results. The reviser will critic on the fake image iteratively with different regions. We provide a weighted parameter to balance the contribution of the patch discriminator and our reviser for different levels of translation tasks. Using this proposed DRPAN, we can synthesize high-quality images with high-resolution and photo-reality details but less artifacts.
The main contribution of the study is threefold: first, we design the mechanism to explore patch-based discriminators for producing discriminative region; second, we propose the reviser for GANs to provide constructive revisions for generator which usually are missed by patch discriminator; third, we build a DRPAN model as a general-purpose solution for high-quality image-to-image translation tasks on different levels. The code of this paper is available at https://github.com/godisboy/DRPAN.
2 Related works
Feed-forward based approach.15], many studies were mainly based on VGG-16 network architecture  and used perceptual losses for style translation . Network architectures that work well on object recognition tasks have been proved to work well on generative models, e.g., some computer vision translation and editing tasks used residual block as a strong feature learning representation architecture [19, 22]
. Feed-forward CNNs accompanied with per-pixel loss have been presented for image super-resolution[9, 16, 34, 15]8, 44], and semantic segmentation [23, 4, 31]. A recent work for photo realistic image synthesis system, called CRN , can synthesize images with high resolution. However, the images synthesized by feed-forward based approach usually become smooth too much rather than realistic, i.e., not sharp enough in details. Besides, these methods are limited to be applied to other image-to-image translation tasks.
GAN based approach. GANs  introduced an unsupervised method to learn real data distribution. And DCGAN  firstly used CNNs to train generative adversarial networks which was hard to be deployed in other tasks before. Then, CNNs were extensively used for designing GAN architectures. Towards stable training of GAN, WGAN  replaced Jensen-Shannon divergence by Wasserstein distance as the optimization metric, and recently a variety of more stable alternatives have been proposed [28, 18, 12]. Wang and Gupta  combined structured GAN with style GAN to learn to generate natural indoor scenes. Reed et al.  used text as conditional input to synthesize images with semantic variation. Pathak et al.  proposed context encoders for image inpainting accompanied by adversarial loss. Li et al.  trained GANs with a combination of reconstruction loss, two adversarial losses and a semantic parsing loss for face completion. Nguyen et al.  presented Plus and Play Generative Networks for high-resolution and photo-realistic image generation with the resolution of images. Isola et al. explored  conditional GANs for a variety of image-to-image translation problems. ID-CGAN  combined conditional GANs with perceptual loss for single image de-raining and de-snowing. Considering that the paired images are less and hard to collect, some works proposed unpaired or unsupervised translation frameworks [46, 17, 40]. But it limits to the similarity of translation between source domain and target domain .
PatchGAN was firstly used in neural style transfer with CNNs based on patch feature inputs . Pix2pix  showed that a full ImageGAN does not show quality improvement compared with a low patch discriminator which has less parameters and needs low computing resource. SimGAN 
used patch based score map for real image synthesis tasks and mapped a full image to a probability map. Our method explores PatchGAN to a unified discriminative region proposal network model for deciding where and how to synthesize via a reviser. We show that this approach can improve translation results on high-quality, especially at high-resolution and photo-reality.
Our image-to-image translation model, called Discriminative Region Proposal Adversarial Networks (DRPAN), is composed of three components: a generator, a discriminator, and a reviser. The discriminator explores PatchGAN to construct Discriminative Region Proposal network (DRPnet, see Fig. 1) to find and extract the discriminative region for producing masked fake sample, while the reviser adopts CNN to distinguish the real from the masked fake to provide constructive revisions for generator. The overall network architecture and data flow are illustrated in Fig. 2.
shows our process of how to improve the quality of synthesized image. It can be seen that, as our DRPAN continues to train, the discriminative region for masked fake images (right) varies so that the quality of synthesized images (left) are improved with brighter score map (the first and the last). Besides, although it is hard to distinguish the synthesized sample from the real sample after many epochs, our DRPAN can still revise the generator to optimize the synthesized result in the details for high quality.
We first suggest that patch-based discriminators produce meaningful score maps, which may have applications beyond image synthesis. Fig. 4 shows the output results of score map on different quality levels (fake and real) of images by a pre-trained PatchGAN. It can be seen that, the score maps of the fake samples, which have obvious artifacts and shape deformation on some regions, are almost dark with lower score on the corresponding regions; in contrast, the score maps of the real samples are brightest with the highest scores. From the visualization of score maps, we can find the darkest region for proposing the discriminative region that indicates the remarkable fake region.
Based on the observation shown in Fig. 4, we explore patch discriminator to DRPnet for producing discriminative region. Given an input image with resolution , and it is processed by the patch discriminator to be a probability score map with size . Suppose we want to obtain the discriminative region at , the size of sliding window for score map can be calculated by
Then our DRPnet will find the discriminative square patch on score map with the center coordinates and length , so the scale between the input image and output score map is
The center coordinates of discriminative region will be calculated by
Finally, the discriminative region produced by DRPnet can be expressed as
Instead of only optimizing the independent local regions, we consider the relationship between fake discriminative region and real surrounding influence regions, so that it can connect the fake to the real for providing constructive revisions to generator. The influence region is defined as the region which is connected to the “most fake regions” and has semantic and spatial relationship with the content in it (e.g., the wheel is often below the car window). For this purpose, we mask the corresponding real sample using the fake discriminative region to make masked fake sample, and then design a reviser using CNN to distinguish real from masked fake to optimize the generator for synthesizing high-quality images. The reviser we proposed can also be used for other GANs to improve the quality of generated samples.
For image-to-image translation tasks, we not only want to generate the realistic samples, but also desire diversity with different conditional inputs. The original GANs suffer from unstability and mode collapse problems [1, 2]. So some recent works [1, 28, 12] improved the training of GAN. To stably train our DRPAN with high-diversity synthesis ability, we modify DRAGAN  as the loss of our reviser , and use the original objective function for training Patch Discriminator.
For reviser , to distinguish between the very similar real and masked fake ( represents the mask operation), we add a regularization to the loss of reviser as the penalty, which is expressed as
where is hyper parameter, is random noise on , and indicates gradient.
Previous studies have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 and L1 distance [14, 35]. Considering that L1 distance encourages less blurring than L2 , we provide extra L1 loss for regularization on the whole input image and the local discriminative region to generator, which is defined as
where and are hyper parameters, is the discriminative region, and represents the region on the real image corresponding to the discriminative region on the synthesized image. Then the total loss of generator can be expressed as
Our proposed model totally contains a generator , a patch discriminator for DRPnet, and a reviser . will be optimized by , and . And our full objective function is
where is a hyper parameter to balance and .
3.3 Network architecture
For our generator, we use architecture based on  which has convincing power for single image super-resolution. We adopt convolution and fractionally convolution blocks for down and up sampling respectively, and residual blocks 
for task learning. Each layer uses Batch Normalization
as activation function. For patch discriminator, we mainly implement withPatchGAN [20, 14]. The DRPAN reviser is a discriminator modified on DCGAN  that has a global view on the whole input. At the end of both discriminator and reviser, we adopt Sigmoid as activation function to output probability.
To evaluate the performance of our proposed method on image-to-image translation tasks, we deploy a variety of experiments about different levels of translation tasks to compare our method with state-of-the-arts. And for different tasks, we also use different evaluation metrics including human perceptual studies and automatic quantitative measures.
4.1 Evaluation metrics
Image quality evaluation. PSNR, SSIM  and VIF  are some of the most popular evaluation metrics in low-level computer vision tasks such as deblurring, dehazing and image restoration. So for de-raining and aerial to maps tasks, we adopt PSNR, SSIM, VIF and RECO  to qualify the performance of results.
Image segmentation evaluation metrics. We use standard metrics from Cityscapes benchmark  to evaluate real to semantic labels task on Cityscapes dataset, including per-pixel accuracy, per-class accuracy, and Class IOU.
Amazon Mechanical Turk (AMT). AMT [14, 46, 40] is adopted in many tasks as a gold metric to evaluate how real the synthesized images, and we use it as evaluation metric for semantic labels to photo and maps to aerial tasks.
The intuition of using an off-the-shelf classifiers for automatic quantitative measurement is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well. We use the FCN-8s score  to evaluate semantic labels to real task on Cityscapes dataset. The FCN-8s model trained on Cityscapes segmentation tasks is taken from .
4.2 Why DRPAN?
To study the influence of DRPAN for revising synthesis and different situations of loss between proposed region and real region. We set an experiments which start from a pre-trained PatchGAN and continue for several training pipelines: continue training with PatchGAN; continue training with PatchGAN and L1 loss of discirminative and real region; continue training with PatchGAN and reviser.
We argue that the PatchD is efficient to discover the most fake or real region (Fig 4
) from the image but is limited to improve these regions with fine details for that PatchD is hard to capture the high dimension distribution. In this case, we propose a DRPnet (explore the strength of PatchD) for discriminative region proposal and design a reviser to gradually remove visual artifacts, and thus reduce it to lower dimension estimation problem. This can be seen as a “top-down” procedure which is different from other gradually “bottom-up” image generation method. Fig 5 shows the necessity of our proposed DRPAN for high-quality image-to-image translation, which illustrates that continue training PatchD is no help to reduce artifacts even with a L1 loss for balance, and DRPAN with only L1 loss can smooth the artifacts but not very sharp in details, while DRPAN with reviser exceeds the PatchD’s performance with less visual artifacts. The combination of reviser and L1 loss can reduce these artifacts ignored by PatchD. We also find that fake-mask operation can improve the fluency of whole image in certain samples (e.g., the connection between door and wall). So DRPAN with fake-mask is implemented in the following experiments.
4.3 Low level translation
We first apply our model on two low level translation tasks which are only related to the appearance translation of images, for example, in de-raining task we don’t need change the content and texture of the input sample. So we set in Eqn. 9 for image synthesis using only reviser.
Single image de-raining. We trained and tested our DRPAN model on single image de-raining task using the procedure as same as , and evaluated the results by both qualitative and quantitative metrics. Fig. 6 shows the qualitative results of our DRPAN with different sizes of discriminative region compared to ID-CGAN , and DRPAN outperforms ID-CGAN with not only more effective de-raining but also more vivid color and clear details. Tab. 1 reports the corresponding quantitative results evaluated by PSNR, SSIM, VIF, and RECO metrics, and the best results (in bold font) are achieved all by our DRPAN.
|Method Metrics||L2+ CGAN||ID-CGAN||PAN||DRPAN (w/o mask)||DRPAN (128)||DRPAN (64)||DRPAN (32)||DRPAN (16)|
Bw to color.
We trained our DRPAN model for image colorization task on ImageNet, and tested on ImageNet val dataset with an example shown in Fig. 7. Our DRPAN can produce compelling colorization results compared with classification with class rebalancing . In addition, we run AMT evaluation for colorization(Tab. 2). Our method fooled participants on which is competitive with the full method from .
|Method||% Turkers labeled real|
4.4 Real to abstract translation
We then implement our proposed DRPAN on two tasks of real to abstract translation which requires many-to-one abstraction ability.
Real to semantic labels. For real to semantic labels task, we tested our DRPAN model on two of the most used datasets: Cityscapes and facades. Fig. 8 shows the qualitative results of our DRPAN compared to Pix2pix  on Cityscapes dataset for translating real to semantic labels, and DRPAN can synthesize more realistic results that are closer to ground truth than Pix2pix, meanwhile, the quantitative results in Tab. 4 can also tell this in terms of per-pixel accuracy, per-class accuracy, and Class IOU.
Aerial to maps. We also applied our DRPAN on aerial photo to maps task, and the experiment was implemented using paired images with resolution . The top row of Fig. 9 shows the qualitative results of our DRPAN compared to Pix2pix , indicating that our DRPAN can correctly translate the motorway on aerial photo into the orange line on the map while Pix2pix can’t.
4.5 Abstract to real translation
Besides, we also demonstrate our proposed DRPAN on several abstract to real tasks that can translate one to many: semantic labels to photo, maps to aerial, edge to real, and sketch to real.
Semantic labels to real. For semantic labels to real task, the translation model aims to synthesize real world images from semantic labels. CGAN based works fail to capture the details in the real world and suffer from deformation and blur problems. CNN based methods such as CRN can synthesize high-resolution but smooth rather than realistic results. Fig. 10 shows qualitative comparison of results, from which it can be seen that our DRPAN can synthesize the most realistic results with high-quality (more clear and less distorted while high resolution) compared to Pix2pix  and CRN .
The evaluation of GAN is still a challenging problem. Many works [32, 38, 44, 14] used off-the-shelf classifiers as automatic measures of synthesized images. Tab. 4 reports performance evaluation on segmentation of FCN-8s model, and our DRPAN exceeds Pix2pix  by on per-pixel accuracy and also achieves highest performance on per-class accuracy and Class IOU.
|Model||Per-pixel acc.||Per-class acc.||Class IOU|
|Model||Per-pixel acc.||Per-class acc.||Class IOU|
|Model||% Turkers labeled real|
|% Turkers labeled more realistic|
|DRPAN vs. Pix2pix ||91.2%|
|DRPAN vs. StackGAN-like||84.6%|
|DRPAN vs. CRN ||75.7%|
|Model||% Turkers labeled real|
Maps to aerial. As opposed to aerial to maps task, we also tested our DRPAN on maps to aerial task, and the qualitative results are shown in the bottom row of Fig. 9, which clearly demonstrates that our DRPAN can synthesize higher quality aerial photos than Pix2pix .
Human perceptual validation. We assess the performance of abstract to real on semantic labels to photo and maps to aerial by AMT. For fake against real study, we followed the perceptual study protocol from , and collected data of each algorithm from participants. Each participant has to look one sample. We also compared how realistic the synthesized images between different algorithms. Tab. 6 illustrates that images synthesized by DRPAN are ranked more realistic than state-of-the-arts (DRPAN CRN StackGAN-like Pix2pix ), moreover, compared to Pix2pix , StackGAN-like  and CRN , images synthesized by DRPAN are ranked more realistic by , and respectively. Tab. 6 reports the comparison on maps to aerial task and our DRPAN fooled participants on over of Pix2pix and of CycleGAN  respectively.
Edges to real and sketch to real. For the edge to real and sketch to real tasks, previous works often encounter with two problems : one is that it’s easy to generate artifacts and artificial color distribution in regions when the input such as edge is sparse; the other is that it’s difficult to deal with unusual inputs like sketch. We tested our DRPAN model on UT Zappos50k dataset  and edge to handbag dataset . Fig. 11 shows that our model can also handle these two problems well.
We propose Discriminative Region Proposal Adversarial Networks (DRPAN) towards high-resolution and photo-reality image-to-image translation. Human perceptual studies and automatic quantitative measures validate the performance of our proposed DRPAN against the state-of-the-arts for synthesizing high-quality results. We hope it can be explored for discriminative feature learning and other computer vision tasks in the future.
This work was supported by the National Natural Science Foundation of China under Grants 61771440 and 41776113, and Qingdao Municipal Science and Technology Program under Grant 17-1-1-5-jch.
-  Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. In: ICML (2017)
-  Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (GANs). arXiv preprint arXiv:1703.00573 (2017)
-  Baroncini, V., Capodiferro, L., Di Claudio, E.D., Jacovitti, G.: The polar edge coherence: a quasi blind metric for video quality assessment. In: ESPC (2009)
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)
-  Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: ICCV (2017)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009)
-  Deshpande, A., Rock, J., Forsyth, D.: Learning large-scale automatic image colorization. In: ICCV (2015)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE TPAMI 38(2), 295–307 (2016)
-  Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
-  Goodfellow, I., Pougetabadie, J., Mirza, M., Xu, B., Wardefarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
-  Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
-  Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)
-  Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: CVPR (2016)
-  Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192 (2017)
-  Kodali, N., Abernethy, J., Hays, J., Kira, Z.: How to train your DRAGAN. arXiv preprint arXiv:1705.07215 (2017)
-  Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR (2017)
-  Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: ECCV (2016)
-  Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: CVPR (2017)
-  Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In: CVPR (2017)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
-  Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug & play generative networks: Conditional iterative generation of images in latent space. In: CVPR (2017)
-  Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: ICLR (2016)
-  Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR (2016)
-  Qi, G.J.: Loss-sensitive generative adversarial networks on Lipschitz densities. arXiv preprint arXiv:1701.06264 (2017)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)
-  Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
-  Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS (2016)
-  Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE TIP 15(2), 430–444 (2006)
-  Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR (2016)
-  Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)
-  Tyleček, R., Šára, R.: Spatial pattern templates for recognition of objects with regular structure. In: GCPR (2013)
-  Wang, C., Xu, C., Wang, C., Tao, D.: Perceptual adversarial networks for image-to-image transformation. In: IJCAI (2017)
-  Wang, X., Gupta, A.: Generative image modeling using style and structure adversarial networks. In: ECCV (2016)
-  Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
-  Yi, Z., Zhang, H., Tan, P., Gong, M.: DualGAN: Unsupervised dual learning for image-to-image translation. In: ICCV (2017)
-  Yu, A., Grauman, K.: Fine-grained visual comparisons with local learning. In: CVPR (2014)
-  Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., Metaxas, D.N.: StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
-  Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. In: CVPR (2017)
-  Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
-  Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: ECCV (2016)
-  Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)