Bi-level Feature Alignment for Versatile Image Translation and Manipulation

07/07/2021 ∙ by Fangneng Zhan, et al. ∙ Nanyang Technological University 4

Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top-$k$ operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top-$k$ operation involves index swapping which precludes the gradient propagation, we propose to approximate the non-differentiable top-$k$ operation with a regularized earth mover's problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art. The code is available at \href{https://github.com/fnzhan/RABIT}{https://github.com/fnzhan/RABIT}.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image translation and manipulation aim to generate and edit photo-realistic images conditioning on certain inputs such as semantic segmentation [54, 65], key points [60, 11] and layout [35]. It has been studied intensively in recent years thanks to its wide spectrum of applications in various tasks [56, 51, 63]

. However, achieving high fidelity image translation and manipulation with faithful style control remains a grand challenge due to the high complexity of natural image styles. A typical approach to control image styles is to encode image features into a latent space with certain regularization (e.g., Gaussian distribution) on the latent feature distribution. For example, Park

et al.[54] utilize VAE [10] to regularize the distribution of encoded features for faithful style control. However, VAE struggles to encode the complex distribution of natural image styles and often suffers from posterior collapse [45] which leads to degraded style control performance. Another strategy is to encode reference images into style codes to provide style guidance in image generation. Choi et al. [5] employ a style encoder to extract the style code from a given reference image and achieve diverse image synthesis over multiple domains. Zhu et al. [95] further propose to extract style codes for each semantic region individually and achieve flexible style control within each semantic region. However, style codes often capture the overall image style or local region style without an explicit style guidance in spatial domain.

Recently, building dense correspondences between images has attracted increasing interest in image translation and manipulation thanks to its impressive image generation performance. Prior studies have explored to build correspondences between images of the same domain for exemplar-based image colorization

[17, 85]. Zhang et al. [87]

further explore to build cross-domain correspondences with Cosine similarity to achieve exemplar-based image translation. However, Zhang

et al. construct the semantic correspondences based on Cosine similarity that often leads to many-to-one matching (i.e. multiple conditional input features match to the same exemplar feature). Zhan et al. [78] thus propose to build the correspondence with optimal transport which facilitates the mass preserving property to mitigate the many-to-one matching. On the other hand, building dense correspondences has quadratic complexity which incurs high memory costs and struggles while scaling up to high-resolution images. To achieve high-resolution image translation, Zhou et al. [93] propose a GRU-assisted Patch-Match [1] method to build high-resolution correspondences efficiently. Zheng et al. [90] tackle the high-resolution correspondences via sparse attention with applications to semantic image manipulation. However, all above methods aim to build correspondences based on their semantic coherence without consideration of their structure coherence. As textures within a semantic region share identical semantic information, the texture structure information tends to be lost while building pure semantic correspondence. Warping exemplars with such pure semantic correspondence will further cause destroyed texture patterns in the warped exemplars which provide inaccurate guidance for image generation.

Fig. 1: Bi-level feature alignment via ranking and attention scheme: With a query block from the Conditional Input, we first retrieve the top- most similar blocks from the Exemplar Image through a differentiable ranking operation, and then compute dense attention between features in query block and features in retrieved top- blocks. Such bi-level alignment reduces the computational cost greatly, and it also allows to build high-resolution correspondences which leads to more realistic translation with finer details.

This paper presents RABIT, a Ranking and Attention scheme with Bi-level feature alignment for versatile Image Translation and manipulation. RABIT consists of an alignment network and a generation network that are optimized jointly. The alignment network establishes feature correspondences between a conditional input (semantic guidance) and an exemplar (style guidance). With the built correspondences, the exemplar is warped to be aligned with the conditional input to provide accurate style guidance for the generation network. However, building dense correspondence incurs quadratic computational complexity which struggles with high-resolution correspondences. We design a bi-level alignment strategy with a Ranking and Attention Scheme (RAS) which builds feature correspondences efficiently at two levels: 1) a top- ranking operation for dynamically generating block-wise ranking matrices; 2) a dense attention module that achieves dense correspondences between features within blocks as illustrated in Fig. 1. RAS enables to build high-resolution correspondences and reduces the memory cost from to ( is the number of features for alignment, is block size, and ). However, the top- operation involves index swapping whose gradient cannot be propagated in networks. To address this issue, we approximate the top- ranking operation to regularized earth mover’s problem by imposing entropy regularization to earth mover’s distance. Then the regularized earth mover’s problem can be solved with a Sinkhorn iteration [8] (in a differentiable manner) which enables gradient back-propagation effectively.

As in [87, 93], building correspondences based on semantic information only often leads to the losing of texture structures and patterns in warped exemplars. Thus, the spatial information should also be incorporated to preserve the texture structures and patterns and yield more accurate feature correspondences. A vanilla method to encode the position information is concatenating the semantic features with the corresponding feature coordinates via coordconv [40]. However, the vanilla position encoding builds a single coordinate system for the whole image which ignores the position information within each semantic region. Instead, we design a semantic position encoding (SPE) mechanism that builds a dedicated coordinate system for each semantic region which outperforms the vanilla position encoding significantly.

In addition, conditional inputs and exemplars are seldom perfectly matched, e.g., conditional inputs could contain several semantic classes that do not exist in exemplar images. Under such circumstance, the built correspondences often contain errors which lead to inaccurate exemplar warping and further deteriorated image generation. We tackle this problem by designing a CONfidence Feature Injection (CONFI) module that fuses features of conditional inputs and warped exemplars according to the reliability of the built correspondences. Although the warped exemplar may not be reliable, the conditional input always provides accurate semantic guidance in image generation. The CONFI module thus assigns higher weights to the conditional input when the built correspondence (or warped exemplar) is unreliable. Experiments show that CONFI helps to generate faithful yet high-fidelity images consistently by assigning adaptive weights (to the conditional input) based on the reliability of the built correspondence.

The contributions of this work can be summarized in four aspects. First, we propose a versatile image translation and manipulation framework which introduces a bi-level feature alignment strategy that greatly reduces the memory cost while building the correspondence between conditional inputs and exemplars. Second, we approximate non-differentiable top- ranking to a regularized earth mover’s problem, which enables effective gradient propagation for end-to-end network training. Third, we introduce a semantic position encoding mechanism that encodes region-level position information to preserve texture structures and patterns. Fourth, we design a confidence feature injection module that provides reliable feature guidance in image translation and manipulation.

2 Related Work

2.1 Image-to-Image Translation

Image translation has achieved remarkable progress in learning the mapping among images of different domains. It could be applied in different tasks such as style transfer  [22, 13, 36]

, image super-resolution

[32, 38, 31, 86], domain adaptation [57, 51, 19, 62, 77, 82], image synthesis [84, 7, 73, 76, 74, 81, 79, 80, 83, 75]

, image inpainting

[71, 72, 39, 66], etc. To achieve high-fidelity and flexible translation, existing work uses different conditional inputs such as semantic segmentation [25, 65, 54], scene layouts [59, 89, 35], key points [48, 50, 11], edge maps [25, 12], etc. However, effective style control remains a challenging task in image translation.

Style control has attracted increasing attention in image translation and generation. Earlier works such as [30] regularize the latent feature distribution to control the generation outcome. However, they struggle to capture the complex textures of natural images. Style encoding has been studied to address this issue. For example, [23] and [47] transfer style codes from exemplars to source images via adaptive instance normalization (AdaIN) [22]. [5] employs a style encoder for style consistency between exemplars and translated images. [95] designs semantic region-adaptive normalization (SEAN) to control the style of each semantic region individually. Wang et al. [64] demonstrate the feasibility of exemplar-guided style control by directly concatenating exemplar image and condition as input for image translation. However, encoding style exemplars tends to capture the overall image style and ignores the texture details in local regions. To achieve accurate style guidance for each local region, Zhang et al. [87] build dense semantic correspondences between conditional inputs and exemplars with Cosine similarity to capture accurate exemplar details. To mitigate the issue of many-to-one matching in Zhang et al. [87], Zhan et al. [78] further propose to utilize the mass preserving property of optimal transport to build the correspondence. On the other hand, above methods usually work with low-resolution correspondences due to the quadratic complexity in correspondence computation. To build correspondence in high resolution, Zhou et al. [93] introduce the GRU-assisted Patch-Match to efficiently establish the high-resolution correspondence. Zheng et al. [90] tackle the high-resolution correspondences through a sparse attention module with applications to semantic image manipulation. However, all these methods only utilize semantic information for building correspondence, which often leads to destroyed texture structures and patterns in the warped exemplar. In this work, we propose a bi-level alignment strategy that allows to build correspondence efficiently and design a semantic position encoding to preserve the texture structures and patterns.

2.2 Semantic Image Editing

The arise of generative adversarial network (GANs) brings revolutionary advance to image editing [94, 20, 52, 4, 55, 68, 67]. As one of the most intuitive representation in image editing, semantic information has been extensively investigated in conditional image synthesis. For example, Isola et al. [25] achieve label-to-pixel generation by training an encoder-decoder network with a conditional adversarial objective. Wang et al. [65] further achieve high-resolution image manipulation by editing the pixel-wise semantic labels. Park et al. [54] introduce spatially-adaptive normalization (SPADE) to inject guided features in image generation. MaskGAN [34] exploits a dual-editing consistency as auxiliary supervision for robust face image manipulation. Gu et al. [14] learn facial embeddings for different face components to enable local facial editing. Chen et al. [3] propose a mask re-targeting strategy for identity-preserved face animation. Xia et al. [69] map images into the latent space of a pre-trained network to facilitate editing. Instead of directly learning a label-to-pixel mapping, Hong et al. [20] propose a semantic manipulation framework HIM that generates images guided by a predicted semantic layout. Upon this work, Ntavelis et al. [52] propose SESAME which requires only local semantic maps to achieve image manipulation. However, the aforementioned methods either only learn a global feature without local focus (e.g., MaskGAN [34]) or ignore the features in the editing regions of the original image (e.g., HIM [20], SESAME [52]). To better utilize the fine features in the original image, Zheng et al. [90] adapt exemplar-based image synthesis framework CoCosNet [87] for semantic image manipulation by building a high-resolution correspondence between the original image and the edited semantic map. However, it may inherit the issue of texture pattern losing from [87], which can be effectively ameliorated by the proposed semantic positional encoding mechanism.

2.3 Feature Correspondence

Early studies determine feature correspondence by focusing on sparse correspondence [44] or dense correspondences between nearby view of the same objects only [21, 53]. Differently, semantics correspondences establish the dense correlation between different instances of the same semantic object. For example, [2, 24, 29]

focus on matching hand-crafted features. Leveraging the power of convolutional neural networks (CNNs) in learning high-level semantic features, Long

et al. [43] first employ CNNs to establish semantic correspondences between images. Later efforts further improve correspondence quality by including additional annotations [92, 6, 15, 16], adopting coarse-to-fine strategy [37], extending to cross-domain images [87], etc. However, most existing studies only work with low-resolution correspondences as constrained by the heavy computation cost. We design a bi-level alignment strategy that greatly improves computation efficiency and allows to compute dense correspondences at higher resolution.

Fig. 2: The framework of the proposed RABIT: Conditional Input and Exemplar are fed to feature extractors and

to extract feature vectors

and where local features form a feature block. In the first level, each block from the conditional input serves as the query to retrieve top- similar blocks from the exemplar through a differentiable ranking operation. In the second level, dense Attention is then built between the features in query block and features in the retrieved blocks. The built Ranking Matrices and Attention Matrices are combined to warp the exemplar to be aligned with the conditional input as in Warped Exemplar, which serves as a style guidance to generate the final result through a generation network.

3 Proposed Method

The proposed RABIT consists of an alignment network and a generation network that are inter-connected as shown in Fig. 2. The alignment network learns the correspondence between a conditional input and an exemplar for warping the exemplar to be aligned with the conditional input. The generation network produces the final generation under the guidance of the warped exemplar and the conditional input. RABIT is typically applicable in the task of conditional image translation with extra exemplar as style guidance. It is also applicable to the task of image manipulation by treating the exemplars as the original images for editing and the conditional inputs as the edited semantic.

3.1 Alignment Network

The alignment network aims to build the correspondence between conditional inputs and exemplars, and accordingly provide accurate style guidance by warping the exemplars to be aligned with the conditional inputs. As shown in Fig. 2, conditional input and exemplar are fed to feature extractors and to extract two sets of feature vectors and , where and denote the number and dimension of feature vectors, respectively. Most existing methods [87, 17, 85] align and by building a dense correspondence matrix where each entry denotes the Cosine similarity between the corresponding feature vectors in and . However, such correspondence computation has quadratic complexity which incurs large memory and computation costs. Most existing studies thus work with low-resolution exemplar images (e.g. 64 64 in CoCosNet [87]) which often struggle in generating realistic images with fine texture details.

In this work, we propose a bi-level alignment strategy via a novel ranking and attention scheme (RAS) that greatly reduces computational costs and allows to build correspondences with high-resolution images as shown in Fig. 4. Instead of building correspondences between features directly, the bi-level alignment strategy builds the correspondences at two levels, including the first level that introduces top- ranking to generate block-wise ranking matrices dynamically and the second level that achieves dense attention between the features within blocks. As Fig. 2 shows, local features are grouped into a block, thus the features of conditional input and exemplar are partitioned into blocks () as denoted by and . In the first level of top- ranking, each block feature of the conditional input serves as a query to retrieve top- block features from the exemplar according to the Cosine similarity between blocks. In the second level of local attention, the features in each query block further attends to the features in the top- retrieved blocks to build up local attention matrices within block features. The correspondence between the exemplar and conditional input can thus be built much more efficiently by combine such inter-block ranking and inner-block attention.

Semantic Position Encoding. Existing works [87, 93] mainly rely on semantic features to establish the correspondences. However, as all textures within a semantic region share the same semantic feature, the pure semantic correspondence fails to preserve the texture structures or patterns within each semantic region. For example, the building regions of conditional inputs in Fig. 4 will establish correspondence with the building regions in the exemplars without consideration of building textures, which will result in warped exemplars with messy textures as shown in the Baseline (64). Thus, the position information of features should also be facilitated to preserve the texture structures and patterns. A vanilla method to encode the position information is employing a simple coordconv [40] to build a global coordinate for the full image. However, this vanilla position encoding mechanism builds a single coordinate system for the whole image, ignoring region-wise semantic differences. To preserve the fine texture pattern within each semantic region, we design a semantic position encoding (SPE) mechanism that builds a dedicated coordinate for each semantic region as shown in Fig. 3. Specifically, SPE treats the center of each semantic region as the origin of coordinate, and the coordinates within each semantic region are normalized to [-1, 1]. The proposed SPE outperforms the vanilla position encoding significantly as shown in Fig. 4 and to be evaluated in experiments.

Fig. 3: The comparison of vanilla position encoding and the proposed semantic position encoding (SPE). Red dots denote the coordinate origin. The proposed SPE builds dedicated coordinate for each semantic region.
Fig. 4: Warped exemplars with different methods: ‘64’ and ‘128’ mean to build correspondences at resolutions and . CoCosNet [87] tends to lose texture details and structures, while CoCosNet v2 [93] tends to generate messy warping. The Baseline denotes building correspondences with Cosine similarity, which tends to lose textures details and structures. The proposed ranking and attention scheme (RAS) allows efficient image warping at high resolutions, the proposed semantic position encoding (SPE) can better preserve texture structures. The combination of the two as denoted by SPE+RAS achieves the best warping performance with high resolution and preserved texture structures.

3.2 Differentiable Top-k Ranking

The core of the ranking and attention scheme lies with a top- operation that ranks the correlative blocks. However, the original top- operation involves index swapping whose gradient cannot be computed and so cannot be integrated into end-to-end network training. We tackle this issue by formulating the top- ranking as a regularized earth mover’s problem which allows gradient computation via implicit differentiation [46, 70].

3.2.1 Top-k Ranking Formulation

We first show that a specific form of earth mover’s problem is essentially equivalent to a top- element ranking problem. Earth mover’s problem [27] aims to find a transport plan that minimizes the total cost to transform one distribution to another. Consider two discrete distributions and defined on supports and

, with probability (or amount of earth)

and . We define as the cost matrix where denotes the cost of transportation between and , and as a transport plan where denotes the amount of earth transported between and . An earth mover’s (EM) problem can be formulated by:

(1)

where denotes a vector of ones, denotes inner product.

Fig. 5: Illustration of the earth mover’s problem in top- retrieval. Earth mover’s problem is conducted between distributions and which is defined on supports and . Transport Plan indicates the retrieved top- elements.

We then derive the earth mover’s form of top- operator. With a query block from the conditional input and blocks from the exemplar, their correlation scores can be obtained based on their Cosine similarity. The top- operation aims to retrieve most similar elements from . We define another set , and consider two discrete distributions and defined on supports sets and with , and . The cost is defined to be the squared Euclidean distance, i.e., and , . The earth mover’s distance between and can thus be formulated as:

Therefore minimizing suffices to minimize . It is obvious that , and . Hence, minimizing essentially aims to select the largest elements from as implied in the transport plan :

where indicates the retrieved top- elements. Fig. 5 illustrates the earth mover’s problem and transport plan, where the earth from the closest points is transported to , and meanwhile the earth from the remaining points is transported to . Therefore, the transport plan exactly indicates the top- elements.

3.2.2 Differentiable Optimization

The top- operation has been formulated as an earth mover’s problem, while the standard earth mover’s problem cannot be solved in a differentiable way. We introduce a regularized earth mover’s distance which serves as a smoothed approximation to the standard top- operator, and enables effective gradient propagation. The regularized earth mover’s problem in Eq. (1) is defined as:

(2)

where is the regularization term, is the regularization coefficient. The optimal transport plan of the regularized earth mover’s problem thus becomes a smoothed version of the standard top- operator.

The regularized earth mover’s distance can be efficiently computed via the Sinkhorn algorithm [8]. Specifically, an exponential kernel is applied to the cost matrix which yields . Then

is converted iteratively towards a doubly stochastic matrix through a Sinkhorn operation

as denoted by:

where denotes the iteration number, and are row and column normalization which can be denoted by:

where represents an element in . Then the partial derivatives for the iteration (taking as the example) can be derived by:

where and represent the indices of the row and columns in , represents an indication function. Thus, the Sinkhorn operation is differentiable and its gradient can be calculated by unrolling the sequence of the row and column normalization operations. When iterations converge, the transport plan indicating the top- elements can be obtained.

3.2.3 Complexity Analysis

The vanilla dense correspondence has a self-attention memory complexity where is the input sequence length. For our bi-level alignment strategy, the memory complexity of building the block ranking matrices and local attention matrices are and , where , () and are block size, block number and the number of top- selection. Thus, the overall memory complexity of the proposed bi-level alignment strategy is .

Fig. 6: Illustration of confidence feature injection: Conditional input and warped exemplar are initially fused with a confidence map (CMAP) of size . A multi-channel confidence map (Multi-CMAP) of size is then obtained from the initial fusion which further fuses the conditional input and warped exemplar in multiple channels.

 

ADE20K CelebA-HQ (Semantic) DeepFashion CelebA-HQ (Edge)
Methods FID SWD LPIPS FID SWD LPIPS FID SWD LPIPS FID SWD LPIPS
Pix2pixHD[65] 81.80 35.70 N/A 43.69 34.82 N/A 25.20 16.40 N/A 42.70 33.30 N/A
StarGAN v2[5] 98.72 65.47 0.551 53.20 41.87 0.324 43.29 30.87 0.296 48.63 41.96 0.214
SPADE[54] 33.90 19.70 0.344 39.17 29.78 0.254 36.20 27.80 0.231 31.50 26.90 0.207
SelectionGAN[61] 35.10 21.82 0.382 42.41 30.32 0.277 38.31 28.21 0.223 34.67 27.34 0.191
SMIS[96] 42.17 22.67 0.476 28.21 24.65 0.301 22.23 23.73 0.240 23.71 22.23 0.201
SEAN[95] 24.84 10.42 0.499 17.66 14.13 0.285 16.28 17.52 0.251 16.84 14.94 0.203
CoCosNet[87] 26.40 10.50 0.580 21.83 12.13 0.292 14.40 17.20 0.272 14.30 15.30 0.208
CoCosNet v2[93] 25.20 9.900 0.557 20.64 11.21 0.303 13.04 16.65 0.270 13.21 14.01 0.216
RABIT 24.35 9.893 0.571 20.44 11.18 0.307 12.58 16.03 0.284 11.67 14.22 0.219
TABLE I:

Comparing RABIT with state-of-the-art image translation methods over four translation tasks with FID, SWD and LPIPS as the evaluation metrics.

 

Semantic Consistency Style Consistency
Methods
SPADE [54] 0.861 0.772 0.934 0.884
StarGAN v2 [5] 0.741 0.718 0.919 0.907
SelectionGAN [61] 0.843 0.785 0.951 0.912
SMIS [96] 0.862 0.787 0.951 0.933
SEAN [95] 0.868 0.791 0.962 0.942
CoCosNet [87] 0.878 0.790 0.986 0.965
CoCosNetv2 [93] 0.889 0.800 0.994 0.972
RABIT 0.891 0.812 0.993 0.977
TABLE II: Comparing RABIT with state-of-the-art image translation methods in semantic consistency and style consistency (on ADE20K [91]).

3.3 Generation Network

The generation network aims to synthesize images under the semantic guidance of conditional inputs and style guidance of exemplars. As the exemplars are warped by the alignment network to be semantically matched with the conditional inputs, the warped exemplar can serve as accurate style guidance for each image region in the generation network. The overall architecture of the generation network is similar to SPADE [54]. Please refer to supplementary material for details of the network structure.

State-of-the-art approach [87] simply concatenates the warped exemplar and conditional input to guide the image generation process. However, conditional input and warped exemplar are from different domains with different distributions and a naive concatenation of them is often sub-optimal [9]. In addition, the warped input image and edited semantic map could be structurally aligned but semantically different especially when they have severe semantic discrepancy. Such unreliably warped exemplars could serve as false guidance for the generation network and heavily deteriorate the generation performance. Therefore, a mechanism is required to identify the semantic reliability of warped exemplar to provide reliable guidance for the generation network. To this end, we propose a CONfidence Feature Injection (CONFI) module that adaptively weights the features of conditional input and warped exemplar according to the reliability of feature matching.

Confidence Feature Injection. Intuitively, in the case of lower reliability of the feature correspondence, we should assign a relatively lower weight to the warped exemplar which provides unreliable style guidance and a higher weight to the conditional input which consistently provides accurate semantic guidance.

As illustrated in Fig. 6, the proposed CONFI fuses the features of the conditional input and warped exemplar based on a confidence map (CMAP) that captures the reliability of the feature correspondence. To derive the confidence map, we first obtain a block-wise correlation map of size by computing element-wise Cosine distance between and . For a block , the correlation score with is denoted by . As higher correlation scores indicate more reliable feature matching, we treat the peak value of as the confidence score of . Similar for other blocks, we can obtain the confidence map (CMAP) of size () which captures the semantic reliability of all blocks. The features of the conditional input and exemplar (both of size after passing through convolution layers) can thus be fused via weighted sum based on the confidence map CMAP:

where is the built correspondence matrix.

As the confidence map contains only one channel (), the above feature fusion is conducted in but ignores that in channel. To achieve thorough feature fusion in all channels, we feed the initial fusion to convolution layers to generate a multi-channel confidence map (Multi-CMAP) of size . The conditional input and warped exemplar are then thoroughly fused via a full channel-weighted summation according to the Multi-CMAP. The final fused feature is further injected to the generation process via spatial de-normalization [54] to provide accurate semantic and style guidance.

Fig. 7: Qualitative comparison of the proposed RABIT and state-of-the-art methods over four types of conditional image translation tasks.

3.4 Loss Functions

The alignment network and generation network are jointly optimized. For clarity, we still denote the conditional input and exemplar as and , the ground truth as , the generated image as , the feature extractors for conditional input and exemplar as and , the generator and discriminator in the generation network as and .

Alignment Network. First, the warping should be cycle consistent, i.e. the exemplar should be recoverable from the warped warped. We thus employ a cycle-consistency loss as follows:

where is the correspondence matrix. The feature extractors and aim to extract invariant semantic information across domains, i.e. the extracted features from and should be consistent. A feature consistency loss can thus be formulated as follows:

Generation Network. The generation network employs several losses for high-fidelity synthesis with consistent style with the exemplar and consistent semantic with the conditional input. As the generated image should be semantically consistent with the ground truth , we employ a perceptual loss [26] to penalize their semantic discrepancy as below:

(3)

where is the activation of layer in pre-trained VGG-19 [58] model. To ensure the statistical consistency between the generated image and the exemplar , a contextual loss [49] is adopted:

(4)

where and are the indexes of the feature map in layer . Besides, a pseudo pairs loss as described in [87] is included in training.

The discriminator is employed to drive adversarial generation with an adversarial loss [25]. The full network is thus optimized with the following objective:

(5)

where the weights balance the losses in the objective.

4 Experiments

4.1 Experimental Settings

Datasets: We evaluate and benchmark our method over multiple datasets for image translation & manipulation tasks.

ADE20K [91] has 20k training images each of which is associated with a 150-class segmentation mask. We use its semantic segmentation as conditional inputs in image translation experiments, and 2k test images for evaluations. For image manipulation, we apply object-level affine transformations on the test set to acquire paired data (150 images) for evaluations as in [90].

CelebA-HQ [42] has 30,000 high-quality face images. We conduct two translation tasks by using face semantics and face edges as conditional inputs. In addition, we also conduct image manipulation experiments on this dataset by editing the face semantics. We use 2993 face images for translation evaluations as in [87], and manually edit 100 semantic maps which is randomly selected for image manipulation evaluations.

DeepFashion [41] has 52,712 person images of different appearance and poses. We use its key points as conditional inputs for image translation, and select 4993 images for evaluations as in [87].

Evaluation Metrics: For image translation, we adopt Fréchet Inception Score (FID) [18] and Sliced Wasserstein distance (SWD) [28] to evaluate the perceptual quality of translated images. We adopt Learned Perceptual Image Patch Similarity (LPIPS) [88] to evaluate the translation diversity with different exemplars.

For image manipulation, we adopted FID, SWD and LPIPS to evaluate perceptual quality of manipulated images. We also adopted L1 distance, peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as low-level evaluation metrics. Note LPIPS evaluates image translation diversity by measuring the distance between translated images, while it evaluates the image manipulation quality by measuring the distance between manipulated images and ground truth.

Similar to [87], we design and metrics to evaluate semantic consistency and and metrics to evaluate style consistency. For semantic consistency, we apply a pre-trained VGG model [58] to extract high-level features ( and ) of the ground truth and generated images. The semantic consistency ( and ) is defined by the distance between the extracted high-level features as computed by Cosine similarity. For style consistency, we extract low-level style features () from the generated images and exemplars. The style consistency ( and

) is defined by the distance of channel-wise mean and standard deviation as computed by Cosine similarity.

Implementation Details: The alignment and generation networks are jointly optimized with learning rates 1-4 and 4-4 for the generator and discriminator, respectively. We adopted Adam solver with and . All experiments were conducted on 4 32GB Tesla V100 GPUs with synchronized BatchNorm. The default size for our correspondence computation is with a block size of . The number in top- ranking is set at 3 by default in our experiments. The size of generated images is in all generation tasks.

4.2 Image Translation Experiments

We compare RABIT with eight state-of-the-art image translation methods: 1) Pix2pixHD [65] on supervised image translation; 2) StarGAN v2[5] on multi-modal translation with support for style encoding from reference images; 3) SPADE [54] on supervised translation that supports style injection from exemplar images; 4) SelectionGAN [61] on guided translation with cascaded semantic guidance; 5) SMIS [96] on semantically multi-modal synthesis with all group convolutions; 6) SEAN [95] on conditional generation that can control the style of each individual semantic region; 7) CoCosNet [87] on exemplar-based image translation that builds cross-domain correspondences; and 8) CoCosNet v2 [93] on building high-resolution correspondences for image translation. Note CoCosNet adopts a default correspondence size of in this work as constrained by high memory costs, whereas CoCosNet v2 and RABIT adopt a default correspondence size of .

Fig. 8: Illustration of generation diversity of the proposed RABIT: With the same conditional input, RABIT can generate a variety of images that have consistent styles with the provided exemplars. It works for different types of conditional inputs consistently.

Quantitative Results. In quantitative experiments, all methods translate images with the same exemplars except Pix2PixHD [65] which doesn’t support style injection from exemplars. LPIPS is calculated by comparing the generated images with randomly selected exemplars. All compared methods adopt three exemplars for each conditional input and the final LPIPS is obtained by averaging the LPIPS between any two generated images.

Table I shows experimental results. It can be seen that RABIT outperforms all compared methods over most metrics and tasks consistently. By building explicit yet accurate correspondences between conditional inputs and exemplars, RABIT enables direct and accurate guidance from the exemplar and achieves better translation quality (in FID and SWD) and diversity (in LPIPS) as compared with the regularization-based methods such as SPADE [54] and SMIS [96], and style-encoding methods such as StarGAN v2 [5] and SEAN [95]. Compared with correspondence-based method CoCosNet [87], the proposed bi-level alignment allows RABIT to build correspondences and warp exemplars at higher resolutions (e.g. ) which offers more detailed guidance in the generation process and helps to achieve better FID and SWD. While compared with CoCosNet v2 [93], the proposed semantic position encoding enables to preserve the texture structures and patterns, thus yielding more accurate warped exemplars as guidance. In addition, the proposed confidence feature injection module fuses conditional inputs and warped exemplars adaptively based on the matching confidence, which provides more reliable guidance and improves FID and SWD. Besides generation quality, RABIT achieves the best generation diversity in LPIPS except StarGAN v2 [5] which sacrifices the generation quality with much lower FID and SWD.

 

Models FID SWD PSNR SSIM
SPADE [54] 120.2 41.62 13.11 0.334
HIM [20] 59.89 22.23 18.23 0.667
SESAME [52] 52.51 29.40 18.67 0.691
CoCosNet [87] 41.03 23.08 20.30 0.744
CoCosNetv2 [93] 34.31 19.55 21.75 0.797
RABIT 26.61 15.05 23.08 0.823
TABLE III: Comparing RABIT with state-of-the-art image manipulation methods on ADE20K [91] with evaluation metrics FID, SWD, PSNR, and SSIM.

 

Models FID SWD LPIPS
SPADE [54] 105.1 41.90 0.376
SEAN [95] 96.31 35.90 0.351
MaskGAN [34] 80.89 23.86 0.271
CoCosNet [87] 68.70 22.90 0.224
CoCosNetv2 [93] 62.53 21.11 0.190
RABIT 60.87 21.07 0.176
TABLE IV: Comparing RABIT with state-of-the-art image manipulation methods on CelebA-HQ [42] with evaluation metrics FID, SWD, SSIM.

 

Semantic Consistency Style Consistency
Methods
SPADE [54] 0.853 0.766 0.929 0.876
HIM [20] 0.865 0.773 0.934 0.884
SESAME [52] 0.870 0.779 0.969 0.947
CoCosNet [87] 0.878 0.790 0.986 0.965
CoCosNet v2 [93] 0.889 0.804 0.985 0.967
RABIT 0.889 0.802 0.992 0.975
TABLE V: Comparing RABIT with state-of-the-art image manipulation methods in semantic consistency and style consistency (on ADE20K [91]).

We also evaluated the generated images by measuring their semantic consistency with the conditional inputs and their style consistency with the exemplars. As shown in Table II, the proposed RABIT achieves the best style consistency thanks to the bi-level feature alignment for building high-resolution correspondences and the semantic position encoding for preservation of texture patterns. It also achieves the best semantic consistency due to the confidence feature injection that offers reliable fusion of semantic and style features.

Fig. 9: Qualitative illustration of RABIT and state-of-the-art image manipulation methods on the augmented test set of ADE20K with ground truth as described in [90]: The edited regions of the semantic maps are highlighted by white boxes. The artifacts generated by CoCosNet and CoCosNet v2 are highlighted by orange boxes. The proposed RABIT is capable of generating high-fidelity editing results without undesired artifacts.
Fig. 10: The comparison of image manipulation by MaskGAN [34] and the proposed RABIT over dataset CelebA-HQ [42].

Qualitative Evaluations. Fig. 7 shows qualitative comparisons. It can be seen that RABIT achieves the best visual quality with faithful styles as exemplars. SPADE [54], SMIS [96] and StarGAN v2 [5] adopt single latent code to encode image styles, which tend to capture global styles but miss local details. SEAN [95] employs multiple latent codes but struggles in preserving faithful exemplar styles. CoCosNet [87] builds low-resolution correspondences which leads to missing details, while CoCosNet v2 [93] builds high-resolution correspondence without position encoding which leads to destroyed texture patterns. RABIT excels with its RAS that offers accurate feature alignment at high resolution and SPE that preserves texture structures and patterns.

RABIT also demonstrates superior diversity in image translation as illustrated in Fig. 8. It can be observed that RABIT is capable of synthesizing various high-fidelity images with faithful styles as various exemplars.

4.3 Image Manipulation Experiment

The proposed RABIT manipulates images by treating input images as exemplars and edited semantic guidance as conditional inputs. We compare RABIT with several state-of-the-art image manipulation methods including 1) SPADE [54], which supports semantic manipulation with style injection from input images; 2) SEAN [95] which supports semantic manipulation with style control of each individual semantic region; 3) MaskGAN [33], a geometry-oriented face manipulation framework with semantic masks as an intermediate representation for manipulation. 4) Hierarchical Image Manipulation (HIM) [20], a hierarchical framework for semantic image manipulation. 5) SESAME [52], a semantic image editing method covering the operation of adding, manipulating, and erasing. 6) CoCosNet [87], a leading exemplar-based image generation framework that enables manipulation by building cross-domain correspondences. 7) CoCosNet v2 [93], which builds high-resolution correspondences () for image generation.

Fig. 11: Various image editing by the proposed RABIT: With input images as the exemplars and edited semantic maps as the conditional input, RABIT generates new images with faithful semantics and high-fidelity textures with little artifacts.
Fig. 12: AMT (Amazon Mechanical Turk) user studies of different image translation and image manipulation methods in terms of the visual quality and style consistency of the generated images.

Quantitative Results: In quantitative experiments, all compared methods manipulate images with the same input image and edited semantic label map. Table III shows experimental results over the synthesized test set of ADE20K [91]. It can be observed that RABIT outperforms state-of-the-art methods over all evaluation metrics consistently. Table IV shows experimental results over the CelebA-HQ dataset with manual edited semantic maps. It can be observed that RABIT outperforms the state-of-the-art methods by large margins in all perceptual quality metrics. The superior generation quality of RABIT is largely attributed to the ranking and attention scheme for building high-resolution correspondences and the semantic position encoding for preserving rich texture details of input images.

Besides the quality of manipulated images, we also evaluate their semantic consistency and style consistency as shown in Table V. It can be seen that RABIT achieves the best semantic consistency and style consistency as compared with state-of-the-art image manipulation methods. The outstanding performance can be explained by the proposed ranking and attention scheme for building high-resolution correspondence, the semantic position encoding for texture pattern preservation as well as the confidence feature injection for reliable image generation.

Qualitative Evaluation: Fig. 9 shows visual comparisons with state-of-art manipulation methods on ADE20K. HIM [20] and SESAME [52] produce unrealistic texture and artifacts for drastic semantic changes due to the lack of texture details after masking. CoCosNet [87] can preserve certain details, but it adopts Cosine similarity to align low-resolution features which often lead to missing details as demonstrated by blurry textures and artifacts. RABIT achieves superior fidelity due to its bi-level feature alignment for building high-resolution correspondences, semantic position encoding for the preservation of texture patterns and confidence feature injection for reliable guidance in image generation. Fig. 11 shows the editing capacity of RABIT with various types of manipulation on semantic labels. It can be seen that the RABIT manipulation results faithfully aligns with the edited semantic maps and produces realistic details. With the proposed bi-level feature alignment strategy and semantic position encoding, RABIT accurately matches features for the edited semantics and minimizes undesired changes outside the editing regions.

We also compare RABIT with MaskGAN [34] on CelebA-HQ [33] in Fig. 10. MaskGAN tends to introduce undesired changes in the edited images such as the skin color (columns 1 and 3) and the missing hand (column 2). RABIT achieves better editing with little change in other regions due to the accurate correspondences built between input images and edited semantic maps.

4.4 User Study

We conduct crowdsourcing user studies through Amazon Mechanical Turk (AMT) to evaluate the image translation & manipulation in terms of generation quality and style consistency. The code of the AMT user studies is available at 111https://github.com/fnzhan/AMT. Specifically, each compared method generates 100 images with the same conditional inputs and exemplars. Then the generated images together with the conditional inputs and exemplars were presented to 10 users for assessment. For the evaluation of image quality, the users were instructed to pick the best-quality images. For the evaluation of style consistency, the users were instructed to select the images with best style relevance to the exemplar. The final AMT score is the averaged number of the methods to be selected as the best quality and the best style relevance.

Fig. 12 shows AMT results on multiple datasets. It can be observed that RABIT outperforms state-of-the-art methods consistently in image quality and style consistency on both image translation & image manipulation tasks.

 

Image Translation Image Manipulation
Models FID SWD LPIPS FID SWD PSNR
SPADE 33.90 19.70 0.344 0.772 0.884 120.2 41.62 13.11 0.766 0.876
SPADE+COS 27.72 14.98 0.556 0.787 0.941 42.02 23.23 19.92 0.772 0.926
CONFI+COS 26.58 12.33 0.529 0.801 0.958 40.56 22.87 20.73 0.782 0.959
CONFI+RAS 25.51 11.94 0.548 0.807 0.966 33.32 18.89 21.97 0.785 0.964
CONFI+RAS+PE 24.93 10.20 0.578 0.791 0.974 28.32 16.48 22.79 0.797 0.971
CONFI+RAS+SPE* 24.35 9.893 0.596 0.812 0.977 26.61 15.05 23.08 0.802 0.975
TABLE VI: Ablation studies on image translation and image manipulation tasks (both on ADE20K [91]): COS refer to Cosine similarity for building correspondence. RAS and CONFI denote the proposed ranking and attention scheme for building correspondence and confidence feature injection module in the generation network, respectively. PE and SPE refer to vanilla position encoding and the proposed semantic position encoding, respectively. Model in the last row is the standard RABIT.

 

Models FID LPIPS SSIM MC FID LPIPS SSIM MC FID LPIPS SSIM MC
CoCosNet [87] 102.9 0.402 0.621 6.179 64.43 0.298 0.630 11.23 54.87 0.273 0.657 21.73
CoCosNet v2 [93] 132.3 0.458 0.614 5.101 73.14 0.339 0.625 9.065 58.66 0.283 0.641 14.94
RAS (=1, b=64) 144.1 0.470 0.608 4.912 97.79 0.367 0.618 8.986 76.47 0.333 0.622 14.35
RAS (=1, b=16) 129.8 0.431 0.615 4.963 81.07 0.344 0.622 9.012 66.22 0.306 0.625 15.16
RAS (=1, b=4) 102.4 0.382 0.620 5.093 70.12 0.321 0.624 9.082 59.97 0.292 0.634 15.95
RAS (=2, b=4) 97.82 0.379 0.619 5.126 65.39 0.312 0.624 9.114 57.54 0.281 0.638 16.06
RAS (=3, b=4)* 95.93 0.357 0.623 5.157 63.84 0.302 0.628 9.136 54.15 0.268 0.644 16.35
TABLE VII: Ablation study of correspondence accuracy and memory cost with different correspondence size (, , and ), block size (=4, 16, and 64) and top- selection (=1, 2, and 3) on DeepFashion [41] for image translation. The correspondence accuracy is evaluated by comparing the warped images and the ground truth with evaluation metrics L1, SSIM and PSNR. MC denotes the memory cost in gigabyte (GB). ‘*’ denotes the default setting of RAS.

 

Models FID
w/o 28.17 0.794 0.958
w/o 29.27 0.809 0.962
w/o 45.16 0.738 0.861
w/o 35.05 0.798 0.853
w/o 25.43 0.807 0.972
24.35 0.812 0.977
TABLE VIII:

Ablation studies of loss functions over ADE20K

[91]. and denote the cycle-consistency loss and feature consistency loss in the alignment network. , and denote perceptual loss, contextual loss and pseudo pairs loss in the generation network.

4.5 Ablation Study

We conduct extensive ablation studies to evaluate our technical designs on image translation and image manipulation tasks. Table VI shows experimental results on ADE20K. SPADE [54] is selected as the baseline which achieves image translation & manipulation without feature alignment. The performance is clearly improved when Cosine similarity is included to align features as denoted by (SPADE+COS). By replacing SPADE with the proposed CONFI for feature injection, the FID score is improved to 15.97. In addition, the translation is further improved by large margins when the proposed RAS is included for building high-resolution correspondences. By including vanilla position encoding (PE), FID score presents some improvement but scores (semantic consistency) is affected severely. The proposed semantic position encoding improves FID scores and semantic consistency consistently.

As the correspondence quality is critical to correspondence-based generation, we analyze the accuracy, memory costs and parameters (e.g., resolution, block size) in correspondence construction in different methods. The experiment was conducted on DeepFashion dataset [41] (with paired images) where the warped exemplars and the ground truth (resized to ) are compared in L1, SSIM and PSNR metrics to evaluate the accuracy of built correspondence. The memory cost is evaluated through the memory footprint in GPU. In experiments, we compare Cosine similarity, Patch match and the proposed RAS over sizes of , and , respectively. As shown in Table VII, RAS(k=3, b=4) outperforms Cosine similarity in CoCosNet [87] and Patch match in CoCosNet v2 [93] in L1, SSIM and PSNR. In addition, RAS(k=1,b=64) reduces memory costs consistently under different image resolutions as compared with CoCosNet [87] and CoCosNet v2 [93]. We also study the correspondence resolution ( , , ), top- number (=1,2,3) and block size (b=4, 16, 64) in RAS. As Fig. VII shows, the accuracy of the built correspondences keeps improving and the memory cost keeps increasing when image resolution or the top-k selection increases and the block size decreases. Compared with CoCosNet and CoCosNet v2, RAS reduces memory more clearly with the increase of correspondence resolution. With a trade-off between correspondence accuracy and memory cost, we select =3, b=4 and correspondence resolution of as the default setting of RAS.

In addition, we perform several ablation studies to examine the contribution of each loss by removing it from the overall objective. Table VIII show experimental results on the image translation task over ADE20K. As shown in Table VIII, all involved losses contribute to the image translation in different manners and significance. Specially, the image quality as indicated by FID drops clearly without the perceptual loss , and the style consistency as indicated by decreases significantly with the removal of contextual loss .

5 Conclusions

This paper presents RABIT, a versatile conditional image translation & manipulation framework that adopts a novel bi-level alignment strategy with a ranking and attention scheme (RAS) to align the features between conditional inputs and exemplars efficiently. As the ranking operation precludes the gradient propagation in model training, we approximate it with a regularized earth mover’s formulation which enables differentiable optimization of the ranking operation. A semantic position encoding mechanism is designed to facilitate semantic-level position information and preserve the texture patterns in the exemplars. To handle the semantic mismatching between the conditional inputs and warped exemplars, a novel confidence feature injection module is proposed to achieve multi-channel feature fusion based on the matching reliability of warped exemplars. Quantitative and qualitative experiments over multiple datasets show that RABIT is capable of achieving high-fidelity image translation and manipulation while preserving consistent semantics with the conditional input and faithful styles with the exemplar.

The current exemplar-based image translation still requires the conditional input and the exemplar to be semantically similar in building meaningful correspondences, and this constrains the generalization of this translation approach. A possible solution is to further relax the constraint of exemplar selection. In this work, we propose the confidence feature injection module to mitigate the semantic discrepancy between conditional inputs and exemplars by assigning higher weights to the conditional input when the exemplar features are misaligned. However, adjusting fusion weights only mitigates the misalignment and the misaligned features still tend to mislead the generation process more or less. Instead of adjusting the fusion weights, we could rectify the misaligned features directly based on a pre-built feature bank with well-aligned features. These related issues will be studied in our future research.

6 Acknowledgement

This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

References

  • [1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28 (3), pp. 24. Cited by: §1.
  • [2] H. Bristow, J. Valmadre, and S. Lucey (2015)

    Dense semantic correspondence where every pixel is a classifier

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4024–4031. Cited by: §2.3.
  • [3] Z. Chen, C. Wang, B. Yuan, and D. Tao (2020) Puppeteergan: arbitrary portrait animation with semantic-aware appearance transformation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 13518–13527. Cited by: §2.2.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: §2.2.
  • [5] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) Stargan v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188–8197. Cited by: §1, §2.1, TABLE I, TABLE II, §4.2, §4.2, §4.2.
  • [6] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal correspondence network. arXiv preprint arXiv:1606.03558. Cited by: §2.3.
  • [7] K. Cui, G. Zhang, F. Zhan, and S. Lu (2021) FBC-gan: diverse and flexible image synthesis via foreground-background composition. arXiv preprint. Cited by: §2.1.
  • [8] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300. Cited by: §1, §3.2.2.
  • [9] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard (2021) Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569. Cited by: §3.3.
  • [10] C. Doersch (2016)

    Tutorial on variational autoencoders

    .
    arXiv preprint arXiv:1606.05908. Cited by: §1.
  • [11] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin (2018) Soft-gated warping-gan for pose-guided person image synthesis. arXiv preprint arXiv:1810.11610. Cited by: §1, §2.1.
  • [12] Y. Fu, J. Ma, L. Ma, and X. Guo (2019) EDIT: exemplar-domain aware image-to-image translation. arXiv preprint arXiv:1911.10520. Cited by: §2.1.
  • [13] L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §2.1.
  • [14] S. Gu, J. Bao, H. Yang, D. Chen, F. Wen, and L. Yuan (2019) Mask-guided portrait editing with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3436–3445. Cited by: §2.2.
  • [15] B. Ham, M. Cho, C. Schmid, and J. Ponce (2017) Proposal flow: semantic correspondences from object proposals. IEEE transactions on pattern analysis and machine intelligence 40 (7), pp. 1711–1725. Cited by: §2.3.
  • [16] K. Han, R. S. Rezende, B. Ham, K. K. Wong, M. Cho, C. Schmid, and J. Ponce (2017) Scnet: learning semantic correspondence. In Proceedings of the IEEE international conference on computer vision, pp. 1831–1840. Cited by: §2.3.
  • [17] M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan (2018) Deep exemplar-based colorization. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–16. Cited by: §1, §3.1.
  • [18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.1.
  • [19] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In

    International conference on machine learning

    ,
    pp. 1989–1998. Cited by: §2.1.
  • [20] S. Hong, X. Yan, T. Huang, and H. Lee (2018) Learning hierarchical semantic image manipulation through structured representations. In NIPS, Cited by: §2.2, §4.3, §4.3, TABLE III, TABLE V.
  • [21] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz (2012) Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 504–511. Cited by: §2.3.
  • [22] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2.1, §2.1.
  • [23] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §2.1.
  • [24] J. Hur, H. Lim, C. Park, and S. Chul Ahn (2015)

    Generalized deformable spatial pyramid: geometry-preserving dense correspondence estimation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1392–1400. Cited by: §2.3.
  • [25] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.1, §2.2, §3.4.
  • [26] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §3.4.
  • [27] L. V. Kantorovich (1960) Mathematical methods of organizing and planning production. Management science 6 (4), pp. 366–422. Cited by: §3.2.1.
  • [28] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §4.1.
  • [29] J. Kim, C. Liu, F. Sha, and K. Grauman (2013) Deformable spatial pyramid matching for fast dense correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2314. Cited by: §2.3.
  • [30] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1.
  • [31] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §2.1.
  • [32] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.1.
  • [33] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) MaskGAN: towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, §4.3.
  • [34] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) Maskgan: towards diverse and interactive facial image manipulation. In CVPR, pp. 5549–5558. Cited by: §2.2, Fig. 10, §4.3, TABLE IV.
  • [35] Y. Li, Y. Cheng, Z. Gan, L. Yu, L. Wang, and J. Liu (2020) BachGAN: high-resolution image synthesis from salient object layout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
  • [36] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017) Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086. Cited by: §2.1.
  • [37] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang (2017) Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088. Cited by: §2.3.
  • [38] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §2.1.
  • [39] H. Liu, B. Jiang, Y. Song, W. Huang, and C. Yang (2020) Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV, pp. 725–741. Cited by: §2.1.
  • [40] R. Liu, J. Lehman, P. Molino, F. Petroski Such, E. Frank, A. Sergeev, and J. Yosinski (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1.
  • [41] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §4.1, §4.5, TABLE VII.
  • [42] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: Fig. 10, §4.1, TABLE IV.
  • [43] J. Long, N. Zhang, and T. Darrell (2014) Do convnets learn correspondence?. arXiv preprint arXiv:1411.1091. Cited by: §2.3.
  • [44] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.3.
  • [45] J. Lucas, G. Tucker, R. Grosse, and M. Norouzi (2019) Don’t blame the elbo! a linear vae perspective on posterior collapse. arXiv preprint arXiv:1911.02469. Cited by: §1.
  • [46] G. Luise, A. Rudi, M. Pontil, and C. Ciliberto (2018) Differential properties of sinkhorn approximation for learning with wasserstein distance. In NIPS 2018-Advances in Neural Information Processing Systems, pp. 5864–5874. Cited by: §3.2.
  • [47] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool (2018) Exemplar guided unsupervised image-to-image translation with semantic consistency. In International Conference on Learning Representations, Cited by: §2.1.
  • [48] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool (2017) Pose guided person image generation. In Advances in neural information processing systems, pp. 406–416. Cited by: §2.1.
  • [49] R. Mechrez, I. Talmi, and L. Zelnik-Manor (2018) The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 768–783. Cited by: §3.4.
  • [50] Y. Men, Y. Mao, Y. Jiang, W. Ma, and Z. Lian (2020) Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5084–5093. Cited by: §2.1.
  • [51] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018) Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4500–4509. Cited by: §1, §2.1.
  • [52] E. Ntavelis, A. Romero, I. Kastanis, L. Van Gool, and R. Timofte (2020) Sesame: semantic editing of scenes by adding, manipulating or erasing objects. In ECCV, pp. 394–411. Cited by: §2.2, §4.3, §4.3, TABLE III, TABLE V.
  • [53] M. Okutomi and T. Kanade (1993) A multiple-baseline stereo. IEEE Transactions on pattern analysis and machine intelligence 15 (4), pp. 353–363. Cited by: §2.3.
  • [54] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §1, §2.1, §2.2, §3.3, §3.3, TABLE I, TABLE II, §4.2, §4.2, §4.2, §4.3, §4.5, TABLE III, TABLE IV, TABLE V.
  • [55] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer (2018) Ganimation: anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833. Cited by: §2.2.
  • [56] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §1.
  • [57] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.1.
  • [58] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4, §4.1.
  • [59] W. Sun and T. Wu (2019) Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10531–10540. Cited by: §2.1.
  • [60] H. Tang, D. Xu, G. Liu, W. Wang, N. Sebe, and Y. Yan (2019) Cycle in cycle generative adversarial networks for keypoint-guided image generation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 2052–2060. Cited by: §1.
  • [61] H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan (2019) Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2417–2426. Cited by: TABLE I, TABLE II, §4.2.
  • [62] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7472–7481. Cited by: §2.1.
  • [63] Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao, and F. Wen (2020) Bringing old photos back to life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2747–2757. Cited by: §1.
  • [64] M. Wang, G. Yang, R. Li, R. Liang, S. Zhang, P. M. Hall, and S. Hu (2019) Example-guided style-consistent image synthesis from semantic labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1495–1504. Cited by: §2.1.
  • [65] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §2.1, §2.2, TABLE I, §4.2, §4.2.
  • [66] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia (2018) Image inpainting via generative multi-column convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 331–340. Cited by: §2.1.
  • [67] R. Wu and S. Lu (2020) LEED: label-free expression editing via disentanglement. In European Conference on Computer Vision, pp. 781–798. Cited by: §2.2.
  • [68] R. Wu, G. Zhang, S. Lu, and T. Chen (2020) Cascade ef-gan: progressive facial expression editing with local focuses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5021–5030. Cited by: §2.2.
  • [69] W. Xia, Y. Yang, J. Xue, and B. Wu (2021) TediGAN: text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2256–2265. Cited by: §2.2.
  • [70] Y. Xie, H. Dai, M. Chen, B. Dai, T. Zhao, H. Zha, W. Wei, and T. Pfister (2020) Differentiable top-k with optimal transport. Advances in Neural Information Processing Systems 33. Cited by: §3.2.
  • [71] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In ICCV, pp. 4471–4480. Cited by: §2.1.
  • [72] Y. Yu, F. Zhan, R. Wu, J. Pan, K. Cui, S. Lu, F. Ma, X. Xie, and C. Miao (2021) Diverse image inpainting with bidirectional and autoregressive transformers. arXiv preprint arXiv:2104.12335. Cited by: §2.1.
  • [73] F. Zhan, S. Lu, and C. Xue (2018) Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–266. Cited by: §2.1.
  • [74] F. Zhan, S. Lu, C. Zhang, F. Ma, and X. Xie (2020) Adversarial image composition with auxiliary illumination. In Proceedings of the Asian Conference on Computer Vision, Cited by: §2.1.
  • [75] F. Zhan, S. Lu, C. Zhang, F. Ma, and X. Xie (2020) Towards realistic 3d embedding via view alignment. arXiv preprint arXiv:2007.07066. Cited by: §2.1.
  • [76] F. Zhan and S. Lu (2019) Esir: end-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2059–2068. Cited by: §2.1.
  • [77] F. Zhan, C. Xue, and S. Lu (2019) GA-dan: geometry-aware domain adaptation network for scene text detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9105–9115. Cited by: §2.1.
  • [78] F. Zhan, Y. Yu, K. Cui, G. Zhang, S. Lu, J. Pan, C. Zhang, F. Ma, X. Xie, and C. Miao (2021) Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
  • [79] F. Zhan, Y. Yu, R. Wu, C. Zhang, S. Lu, L. Shao, F. Ma, and X. Xie (2021)

    GMLight: lighting estimation via geometric distribution approximation

    .
    arXiv preprint arXiv:2102.10244. Cited by: §2.1.
  • [80] F. Zhan, C. Zhang, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao (2021) Sparse needlets for lighting estimation with spherical transport loss. arXiv preprint arXiv:2106.13090. Cited by: §2.1.
  • [81] F. Zhan, C. Zhang, Y. Yu, Y. Chang, S. Lu, F. Ma, and X. Xie (2020) EMLight: lighting estimation via spherical distribution approximation. arXiv preprint arXiv:2012.11116. Cited by: §2.1.
  • [82] F. Zhan and C. Zhang (2021) Spatial-aware gan for unsupervised person re-identification. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6889–6896. Cited by: §2.1.
  • [83] F. Zhan, H. Zhu, and S. Lu (2019) Scene text synthesis for efficient and effective deep network training. arXiv preprint arXiv:1901.09193. Cited by: §2.1.
  • [84] F. Zhan, H. Zhu, and S. Lu (2019) Spatial fusion gan for image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3653–3662. Cited by: §2.1.
  • [85] B. Zhang, M. He, J. Liao, P. V. Sander, L. Yuan, A. Bermak, and D. Chen (2019) Deep exemplar-based video colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8061. Cited by: §1, §3.1.
  • [86] J. Zhang, S. Lu, F. Zhan, and Y. Yu (2021) Blind image super-resolution via contrastive representation learning. arXiv preprint arXiv:2107.00708. Cited by: §2.1.
  • [87] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen (2020) Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153. Cited by: §1, §1, §2.1, §2.2, §2.3, Fig. 4, §3.1, §3.1, §3.3, §3.4, TABLE I, TABLE II, §4.1, §4.1, §4.1, §4.2, §4.2, §4.2, §4.3, §4.3, §4.5, TABLE III, TABLE IV, TABLE V, TABLE VII.
  • [88] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §4.1.
  • [89] B. Zhao, L. Meng, W. Yin, and L. Sigal (2019) Image generation from layout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8584–8593. Cited by: §2.1.
  • [90] H. Zheng, Z. Lin, J. Lu, S. Cohen, J. Zhang, N. Xu, and J. Luo (2020) Semantic layout manipulation with high-resolution sparse attention. arXiv preprint arXiv:2012.07288. Cited by: §1, §2.1, §2.2, Fig. 9, §4.1.
  • [91] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: TABLE II, §4.1, §4.3, TABLE III, TABLE V, TABLE VI, TABLE VIII.
  • [92] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros (2016) Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126. Cited by: §2.3.
  • [93] X. Zhou, B. Zhang, T. Zhang, P. Zhang, J. Bao, D. Chen, Z. Zhang, and F. Wen (2021) CoCosNet v2: full-resolution correspondence learning for image translation. In CVPR, Cited by: §1, §1, §2.1, Fig. 4, §3.1, TABLE I, TABLE II, §4.2, §4.2, §4.2, §4.3, §4.5, TABLE III, TABLE IV, TABLE V, TABLE VII.
  • [94] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In ECCV, pp. 597–613. Cited by: §2.2.
  • [95] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020) SEAN: image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113. Cited by: §1, §2.1, TABLE I, TABLE II, §4.2, §4.2, §4.2, §4.3, TABLE IV.
  • [96] Z. Zhu, Z. Xu, A. You, and X. Bai (2020) Semantically multi-modal image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5467–5476. Cited by: TABLE I, TABLE II, §4.2, §4.2, §4.2.