Semantic image synthesis is a powerful technique for generating images with intuitive control using spatial semantic layouts. A drawback is that most existing techniques require substantial training data in source and target domains for high-quality outputs. Even worse, annotations of pixel-wise labels (e.g., semantic masks) are quite costly.
In this paper, we present the first method for few-shot semantic image synthesis, assuming that we can utilize many unlabeled data with only a few labeled data of the target domain. Imagine that you have a large dataset of car or cat photos, but only a single annotated pair is available (Figure 1
, the 2nd and 3rd rows). In this scenario, we utilize the state-of-the-art generative adversarial network (GAN),StyleGAN [DBLP:conf/cvpr/KarrasLA19, DBLP:conf/cvpr/KarrasLAHLA20], pre-trained using the unlabeled dataset. Namely, we achieve high-quality image synthesis by exploring StyleGAN’s latent space via GAN inversion. What is challenging here is that, although common GAN inversion techniques [DBLP:conf/iccv/AbdalQW19, DBLP:conf/cvpr/AbdalQW20] assume that test inputs belong to the same domain as GAN’s training data (, facial photographs), our test and training data are in different domains, , semantic layouts and photographs. How to invert the input in a different domain into GAN’s latent space is an open question, as introduced in the latest survey [xia2021gan].
To bridge the domain gaps for the first time, we construct a mapping between the semantics predefined in the few-shot examples and StyleGAN’s latent space. Inspired by the fact that pixels with the same semantics tend to have similar StyleGAN features [DBLP:conf/cvpr/CollinsBPS20], we generate pseudo semantic masks from random noise in StyleGAN’s latent space via simple nearest-neighbor matching. This way, we can draw an unlimited number of training pairs by only feeding random noise to the pre-trained StyleGAN generator. After integrating an encoder on top of the fixed StyleGAN generator, we then train the encoder for controlling the generator using the pseudo-labeled data in a supervised fashion. Although our pseudo semantic masks might be too noisy or coarse for the previous pixel-aligned approach [DBLP:conf/cvpr/Park0WZ19], our method works well with such masks thanks to the tolerance to misalignment. Our approach integrates semantic layout control into pre-trained StyleGAN models publicly available on the Web [awesomeStyleGAN2], via pseudo labeling even from a single annotated pair with not only a dense mask but also sparse scribbles or landmarks.
In summary, our major contributions are three-fold:
We explore a novel problem of few-shot semantic image synthesis, where the users can synthesize high-quality, various images in the target domains even from very few and rough semantic layouts provided during training.
We propose a simple yet effective method for training a StyleGAN encoder for semantic image synthesis in few-shot scenarios, via pseudo sampling and labeleing based on the StyleGAN prior, without hyper parameter tuning for complicated loss functions.
We demonstrate that our method significantly outperforms the existing methods w.r.t. layout fidelity and visual quality via extensive experiments on various datasets.
2 Related Work
There are various image-to-image (I2I) translation methods suitable for semantic image synthesis; the goals are, , to improve image quality [DBLP:conf/iccv/ChenK17, DBLP:conf/nips/LiuYS0L19, DBLP:conf/cvpr/Park0WZ19, DBLP:conf/cvpr/000500TS20, DBLP:conf/mm/0005BS20], generate multi-modal outputs [DBLP:conf/cvpr/Park0WZ19, DBLP:conf/iccv/LiZM19, DBLP:conf/cvpr/ZhuXYB20, DBLP:journals/cgf/EndoK20], and simplify input annotations using bounding boxes [DBLP:conf/cvpr/ZhaoMYS19, DBLP:conf/iccv/SunW19, DBLP:conf/cvpr/LiCGYWL20]. However, all of these methods require large amounts of training data of both source and target domains and thus are unsuitable for our few-shot scenarios.
FUNIT [DBLP:conf/iccv/0001HMKALK19] and SEMIT [DBLP:conf/cvpr/WangKG0K20] are recently proposed methods for “few-shot” I2I translation among different classes of photographs (, dog, bird, and flower). However, their meaning of “few-shot” is quite different from ours; they mean that only a few target class data are available in test time, but assume sufficient data of both source and target classes in training time (with a difference in whether the image class labels are fully available [DBLP:conf/iccv/0001HMKALK19] or not [DBLP:conf/cvpr/WangKG0K20]). Contrarily, we assume only a few source data, , ground-truth (GT) semantic masks, in training time. These “few-shot” I2I translation methods do not work at all in our settings, as shown in Figure 2.
Benaim and Wolf [DBLP:conf/nips/BenaimW18] presented a one-shot unsupervised I2I translation framework for the same situation as ours. However, their “unpaired” approach suffers from handling semantic masks, which have less distinctive features than photographs (see Figure 2). Moreover, their trained model has low generalizability, specialized for the single source image provided during training. In other words, their method needs to train a model for each test input, while our method does not. Table 1 summerizes the differences of problem settings between each method.
|SMIS [DBLP:conf/cvpr/IsolaZZE17, DBLP:conf/cvpr/Park0WZ19, DBLP:conf/nips/LiuYS0L19]||paired||large||large||none|
|Benaim and Wolf [DBLP:conf/nips/BenaimW18]||unpaired||small||large||none|
|Few-shot SMIS (ours)||paired||small||large||none|
“ image-to-image translation (I2I), and ours.
Latent space manipulation
Recent GAN inversion (e.g., Image2StyleGAN [DBLP:conf/iccv/AbdalQW19, DBLP:conf/cvpr/AbdalQW20]) can control GAN outputs by inverting given images into GAN’s latent space. There have been also many attempts to manipulate inverted codes in disentangled latent spaces [DBLP:journals/tog/ChiuKLIY20, gansteerability, DBLP:conf/cvpr/ShenGTZ20, DBLP:journals/corr/abs-2007-06600, DBLP:journals/corr/abs-2004-02546]. However, inverting semantic masks into a latent space defined by photographs is not straightforward because how to measure the discrepancy between the two different domains (, semantic masks and photographs) is an open question [xia2021gan]. Note that we cannot use pre-trained segmentation networks in our few-shot scenrarios. Our method is the first attempt of GAN inversion for semantic masks into StyleGAN’s latent space defined by photographs.
Few-shot semantic image synthesis
To the best of our knowledge, there is no other few-shot method dedicated to semantic image synthesis. An alternative approach might be to use few-shot semantic segmentation [DBLP:conf/bmvc/DongX18, DBLP:conf/iccv/WangLZZF19, DBLP:conf/mm/LiuCLGCT20, DBLP:conf/aaai/TianWQWSG20, WangECCV20, YangECCV20] to annotate unlabeled images to train image-to-image translation models. In recent few-shot semantic segmentation methods based on a meta-learning approach, however, training episodes require large numbers of labeled images of various classes other than target classes to obtain common knowledge. Therefore, this approach is not applicable to our problem setting.
3 Few-shot Semantic Image Synthesis
3.1 Problem setting
Our goal is to accomplish semantic image synthesis via semi-supervised learning withunlabeled images and labeled pairs both in the same target domain, where . In particular, we assume few-shot scenarios, setting or in our results. A labeled pair consists of a one-hot semantic mask (where , , and are the number of classes, width, and height) and its GT RGB image . A semantic mask can be a dense map pixel-aligned to or a sparse map (e.g., scribbles or landmarks). In a sparse map, each scribble or landmark has a unique class label, whereas unoccupied pixels have an “unknown” class label. Hereafter we denote the labeled dataset as and the unlabeled dataset as .
The core of our method is to find appropriate mappings between semantics defined by a few labeled pairs and StyleGAN’s latent space defined by an unlabeled dataset
. Specifically, we first extract a feature vector representing each semantic class, which we refer to as arepresentative vector, and then find matchings with StyleGAN’s feature map via (-)nearest-neighbor search. Such matchings enable pseudo labeling, , to obtain pseudo semantic masks from random noise in StyleGAN’s latent space, which are then used to train an encoder for controlling the pre-trained StyleGAN generator. A similar approach is the prototyping used in recent few-shot semantic segmentation [DBLP:conf/bmvc/DongX18, DBLP:conf/iccv/WangLZZF19, YangECCV20]. Our advantage is that our method suffices with unsupervised training of StyleGAN models, whereas the prototyping requires supervised training of feature extractors (e.g., VGG [DBLP:journals/corr/SimonyanZ14a]).
Our pseudo semantic masks are often noisy, distorted (see Figure 3), and thus inadequate for conventional approaches of semantic image synthesis or image-to-image translation, which require pixel-wise correspondence. However, even from such low-quality pseudo semantic masks, we can synthesize high-quality images with spatial layout control by utilizing the pre-trained StyleGAN generator. This is because the StyleGAN generator only requires latent codes that encode spatially global information.
As an encoder for generating such latent codes, we adopt the Pixel2Style2Pixel (pSp) encoder [DBLP:journals/corr/abs-2008-00951]. The inference process is the same as that of pSp; from a semantic mask, the encoder generates latent codes that are then fed to the fixed StyleGAN generator to control the spatial layout. We can optionally change or fix latent codes that control local details of the output images. Please refer to Figure 3 in the pSp paper [DBLP:journals/corr/abs-2008-00951] for more details.
Hereafter we explain the pseudo labeling process and the training procedure with the pseudo semantic masks.
3.3 Pseudo labeling
We elaborate on how to calculate the representative vectors and pseudo labeling, for which we propose different approaches to dense and sparse semantic masks.
3.3.1 Dense pseudo labeling
Figure 4 illustrates the pseudo labeling process for dense semantic masks. We first extract StyleGAN’s feature maps corresponding to the semantic masks in . If pairs of semantic masks and GT RGB images are available in , we first invert into the StyleGAN’s latent space via optimization and then extract the feature map via forward propagation. Otherwise, we feed one or a few noise vectors to the pre-trained StyleGAN generator, extract the feature maps and synthesized images, and manually annotate the synthesized images to create semantic masks. Next, we extract a representative vector for each semantic class from the pairs of extracted feature maps and semantic masks, following the approach by Wang [DBLP:conf/iccv/WangLZZF19] for prototyping. Specifically, we apply the masked average pooling to the feature map (where , , and are the number of channels, width, and height) using a resized semantic mask , and then average over each pair in :
where denote pixel positions, and is the indicator function that returns 1 if the argument is true and 0 otherwise.
After obtaining representative vectors, we generate pseudo semantic masks for training our encoder. Every time we feed random noise to the pre-trained StyleGAN generator, we extract a feature map and then calculate a semantic mask via nearest-neighbor matching between the representative vectors and the pixel-wise vectors in . In all of our results, feature maps are at resolution of and extracted from the layer closest to the output layer of the StyleGAN generator. Class label for pixel is calculated as follows:
As a distance metric, we adopt the cosine similarity , inspired by the finding [DBLP:conf/cvpr/CollinsBPS20] that StyleGAN’s feature vectors having the same semantics form clusters on a unit sphere. Finally, we enlarge the semantic masks to the size of the synthesized images. Figure 3(a) shows the examples of pseudo labels for dense semantic masks.
3.3.2 Sparse pseudo labeling
, sparse semantic masks have a class label for each annotation (e.g., a scribble and landmark) and an “unknown” label. Here we adopt a pseudo-labeling approach different from the dense version due to the following reason. We want to retain the spatial sparsity in pseudo semantic masks so that the pseudo semantic masks resemble genuine ones as much as possible. However, if we calculate nearest-neighbors for a representative vector of each annotation as done in the dense version, the resultant pseudo masks might form dense clusters of semantic labels. Alternatively, as a simple heuristics, we consider each pixel in each annotation has its representative vector and calculate a one-to-one matching between each annotated pixel and each pixel-wise vector. In this case, however, many annotated pixels might match an identical pixel-wise vector (, many-to-one mapping), which results in fewer samples in pseudo semantic masks. Therefore, we calculate top-(i.e.,
-nearest-neighbors) instead of one-nearest-neighbor to increase matchings. In the case of many-to-one mappings, we assign the class label of an annotation that has the largest cosine similarity. To avoid outliers, we discard the matchings if their cosine similarities are lower than a thresholdand assign the “unknown” label. Figure 3(b) shows the examples of pseudo labels for sparse semantic masks.
We set and in all of our results in this paper. The supplementary material contains pseudo-labeled results with different parameters.
3.4 Training procedure
Figure 6 illustrates the learning process of our encoder. First, we explain the forward pass in the training phase. We feed a random noise
sampled from a normal distributionto the encoder and obtain latent codes (where is the number of layers to input/output latent codes) via the pre-trained StyleGAN’s mapping network . We feed the latent codes to the pre-trained StyleGAN generator to synthesize an image while extracting the intermediate layer’s feature map. From this feature map and representative vectors, we create a pseudo semantic mask, which is then fed to our encoder to extract latent codes .
In the backward pass, we optimize the encoder using the following loss function:
This loss function indicates that our training is quite simple because backpropagation does not go through the pre-trained StyleGAN generator. Algorithm1 summarizes the whole process of training. In the supplementary material, we also show the intermediate pseudo semantic masks and reconstructed images obtained during the training iterations.
We conducted experiments to evaluate our method. The supplementary material contains implementation details.
We used public StyleGAN2 [DBLP:conf/cvpr/KarrasLAHLA20] models pre-trained with FFHQ (human faces) [DBLP:conf/cvpr/KarrasLA19, DBLP:conf/cvpr/KarrasLAHLA20], LSUN (car, cat, and church) [DBLP:journals/corr/YuZSSX15, DBLP:conf/cvpr/KarrasLAHLA20], ukiyo-e [ukiyoePinkney], and anime face images [animeGokaslan]. To evaluate our method quantitatively, we used the pre-processed CelebAMask-HQ datasets [DBLP:conf/cvpr/ZhuAQW20], which contains face images and corresponding semantic masks (namely, 2,000 for test and 28,000 for training). In addition, we extracted face landmarks as sparse annotations using OpenPose . The numbers of “ground-truth” face landmarks are reduced to 1,993 for test and 27,927 for training because OpenPose sometimes failed. We used also LSUN church [DBLP:journals/corr/YuZSSX15] (300 in a validation set and 1,000 in a training set) for the quantitative evaluation. Because this dataset does not contain semantic masks, we prepared them using the scene parsing model [DBLP:conf/cvpr/ZhouZPFB017] consisting of the ResNet101 encoder [DBLP:conf/cvpr/HeZRS16] and the UPerNet101 decoder [DBLP:conf/eccv/XiaoLZJS18]. For our experiments of -shot learning, we selected paired images from the training sets, while the full-shot version uses all of them.
4.2 Qualitative results
Figure 7 compares the results generated from semantic masks of the CelebAMask-HQ dataset in a one-shot setting. Figure 8 also shows the results generated from sparse landmarks in a five-shot setting. The pixel-aligned approach, SPADE [DBLP:conf/cvpr/Park0WZ19] and pix2pixHD++ [DBLP:conf/cvpr/Park0WZ19], generates images faithfully to the given layouts, but the visual quality is very low. Meanwhile, pSp [DBLP:journals/corr/abs-2008-00951], which uses pre-trained StyleGANs, is able to generate realistic images. Although pSp can optionally generate multi-modal outputs, it ignores the input layouts due to over-fitting to the too few training examples. In contrast, our method produces photorealistic images corresponding to the given layouts. We can see the same tendency in comparison with the LSUN church dataset in Figure 9.
The benefit of our few-shot learning approach is not to need many labeled data. We therefore validate the applicability of our method to various domains where annotations are hardly available in public. Figures 1 and 10 show car, cat, and ukiyo-e images generated from semantic masks and scribbles. Again, pSp does not reflect the input layouts on the results, whereas our method controls output semantics accordingly (e.g., the cats’ postures and the ukiyo-e hairstyles). Interestingly, our method works well with cross lines as inputs, which specify the orientations of anime faces (Figure 11).
Finally, we conducted a comparison with the pixel-aligned approach using our pseudo labeling technique. Figure 12 shows the results of SPADE, pix2pixHD++, and ours, which were trained up to 100,000 iterations with the appropriate loss functions. Because our pseudo semantic masks are often misaligned, the pixel-aligned approach failed to learn photorealistic image synthesis, whereas ours succeeded. Please refer to the supplementary material for more qualitative results.
4.3 Quantitative results
We quantitatively evaluated competitive methods and ours with respect to layout fidelity and visual quality. For each dataset, we first generate images from test data (i.e., semantic masks/landmarks in CelebA-HQ and semantic masks in LSUN church) using each method and then extract the corresponding semantic masks/landmarks for evaluation, as done in Subsection 4.1
. As evaluation metrics for parsing, we used Intersection over Union (IoU) and accuracy. As for IoU, we used mean IoU (mIoU) for CelebA-HQ. For LSUN church, we used frequency weighted IoU (fwIoU) because our “ground-truth” (GT) semantic masks synthesized by[DBLP:conf/cvpr/ZhouZPFB017] often contain small noisy-labeled regions, which strongly affect mIoU. As a landmark metric, we computed RMSE of Euclidean distances between landmarks of generated and GT images. If landmarks cannot be detected in generated images, we counted them as N/A. We used Fréchet Inception Distance (FID) as a metric for visual quality.
Table 2 shows the quantitative comparison in few-shot settings, except for the bottom row, where all labeled images in the training datasets were used. In the five-shot setting, the pixel-aligned approach (i.e., pix2pixHD++ [DBLP:conf/cvpr/Park0WZ19] and SPADE [DBLP:conf/cvpr/Park0WZ19]) records consistently high IoU, accuracy, and FID scores. These scores indicate that the output images are aligned to the semantic masks relatively better but the image quality is lower, as we can see from the qualitative results. The larger numbers of undetected faces (denoted as “N/A”) also indicate low visual quality. We confirmed that our pseudo labeling technique does not yield consistent improvements for the pixel-aligned approaches (indicated with “*”). In contrast, ours yields lower FID scores than the pixel-aligned approach and pSp [DBLP:journals/corr/abs-2008-00951] (even in the full-shot setting) consistently and is overall improved by increasing from 1 to 5. Ours also outperforms pSp in the few-shot settings w.r.t. all the metrics except for N/A. The qualitative full-shot results are also included in the supplementary material.
Here we summarize the pros and cons of the related methods and ours. The pixel-aligned approach [DBLP:conf/cvpr/Park0WZ19] preserves spatial layouts specified by the semantic masks but fails to learn from our noisy pseudo labels due to the sensitivity to misaligned semantic masks. Contrarily, ours on top of pSp [DBLP:journals/corr/abs-2008-00951] is tolerant of misalignment and thus works well with our pseudo labels. However, it is still challenging to reproduce detailed input layouts and to handle layouts that StyleGAN cannot generate. A future direction is to overcome these limitations by, e.g., directly manipulating hidden units corresponding to semantics of input layouts [DBLP:conf/iclr/BauZSZTFT19]. Another limitation is that we cannot handle semantic classes unseen in the few-shot examples. Figure 13
shows such an example with a more challenging dataset, ADE20K[DBLP:conf/cvpr/ZhouZPFB017]. Please refer to the supplementary material for more results.
It is also worth mentioning that our method outperformed the full-shot version of pSp in FID. This is presumably because our pseudo sampling could better explore StyleGAN’s latent space defined by a large unlabeled dataset (e.g., 70K images in FFHQ and 48M images in LSUN church) than pSp, which uses limited labeled datasets for training the encoder.
In this paper, we have proposed a simple yet effective method for few-shot semantic image synthesis for the first time. To compensate for the lack of pixel-wise annotation data, we generate pseudo semantic masks via (
-)nearest-neighbor mapping between the feature vector of the pre-trained StyleGAN generator and each semantic class in the few-shot labeled data. In each training iteration, we can generate a pseudo label from random noise to train an encoder[DBLP:journals/corr/abs-2008-00951] for controlling the pre-trained StyleGAN generator using a simple L2 loss. The experiments with various datasets demonstrated that our method can synthesize higher-quality images with spatial control than competitive methods and works well even with sparse semantic masks such as scribbles and landmarks.
Appendix A Implementation Details
We implemented our method with PyTorch and ran our code on PCs equipped with GeForce GTX 1080 Ti. We used StyleGAN2[DBLP:conf/cvpr/KarrasLAHLA20] as a generator and pSp [DBLP:journals/corr/abs-2008-00951] as an encoder. We trained the encoder using the Ranger optimizer [DBLP:journals/corr/abs-2008-00951] with a learning rate of 0.0001. The batch size (i.e., the number of pseudo-labeled images per iteration) was set to 2. We performed 100,000 iterations and took a day at most. Regarding our multi-modal results, please refer to Section D in this supplementary material.
Appendix B Sparse Pseudo Labeling with Different Parameters
Figure 14 shows the sparsely pseudo-labeled results (right) for the StyleGAN sample (lower left) using different parameters and with a one-shot training pair (upper left). As explained in Subsubsection 3.3.2 in our paper, is used for top- matching between per-pixel feature vectors and representative vectors, whereas is a threshold of cosine similarity. For all of our other results, we set to reduce the number of misfetches of matched pixels and to reduce outliers.
Appendix C Images Reconstructed from Pseudo Semantic Masks During Training Procedure
Figures 15, 16, 17, 18, 19, and 20 show the intermediate outputs in one-shot settings during training iterations, which is explained in Subsection 3.4 of our paper. For each set of results, we fed random noise vectors to the pre-trained StyleGAN generator to obtain synthetic images (top row) and feature vectors, from which we calculated pseudo semantic masks (middle row). We then used the pseudo masks to train the pSp encoder to generate latent codes for reconstructing images (bottom row). It can be seen that the layouts of the bottom-row images reconstructed from the middle-row pseudo semantic masks gradually become close to those of the top-row StyleGAN samples as the training iterations increase.
Appendix D Multi-modal Results
Figure 21 demonstrates that our method can generate multi-modal results. To obtain multi-modal outputs in test time, we follow the same approach as pSp [DBLP:journals/corr/abs-2008-00951]; we fed latent codes encoded from an input layout to the first layers of the generator and random noise vectors to the other layers. While we used for other results in our paper and this supplementary matrial, here we used different values of to create various outputs. Specifically, we set 8, 5, 7, 5, 5, 5, and 5 from the top rows in Figure 21. As explained in the pSp paper [DBLP:journals/corr/abs-2008-00951], smaller affects coarser-scale styles whereas larger changes finer-scale ones.
Appendix E Additional Qualitative Results
Appendix F Limitation
Figure 26 and Table 3 show the results with a more challenging dataset, ADE20K [DBLP:conf/cvpr/ZhouZPFB017], which consists of 20,210 training and 2,000 validation sets and contains indoor and outdoor scenes with 150 semantic classes. We used the training set without semantic masks to pre-train StyleGAN. Although our method can generate plausible images for some scenes, it struggles to handle complex scenes with diverse and unseen semantic classes.