StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

08/02/2021 ∙ by Rinon Gal, et al. ∙ 13

Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained blindly? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image from those domains. We show that through natural language prompts and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. Notably, many of these modifications would be difficult or outright impossible to reach with existing methods. We conduct an extensive set of experiments and comparisons across a wide range of domains. These demonstrate the effectiveness of our approach and show that our shifted models maintain the latent-space properties that make generative models appealing for downstream tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 11

page 15

page 16

page 17

page 18

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code and videos available at: stylegan-nada.github.io/

The unprecedented ability of Generative Adversarial Networks (GANs)

[goodfellow2014generative] to capture and model image distributions through a semantically-rich latent space has revolutionized countless fields. These range from image enhancement [yang2021gan, ledig2017photo], editing [shen2020interpreting, harkonen2020ganspace] and recently even discriminative tasks such as classification and regression [xu2021generative, nitzan2021large].

Typically, the scope of these models is restricted only to domains for which one can collect a large set of images. This requirement severely constrains their applicability. Indeed, in many cases (paintings by specific artists, imaginary scenes), there may not exist sufficient data to train a GAN, or even any data at all.

Photo Fernando Botero Painting
Photo of a Church Cryengine render of New York City
Photo Painting by Salvador Dali
Figure 1: Examples of text-driven, out-of-domain generator adaptations induced by our method. The textual directions driving the change appear next to each set of generated images.

Recently, it has been shown that Vision-Language models

[radford2021learning] encapsulate generic information that can bypass the need for collecting data. Moreover, these models can be paired with generative models to provide a simple and intuitive text-driven interface for image generation and manipulation [patashnik2021styleclip]. However, such works are built upon pre-trained generative models with fixed domains, limiting the user to in-domain generation and manipulation.

In this work, we present a text-driven method that enables out-of-domain generation. We introduce a training scheme that shifts the domain of a pre-trained model towards a new domain, using nothing more than a textual prompt. The domain shift is achieved by modifying the generator’s weights towards images aligned with the driving text. In Fig. 1, we demonstrate three examples of out-of-domain images generated by our method. In two of them, we modified generators trained on real human faces and cars to generate paintings of specific artistic styles. In the third, we modified a generator trained on churches to generate images of New-York City. These models were trained blindly, without seeing any image of the target domain during training. i.e. they were trained in a zero-shot setting.

Leveraging CLIP for text-guided training is not trivial. The naïve approach – requiring generated images to maximize some CLIP-based classification score, often leads to adversarial solutions [goodfellow2014explaining] (see Sec. 5). Instead, we propose a novel dual-generator training approach. These generators are intertwined, sharing a joint latent space. One generator is kept frozen for the duration of the training - providing the context of the source domain across an infinite range of generated instances. The other generator is trained with the goal of shifting each generated instance, individually, along some textually-prescribed path in CLIP’s embedding space. To increase training stability for more drastic domain changes, we introduce a novel adaptive training approach that utilizes CLIP to identify and restrict training only to the most relevant layers at each training iteration, and accounting for the target domain.

We demonstrate the effectiveness of our approach for a wide range of source and target domains. These include artistic styles, cross-species identity transfer, and significant shape changes. We compare our method to existing editing techniques and alternative few-shot approaches and show that it enables modifications which are beyond their scope – and it does so without any training data.

Finally, we show that our method maintains the appealing structure of the latent space. Our shifted generator not only inherits the editing capabilities of the original, but it can even re-use any off-the-shelf editing directions and models trained for the original domain.

2 Related Work

Text-guided synthesis

Vision-language tasks includes language-based image retrieval, image captioning, visual question answering, and text-guided synthesis among others. Typically, to solve these tasks, a cross-modal vision and language representation is learned 

[Desai2020VirTexLV, sariyildiz2020learning, Tan2019LXMERTLC, Lu2019ViLBERTPT, Li2019VisualBERTAS, Su2020VL-BERT:, Li2020UnicoderVLAU, Chen2020UNITERUI, Li2020OscarOA]. A common approach to learn such a representation is by training a transformer [NIPS2017_3f5ee243]. Recently, OpenAI introduced CLIP [radford2021learning] - a powerful model that learns a joint vision-language representation. CLIP was trained on 400 million text-image pairs, collected from a variety of publicly available sources on the Internet, using a contrastive learning goal. CLIP is composed of two encoders – an image encoder, and a text encoder – which aim to map their respective inputs into a joint, multi-modal embedding space. The representations learned by CLIP have been shown to be powerful enough to guide a variety of tasks, including image synthesis [murdock2021bigsleep, katherine2021vqganclip] and manipulation [patashnik2021styleclip].

In contrast, we present a novel approach in which a text prompt guides the training of an image generator, rather than the generation or manipulation of a specific image.

Training Generators with Limited Data

The goal of few-shot generative models is to mimic a rich and diverse data distribution using only a few images. Methods used to tackle such a task can be divided into two broad categories – those that seek to train a generator from scratch, and those that seek to leverage the diversity of a pre-trained generator and adapt it to a novel domain.

Among those that train a new generator, ‘few’ often denotes several hundred or thousand images, rather than the tens of thousands [karras2020analyzing] or even millions [brock2018large] required to train SoTA GANs. Such works typically employ data augmentations [tran2021data, zhao2020differentiable, zhao2020image, Karras2020ada] or empower the discriminator to better learn from existing data using auxiliary tasks [yang2021data, liu2020towards].

In the transfer-learning scenario, ‘few’ typically refers to significantly smaller numbers, with supervision ranging from a several hundred to as few as five images. In some degenerate cases, generator adaptation can even be achieved with a single image

[ojha2021few]. When training with extremely limited data, a primary concern is the need to stave off mode-collapse or overfitting, and successfully transfer the diversity of the source generator to the target domain. Multiple methods have been proposed to tackle these challenges. Some place restrictions on the space of trainable weights, either during training [mo2020freeze, Robb2020FewShotAO] or by mixing the weights of the fine-tuned model with those of the original [pinkney2020resolution]. Others introduce new parameters in order to control channel-wise statistics [noguchi2019image], leverage a miner network in order to steer GAN sampling towards suitable regions of the latent space [Wang_2020_CVPR]

, add novel regularization terms to the loss function

[Tseng2021RegularizingGA, li2020few] or mix contrastive learning ideas with patch-level discrimination in order to maintain cross-domain consistency while adapting to a target style [ojha2021few].

Our work similarly seeks to adapt an existing generator, pre-trained on a large source domain. However, while previous methods deal with adapting generators with limited data, we do so in a zero-shot manner, i.e. without access to any training data. Additionally, unlike prior methods which constrain the space of trainable weights using fixed, hand-picked subsets, our method is adaptive - automatically accounting for both the current state of the network at each optimization step, and for the target class, without the need for any additional supervision.

3 Preliminaries

At the core of our approach are two components - StyleGAN [karras2020analyzing] and CLIP [radford2021learning]. In the following section we provide a brief overview of StyleGAN and highlight its unique features which are most relevant to our work. We then discuss CLIP through the lens of StyleCLIP - a prior work which first suggested combining StyleGAN and CLIP as a tool for in-domain image editing.

3.1 StyleGAN

In recent years, StyleGAN and its multiple followups [karras2019style, karras2020analyzing, Karras2020ada, karras2021aliasfree] have established themselves as the state-of-the-art unconditional image generators, owing to their ability to synthesize high resolution images of unprecedented quality. The StyleGAN generator consists of two main components. First, a mapping network which converts a latent code

, sampled from a Gaussian distribution, into a vector

in a learned latent space . These latent vectors are then fed into the second component - the synthesis network to control feature (or, equivalently, convolutional kernel) statistics across the different network layers. By traversing this intermediate latent space, , or by mixing different codes across different network layers, prior work demonstrated fine-grained control over semantic properties in generated images [shen2020interpreting, harkonen2020ganspace, 10.1145/3447648, patashnik2021styleclip]. However, such latent-space traversal is typically limited to in-domain modifications. That is, it is constrained to the manifold of images with properties that match the initial training set. In contrast, here we aim to shift the generator between domains, moving beyond latent-space editing and towards semantically-aware fine-tuning.

3.2 StyleCLIP

In a recent work, Patashnik et al. [patashnik2021styleclip] combine the generative power of StyleGAN with the semantic knowledge of CLIP to discover editing directions within the latent space of a pre-trained GAN, using only a textual description of the desired change. They outline three approaches for leveraging the semantic power of CLIP:

The first, a latent optimization technique, uses standard back-propagation methods to modify a given latent code in a manner that minimizes the CLIP-space distance between a generated image and some given target text:

(1)

where is the image generated when providing a latent code to , is the textual description of some target class, and is the CLIP-space cosine distance. We name this target-text distance loss global CLIP loss. In Sec. 4.3 we use this latent-optimization approach to determine a subset of layers to optimize at each training iteration.

In the second approach, the latent mapper, a network is trained to convert an input latent code to one which modifies a textually-described property in the generated image. This mapper is trained using the same global CLIP loss objective, i.e. it should produce codes that correspond to images which minimize the CLIP-space distance to a target text. For some drastic shape modifications, we find that training such a latent mapper can help improve results by identifying regions of latent space which produce better candidates for the target class. See Sec. 4.4 for more details.

The last approach aims to discover directions of meaningful change in the latent space of the GAN by determining which latent-code modifications induce an image-space change which is co-linear with the direction between two textual descriptors (denoting the source and desired target) in CLIP-space.

These approaches span a wide range of training and inference times, and vary greatly in expressiveness, but they all share a common property with other latent space editing approaches - the modifications that they can apply to a given image are largely constrained to the domain of the pre-trained generator. As such, they can allow for changes in hairstyle, expressions, or even convert a wolf to a lion if the generator has seen both - but they cannot convert a photo to a painting in the style of Raphael or turn a generator that produces only dogs to one which produces cats.

Figure 2: Overview of our training setup. We initialize two intertwined generators - and using the weights of a generator pre-trained on images from a source domain (e.g. FFHQ [karras2019style]). The weights of remain fixed throughout the process, while those of are modified through optimization and an iterative layer-freezing scheme. The process shifts the domain of according to a user-provided textual direction while maintaining a shared latent space.

4 Method

Our goal is to shift a pre-trained generator from a given source domain to a new target domain, using only textual prompts and no images of the target domain. As a source of supervision for the target domain, we use only a pre-trained CLIP model.

In order to approach this task, we ask ourselves two key questions: (1) How can we best distill the semantic information encapsulated in CLIP? and (2) How should we regularize the optimization process in order to avoid adversarial solutions or mode collapse? In the following section we will outline a training scheme and a set of losses that seek to answer both questions.

4.1 Network Architecture

We begin by outlining the architecture used during our training phase. At the core of our method are two intertwined generators, both utilizing the StyleGAN2 [karras2020analyzing] architecture. The generators share a mapping network, and thus the same latent space, such that the same latent code will initially produce the same image in both. We initialize both generators using the weights of a model pre-trained on a single source domain (e.g. human faces, dogs, churches or cars). Our goal is to change the domain of one of the paired generators, while keeping the other fixed as a reference. We dub these generators and respectively.

We guide the change in the trainable generator using a set of CLIP-based losses and a layer-freezing scheme which increases training stability by adaptively determining the most relevant subset of layers to train at each iteration - and freezing the rest. Our training architecture is shown in Fig. 2. The losses and layer-freezing schemes are detailed below.

4.2 CLIP-based Guidance

We rely on a pre-trained CLIP model to serve as the sole source of supervision for our target domain. In order to effectively extract knowledge from CLIP, we utilize three different losses: a global target loss, a local direction loss and an embedding-norm loss.

Global CLIP loss:

The first and most intuitive of the losses is the global loss described in Sec. 3.2. Recall that this loss seeks to minimize the CLIP-space cosine distance between the generated images and some given target text.

This loss, however, suffers from a set of shortcomings: First, such a loss sees no benefit from maintaining diversity. Indeed, a mode-collapsed generator which produces the same image regardless of latent code may be the best minimizer for the distance. Second, it is highly susceptible to adversarial solutions. Without sufficient regularization, the model can opt to fool the classifier (CLIP) by adding appropriate pixel-level perturbations to the image.

These shortcomings make the global loss unsuitable for training the generator. However, we still leverage it in order to adaptively determine which subset of layers to train at each iteration (see Sec. 4.3).

Directional CLIP loss:

Our second loss is thus designed with the goal of preserving diversity and discouraging extensive image corruption. For this loss, we utilize a pair of images - one generated by the reference (frozen) generator, and another generated by the modified (trainable) generator using the same latent code. Rather than optimizing for better classification (or similarity) score, we draw inspiration from StyleCLIP’s global direction approach. We demand that the CLIP-space direction between the embedding of the reference (source) and modified (target) images aligns with the CLIP-space direction between the embeddings of a pair of source and target texts. An illustration of this idea is provided in Fig. 3.

Figure 3: Illustration of our directional loss (Sec. 4.2). We embed the images generated by both generators into CLIP-space and demand that the vector connecting them, , is co-linear with a direction prescribed by a source and target text, . We do so by maximizing their normalized inner-product.

The direction loss is thus given by:

(2)

where and are CLIP’s image and text encoders respectively, and are the frozen source generator and the modified trainable generator, and , are the source and target class texts.

Such a loss overcomes the global loss’ shortcomings: First, it is adversely affected by mode-collapsed examples. If the target generator only creates a single image, the CLIP-space directions from all sources to this target image will be different. As such, they can’t all align with the textual direction. Second, it is harder for the network to converge to adversarial solutions, because it has to engineer perturbations that fool CLIP across an infinite set of different instances.

Embedding-norm Loss:

As discussed in Sec. 3, in some scenarios we use the latent mapper from StyleCLIP [patashnik2021styleclip], to better identify latent-space regions which match the target domain. Unfortunately, the mapper occasionally induces undesirable semantic artifacts on the image, such as opening an animal’s mouth and enlarging tongues. We observe that these artifacts correlate with an increase in the norm of the CLIP-space embeddings of the generated images. We thus discourage the mapper from introducing such artifacts by constraining these norms by introducing an additional loss during mapper training:

(3)

where is the latent mapper.

4.3 Layer-Freezing

For domain shifts which are predominantly texture-based, such as converting a photo to a sketch, training quickly converges before mode collapse or overfitting occurs. However, more extensive shape modifications require longer training, which in turn destabilizes the network and leads to poor results.

Prior works in the field of few-shot generator domain translation have observed that the quality of translated results can be significantly improved by restricting the training to a subset of network weights [mo2020freeze, Robb2020FewShotAO]. The intuition is that some layers of the source generator are useful for generating aspects of the target domain, so we want to preserve them. Furthermore, optimizing fewer parameters reduces the model complexity and the risk of overfitting. Following these approaches, we regularize the training process by limiting the number of weights that can be modified at each training iteration.

Ideally, we would have liked to restrict training to those model weights that are most relevant to a given change. To identify these weights, we turn back to latent-space editing techniques, and specifically StyleCLIP. Recall that editing techniques aim to find directions in the latent space of a GAN that correspond to a change in a single attribute, and at the same time avoid unrelated changes.

In the case of StyleGAN, it has been shown that the codes provided to different network layers can influence different semantic attributes. Thus, by considering editing directions in the space [abdal2019image2stylegan] – the latent space where each layer of StyleGAN can be provided with a different code

– we can identify which layers are most strongly tied to a given change. While we are interested in changes that cannot be achieved through mere latent code interpolations, we can still consider such editing directions locally (

i.e. at each iteration along the optimization process) and ask: ‘given the current version of the model, which layers are most relevant to our change?’.

Building upon this intuition, we suggest a training scheme, where at each iteration we (i) select the -most relevant layers, and (ii) perform a single training iteration of the generator to optimize only these layers, while freezing all the others. To select the layers, we randomly sample latent codes and convert them to by replicating the same code for each layer. We then perform iterations of the StyleCLIP latent optimization step, with the current version of the generator held constant and using the global clip loss (Sec. 3.2). We select the layers for which the latent code changed most significantly. The two-step process is illustrated in Fig. 4.

In all cases we additionally freeze StyleGAN’s mapping network, affine code transformations, and all toRGB layers.

Figure 4: The adaptive layer-freezing mechanism has two phases. In the first phase (left), we optimize a set of latent codes in (turquoise), leaving all network weights fixed. This optimization is conducted using the Global CLIP loss (Sec. 3.2) driven by the textual-description of our target domain. We select the layers for which the corresponding entry changed most significantly (illustrated by darker colors, left). In the second phase (right), we unfreeze the weights of the selected layers. We then optimize these layers using the directional CLIP loss (Sec. 4.2).
Photo Sketch
Photo A painting in Ukiyo-e style
Human Werewolf
Figure 5: Image synthesis using models adapted from StyleGAN2-FFHQ [karras2020analyzing] to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear to the left of each row.

4.4 Latent-Mapper ‘Mining’

As a last step, we note that for some shape changes, the generator does not undergo a complete transformation. In the case of dogs to cats, for example, the fine-tuning process results in a new generator that can output both cats, dogs, and an assortment of images that lie in between. Consider, however, that this shifted generator now includes both cats and dogs within its domain. As such, we can now turn to standard editing techniques, and in particular StyleCLIP’s latent mapper, in order to map all latent codes into the cat-like region of the latent space. We use the same network and training scheme as StyleCLIP, with the addition of the embedding-norm loss of Eq. 4.2.

5 Experiments

5.1 Results

We begin by showcasing the wide range of out-of-domain adaptations enabled by our method. These range from style and texture changes to extensive shape modifications and from realistic to fantastical, including such domains where quality data would be prohibitively expensive to gather. All these are achieved through a simple text-based interface and, in all but the most of extreme of shape changes, a few minutes of training.

In Figs. 6 and 5, we show a series of randomly sampled images synthesized by generators converted from faces, churches, dogs and cars to a host of target domains. In all these experiments we make use of our full model without a latent mapper. For purely style-based changes we allowed training across all layers. For minor shape modifications, we find that training roughly two-thirds of the model layers (i.e. for models) provides a good compromise between stability and training times. Additional large scale galleries portraying a wide set of target domains are shown in the Appendix.

In Fig. 7 we show domain adaptations from dogs to a wide range of animals. For all animal translation experiments, we set the number of trainable layers to three at each iteration, and train a latent mapper to reduce leakage from the source domain. In contrast to Figs. 6 and 5, where changes focused mainly on style or minor shape adjustments, here the model is required to perform significant shape modifications. For example, many animals sport upright ears, while most AFHQ-dog [choi2020starganv2] breeds do not.

Church Hut Church Ancient underwater ruin Photo of a church Cryengine render of New York
Dog The Joker Dog Nicolas Cage Photo Watercolor Art with thick brushstrokes
Car Ghost car Chrome wheels TRON wheels Car made of metal Car made of Gold
Figure 6: Image synthesis using models adapted from StyleGAN2’s [karras2020analyzing] LSUN Church, LSUN Car [yu2015lsun] models and StyleGAN ADA’s [Karras2020ada] AFHQ-dog [choi2020starganv2] model. The generators were converted to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear below each generated set.
”Fox” ”Hamster” ”Badger”
”Lion” ”Bear” ”Skunk”
”Capybara” ”Pig” ”Meerkat”
”Otter” ”Wolf” ”Boar”
Figure 7: Generator translation to multiple animal domains. In all cases we begin with a StyleGAN-ADA [Karras2020ada] generator pre-trained on the AFHQ-dog dataset [choi2020starganv2]. All generators are adapted using our method and a StyleCLIP [patashnik2021styleclip] latent mapper. For all experiments, the source domain text was ’dog’. The target domain text is shown below each image.

5.2 Latent space exploration

As previously discussed, modern image generators (and StyleGAN in particular), are known to have a well-behaved latent space. Such a latent space is conductive for tasks such as image editing and image-to-image translation

[shen2020interpreting, harkonen2020ganspace, patashnik2021styleclip, richardson2020encoding, alaluf2021matter]. The ability to apply such manipulations to real images is of particular interest, leading to an ever-increasing list of methods for GAN inversion [abdal2019image2stylegan, tov2021designing, richardson2020encoding, alaluf2021restyle, xia2021gan]. We show that our generators can still support such manipulation tasks, using the same techniques and inversion methods. Indeed, as we outline below, our model can even reuse off-the-shelf models pre-trained on the source generator’s domain, with no need for additional fine-tuning.

GAN Inversion

We begin by pairing existing inversion methods with our modified generator. Specifically, we use a ReStyle encoder [alaluf2021restyle], pre-trained on the human face domain. For a given real image, we first invert it using ReStyle, and then insert the inverted latent-code code into our modified generators. In Fig. 8 we show a series of results obtained in such a manner, using generators adapted across multiple domains. As can be seen, our generators successfully preserve the identity tied to the latent code, even for those codes obtained from the inversion of real images. As such, our model can be used for textually-driven, zero-shot out-of-domain image-to-image translation of real images (or equivalently, cross-domain editing).

Inverted Cubism Anime Modigliani Super Saiyan
Figure 8: Out-of-domain editing through latent-code equivalence between generators. We invert an image into the latent space of a StyleGAN2 FFHQ model [karras2020analyzing], using a pre-trained ReStyle encoder [alaluf2021restyle]. We then feed the same latent code into the translated generators in order to map the same identity to a novel domain.

Latent traversal editing

We have shown that the identity and facial attributes tied to a given latent-code are preserved through the training process. This suggests that the latent space of the adapted generator is aligned with the original latent space. This is not entirely surprising. First, due to our intertwined generator architecture and the nature of the directional loss – and second, because prior methods successfully employed the natural alignment of a fine-tuned generator for downstream applications [pinkney2020resolution, 10.1145/3450626.3459771, wang2021crossdomain]. Our fine-tuning approach is non-adversarial and thus differs from these prior methods. Consequently, verifying that such latent-space alignment remains unbroken is of great interest. By using existing, latent-traversal based editing techniques, we show that such alignment does indeed hold in our case, beyond mere identity preservation. Rather than finding new paths in the latent space of the modified generator, we show that we can reuse the same paths and editing models found for the original, source generator.

In Figure 9, we show editing performed on real images mapped into novel domains, using several off-the-shelf methods. In particular, we use StyleCLIP [patashnik2021styleclip] to edit expression and hairstyle, StyleFlow [10.1145/3447648] to edit pose, and InterfaceGAN [shen2020interpreting] to edit age. In all cases we use the original implementations and pre-trained models where such are available. As can be seen, the adapted generator maintains a similar capacity for semantic, latent-based editing. Notably, the directions that control semantic attributes in the new generator are aligned with those of the source, enabling the use of a host of existing editing methods and pre-trained models, and ensuring that we do not need to re-train or re-discover such directions for each new domain.

Photo Fernando Botero Painting
Photo Rendered 3D in the Style of Pixar
Photo Mona Lisa Painting
Human Tolkien Elf
Human Zombie
Unmodified Mohawk Age Surprised Pose
Figure 9: Editing in the translated generator. In each line we show the results of editing a real, inverted image using StyleCLIP [patashnik2021styleclip] (mohawk and surprised), InterfaceGAN [shen2020interpreting] (age) and StyleFlow [10.1145/3447648] (pose). The top row portrays editing in the source domain. All rows below show the same editing operations in our translated domains. The driving texts are shown to the left of each row. Editing directions are described beneath each column.

Image-to-image translation

The use of generative models extends beyond image editing techniques. Richardson et al[richardson2020encoding]

demonstrate a wide range of image-to-image translation applications that can be approached using a pre-trained StyleGAN generator. Their architecture, pSp, pairs such a generator with a dedicated encoder. This encoder is tasked with mapping an input image into a latent code that, when fed into the generator, produces some (possibly different) result image. In their work, they utilize such an architecture for conditional synthesis tasks, image restoration and super resolution. However, a significant limitation of this approach is that the target domain of the generated images is restricted to domains for which a StyleGAN generator can be trained. We show that pre-trained pSp encoders can also be paired with our adapted generators, enabling a more generic image-to-image translation approach. Specifically, in

Fig. 10 we show the results of conditional image synthesis in multiple domains, using segmentation masks and sketch-based guidance.

Human White Walker
Photo Sketch
Human Disney Princess
Photo Painting in the Style of Edvard Munch
Figure 10: Examples of conditional synthesis across multiple domains. In all cases we use a pSp [richardson2020encoding] encoder pre-trained on StyleGAN2-FFHQ [karras2020analyzing] in order to invert segmentation masks and simple sketches (top row) into the latent space of the GAN. The same inverted-codes work seamlessly with our adapted models, enabling off-the-shelf conditional synthesis in the new domains.

5.3 Comparison to other methods

We compare two aspects of our method to alternative approaches. First, we show that that the text-driven out-of-domain capabilities of our method cannot be replicated by current latent-editing based techniques. Then, we demonstrate our ability to affect large shape changes when compared to few-shot training approaches.

Text-guided editing

We show that existing editing techniques which operate within the domain of a pre-trained generator are unable to induce changes which shift an image beyond said domain. In particular, we are interested in text-driven methods which are unconstrained by the need to gather additional data. A natural comparison is then StyleCLIP and it’s three CLIP-guided editing approaches (outlined in Sec. 3). The results of such a comparison are shown in Fig. 11. As can be seen, none of StyleCLIP’s approaches succeed in performing out-of-domain manipulations, even in scenarios that require only relatively minor changes (such as inducing celebrity identities on dogs).

Photo Raphael Painting
Dog The Joker
Dog Nicolas Cage
Church The Shire
StyleCLIP
Latent Optimization
StyleCLIP
Latent Mapper
StyleCLIP
Global Directions
Ours
Figure 11: Out-of-domain manipulation comparisons to StyleCLIP [patashnik2021styleclip]. In the left column, we show an image synthesized from a source generator with a given latent code. We show the results of editing the latent code towards an out-of-domain textual direction using all three StyleCLIP [patashnik2021styleclip] methods. In the last column, we show the image produced by feeding the original latent code to a generator converted using our method. Driving texts are shown to the left of each row. The latent optimization and mapper utilize only the target text. Our model successfully applies out-of-domain changes which are beyond the scope of all StyleCLIP approaches.

Few-shot generators

We next compare our zero-shot training approach with two few-shot alternatives. In particular, we compare to Ojha et al. [ojha2021few] and MineGAN [Wang_2020_CVPR]. The first is an approach focused on maintaining the diversity of the source domain while adapting to the style of the target domain, while the later seeks to stabilize training by steering the GAN towards regions of the latent space that better match the target set distribution, even at the cost of diversity. Results are shown in Fig. 12. While our model is not free of artifacts, it successfully avoids both the over-fitting and mode collapsed results displayed by the alternatives, maintains a high degree of diversity, and does it does so without being provided with any image of the target domain.

Ojha et al. (10) MineGAN (10)
Ours (0) Ojha et al. (5) MineGAN (5)
Figure 12: Image synthesis using models adapted from StyleGAN-ADA’s [Karras2020ada] AFHQ-dog [choi2020starganv2] model to the cat domain. We compare our method to two few shot models - Ojha et al. [ojha2021few] and MineGAN [Wang2018TransferringGG]. Next to each method we list the number of training images used during training. Note that in the case of 5 examples MineGAN simply memorizes the training set.

5.4 Ablation study

In order to evaluate the significance of our suggested modifications, we conduct a qualitative ablation study. The results are presented in Fig. 13. The global loss approach consistently fails to produce meaningful results, across all domains and modifications. Meanwhile, our directional-loss model with adaptive layer-freezing achieves the best visual results. In some cases, the quality of translation can be improved further by training a latent mapper.

Dog Cat
Dog Quokka
Human Plastic Puppet
Photo Sierra Quest Graphics
Global
Loss
All
Layers
Manual
Layers
Adaptive
Layers
With
Mapper
Figure 13: Images synthesized by a translated generator when removing or adjusting individual network components. In the first column we show an image generated by the source generator. We then show the images produced by a converted generator after training using the global clip loss (Sec. 3.2), training all layers, manually selecting training layers, using the adaptive layer-freezing method, and when adding a StyleCLIP [patashnik2021styleclip] latent mapper. The mapper is only required for more extensive shape changes.

5.5 Training details

For texture-based changes we find that a training session typically requires 300 iterations with a batch size of 2, or roughly 3 minutes on a single NVIDIA V100 GPU. In some cases (”Photo” to ”Sketch”) training can converge in less than a single minute, an improvement of two orders of magnitude compared to recent adversarial methods designed for training speed [10.1145/3450626.3459771].

For animal changes, training typically lasts 2000 iterations. We then train StyleCLIP mapper using the modified as a base model. The entire process takes roughly 6 hours on a single NVIDIA V100 GPU.

For all experiments, we use an ADAM Optimizer with a learning rate of 0.02. When training a latent mapper, we set and .

When using our adaptive layer-freezing approach, we set the optimization batch size to , and the number of optimization iterations to .

6 Conclusions

We presented StyleGAN-NADA, a CLIP-guided zero-shot method for Non-Adversarial Domain Adaptation of image generators. By using CLIP to guide the training of a generator, rather than an exploration of its latent space, we are able to move beyond the domain of the original generator and affect large changes both in style and shape.

The ability to train intertwined generators without data leads to exciting new possibilities - from the ability to edit images in ways that are constrained almost only by the user’s creativity, to the synthesis of paired cross-domain data and labeled images for downstream applications such as image-to-image translation.

Our method, however, is not without limitations. By relying on CLIP, we are limited to those domains which CLIP has observed. In particular, novel domains or styles may not be accounted for in CLIP’s embedding space. Such translation is also inherently limited by the ambiguity of natural language prompts. When one describes a ‘Raphael Painting’, for example, do they refer to the artistic style of the Renaissance painter, a portrait of his likeness, or – as one may discover by training the network for extended periods – an animated turtle bearing that name? Such limitations may lead to a need for careful prompt engineering, as seen in text-guided artistic circles.

Our method works particularly well for style and fine details, but in some cases it may struggle with large scale attributes or geometry changes. Such restrictions are common also in few-shot approaches. We find that a good translation often requires a fairly similar pre-trained generator as a starting point.

While our work focuses on shifting an existing generator, an immediate question is whether one can do away with this requirement entirely and train a generator from scratch, using nothing but CLIP’s guidance. While such an idea may seem outlandish at first, recent progress in inverting classifiers [dong2021deep] and in generative art [katherine2021vqganclip, murdock2021bigsleep] gives us hope that it is not entirely beyond reach.

We hope our work can inspire others to continue exploring the world of textually-guided generation, and particularly the astounding ability of CLIP to guide visual transformations. Perhaps, not too long in the future, our day-to-day efforts would no longer be constrained by data requirements - but only by our creativity.

Acknowledgments

We thank Yuval Alaluf, Ron Mokady and Ethan Fetaya for reviewing early drafts and helpful suggestions. Assaf Hallak for discussions, and Zonzge Wu for assistance with StyleCLIP comparisons.

References

Appendix A Additional Samples

We provide additional synthesized results from a large collection of source and target domains. In Figs. 14 and 15 we show results from models converted from the face domain. In Fig. 16 we show results from models converted from the church domain. In Fig. 17 we show additional results from the dog domain.

Photo Amedeo Modigliani painting
Human Tolkien elf
Human Zombie
Human Neanderthal
Human Mark Zuckerberg
Figure 14: Additional images synthesized using models adapted from StyleGAN2-FFHQ [karras2020analyzing] to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear to the left of each row.
Photo Mona Lisa Painting
Photo 3D render in the style of Pixar
Photo A painting by Raphael
Photo Old-timey photograph
Figure 15: Additional images synthesized using models adapted from StyleGAN2-FFHQ [karras2020analyzing] to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear to the left of each row.
Church Hut
Church Snowy Mountain
Church Ancient underwater ruin
Church The Shire
Photo of a church Cryengine render of Shibuya at night
Photo of a church Cryengine render of New York
Figure 16: Additional images synthesized using models adapted from StyleGAN2 [karras2020analyzing] LSUN Church [yu2015lsun] to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear to the left of each row.
Photo Pixel Art
Dog The Joker
Dog Bugs Bunny
Dog Nicolas Cage
Photo Watercolor Art
Photo Watercolor Art with Thick Brushstrokes
Figure 17: Additional images synthesized using models adapted from StyleGAN-ADA [Karras2020ada] AFHQ-dog [choi2020starganv2] to a set of textually-prescriped target domains. All images were sampled randomly, using truncation with . The driving texts appear to the left of each row.

Appendix B Training Transition

In Fig. 18 we show a the gradual transitions of images generated by our modified generations during the training process. The results are visually similar to an interpolation between the two domains.

Human Werewolf
Human Gollum
Photo Sketch
Figure 18: Visualization of generated images for fixed identities along different steps of the training process. In many cases, generated images resemble a linear interpolation between the source and target domains.