PluGeN: Multi-Label Conditional Generation From Pre-Trained Models

09/18/2021 ∙ by Maciej Wołczyk, et al. ∙ Jagiellonian University 0

Modern generative models achieve excellent quality in a variety of tasks including image or text generation and chemical molecule modeling. However, existing methods often lack the essential ability to generate examples with requested properties, such as the age of the person in the photo or the weight of the generated molecule. Incorporating such additional conditioning factors would require rebuilding the entire architecture and optimizing the parameters from scratch. Moreover, it is difficult to disentangle selected attributes so that to perform edits of only one attribute while leaving the others unchanged. To overcome these limitations we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to pre-trained generative models. The idea behind our approach is to transform the entangled latent representation using a flow-based module into a multi-dimensional space where the values of each attribute are modeled as an independent one-dimensional distribution. In consequence, PluGeN can generate new samples with desired attributes as well as manipulate labeled attributes of existing examples. Due to the disentangling of the latent representation, we are even able to generate samples with rare or unseen combinations of attributes in the dataset, such as a young person with gray hair, men with make-up, or women with beards. We combined PluGeN with GAN and VAE models and applied it to conditional generation and manipulation of images and chemical molecule modeling. Experiments demonstrate that PluGeN preserves the quality of backbone models while adding the ability to control the values of labeled attributes.



There are no comments yet.


page 1

page 5

page 6

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


input gender glasses smile

Figure 1: Attributes manipulation performed by PluGeN using the StyleGAN backbone.

Generative models such as GANs and variational autoencoders have achieved great results in recent years, especially in the domains of images

Brock, Donahue, and Simonyan (2018); Brown et al. (2020) and cheminformatics Gómez-Bombarelli et al. (2018); Jin, Barzilay, and Jaakkola (2018). However, in many practical applications, we need to control the process of creating samples by enforcing particular features of generated objects. This would be required to regulate the biases present in the data, e.g. to assure that people of each ethnicity are properly represented in the generated set of face images. In numerous realistic problems, such as drug discovery, we want to find objects with desired properties, like molecules with a particular activity, non-toxicity, and solubility.

(a) Factorization of true data distribution
(b) Probability distribution covered by PluGeN.
Figure 2: PluGeN factorizes true data distribution into components (marginal distributions) related to labeled attributes, see (a), and allows for describing unexplored regions of data (uncommon combinations of labels) by sampling from independent components, see (b). In the case illustrated here, PluGeN constructs pictures of men with make-up or women with beards, although such examples rarely (or never) appear in the training set.

Designing the conditional variants of generative models that operate on multiple labels is a challenging problem due to intricate relations among the attributes. Practically, it means that some combinations of attributes (e.g. a woman with a beard) might be unobserved or rarely observed in the training data. In essence, the model should be able to go beyond the distribution of seen data and generate examples with combinations of attributes not encountered previously. One might approach this problem by building a new conditional generative model from the ground up or design a solution tailored for a specific existing unsupervised generative model. However, this introduces an additional effort when one wants to adapt it to a newly invented approach.

To tackle this problem while leveraging the power of existing techniques, we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to various pre-trained generative models such as VAEs or GANs, see Figure 1 for demonstration. Making use of PluGeN, we can manipulate the attributes of input examples as well as generate new samples with desired features. When training the proposed module, we do not change the parameters of the base model and thus we retain its generative and reconstructive abilities, which places our work in the emerging family of non-invasive network adaptation methods Wołczyk et al. (2021); Rebuffi, Bilen, and Vedaldi (2017); Koperski et al. (2020); Kaya, Hong, and Dumitras (2019); Zhou et al. (2020).

Our idea is to find a mapping between the entangled latent representation of the backbone model and a disentangled space, where each dimension corresponds to a single, interpretable attribute of the image. By factorizing the true data distribution into independent components, we can sample from each component independently, which results in creating samples with arbitrary combinations of attributes, see Figure 2. In contrast to many previous works, which are constrained to the attributes combinations visible in the training set, PluGeN gives us full control of the generation process, being able to create uncommon combinations of attributes, such as a woman with a beard or a man with heavy make-up. Generating samples with unseen combinations of attributes can be viewed as extending the distribution of generative models to unexplored although reasonable regions of data space, which distinguishes our approach from existing solutions.

Extensive experiments performed on the domain of images and a dataset of chemical compounds demonstrate that PluGeN is a reusable plugin that can be applied to various architectures including GANs and VAEs. In contrast to the baselines, PluGeN can generate new samples as well as manipulate the properties of existing examples, being capable of creating uncommon combinations of attributes.

Our contributions are as follow:

  • We propose a universal and reusable plugin for multi-label generation and manipulation that can be attached to various generative models and applied it to diverse domains, such as chemical molecule modeling.

  • We introduce a novel way of modeling conditional distributions using invertible normalizing flows based on the latent space factorization.

  • We experimentally demonstrate that PluGeN can produce samples with uncommon combinations of attributes going beyond the distribution of training data.

Related work

Conditional VAE (cVAE) is one of the first methods which includes additional information about the labeled attributes in a generative model Kingma et al. (2014). Although this approach has been widely used in various areas ranging from image generation Sohn, Lee, and Yan (2015); Yan et al. (2016); Klys, Snell, and Zemel (2018) to molecular design Kang and Cho (2018)

, the independence of the latent vector from the attribute data is not assured, which negatively influences the generation quality. Conditional GAN (cGAN) is an alternative approach that gives results of significantly better quality 

Mirza and Osindero (2014); Perarnau et al. (2016); He et al. (2019), but the model is more difficult to train Kodali et al. (2017). cGAN works very well for generating new images and conditioning factors may take various forms (images, sketches, labels) Park et al. (2019); Jo and Park (2019); Choi et al. (2020), but manipulating existing examples is more problematic because GAN models lack the encoder network Tov et al. (2021). Fader Networks Lample et al. (2017) combine features of both cVAE and cGAN, as they use encoder-decoder architecture, together with the discriminator, which predicts the image attributes from its latent vector returned from the encoder. As discussed in Li et al. (2020), the training of Fader Networks is even more difficult than standard GANs, and disentanglement of attributes is not preserved. MSP Li et al. (2020) is a recent auto-encoder based architecture with an additional projection matrix, which is responsible for disentangling the latent space and separating the attribute information from other characteristic information. In contrast to PluGeN, MSP cannot be used with pre-trained GANs and performs poorly at generating new images (it was designed for manipulating existing examples). CAGlow Liu et al. (2019) is an adaptation of Glow Kingma and Dhariwal (2018) to conditional image generation based on modeling a joint probabilistic density of an image and its conditions. Since CAGlow does not reduce data dimension, applying it to more complex data might be problematic.

While the above approaches focus on building conditional generative models from scratch, recent works often focus on manipulating the latent codes of pre-trained models. StyleFlow Abdal et al. (2021) operates on the latent space of StyleGAN Karras, Laine, and Aila (2019)

using a conditional continuous flow module. Although the quality of generated images is impressive, the model has not been applied to other generative models than StyleGAN and domains other than images. Moreover, StyleFlow needs an additional classifier to compute the conditioning factor (labels) for images at test time. Competitive approaches to StyleGAN appear in

Gao et al. (2021); Tewari et al. (2020); Härkönen et al. (2020); Nitzan et al. (2020). Hijack-GAN Wang, Yu, and Fritz (2021) is a framework for attribute manipulation using the latent space of any GAN by designing a proxy model to traverse the latent space.

Plugin Generative Network

Figure 3: PluGeN maps the entangled latent space of pre-trained generative models using invertible normalizing flow into a separate space, where labeled attributes are modeled using independent 1-dimensional distributions. By manipulating label variables in this space, we fully control the generation process.

We propose a plugin generative network (PluGeN), which can be attached to pre-trained generative models and allows for direct manipulation of labeled attributes, see Figure 3 for the basic scheme of PluGeN. Making use of PluGeN we preserve all properties of the base model, such as generation quality and reconstruction in the case of auto-encoders, while adding new functionalities. In particular, we can:

  • modify selected attributes of existing examples,

  • generate new samples with desired labels.

In contrast to typical conditional generative models, PluGeN is capable of creating examples with rare or even unseen combinations of attributes, e.g. man with makeup.

Probabilistic model. PluGeN works in a multi-label setting, where every example is associated with a -dimensional vector of binary labels111Our model can be extended to continuous values, which we describe in the supplementary materials due to page limit. . We assume that there is a pre-trained generative model , where is the latent space, which is usually heavily entangled. That is, although each latent code contains the information about the labels , there is no direct way to extract or modify it.

We want to map this entangled latent space into a separate latent space which encodes the values of each label

as a separate random variable

living in a single dimension of this space. Thus, by changing the value of , going back to the entangled space and generating a sample, we can control the values of . Since labeled attributes usually do not fully describe a given example, we consider additional random variables , which are supposed to encode the information not included in the labels. We call the label variables (or attributes) and the style variables.

Since we want to control the value of each attribute independently of any other factors, we assume the factorized form of the probability distribution of the random vector

. More precisely, the conditional probability distribution of

given any condition imposed on labeled attributes is of the form:


for all . In other words, modifying influences only the -th factor leaving other features unchanged.

Parametrization. To instantiate the above probabilistic model (1), we need to parametrize the conditional distribution of given and the distribution of

. Since we do not impose any constraints on style variables, we use standard Gaussian distribution for modeling density of


To provide the consistency with

and avoid potential problems with training our deep learning model using discrete distributions, we use the mixture of two Gaussians for modeling the presence of labels – each component corresponds to a potential value of the label (

or ). More precisely, the conditional distribution of given is parametrized by:


where are the user-defined parameters. If , then the latent factor takes values close to ; otherwise we get values around (depending on the value of and ). To provide good separation between components, we put ; the selection of will be discussed is the supplementary materials.

Thanks to this continuous parametrization, we can smoothly interpolate between different labels, which would not be so easy using e.g. Gumbel softmax parametrization

Jang, Gu, and Poole (2016). In consequence, we can gradually change the intensity of certain labels, like smile or beard, even though such information was not available in a training set (see Figure 4 in the experimental section).

Training the model

To establish a two-way mapping between entangled space and the disentangled space , we use an invertible normalizing flow (INF),

. Let us recall that INF is a neural network, where the inverse mapping is given explicitly and the Jacobian determinant can be easily calculated

Dinh, Krueger, and Bengio (2014). Due to the invertibility of INF, we can transform latent codes to the prior distribution of INF, modify selected attributes, and map the resulting vector back to . Moreover, INFs can be trained using log-likelihood loss, which is very appealing in generative modeling.

Summarizing, given a latent representation of a sample with label

, the loss function of PluGeN equals:


where . In the training phase, we collect latent representations of data points . Making use of labeled attributes associated with every , we modify the weights of so that to minimize the negative log-likelihood (3) using gradient descent. The weights of the base model are kept frozen.

In contrast to many previous works Abdal et al. (2021), PluGeN can be trained in a semi-supervised setting, where only partial information about labeled attributes is available (see supplementary materials for details).

Inference. We may use PluGeN to generate new samples with desired attributes as well as to manipulate attributes of input examples. In the first case, we generate a vector from the conditional distribution with selected condition . To get the output sample, the vector is transformed by the INF and the base generative network , which gives us the final output .

In the second case, to manipulate the attributes of an existing example , we need to find its latent representation . If is a decoder network of an autoencoder model, then should be passed through the encoder network to obtain Li et al. (2020). If is a GAN, then can be found by minimizing the reconstruction error between and using gradient descent for a frozen Abdal et al. (2021). In both cases, is next processed by INF, which gives us its factorized representation . In this representation, we can modify any labeled variable and map the resulting vector back through and as in the generative case.

Observe that PluGeN does not need to know what are the values of labeled attributes when it modifies attributes of existing examples. Given a latent representation , PluGeN maps it through , which gives us the factorization into labeled and unlabeled attributes. In contrast, existing solutions based on conditional INF, e.g StyleFlow Abdal et al. (2021), have to determine all labels before passing through INF as they represent the conditioning factors. In consequence, these models involve additional classifiers for labeled attributes.


To empirically evaluate the properties of PluGeN, we combine it with GAN and VAE architectures to manipulate attributes of image data. Moreover, we present a practical use-case of chemical molecule modeling using CharVAE. Due to the page limit, we included architecture details and additional results in the supplementary materials.

GAN backbone

First, we consider the state-of-the-art StyleGAN architecture Karras, Laine, and Aila (2019), which was trained on Flickr-Faces-HQ (FFHQ) containing 70 000 high-quality images of resolution . The Microsoft Face API was used to label 8 attributes in each image (gender, pitch, yaw, eyeglasses, age, facial hair, expression, and baldness).

PluGeN is instantiated using NICE flow model Dinh, Krueger, and Bengio (2014) that operates on the latent vectors sampled from the space of the StyleGAN. As a baseline, we select StyleFlow Abdal et al. (2021), which is currently one of the state-of-the-art models for controlling the generation process of StyleGAN. In contrast to PluGeN, StyleFlow uses the conditional continuous INF to operate on the latent codes of StyleGAN, where the conditioning factor corresponds to the labeled attributes. For evaluation, we modify one of 5 attributes222The remaining 3 attributes (age, pitch, yaw) are continuous and it is more difficult to assess their modifications. and verify the success of this operation using the prediction accuracy returned by Microsoft Face API. The quality of images is additionally assessed by calculating the standard Fréchet Inception Distance (FID) Heusel et al. (2017).

(a) PluGeN
(b) StyleFlow
Figure 4: Gradual modification of attributes (age, baldness, and yaw, respectively) performed on the StyleGAN latent codes.

Figure 1 (first page) and 4 present the effects of how PluGeN and StyleFlow manipulate images sampled by StyleGAN. It is evident that PluGeN can switch the labels to opposite values as well as gradually change their intensities. At the same time, the requested modifications do not influence the remaining attributes leaving them unchanged. One can observe that the results produced by StyleFlow are also acceptable, but the modification of the requested attribute implies the change of other attributes. For example, increasing the intensity of ”baldness” changes the type of glasses, or turning the head into right makes the woman look straight.

The above qualitative evaluation is supported by the quantitative assessment presented in Table 1. As can be seen, StyleFlow obtains a better FID score, while PluGeN outperforms StyleFlow in terms of accuracy. Since FID compares the distribution of generated and real images, creating images with uncommon combinations of attributes that do not appear in a training set may be scored lower, which can explain the relation between accuracy and FID obtained by PluGeN and StyleFlow. In consequence, FID is not an adequate metric for measuring the quality of arbitrary image manipulations considered here, because it is too closely tied to the distribution of input images.

It is worth mentioning that PluGeN obtains these very good results using NICE model, which is the simplest type of INFs. In contrast, StyleFlow uses continuous INF, which is significantly more complex and requires using an ODE solver leading to unstable training. Moreover, to modify even a single attribute, StyleFlow needs to determine the values of all labels, since they represent the conditioning factors of INF. In consequence, every modification requires applying an auxiliary classifier to predict all image labels. The usage of PluGeN is significantly simpler, as subsequent coordinates in the latent space of INF correspond to the labeled attributes and they are automatically determined by PluGeN.

Requested value PluGeN StyleFlow
female 0.95 0.95
male 0.92 0.87
no-glasses 1.00 0.99
glasses 0.90 0.70
not-bald 1.00 1.00
bald 0.53 0.54
no-facial-hair 1.00 1.00
facial-hair 0.72 0.65
no-smile 0.99 0.92
smile 0.96 0.99
Average Acc 0.90 0.86
Average FID 46.51 32.59
Table 1: Accuracy and FID scores of attributes modification using StyleGAN backbone.

Image manipulation on VAE backbone

In the following experiment, we show that PluGeN can be combined with autoencoder models to effectively manipulate image attributes. We use CelebA database, where every image of the size is annotated with binary labels.

We compare PluGeN to MSP Li et al. (2020), a strong baseline, which uses a specific loss for disentangling the latent space of VAE. Following the idea of StyleFlow, we also consider a conditional INF attached to the latent space of pre-trained VAE (referred to as cFlow), where conditioning factors correspond to the labeled attributes. The architecture of the base VAE and the evaluation protocol were taken from the original MSP paper. More precisely, for every input image, we manipulate the values of two attributes (we inspect 20 combinations in total). The success of the requested manipulation is verified using a multi-label ResNet-56 classifier trained on the original CelebA dataset.

The sample results presented in Figure 5 demonstrate that PluGeN attached to VAE produces high-quality images satisfying the constraints imposed on the labeled attributes. The quantitative comparison shown in Table 2 confirms that PluGeN is extremely efficient in creating uncommon combinations of attributes, while cFlow performs well only for the usual combinations. At the same time, the quality of images produced by PluGeN and MSP is better than in the case of cFlow. Although both PluGeN and MSP focus on disentangling the latent space of the base model, MSP has to be trained jointly with the base VAE model and it was designed only to autoencoder models. In contrast, PluGeN is a separate module, which can be attached to arbitrary pre-trained models. Due to the use of invertible neural networks, it preserves the reconstruction quality of the base model, while adding manipulation functionalities. In the following experiment, we show that PluGeN also performs well at generating entirely new images, which is not possible using MSP.

Input image

MSP cFlow
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
Figure 5: Examples of image attribute manipulation using VAE backbone.
Requested value PluGeN MSP cFlow
male x beard 0.80 0.79 0.85
female x beard 0.59 0.33 0.31
male x no-beard 0.88 0.92 0.91
female x no-beard 0.85 0.82 0.95
male x makeup 0.44 0.43 0.29
male x no-makeup 0.72 0.92 0.96
female x makeup 0.42 0.41 0.58
female x no-makeup 0.55 0.40 0.85
smile x open-mouth 0.97 0.99 0.79
no-smile x open-mouth 0.79 0.82 0.77
smile x calm-mouth 0.84 0.91 0.72
no-smile x calm-mouth 0.96 0.97 0.99
male x bald 0.26 0.41 0.34
male x bangs 0.58 0.74 0.45
female x bald 0.19 0.13 0.39
female x bangs 0.52 0.49 0.60
no-glasses x black-hair 0.92 0.93 0.74
no-glasses x golden-hair 0.92 0.91 0.81
glasses x black-hair 0.76 0.90 0.58
glasses x golden-hair 0.75 0.85 0.61
Average Acc 0.69 0.70 0.67
Average FID 28.07 30.67 39.68
Table 2: Accuracy and FID scores of image manipulation performed on the VAE backbone.

Image generation with VAE backbone

In addition to manipulating the labeled attributes of existing images, PluGeN generates new examples with desired attributes. To verify this property, we use the same VAE architecture as before trained on CelebA dataset. The baselines include cFlow and two previously introduced methods for multi-label conditional generation333For cVAE and -GAN we use images of the size following their implementations.: cVAE Yan et al. (2016) and -GAN Gan et al. (2017). We exclude MSP from the comparison because it cannot generate new images, but only manipulate the attributes of existing ones (see supplementary materials for a detailed explanation).

Figure 6: Examples of conditional generation using VAE backbone. Each row contains the same person (style variables) with modified attributes (label variables).

Figure 6 presents sample results of image generation with the specific conditions. In each row, we fix the style variables and vary the label variables in each column, generating the same person but with different characteristics such as hair color, eyeglasses, etc. Although cVAE manages to modify the attributes, the quality of obtained samples is poor, while -GAN falls completely out of distribution. PluGeN and cFlow generate images of similar quality, but only PluGeN is able to correctly manipulate the labeled attributes. The lower quality of generated images is caused by the poor generation abilities of VAE backbone, which does not work well with high dimensional images (see supplementary materials for a discussion). For this reason, it is especially notable that PluGeN can improve the generation performance of the backbone model in contrast to MSP.

Disentangling the attributes

The attributes in the CelebA dataset are strongly correlated and at times even contradictory, e.g. attributes ’bald’ and ’blond hair’ cannot both be present at the same time. In this challenging task, we aim to disentangle the attribute space as much as it is possible to allow for generating examples with arbitrary combinations of attributes. For this purpose, we sample the conditional variables independently, effectively ignoring the underlying correlations of attributes, and use them to generate images. Since the attributes in the CelebA dataset are often imbalanced (e.g. only in 6.5% of examples the person wears glasses), we calculate F1 and AUC scores for each attribute.

The quantitative analysis of the generated images presented in Table 3 confirms that PluGeN outperforms the rest of the methods with respect to classification scores. The overall metrics are quite low for all approaches, which is due to the difficulty of disentanglement mentioned above, as well as the inaccuracy of the ResNet attribute classifier. Deep learning models often fail when the correlations in the training data are broken, e.g. the classifier might use the presence of a beard to predict gender, thus introducing noise in the evaluation Beery, Horn, and Perona (2018).

PluGeN cFlow -GAN cVAE
F1 0.44 0.29 0.39 0.39
AUC 0.76 0.68 0.70 0.73
Table 3: Results of the independent conditional generation using VAE backbone.

Chemical molecules modeling

Finally, we present a practical use-case, in which we apply PluGeN to generate chemical molecules with the requested properties. As a backbone model, we use CharVAE Gómez-Bombarelli et al. (2018), which is a type of recurrent network used for processing SMILES Weininger (1988), a textual representation of molecules. It was trained on ZINC 250k database Sterling and Irwin (2015) of commercially available chemical compounds. For every molecule, we model 3 physio-chemical continuous (not binary) labels: logP, SAS, TPSA, which values were calculated using RDKit package Landrum et al. (2006). Additional explanations and more examples are given in the supplementary materials.

First, we imitate a practical task of de novo design Olivecrona et al. (2017); Popova, Isayev, and Tropsha (2018), where we force the model to generate new compounds with desirable properties. For every attribute, we generate 25k molecules with 3 different values: for logP we set the label of generated molecules to: 1.5, 3.0, 4.5; for TPSA we set generated labels to: 40, 60, 80; for SAS we set them to: 2.0, 3.0, 4.0, which gives 9 scenarios in total. From density plots of labels of generated and original molecules presented in Figure 7, we can see that PluGeN changes the distribution of values of the attributes and moves it towards the desired value. A slight discrepancy between desired and generated values may follow from the fact that values of labeled attributes were sampled independently, which could make some combinations physically contradictory.

Figure 7: Distribution of attributes of generated molecules, together with distribution for the training dataset. Each color shows the value of a labeled attribute that was used for generation. PluGeN is capable of moving the density of generated molecules’ attributes towards the desired value. The average of every distribution is marked with a vertical line.

Next, we consider the setting of lead optimization Jin et al. (2019); Maziarka et al. (2020), where selected compounds are improved to meet certain criteria. For this purpose, we encode a molecule into the latent representation of INF and force PluGeN to gradually increase the value of logP by 3 and decode the resulting molecules.

The obtained molecules together with their logP are shown in Figure 8. As can be seen, PluGeN generates molecules that are structurally similar to the initial one, however with optimized desired attributes.

Obtained results show that PluGeN is able to model the physio-chemical molecular features, which is a non-trivial task that could speed up a long and expensive process of designing new drugs.

(a) Molecules decoded from path
(b) LogP of presented molecules
Figure 8: Molecules obtained by the model during an optimization phase (left side), and their LogP (right side).


We proposed a novel approach for disentangling the latent space of pre-trained generative models, which works perfectly for generating new samples with desired conditions as well as for manipulating the attributes of existing examples. In contrast to previous works, we demonstrated that PluGeN performs well across diverse domains, including chemical molecule modeling, and can be combined with various architectures, such as GANs and VAEs backbones.


  • Abdal et al. (2021) Abdal, R.; Zhu, P.; Mitra, N. J.; and Wonka, P. 2021. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40(3): 1–21.
  • Beery, Horn, and Perona (2018) Beery, S.; Horn, G. V.; and Perona, P. 2018. Recognition in Terra Incognita. In ECCV.
  • Bohacek, McMartin, and Guida (1996) Bohacek, R. S.; McMartin, C.; and Guida, W. C. 1996. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal research reviews, 16(1): 3–50.
  • Brock, Donahue, and Simonyan (2018) Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.
  • Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Chen et al. (2018) Chen, R. T.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. 2018. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366.
  • Cho et al. (2014) Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Choi et al. (2020) Choi, Y.; Uh, Y.; Yoo, J.; and Ha, J.-W. 2020. Stargan v2: Diverse image synthesis for multiple domains. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , 8188–8197.
  • Coley et al. (2017) Coley, C. W.; Barzilay, R.; Green, W. H.; Jaakkola, T. S.; and Jensen, K. F. 2017. Convolutional embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and modeling, 57(8): 1757–1772.
  • Dinh, Krueger, and Bengio (2014) Dinh, L.; Krueger, D.; and Bengio, Y. 2014. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
  • Gan et al. (2017) Gan, Z.; Chen, L.; Wang, W.; Pu, Y.; Zhang, Y.; Liu, H.; Li, C.; and Carin, L. 2017. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548.
  • Gao et al. (2021) Gao, Y.; Wei, F.; Bao, J.; Gu, S.; Chen, D.; Wen, F.; and Lian, Z. 2021. High-Fidelity and Arbitrary Face Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16115–16124.
  • Gaulton et al. (2017) Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; et al. 2017. The ChEMBL database in 2017. Nucleic acids research, 45(D1): D945–D954.
  • Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268–276.
  • Härkönen et al. (2020) Härkönen, E.; Hertzmann, A.; Lehtinen, J.; and Paris, S. 2020. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546.
  • He et al. (2019) He, Z.; Zuo, W.; Kan, M.; Shan, S.; and Chen, X. 2019. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 28(11): 5464–5478.
  • Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 6626–6637.
  • Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
  • Jin, Barzilay, and Jaakkola (2018) Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction tree variational autoencoder for molecular graph generation. In

    International Conference on Machine Learning

    , 2323–2332. PMLR.
  • Jin et al. (2019) Jin, W.; Yang, K.; Barzilay, R.; and Jaakkola, T. 2019. Learning multimodal graph-to-graph translation for molecular optimization. International Conference on Learning Representations.
  • Jo and Park (2019) Jo, Y.; and Park, J. 2019.

    Sc-fegan: Face editing generative adversarial network with user’s sketch and color.

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1745–1753.
  • Kang and Cho (2018) Kang, S.; and Cho, K. 2018. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1): 43–52.
  • Karras, Laine, and Aila (2019) Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401–4410.
  • Karras et al. (2020) Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
  • Kaya, Hong, and Dumitras (2019) Kaya, Y.; Hong, S.; and Dumitras, T. 2019. Shallow-Deep Networks: Understanding and Mitigating Network Overthinking. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 3301–3310. PMLR.
  • Kingma and Dhariwal (2018) Kingma, D. P.; and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039.
  • Kingma et al. (2014) Kingma, D. P.; Rezende, D. J.; Mohamed, S.; and Welling, M. 2014. Semi-supervised learning with deep generative models. arXiv preprint arXiv:1406.5298.
  • Klys, Snell, and Zemel (2018) Klys, J.; Snell, J.; and Zemel, R. 2018. Learning latent subspaces in variational autoencoders. arXiv preprint arXiv:1812.06190.
  • Kodali et al. (2017) Kodali, N.; Abernethy, J.; Hays, J.; and Kira, Z. 2017. On convergence and stability of gans. arXiv preprint arXiv:1705.07215.
  • Koperski et al. (2020) Koperski, M.; Konopczynski, T.; Nowak, R.; Semberecki, P.; and Trzcinski, T. 2020. Plugin Networks for Inference under Partial Evidence. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2883–2891.
  • Lample et al. (2017) Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.; and Ranzato, M. 2017. Fader networks: Manipulating images by sliding attributes. arXiv preprint arXiv:1706.00409.
  • Landrum et al. (2006) Landrum, G.; et al. 2006.

    RDKit: Open-source cheminformatics.

  • Li and Wand (2016) Li, C.; and Wand, M. 2016. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, 702–716. Springer.
  • Li et al. (2020) Li, X.; Lin, C.; Li, R.; Wang, C.; and Guerin, F. 2020. Latent space factorisation and manipulation via matrix subspace projection. In International Conference on Machine Learning, 5916–5926. PMLR.
  • Liu et al. (2019) Liu, R.; Liu, Y.; Gong, X.; Wang, X.; and Li, H. 2019. Conditional adversarial generative flow for controllable image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7992–8001.
  • Maziarka et al. (2020) Maziarka, Ł.; Pocha, A.; Kaczmarczyk, J.; Rataj, K.; Danel, T.; and Warchoł, M. 2020. Mol-CycleGAN: a generative model for molecular optimization. Journal of Cheminformatics, 12(1): 1–18.
  • Mestre-Ferrandiz et al. (2012) Mestre-Ferrandiz, J.; Sussex, J.; Towse, A.; et al. 2012. The R&D cost of a new medicine. Monographs.
  • Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • Nitzan et al. (2020) Nitzan, Y.; Bermano, A.; Li, Y.; and Cohen-Or, D. 2020. Disentangling in latent space by harnessing a pretrained generator. arXiv preprint arXiv:2005.07728, 2(3).
  • Olivecrona et al. (2017) Olivecrona, M.; Blaschke, T.; Engkvist, O.; and Chen, H. 2017.

    Molecular de-novo design through deep reinforcement learning.

    Journal of cheminformatics, 9(1): 1–14.
  • Park et al. (2019) Park, T.; Liu, M.-Y.; Wang, T.-C.; and Zhu, J.-Y. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337–2346.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc.
  • Perarnau et al. (2016) Perarnau, G.; Van De Weijer, J.; Raducanu, B.; and Álvarez, J. M. 2016. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355.
  • Popova, Isayev, and Tropsha (2018) Popova, M.; Isayev, O.; and Tropsha, A. 2018. Deep reinforcement learning for de novo drug design. Science advances, 4(7): eaap7885.
  • Rebuffi, Bilen, and Vedaldi (2017) Rebuffi, S.; Bilen, H.; and Vedaldi, A. 2017. Learning multiple visual domains with residual adapters. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 506–516.
  • Sohn, Lee, and Yan (2015) Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28: 3483–3491.
  • Sterling and Irwin (2015) Sterling, T.; and Irwin, J. J. 2015. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11): 2324–2337.
  • Tewari et al. (2020) Tewari, A.; Elgharib, M.; Bernard, F.; Seidel, H.-P.; Pérez, P.; Zollhöfer, M.; and Theobalt, C. 2020. Pie: Portrait image embedding for semantic control. ACM Transactions on Graphics (TOG), 39(6): 1–14.
  • Tov et al. (2021) Tov, O.; Alaluf, Y.; Nitzan, Y.; Patashnik, O.; and Cohen-Or, D. 2021. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4): 1–14.
  • Wang, Yu, and Fritz (2021) Wang, H.-P.; Yu, N.; and Fritz, M. 2021. Hijack-gan: Unintended-use of pretrained, black-box gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7872–7881.
  • Weininger (1988) Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1): 31–36.
  • Wołczyk et al. (2021) Wołczyk, M.; Wójcik, B.; Bałazy, K.; Podolak, I.; Tabor, J.; Śmieja, M.; and Trzciński, T. 2021. Zero Time Waste: Recycling Predictions in Early Exit Neural Networks. arXiv preprint arXiv:2106.05409.
  • Yan et al. (2016) Yan, X.; Yang, J.; Sohn, K.; and Lee, H. 2016. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, 776–791. Springer.
  • Yang et al. (2019) Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; et al. 2019. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8): 3370–3388.
  • Zhou et al. (2020) Zhou, W.; Xu, C.; Ge, T.; McAuley, J. J.; Xu, K.; and Wei, F. 2020. BERT Loses Patience: Fast and Robust Inference with Early Exit. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

Parametrization of PluGeN

Continuous attributes

In this section, we show that PluGeN can also be applied to the case of continuous labeled attributes. Without loss of generality, we assume that . Analogically to the case of binary labels, we assume that the conditional distribution of label variable given is parametrized by

where is the user-defined parameter controlling smoothness.

Observe that by marginalizing out the label variable over training set , we obtain:

which coincides with a 1-dimensional kernel density estimator (KDE). Although KDE does not work well in high dimensional spaces, it is a reliable estimate of the probability density function in the

-dimensional situation considered here.

For high values of there is a huge overlap between Gaussian components. This results in small penalties in terms of negative log-likelihood for incorrect assignments. From a practical perspective, we start the training process with high values of , which provides reasonable initialization of PluGeN. Next, we gradually decrease to match the correct assignments.

Modeling imbalanced binary labels

PluGeN StyleFlow





Figure 9: Gradual modification of attributes (age, baldness, beard, and yaw, respectively) performed by PluGeN (left) and StyleFlow (right) using the StyleGAN backbone.
Figure 10: Attributes manipulation performed by PluGeN (left) and StyleFlow (right) using the StyleGAN backbone.

Input image

MSP cFlow
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass

Input image

MSP cFlow
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♂+beard ♂+mkup open+smile ♂+bald hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♀+beard ♂-mkup open-smile ♂+bangs hair-glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♂-beard ♀+mkup shut+smile ♀+bald hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
♀-beard ♀-mkup shut-smile ♀+bangs hair+glass
Figure 11: Examples of image attribute manipulation using VAE backbone.

In many cases, the class labels are imbalanced, which means that the number of examples from one class significantly exceeds the other class (e.g., only

examples in CelebA dataset have the ’glasses’ label). To deal with imbalanced data, we scale the variance of Gaussian density modeling the conditional distribution


We consider the conditional density of -th attribute represented by:


where and . We assume that are the fractions of examples with class 0 and 1, respectively. To deal with imbalanced classes we put


is a fixed parameter. For a majority class, standard deviation becomes higher, which introduces a lower penalty in the case of negative log-likelihood loss. The minority class has a higher penalty because we need to stop the mixture from collapsing into a single component.

Let us calculate the log-likelihood of our conditional prior density using the parametrization . We have


where is an extra weighting factor.

We observe that, for our selection of , the expected value of the weighting factors with respect to labeling variable equals 1. In consequence,

which is a typical log-likelihood of Gaussian distribution assuming class proportion .

Reducing in a training of PluGeN

Here, we describe the schedule for parameter used for modeling conditional distribution . We want to ensure the flexibility of the INF at the beginning of the training, but we also need the attribute values to be strictly separated. In order to achieve both of these conditions, we impose a schedule on the standard deviation . Starting with high we allow for great flexibility of our model, and then we get class separation by reducing the value of . Namely, we use the following schedule for the standard deviation

of the class normal distributions:


is the index of the current epoch and

are hyperparameters setting, respectively, the starting point and the speed of value decay. The selection process of

and is described in the following sections.

Semi-supervised setting

It is worth noticing that PluGeN can be trained in a semi-supervised setting, where only partial information about labeled attributes is available. Namely, for every latent representation we can use an arbitrary condition imposed on . If the value of -th label is unknown, then we use the marginal density:

instead of in the loss function (3). Here are the fraction of negatively and positively labeled examples in . Further investigations about this setting are left for future work.

Details of image experiments

All experiments were run on a single NVIDIA DGX Station with Ubuntu 20.04 LTS. The full hardware specification includes 8 Tesla V100 GPUs with 32GB VRAM, 512GB RAM, and Intel(R) Xeon(R) CPU E5-2698 v4. Each experiment was run using a single GPU. The code is based on the PyTorch Paszke et al. (2019) framework.

Architectures of the models

StyleGAN backbone

Our experiments were performed using the pre-trained, publicly available StyleGAN2 trained on the FFHQ dataset Karras et al. (2020).

PluGeN for StyleGAN backbone

We use NICE architecture with coupling layers with layers in each and width . We use Adam optimizer with learning rate and train model for epochs. The hyperparameters and used for modeling conditional distributions, are set to and , respectively.

VAE backbone

For our experiments, we reuse the VAE architecture from Li et al. (2020). We use an encoder with 5 convolutional layers starting with 128 filters and doubling. The decoder is symmetrical to the decoder. We use leakyReLU activations. We train the network for 50 epochs with batch size 40 and Adam optimizer with the earning rate set to . We additionally train a PatchGAN model Li and Wand (2016) to improve the sharpness of the images.

PluGeN for VAE backbone

As previously, we use NICE architecture with coupling layers with layers in each and width . We train the model for epochs using Adam optimizer with learning rate and . The hyperparameters are set to and , respectively.


We train cFlow model also on top of the base network. We use Conditional Masked Autoregressive Flow with layers of each consisting of reverse permutation and MADE component with residual blocks. Moreover, we have been encoding attributes using linear layer which was after that passed as a context input to the flow. We train the model for epochs using Adam optimizer witht learning rate . During sampling, the temperature trick was used with .

ResNet classifier

To evaluate the correctness of attribute manipulation in the case of CelebA dataset, we used a standard ResNet-56 classifier. We trained it on the task of multi-label classification, with class weighting to correct for class imbalance. We used the Adam optimizer with the learning rate set to , batch size and trained it for epochs.

Additional Results

In this subsection, we present additional results and models comparison, which were not included in the main paper because of space restrictions.

Manipulating the StyleGAN latent codes

In Figures 9 and 10, we present additional results of attribute manipulations performed by PluGeN and StyleFlow on the latent codes of StyleGAN backbone. In most cases, PluGeN modifies only the requested attribute leaving the remaining ones unchanged, which is not always the case of StyleFlow (compare 4th row of Figure 9 or 3rd row of Figure 10). This confirms that the latent space produced by PluGeN is more disentangled than the one created by StyleFlow.

Manipulating images using VAE backbone

In Figure 11, we show additional results of image manipulation performed by PluGeN, MSP, and cFlow using VAE backbone. One can observe that PluGeN and MSP perform the requested modification more accurately than cFlow.

Manipulating attributes intensity of generated images.

In this experiment, we consider images fully generated by PluGeN (not reconstructed images) attached to the VAE backbone. More precisely, we sample a single style variable from the prior distribution and manipulate the label variables of CelebA attributes. It is evident in Figure 12

that PluGeN freely interpolates between binary values of each attribute and even extrapolates outside the data distribution. This is possible thanks to the continuous form of prior distribution we are using in the latent space, which enables us to choose the intensity of each attribute. We emphasize that this information is not encoded in the dataset, where labels are represented as binary variables. However, in reality, an attribute such as ’narrow eyes’ covers a whole spectrum of possible eyelid positions, from eyes fully closed, through half-closed to wide open. PluGeN is able to recover this property without explicit supervision. Interestingly, we also see cases of extrapolation outside of the dataset, e.g. setting a significantly negative value of the ’bangs’ attribute, which can be interpreted as an illogical condition ’extreme absence of bangs’, creates a white spot on the forehead.

Figure 13 shows that the shape of the empirical distributions in the latent space of PluGeN allows for this continuous change. While the positive and negative classes of boolean attributes such as the presence of a hat or eyeglasses are clearly separated, in more continuous variables like youth and attractiveness they overlap significantly, allowing for smooth interpolations. This phenomenon emerges naturally, even though CelebA provides only binary labels for all the attributes.

Figure 12: Manipulating the intensity of labeled attributes of the generated sample. Since PluGeN models the values of the attributes with continuous distributions, it can control the intensity of each attribute and even sometimes extrapolate outside the data distribution (e.g. very bright blond hair).
Figure 13: The density of the positive and negative samples for chosen attributes in the flow latent space estimated using all examples from the CelebA test set. Binary attributes (top row) are clearly separated while continuous attributes (bottom row) overlap significantly.

Generation capabilities of MSP and the VAE backbone

In Figure 14 (top), we demonstrate that the base VAE model taken from the MSP paper Li et al. (2020) cannot generate new face images, but only manipulate the attributes of input examples. In consequence, it works similar to the autoencoder model. For this reason it is especially notable that PluGeN can improve the generation performance of the backbone model (see the main paper). In contrast, MSP cannot generate new face images using this VAE model as shown in the bottom row of Figure 14. For very low temperatures, MSP generates typical (not diverse) faces.



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 14: Samples from the base VAE (top row) and MSP (bottom row) models using increasing values of the temperature parameter (bottom line). MSP generates typical face images only for a very low temperature, while VAE does not generate face images at all.

Generating images with attributes combinations taken from test set

We present additional quantitative results for generating images with the requested combinations of attributes. In this experiment, we focus on typical combinations, which appear in a dataset. For this purpose, we generate 20,000 images with the same attribute combinations as in the CelebA test set. The results presented in Table 4 show that PluGeN outperforms both cFlow, cVAE, and -GAN in terms of classification scores.

PluGeN cFlow -GAN cVAE
F1 0.69 0.49 0.58 0.59
AUC 0.92 0.85 0.87 0.88
Table 4: Average classification metrics for generating images with the combinations of attributes taken from the test set of CelebA.



In our main experiments, we use the NICE Dinh, Krueger, and Bengio (2014) approach to flow-based models. This choice was motivated by the computational and conceptual simplicity of the approach. However, we also empirically evaluated a more complex approach of continuous normalizing flows Chen et al. (2018) which cast the distribution modeling task as a problem of solving differential equations. The CNF implementation consisted of stacked CNFs, each containing concatsquash layers with a hidden dimension . Table 5 shows the results of both approaches in the task of multi-label conditional generation using VAE backbone. We use the same combinations of attributes as in the CelebA test set. Table 6 shows an analogical comparison when the attributes were sampled independently, which is more challenging setting. For both of these settings results on NICE and CNF are comparable. Although CNF samples get better FIDs, they also score worse on the classification metrics, which suggests that the model might be worse at enforcing the class conditions. Overall, both models perform similarly and because of that, we use NICE as the approach is less expensive computationally.

FID 72.72 68.96
F1 0.69 0.63
AUC 0.92 0.89
Table 5: Average classification metrics and FIDs for generating images with the combinations of attributes taken from the test set of CelebA.
FID 77.48 73.31
F1 0.44 0.41
AUC 0.78 0.75
Table 6: Average classification metrics and FIDs for generating images, when the values of attributes were sampled independently.

Different autoencoder backbones

In order to investigate how the structure of the latent space of the backbone autoencoder impacts the performance of our model, we check multiple -VAE models with varying values of . For each model we trained three architectures of INFs (small, medium, big) and picked the best performing ones for evaluation. The results presented in Table 7 show that the FID scores get worse as the value of increases. This is caused by the drop in the reconstructive power of the base model, which focuses more on the latent space regularization instead. Interestingly, the statistics also fall as the value of gets too low. The flow-based model cannot disentangle factors of variation from latent space which is not already at least partially structured. This experiment shows limitations of our model in respect to its reliance on the performance of the backbone autoencoder. However, PluGeN is still quite robust as it achieves good results for a wide range of values.

0.5 1 2 4 8 16
FID 61.86 55.11 61.96 65.76 77.94 110.46
F1 0.45 0.66 0.63 0.59 0.57 0.53
AUC 0.79 0.90 0.88 0.87 0.86 0.83
Table 7: Results for PluGeN using -VAE backbone for different values of .

Details of molecules generation experiments


Designing a new drug is a long and expensive process that could cost up to 10 billion dollars and lasts even 10 years Mestre-Ferrandiz et al. (2012). The recent spread of SARS-CoV-2 virus and the pandemic it caused have shown how important it is to speed up this process. Recently, deep learning is gaining popularity in the cheminformatics community, where it is used to propose new drug candidates. However, using neural networks in the drug generation task is not easy and is fraught with problems. The complexity of the chemical space is high and thus training generative and predictive models is challenging. Although there are around of possible molecules Bohacek, McMartin, and Guida (1996), detailed information (such as class labels) is known only about a small percentage of them. For example, the ChEMBL database Gaulton et al. (2017), one of the biggest databases with information about the molecular attributes, contains data for 2.1 M chemical compounds. Moreover, since obtaining labeled data requires long and costly laboratory experiments, the amount of labeled molecules in the training datasets is usually really small (often less than 1000), which is often not sufficient to train a good model. This poses an important research problem.

Deep neural networks are mostly used in cheminformatics for the following tasks:

  • virtual screening – the search for potentially active compounds in the libraries of commercially available molecules using predictive models Coley et al. (2017); Yang et al. (2019),

  • de novo design – generating new compounds with desirable properties that are not present in the above-mentioned libraries Olivecrona et al. (2017); Popova, Isayev, and Tropsha (2018),

  • lead optimization – improving selected promising compounds to meet certain criteria Jin et al. (2019); Maziarka et al. (2020).

PluGeN can be used for the two latter tasks, as our model can generate molecules with specified values of given attributes as well as optimize molecules by changing the value of selected labels.

SMILES representation

SMILES Weininger (1988) (simplified molecular-input line-entry system) is a notation, for describing the structure of chemical species using a sequence of characters. SMILES representation consists of a specially defined grammar, which guarantees that a correct SMILES defines a unique molecule. The opposite is not actually true, as a molecule could be encoded by multiple SMILES representations. In order to add this property, the community introduced the canonicalization algorithm, which returns the canonical SMILES that is unique for each molecule.

In Figure 15 we show two molecules together with their canonical SMILES as well as other SMILES representations.

(a) Melatonin
(b) Vanillin
Figure 15: Sample molecules together with their SMILES representations.
Figure 16: Density plots of chemistry attributes present in the training dataset.
logP 1.00 -0.16 0.51
TPSA -0.16 1.00 -0.18
SAS -0.51 -0.18 1.00
Table 8: Correlations of attributes for chemical molecules modeling.

Modeled attributes

In our chemistry experiments, we modeled 3 chemical attributes: logP, TPSA, and SAS. Below, we describe their responsibilities:

  • logP – logarithm of the partition coefficient. Describes the molecule solubility in fats. It shows how well the molecule is passing through membranes.

  • TPSA – the topological polar surface area of a molecule is the surface sum over all polar atoms or molecules (together with their attached hydrogen atoms). TPSA could be used as a metric of the ability of a drug to permeate cells.

  • SAS – synthetic accessibility score defines the ease of synthesis of a drug-like molecule. When generating a drug candidate, one would rather want it to be easily synthesized so that it can be obtained in the laboratory.


We conducted our chemistry experiments using a dataset of 250k molecules sampled from the ZINC database Sterling and Irwin (2015), which is a dataset of commercially available chemical compounds. The mean number of SMILES tokens in our dataset is equal to 38.31, with a standard deviation equal to 8.46.

Figure 16 shows the distribution of attributes of molecules that make up our training dataset.

Since the values of the chemical attributes are related to the structure of the molecule, many of them will be correlated in some way. In Table 8 we present the correlations between the chemistry attributes. The correlations suggest that it might be difficult or even impossible to manipulate logP and SAS attributes independently, setting a difficult challenge for PluGeN.



The encoder consists of 3 bi-GRU Cho et al. (2014) layers, with hidden size equal to 256 and output size (latent dimensionality) equal to 100. The decoder consists of 3 GRU layers with the hidden size equal to 256. The architecture of the backbone model is significantly different from the one used in the image domain, which partially confirms that PluGeN can be combined with various autoencoder models.

We trained the VAE model for 100 epochs, using batch size of 256 and learning rate equal to 1e-4.


The flow model consisted of 6 coupling layers, each of which consists of 6 dense layers with a hidden size equal to 256. We trained NICE for 50 epochs, with learning rate equal to 1e-4 and batch size 256. We used and .

(a) LogP = 1.0, TPSA = 60.0, SAS = 5.0
(b) LogP = 3.0, TPSA = 75.0, SAS = 3.0
(c) LogP = 5.0, TPSA = 50.0, SAS = 2.0
Figure 17: Distribution of labeled attributes for generated molecules (for the experiment with multiple attributes condition), together with distribution for the training dataset. The average of every distribution is marked with a vertical line.
(a) Molecules decoded from path
(b) LogP of presented molecules
Figure 18: Molecules obtained by the model during an optimization phase (left side), together with their LogP (right side).
(a) Molecules decoded from path
(b) LogP of presented molecules
Figure 19: Molecules obtained by the model during an optimization phase (left side), together with their LogP (right side).
(a) Molecules decoded from path
(b) TPSA of presented molecules
Figure 20: Molecules obtained by the model during an optimization phase (left side), together with their TPSA (right side).
(a) Molecules decoded from path
(b) TPSA of presented molecules
Figure 21: Molecules obtained by the model during an optimization phase (left side), together with their TPSA (right side).
(a) Molecules decoded from path
(b) SAS of presented molecules
Figure 22: Molecules obtained by the model during an optimization phase (left side), together with their SAS (right side).
(a) Molecules decoded from path
(b) SAS of presented molecules
Figure 23: Molecules obtained by the model during an optimization phase (left side), together with their SAS (right side).

Additional experiments

In the following subsection, we show additional results for the chemistry-based experiments, for both conditional generation as well as latent space traversal. Furthermore, we show how PluGeN works with the conditional normalizing flow instead of NICE as a base flow model.

Conditional generation

In the main paper, we presented results for conditional generation in the setting of a single attribute condition (where the value of the remaining attributes was sampled from their prior distribution). Here we also show results for a situation where set conditions on all attributes at the same time.

In particular, we tested 3 different settings:

  1. LogP set to 1.0, TPSA set to 60.0, SAS set to 5.0.

  2. LogP set to 3.0, TPSA set to 75.0, SAS set to 3.0.

  3. LogP set to 5.0, TPSA set to 50.0, SAS set to 2.0.

The density plots of the attributes of the molecules generated in these settings are presented in Figure 17.

Latent space traversal

We also present more results for latent space traversals, which is a task that imitates the inter-class interpolation experiments from the image domain. For this purpose, we tested how PluGeN can traverse the latent space of CharVAE. Therefore, we selected a few random molecules from our dataset, and for every one, we forced PluGeN to gradually increase the value of the specified attribute by some value and decoded the resulting molecules back into the latent space. The goal of this task is to generate the molecules that are structurally similar to the initial one, except for changes in the desired attributes. This is an important challenge in the lead optimization stage of the drug discovery process.

LogP For LogP, we forced PluGeN to increase the molecular attribute value by 3. Figures 18 and 19 show the obtained molecules, together with the optimized attribute values.

TPSA For TPSA, we forced PluGeN to increase the molecular attribute value by 40. Figures 20 and 21 show the obtained molecules, together with the optimized attribute values.

SAS. For SAS, we forced PluGeN to increase the molecular attribute value by 2. Figures 22 and 23 show the obtained molecules, together with the optimized attribute values.


We also tested how replacing NICE Dinh, Krueger, and Bengio (2014) with conditional normalizing flow Chen et al. (2018) affects the process of molecular generation using PluGeN. For this purpose, we repeated the chemistry-based conditional generation experiments from the main text, but with CNF as our backbone flow model. Results are presented in Figure 24. One can see, that in this version PluGeN is also capable of moving the density of the attributes of the generated molecules towards the desired value. The obtained changes, however, are worse than in the case of NICE as a flow backbone.

Figure 24: Distribution of labeled attributes for generated molecules for PluGeN with the conditional normalizing flow, together with distribution for the training dataset. Each color shows the value of the labeled attribute that was used for generation.