Generative models such as GANs and variational autoencoders have achieved great results in recent years, especially in the domains of imagesBrock, Donahue, and Simonyan (2018); Brown et al. (2020) and cheminformatics Gómez-Bombarelli et al. (2018); Jin, Barzilay, and Jaakkola (2018). However, in many practical applications, we need to control the process of creating samples by enforcing particular features of generated objects. This would be required to regulate the biases present in the data, e.g. to assure that people of each ethnicity are properly represented in the generated set of face images. In numerous realistic problems, such as drug discovery, we want to find objects with desired properties, like molecules with a particular activity, non-toxicity, and solubility.
Designing the conditional variants of generative models that operate on multiple labels is a challenging problem due to intricate relations among the attributes. Practically, it means that some combinations of attributes (e.g. a woman with a beard) might be unobserved or rarely observed in the training data. In essence, the model should be able to go beyond the distribution of seen data and generate examples with combinations of attributes not encountered previously. One might approach this problem by building a new conditional generative model from the ground up or design a solution tailored for a specific existing unsupervised generative model. However, this introduces an additional effort when one wants to adapt it to a newly invented approach.
To tackle this problem while leveraging the power of existing techniques, we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to various pre-trained generative models such as VAEs or GANs, see Figure 1 for demonstration. Making use of PluGeN, we can manipulate the attributes of input examples as well as generate new samples with desired features. When training the proposed module, we do not change the parameters of the base model and thus we retain its generative and reconstructive abilities, which places our work in the emerging family of non-invasive network adaptation methods Wołczyk et al. (2021); Rebuffi, Bilen, and Vedaldi (2017); Koperski et al. (2020); Kaya, Hong, and Dumitras (2019); Zhou et al. (2020).
Our idea is to find a mapping between the entangled latent representation of the backbone model and a disentangled space, where each dimension corresponds to a single, interpretable attribute of the image. By factorizing the true data distribution into independent components, we can sample from each component independently, which results in creating samples with arbitrary combinations of attributes, see Figure 2. In contrast to many previous works, which are constrained to the attributes combinations visible in the training set, PluGeN gives us full control of the generation process, being able to create uncommon combinations of attributes, such as a woman with a beard or a man with heavy make-up. Generating samples with unseen combinations of attributes can be viewed as extending the distribution of generative models to unexplored although reasonable regions of data space, which distinguishes our approach from existing solutions.
Extensive experiments performed on the domain of images and a dataset of chemical compounds demonstrate that PluGeN is a reusable plugin that can be applied to various architectures including GANs and VAEs. In contrast to the baselines, PluGeN can generate new samples as well as manipulate the properties of existing examples, being capable of creating uncommon combinations of attributes.
Our contributions are as follow:
We propose a universal and reusable plugin for multi-label generation and manipulation that can be attached to various generative models and applied it to diverse domains, such as chemical molecule modeling.
We introduce a novel way of modeling conditional distributions using invertible normalizing flows based on the latent space factorization.
We experimentally demonstrate that PluGeN can produce samples with uncommon combinations of attributes going beyond the distribution of training data.
Conditional VAE (cVAE) is one of the first methods which includes additional information about the labeled attributes in a generative model Kingma et al. (2014). Although this approach has been widely used in various areas ranging from image generation Sohn, Lee, and Yan (2015); Yan et al. (2016); Klys, Snell, and Zemel (2018) to molecular design Kang and Cho (2018)
, the independence of the latent vector from the attribute data is not assured, which negatively influences the generation quality. Conditional GAN (cGAN) is an alternative approach that gives results of significantly better qualityMirza and Osindero (2014); Perarnau et al. (2016); He et al. (2019), but the model is more difficult to train Kodali et al. (2017). cGAN works very well for generating new images and conditioning factors may take various forms (images, sketches, labels) Park et al. (2019); Jo and Park (2019); Choi et al. (2020), but manipulating existing examples is more problematic because GAN models lack the encoder network Tov et al. (2021). Fader Networks Lample et al. (2017) combine features of both cVAE and cGAN, as they use encoder-decoder architecture, together with the discriminator, which predicts the image attributes from its latent vector returned from the encoder. As discussed in Li et al. (2020), the training of Fader Networks is even more difficult than standard GANs, and disentanglement of attributes is not preserved. MSP Li et al. (2020) is a recent auto-encoder based architecture with an additional projection matrix, which is responsible for disentangling the latent space and separating the attribute information from other characteristic information. In contrast to PluGeN, MSP cannot be used with pre-trained GANs and performs poorly at generating new images (it was designed for manipulating existing examples). CAGlow Liu et al. (2019) is an adaptation of Glow Kingma and Dhariwal (2018) to conditional image generation based on modeling a joint probabilistic density of an image and its conditions. Since CAGlow does not reduce data dimension, applying it to more complex data might be problematic.
While the above approaches focus on building conditional generative models from scratch, recent works often focus on manipulating the latent codes of pre-trained models. StyleFlow Abdal et al. (2021) operates on the latent space of StyleGAN Karras, Laine, and Aila (2019)
using a conditional continuous flow module. Although the quality of generated images is impressive, the model has not been applied to other generative models than StyleGAN and domains other than images. Moreover, StyleFlow needs an additional classifier to compute the conditioning factor (labels) for images at test time. Competitive approaches to StyleGAN appear inGao et al. (2021); Tewari et al. (2020); Härkönen et al. (2020); Nitzan et al. (2020). Hijack-GAN Wang, Yu, and Fritz (2021) is a framework for attribute manipulation using the latent space of any GAN by designing a proxy model to traverse the latent space.
Plugin Generative Network
We propose a plugin generative network (PluGeN), which can be attached to pre-trained generative models and allows for direct manipulation of labeled attributes, see Figure 3 for the basic scheme of PluGeN. Making use of PluGeN we preserve all properties of the base model, such as generation quality and reconstruction in the case of auto-encoders, while adding new functionalities. In particular, we can:
modify selected attributes of existing examples,
generate new samples with desired labels.
In contrast to typical conditional generative models, PluGeN is capable of creating examples with rare or even unseen combinations of attributes, e.g. man with makeup.
Probabilistic model. PluGeN works in a multi-label setting, where every example is associated with a -dimensional vector of binary labels111Our model can be extended to continuous values, which we describe in the supplementary materials due to page limit. . We assume that there is a pre-trained generative model , where is the latent space, which is usually heavily entangled. That is, although each latent code contains the information about the labels , there is no direct way to extract or modify it.
We want to map this entangled latent space into a separate latent space which encodes the values of each label
as a separate random variableliving in a single dimension of this space. Thus, by changing the value of , going back to the entangled space and generating a sample, we can control the values of . Since labeled attributes usually do not fully describe a given example, we consider additional random variables , which are supposed to encode the information not included in the labels. We call the label variables (or attributes) and the style variables.
Since we want to control the value of each attribute independently of any other factors, we assume the factorized form of the probability distribution of the random vector
. More precisely, the conditional probability distribution ofgiven any condition imposed on labeled attributes is of the form:
for all . In other words, modifying influences only the -th factor leaving other features unchanged.
Parametrization. To instantiate the above probabilistic model (1), we need to parametrize the conditional distribution of given and the distribution of
. Since we do not impose any constraints on style variables, we use standard Gaussian distribution for modeling density of:
To provide the consistency with
and avoid potential problems with training our deep learning model using discrete distributions, we use the mixture of two Gaussians for modeling the presence of labels – each component corresponds to a potential value of the label (or ). More precisely, the conditional distribution of given is parametrized by:
where are the user-defined parameters. If , then the latent factor takes values close to ; otherwise we get values around (depending on the value of and ). To provide good separation between components, we put ; the selection of will be discussed is the supplementary materials.
Thanks to this continuous parametrization, we can smoothly interpolate between different labels, which would not be so easy using e.g. Gumbel softmax parametrizationJang, Gu, and Poole (2016). In consequence, we can gradually change the intensity of certain labels, like smile or beard, even though such information was not available in a training set (see Figure 4 in the experimental section).
Training the model
To establish a two-way mapping between entangled space and the disentangled space , we use an invertible normalizing flow (INF),
. Let us recall that INF is a neural network, where the inverse mapping is given explicitly and the Jacobian determinant can be easily calculatedDinh, Krueger, and Bengio (2014). Due to the invertibility of INF, we can transform latent codes to the prior distribution of INF, modify selected attributes, and map the resulting vector back to . Moreover, INFs can be trained using log-likelihood loss, which is very appealing in generative modeling.
Summarizing, given a latent representation of a sample with label
, the loss function of PluGeN equals:
where . In the training phase, we collect latent representations of data points . Making use of labeled attributes associated with every , we modify the weights of so that to minimize the negative log-likelihood (3) using gradient descent. The weights of the base model are kept frozen.
In contrast to many previous works Abdal et al. (2021), PluGeN can be trained in a semi-supervised setting, where only partial information about labeled attributes is available (see supplementary materials for details).
Inference. We may use PluGeN to generate new samples with desired attributes as well as to manipulate attributes of input examples. In the first case, we generate a vector from the conditional distribution with selected condition . To get the output sample, the vector is transformed by the INF and the base generative network , which gives us the final output .
In the second case, to manipulate the attributes of an existing example , we need to find its latent representation . If is a decoder network of an autoencoder model, then should be passed through the encoder network to obtain Li et al. (2020). If is a GAN, then can be found by minimizing the reconstruction error between and using gradient descent for a frozen Abdal et al. (2021). In both cases, is next processed by INF, which gives us its factorized representation . In this representation, we can modify any labeled variable and map the resulting vector back through and as in the generative case.
Observe that PluGeN does not need to know what are the values of labeled attributes when it modifies attributes of existing examples. Given a latent representation , PluGeN maps it through , which gives us the factorization into labeled and unlabeled attributes. In contrast, existing solutions based on conditional INF, e.g StyleFlow Abdal et al. (2021), have to determine all labels before passing through INF as they represent the conditioning factors. In consequence, these models involve additional classifiers for labeled attributes.
To empirically evaluate the properties of PluGeN, we combine it with GAN and VAE architectures to manipulate attributes of image data. Moreover, we present a practical use-case of chemical molecule modeling using CharVAE. Due to the page limit, we included architecture details and additional results in the supplementary materials.
First, we consider the state-of-the-art StyleGAN architecture Karras, Laine, and Aila (2019), which was trained on Flickr-Faces-HQ (FFHQ) containing 70 000 high-quality images of resolution . The Microsoft Face API was used to label 8 attributes in each image (gender, pitch, yaw, eyeglasses, age, facial hair, expression, and baldness).
PluGeN is instantiated using NICE flow model Dinh, Krueger, and Bengio (2014) that operates on the latent vectors sampled from the space of the StyleGAN. As a baseline, we select StyleFlow Abdal et al. (2021), which is currently one of the state-of-the-art models for controlling the generation process of StyleGAN. In contrast to PluGeN, StyleFlow uses the conditional continuous INF to operate on the latent codes of StyleGAN, where the conditioning factor corresponds to the labeled attributes. For evaluation, we modify one of 5 attributes222The remaining 3 attributes (age, pitch, yaw) are continuous and it is more difficult to assess their modifications. and verify the success of this operation using the prediction accuracy returned by Microsoft Face API. The quality of images is additionally assessed by calculating the standard Fréchet Inception Distance (FID) Heusel et al. (2017).
Figure 1 (first page) and 4 present the effects of how PluGeN and StyleFlow manipulate images sampled by StyleGAN. It is evident that PluGeN can switch the labels to opposite values as well as gradually change their intensities. At the same time, the requested modifications do not influence the remaining attributes leaving them unchanged. One can observe that the results produced by StyleFlow are also acceptable, but the modification of the requested attribute implies the change of other attributes. For example, increasing the intensity of ”baldness” changes the type of glasses, or turning the head into right makes the woman look straight.
The above qualitative evaluation is supported by the quantitative assessment presented in Table 1. As can be seen, StyleFlow obtains a better FID score, while PluGeN outperforms StyleFlow in terms of accuracy. Since FID compares the distribution of generated and real images, creating images with uncommon combinations of attributes that do not appear in a training set may be scored lower, which can explain the relation between accuracy and FID obtained by PluGeN and StyleFlow. In consequence, FID is not an adequate metric for measuring the quality of arbitrary image manipulations considered here, because it is too closely tied to the distribution of input images.
It is worth mentioning that PluGeN obtains these very good results using NICE model, which is the simplest type of INFs. In contrast, StyleFlow uses continuous INF, which is significantly more complex and requires using an ODE solver leading to unstable training. Moreover, to modify even a single attribute, StyleFlow needs to determine the values of all labels, since they represent the conditioning factors of INF. In consequence, every modification requires applying an auxiliary classifier to predict all image labels. The usage of PluGeN is significantly simpler, as subsequent coordinates in the latent space of INF correspond to the labeled attributes and they are automatically determined by PluGeN.
Image manipulation on VAE backbone
In the following experiment, we show that PluGeN can be combined with autoencoder models to effectively manipulate image attributes. We use CelebA database, where every image of the size is annotated with binary labels.
We compare PluGeN to MSP Li et al. (2020), a strong baseline, which uses a specific loss for disentangling the latent space of VAE. Following the idea of StyleFlow, we also consider a conditional INF attached to the latent space of pre-trained VAE (referred to as cFlow), where conditioning factors correspond to the labeled attributes. The architecture of the base VAE and the evaluation protocol were taken from the original MSP paper. More precisely, for every input image, we manipulate the values of two attributes (we inspect 20 combinations in total). The success of the requested manipulation is verified using a multi-label ResNet-56 classifier trained on the original CelebA dataset.
The sample results presented in Figure 5 demonstrate that PluGeN attached to VAE produces high-quality images satisfying the constraints imposed on the labeled attributes. The quantitative comparison shown in Table 2 confirms that PluGeN is extremely efficient in creating uncommon combinations of attributes, while cFlow performs well only for the usual combinations. At the same time, the quality of images produced by PluGeN and MSP is better than in the case of cFlow. Although both PluGeN and MSP focus on disentangling the latent space of the base model, MSP has to be trained jointly with the base VAE model and it was designed only to autoencoder models. In contrast, PluGeN is a separate module, which can be attached to arbitrary pre-trained models. Due to the use of invertible neural networks, it preserves the reconstruction quality of the base model, while adding manipulation functionalities. In the following experiment, we show that PluGeN also performs well at generating entirely new images, which is not possible using MSP.
|male x beard||0.80||0.79||0.85|
|female x beard||0.59||0.33||0.31|
|male x no-beard||0.88||0.92||0.91|
|female x no-beard||0.85||0.82||0.95|
|male x makeup||0.44||0.43||0.29|
|male x no-makeup||0.72||0.92||0.96|
|female x makeup||0.42||0.41||0.58|
|female x no-makeup||0.55||0.40||0.85|
|smile x open-mouth||0.97||0.99||0.79|
|no-smile x open-mouth||0.79||0.82||0.77|
|smile x calm-mouth||0.84||0.91||0.72|
|no-smile x calm-mouth||0.96||0.97||0.99|
|male x bald||0.26||0.41||0.34|
|male x bangs||0.58||0.74||0.45|
|female x bald||0.19||0.13||0.39|
|female x bangs||0.52||0.49||0.60|
|no-glasses x black-hair||0.92||0.93||0.74|
|no-glasses x golden-hair||0.92||0.91||0.81|
|glasses x black-hair||0.76||0.90||0.58|
|glasses x golden-hair||0.75||0.85||0.61|
Image generation with VAE backbone
In addition to manipulating the labeled attributes of existing images, PluGeN generates new examples with desired attributes. To verify this property, we use the same VAE architecture as before trained on CelebA dataset. The baselines include cFlow and two previously introduced methods for multi-label conditional generation333For cVAE and -GAN we use images of the size following their implementations.: cVAE Yan et al. (2016) and -GAN Gan et al. (2017). We exclude MSP from the comparison because it cannot generate new images, but only manipulate the attributes of existing ones (see supplementary materials for a detailed explanation).
Figure 6 presents sample results of image generation with the specific conditions. In each row, we fix the style variables and vary the label variables in each column, generating the same person but with different characteristics such as hair color, eyeglasses, etc. Although cVAE manages to modify the attributes, the quality of obtained samples is poor, while -GAN falls completely out of distribution. PluGeN and cFlow generate images of similar quality, but only PluGeN is able to correctly manipulate the labeled attributes. The lower quality of generated images is caused by the poor generation abilities of VAE backbone, which does not work well with high dimensional images (see supplementary materials for a discussion). For this reason, it is especially notable that PluGeN can improve the generation performance of the backbone model in contrast to MSP.
Disentangling the attributes
The attributes in the CelebA dataset are strongly correlated and at times even contradictory, e.g. attributes ’bald’ and ’blond hair’ cannot both be present at the same time. In this challenging task, we aim to disentangle the attribute space as much as it is possible to allow for generating examples with arbitrary combinations of attributes. For this purpose, we sample the conditional variables independently, effectively ignoring the underlying correlations of attributes, and use them to generate images. Since the attributes in the CelebA dataset are often imbalanced (e.g. only in 6.5% of examples the person wears glasses), we calculate F1 and AUC scores for each attribute.
The quantitative analysis of the generated images presented in Table 3 confirms that PluGeN outperforms the rest of the methods with respect to classification scores. The overall metrics are quite low for all approaches, which is due to the difficulty of disentanglement mentioned above, as well as the inaccuracy of the ResNet attribute classifier. Deep learning models often fail when the correlations in the training data are broken, e.g. the classifier might use the presence of a beard to predict gender, thus introducing noise in the evaluation Beery, Horn, and Perona (2018).
Chemical molecules modeling
Finally, we present a practical use-case, in which we apply PluGeN to generate chemical molecules with the requested properties. As a backbone model, we use CharVAE Gómez-Bombarelli et al. (2018), which is a type of recurrent network used for processing SMILES Weininger (1988), a textual representation of molecules. It was trained on ZINC 250k database Sterling and Irwin (2015) of commercially available chemical compounds. For every molecule, we model 3 physio-chemical continuous (not binary) labels: logP, SAS, TPSA, which values were calculated using RDKit package Landrum et al. (2006). Additional explanations and more examples are given in the supplementary materials.
First, we imitate a practical task of de novo design Olivecrona et al. (2017); Popova, Isayev, and Tropsha (2018), where we force the model to generate new compounds with desirable properties. For every attribute, we generate 25k molecules with 3 different values: for logP we set the label of generated molecules to: 1.5, 3.0, 4.5; for TPSA we set generated labels to: 40, 60, 80; for SAS we set them to: 2.0, 3.0, 4.0, which gives 9 scenarios in total. From density plots of labels of generated and original molecules presented in Figure 7, we can see that PluGeN changes the distribution of values of the attributes and moves it towards the desired value. A slight discrepancy between desired and generated values may follow from the fact that values of labeled attributes were sampled independently, which could make some combinations physically contradictory.
Next, we consider the setting of lead optimization Jin et al. (2019); Maziarka et al. (2020), where selected compounds are improved to meet certain criteria. For this purpose, we encode a molecule into the latent representation of INF and force PluGeN to gradually increase the value of logP by 3 and decode the resulting molecules.
The obtained molecules together with their logP are shown in Figure 8. As can be seen, PluGeN generates molecules that are structurally similar to the initial one, however with optimized desired attributes.
Obtained results show that PluGeN is able to model the physio-chemical molecular features, which is a non-trivial task that could speed up a long and expensive process of designing new drugs.
We proposed a novel approach for disentangling the latent space of pre-trained generative models, which works perfectly for generating new samples with desired conditions as well as for manipulating the attributes of existing examples. In contrast to previous works, we demonstrated that PluGeN performs well across diverse domains, including chemical molecule modeling, and can be combined with various architectures, such as GANs and VAEs backbones.
- Abdal et al. (2021) Abdal, R.; Zhu, P.; Mitra, N. J.; and Wonka, P. 2021. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40(3): 1–21.
- Beery, Horn, and Perona (2018) Beery, S.; Horn, G. V.; and Perona, P. 2018. Recognition in Terra Incognita. In ECCV.
- Bohacek, McMartin, and Guida (1996) Bohacek, R. S.; McMartin, C.; and Guida, W. C. 1996. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal research reviews, 16(1): 3–50.
- Brock, Donahue, and Simonyan (2018) Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations.
- Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Chen et al. (2018) Chen, R. T.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. 2018. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366.
- Cho et al. (2014) Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Choi et al. (2020) Choi, Y.; Uh, Y.; Yoo, J.; and Ha, J.-W. 2020. Stargan v2: Diverse image synthesis for multiple domains. In
- Coley et al. (2017) Coley, C. W.; Barzilay, R.; Green, W. H.; Jaakkola, T. S.; and Jensen, K. F. 2017. Convolutional embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and modeling, 57(8): 1757–1772.
- Dinh, Krueger, and Bengio (2014) Dinh, L.; Krueger, D.; and Bengio, Y. 2014. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
- Gan et al. (2017) Gan, Z.; Chen, L.; Wang, W.; Pu, Y.; Zhang, Y.; Liu, H.; Li, C.; and Carin, L. 2017. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548.
- Gao et al. (2021) Gao, Y.; Wei, F.; Bao, J.; Gu, S.; Chen, D.; Wen, F.; and Lian, Z. 2021. High-Fidelity and Arbitrary Face Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16115–16124.
- Gaulton et al. (2017) Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; et al. 2017. The ChEMBL database in 2017. Nucleic acids research, 45(D1): D945–D954.
- Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268–276.
- Härkönen et al. (2020) Härkönen, E.; Hertzmann, A.; Lehtinen, J.; and Paris, S. 2020. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546.
- He et al. (2019) He, Z.; Zuo, W.; Kan, M.; Shan, S.; and Chen, X. 2019. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 28(11): 5464–5478.
- Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 6626–6637.
- Jang, Gu, and Poole (2016) Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Jin, Barzilay, and Jaakkola (2018)
Jin, W.; Barzilay, R.; and Jaakkola, T. 2018.
Junction tree variational autoencoder for molecular graph generation.
International Conference on Machine Learning, 2323–2332. PMLR.
- Jin et al. (2019) Jin, W.; Yang, K.; Barzilay, R.; and Jaakkola, T. 2019. Learning multimodal graph-to-graph translation for molecular optimization. International Conference on Learning Representations.
Jo and Park (2019)
Jo, Y.; and Park, J. 2019.
Sc-fegan: Face editing generative adversarial network with user’s sketch and color.In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1745–1753.
- Kang and Cho (2018) Kang, S.; and Cho, K. 2018. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1): 43–52.
- Karras, Laine, and Aila (2019) Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401–4410.
- Karras et al. (2020) Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
- Kaya, Hong, and Dumitras (2019) Kaya, Y.; Hong, S.; and Dumitras, T. 2019. Shallow-Deep Networks: Understanding and Mitigating Network Overthinking. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 3301–3310. PMLR.
- Kingma and Dhariwal (2018) Kingma, D. P.; and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039.
- Kingma et al. (2014) Kingma, D. P.; Rezende, D. J.; Mohamed, S.; and Welling, M. 2014. Semi-supervised learning with deep generative models. arXiv preprint arXiv:1406.5298.
- Klys, Snell, and Zemel (2018) Klys, J.; Snell, J.; and Zemel, R. 2018. Learning latent subspaces in variational autoencoders. arXiv preprint arXiv:1812.06190.
- Kodali et al. (2017) Kodali, N.; Abernethy, J.; Hays, J.; and Kira, Z. 2017. On convergence and stability of gans. arXiv preprint arXiv:1705.07215.
- Koperski et al. (2020) Koperski, M.; Konopczynski, T.; Nowak, R.; Semberecki, P.; and Trzcinski, T. 2020. Plugin Networks for Inference under Partial Evidence. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2883–2891.
- Lample et al. (2017) Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.; and Ranzato, M. 2017. Fader networks: Manipulating images by sliding attributes. arXiv preprint arXiv:1706.00409.
Landrum et al. (2006)
Landrum, G.; et al. 2006.
RDKit: Open-source cheminformatics.
- Li and Wand (2016) Li, C.; and Wand, M. 2016. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, 702–716. Springer.
- Li et al. (2020) Li, X.; Lin, C.; Li, R.; Wang, C.; and Guerin, F. 2020. Latent space factorisation and manipulation via matrix subspace projection. In International Conference on Machine Learning, 5916–5926. PMLR.
- Liu et al. (2019) Liu, R.; Liu, Y.; Gong, X.; Wang, X.; and Li, H. 2019. Conditional adversarial generative flow for controllable image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7992–8001.
- Maziarka et al. (2020) Maziarka, Ł.; Pocha, A.; Kaczmarczyk, J.; Rataj, K.; Danel, T.; and Warchoł, M. 2020. Mol-CycleGAN: a generative model for molecular optimization. Journal of Cheminformatics, 12(1): 1–18.
- Mestre-Ferrandiz et al. (2012) Mestre-Ferrandiz, J.; Sussex, J.; Towse, A.; et al. 2012. The R&D cost of a new medicine. Monographs.
- Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
- Nitzan et al. (2020) Nitzan, Y.; Bermano, A.; Li, Y.; and Cohen-Or, D. 2020. Disentangling in latent space by harnessing a pretrained generator. arXiv preprint arXiv:2005.07728, 2(3).
Olivecrona et al. (2017)
Olivecrona, M.; Blaschke, T.; Engkvist, O.; and Chen, H. 2017.
Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1): 1–14.
- Park et al. (2019) Park, T.; Liu, M.-Y.; Wang, T.-C.; and Zhu, J.-Y. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337–2346.
- Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc.
- Perarnau et al. (2016) Perarnau, G.; Van De Weijer, J.; Raducanu, B.; and Álvarez, J. M. 2016. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355.
- Popova, Isayev, and Tropsha (2018) Popova, M.; Isayev, O.; and Tropsha, A. 2018. Deep reinforcement learning for de novo drug design. Science advances, 4(7): eaap7885.
- Rebuffi, Bilen, and Vedaldi (2017) Rebuffi, S.; Bilen, H.; and Vedaldi, A. 2017. Learning multiple visual domains with residual adapters. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 506–516.
- Sohn, Lee, and Yan (2015) Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28: 3483–3491.
- Sterling and Irwin (2015) Sterling, T.; and Irwin, J. J. 2015. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11): 2324–2337.
- Tewari et al. (2020) Tewari, A.; Elgharib, M.; Bernard, F.; Seidel, H.-P.; Pérez, P.; Zollhöfer, M.; and Theobalt, C. 2020. Pie: Portrait image embedding for semantic control. ACM Transactions on Graphics (TOG), 39(6): 1–14.
- Tov et al. (2021) Tov, O.; Alaluf, Y.; Nitzan, Y.; Patashnik, O.; and Cohen-Or, D. 2021. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4): 1–14.
- Wang, Yu, and Fritz (2021) Wang, H.-P.; Yu, N.; and Fritz, M. 2021. Hijack-gan: Unintended-use of pretrained, black-box gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7872–7881.
- Weininger (1988) Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1): 31–36.
- Wołczyk et al. (2021) Wołczyk, M.; Wójcik, B.; Bałazy, K.; Podolak, I.; Tabor, J.; Śmieja, M.; and Trzciński, T. 2021. Zero Time Waste: Recycling Predictions in Early Exit Neural Networks. arXiv preprint arXiv:2106.05409.
- Yan et al. (2016) Yan, X.; Yang, J.; Sohn, K.; and Lee, H. 2016. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, 776–791. Springer.
- Yang et al. (2019) Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; et al. 2019. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8): 3370–3388.
- Zhou et al. (2020) Zhou, W.; Xu, C.; Ge, T.; McAuley, J. J.; Xu, K.; and Wei, F. 2020. BERT Loses Patience: Fast and Robust Inference with Early Exit. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Parametrization of PluGeN
In this section, we show that PluGeN can also be applied to the case of continuous labeled attributes. Without loss of generality, we assume that . Analogically to the case of binary labels, we assume that the conditional distribution of label variable given is parametrized by
where is the user-defined parameter controlling smoothness.
Observe that by marginalizing out the label variable over training set , we obtain:
For high values of there is a huge overlap between Gaussian components. This results in small penalties in terms of negative log-likelihood for incorrect assignments. From a practical perspective, we start the training process with high values of , which provides reasonable initialization of PluGeN. Next, we gradually decrease to match the correct assignments.
Modeling imbalanced binary labels
In many cases, the class labels are imbalanced, which means that the number of examples from one class significantly exceeds the other class (e.g., only
examples in CelebA dataset have the ’glasses’ label). To deal with imbalanced data, we scale the variance of Gaussian density modeling the conditional distribution.
We consider the conditional density of -th attribute represented by:
where and . We assume that are the fractions of examples with class 0 and 1, respectively. To deal with imbalanced classes we put
is a fixed parameter. For a majority class, standard deviation becomes higher, which introduces a lower penalty in the case of negative log-likelihood loss. The minority class has a higher penalty because we need to stop the mixture from collapsing into a single component.
Let us calculate the log-likelihood of our conditional prior density using the parametrization . We have
where is an extra weighting factor.
We observe that, for our selection of , the expected value of the weighting factors with respect to labeling variable equals 1. In consequence,
which is a typical log-likelihood of Gaussian distribution assuming class proportion .
Reducing in a training of PluGeN
Here, we describe the schedule for parameter used for modeling conditional distribution . We want to ensure the flexibility of the INF at the beginning of the training, but we also need the attribute values to be strictly separated. In order to achieve both of these conditions, we impose a schedule on the standard deviation . Starting with high we allow for great flexibility of our model, and then we get class separation by reducing the value of . Namely, we use the following schedule for the standard deviation
of the class normal distributions:
is the index of the current epoch and
are hyperparameters setting, respectively, the starting point and the speed of value decay. The selection process ofand is described in the following sections.
It is worth noticing that PluGeN can be trained in a semi-supervised setting, where only partial information about labeled attributes is available. Namely, for every latent representation we can use an arbitrary condition imposed on . If the value of -th label is unknown, then we use the marginal density:
instead of in the loss function (3). Here are the fraction of negatively and positively labeled examples in . Further investigations about this setting are left for future work.
Details of image experiments
All experiments were run on a single NVIDIA DGX Station with Ubuntu 20.04 LTS. The full hardware specification includes 8 Tesla V100 GPUs with 32GB VRAM, 512GB RAM, and Intel(R) Xeon(R) CPU E5-2698 v4. Each experiment was run using a single GPU. The code is based on the PyTorch Paszke et al. (2019) framework.
Architectures of the models
Our experiments were performed using the pre-trained, publicly available StyleGAN2 trained on the FFHQ dataset Karras et al. (2020).
PluGeN for StyleGAN backbone
We use NICE architecture with coupling layers with layers in each and width . We use Adam optimizer with learning rate and train model for epochs. The hyperparameters and used for modeling conditional distributions, are set to and , respectively.
For our experiments, we reuse the VAE architecture from Li et al. (2020). We use an encoder with 5 convolutional layers starting with 128 filters and doubling. The decoder is symmetrical to the decoder. We use leakyReLU activations. We train the network for 50 epochs with batch size 40 and Adam optimizer with the earning rate set to . We additionally train a PatchGAN model Li and Wand (2016) to improve the sharpness of the images.
PluGeN for VAE backbone
As previously, we use NICE architecture with coupling layers with layers in each and width . We train the model for epochs using Adam optimizer with learning rate and . The hyperparameters are set to and , respectively.
We train cFlow model also on top of the base network. We use Conditional Masked Autoregressive Flow with layers of each consisting of reverse permutation and MADE component with residual blocks. Moreover, we have been encoding attributes using linear layer which was after that passed as a context input to the flow. We train the model for epochs using Adam optimizer witht learning rate . During sampling, the temperature trick was used with .
To evaluate the correctness of attribute manipulation in the case of CelebA dataset, we used a standard ResNet-56 classifier. We trained it on the task of multi-label classification, with class weighting to correct for class imbalance. We used the Adam optimizer with the learning rate set to , batch size and trained it for epochs.
In this subsection, we present additional results and models comparison, which were not included in the main paper because of space restrictions.
Manipulating the StyleGAN latent codes
In Figures 9 and 10, we present additional results of attribute manipulations performed by PluGeN and StyleFlow on the latent codes of StyleGAN backbone. In most cases, PluGeN modifies only the requested attribute leaving the remaining ones unchanged, which is not always the case of StyleFlow (compare 4th row of Figure 9 or 3rd row of Figure 10). This confirms that the latent space produced by PluGeN is more disentangled than the one created by StyleFlow.
Manipulating images using VAE backbone
In Figure 11, we show additional results of image manipulation performed by PluGeN, MSP, and cFlow using VAE backbone. One can observe that PluGeN and MSP perform the requested modification more accurately than cFlow.
Manipulating attributes intensity of generated images.
In this experiment, we consider images fully generated by PluGeN (not reconstructed images) attached to the VAE backbone. More precisely, we sample a single style variable from the prior distribution and manipulate the label variables of CelebA attributes. It is evident in Figure 12
that PluGeN freely interpolates between binary values of each attribute and even extrapolates outside the data distribution. This is possible thanks to the continuous form of prior distribution we are using in the latent space, which enables us to choose the intensity of each attribute. We emphasize that this information is not encoded in the dataset, where labels are represented as binary variables. However, in reality, an attribute such as ’narrow eyes’ covers a whole spectrum of possible eyelid positions, from eyes fully closed, through half-closed to wide open. PluGeN is able to recover this property without explicit supervision. Interestingly, we also see cases of extrapolation outside of the dataset, e.g. setting a significantly negative value of the ’bangs’ attribute, which can be interpreted as an illogical condition ’extreme absence of bangs’, creates a white spot on the forehead.
Figure 13 shows that the shape of the empirical distributions in the latent space of PluGeN allows for this continuous change. While the positive and negative classes of boolean attributes such as the presence of a hat or eyeglasses are clearly separated, in more continuous variables like youth and attractiveness they overlap significantly, allowing for smooth interpolations. This phenomenon emerges naturally, even though CelebA provides only binary labels for all the attributes.
Generation capabilities of MSP and the VAE backbone
In Figure 14 (top), we demonstrate that the base VAE model taken from the MSP paper Li et al. (2020) cannot generate new face images, but only manipulate the attributes of input examples. In consequence, it works similar to the autoencoder model. For this reason it is especially notable that PluGeN can improve the generation performance of the backbone model (see the main paper). In contrast, MSP cannot generate new face images using this VAE model as shown in the bottom row of Figure 14. For very low temperatures, MSP generates typical (not diverse) faces.
Generating images with attributes combinations taken from test set
We present additional quantitative results for generating images with the requested combinations of attributes. In this experiment, we focus on typical combinations, which appear in a dataset. For this purpose, we generate 20,000 images with the same attribute combinations as in the CelebA test set. The results presented in Table 4 show that PluGeN outperforms both cFlow, cVAE, and -GAN in terms of classification scores.
CNF vs NICE
In our main experiments, we use the NICE Dinh, Krueger, and Bengio (2014) approach to flow-based models. This choice was motivated by the computational and conceptual simplicity of the approach. However, we also empirically evaluated a more complex approach of continuous normalizing flows Chen et al. (2018) which cast the distribution modeling task as a problem of solving differential equations. The CNF implementation consisted of stacked CNFs, each containing concatsquash layers with a hidden dimension . Table 5 shows the results of both approaches in the task of multi-label conditional generation using VAE backbone. We use the same combinations of attributes as in the CelebA test set. Table 6 shows an analogical comparison when the attributes were sampled independently, which is more challenging setting. For both of these settings results on NICE and CNF are comparable. Although CNF samples get better FIDs, they also score worse on the classification metrics, which suggests that the model might be worse at enforcing the class conditions. Overall, both models perform similarly and because of that, we use NICE as the approach is less expensive computationally.
Different autoencoder backbones
In order to investigate how the structure of the latent space of the backbone autoencoder impacts the performance of our model, we check multiple -VAE models with varying values of . For each model we trained three architectures of INFs (small, medium, big) and picked the best performing ones for evaluation. The results presented in Table 7 show that the FID scores get worse as the value of increases. This is caused by the drop in the reconstructive power of the base model, which focuses more on the latent space regularization instead. Interestingly, the statistics also fall as the value of gets too low. The flow-based model cannot disentangle factors of variation from latent space which is not already at least partially structured. This experiment shows limitations of our model in respect to its reliance on the performance of the backbone autoencoder. However, PluGeN is still quite robust as it achieves good results for a wide range of values.
Details of molecules generation experiments
Designing a new drug is a long and expensive process that could cost up to 10 billion dollars and lasts even 10 years Mestre-Ferrandiz et al. (2012). The recent spread of SARS-CoV-2 virus and the pandemic it caused have shown how important it is to speed up this process. Recently, deep learning is gaining popularity in the cheminformatics community, where it is used to propose new drug candidates. However, using neural networks in the drug generation task is not easy and is fraught with problems. The complexity of the chemical space is high and thus training generative and predictive models is challenging. Although there are around of possible molecules Bohacek, McMartin, and Guida (1996), detailed information (such as class labels) is known only about a small percentage of them. For example, the ChEMBL database Gaulton et al. (2017), one of the biggest databases with information about the molecular attributes, contains data for 2.1 M chemical compounds. Moreover, since obtaining labeled data requires long and costly laboratory experiments, the amount of labeled molecules in the training datasets is usually really small (often less than 1000), which is often not sufficient to train a good model. This poses an important research problem.
Deep neural networks are mostly used in cheminformatics for the following tasks:
PluGeN can be used for the two latter tasks, as our model can generate molecules with specified values of given attributes as well as optimize molecules by changing the value of selected labels.
SMILES Weininger (1988) (simplified molecular-input line-entry system) is a notation, for describing the structure of chemical species using a sequence of characters. SMILES representation consists of a specially defined grammar, which guarantees that a correct SMILES defines a unique molecule. The opposite is not actually true, as a molecule could be encoded by multiple SMILES representations. In order to add this property, the community introduced the canonicalization algorithm, which returns the canonical SMILES that is unique for each molecule.
In Figure 15 we show two molecules together with their canonical SMILES as well as other SMILES representations.
In our chemistry experiments, we modeled 3 chemical attributes: logP, TPSA, and SAS. Below, we describe their responsibilities:
logP – logarithm of the partition coefficient. Describes the molecule solubility in fats. It shows how well the molecule is passing through membranes.
TPSA – the topological polar surface area of a molecule is the surface sum over all polar atoms or molecules (together with their attached hydrogen atoms). TPSA could be used as a metric of the ability of a drug to permeate cells.
SAS – synthetic accessibility score defines the ease of synthesis of a drug-like molecule. When generating a drug candidate, one would rather want it to be easily synthesized so that it can be obtained in the laboratory.
We conducted our chemistry experiments using a dataset of 250k molecules sampled from the ZINC database Sterling and Irwin (2015), which is a dataset of commercially available chemical compounds. The mean number of SMILES tokens in our dataset is equal to 38.31, with a standard deviation equal to 8.46.
Figure 16 shows the distribution of attributes of molecules that make up our training dataset.
Since the values of the chemical attributes are related to the structure of the molecule, many of them will be correlated in some way. In Table 8 we present the correlations between the chemistry attributes. The correlations suggest that it might be difficult or even impossible to manipulate logP and SAS attributes independently, setting a difficult challenge for PluGeN.
The encoder consists of 3 bi-GRU Cho et al. (2014) layers, with hidden size equal to 256 and output size (latent dimensionality) equal to 100. The decoder consists of 3 GRU layers with the hidden size equal to 256. The architecture of the backbone model is significantly different from the one used in the image domain, which partially confirms that PluGeN can be combined with various autoencoder models.
We trained the VAE model for 100 epochs, using batch size of 256 and learning rate equal to 1e-4.
The flow model consisted of 6 coupling layers, each of which consists of 6 dense layers with a hidden size equal to 256. We trained NICE for 50 epochs, with learning rate equal to 1e-4 and batch size 256. We used and .
In the following subsection, we show additional results for the chemistry-based experiments, for both conditional generation as well as latent space traversal. Furthermore, we show how PluGeN works with the conditional normalizing flow instead of NICE as a base flow model.
In the main paper, we presented results for conditional generation in the setting of a single attribute condition (where the value of the remaining attributes was sampled from their prior distribution). Here we also show results for a situation where set conditions on all attributes at the same time.
In particular, we tested 3 different settings:
LogP set to 1.0, TPSA set to 60.0, SAS set to 5.0.
LogP set to 3.0, TPSA set to 75.0, SAS set to 3.0.
LogP set to 5.0, TPSA set to 50.0, SAS set to 2.0.
The density plots of the attributes of the molecules generated in these settings are presented in Figure 17.
Latent space traversal
We also present more results for latent space traversals, which is a task that imitates the inter-class interpolation experiments from the image domain. For this purpose, we tested how PluGeN can traverse the latent space of CharVAE. Therefore, we selected a few random molecules from our dataset, and for every one, we forced PluGeN to gradually increase the value of the specified attribute by some value and decoded the resulting molecules back into the latent space. The goal of this task is to generate the molecules that are structurally similar to the initial one, except for changes in the desired attributes. This is an important challenge in the lead optimization stage of the drug discovery process.
CNF vs NICE
We also tested how replacing NICE Dinh, Krueger, and Bengio (2014) with conditional normalizing flow Chen et al. (2018) affects the process of molecular generation using PluGeN. For this purpose, we repeated the chemistry-based conditional generation experiments from the main text, but with CNF as our backbone flow model. Results are presented in Figure 24. One can see, that in this version PluGeN is also capable of moving the density of the attributes of the generated molecules towards the desired value. The obtained changes, however, are worse than in the case of NICE as a flow backbone.