Log In Sign Up

Training a discrete variational autoencoder for generative chemistry and drug design on a quantum annealer

by   A. I. Gircha, et al.

Deep generative chemistry models emerge as powerful tools to expedite drug discovery. However, the immense size and complexity of the structural space of all possible drug-like molecules pose significant obstacles, which could be overcome with hybrid architectures combining quantum computers with deep classical networks. We built a compact discrete variational autoencoder (DVAE) with a Restricted Boltzmann Machine (RBM) of reduced size in its latent layer. The size of the proposed model was small enough to fit on a state-of-the-art D-Wave quantum annealer and allowed training on a subset of the ChEMBL dataset of biologically active compounds. Finally, we generated 4290 novel chemical structures with medicinal chemistry and synthetic accessibility properties in the ranges typical for molecules from ChEMBL. The experimental results point towards the feasibility of using already existing quantum annealing devices for drug discovery problems, which opens the way to building quantum generative models for practically relevant applications.


page 1

page 2

page 3

page 4


Drug Discovery Approaches using Quantum Machine Learning

Traditional drug discovery pipeline takes several years and cost billion...

Quantum Variational Autoencoder

Variational autoencoders (VAEs) are powerful generative models with the ...

Quantum Generative Models for Small Molecule Drug Discovery

Existing drug discovery pipelines take 5-10 years and cost billions of d...

A Path Towards Quantum Advantage in Training Deep Generative Models with Quantum Annealers

The development of quantum-classical hybrid (QCH) algorithms is critical...

PGMG: A Pharmacophore-Guided Deep Learning Approach for Bioactive Molecular Generation

The rational design of novel molecules with desired bioactivity is a cri...

Genetic Constrained Graph Variational Autoencoder for COVID-19 Drug Discovery

In the past several months, COVID-19 has spread over the globe and cause...

I Introduction

Drug design is the process of identifying biologically active compounds and relies on the efficient generation of novel, drug-like, and yet synthetically accessible compounds. So far, only about substances have ever been synthesized kim2016pubchem

, whereas the total number of realistic drug-like molecules is estimated to be in the range between

and  polishchuk2013estimation

. Optimizing physical and chemical properties alongside biological activities in such a large and unstructured space is of extreme challenge. This is why deep generative machine learning models, such as variational autoencoders (VAEs, see Ref. 


) or generative adversarial networks (such as GANs 

Kurach2019) should help solving the problem of generative chemistry, which is the sampling from hitherto unknown distributions of possible drug-like molecules.

A fully developed generative model should implicitly estimate the fundamental molecular properties, such as stability and synthetic accessibility for each generated compound and intermediate products. All those features depend on the ability of the network architecture to approximate the solutions of the underlying quantum mechanical problems, which is computationally hard for molecules of realistic size. Quantum computers are naturally good for solving complex quantum many-body problems Lloyd and thus may be instrumental in applications involving quantum chemistry McArdle; Bauer_2020; Aspuru-Guzik2018. In addition to that, quantum algorithms can speed up machine learning Aspuru-Guzik2018; Biamonte2017. Therefore, one can expect that quantum-enhanced generative models Gao2021 including quantum GANs li2021quantum may eventually be developed into ultimate generative chemistry algorithms.

Figure 1:

Scheme of the DVAE learning a joint probability distribution over the molecular structural features

and their latent variable-representations (descrete and continuous ). Here and are the encoder and decoders, respectively, whereas

is the prior distribution in the latent variable space and is approximated by RBM. We provide an example of reconstruction of a target molecule (diaveridine) using Gibbs-300 model saved after 300 epochs of training (here

is the Tanimoto similarity between the initial molecule and its reconstruction,

is the output probability).

An ultimate implementation of quantum machine-learning algorithms requires the development of fault-tolerant QPUs Biamonte2017, which is an outstanding challenging. Meanwhile, readily available noisy intermediate-scale quantum (NISQ) devices Preskill2018 provide a test-bed for the development and testing of quantum machine-learning algorithms for practical problems of modest size. Specifically, quantum annealing processors Johnson2011 could potentially enable more efficient solving quadratic unconstrained binary optimization problems and approximating sampling from the thermal distributions of transverse Ising systems. These applications make the attractive in the context of machine learning as tools both for solving optimization problems Neven2008; Neven2012; Lidar2013; Lidar2017 and sampling Andriyash2016; Perdomo-Ortiz2018; Melko2018; Dixit2021.

In this work, we prototyped a discrete variational autoencoder (DVAE, khoshaman2018quantum), whose latent generative process is implemented in the form of a Restricted Boltzmann Machine (RBM) of a small enough size to fit readily available annealers. We trained the network on D-Wave annealer and generated novel chemical structures with medicinal chemistry and synthetic accessibility properties in the ranges typical for molecules from ChEMBL. Hence, we demonstrated that such hybrid architectures might allow practical machine learning applications for generative chemistry and drug design. Once the hardware matures, the RBM could be turned into Quantum Boltzman Machine (QBM) and the whole system might be transformed into a Quantum VAE (QVAE, khoshaman2018quantum) and sample from richer non-classical distributions.

Ii Results

We proposed and characterized a generative model (see Fig. 1) in the form of a combination of a Discrete Variational Autoencoder (DVAE) model with a Restricted Boltzmann Machine (RBM) in the latent space khoshaman2018quantum; rolfe2016discrete. and the Transformer model vaswani2017attention. The model learns useful representations of chemical structures from ChEMBL, which is the manually curated database of biologically active molecules with drug-like properties gaulton2017chembl.

Following Ref. gomez2018automatic, we used common SMILES weininger1988smiles encoding for organic molecules and trained the system to encode and subsequently decode molecular representations via optimizing evidence lower bound (ELBO) for DVAE log-likelihood khoshaman2018quantum:


Here, denotes the expectation value and is the Kullback–Leibler (KL) divergence, is the prior distribution in the latent variable space and is encoded by RBM as in Ref. khoshaman2018quantum (see MM). The two layers of RBM contain units each since an RBM of precisely this size could be sampled on readily available quantum annealers. We used the spike-and-exponential transformation between the discrete and continuous

variables and employed the standard reparameterization trick to avoid calculating derivatives over random variables.

The respective encoder and decoder functions, and

, are approximated by the deep neural networks with Transformer layers each depending on its own set of adjustable parameters

and . We modified the KL divergence term with to avoid posterior collapse yan2020rebalancing.

Figure 2: Learning curves of DVAE trained with classical Gibbs sampleing (red) and samples from D-Wave annealer (blue). Training on D-Wave suspended before reaching convergence due to resource limitation.

We trained the network for epochs until apparent convergence using Gibbs sampling (see the red solid and dashed lines in Fig. 2 representing the total loss over the validation and train sets, respectively). In what follows, we discuss the two checkpoints: the fully trained (Gibbs-300) and, for reference purposes, the intermediate model (Gibbs-75) appearing by the end of the th epoch.

VAE is a probabilistic model. In particular, this means that each of the discrete states in the latent variables is decoded into a probability distribution of SMILES-encoded molecules. On top of Fig. 1 we provide an example of encoding a particular molecule (diaveridine) and its reconstruction by the Gibbs-300 network (see the structures at the bottom). In this case, the target molecule was reconstructed exactly in runs (see the reconstruction probabilities and Tanimoto similarities to the target molecule next to the reported structures).

DVAE is a generative model and can produce novel molecules with properties presumably matching those in the training set. In Fig. 3 and Table 1 we compare the distributions of the basic biochemical properties of the molecules in the training set and among molecules generated by each of the models trained and discussed in this work. The molecules generated by the network were mostly valid ( and in Gibbs-75 ( molecules) and Gibbs-300 ( molecules) models, respectively). We kept track of molecular weight (MW), the water-octanol partition coefficient (logP), the synthetic accessibility (SAscore Ertl2009), and drug-likeness (QED) scores.

Aside from the biochemical and drug-likeness properties, we also measured the novelty of generated molecules. Less than of the generated molecules ( and in Gibbs-75 and Gibbs-300 models, respectively) came closer than Tanimoto distance of to any of the molecules in the training set (or less than closer than Tanimoto distance in both models). Extra training time improved both the validity of the generated molecules and brought the molecular properties closer to those found in the training set (see the relevant Gibbs-75 and Gibbs-300 columns in Table 1).

The proposed network architecture is sufficiently compact to fit the D-Wave hardware. Hence, we were able to train the network using the annealer instead of Gibbs-sampling. The learning of the hybrid model on D-Wave progressed slower than that on a classical computer using Gibbs sampling (see the blue solid and dashed lines in Fig. 2 corresponding to the total loss of the model on the validation and the training sets). We had, however, to stop the training before reaching convergence at th epoch due to the limited performance of the avaliable quantum hardware. With its further improvements we expect to have an ability to prolong the training. Eventually, we used D-Wave to generate molecular structures (see Fig. 3 and the corresponding column in Table 1). As expected, the distributions of basic properties of the generated molecules were close to those obtained from the Gibbs-75 model and could be improved if more training time were available.

Train set

D-Wave sampl.

(75 epochs)

Gibbs sampl.

(75 epochs)

Gibbs sampl.

(300 epochs)

MW 409.99 374.16 397.81 416.14
LogP 3.41 3.68 3.64 3.61
QED 0.54 0.54 0.54 0.52
SAS 2.96 3.12 3.04 3.13
Validity 1.0 0.54 0.55 0.69
Table 1: The parameters of distributions of basic physical and medicinal chemistry properties of the molecules produced by the generative models discussed in this work. The entries in the table are mean values computed with the help of RDKit library rdkitlib to obtain the molecular weight (MW), LogP, QED, and the synthetic accessibility score (SAS) Ertl2009.
Figure 3: Distributions of basic physical and medicinal chemistry properties of the molecules produced by the proposed generative models discussed (same as in Table 1).

Iii Discussion and outlook

VAE are powerful generative machine learning models capable of learning and sampling from the unknown distribution of input data kingma2013auto. As a first step towards building a hybrid quantum generative model, we prototyped the DVAE (along the lines of Ref. khoshaman2018quantum) with the RBM in its latent space khoshaman2018quantum; rolfe2016discrete. If provided with a large dataset of drug-like molecules, such a system should, if converged, learn representations of molecular structure and could be used to generate novel and yet drug-like molecules for drug design applications.

We employed Transformer layers vaswani2017attention in the encoder and decoder components and developed additional preprocessing layers that allowed our model to operate at the character-level (rather than on the world-level) to parse SMILES, the textual representations of the input molecules. As a result, we were able to built a compact model with the RBM consisting of two layers of just units each. RBMs of this size may be deployed on already existing quantum annealing devices (such as D-Wave Advantage processor). The RBM could be turned into Quantum Boltzman Machine (QBM) and the whole system might be transformed into a Quantum VAE (QVAE, khoshaman2018quantum) and sample from richer non-classical distributions.

For demonstration purposes, we trained DVAE using almost -random molecules from the ChEMBL database of manually curated and biologically active molecules. On a classical hardware, the system could be trained with Gibbs sampling. We were able to show that the training converged, and the resulting network could be used to generate molecules with the distribution of the basic properties, such as logP, and QED, closely matching those in the training set. Simultaneously, the average size of the molecules appeared to increase as the training of the network was progressing. There are relatively more harder-to synthesize compounds among the molecules generated by the network.

Training of the same architecture network on quantum annealer proceeded slower per epoch than on the classical computer, most probably due to noise. Nevertheless, the distributions of the molecular properties of generated molecules were sufficiently close to those in the training set or among the molecules generated by classical counterparts Gibbs-75 and 300. While certain discrepancies between distributions were present, those results have been computed only after a limited number of training epochs due to the restrictions on public access to the quantum computer.

In the original VAE paper gomez2018automatic, the authors proposed training additional properties, such as the prediction of the binding constant to a particular target on top of the autoencoder loss. In such a form, the network could be used in problems involving actual drug design, i.e., for the generation of novel compounds binding specific medically relevant targets. Although we did not attempt to demonstrate such a capability, DVAE and eventually its hybrid implementation such as QVAE can be appropriately refitted by adding the extra loss.

Hence, based on the results of our limited experiments, we conclude that the DVAE model can be used in generative chemistry and drug discovery applications of practical scale on already existing quantum annealing devices.

Training of a QVAE with QBM in the latent layer can be reduced to the training of a DVAE with RBM in the latent layer with Gibbs-sampling if the transverse magnetic field in QBM is absent khoshaman2018quantum. Using genuine QBMs should speed up the training of the system ( vs. with being the size of the network Biamonte2017). There was a demonstration in Ref. khoshaman2018quantum, where “quantum” samplers with the non-vanishing transverse fields outperformed DVAE if assessed by metrics achieved at the same number of training cycles (epochs).

Construction of QVAE with the controllable non-zero transverse field can, in principle, be performed on the existing generation of D-Wave chips. However, it would require additional hardware tuning, and applying a combination of additional tricks such as reverse anneal schedule, pause-and-quench and others dwavemanual. We expect that with further developments in the engineering of quantum computing devices, hybrid architectures similar to QVAE would surpass their classical counterparts. This may be especially true in problems potentially involving rules of quantum chemistry, such as learning efficient representations of molecular structures for applications related to generative chemistry and drug design.

Iv Acknowledgements

We thank Sergey Usmanov for help in performing the experiment. P.O.F and A.K. would like to thank Dr. A. Tarkhov from Gero and Daria Orhunova for fruitful discussion and help with the data analysis. We acknowledge online cloud access to the quantum annealing device produced by D-Wave Systems. A.K.F. thanks Russian Science Foundation grant (19-71-10092). P.O.F. and A.K. are supported by Gero LLC PTE (Singapore).

Author contributions

All the authors jointly developed the problem statement and analyzed existing state-of-the-art techniques. A.I.G. and A.S.B. implemented the considered methods and performed the simulation of molecules. All the authors contributed to discussions of the results and writing the manuscript. P.O.F. and A.K.F. supervised the project.

Competing interests

Owing to the employments and consulting activities of A.I.G., A.S.B., and A.K.F., they have financial interests in the commercial applications of quantum computing. P.O.F. and K.A. are employees of (P.O.F. is a stake holder in) Gero LLC PTE and hence have financial interests in commercial applications of AI/ML systems for drug discovery.

Code availability

The code that is deemed central to the conclusions is available from the corresponding author upon reasonable request.

V Methods

We proposed and characterized a combination of Discrete Variational Autoencoder (DVAE) with Restricted Boltzmann Machine (RBM) in the latent space khoshaman2018quantum; rolfe2016discrete and the encoder and decoder including layers from the Transformer model vaswani2017attention. We trained the models on a subset of the ChEMBL dataset by optimizing evidence lower bound (ELBO) for DVAE log-likelihood khoshaman2018quantum, modified with additional coefficient that multiplies KL divergence term yan2020rebalancing, see Eq. (1). The sketch of the architecture of our models is illustrated in Fig. 1.

Below we in details describe the dataset, the network architecture, the training parameters, and the training schedule of the classical and quantum annealer models.

v.1 Dataset

We used a small subset of random molecules from the ChEMBL (release 26) database Gaulton2011; Gaulton2017. We selected altogether structures encoded by SMILES string of the maximum length of 200 symbols and containing the atoms from the organic subset only (B, C, N, O, P, S, F, Cl, Br, I). To focus on the relevant biologically active compounds, we removed salt residuals. Finally, we converted all SMILES into the canonical format with the help of RDKit rdkitlib.

The remaining molecules were randomly assigned into train and validation sets each containing and of all samples ( and molecules), respectively.

v.2 Training DVAE using Gibbs-sampling

Molecular SMILES strings are tokenized with the regular expression from Ref. schwaller2018found, which produced unique tokens. Standard trainable embedding layer and positional encoding from Ref. vaswani2017attention are used. Our implementation utilized a combination of embedding and positional encoding, in which positional encoding is multiplied by additional correction factor:



is embedding tensor,

is positional encoding tensor and is the dimensionality of the embedding. This factor is required to make the proportion between embedding tensor and positional encoding closer to that in the original model vaswani2017attention

. The dimension of embeddings is a model hyperparameter which was set to


In the encoder, we employed a layer of one-dimensional convolutions and a highway layer Srivastava2015 between the embedding and Transformer layers. The convolution layer with filters and the kernel size equal to were developed based on previous work banar2020Character.

We used a highway layer to improve the performance of the model according to banar2020Character; kim2016character.

The preprocessed sequence of 160-d tensors were passed from the highway layer to the stack of Transformer encoder layers. The width of the feed-forward part of the layers is equal to . Number of heads in Multi-Head attention is . We used GeLU activation Hendrycks2016 functions and Dropout with the rate of .

Original Transformer encoder layers produce output tensor of variable length. The length of the tensor is equal to the size of the input string. In order to further reduce the dimensionality of the latent space layer in the model, we construct fixed-length tensor from the Transformer encoder output tensor

by calculating fixed number of vectors from

, which we then concatenate in one tensor. First, we select the vector with index from Transformer layers output . Then, we calculate the arithmetic mean of all vectors along the length of the output tensor . Next, we consider the subsets , each consisting of vectors with indices that have the same remainder after division by for :

For each , we compute the arithmetic mean and concatenate all calculated vectors into the fixed-length output tensor.

Restricted Boltzmann machine (RBM) is implemented in the latent space as presented in papers khoshaman2018quantum; rolfe2016discrete. RBM consists of two layers of units each. RBM of precisely this size can be sampled using existing quantum annealing devices. It is worth noting that all the units of RBM in DVAE are latent variables and connected to the rest of the model. Hence, there is no distinction between ”hidden” and ”visible” units as for standalone RBM khoshaman2018quantum; rolfe2016discrete. We employed the spike-and-exponential transformation as a smoothing probability distribution and the reparametrization trick to avoid calculating of derivatives over random variables. RBM is sampled by performing

steps of Gibbs updates using persistent contrastive divergence (PCD) 


The decoder works in two modes: training and inference. In the inference mode decoder uses preprocessing layers. The main part of data processing in both training and inference modes of decoder consists of Transformer decoder layers. Altogether, we used transformer decoder layers of the size (GeLU activation, dropout = ). The width of the feed-forward part of the layers was equal to , the number of heads in MultiHead attention is .

To train the model, we used the rebalanced objective function in which the KL divergence term is multiplied by the additional small coefficient  yan2020rebalancing to avoid posterior collapse problem and employed the Adam optimizer.

In contrast to the original Transformer model, we use a different learning rate schedule: we trained the model for epochs using MultiStep learning rate schedule with the initial learning rate was . The learning rate was subsequently reduced by the factor of at points corresponding to , and of the length of the training process.

For estimation of the logarithm of the partition function of Boltzmann distribution, we used annealed importance sampling (AIS) algorithm Neal2001AnnealedIS during the evaluation of the model at the end of each epoch using intermediate distributions and samples.

Due to resources constraints, we did not have a chance to optimize the hyperparameters or too many architectural variants of the model. The presented variant of the network just worked and can be considered the first step towards a real and effective solution.

v.3 Training DVAE on a quantum annealer

We used exactly the network architecture on the quantum annealer with the only difference from the classical case being that the RBM in the latent space was sampled using quantum annealer (D-Wave Advantage processor). Also, the quantum model was trained during epochs with constant learning rate equal to .

For estimating the logarithm of the partition function of the Boltzmann distribution during the evaluation of the model, we used a different version of annealed importance sampling (AIS; see Ref. Ulanov2019QuantuminspiredAA) with the same parameters as in the classical case.