I Introduction
Drug design is the process of identifying biologically active compounds and relies on the efficient generation of novel, druglike, and yet synthetically accessible compounds. So far, only about substances have ever been synthesized kim2016pubchem
, whereas the total number of realistic druglike molecules is estimated to be in the range between
and polishchuk2013estimation. Optimizing physical and chemical properties alongside biological activities in such a large and unstructured space is of extreme challenge. This is why deep generative machine learning models, such as variational autoencoders (VAEs, see Ref.
gomez2018automatic) or generative adversarial networks (such as GANs
Kurach2019) should help solving the problem of generative chemistry, which is the sampling from hitherto unknown distributions of possible druglike molecules.A fully developed generative model should implicitly estimate the fundamental molecular properties, such as stability and synthetic accessibility for each generated compound and intermediate products. All those features depend on the ability of the network architecture to approximate the solutions of the underlying quantum mechanical problems, which is computationally hard for molecules of realistic size. Quantum computers are naturally good for solving complex quantum manybody problems Lloyd and thus may be instrumental in applications involving quantum chemistry McArdle; Bauer_2020; AspuruGuzik2018. In addition to that, quantum algorithms can speed up machine learning AspuruGuzik2018; Biamonte2017. Therefore, one can expect that quantumenhanced generative models Gao2021 including quantum GANs li2021quantum may eventually be developed into ultimate generative chemistry algorithms.
An ultimate implementation of quantum machinelearning algorithms requires the development of faulttolerant QPUs Biamonte2017, which is an outstanding challenging. Meanwhile, readily available noisy intermediatescale quantum (NISQ) devices Preskill2018 provide a testbed for the development and testing of quantum machinelearning algorithms for practical problems of modest size. Specifically, quantum annealing processors Johnson2011 could potentially enable more efficient solving quadratic unconstrained binary optimization problems and approximating sampling from the thermal distributions of transverse Ising systems. These applications make the attractive in the context of machine learning as tools both for solving optimization problems Neven2008; Neven2012; Lidar2013; Lidar2017 and sampling Andriyash2016; PerdomoOrtiz2018; Melko2018; Dixit2021.
In this work, we prototyped a discrete variational autoencoder (DVAE, khoshaman2018quantum), whose latent generative process is implemented in the form of a Restricted Boltzmann Machine (RBM) of a small enough size to fit readily available annealers. We trained the network on DWave annealer and generated novel chemical structures with medicinal chemistry and synthetic accessibility properties in the ranges typical for molecules from ChEMBL. Hence, we demonstrated that such hybrid architectures might allow practical machine learning applications for generative chemistry and drug design. Once the hardware matures, the RBM could be turned into Quantum Boltzman Machine (QBM) and the whole system might be transformed into a Quantum VAE (QVAE, khoshaman2018quantum) and sample from richer nonclassical distributions.
Ii Results
We proposed and characterized a generative model (see Fig. 1) in the form of a combination of a Discrete Variational Autoencoder (DVAE) model with a Restricted Boltzmann Machine (RBM) in the latent space khoshaman2018quantum; rolfe2016discrete. and the Transformer model vaswani2017attention. The model learns useful representations of chemical structures from ChEMBL, which is the manually curated database of biologically active molecules with druglike properties gaulton2017chembl.
Following Ref. gomez2018automatic, we used common SMILES weininger1988smiles encoding for organic molecules and trained the system to encode and subsequently decode molecular representations via optimizing evidence lower bound (ELBO) for DVAE loglikelihood khoshaman2018quantum:
(1)  
Here, denotes the expectation value and is the Kullback–Leibler (KL) divergence, is the prior distribution in the latent variable space and is encoded by RBM as in Ref. khoshaman2018quantum (see MM). The two layers of RBM contain units each since an RBM of precisely this size could be sampled on readily available quantum annealers. We used the spikeandexponential transformation between the discrete and continuous
variables and employed the standard reparameterization trick to avoid calculating derivatives over random variables.
The respective encoder and decoder functions, and
, are approximated by the deep neural networks with Transformer layers each depending on its own set of adjustable parameters
and . We modified the KL divergence term with to avoid posterior collapse yan2020rebalancing.We trained the network for epochs until apparent convergence using Gibbs sampling (see the red solid and dashed lines in Fig. 2 representing the total loss over the validation and train sets, respectively). In what follows, we discuss the two checkpoints: the fully trained (Gibbs300) and, for reference purposes, the intermediate model (Gibbs75) appearing by the end of the th epoch.
VAE is a probabilistic model. In particular, this means that each of the discrete states in the latent variables is decoded into a probability distribution of SMILESencoded molecules. On top of Fig. 1 we provide an example of encoding a particular molecule (diaveridine) and its reconstruction by the Gibbs300 network (see the structures at the bottom). In this case, the target molecule was reconstructed exactly in runs (see the reconstruction probabilities and Tanimoto similarities to the target molecule next to the reported structures).
DVAE is a generative model and can produce novel molecules with properties presumably matching those in the training set. In Fig. 3 and Table 1 we compare the distributions of the basic biochemical properties of the molecules in the training set and among molecules generated by each of the models trained and discussed in this work. The molecules generated by the network were mostly valid ( and in Gibbs75 ( molecules) and Gibbs300 ( molecules) models, respectively). We kept track of molecular weight (MW), the wateroctanol partition coefficient (logP), the synthetic accessibility (SAscore Ertl2009), and druglikeness (QED) scores.
Aside from the biochemical and druglikeness properties, we also measured the novelty of generated molecules. Less than of the generated molecules ( and in Gibbs75 and Gibbs300 models, respectively) came closer than Tanimoto distance of to any of the molecules in the training set (or less than closer than Tanimoto distance in both models). Extra training time improved both the validity of the generated molecules and brought the molecular properties closer to those found in the training set (see the relevant Gibbs75 and Gibbs300 columns in Table 1).
The proposed network architecture is sufficiently compact to fit the DWave hardware. Hence, we were able to train the network using the annealer instead of Gibbssampling. The learning of the hybrid model on DWave progressed slower than that on a classical computer using Gibbs sampling (see the blue solid and dashed lines in Fig. 2 corresponding to the total loss of the model on the validation and the training sets). We had, however, to stop the training before reaching convergence at th epoch due to the limited performance of the avaliable quantum hardware. With its further improvements we expect to have an ability to prolong the training. Eventually, we used DWave to generate molecular structures (see Fig. 3 and the corresponding column in Table 1). As expected, the distributions of basic properties of the generated molecules were close to those obtained from the Gibbs75 model and could be improved if more training time were available.
Train set 
DWave sampl. (75 epochs) 
Gibbs sampl. (75 epochs) 
Gibbs sampl. (300 epochs) 


MW  409.99  374.16  397.81  416.14 
LogP  3.41  3.68  3.64  3.61 
QED  0.54  0.54  0.54  0.52 
SAS  2.96  3.12  3.04  3.13 
Validity  1.0  0.54  0.55  0.69 
Iii Discussion and outlook
VAE are powerful generative machine learning models capable of learning and sampling from the unknown distribution of input data kingma2013auto. As a first step towards building a hybrid quantum generative model, we prototyped the DVAE (along the lines of Ref. khoshaman2018quantum) with the RBM in its latent space khoshaman2018quantum; rolfe2016discrete. If provided with a large dataset of druglike molecules, such a system should, if converged, learn representations of molecular structure and could be used to generate novel and yet druglike molecules for drug design applications.
We employed Transformer layers vaswani2017attention in the encoder and decoder components and developed additional preprocessing layers that allowed our model to operate at the characterlevel (rather than on the worldlevel) to parse SMILES, the textual representations of the input molecules. As a result, we were able to built a compact model with the RBM consisting of two layers of just units each. RBMs of this size may be deployed on already existing quantum annealing devices (such as DWave Advantage processor). The RBM could be turned into Quantum Boltzman Machine (QBM) and the whole system might be transformed into a Quantum VAE (QVAE, khoshaman2018quantum) and sample from richer nonclassical distributions.
For demonstration purposes, we trained DVAE using almost random molecules from the ChEMBL database of manually curated and biologically active molecules. On a classical hardware, the system could be trained with Gibbs sampling. We were able to show that the training converged, and the resulting network could be used to generate molecules with the distribution of the basic properties, such as logP, and QED, closely matching those in the training set. Simultaneously, the average size of the molecules appeared to increase as the training of the network was progressing. There are relatively more harderto synthesize compounds among the molecules generated by the network.
Training of the same architecture network on quantum annealer proceeded slower per epoch than on the classical computer, most probably due to noise. Nevertheless, the distributions of the molecular properties of generated molecules were sufficiently close to those in the training set or among the molecules generated by classical counterparts Gibbs75 and 300. While certain discrepancies between distributions were present, those results have been computed only after a limited number of training epochs due to the restrictions on public access to the quantum computer.
In the original VAE paper gomez2018automatic, the authors proposed training additional properties, such as the prediction of the binding constant to a particular target on top of the autoencoder loss. In such a form, the network could be used in problems involving actual drug design, i.e., for the generation of novel compounds binding specific medically relevant targets. Although we did not attempt to demonstrate such a capability, DVAE and eventually its hybrid implementation such as QVAE can be appropriately refitted by adding the extra loss.
Hence, based on the results of our limited experiments, we conclude that the DVAE model can be used in generative chemistry and drug discovery applications of practical scale on already existing quantum annealing devices.
Training of a QVAE with QBM in the latent layer can be reduced to the training of a DVAE with RBM in the latent layer with Gibbssampling if the transverse magnetic field in QBM is absent khoshaman2018quantum. Using genuine QBMs should speed up the training of the system ( vs. with being the size of the network Biamonte2017). There was a demonstration in Ref. khoshaman2018quantum, where “quantum” samplers with the nonvanishing transverse fields outperformed DVAE if assessed by metrics achieved at the same number of training cycles (epochs).
Construction of QVAE with the controllable nonzero transverse field can, in principle, be performed on the existing generation of DWave chips. However, it would require additional hardware tuning, and applying a combination of additional tricks such as reverse anneal schedule, pauseandquench and others dwavemanual. We expect that with further developments in the engineering of quantum computing devices, hybrid architectures similar to QVAE would surpass their classical counterparts. This may be especially true in problems potentially involving rules of quantum chemistry, such as learning efficient representations of molecular structures for applications related to generative chemistry and drug design.
Iv Acknowledgements
We thank Sergey Usmanov for help in performing the experiment. P.O.F and A.K. would like to thank Dr. A. Tarkhov from Gero and Daria Orhunova for fruitful discussion and help with the data analysis. We acknowledge online cloud access to the quantum annealing device produced by DWave Systems. A.K.F. thanks Russian Science Foundation grant (197110092). P.O.F. and A.K. are supported by Gero LLC PTE (Singapore).
Author contributions
All the authors jointly developed the problem statement and analyzed existing stateoftheart techniques. A.I.G. and A.S.B. implemented the considered methods and performed the simulation of molecules. All the authors contributed to discussions of the results and writing the manuscript. P.O.F. and A.K.F. supervised the project.
Competing interests
Owing to the employments and consulting activities of A.I.G., A.S.B., and A.K.F., they have financial interests in the commercial applications of quantum computing. P.O.F. and K.A. are employees of (P.O.F. is a stake holder in) Gero LLC PTE and hence have financial interests in commercial applications of AI/ML systems for drug discovery.
Code availability
The code that is deemed central to the conclusions is available from the corresponding author upon reasonable request.
V Methods
We proposed and characterized a combination of Discrete Variational Autoencoder (DVAE) with Restricted Boltzmann Machine (RBM) in the latent space khoshaman2018quantum; rolfe2016discrete and the encoder and decoder including layers from the Transformer model vaswani2017attention. We trained the models on a subset of the ChEMBL dataset by optimizing evidence lower bound (ELBO) for DVAE loglikelihood khoshaman2018quantum, modified with additional coefficient that multiplies KL divergence term yan2020rebalancing, see Eq. (1). The sketch of the architecture of our models is illustrated in Fig. 1.
Below we in details describe the dataset, the network architecture, the training parameters, and the training schedule of the classical and quantum annealer models.
v.1 Dataset
We used a small subset of random molecules from the ChEMBL (release 26) database Gaulton2011; Gaulton2017. We selected altogether structures encoded by SMILES string of the maximum length of 200 symbols and containing the atoms from the organic subset only (B, C, N, O, P, S, F, Cl, Br, I). To focus on the relevant biologically active compounds, we removed salt residuals. Finally, we converted all SMILES into the canonical format with the help of RDKit rdkitlib.
The remaining molecules were randomly assigned into train and validation sets each containing and of all samples ( and molecules), respectively.
v.2 Training DVAE using Gibbssampling
Molecular SMILES strings are tokenized with the regular expression from Ref. schwaller2018found, which produced unique tokens. Standard trainable embedding layer and positional encoding from Ref. vaswani2017attention are used. Our implementation utilized a combination of embedding and positional encoding, in which positional encoding is multiplied by additional correction factor:
(2) 
where
is embedding tensor,
is positional encoding tensor and is the dimensionality of the embedding. This factor is required to make the proportion between embedding tensor and positional encoding closer to that in the original model vaswani2017attention. The dimension of embeddings is a model hyperparameter which was set to
.In the encoder, we employed a layer of onedimensional convolutions and a highway layer Srivastava2015 between the embedding and Transformer layers. The convolution layer with filters and the kernel size equal to were developed based on previous work banar2020Character.
We used a highway layer to improve the performance of the model according to banar2020Character; kim2016character.
The preprocessed sequence of 160d tensors were passed from the highway layer to the stack of Transformer encoder layers. The width of the feedforward part of the layers is equal to . Number of heads in MultiHead attention is . We used GeLU activation Hendrycks2016 functions and Dropout with the rate of .
Original Transformer encoder layers produce output tensor of variable length. The length of the tensor is equal to the size of the input string. In order to further reduce the dimensionality of the latent space layer in the model, we construct fixedlength tensor from the Transformer encoder output tensor
by calculating fixed number of vectors from
, which we then concatenate in one tensor. First, we select the vector with index from Transformer layers output . Then, we calculate the arithmetic mean of all vectors along the length of the output tensor . Next, we consider the subsets , each consisting of vectors with indices that have the same remainder after division by for :For each , we compute the arithmetic mean and concatenate all calculated vectors into the fixedlength output tensor.
Restricted Boltzmann machine (RBM) is implemented in the latent space as presented in papers khoshaman2018quantum; rolfe2016discrete. RBM consists of two layers of units each. RBM of precisely this size can be sampled using existing quantum annealing devices. It is worth noting that all the units of RBM in DVAE are latent variables and connected to the rest of the model. Hence, there is no distinction between ”hidden” and ”visible” units as for standalone RBM khoshaman2018quantum; rolfe2016discrete. We employed the spikeandexponential transformation as a smoothing probability distribution and the reparametrization trick to avoid calculating of derivatives over random variables. RBM is sampled by performing
steps of Gibbs updates using persistent contrastive divergence (PCD)
tieleman2008training.The decoder works in two modes: training and inference. In the inference mode decoder uses preprocessing layers. The main part of data processing in both training and inference modes of decoder consists of Transformer decoder layers. Altogether, we used transformer decoder layers of the size (GeLU activation, dropout = ). The width of the feedforward part of the layers was equal to , the number of heads in MultiHead attention is .
To train the model, we used the rebalanced objective function in which the KL divergence term is multiplied by the additional small coefficient yan2020rebalancing to avoid posterior collapse problem and employed the Adam optimizer.
In contrast to the original Transformer model, we use a different learning rate schedule: we trained the model for epochs using MultiStep learning rate schedule with the initial learning rate was . The learning rate was subsequently reduced by the factor of at points corresponding to , and of the length of the training process.
For estimation of the logarithm of the partition function of Boltzmann distribution, we used annealed importance sampling (AIS) algorithm Neal2001AnnealedIS during the evaluation of the model at the end of each epoch using intermediate distributions and samples.
Due to resources constraints, we did not have a chance to optimize the hyperparameters or too many architectural variants of the model. The presented variant of the network just worked and can be considered the first step towards a real and effective solution.
v.3 Training DVAE on a quantum annealer
We used exactly the network architecture on the quantum annealer with the only difference from the classical case being that the RBM in the latent space was sampled using quantum annealer (DWave Advantage processor). Also, the quantum model was trained during epochs with constant learning rate equal to .
For estimating the logarithm of the partition function of the Boltzmann distribution during the evaluation of the model, we used a different version of annealed importance sampling (AIS; see Ref. Ulanov2019QuantuminspiredAA) with the same parameters as in the classical case.