Synthesizing and characterizing small molecules in a laboratory with desired properties is a time consuming taskZhavoronkov et al. (2019). Until recently, experimental laboratories are mostly human operated; they relied completely on the experts of the field to design experiments, carry out characterization, analyze, validate, and conduct decision making for the final product. Moreover, the experimental process involves a series of steps, each requiring several correlated parameters that need to be tunedKackar (1985); Kim et al. (2017), which is a daunting task, as each parameter set conventionally demands individual experiments. This has slowed down the discovery of high impact small molecules and/or materials, in some case by decades, with possible implications for diverse fields such as in energy storage, electronics, catalysis, drug discovery etc. Besides, the high impact materials of today come from exploring only a fraction of the known chemical space. Larger portions of the chemical space are still uncovered and it is expected to contain exotic materials with the potential to bring unprecedented advances to state-of-art technologies. Exploring such a large space with conventional experiments will take time and lot of resourcesLeelananda and Lindert (2016); DiMasi et al. (2016); Murcko (2012); Paul et al. (2010). In this scenario, complete automation of laboratories is long overdue and have been used with limited success in the pastNicolaou et al. (2019); Vidler and Baumgartner (2019); Struble et al. (2020); Godfrey et al. (2013).
Automating the computational design of molecules that integrates physics-based simulations and optimization with ML approaches are a feasible and efficient alternative instead; it can contribute significantly in accelerating autonomous molecular design. High throughput quantum mechanical calculations such as efficient density functional theory (DFT) based simulations are the first step towards this goal of providing insight into larger chemical space and have shown some promise to accelerate novel molecule discovery. However, it still requires human intelligence for different decision-making processes and it cannot autonomously guide small molecule therapeutic discovery steps, thus slowing down the entire process. Additionally, inverse design of molecules is notoriously difficult with DFT alone. The amount of data produced by these high throughput methods is so large that it cannot be analyzed in real time with conventional methods. Autonomous computational design and characterization of molecules is more important in the scenarios where existing experimental/computational approaches are inefficientMarklund et al. (2015); Li et al. (2019). One particular example is the challenge associated with identifying new metabolites in a biological sample from mass spectrometry data, which requires mapping the fragmented spectra of novel molecules to the existing spectral library making it slow and tedious. In many cases, such references libraries do not exist, and a machine learning integrated, automated workflow could be an ideal choice to deploy for rapid identification of metabolites as well as to expand the existing libraries for future reference. Such a workflow has shown the early ability to quickly screen molecules and accurately predict their properties for different applications. The synergistic use of high throughput methods in a closed loop with machine learning based methods, capable of inverse design, is considered vital for autonomous and accelerated discovery of moleculesGodfrey et al. (2013). In this perspective, we discuss how computational workflows for autonomous molecular design can guide the goal of laboratory automation and review the current state-of-the art artificial intelligence (AI) guided autonomous molecular design focusing mainly on small molecule therapeutic discovery.
2 Components of Computational Autonomous Molecular Design Workflow
The workflow for computational autonomous molecular design (CAMD) must be an integrated and closed loop system with (i) efficient data generation and extraction tools, (ii) robust data representation techniques, (iii) physics based predictive machine learning models, and (iv) tools to generate new molecules using the knowledge learned from steps i-iii. Ideally, an autonomous computational workflow for molecule discovery would learn from its own experience and adjust its functionality as the chemical environment or the targeted functionality changes. This can be achieved when all the components work in collaboration with each other, providing feedback while improving model performance as we move from one step to other.
For data generation in CAMD, high-throughput density functional theory (DFT)Hohenberg and Kohn (1964); Kohn and Sham (1965) is a common choice mainly because of its reasonable accuracy and efficiencyJain et al. (2011); Qu et al. (2015). In DFT, we feed in 3D-structures to predict the properties of interest. Data generated from DFT simulations is processed to extract the more relevant data, which is then either used as input to learn the representationQiao,Zhuoran et al. (2020); Lee et al. (2020), or as a target required for the ML modelsDral (2020); Bogojeski et al. (2020). Data generated can be used in two different ways i.e. to predict the properties of new molecules using a direct supervised ML approach, and to generate new molecules having desired properties of interest using inverse design. CAMD can be tied with supplementary components such as databases to store the data and visualize it. The AI-assisted CAMD workflow presented here is the first step in developing automated workflows for molecular design. It will only accelerate the lead optimization. These workflows, in principle, can be combined with experimental setups for computer aided synthesis planning that includes synthesis and characterization tools at the cost of an increase in the complexity and expense. Instead, experimental measurements and characterization should be performed for only the lead compounds obtained from CAMD.
The data generated from inverse design in principle should be validated by using an integrated DFT method for the desired properties, or by docking with a target protein to find out its affinity in the closed loop system, then accordingly update the rest of the CAMD. These steps are then repeated in a closed loop, thus improving and optimizing the data representation, property prediction and new data generation component. Once we have confidence in our workflow to generate valid new molecules, the validation step with DFT can be bypassed or replaced with an ML predictive tool to make the workflow computationally more efficient. In the following, we briefly discuss the main component of the CAMD, while reviewing the recent breakthroughs achieved.
3 Data Generation and Molecular Representation
Machine learning models are data centric. Lack of accurate and well curated data is the main bottleneck limiting their use in many domains of physical and biological science. For some sub-domains, a limited amount of data exists that comes mainly from physics-based simulations in databasesRamakrishnan et al. (2014); Ruddigkeit et al. (2012), or from experimental databases such as NISTShen et al. . For other fields such as for bio-chemical reactionsSeaver, Samuel M D; Liu, Filipe; Zhang, Qizhi; Jeffryes, James; Faria, José P; Edirisinghe, Janaka N; Mundy, Michael; Chia, Nicholas; Noor, Elad; Beber, Moritz E et al. (2020), we have databases with the free energy of reactions, but they are obtained with empirical methods, which are not considered ideal as ground truth for machine learning models. For many domains, accurate and curated data does not exist. In these scenarios, slightly unconventional yet very effective approaches of creating data from published scientific literature and patents for ML have recently gained adoptionKononova et al. (2019); Zheng et al. (2019); Singhal et al. (2016); Krallinger et al. (2017).
Robust representation of molecules is required for accurate functioning of the machine learning modelsHuang,Bing and von Lilienfeld,O. Anatole (2016). An ideal representation of molecules should be unique, invariant, invertible, efficient to obtain, and should capture the physics, chemistry and structural motif of the molecules. Some of these can be achieved by using all the physical, chemical and structural propertiesChen et al. (2019), which all together are rarely well documented so getting this information is considered cumbersome task. Over time, this has been tackled by using several alternative approaches that work well for specific problemsElton et al. (2019); Bjerrum (2017); Gilmer et al. (2017); Hamilton et al. (2017); Kearnes et al. (2016); Wu et al. (2017). However, obtaining robust representations of molecules for diverse machine learning problems is still a challenging task and any gold standard method that works consistently for all kind of problems is yet to be known. Molecular representations used in the literature falls into two broad groups, (a) 1D and/or 2D representations designed by experts using domain specific knowledge including properties from the simulation and experiments, and (b) iteratively learned molecular representations directly from the 3D nuclear coordinates/properties within ML frameworks.
Expert engineered molecular representations have been extensively used for predictive modeling in the last decade which includes properties of the moleculesRupp et al. (2012); Hansen et al. (2015), structured text sequencesWeininger (1988); Heller et al. (2013); Grethe et al. (2013) (SMILES, InChI), molecular fingerprintsElton et al. (2018)
among others. Such representations are carefully selected for each specific problem using domain expertise, a lot of resources, and time. The SMILES representation of molecules is the main workhorse as a starting point for both representation learning as well as for generating expert engineered molecular descriptors. For the latter, SMILES strings can be used directly as one hot encoded vector, to calculate fingerprints or to calculate the range of empirical properties using different open source platforms such as RDkit58, chemaxon12 thereby by-passing expensive features generation from quantum chemistry/experiments by providing faster speed and diverse properties, including 3D-coordinates, for molecular representations. Moreover, they can be easily converted into 2D-graphs, which is preferred choice to date for generative modelling, where molecules are treated as graphs with nodes and edges. Although significant progress has been made in molecular generative modeling using mainly SMILES stringsWeininger (1988), they often lead to generation of syntactically invalid molecules and are non-unique. In addition, they are also known to violate fundamental physics and chemistry based constraintsM. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik (2019); 38. Case specific solutions to circumvent some of these problems exist, but a universal solution is still unknown. Extension of SMILES were attempted by more robustly encoding rings and branches of molecules to find more concrete representations with high semantical and syntactical validity using canonical SMILESKoichi et al. (2007); O’Boyle (2012), InChI Heller et al. (2013); Grethe et al. (2013), SMARTS14, DeepSMILESO’Boyle and Dalke (2018), DESMILESMaragakis et al. (2020) etc. More recently, Alan et al. proposed 100 % syntactically correct and robust string based representation of molecules known as SELFIESKrenn et al. (2019), which has been increasingly adopted for predictive and generative modellingNigam et al. (2020).
Recently, molecular representations that can be iteratively learned directly from molecules have increasingly gained adoption, mainly for predictive molecular modeling achieving chemical accuracy for range of propertiesChen et al. (2019); Gebauer et al. (2019b); Schütt et al. (2019). Such representations are more robust and out perform expert designed representations in drug discovery.Minnich et al. (2020)
For representation learning, different variants of graph neural networks are a popular choiceGilmer et al. (2017); C. et al. (2019). It starts with generating atom (node) and bond (edge) features for all the atoms and bonds within a molecule which are iteratively updated using graph traversal algorithms, taking into account the chemical environment information to learn a robust molecular representation. Starting atom and bond features of the molecule may just be one hot encoded vector to include only atom-type, bond-type or a list of properties of the atom and bonds derived from SMILES strings. Yang et al. achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features before being updated during the iterative processYang et al. (2019).
Molecules are 3-dimensional multiconformational objects and hence it is natural to assume that they can be well represented by the nuclear coordinates as is the case of quantum mechanics based molecular simulationsGöller et al. (2020). However, with coordinates the representation of molecules is non-invariant, non-invertible and non unique in natureElton et al. (2019)
and hence not commonly used in conventional machine learning. In addition, the coordinates by itself do not carry information about the key attribute of molecules such as bond types, symmetry, spin states, charge etc in a molecule. Approaches/architectures have been proposed to create the robust, unique, invariant representations from nuclear coordinates using atom centered Gaussian functions, tensor field networks, and more robustly by using representation learning techniquesSchütt et al. (2017a); T. et al. (2018); Schütt et al. (2017b, 2019); Chen et al. (2019); Axelrod and Gomez-Bombarelli (2020).
Chen et al.Chen et al. (2019) achieved the chemical accuracy for predicting a number of properties with their ML models by combining the atom and bond features of molecules with global state features of the molecules before being updated during the iterative process. Robust representation of molecules can also be learned only from the nuclear charge and coordinates of molecules as demonstrated by Schutt et alSchütt et al. (2017b, 2019, a). Different variants (see Table 1) of message passing neural networks for representation learning have been proposed, with the main differences being how message are passed between the nodes and edges and how they are updated during the iterative process using hidden states . Hidden states at each nodes during message passing phase are updated using
where , are message and vertex update functions whereas , are node and edge features. The summation runs over all the neighbour of in the entire molecular graph. This information is used by a readout phase to generate the feature vector for molecule which is then used for property prediction.
These approaches however, require the relatively large amount of data and computationally intensive DFT optimized ground state coordinates for the desired accuracy, thus limiting their use for domains/data-sets lacking them. Moreover, representations learned from a particular 3D coordinate of a molecule fail to capture the conformer flexibility on its potential energy surfaceAxelrod and Gomez-Bombarelli (2020) thus requiring expensive multiple QM based calculations for each conformer of the molecule. Some work in this direction is based on semi-empirical DFT calculations to produce a database of conformers with 3D-geometry has been recently publishedAxelrod and Gomez-Bombarelli (2020). This however does not provide any significant improvement in predictive power. These methods in practice can be used with empirical coordinates generated from SMILES, using RDkit/chemaxon but still require the corresponding ground state target properties for building a robust predictive modeling engine as well as to optimize the properties of new molecules with generative modelling.
Moreover, in these models, the cutoff distance is used to restrict the interaction among the atoms to the local environments only, hence generating local representations. In many molecular systems and for several applications, explicit non-local interactions is equally importantYue,Shuwen et al. (2021)
. Long range interactions have been implemented in convolutional neural networks, however, are known to be inefficient in information propagation. Matlock et al.Matlock et al. (2018)
proposed a novel architecture to encode non-local features of molecules in terms of efficient local features in aromatic and conjugated systems using gated recurrent units. In their models, information is propagated back and forth in the molecules in the form of waves making it possible to pass the information locally while simultaneously travelling the entire molecule in a single pass. With the unprecedented success of learned molecular representations for predictive modelling, they are also adopted with success for generative modelsGebauer et al. (2019b); Joshi et al. (2021)
4 Predictive modeling
Predictive modeling is the most widely studied area of applied machine learning in molecular modeling, drug discovery and medicineGertrudes et al. (2012); Talevi et al. (2020); Lo et al. (2018); Agarwal et al. (2010); Rodrigues and Bernardes (2020); Gao et al. (2020); Schütt et al. (2017b, a, 2019); Dahal and Gautam (2020)
. Depending upon whether the ML architecture requires the pre-defined input representations as input features or can learn their own input representation by itself, predictive modeling can be broadly classified into two sub-categories. The former is well covered in several recent review articlesGertrudes et al. (2012); Talevi et al. (2020); Lo et al. (2018); Agarwal et al. (2010); Rodrigues and Bernardes (2020); Gao et al. (2020). We will focus only on the latter, which has been increasingly adopted in predictive machine learning recently with unprecedented accuracy for a range of properties and data-sets. A number of related approaches for predictive feature/property learning have been proposed in recent years under the umbrella term Graph Neural Network (GNN) Duvenaud et al. (2015); Faber et al. (2017); Fung et al. (2021) and extensively tested on different quantum chemistry benchmark datasets. GNN for predictive molecular modeling consists of two phases; representation learning and property prediction, integrated end-to-end in a way to learn the meaningful representation of the molecules while simultaneously learning how to use the learned feature for the accurate prediction of properties. In the feature learning phase, atoms and bond connectivity information read from the nuclear coordinates or graph inputs are updated by passing through a sequence of layers for robust chemical encoding which are then used in subsequent property prediction blocks.
In one of the first works on embedded feature learning, Schütt et al.Schütt et al. (2017a) used the concept of many body Hamiltonians to devise the size of extensive, rotational, translational and permutationally invariant deep tensorial neural network architecture (DTNN) architecture for molecular feature learning and property prediction. Starting with the embedded atomic number and nuclear coordinates as input and after a series of refinement steps to encode the chemical environment, their approach learns the atom centered Gaussian-basis function as a feature which can be used to predict the atomic contribution for a given molecular property. The total property of the molecule is the sum over the atomic contribution. They demonstrated chemical accuracy of 1 kcal mol in total energy prediction for relatively small molecules in QM7/QM9 dataset that contains only H, C, N, O, F atoms.
|MPNNC. et al. (2019)||
|d-MPNNYang et al. (2019)||
|SchNetSchütt et al. (2019)||
|MEGNetChen et al. (2019)||
|SchNet-edgeJørgensen et al. (2018)||
Building on DTNN, Schütt et al.Schütt et al. (2019) also proposed a SchNet model, where the interactions between the atoms are encoded by using a continuous filter convolution layer before being processed by filter generating neural networks. They also expanded the predictive power of their model for electronic, optical, and thermodynamic properties of molecules in the QM9 dataset compared to only total energy in DTNN achieving state-of-the-art chemical accuracy in 8 out of 12 properties. They also improved on accuracy over a related approach of Gilmer et al.Gilmer et al. (2017) known as message passing neural network (MPNN) on a number of properties except polarizability and electronic spatial extent. In contrast to the SchNet/DTNN model which learns atom-wise representation of molecule, MPNN learns the global representation of molecules from the atomic number, nuclear coordinates and other relevant bond-attributes and uses it for the molecular property prediction. MPNN is more accurate for the intensive properties (, ) where the decomposition into individual atomic contribution is not required. Performance of SchNet is further improved by Jørgensen et al.Jørgensen et al. (2018) by making edge features inclusive of the atom receiving the message.
In another related model, Chen et al.Chen et al. (2019) proposed an integrated framework with unique feature update steps that work equally well for molecules and solids. They used several atom attributes & bond-attributes and then combined it with the global state attribute to learn the feature representation of molecule. They claimed to out-perform SchNet model in 11 out of 13 properties including U0, U, H, and G in the benchmark QM9 dataset. However, they trained their model for respective atomization energies (P - nX, P = U0, U, H, and G) in-contrast to the parent U0, U, H, and G trained model of Schnet. A fair comparison of the model should be made between the similar quantities. These models also demonstrated that a model trained for predicting a single property of molecules with GNN will always outperform the model optimized for predicting all the properties simultaneously. Other variants of MPNN are also published in the literature with slight improvements in accuracy for predicting some of the properties in the QM9 data set over the parent MPNNJørgensen et al. (2018); Yang et al. (2019). The key features of a few benchmark models with their advantages and disadvantages are listed in Table 1. The comparison of mean absolute errors obtained from some of the benchmark models with their target chemical accuracy are reported in Table 2. This shows that appropriate ML models when used with proper representation of molecules, a well curated accurate data set, and a well sought state-of-the-art chemical accuracy from machine learning can be achieved.
5 Inverse Molecular Design
To achieve the long overdue goal of exploring a large chemical space, accelerated molecular design, and generation of molecules with desired properties, inverse design is unavoidable. It is generally known that a molecule should have specific functionalities for it to be a effective therapeutic candidate against a particular disease, but in many cases new molecules that host such functionalities are not easily known with a direct approach. Furthermore, the pool where such molecules may exist is astronomically largePolishchuk et al. (2013); Kim et al. (2015); Coley (2020) (approx. 10 molecules), making it impossible to explore each of them by quantum mechanics based simulations or experiment.
In such scenarios, inverse design is of significant interest where the focus is on quickly identifying novel molecules with desired properties, in contrast to the conventional, so called direct approach where known molecules are explored for different properties. In inverse design, we start with the initial data set for which we know the structure and properties, map this to a probability distribution and then use it to generate new, previously unknown candidate molecules with desired properties very efficiently. Inverse design uses optimization and search algorithmsZunger (2018); Kuhn and Beratan (1996) for the purpose and by itself can accelerate the lead molecule discovery process, which is the first step for any drug development. This paradigm holds even more promise when used in a closed loop with synthesis, characterization and different test tools in such a way that each of these steps receives and transmits feedback concurrently, thus improving each other over time. This has shown some promise recently by substantially reducing the timeline for commercialization of molecules from its discovery to days, which is otherwise known to span over a decade in most cases. In one recent work, Zhavoronkov et al.Zhavoronkov et al. (2019)
designed, developed and tested a workflow that integrates deep reinforcement learning with experimental synthesis, characterization and test tools forde novo design of drug molecules as potential inhibitors of discoidin domain receptor-1 in 21 days. Such a paradigm shift in the design of drugs is possible only because of recently developed deep generative model architectures. Here, we briefly discuss some of the breakthrough architectures along with the recent applications in drug discovery.
Variational autoencodersGómez-Bombarelli et al. (2018)
(VAE) and its different variants have been extensively used for generating small molecules with optimal physico-chemical and biological properties. VAE consist of encoder and decoder network, where the encoder functions as a compression tool for compressing high-dimensional discrete molecular representations to a continuous vector in low-dimensional latent space, whereas the decoder recreates the original molecules from the compressed space. Within VAE, recurrent neural networks (RNN)Zaremba et al. (2015) and convolution neural networks (CNN)Schmidhuber (2015) are commonly used as encoding networks whereas several RNN based architectures such as GRU, LSTM are used as decoder network. RNN independently has also been used to generate molecules. Bombarelli et al.Gómez-Bombarelli et al. (2018) first used VAE’s to generate molecules, in the form of SMILES strings, from latent space while simultaneously predicting their properties. For property prediction, they coupled the encoder, decoder network with the predictor network which uses vector from latent space as an input. SMILES strings generated from their VAE’s do not always correspond to valid molecules. To improve on this, Kusner et al.Kusner et al. (2017) proposed a variant of VAE known as grammar VAE that imposes constraint on SMILES generation by using context free grammars rules. Both of these works employed string based molecular representations. More recent works have focused on using molecular graphs as input and output for variational auto-encodersLiu et al. (2018) using different variants of VAE among othersLiu et al. (2018); Kusner et al. (2017); Jin et al. (2018), stacked auto-encoder, semi-supervised deep autoencoders, adversial autoencoder, Junction Tree Variational Auto-Encoder (JT-VAE) for generating molecules for drug discovery. In JT-VAEJin et al. (2018), tree-like structures are generated from the valid sub-graph components of molecules and encoded along with a full graph to form two complementary latent spaces, one for molecular graph and another for the corresponding junction tree. These two spaces are then used for hierarchical decoding, generating 100 % valid small molecules. Further improvement on this includes using JT-VAE in combination with auto-regressive and graph-to-graph translation methods for valid large molecule generationJin et al. (2019).
Generative adversarial network (GAN) are another class of NN popular for generating moleculesBian et al. ; Cao and Kipf (2018); Kadurin et al. (2017). They consist of generative and discriminative models that work in coordination with each other where the generator is trained to generate a molecule and the discriminator is trained to check the accuracy of the generated molecules. Kadurin et al.Kadurin et al. (2017) successfully first used the GAN architecture for de novo generation of molecules with anti-cancer properties, where they demonstrated higher flexibility, more efficient training, and processing of a larger data set compared to VAE. However, it uses unconventional binary chemical compound feature vectors and requires cumbersome validation of output fingerprints against the PubChem chemical library. Guimaraes et al.Guimaraes et al. (2017) and Sanchez-Lengeling et al.Sanchez-Lengeling et al. (2017) used sequence based Generative Adversarial Network in combination with reinforcement learning for molecule generation, where they bias the generator to produce molecules with desired properties. The works of Guimaraes et al. and Sanchez-Lengeling et al. suffer from several issues associated with GAN, including mode collapse during training, among others. Some of these issues can be eliminated by using the reinforced adversarial neural computer methodPutin et al. (2018), which extends their work. Similar to VAE’s, GAN’s have also been used for molecular graph generation, which is considered more robust compared to SMILES string generation. Cao et al.Cao and Kipf (2018) non-sequentially and efficiently generated the molecular graph of small molecules with high validity and novelty from jointly trained GAN and Reinforcement learning architectures. Maziarka et al.Jin et al. (2019) proposed a method for graph-to-graph translation, where they generated 100 % valid molecules identical with the input molecules, but with different desired properties. Their approach relies on the latent space trained for JT-VAE and a degree of similarity of the generated molecules to the starting ones can be tuned. Mendez-Lucio et al.Méndez-Lucio et al. (2018) proposed conditional generative adversarial networks to generate molecules that produce a desired biological effect at a cellular level, thus bridging systems biology and molecular design. deep convolution NN based GANBian et al. was used for de novo drug design targeting types of cannabinoid receptors.
Generative models such as GAN’s, RNN, VAE have been used together with reward-driven and dynamic decision making, reinforcement learning (RL) techniques in many cases with unprecedented success in generating molecules. Popova et al.Popova et al. (2018) recently used deep-RL for de novo design of molecules with desired hydrophobicity or inhibitory activity against Janus protein kinase 2. They trained generative and predictive model separately first and then trained both together using a RL approach by biasing the model for generating molecules with desired properties. In RL, an agent, which is a neural network takes actions to maximize the desired outcome by exploring the chemical space and taking actions based on the reward, penalties, and policies setup to maximize the desired outcome. Olivecrona et al.Olivecrona et al. (2017) trained a policy-based RL model for generating the bioactives against dopamine receptor type 2, and generated molecules with more than 95 % active molecules. Furthermore, taking an example of the drug Celecoxib, they demonstrated that, RL can generate structure similar to Celecoxib even when no Celecoxib was included in the training set. de novo drug design has so far focused only on generating structures that satisfy one of the several required criterion when used as drug. Stahl et al. Ståhl et al. (2019)
proposed a fragment-based RL approach employing an actor-critic model, for generating more than 90 % valid molecules while optimizing multiple properties. Genetic algorithms (GA) have also been used for generating molecules while optimizing their propertiesO’Boyle et al. (2011); Virshup et al. (2013); Rupakheti et al. (2015); Jensen (2019). GA based models suffer from stagnation while being trapped in at the regions of local optimaPaszkowicz (2009). One notable work alleviating these problems is by Nigam et al.Nigam et al. (2020) where they hybridize GA and deep neural network to generate diverse molecules while outperforming related models in optimization.
All of the generative models discussed above generate molecules in the form of 2D graphs, or SMILES strings. Models to generate molecules directly in the form of 3D coordinates have also recently gained attentionSimm et al. (2020a, b); Gebauer et al. (2019a). Such generated 3D coordinates can be directly used for further simulation using quantum mechanics or by using docking methods. One of such first models is proposed by Niklas et al.Gebauer et al. (2019a) where they generate 3D coordinates of small molecules with light atoms (H, C, N, O, F). They then use the 3D coordinates of the molecules to learn representation to map it to a space which is then used to generate 3D coordinates of the novel molecules. Building on this for a drug discovery application, we recently proposed a modelJoshi et al. (2021) to generate 3D coordinates of molecules while always preserving desired scaffolds. This approach has generated synthesizable drug-like molecules that show high docking score against the target protein. Other scaffolds based models to generates molecules in the form of 2D-graphs/SMILES strings are also published in the literatureLi et al. (2020); Lim et al. (2020); Arús-Pous et al. (2020); Zhang et al. (2007); Scott and Edith Chan (2020).
Recently, with the huge interest in the development of architecture and algorithms required for quantum computing, quantum version of generative models such as the quantum auto-encoderRomero et al. (2017) and quantum GANsAllcock and Zhang (2018) have been proposed which carry huge potential, among others, for drug discovery. The preliminary proof of concept work of Romero et al.Allcock and Zhang (2018); Romero et al. (2017) shows that it is possible to encode and decode molecular information using a quantum encoder, demonstrating generative modeling is possible with quantum VAE’s and more work especially in the development of supporting hardware architecture is required in this direction.
6 Conclusions and Future Perspectives
The success of current ML approaches depends upon how accurately we can represent a chemical structure for a given model. Finding a robust, transferable, interpretable, and easy to obtain representation which obeys the fundamental physics and chemistry of molecules that work for all different kind of applications is a critical task. If available, this would save lot of resources, while increasing the accuracy and flexibility of molecular representations. Efficiently using such representations with robust and reproducible ML architectures will provide predictive modeling engine. Once a desired accuracy for diverse molecular systems for a given property prediction is achieved, it can routinely be used as an alternative to expensive QM based simulations or experiments. In the chemical and biological sciences, a main bottleneck for deploying ML models is the lack of sufficiently curated data under similar conditions that is required for training the models. Finding architecture that works consistently well enough for relatively small amount of data is equally important. Strategies such as active learning (AL) and transfer learning (TL) are ideal for such scenarios to tackle problemsLi and Rangarajan (2019); Warmuth et al. (2002); Fusani and Cabrera (2019); Green et al. (2019); Zhang et al. (2016). Graph based methods for end-to-end feature learning and predictive modeling so far have been successfully used on small molecules consisting of lighter atoms. For larger molecules, robust representation learning and molecule generation parts must include non-local interactions such as vander-waals and H-bonding while building predictive and generative models.
Equally important is to develop and tie a robust, transferable, and scalable state-of-the-art platform for inverse molecular design in a closed loop with predictive modeling engine to accelerate therapeutic design ultimately reducing the cost and time required. Many of the inverse ML models used for inverse design use single bio-chemical activity as the criteria to measure the success of generated candidate therapeutic, which is in-contrast to real clinical trial, where small molecule therapeutics are optimized for several bio-activities simultaneously. CAMD workflow should be designed in a way to optimize multiple objective functions while generating and validating therapeutic molecules. Validation of newly generated lead molecules for a given drug usage, if done by experiments or quantum mechanical simulations, is an expensive task for all generated lead molecules. Ways to auto-validate molecules (using an inbuilt robust predictive model) would be ideal to save resources. In addition, CAMD workflows should be able to quantify uncertainty associated with it using statistical measures. For an ideal case, such uncertainty should decrease over the time as it learns from its own experience in series of closed loop.
CAMD workflows are generally built and trained with a specific goal in mind. Such workflows need to be re-configured and re-trained to work for different objective in therapeutic design and discovery. Design and build a single automated CAMD setup for multiple experiment (multi-parameter optimization) in a kind of transfer learning fashion is a challenge. It would be particularly helpful for the domains where a relatively small amount of data exist. Having such a CAMD infrastructure, algorithm and software stack would speedup end-to-end antiviral lead design and optimization for any future pandemics like Covid-19.
This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. Additional support was provided by the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory (PNNL). PNNL is a multiprogram national laboratory operated by Battelle for the DOE under Contract DE-AC05-76RLO 1830. C. Computing resources was supported by the Intramural program at the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL; grid.436923.9), a DOE Office of Science User Facility sponsored by the Office of Biological and Environmental Research and operated under Contract No. DE-AC05-76RL01830.
- Ranking chemical structures for drug discovery: a new machine learning approach. 50 (5), pp. 716–731. Cited by: §4.
- Quantum machine learning. Natl. Sci. Rev. 6 (1), pp. 26–28. External Links: Cited by: §5.
- SMILES-based deep generative scaffold decorator for de-novo drug design. 12 (1), pp. 38. External Links: Cited by: §5.
- GEOM: energy-annotated molecular conformations for property prediction and molecular generation. External Links: Cited by: §3, §3.
- Deep convolutional generative adversarial network (dcgan) models for screening and design of small molecules targeting cannabinoid receptors. Mol. Pharm. 0 (0), pp. null. Cited by: §5.
- SMILES enumeration as data augmentation for neural network modeling of molecules. External Links: Cited by: §3.
- Quantum chemical accuracy from density functional approximations via machine learning. 11 (1), pp. 5223. External Links: Cited by: §2.
- Message-passing neural networks for high-throughput polymer screening. J. Chem. Phys. 150 (23), pp. 234111. Cited by: §3, Table 1.
- MolGAN: an implicit generative model for small molecular graphs. External Links: Cited by: §5.
- Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 31 (9), pp. 3564–3572. Cited by: §3, §3, §3, §3, Table 1, §4.
- Defining and exploring chemical spaces. External Links: Cited by: §5.
-  cxcalc, ChemAxon (https://www.chemaxon.com). Cited by: §3.
- Argumentative comparative analysis of machine learning on coronary artery disease.. 10, pp. 694–705. Cited by: §4.
-  Daylight Chemical Information Systems Inc. http://www.daylight.com/ dayhtml/doc/theory/theory.smarts.html. Cited by: §3.
Innovation in the pharmaceutical industry: new estimates of r&d costs. 47, pp. 20–33. Cited by: §1.
- Quantum chemistry in the age of machine learning. 11 (6), pp. 2336–2347. Cited by: §2.
- Convolutional networks on graphs for learning molecular fingerprints. External Links: Cited by: §4.
- Applying machine learning techniques to predict the properties of energetic materials. Sci. Rep. 8 (1), pp. 9059. Cited by: §3.
- Deep learning for molecular design—a review of the state of the art. Mol. Syst. Des. Eng. 4, pp. 828–849. Cited by: §3, §3.
- Prediction errors of molecular machine learning models lower than hybrid dft error. J. Chem. Theory and Comput. 13 (11), pp. 5255–5264. Cited by: §4.
- Benchmarking Graph Neural Networks for Materials Chemistry. Cited by: §4.
- Active learning strategies with combine analysis: new tricks for an old dog. J. Comput. Aided Mol. Des. 33 (2), pp. 287–294. External Links: Cited by: §6.
- Applications of machine learning in drug target discovery. 21 (10), pp. 790–803. Cited by: §4.
- Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 7566–7578. Cited by: §5.
- Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 7566–7578. Cited by: §3, §3.
- Machine learning techniques and drug design. 19 (25), pp. 4289–4297. Cited by: §4.
- Neural message passing for quantum chemistry. External Links: Cited by: §3, §3, §4.
- A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st century. 18 (17-18), pp. 795–802. Cited by: §1, §1.
- Bayer’s in silico admet platform: a journey of machine learning over the past two decades. 25 (9), pp. 1702 – 1709. External Links: Cited by: §3.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4 (2), pp. 268–276. Cited by: §5.
- BRADSHAW: a system for automated molecular design. J. Comput. Aided Mol. Des.. External Links: Cited by: §6.
- International chemical identifier for chemical reactions. J. Cheminform. 5 (1), pp. O16. Cited by: §3.
- Objective-reinforced generative adversarial networks (organ) for sequence generation models. External Links: Cited by: §5.
- Representation learning on graphs: methods and applications. External Links: Cited by: §3.
- Machine learning predictions of molecular properties: accurate many-body potentials and nonlocality in chemical space. J. Phys. Chem. Lett. 6 (12), pp. 2326–2331. Cited by: §3.
- InChI - the worldwide chemical structure identifier standard. journal of cheminformatics. J. cheminform. 5 (1). Cited by: §3.
- Inhomogeneous electron gas. Phys. Rev. 136, pp. B864–B871. Cited by: §2.
-  https://aspuru.substack.com/p/molecular-graph-representations-and. Cited by: §3.
- Communication: understanding molecular representations in machine learning: the role of uniqueness and target similarity. J. Chem. Phys. 145 (16), pp. 161102. Cited by: §3.
- A high-throughput infrastructure for density functional theory calculations. Comput. Mater. Sci. 50 (8), pp. 2295–2310. Cited by: §2.
- A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chem. Sci. 10, pp. 3567–3572. Cited by: §5.
- Multi-resolution autoregressive graph-to-graph translation for molecules. External Links: Cited by: §5, §5.
- Learning multimodal graph-to-graph translation for molecular optimization. External Links: Cited by: §5.
- Neural message passing with edge updates for predicting properties of molecules and materials. (English). Note: 32nd Conference on Neural Information Processing Systems, NIPS 2018 ; Conference date: 02-12-2018 Through 08-12-2018 Cited by: Table 1, §4, §4.
- 3D-scaffold: deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds. Cited by: §3, §5.
- Off-line quality control, parameter design, and the taguchi method. J. Qual. Technol. 17 (4), pp. 176–188. Cited by: §1.
- DruGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14 (9), pp. 3098–3104. Cited by: §5.
- Molecular graph convolutions: moving beyond fingerprints. J. Comput. Aided Mol. Des. 30 (8), pp. 595–608. External Links: Cited by: §3.
- Machine-learned and codified synthesis parameters of oxide materials. Sci. Data 4 (170127). Cited by: §1.
- PubChem Substance and Compound databases. Nucleic Acids Res. 44 (D1), pp. D1202–D1213. External Links: Cited by: §5.
- Self-consistent equations including exchange and correlation effects. 140, pp. A1133–A1138. Cited by: §2.
- Algorithm for advanced canonical coding of planar chemical structures that considers stereochemical and symmetric information.. 47 (5). Cited by: §3.
- Text-mined dataset of inorganic materials synthesis recipes. Sci. Data 6 (203). Cited by: §3.
- Information retrieval and text mining technologies for chemistry. Chem. Rev. 117 (12), pp. 7673–7761. Cited by: §3.
- SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry. External Links: Cited by: §3.
- Inverse strategies for molecular design. J. Phys. Chem. 100 (25), pp. 10595–10599. Cited by: §5.
- Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1945–1954. Cited by: §5.
-  Landrum G. RDKit: Open-Source Cheminformatics Software. 2016; (20 December 2020, date last accessed) http://rdkit.org/.. Cited by: §3.
- Analytical gradients for molecular-orbital-based machine learning. External Links: Cited by: §2.
- Computational methods in drug discovery. 12 (1), pp. 2694–2718. Cited by: §1.
- Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration. Mol. Syst. Des. Eng. 4, pp. 1048–1057. Cited by: §6.
- DeepScaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning. 60 (1), pp. 77–91. Cited by: §5.
- Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features. Bioinformatics. External Links: Cited by: §1.
- Scaffold-based molecular design with a graph generative model. 11, pp. 1153–1164. Cited by: §5.
- Constrained graph variational autoencoders for molecule design. External Links: Cited by: §5.
- Machine learning in chemoinformatics and drug discovery. 23 (8), pp. 1538 – 1546. External Links: Cited by: §4.
- A deep-learning view of chemical space designed to facilitate drug discovery. External Links: Cited by: §3.
- Collision cross sections for structural proteomics. Structure 23 (4), pp. 791 – 799. External Links: Cited by: §1.
- Learning a local-variable model of aromatic and conjugated systems. ACS Cent. Sci. 4 (1), pp. 52–62. Cited by: §3.
- De Novo Generation of Hit-like Molecules from Gene Expression Signatures Using Artificial Intelligence . External Links: Cited by: §5.
- AMPL: a data-driven modeling pipeline for drug discovery. 60 (4), pp. 1955–1968. Cited by: §3.
- Envisioning the future: medicine in the year 2050. 1 (2), pp. 89–99. Cited by: §1.
- Idea2Data: toward a new paradigm for drug discovery. 10 (3), pp. 278–286. Cited by: §1.
- Augmenting genetic algorithms with deep neural networks for exploring the chemical space. External Links: Cited by: §3, §5.
- DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures . External Links: Cited by: §3.
- Computational design and selection of optimal organic photovoltaic materials. J. Phys. Chem. C 115 (32), pp. 16200–16210. Cited by: §5.
- Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI.. 4 (1). Cited by: §3.
- Molecular de novo design through deep reinforcement learning. External Links: Cited by: §5.
- Properties of a genetic algorithm equipped with a dynamic penalty function. Comput. Mater. Sci. 45 (1), pp. 77 – 83. Note: Selected papers from the E-MRS 2007 Fall Meeting Symposium G: Genetic Algorithms in Materials Science and Engineering External Links: Cited by: §5.
- How to improve r&d productivity: the pharmaceutical industry’s grand challenge. 9 (3), pp. 203–214. Cited by: §1.
- Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput. Aided Mol. Des. 27 (8), pp. 675–679. External Links: Cited by: §5.
- Deep reinforcement learning for de novo drug design. Sci. Adv.. External Links: Cited by: §5.
- Reinforced adversarial neural computer for de novo molecular design. J. Chem. Inf. Model. 58 (6), pp. 1194–1204. Cited by: §5.
- OrbNet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. 153 (12), pp. 124111. Cited by: §2.
- The electrolyte genome project: a big data approach in battery materials discovery. Comput. Mater. Sci.Phys. Rev.Trends Chem. (2020) 2589-5974Drug Discov. TodayJ. Health Econ.Disruptive Science and TechnologyNat. Rev. Drug Discov.J. Chem. Inf. Mode.Drug Discov. TodayACS Med. Chem. Lett.ACS Med. Chem. Lett.J. Med. Chem.Drug Discov. TodayCPT: Pharmacometrics & Systems PharmacologyCurr. Med. Chem.Curr. Opin. Chem. Biol.J. Chem. Inf. Model.Curr. Drug Metab.Beilstein J. Org. Chem.Nat. Commun.J. Phys. Chem. Lett.J. Chem. Phys.J. Chem. Phys.Future Medicinal ChemistryNucleic Acids Res.J. Chem. Inf. Model.J Cheminform.J. Chem. Inf. Model.Neural NetworksChem. Sci.J. Chem. Inf. Model.BioinformaticsJournal of CheminformaticsOpen J. Stat. 103, pp. 56 – 67. External Links: Cited by: §2.
- Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1. Cited by: §3.
- Machine learning for target discovery in drug development. 56, pp. 16 – 22. Note: Next Generation Therapeutics External Links: Cited by: §4.
- Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol. 2 (4), pp. 045001. External Links: Cited by: §5.
- Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. J. Chem. Inf. and Model. 52 (11), pp. 2864–2875. Cited by: §3.
- Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 55 (3), pp. 529–537. Cited by: §5.
- Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108 (5), pp. 058301. Cited by: §3.
- Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC) . External Links: Cited by: §5.
- Deep learning in neural networks: an overview. 61, pp. 85–117. External Links: Cited by: §5.
- Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8 (1), pp. 13890. Cited by: §3, §3, §4, §4.
- SchNetPack: a deep learning toolbox for atomistic systems. J. Chem. Theory and Comput. 15 (1), pp. 448–455. Cited by: §3, §3, §3, Table 1, §4, §4.
- SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 991–1001. Cited by: §3, §3, §4.
- ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees. 36 (12), pp. 3930–3931. External Links: Cited by: §5.
- The ModelSEED Biochemistry Database for the Integration of Metabolic Annotations and the Reconstruction, Comparison and Analysis of Metabolic Models for Plants, Fungi and Microbes. External Links: Cited by: §3.
-  NIST standard reference simulation website, nist standard reference database number 173, national institute of standards and technology, gaithersburg md, 20899. Cited by: §3.
- Symmetry-aware actor-critic for 3d molecular design. External Links: Cited by: §5.
- Reinforcement learning for molecular design guided by quantum mechanics. External Links: Cited by: §5.
- Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. J. Am. Med. Inform. Assoc. 23 (4), pp. 766–772. Cited by: §3.
- Deep Reinforcement Learning for Multiparameter Optimization in de novo Drug Design . External Links: Cited by: §5.
- Current and future roles of artificial intelligence in medicinal chemistry synthesis. 63 (16), pp. 8667–8682. Cited by: §1.
- SchNet – a deep learning architecture for molecules and materials. J. Chem. Phys. 148 (24), pp. 241722. Cited by: §3.
- Machine learning in drug discovery and development part 1: a primer. 9 (3), pp. 129–142. Cited by: §4.
- Creating a virtual assistant for medicinal chemistry. 10 (7), pp. 1051–1055. Cited by: §1.
- Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J. Amer. Chem. Soc. 135 (19), pp. 7296–7303. Cited by: §5.
- Active learning in the drug discovery process. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 1449–1456. Cited by: §6.
- SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28 (1), pp. 31–36. Cited by: §3.
- MoleculeNet: a benchmark for molecular machine learning. External Links: Cited by: §3.
- Analyzing learned molecular representations for property prediction. 59 (8), pp. 3370–3388. Cited by: §3, Table 1, §4.
- When do short-range atomistic machine-learning models fall short?. 154 (3), pp. 034111. Cited by: §3.
- Recurrent neural network regularization. External Links: Cited by: §5.
- Scaffold-based drug discovery. In Structure-Based Drug Discovery, pp. 129–153. External Links: Cited by: §5.
- Deep model based transfer and multi-task learning for biological image analysis. IEEE Trans. Big Data. (), pp. 1–1. External Links: Cited by: §6.
- Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37 (2), pp. 1038––1040. Cited by: §1, §5.
- Text mining for drug discovery. In Bioinformatics and Drug Discovery, R. S. Larson and T. I. Oprea (Eds.), pp. 231–252. External Links: Cited by: §3.
- Inverse design in search of materials with target functionalities. Nat. Rev. Chem. 2 (4), pp. 1–16. Cited by: §5.