G protein-coupled receptors (GPCRs) are one of the most prominent families of integral membrane proteins and are among the most widely studied targets in drug discovery and development. According to [hauser2017trends], 34 percent of all drugs approved by the US Food and Drug Agency are GPCRs. While the exact size of the GPCR family remains to be determined, genome analyses suggest that the GPCR family is represented by approximately 800-1,000 genes in humans [garland2013gpcrs, thomsen2005functional]. More than 150 GPCRs are characterized as orphan receptors, which means that the receptors’ endogenous ligands are still unknown [bjarnadottir2006comprehensive]. However, it remains unclear whether endogenous ligands exist for all orphan GPCRs [davenport2013international].
Since GPCRs are involved in many different cellular and biological processes, make excellent drug targets, and remain orphaned to a relatively large extent, the prediction and consequent identification of GPCR ligands is an active area of research and interest [raschka2019automated]. Recent studies suggest that drugs target only approximately 10% of known GPCRs [southan2015iuphar]. The high cost of clinical trials coupled with a relatively low success rate (approximated to be below 6.2% [wong2019estimation, vamathevan2019applications]), motivates the development and application of machine learning and artificial intelligence-based methods for ligand discovery, to lower costs as well as to identify candidates that would otherwise be missed using conventional methods.
1.1 GPCRs and drug discovery
Many common human diseases involve GPCR signaling, including schizophrenia, glaucoma, depression, and hypertension [garland2013gpcrs]. Despite the increasing popularity of biologics, which includes gene therapy, recombinant therapeutic proteins, antibody-drug conjugates, and tissues [biologics2019web], experts see the discovery of small-molecule ligands being crucial for the future of molecular medicine and the treatment of human diseases [mullard2019fda, rodrigues2020machine].
GPCR ligands exhibit a wide range of characteristics and are very diverse in their physiochemical properties, shape, and size [southan2015iuphar]
. Furthermore, GPCR ligands include lipids, peptides, proteins, steroids, and other small organic molecules. Depending on their mechanism of action, GPCR ligands can be described as full or partial agonists, antagonists, or inverse agonists. This diversity among GPCR ligands poses challenges for standardized binding assays used in experimental bioactivity screening. Thus, wet-lab techniques such as high throughput screening have risen in popularity. However, high throughput screening can still be cost-prohibitive and labor-intensive. Another downside of high throughput screening is the limitation of available ligand libraries in terms of size and diversity. With drug discovery being an expensive and labor-intensive process, with estimated costs ranging between 0.5-2.6 billion US dollars, and a timeline ranging between 10-20 years, researchers have begun to favor computational methods for identifying candidate molecules for more elaborate bioactivity assays[paul2010improve, zhavoronkov2019deep].
When it comes to the identification of protein targets of small-molecule ligands and assessing potential off-target activity, chemical proteomics are considered the gold standard [parker2017ligand, rodrigues2020machine, bar2017chemical, moellering2012chemoproteomics]. However, similar to the challenges of binding assays for small molecule screening, chemical proteomics experiments are laborious and challenging to streamline [laraia2017natural]. Hence, many researchers are increasingly favoring computational alternatives [rodrigues2020machine] or augmenting the experimental discovery pipeline with artificial intelligence [duros2017human, hase2019next].
1.2 Computer-aided ligand discovery
Modern computational databases feature hundreds of millions of molecules and are available free of charge [sterling2015zinc, sunseri2016pharmit, bento2014chembl], which makes computer-aided ligand discovery a particularly compelling alternative to high throughput screening and other experimental approaches during the early stages of ligand discovery. The two major approaches for computer-aided ligand discovery, also known as virtual screening (VS), are ligand-based VS and structure-based VS.
Ligand-based VS approaches focus on the structure and physicochemical properties of ligands in the absence of the receptor structure. Broadly, ligand-based VS can be described as a set of techniques for ligand-based similarity search or property prediction. Ligand-based VS approaches for identifying bioactive molecules are often based on a known bioactive molecule given the hypothesis that similar molecules are likely to bind the target receptor and exhibit a certain level of bioactivity, which has to be determined experimentally.
In contrast to ligand-based VS, structure-based VS assumes knowledge of the receptor structure with molecular docking being one of the most prominent structure-based VS techniques. While molecular docking has led to many successful discoveries, one of its limitations for GPCR ligand discovery is the small number of high-resolution GPCR structures that are currently available. Recent years brought many significant improvements for protein X-ray crystallography and cryo-electron microscopy; however, high-resolution structures still only cover four out of the six different classes of GPCR: classes A, B, C, and F, of which the rhodopsin-like receptors from class A form the most significant portion [basith2018exploring]. Furthermore, another limitation for structure-based approaches for bioactive ligand discovery is the breadth and diversity of GPCR structures that are currently available: only 44 of the 205 GPCR structures correspond to unique GPCRs [hauser2017trends].
All GPCRs consist of seven transmembrane helices, which are relatively conserved across the different classes of GPCRs (A-F). However, GPCRs may differ more noticeably in the intracellular and extracellular loop domains. The extracellular loops play an essential role in many GPCR structures, since they often form the orthosteric ligand binding site or provide access to binding sites that are located within the transmembrane bundle [wheatley2012lifting]. The diversity of GPCR ligands and GPCR ligand binding sites are a direct consequence of the structural variety of the extracellular loops, which makes employing structure-based approaches extremely challenging in the absence of high-quality structural information. In those cases, ligand-based strategies may represent the only viable alternative [raschka2018enabling]. In recent years, both traditional structure-based and ligand-based VS have been augmented or replaced by machine learning techniques. Furthermore, machine learning has also been used for predicting bioactivity from other types of data such as a bioactivity matrix (a targets-by-compounds matrix of functional interactions) [zhang2019biomatrix]. Today, machine learning is widely recognized as an essential method in the chemical biology toolbox for researching ligand binding [rodrigues2020machine].
1.3 Augmenting ligand discovery with artificial intelligence and machine learning
Artificial intelligence (AI) is traditionally considered a subfield of computer science that focuses on tasks that humans are naturally good at, such as image recognition and natural language processing. Furthermore, can be categorized into artificial general intelligence, i.e., human-level intelligence on a breadth of different asks, and "narrow" AI, which focuses on solving a single task well. Examples of narrow AI include tasks such as classifying objects from images, language translation, playing chess, or predicting the bioactivity of small molecules. Currently, the most popular approach for implementing and programming narrow AI is machine learning. Machine learning is a field that focuses on the development of algorithms that allow computers to learn from representative datasets, as opposed to having domain experts developing rules manually.
The three major subcategories of machine learning are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the goal is to predict a category label (classification) or score (regression analysis) by learning from a large collection of such labeled examples. In other words, supervised learning is concerned with learning a mapping function between the so-called input features or observations (for example, fingerprint representations of small molecules) and a discrete or continuous target variable, for example, an active/inactive label or the binding affinity in the context of a specific receptor. After learning to predict the desired scores or labels from the labeled training dataset, an independent test dataset, consisting of labeled examples that were unseen during training, is used to evaluate the performance of the predictive model. Unsupervised learning is used for tasks that do not involve target labels. A typical example of unsupervised learning is clustering, for instance, grouping small molecules based on a user-specified similarity measure. Reinforcement learning, a third major category of machine learning, focuses on decision making in complex environments, which requires learning an optimal series of actions in response to environmental information as opposed to a target output as in supervised learning. A classic example is the development of a chess program.
In the presence of labeled high-quality data, machine learning provides opportunities for automating the development of customized solutions that are optimized for a specific type of receptor, binding site, or small molecule – as opposed to more general scoring functions such as DrugScore [neudert2011dsx] or AutoDock Vina [trott2010autodock]. Machine learning has already shown to be successful in various stages throughout typical ligand and drug discovery, including the discovery of novel drug targets and ligands [jeon2014systematic, riniker2014using], bioactivity prediction [rifaioglu2018recent], predicting new binding sites in GPCRs [chan2019new, wu2018coach], analyzing the association between the drug targets and diseases [ferrero2017silico], studying binding pathways (for example, the binding pathways of opiates to -opioid receptors [farimani2018binding]), optimizing lead compounds [ballester2010machine], molecular de novo design [olivecrona2017molecular], modifying molecular properties [kadurin2017drugan], and developing biomarkers to assess the efficacy of drugs [mamoshina2018machine].
1.4 Deep learning–representation learning with deep neural networks
In the last decade, deep learning research has seen a substantial increase in attention, since it has allowed researchers to develop state-of-the-art solutions for computer vision, natural language processing, and computational biology[lecun2015deep]. Deep learning is a subfield of machine learning that focuses on artificial neural networks with many layers that can learn multiple levels of abstractions of data that are useful for supervised and unsupervised learning tasks. Modern deep learning architectures consist of hundreds of millions of parameters [canziani2017analysis]
, the so-called model weights, which are learned from a training dataset using the backpropagation algorithm[rumelhart1986learning]. An overview of the most commonly used deep learning architectures is provided in Section 1.7.
The predictive performance of traditional machine learning methods – this includes logistic regression, random forests, support vector machines (SVMs), k-nearest neighbors, and many others – heavily depends on the design of the feature engineering pipeline. For instance, the aforementioned conventional machine learning methods generally do not operate well on high dimensional datasets and are unable to extract knowledge from raw data (such as text or images)[lecun2015deep]. Hence, the training data has to be converted into a tabular format via manual feature engineering – for example, by representing a document as a vector of word frequencies [raschka2014naive], or by representing flowers by tabulated measurements of their leaf sizes rather than photographs [fisher1936use]. Careful feature engineering requires substantial domain expertise, and useful information contained in the data can be lost by feature engineering. Deep learning, on the other hand, relies on general-purpose algorithms that include the automatic extraction of salient information from raw data as part of the modeling architecture and optimization objective. In this respect, deep learning can be characterized as a feature or representation learning method. However, one downside of deep learning is that it is relatively data-hungry, and datasets for supervised learning require large amounts of labeled examples – dataset sizes ranging between 50 thousand and 15 million training examples are not unusual [cao2019consistent, deng2009imagenet].
1.5 Open source software and datasets
Under similar open source licensing terms, the Scikit-learn machine learning library [pedregosa2011scikit]
has been widely adopted for predictive modeling with traditional machine learning (for example, generalized linear models, SVMs, and tree-based methods such as random forests and gradient boosting). In recent years, computationally efficient and GPU-enabled libraries such as TensorFlow[abadi2016tensorflow]
and PyTorch[paszke2019pytorch] made deep learning accessible to the broad scientific communities similar to how Scikit-learn contributed to popularizing machine learning.
While the use of machine learning and deep learning has seen widespread adoption throughout various research areas, a bottleneck for applications to computational biology was the lack of large, annotated, high-quality datasets. However, recent years brought both improvements of experimental techniques and the "data and code sharing" culture in academia, which led to the increase of publicly available molecular activity and biomedical datasets. For instance, in drug discovery, recent efforts focused on developing an open, annotated datasets that can be utilized for therapeutic target validation [koscielny2016open]. MoleculeNet is a large benchmark dataset consisting of more than 700,000 molecules and their property annotation [wu2018moleculenet]. The ChEMBL large-scale bioactivity database contains more than 5.4 million bioactivity measurements for 5,200 protein targets and more than 1 million ligands [gaulton2016chembl]
. While the aforementioned databases are general-purpose datasets that are usually used for general algorithm development and benchmarking, these can be used as a starting point for pre-training deep learning models that can then be fine-tuned to smaller datasets containing a specific target using transfer learning, which is discussed in greater detail in Section6.
1.6 Interpretability and repeatability of machine learning
Machine learning models can learn to identify salient patterns in high-dimensional and complex datasets that are not obvious to a human researcher. However, machine learning and particularly deep learning models are often criticized for their black box aspect and lack of interpretability [vamathevan2019applications]
. We agree that machine-learning-based methods, except for the most fundamental decision tree models and rule-based classifiers, are less interpretable than pure "if/then" rules. However, many methods exist that allow scientists to obtain insights into which features a model learns for making particular predictions. For example, in[raschka2018enabling]
, the researchers were able to identify a potent inhibitor of GPCR pheromone receptor signaling using a machine learning-aided virtual screening approach that provided structure-activity information. Furthermore, by combining machine learning with feature selection algorithms, the researchers were able to identify sulfate groups on the tail end of the candidate ligands that are crucial for bioactivity in SLOR-1 GPCR signaling[raschka2018automated]. While the methods described in [raschka2018automated] were applied to summarize structure-activity relationships of 3D-structural overlays of small molecules that were prioritized for experimental bioassays, the same techniques can be used for different types of data, such as hydrogen bonding patterns [raschka2018protein] or ligand-receptor docking poses [raschka2016detecting].
Other commonly used methods for understanding model predictions are general-purpose methods such as LIME [ribeiro2016should] or SHAP [lundberg2017unified]. Neural network-specific methods include guided backpropagation [springenberg2014striving], class activation mapping [zhou2016learning], gradient-weighted class activation mapping [selvaraju2017grad], and learning important features through propagating activation differences (DeepLIFT) [shrikumar2017learning].
Another point of criticism against the use of deep neural networks is the lack of repeatability [vamathevan2019applications]. In this context, repeatability refers to the ability to produce the exact same results upon repeating an experiment under identical conditions; in contrast, reproducibility refers to the ability to obtain similar results in different environments or conditions, for example, if a different research lab conducts similar experiments. According to [vamathevan2019applications], the issue of repeatability "arises because [machine learning] outputs are highly dependent on the initial values or weights of the network parameters or even the order in which training examples are presented to the network, as all of them are typically chosen at random." We want to highlight that these repeatability issues can easily be circumvented by specifying the seeds of pseudo-random number generators and setting the behavior of machine learning and deep learning software libraries to deterministic. In practice, however, issues with reproducibility and repeatability arise when researchers share insufficient details about the software versions that were used to conduct the experiments. Reproducibility and repeatability require best code sharing practices, which includes not only sharing the exact code but also the experimental settings and software version numbers. Fortunately, sharing data and code has become a recommended practice in the field of machine learning [neurips2018callforpapers], and we hope that computational biology journals will adopt similar best practices such that repeatability and reproducibility issues that plagued the field in previous years can hopefully be avoided in the future.
1.7 Commonly used deep learning architectures
This section summarizes the main concepts behind machine learning with a focus on the fundamental deep learning concepts that are relevant for the remainder of this article. For an introduction and overview of general machine learning methods, we recommend [burkov2019hundred, raschka2019python]. For a more comprehensive introduction to deep learning, we recommend [lecun2015deep, goodfellow2016deep, raschka2019python].
1.7.1 Multilayer perceptrons
Multilayer perceptrons (MLPs) are fully connected artificial neural networks (NNs) that consist of an input layer, an output layer, and at least one hidden layer between the two (Figure 1 A). While there is no precise definition of what constitutes a deep neural network (DNN), an NN that has more than one hidden layer is commonly referred to as a DNN. The hidden units in the hidden layer manipulate the input information (observations or features) in a non-linear fashion so that, in the case of supervised classification, the training examples from different classes become linearly separable by the last layer. In DNNs, the early layers can be considered representation learning layers that distort the data in such a way that it can be classified by an output layer that has a similar structure to a generalized linear model. In other words, NN architectures with one or more hidden layers can learn highly complex relationships between the input features and the target label.
An MLP with only one (sufficiently large) hidden layer can already be considered a universal function approximator [csaji2001approximation, cybenko1989approximations, hornik1989multilayer]
. However, a DNN with many hidden layers can achieve the same expressiveness with fewer parameters than an NN with only one hidden layer. Furthermore, constructing an architecture with multiple layers provides some form of regularization: later layers are constrained by the behavior of earlier layers. Unfortunately, a common problem with deep architectures is that increasing the number of layers exacerbates the vanishing and exploding gradient problems that arise from repeated multiplications via the chain rule in backpropagation, making it hard to parameterize very deep models. A particular focus in the deep learning field has been on developing methods that help with training very deep architectures successfully, as discussed in Section1.7.2 and Section 1.7.3.
While the early MLP architecture was first proposed in the early 1960s [schmidhuber2015deep, ivakhnenko1966cybernetic, ivakhnenko1967cybernetics], efficient ways for training such multilayer neural networks using the backpropagation procedure were formulated by several researchers independently between 1970 and 1986 [linnainmaa1970representation, werbos1974beyond, rumelhart1986learning]. Although the backpropagation procedure is still the main learning algorithm in deep learning, the training of DNNs has only become broadly feasible in recent years due to several advances towards making the general training more efficient and effective. These improvements include implementing machine learning models and training DNNs on GPUs [steinkraus2005using, chellapilla2006high, raina2009large, cirecsan2012multi], which are extremely well-suited for performing linear algebra operations such as matrix multiplications efficiently and on a large scale.
In addition to developing software libraries that allow researchers to develop and implement DNNs more easily (see Section 1.5
), countless algorithmic advances towards training DNNs more efficiently and robustly have been made in recent years. For example, new activation functions have been proposed, such as the rectified linear unit (ReLU)[nair2010rectified]
, which has become a popular default choice since it can help with vanishing gradient effects and generally allows for training DNNs faster and with better predictive performance. New stochastic gradient descent-based optimization algorithms such as Adam[kingma2014adam] can accelerate the model training and make it easier to find a proper learning rate setting compared to conventional stochastic gradient descent. Regularization methods such as Dropout [srivastava2014dropout]
help reduce the effect of overfitting by dropping nodes randomly during training and thus make the model less reliant on particular inputs and connectivity patterns. Batch normalization is a method for scaling layer inputs[ioffe2015batch], which can speed up learning further by allowing learning with larger batch sizes and converging to local minima on the loss surface in fewer iterations over the training set. While not all of these improvements exist in all modern architectures, they have been fundamental to allowing experts as well as non-experts to train DNNs successfully on a wide range of different datasets.
In a quantitative structure-activity relationship (QSAR) study, researchers found that in 13 out of 15 cases, very simple DNNs were already able to outperform traditional machine learning methods such as random forest and SVMs for quantitative-activity relationship analysis even when trained on traditional fingerprint representations [ma2015deep]. This is a remarkable result because, in this case, the model the researchers referred to as DNN was a simple MLP with only one hidden layer. Additional details about the different molecular data input representations will be provided in Section 2.
1.7.2 Convolutional neural networks
Convolutional neural networks (CNNs) are feedforward neural networks that utilize the discrete convolution operation as a filtering operation on multi-dimensional arrays (Figure mlp-cnn). Depending on the architectural design and discrete convolution operation, CNNs can handle various data modalities, including 1D (sequences and signals in vector form), 2D (pictures and audio spectrograms), and 3D arrays (such as videos or images with depth dimension like 3D computerized tomography scans).
Today, CNNs are the most widely used deep learning architectures. They are widely recognized for their state-of-the-art performance for image-related tasks such as object classification, detection, and segmentation. The first successful demonstration of CNNs for image classification was the design of the LeNet architecture for handwritten digit and document recognition in the 1990s [lecun1990handwritten, lecun1998gradient]. While CNNs have shown exceptional performance in various image recognition tasks, including the segmentation of biological images [ning2005toward, lecun2015deep]
, the rise of CNNs to mainstream success can be attributed to 2012 ImageNet competition, where CNNs were able to outperform traditional computer vision methods by doubling the object classification accuracy on a database consisting of millions of images[krizhevsky2012imagenet].
Similar to MLP architectures, CNNs are feedforward neural networks. However, in contrast to MLPs, CNNs do not use fully connected layers. The main concepts behind CNNs that distinguish this architecture from MLPs are sparse connectivity and parameter sharing. Here, sparse connectivity refers to the fact that each unit in the hidden layer is only connected to a small number of units in the previous layer. The connection occurs through a so-called kernel or filter, which is a parameter matrix that is commonly interpreted as a feature extractor. Via the filter, information between neighboring pixels is combined to compute the activation of a unit in the next hidden layer. This allows CNNs to compose hierarchies of local features to recognize complex shapes in later layers. For instance, in an image classification task, a multi-layer CNN may learn to detect simple geometric shapes like edges and curves in the earlier layers, and in later layers, it can identify more sophisticated features such as cars or buildings based on the underlying geometric shapes. These feature detectors typically only cover a small portion of the input image or feature map, and in the process of the convolution operation, they can be thought of as a sliding window over the image. Hence, the weights for the different patches of the input image or feature maps are shared within a given layer, which is inspired by the fact that a feature detector that works well in one region may also work well in another part. The advantage of weight sharing is that it reduces the number of parameters that need to be learned, which improves computational efficiency and can reduce overfitting, compared to fully connected networks.
In traditional CNN architectures, feature maps were followed by pooling layers [lecun1990handwritten, lecun1998gradient]. These pooling layers combine similar features from neighboring regions in the previous feature map into a single feature, which is supposed to help neural networks to learn more complex relationships or geometrical shapes. However, it has been shown that pooling layers are not a requirement in modern CNN architectures [springenberg2014striving].
Deep learning is a fast-moving field, and even revolutionary architectures such as AlexNet, which outperformed state-of-the-art computer vision methods at the ImageNet competition in 2012 [krizhevsky2012imagenet], have long been succeeded by other reference architectures. For example, the VGG architecture has achieved notable performance by shrinking the receptive field and stacking many more layers than AlexNet [simonyan2014very]
. The ResNet architecture introduced residual connections, which are shortcut connections between layers that help with vanishing gradient problems and thus improve the training of very deep neural networks[he2016deep]. Inception networks improved the extraction aspect of both local and global features by using multiple filter modules in parallel [szegedy2016rethinking]. While the design of CNN architectures still remains an active area of research, recent efforts have been targeted towards improving efficiency, for example, MobileNet [howard2017mobilenets] and EfficientNet [tan2019efficientnet]. In a benchmark study, Canziani et al. compare both the predictive and computational performance of these models for further reference [canziani2017analysis].
1.7.3 Recurrent neural networks
Recurrent neural networks (RNNs) were originally designed for tasks that involve sequence data, for example, textual data and audio signals such as speech. While traditional RNNs consist of fully connected layers similar to MLPs, RNNs have recurrent connections through time. This recurrence property allows RNNs to maintain an internal history of sequence elements that it processed previously (Figure 2).
While RNNs can model arbitrary long sequences, the recurrent edges make this architecture particularly prone to vanishing and exploding gradient problems [pascanu2013difficulty]. One solution to this problem is the so-called long short-term memory mechanism (LSTM) by Hochreiter and Schmidhuber [hochreiter1997long], which is a memory cell that replaces the hidden layer of conventional RNNs. A mechanism similar to LSTM, although slightly more computationally efficient, is the Gated Recurrent Unit (GRU) [jozefowicz2015empirical].
Since its first proposal in 2017, the Transformer architecture [vaswani2017attention] has gained a lot of attention as it has been shown to outperform RNNs on many sequence modeling tasks [merity2019single]. The Transformer architecture relies on a so-called attention mechanism, which allows for modeling global dependencies between input and output sequences. In other words, the attention mechanism allows the model to focus on only those parts of the text that are relevant for the prediction task. While the Transformer architecture is still relatively new, it has already been adopted for several applications in computational biology, including retrosynthetic reaction prediction [karpov2019transformer], QSAR models [karpov2019transformer2], and learning molecular fingerprint representations [honda2019smiles].
Autoencoders refer to architectures that were traditionally used for dimensionality reduction and trained in an unsupervised manner [hinton2006reducing]. The main idea behind autoencoders is that they learn to preserve the essential part of the data and remove the non-essential ones (Figure 3).
An autoencoder consists of two neural networks: an encoder that transforms a training example into a lower-dimensional representation, and a decoder that reconstructs the original training example from its lower-dimensional representation. For example, if both the encoder and decoder submodules of the autoencoder are fully connected NNs with linear activation functions, the latent feature embedding produced by the encoder is equivalent to features transformed via principal component analysis except for the feature orthogonality constraint.
It shall be noted that regular autoencoders are deterministic models that can be used for data compression or to modify existing data given a user-specified objective [mirjalili2018semi, mirjalili2019flowsan, mirjalili2018ensemble, mirjalili2020privacynet]. A related model is the variational autoencoder, which can be considered a generative model as it is capable of synthesizing entirely new, realistic data records mimicking the data contained in the training dataset [kingma2013auto, doersch2016tutorial].
The remainder of this paper is organized as follows. First, we discuss one of the common challenges with applying machine learning to bioactive ligand discovery, namely choosing an appropriate feature representation for the ligand structure, and highlight the recent developments using deep learning. Next, we discuss the latest trends for ligand-based analysis, from molecular property prediction to similarity-based virtual screening. While the ligand-based approaches assume that a high-quality receptor structure is not available, the next section reviews the recent developments for receptor structure-based bioactive ligand discovery. This includes the detection of new GPCR binding sites, protein-ligand docking, and de novo small-molecule design. Finally, this review concludes by motivating the use of transfer learning, which allow researchers to make better use of publicly available data in machine learning and AI-based bioactive ligand discovery.
2 Molecular feature representations
Machine learning methods excel at prediction tasks across multiple disciplines but require careful data preparation as most methods are designed to operate on tabular datasets. The standard data input format is the so-called design matrix, where each row represents a new training example, and the columns correspond to the different feature variables. A common challenge in conventional machine learning is how to prepare datasets as input to machine learning algorithms – in practice, machine learning practitioners have to find a sweet spot between reducing the dimensionality and retaining salient information that the model can learn from. In contrast to conventional machine learning, deep learning excels at learning from raw data, such as images and text, directly. However, molecular data, such as conformations of small molecules and receptors, can be challenging to represent in a standard format that most machine learning and deep learning methods have been designed for. Even if the same information can be extracted from two different data representations, an algorithm may be more effective at extracting that information from one over the other. There is no clear best representation of molecules for machine learning methods and indeed certain representations may be better for certain tasks. The following section provides a brief overview of commonly used molecular representations as well as some recent applications of them using AI-based methods.
2.1 Property-based feature vectors
A molecular descriptor is the transformation of chemical information into a numeric value [todeschini2009molecular]. Dragon [mauri2006dragon] and Mordred descriptors [moriwaki2018mordred] are examples of sets of molecular descriptors. As an alternative to molecular descriptors, molecular fingerprints encode molecular structure in a vector format, a so-called bit vector consisting of 1’s and 0’s. When used as input for machine learning models, both molecular fingerprints and descriptors have historically produced state-of-the-art results on chemical machine learning tasks such as chemical odor prediction and bioactivity [keller2017predicting, ballester2010machine].
The extended connectivity fingerprint (ECFP) is among the most widely-used 2D fingerprint methods [rogers2010extended], and we use its generation procedure as an example of the general process for generating traditional molecular fingerprints. A fingerprint is generated by a multistep process in which each atom is associated with a series of integers. In this series the th integer encodes information about the atom it is associated with as well as information about the atoms and bonds within bonds of that atom – that is, the substructure of the compound that is within bonds of the atom. Next, the integers associated with each atom are concatenated into an array format, which is then processed via a hashing algorithm to generate a bit vector of a desired length (typically 1024 or 2048 elements). This method captures information about all identified substructures in a compound, resulting in a fixed-length vector regardless of the input compound’s size. ECFPs do not explicitly encode the 3D spatial information of a compound; however, specialized fingerprint methods have recently been developed that incorporate 3D-strucutral information [axen2017simple]. Lastly, there are also fingerprints that can encode protein-ligand interactions [da2014structural].
Simplified molecular-input line-entry system (SMILES) strings are ASCII string representations of compounds (Figure 4 A), which are generated according to a procedure that guarantees a unique mapping from a SMILES string to a compound structure (though not the inverse) [Weininger1988smiles]. One benefit of SMILES strings over 2D molecular fingerprints like ECFP is that they encode stereochemistry explicitly. One downside for machine learning is that SMILES do not have a fixed-length; however, certain deep learning architectures designed for processing text documents, like RNNs or 1D CNNs, can handle variable-length inputs.
Recently, Hirohara et al. [hirohara2018convolutional] proposed a novel molecular representation scheme by converting SMILES strings into "SMILES feature matrices," which were used as inputs into a 1D CNN [hirohara2018convolutional]. A SMILES feature matrix was constructed by mapping a SMILES string of length to a matrix, where the th row represents the
th character (corresponding to either atom or connectivity information) of the string, and the 42 columns correspond to properties of that character. To address the problem of varying-length input strings, the feature matrix was padded with rows of zeros such that all feature matrices had their number of rows set to the length of the longest SMILE string. The predictive performance of the resulting CNN model was comparable with other deep learning methods on the Tox 21 dataset, which contains 8,000 compounds labeled as active or inactive for 12 proteins[huang2016tox21challenge]. More interestingly though, this novel feature representation method enabled the extraction of a 64-dimensional vector from a convolutional layer referred to as the SMILES convolution fingerprint, which can be mapped back to the model input SMILES, providing an interpretable data-driven fingerprint.
Another notable recent development is the SMILES2vec method, which combines components from both recurrent and convolutional neural networks [Goh2017smiles2vec]
. Here, the model input is a SMILES string converted into a one-hot encoding of characters present in the SMILES in the training dataset. The one-hot encoding then serves as an input to a 1D convolutional layer, which is followed by two GRUs before the fully-connected output layer that returns the target value. This architecture achieved equivalent performance to the state-of-the-art at the time on the ESOL solubility dataset[delaney2004esol].
2.3 3D voxels
Molecular structures can be considered as 3D objects, and modeling them as such would encode their structural properties effectively. The conventional way for encoding 3D objects is by voxelization, which, in this case, takes a 3D space and discretizes it into a 3D grid. Each unit cube of the 3D grid represents a voxel (Figure 4 C), which can be considered as the equivalent of a 2D pixel in a 3D space. Instead of merely labeling each voxel via a binary membership indicator based on what part of the compound occupies the voxel, a feature vector with discrete or continuous-valued attributes can be assigned to each voxel, which can provide further information about the atom type or charge, for example. This adds an additional dimension to the representation. Since this representation is typically associated with a large number of input features, depending on the resolution of the voxel space as well as the size of the vector representation at each voxel position, deep learning approaches used with this type of input representation typically rely on 3D convolutions. This is equivalent to using 2D convolutions, which are commonly used for image analysis (Section 1.7.2), in the 3D voxel space. Unfortunately, these convolutions can still be prohibitively slow, even at relatively low voxel resolutions, which can result in a coarse representation of the compounds properties. Since each cube represents a unit of measurement, and compounds can vary in size, this representation needs to be padded so that it can accommodate for the largest compound in the dataset, which exacerbates the computational efficiency challenges of this method. Likely owed to being computationally very intensive and inefficient, this method is not a common choice for ligand-based virtual screening. However, this representation has shown to be successful for structure-based VS where binding pocket size is consistent regardless of the interacting ligand, and capturing the 3D structure of the protein-ligand complex is important for the task.
We examine the 3D voxelization process from Ragoza et al. as a concrete example [ragoza2017protein]. In this publication the authors create 3D voxel representations of high-scoring docking poses of protein-ligand complexes for use in a 3D CNN that computes binding affinities. This representation voxelizes a 24 Å cube centered on the docked ligand into .5 Å voxels. Each voxel has channels for Smina atom types in the ligand and separately for Smina atom types in the protein [koes2013lessons]. The value contributed to a channel by an atom of the type that is associated with that channel is based on a function of two values: First, the distance between the center of the voxel and the center of that atom, and second, the atom’s van der Waals radius." This function is a continuous piece-wise combination of a Gaussian and a quadratic based on these two values.
2.4 Graph representation
Molecules are commonly visualized as undirected graphs. In this representation, atoms represent the graph’s nodes, and the bonds are the graph’s edges. A naive approach that utilizes such graphical information is to consider pictures of molecular graphs as inputs to a DNN. This has been tried in the literature using 2D CNN architectures that have been effective for image classification [goh2017chemception, Meyer2019learning]. However, these models do not outperform MLP and random forest trained on ECFPs. Two issues with this approach are that atomic properties and spatial relationships must be inferred implicitly and that the representation is sparse, with lots of white space that provides little chemically relevant information to the CNNs. Both these issues can be addressed with graph neural networks (GNNs) [duvenaud2015convolutional, gilmer2017neural].
NNs operate directly on a molecular graph (Figure 4 D). Similar to images, graphs can encode local structure, but the local structural information is based on the graphs structure rather than Euclidean distance. Like CNNs, GNNs utilize sparse connectivity and parameter sharing, but the connectivity is based on the graph’s structure. For a detailed explanation of graph convolutional operators, we refer to [gilmer2017neural]. On a high-level, graph convolutions are based on message passing frameworks for GNNs. At each node of the graph, the following steps are performed: a message function is applied to each of a nodes’ neighbors individually, and the outputs are then added. An update function is then applied to the summed message functions and the current node with the output being the updated value for the current node. After graph convolutions are performed, the information in the graph can be aggregated by a readout function, which produces the networks’ prediction output.
3 Ligand-based methods
Ligand-based VS methods are traditionally defined as methods that only rely on physicochemical and structural information about the ligand and sometimes measurements of ligand-receptor interactions. No other information about the target, like the receptor structure, is utilized. The use of ligand-based methods is particularly appealing when working with GPCRs for which only a small set of high-quality structures exist.
In the following subsection, we review the most notable advances in predicting molecular properties, which are applicable to GPCR bioactive molecule VS.
3.1 Molecular property prediction
Researchers applied a variety of standard machine learning algorithms to the task of classifying compounds as active or inactive with two GPCRs, cannaboid receptor 1 (CB1) and cannaboid receptor 2 (CB2) [bian2019prediction]. In addition to the activity prediction, for CB1, the researchers also predicted if compounds were allosteric or orthosteric. Here, molecules with a an inhibitory constant ( ) greater than 100 for the given receptor were defined as orthosteric positives and allosteric positives were selected from an existing database of allosteric compounds [shen2015asd]. The researchers trained a set of machine learning classifiers on each task and compared their performance. This set included SVM, MLPs with 1-5 hidden layers, random forest, naïve Bayes and a logistic regression model. In this study, the researchers considered ECFP fingerprints, MACCS fingerprints, as well as molecular descriptor input feature representations. While multiple models had comparable performances, an MLP with only one hidden layer performed best overall on a variety of performance metrics with the MACCS featurization while logistic regression performed best in combination with the ECFP. There was no consistent top performer with the molecular descriptors. Notably, deeper MLP models were also assessed and did not have strong performance, which is likely due to vanishing gradient problems as discussed in Section 1.3 and the limited dataset size (on the order of 1000s).
The above result suggests different compound representations capture chemical information in manners that are utilized more or less effectively by different machine learning algorithms. If a model took multiple representations as input the ensemble may have better performance. The model CheMixNet followed this approach and used both a SMILES and fingerprint representation of a molecule as input [paul2018chemixnet]
. The authors considered four subnetworks three of which received a one-hot SMILES string as input and one that was trained on an MACCS fingerprint representation. The three SMILES subnetworks respectively consisted of GRU cells, 1D convolutional layers, and a combination of 1D convolutional layers and GRU cells; the fingerprint subnetwork was an MLP. The authors trained models with combinations of these subnetworks, where the outputs of each subnetwork were concatenated and passed through additional fully connected layers before a linear or softmax layer to predict the property. The model was trained and tested multiple times to predict a variety of properties. In all cases certain subnetwork permutations outperformed or matched other state-of-the-art methods at the time. Interestingly, the best subnetwork ensemble for each dataset was not consistent, suggesting that different representations may be better at predicting different chemical properties.
WDL-RF is a novel architecture for predicting molecular properties that was trained and tested on a suite of GPCR ligand datasets [wu2018wdl]. The first part of WDL-RF produces learned molecular fingerprints using a neural network. The NN is a (GNN) that shares weights in the network based on the number of bonds attached to each atom node. Each atoms initial feature vector included a one-hot encoding of its element, degree, number of attached hydrogen atoms, implicit valence and aromaticity indicator. This network was trained on GPCR datasets to predict bioactivity. After it was trained, an internal embedding in the network was extracted for each ligand to use as a data-driven fingerprint. GNN-based fingerprints were introduced in other work [duvenaud2015convolutional]. The advantage of such fingerprints is that a network must capture molecular properties in order to predict bioactivity, so the embedding of a ligand in the network will capture its chemical information. The neural fingerprints were then used to train a random forest to predict bioactivity. This model outperformed other random forest models that were trained on different molecular fingerprints on a large majority of the GPCR datasets.
GNNs have also found success in predicting other molecular properties. For instance, they have achieved state-of-the-art performance in odor descriptor prediction [sanchez2019machine]. Odor perception involves 300-4000 different types of olfactory receptors, which are rhodopsin like (class A) GPCRs [su2009olfactory]. The odor prediction GNN used atom-node vectors based on atom type, charge, etc. Similar to WDL-RF, an embedding within the GNN was found to outperform traditional fingerprints and molecular descriptors when used as input to a random forest model.
One issue with the data-driven fingerprints discussed thus far is that they are generated with labeled training data, which restricts the amount of data that can be used to construct a robust fingerprint. The Seq2Seq model attempted to address this issue by learning a fingerprint from SMILES in an unsupervised manner [xu2017seq2seq]. This model was an autoencoder that used an GRU-based RNN to encode SMILES into a fixed-length embedding vector and then recover the original SMILES from the embedding. After the model was trained, the embedding vectors for ligands in a labeled dataset were generated to use as data-driven fingerprints in other machine learning models. Experiments showed that random forest, gradient boosting, and SVMs that were trained on the Seq2Seq fingerprints outperformed the same models trained on ECFP and the original neural fingerprint produced by Duvenaud et al. [duvenaud2015convolutional]. One additional benefit of fingerprints generated by Seq2Seq is that they are invertible, so it is easy to recover the original SMILES string if given the fingerprint.
Shortly after the publication of the Seq2Seq, researchers developed a variational autoencoder (VAE) for encoding SMILES into a continuous representation in the latent space [gomez2018automatic]. this method also produces invertible fingerprints and a regression model can be trained on these embeddings to predict various molecular properties. The use of a VAE is the game changer here as molecules with similar properties are encouraged to be close together in latent space. As a result slightly perturbing the embedding of a ligand and inverting it will tend to give a different but similar ligand. This property is of interest in de novo synthesis which is discussed in Section 5.
3.2 Similarity-based virtual screening
Similarity-based screening methods are based on the hypothesis that active compounds share similar properties [johnson1990concepts]. However, there is no trivial definition of compound similarity as computational methods work with approximate representations. For this reason, similarity-based methods heavily depend on and are defined by the molecular representations they operate on. Most machine learning-based methods for ligand-based VS are clustering algorithms, and recent advances in this research are focused on developing new data representation methods and similarity measures.
Tanimoto similarity is commonly used when comparing molecular fingerprint vectors as it compares favorably to other similarity metrics on molecular fingerprints, which has been confirmed in analyses based on the sum of ranking differences and ANOVA analysis [bajusz2015tani]. Fingerprint-similarity-based VS is not a new methodology [willett2006similarity]. While there are more complex 3D approaches, these methods are still used to query large databases due to their computational efficiency. However, Raschka et al. recently showed that 3D overlay-based VS can be made computationally feasible on large databases if hypotheses based on prior information about the ligand-receptor interactions can be used as filtering criteria [raschka2018enabling]. Using this hypothesis-driven approach, combined with machine learning-based molecular feature importance assessments, the researchers discovered a potent GPCR signaling inhibitor by screening more than ten million commercially available molecules [raschka2018enabling, raschka2018automated].
A recent example of a Tanimoto-based fingerprint similarity method is MuSSeL [alberga2018new]. MuSSeL generates an ensemble of different fingerprints for a database of compounds which were labeled with a receptor as their target class and their binding affinity for the corresponding receptor. Given a query compound’s fingerprint ensemble, a molecule is considered as a candidate for a given target receptor class only if it meets a minimum Tanimoto similarity cutoff with at least fingerprints in the set of known actives. If this requirement is met, the compound must also pass the following criterion for fingerprints – where is a number that has to be specified by the user, similar to choosing . First, the compound must have at least nearest neighbors above a new user-specified Tanimoto cutoff, and second, the maximum difference between two of the neighbors’ binding affinities must be below a user-specified threshold. The maximum difference threshold is included to help avoid activity cliffs [stumpfe2012exploring]. For each fingerprint that meets these conditions, the neighbors’ binding affinities are averaged. Finally, for each query compound, the -neighbor averages are averaged for all the fingerprints in the ensemble that passed the second filter to predict the binding affinity of the query compound. The algorithm had good performance on a calibration set and correctly labeled compounds in a number of case studies with GPCR targets such as adenosine receptor and dopamine receptor D4.
Common fingerprints like ECFP and MACCS do not explicitly capture 3D information about the compounds. This is problematic in situations where a molecule’s 3D structure is valuable for assessing a property, for instance the steric fit of a predicted pose docked into the receptor’s binding site, so methods for comparing 3D representations of compounds are of practical use. One approach for comparing 3D representations of molecules could be to find an overlay with maximum shape overlap, but this does not utilize electrostatic information. Thus, similarity scores in this space are generally based on a combination of structural overlap and the overlap of groups with like properties. Methods like ROCS [hawkins2007comparison] and WEGA [yan2013enhancing] represent molecules as overlapping Gaussian spheres with each atom sphere containing information about its location, atom type, and charge. The maximum scoring overlay is then based on both the structural and chemical overlap. Esim is a recent method that computes overlay similarity in a different way [cleves2019electrostatic]. The model uses a set of "observer points" in 3D space that are a fixed distance from one target molecule. At each observer point spatial, angular and electrostatic values are computed that capture properties of the target and query molecule from that observer points perspective. The closer the values for each molecule are when summed over all observer points, the higher the similarity score is. This was found to outperform other 3D similarity methods on DUD-E, a common benchmark set that also includes GPCR targets [mysinger2012directory].
Methods that rely on 2D fingerprints or 3D overlays for ligand-based VS utilize different information about the molecular structures – each method has individual strengths and weaknesses [hu2012performance]. The HybridSim-VS method combines both 2D fingerprint and 3D-structural information [Shang2017hybridsim], and the proposed similarity metric is a combination of the Tanimoto similarity of the 3D shape overlay of a pair of molecules and of its 2D fingerprint. This hybrid metric outperformed MACCS, a 2D fingerprint, and WEGA, a 3D representation, when compared on the tested on the DUD-E benchmark dataset [mysinger2012directory].
Section 3.1 summarized different neural network embeddings that can be used as inputs to machine learning models to predict various molecular properties. The same embeddings can be used for clustering-based similarity search. One example of this in practice is a clustering-based SPiDER search [reker2014identifying]
. SPiDER uses self-organizing maps, a special type of artificial neural network approach for unsupervised learning, to create low-dimensional embeddings of the molecule space based on the CATS topological pharmacophores models and physiochemical small molecule descriptors[reutlinger2013chemically]. Recently SPiDER was used in tandem with a binding affinity prediction via DEcRyPT to discover celastrol, a cannabinoid receptor-1 and 2 agonist [rodrigues2018machine, rodrigues2019dissecting]. DEcRyPT is based on random forest models for regression analysis, which have been trained to predict binding affinities from topological pharmacophore features. In this study, the researchers first used SPiDER to select potential cannabinoid receptor agonists via clustering. The selected molecules were then analyzed using DEcRyPT, which was trained on experimental binding affinity values for selected targets, as a predictive model that was applied to the most confident predictions from SPiDER to obtain binding affinity scores (-log10 values). The agonist celastrol was identified by this approach, and the ligand was experimentally validated by biochemical assays including radiological displacement assays, followed by dynamic light scattering measurements to rule out false positive readouts. The authors emphasized that common similarity-based search using conventional ECFP4 Morgan-fingerprints would have disqualified celastrol as a GPCR ligand.
4 Receptor structure-based methods
4.1 Binding site prediction
One of the most notable achievements in binding site prediction for identifying new targets was Nayal and Honig’s random forest model, which was trained on 408 features representing structural, geometric, and physicochemical properties [nayal2006nature]. Their method was able to identify 1347 cavities on the surface of 99 diverse proteins, and in 100% of the cases, a drug has been experimentally found to bind at least one of these surface cavities.
A more recent method for discovering binding sites in GPCRs is implemented in the COACH-D server [chan2019new, wu2018coach], which optionally allows users to provide a ligand structure that is then docked into the predicted binding site of the target protein structure to refine the ligand binding site predictions. The prediction in COACH-D is based on the COACH algorithm, which aggregates binding site predictions from both sequence and structure data based on five other methods and then uses a linear SVM classifier for the final prediction [yang2013protein].
Challenges for applying deep learning-based models to binding site identification are two-fold. First, the dataset of available GPCR structures is relatively small. However, this problem can potentially be addressed by transfer learning, which is discussed in Section 6. Second, deep learning excels when the models can learn to extract patterns from the raw input data rather than hand-engineered feature descriptors as described in Section 3.1. While deep learning models based on 3D voxelization methods [jimenez2018k, li2019deepatom] and graph convolutions [duvenaud2015convolutional, gilmer2017neural, schutt2017quantum] have been successful in predicting binding affinities and molecular properties from small molecule structures, representing large molecules such as membrane receptors is more computationally prohibitive and remains yet to be explored.
4.2 Protein-ligand docking
Structure-based virtual screening for GPCRs is limited by the number of high-resolution GPCR structures that are available. However, advancements in techniques like Cryo-EM have increased the rate at which these structures are solved, such that many general AI-based methods for protein-ligand docking can be applied to modeling GPCR-ligand recognition in the future.
Structure-based VS methods need to be able to score protein-ligand complexes to identify favorable candidates. Scoring functions can be grouped into four classes: Force field, empirical, knowledge-based, and machine learning-based [li2019overview]. Popular docking programs typically use scoring from the first three classes [trott2010autodock, verdonk2003improved, vilar2008medicinal]. Functions from these three classes are sometimes referred to as classical scoring functions. Classical scoring functions are based on linear combinations of features which restricts they ways they can utilize structural and interaction data [li2019classical]. Scoring functions based on machine learning are not restricted in this way. One of the first machine learning scoring functions to outperform classical methods is RF-score, which has been updated multiple times to further improve its performance [ballester2010machine, ballester2014does, li2015improving]. All iterations of RF-score use the same algorithm for random forest regression. However, the researchers experimented with different ways for encoding the molecular information in the training dataset to improve the performance of RF-score, which underlines the importance of choosing a good input representation as previously discussed in Section 2. The feature representation of the protein-ligand complex used in the latest version of the RF-score (v3) [li2015improving] consists of counts of interacting protein-ligand atom pairs. Two atoms are considered interacting if they are within 10 Å in the structure. For example, a carbon in the protein that is 7 Å away from a nitrogen in the ligand in a binding pose would increment the count. These features are not symmetric; atom in protein atom in ligand and vice versa are separate features. Additionally, RF-score v3 uses features produced by AutoDock Vina [trott2010autodock]
. When trained and tested to model a VS setting RF-score, outperformed classical scoring functions in various evaluation metrics[wojcikowski2017performance].
Initial attempts at developing NN-based scoring functions in this space struggled with overfitting issues and were outperformed by methods based on conventional machine learning methods, like RF-score [sunseri2016d3r]. Recently, however, a set of new NN-based scoring functions have shown state-of-the-art performance [jimenez2018k, li2019deepatom, zheng_onionnet_2019, stepniewska2018development]. All these networks use novel input representations to capture the structural and chemical information of the protein-ligand complex.
A number of NN-based scoring functions voxelize the 3D representation of the protein-ligand complex (Section 2.3) and use functions that capture structural and chemical information to compute the channels of each voxel [jimenez2018k, li2019deepatom, stepniewska2018development]. All of these methods are based on 3D CNNs, which can be computationally demanding due to their large number of trainable model parameters. Furthermore, the large number of parameters increases the number of training examples required to effectively train the networks without overfitting, which has been identified as an issue for DNNs in this space [ragoza2017protein]. However, more recent methods aim to address the overfitting issue by using more modern CNN architectures. For instance, DeepAtom [li2019deepatom] was recently designed based on ShuffleNet v2 [ma2018shufflenet], and the recent [jimenez2018k] method is based on SqueezeNet [iandola2016squeezenet]. These architectures were designed to achieve good computational performance while keeping the number of trainable parameters relatively small, which helps with minimizing overfitting when deeper architectures are being trained on smaller datasets. Both DeepAtom and achieved state-of-the-art binding affinity predictions on the PDBbind v. 2016 core dataset [wang2005pdbbind]. The success of these networks suggests that advances in other subfields of deep learning could be effective in VS as well and could be a fruitful area of future research.
Other NN-based scoring functions achieved state-of-the-art performance by including more information in the input representations. OnionNet uses atom pair counts similar to RF-score, but generates them for multiple cutoffs and makes each pair exclusive to the first cutoff it appears in [zheng_onionnet_2019]. These values are then converted into a matrix where is the cutoff value and is the atom pair and this matrix is used as input for a 2D CNN. The use of a 2D CNN reduces the parameter burden of this model so the area of the protein that can be measured increases substantially. While the previously mentioned 3D CNN models, DeepAtom and capture molecular information in a 20 Å cube. As described in [zheng_onionnet_2019], OnionNet counted at least all atom pairs within a 61 Å cube. Long-range electrostatic interactions extend beyond 20 Å and are known to be important in protein-ligand binding. Hence, OnionNet may be more effective for VS targets where these long-range interactions are important for activity prediction [dagliyan2011structural]. In practice, its performance on the PDBbind v. 2016 core was comparable to the 3D CNNs mentioned above, though there was no performance comparison conducted on individual targets where this information could be useful.
5 De novo small-molecule design
De novo molecule design is closely aligned with the goals of VS. While VS attempts to answer questions such as "Does this molecule activate this target?", de novo design attempts to answer "Can we generate a molecule that can activate this target?". Answering the second question in the affirmative implies that we understand what makes a molecule active against a given target. De novo molecule design can be tied to the idea of inverse QSAR/QSPR, deriving the structures of all molecules that are active with a target or have a certain property. This is a challenging problem but despite its difficulty, de novo synthesis has recently become an active area of research, which is likely due to the recent advances in the development of deep generative models and reinforcement learning.
5.1 Generating molecules with variational autoencoders
The most commonly used deep generative model in this space is the variational autoencoder (VAE) [kingma2013auto]. In essence, a VAE learns to reconstruct high-dimensional inputs from a low-dimensional continuous latent space. In contrast to a conventional autoencoder (Section 1.7.4), points close in the latent space are encouraged to have reconstructions with similar properties. In de novo design, this property can be utilized to generate novel molecules that have similar properties to selected input molecules by taking that input’s latent representation and perturbing it in a controlled manner [gomez2018automatic, dai2018syntax]. One of the challenges with using VAE-based models in this space is that with current architectures, the VAE has no notion of what constitutes a valid molecule [jin2018junction]. Since these models are trained on large numbers of valid compounds, certain models were still able to achieve reasonable results despite producing some invalid molecules [gomez2018automatic, kusner2017grammar].
Jin et al. introduced a novel VAE design that only allowed valid molecules as outputs called a junction tree VAE [jin2018junction]. This model uses a GNN to encode a molecules’ graph and also a junction tree of valid molecular fragments in the graph. The decoder network first decodes a junction tree from the latent space. Then, it decodes a molecule that satisfies that junction tree. Since all molecular fragments of the junction tree are valid, a valid molecule can be generated from the fragments given the nature of a junction tree, which forces outputs to be valid molecules[wiegerinck2000variational]. In addition to always producing valid molecules, the model outperformed other VAEs on a Bayesian optimization task, where the goal was to find a molecule that maximized octanol-water partition coefficients (logP).
As a general example, consider the latent representation of a ligand with a thiol group that induces a positive activity response in a given target receptor. Perturbation of the latent representation of that molecule may result in similar molecules that share the thiol group feature but are not active against the target. As a consequence, the performance for VAEs in this space is usually assessed on simple chemical properties [jin2018junction, kusner2017grammar, dai2018syntax]. For this reason, while there is still plenty to be investigated about VAEs for de novo synthesis, research that has focused on the de novo generation of compounds for specific targets has largely utilized reinforcement learning, which will be discussed in the next section.
5.2 Reinforcement learning-based molecular design
Reinforcement learning seeks to teach an agent to perform actions that will maximize a cumulative reward over a series of iterations. In this paradigm, the definition of the reward is very flexible, which makes it possible to more directly optimize for specific traits compared to using generative models such as VAEs. As illustrated in Figure 5, at each iteration, the agent is given a state by the environment as well as the reward from the previous iteration. The agent then selects an action based on this information. This selected action is fed into the environment, which then outputs a new state and reward. This cycle continues until the series of iterations, called an episode, is terminated at which point the agent’s behavior is updated based on the information provided by its trajectory, the ordered list of states, action, and rewards associated with each iteration in the episode. While this high-level overview will suffice to understand how reinforcement learning is utilized for de novo design, the reader is referred to [sutton2018reinforcement, raschka2019python], which we recommend as resources with more in-depth explanations.
Several examples exist in which reinforcement learning models have been optimized to generate GPCR bioactive molecules. One such example is the REINVENT model [olivecrona2017molecular]. In REINVENT, the agent is an RNN that generates one SMILES string token at a time over the course of multiple iterations. After a batch of 128 SMILES strings are produced, the RNN agent is updated with a modified form of REINFORCE, an algorithm that updates the agent network based on the trajectories of the episodes in the current batch [williams1992simple]. In one experiment, REINVENT was trained to generate compounds that are active with the GPCR Dopamine receptor (DRD2). In this experiment, the reward function was based on the output of an SVM trained to predict a compound’s DRD2 activity. After training, the agent was able to produce compounds that the SVM labeled as active against DRD2 96% of the time. It shall be noted that the SVM was highly accurate in predicting the activity of molecules when evaluated on an independent test set (98% prediction accuracy and a high precision of .96). The reinforcement learning model generated both novel, chemically valid compounds and, impressively, compounds that are known to be active with DRD2 even though these were not included in the training set.
DrugEx is another deep reinforcement model that was directly optimized for generating molecules with GPCR activity [liu2019exploration]. Here, the researchers focused on adenosine receptor, which has been targeted to treat cardiovascular and inflammatory diseases [chen2013adenosine]. Similar to REINVENT, the DrugEx model also produces SMILES strings over the course of an episode. However, DrugEx adds a stochastic component during training. Before being used in the reinforcement learning training, the RNN agent network was individually trained with a large set of molecule SMILES from ZINC [sterling2015zinc]. Two copies of this pre-trained network were then created, with one referred to as the exploration network and the other as the exploitation
network. Only the exploitation network was updated during the reinforcement learning training process; however, with a specified probability at each iteration, the exploration network would be queried for the next token instead. The purpose of this procedure was to explore a wider chemical space during training – afterwards the exploration network was discarded, and only the exploitation network was used to generate new molecules. This method successfully rediscovered some known actives for adenosinereceptor. The RNN agent was also able to produce molecules with a large diversity, which was evident by covering all cluster centers with a generated active when using a fingerprint-based clustering.
Other reinforcement learning-based models for de novo synthesis described in the literature were not specifically focused on GPCRs’ bioactive molecule design but could be adopted for such tasks in the future [zhou2019optimization, you2018graph]. One such example is Zhou et al.’s Molecule Deep Q-Networks (MOLDQN) approach, which modifies deep Q-networks for molecule generation [zhou2019optimization, mnih2015human]. This model’s agent network takes the current molecular graph’s Morgan fingerprint as input and selects an action to modify the molecular graph. The actions include adding an atom, adding a bond, or increasing a bond’s order. Additionally, the actions that are allowed at a given iteration are restricted if they are invalid so the system will always produce valid molecules. Similar to DrugEx, the model encourages exploration by selecting a random valid action at a given iteration with some probability . The model achieved state-of-the-art performance when producing molecules that maximized for logP and quantitative estimates of drug likeness separately [leeson2012drug].
While many publications in this area demonstrate the ability to optimize for molecular properties, they unfortunately lack experimental follow-up procedures for further model evaluation. As a notable counter-example, we want to highlight GENTRL, which was used to discover novel inhibitors of discodin domain receptor 1, a tyrosine kinase [zhavoronkov2019deep]. The inhibitors were generated computationally and then synthesized and experimentally validated. The experiments in silico and in vitro were completed in approximately 46 days at a fraction of the cost of a high throughput screening approach.
6 Transfer learning
Machine learning, and deep learning in particular, requires large training datasets. It’s not atypical for modern deep neural networks to have millions of trainable parameters, which require sufficient data for successful parameterization. Transfer learning refers to the process of adapting models that have been trained on one task to another, typically similar task [pan2009survey]. The most common form of transfer learning is to take a DNN that was trained on a large general-purpose dataset and fine-tune it to a smaller dataset of interest. Typically, the more similar the general-purpose dataset is to the target task, the more successful transfer learning can be. Furthermore, the model weights can be frozen and only the last layer replaced and tuned to the target data, or the model parameters of the entire model can be considered starting weights (as opposed to randomly initialized weights) and further fine-tuned.
The advantage of transfer learning is that it is less data- hungry compared to training a model entirely from scratch. This allows researchers to apply DNN models even though the available labeled datasets for the target task are relatively small. One example of this is in a recent structure-based method that that sought to train a 3D CNN to classify molecule activity from high-scoring docking poses [imrie2018protein]. When the network was trained on targets in a DUD-E based training set and then fine-tuned on a smaller protein family-specific dataset, the model performed better on test set targets in that protein family than the same network that was trained only on the family-specific set [mysinger2012directory]. Family-specific models typically outperform general models, but data availability may limit their performance [ross2013one]. This work shows that transfer learning can help to address this issue. While it has not yet been applied further in bioactivity predictions, transfer learning has been successful in improving the performance of models that predict standard molecular properties like solvency as well as in models that predict quantum mechanical approximations [goh2018using, smith2019approaching].
7 Future directions
Currently, one of the most neglected areas of machine learning in bioactive ligand discovery is active learning, which aims to combine artificial and human intelligence. Active learning is a branch of machine learning that focuses on selecting labeled data for supervised learning to improve non-confident predictions and fill the knowledge gaps of the model [munro2020active]. Knowledge gaps can be filled through human interaction, by sampling optimal unlabeled training examples for annotations by humans. We think that utilizing domain knowledge from experts is crucial in a highly specialized field such as GPCR bioactive ligand discovery, and we believe that it is highly advantageous to develop machine learning systems that include human feedback loops.
Another avenue that remains largely unexplored for computational ligand discovery is a combination of active learning and transfer learning (Section 6), called active transfer learning. Here, a separate model is trained on an existing validation set to predict whether a given model is able to make correct predictions or not on a given unlabeled dataset, which then can highlight problematic cases for human review [munro2020active].
Another recent development and promising research trend in deep learning is semi-supervised learning, which is particularly useful if pre-trained models for transfer learning are not available for the target domain or are infeasible to obtain. Semi-supervised learning is the process of deriving and utilizing label information directly from the data itself rather than having humans annotating it. In a language model, for example, semi-supervised learning can be utilized by training a model to predict the next word in a sequence [howard2018universal] or, in the context of computer vision, this could be the composition of an image into a jigsaw puzzle that the DNN learns to assemble [noroozi2016unsupervised]. The main idea of this approach is to choose a task that requires an understanding of the underlying data in order to be solved. This stage of self-supervised training can be regarded as pre-training, and the model can then be adopted and fine-tuned to solve the target task downstream similar to conventional transfer learning.
The confluence of the meteoric rise of deep learning methods, the increase in large chemical datasets, and the further ease of access to high powered computing resources have spurred the field of chemical machine learning forward at a rapid pace. The many representations of chemical data allow a variety of methods used in other fields to be put to task on chemical quandaries and have encouraged the development of new machine learning models. As we have discussed, many of these models have been or could be applied to VS on GPCRs, though the types of GPCR data currently available may limit the general applicability of certain models. However, as the amount of bioactivity data available continues to increase, so will the performance of these models. One area in which the space is currently lacking is the experimental validation of the computational method development.
Hopefully, by providing an overview of deep learning methods and detailed explanations of new approaches in the space, we can encourage the more experimentally-minded to utilize methods in this space, or at least to collaborate more confidently. In the other direction, it is important to convey to the methodologically-inclined that while it is great to develop a new method that performs well on a benchmark, experimental validation is the gold standard, so it is beneficial to seek out experimental collaboration when possible. This will lead to more actionable results and push the development of machine learning VS methods even faster in the future.
Support for this work was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation. In addition, we are thankful for the support provided by the NLM Biomedical Informatics Training Program (grant number 5T15LM007359). Also, we would like to thank Kylie Moynihan for helpful feedback on the manuscript.