1 Introduction
Recent advances in deep learning, which is an instance of artificial intelligence (AI) based on neural networks [lecun2015deep, schmidhuber2015deep], have led to numerous applications in the molecular sciences, e.g., in drug discovery [gawehn2016deep, jimenez2021artificial], quantum chemistry [gilmer2017neural], and structural biology [jumper2021highly, baek2021accurate]. Two characteristics of deep learning render it particularly promising when applied to molecules. First, deep learning methods can cope with "unstructured" data representations, such as text sequences [vaswani2017attention, brown2020language], speech signals [hinton2012deep, mikolov2011strategies], images [krizhevsky2017imagenet, farabet2012learning, tompson2014joint], and graphs [bronstein2017geometric, monti2019fake]. This ability is particularly useful for molecular systems, for which chemists have developed many models (i.e., "molecular representations") that capture molecular properties at varying levels of abstraction (Figure 1
). The second key characteristic is that deep learning can perform feature extraction (or feature learning) from the input data, that is, produce datadriven features from the input data without the need for manual intervention. These two characteristics are promising for deep learning as a complement to “classical” machine learning applications (
e.g., Quantitative StructureActivity Relationship [QSAR]), in which molecular features (i.e., "molecular descriptors" [todeschini2009molecular]) are encoded a priori with rulebased algorithms. The capability to learn from unstructured data and obtain datadriven molecular features has led to unprecedented applications of AI in the molecular sciences.One of the most promising advances in deep learning is geometric deep learning (GDL). Geometric deep learning is an umbrella term encompassing emerging techniques which generalize neural networks to Euclidean and nonEuclidean domains, such as graphs, manifolds, meshes, or string representations [bronstein2017geometric]. In general, GDL encompasses approaches that incorporate a geometric prior, i.e., information on the structure space and symmetry properties of the input variables. Such a geometric prior is leveraged to improve the quality of the information captured by the model. Although GDL has been increasingly applied to molecular modeling [gainza2020deciphering, segler2018generating, gilmer2017neural], its full potential in the field is still untapped.
The aim of this review is to (i) provide a structured and harmonized overview of the applications of GDL on molecular systems, (ii) delineate the main research directions in the field, and (iii) provide a forecast of the future impact of GDL. Three fields of application are highlighted, namely drug discovery, quantum chemistry, and computeraided synthesis planning (CASP), with particular attention to the datadriven molecular features learned by GDL methods. A glossary of selected terms can be found in Box 1.
2 Principles of geometric deep learning
The term geometric deep learning was coined in 2017 [bronstein2017geometric]. Although GDL was originally used for methods applied to nonEuclidean data [bronstein2017geometric], it now extends to all deep learning methods that incorporate geometric priors [bronstein2021geometric], that is, information about the structure and symmetry of the system of interest. Symmetry is a crucial concept in GDL, as it encompasses the properties of the system with respect to manipulations (transformations), such as translation, reflection, rotation, scaling, or permutation (Box 2).
Symmetry is often recast in terms of invariance and equivariance to express the behavior of any mathematical function with respect to a transformation (e.g. rotation, translation, reflection or permutation) of an acting symmetry group [marsden1974reduction]. Here, the mathematical function is a neural network applied to a given molecular input . can therein transform equivariantly, invariantly or neither with respect to , as described below:

Equivariance. A neural network applied to an input is equivariant to a transformation if the transformation of the input commutes with the transformation of , via a transformation of the same symmetry group, such that: . Neural networks are therefore equivariant to the actions of a symmetry group on their inputs if and only if each layer of the network “equivalently" transforms under any transformation of that group.

Invariance. Invariance is a special case of equivariance, where is invariant to if is the trivial group action (i.e., identity): .

is neither invariant nor equivariant to when (i) the transformation of the input does not commute with the transformation of : , and (ii) is not the trivial group action.
The symmetry properties of a neural network architecture vary depending on the network type and the symmetry group of interest and are individually discussed in the following sections. Readers can find an indepth treatment of equivariance and group equivariant layers in neural networks elsewhere [cohen2016group, cohen2016steerable, cohen2018spherical, kondor2018generalization].
The concept of equivariance and invariance can also be used in reference to the molecular features obtained from a given molecular representation (), depending on their behaviour when a transformation is applied to . For instance, many molecular descriptors are invariant to the rotation and translation of the molecular representation by design [todeschini2009molecular], e.g., the Moriguchi octanolwater partitioning coefficient [moriguchi1992simple], which relies only on the occurrence of specific molecular substructures for calculation. The symmetry properties of molecular features extracted by a neural network depend on both the symmetry properties of the input molecular representation and of the utilized neural network.
Many relevant molecular properties (e.g.
, equilibrium energies, atomic charges, or physicochemical properties such as permeability, lipophilicity or solubility) are invariant to certain symmetry operations (Box 2). In many tasks in chemistry, it is thus desirable to design neural networks that transform equivariantly under the actions of predefined symmetry groups. Exceptions occur if the targeted property changes upon a symmetry transformation of the molecules (
e.g., chiral properties which change under inversion of the molecule, or vector properties which change under rotation of the molecule). In such cases, the inductive bias (learning bias) of equivariant neural networks would not allow for the differentiation of symmetrytransformed molecules.
While neural networks can be considered as universal function approximators [cybenko1989approximation], incorporating prior knowledge such as reasonable geometric information (geometric priors) has evolved as a core design principle of neural network modeling [bronstein2021geometric]. By incorporating geometric priors, GDL allows to increase the quality of the model and bypasses several bottlenecks related to the need to force the data into Euclidean geometries (e.g., by feature engineering). Moreover, GDL provides novel modeling opportunities, such as data augmentation in low data regimes [tetko2020state, skinnider2021chemical].
[float*=t, floatplacement=t, width=]
Box 1: Glossary of selected terms
CoMFA and CoMSIA. Comparative Molecular Field Analysis (CoMFA) [cramer1988comparative] and Comparative Molecular Similarity Indices Analysis (CoMSIA) [klebe1998comparative] are popular 3D QSAR methods developed in the 1980s and 1990s, in which threedimensional grids are used to capture the distributions of molecular features (e.g., steric, hydrophobic, and electrostatic properties). The obtained molecular descriptors serve as inputs to a regression model for quantitative bioactivity prediction.
Convolution. Operation within a neural network that transforms a feature space into a new feature space and thereby captures the local information found in the data. Convolutions were first introduced for pixels in images [lecun1995convolutional, lecun1998gradient] but the term "convolution" is now used for neural network architectures covering a variety of data structures such as graphs, point clouds, spheres, grids, or manifolds.
Density Functional Theory (DFT). A quantum mechanical modeling approach used to investigate the electronic structure of molecules.
Data augmentation. Artificial increase of the data volume available for model training, often achieved by leveraging symmetrical properties of the input data which are not captured by the model (e.g., rotation or permutation).
Feature. An individually measurable or computationally obtainable characteristic of a given sample (e.g., molecule), in the form of a scalar. In this review, the term refers to a numeric value characterizing a molecule. Such molecular features can be computed with rulebased algorithms ("molecular descriptors") or generated automatically by deep learning from a molecular representation ("hidden" or "learned" features).
Geometric prior. An inductive bias incorporating information on the symmetric nature of the system of interest into the neural network architecture. Also known as symmetry prior.
Inductive bias. Set of assumptions that a learning algorithm (e.g., a neural network) uses to learn the target function and to make predictions on previously unseen data points.
. Method for representing categorical variables as numerical arrays by obtaining a binary variable (0, 1) for each category. It is often used to covert sequences (
e.g., SMILES strings) into numerical matrices, suitable as inputs and/or outputs of deep learning models (e.g., chemical language models).
Quantitative StructureActivity Relationship (QSAR). Machine learning techniques aimed at finding an empirical relationship between the molecular structure (usually encoded as molecular descriptors) and experimentally determined molecular properties, such as pharmacological activity or toxicity.
Reinforcement learning. A technique used to steer the output of a machine learning algorithm toward userdefined regions of optimality via a predefined reward function [sutton2018reinforcement].
Transfer learning. Transfer of knowledge from an existing deep learning model to a related task for which fewer training samples are available [pan2009survey].
Unstructured data. Data that are not arranged as vectors of (typically handcrafted) features. Examples of unstructured data include graphs, images, and meshes. Molecular representations are typically unstructured, whereas numerical molecular descriptors (e.g., molecular properties, molecular "fingerprints") are examples of structured data.
Voxel. Element of a regularly spaced, 3D grid (equivalent to a pixel in 2D space).


GDL approach  Molecular representation(s)  Applications  



2D molecular graph 

































3 Molecular GDL
The application of GDL to molecular systems is challenging, in part because there are multiple valid ways of representing the same molecular entity. Molecular representations^{1}^{1}1Note that in this review the term "representation" is used solely to denote humanmade models of molecules (e.g., molecular graphs, 3D conformers, SMILES strings). To avoid confusion with other usages of the word "representation" in deep learning, we will use the term "feature" whenever referring to any numerical description of molecules, obtained either with rulebased algorithms (molecular descriptors) or learned (extracted) by neural networks. can be categorized based on their different levels of abstraction and the physicochemical and geometrical aspects they capture. Importantly, all of these representations are models of the same reality and are thus "suitable for some purposes, not for others" [hoffmann1991representation]. GDL provides the opportunity to experiment with different representations of the same molecule and leverages their intrinsic geometrical features to increase the quality of the model. Moreover, GDL has repeatedly proven useful in providing insights into relevant molecular properties for the task at hand, thanks to its feature extraction (feature learning) capabilities. In the following sections, we delineate the most prevalent molecular GDL approaches and their applications in chemistry, grouped according to the respective molecular representations used for deep learning: molecular graphs, grids, strings, and surfaces.
[float*=t, floatplacement=t, width=]
Box 2: Euclidean symmetries in molecular systems
Molecular systems (and threedimensional representations thereof) can be considered as objects in Euclidean space. In such a space, one can apply several symmetry operations (transformations) that are (i) performed with respect to three symmetry elements (i.e., line, plane, point), and (ii) rigid, that is, they preserve the Euclidean distance between all pairs of atoms (i.e., isometry). The Euclidean transformations are as follows:

[topsep=0.5pt, partopsep=0.5pt, itemsep=0.3pt]

Rotation. Movement of an object with respect to the radial orientation to a given point.

Translation. Movement of every point of an object by the same distance in a given direction.

Reflection. Mapping of an object to itself through a point (inversion), a line or a plane (mirroring).
All three transformations and their arbitrary finite combinations are included in the Euclidean group [E(3)]. The special Euclidean group [SE(3)] comprises only translations and rotations.
Molecules are always symmetric in the SE(3) group, i.e., their intrinsic properties (e.g., biological and physicochemical properties, and equilibrium energy) are invariant to coordinate rotation and translation, and combinations thereof. Several molecules are chiral, that is, some of their (chiral) properties depend on the absolute configuration of their stereogenic centers, and are thus noninvariant to molecule reflection. Chirality plays a key role in chemical biology; relevant examples of chiral molecules are DNA, and several drugs whose isomers exhibit markedly different pharmacological and toxicological properties [nguyen2006chiral].
3.1 Learning on molecular graphs
3.1.1 Molecular graphs
Graphs are among the most intuitive ways to represent molecular structures [kipf2016semi]. Any molecule can be thought of as a mathematical graph , whose vertices () represent atoms, and whose edges () constitute their connection (Figure 3.1). In many deep learning applications, molecular graphs can be further characterized by a set of vertex and edge features.
3.1.2 Graph neural networks
Deep learning methods devoted to handling graphs as input are commonly referred to as graph neural networks (GNNs). When applied to molecules, GNNs allow for feature extraction by progressively aggregating information from atoms and their molecular environments (Figure 2a, [battaglia2016interaction, battaglia2018relational]). Different architectures of GNNs have been introduced [zhou2020graph], the most popular of which fall under the umbrella term of message passing neural networks [geerts2020let, duvenaud2015convolutional, gilmer2017neural]. Such networks iteratively update the vertex features of the lth network layer () via graph convolutional operations, employing at least two learnable functions and , and a local permutationinvariant aggregation operator (e.g., sum): .
Since their introduction as a means to predict quantum chemical properties of small molecules at the density functional theory (DFT) level [gilmer2017neural], GNNs have found many applications in quantum chemistry [klicpera2020directional, mxmnet2020molecular, withnall2020building, tang2020self, goodall2020predicting], drug discovery [stokes2020deep, feinberg2018potentialnet, torng2019graph], CASP [somnath2020learning], and molecular property prediction [li2017learning, liu2019chemi]. When applied to quantum chemistry tasks, GNNs often use E(3)invariant 3D information by including radial and angular information into the edge features of the graph [unke2019physnet, klicpera2020fast, klicpera2020directional, mxmnet2020molecular, schutt2021equivariant], thereby improving the prediction accuracy of quantum chemical forces and energies for equilibrium and nonequilibrium molecular conformations, as in the case of SchNet [schutt2018schnet, schutt2017quantum] and PaiNN [schutt2021equivariant]. SchNetlike architectures were used to predict quantum mechanical wavefunctions in the form of HartreeFock and DFT density matrices [schutt2019unifying], and differences in quantum properties obtained by DFT and coupled cluster leveloftheory calculations [bogojeski2020quantum].
GNNs for molecular property prediction have been shown to outperform humanengineered molecular descriptors for several biologically relevant properties [yang2019analyzing]. Although including 3D information into molecular graphs generally improved the prediction of drugrelevant properties, no marked difference was observed between using a single or multiple molecular conformers for network training [axelrod2020molecular]. Because of their natural connection with molecular representations, GNNs seem particularly suitable in the context of explainable AI (XAI) [gunning2017explainable, jimenez2020drug], where they have been used to interpret models predicting molecular properties of preclinical relevance [jimenez2021coloring] and quantum chemical properties [schnake2020xai].
GNNs have been used for de novo molecule generation [Battaglia2018learning, simonovsky2018graphvae, de2018molgan, zhou2019optimization], for example by performing vertex and edge addition from an initial vertex [Battaglia2018learning] (Figure 2
b). GNNs have also been combined with variational autoencoders
[jin2018junction, de2018molgan, simonovsky2018graphvae, flam2021mpgvae] and reinforcement learning [you2018graph, jin2020multi, zhou2019optimization]. Finally, GNNs have been applied to CASP [somnath2020learning, coley2019graph, lei2017deriving]; however, the current approaches are limited to reactions in which one bond is removed between the products and the reactants.3.1.3 Equivariant message passing
A recent area of development of graphbased methods are SE(3) and E(3)equivariant GNNs (equivariant message passing networks) which deal with the absolute coordinate systems of 3D graphs [e3nn2018, smidt2021finding] (Figure 2b). Thus, these networks may be particularly wellsuited to be applied to 3D molecular representations. Such networks exploit Euclidean symmetries of the system (Box 2).
3D molecular graphs , in addition to their vertex and edge features ( and , respectively), also encode information on the vertex position in a 3D coordinate system (). By employing E(3) [satorras2021n] and SE(3)equivariant [e3nn2018] convolutions, such networks have shown high accuracy for predicting several quantum chemical properties such as energies [clebschgordonnet, cormorant, smidt2020euclidean, e3nn2020qm9L1, fuchs2020se, hutchinson2020lietransformer, schutt2021equivariant], interatomic potentials for molecular dynamics simulations [fuchs2020se, batzner2021se, satorras2021n], and wavefunctions [unke2021se3equivariant]. SE(3) equivariant neural networks possess reflectionequivariance, and thereby enable the model to distinguish between chiral molecules [e3nn2018]. SE(3) neural networks are computationally expensive due to their use of spherical harmonics [muller2006spherical] and Wigner Dfunctions [dray1986unified] to compute learnable weight kernels. E(3)equivariant neural networks are computationally more efficient and have shown to perform equal to, or better than, SE(3)equivariant networks, e.g., for the modeling of quantum chemical properties and dynamic systems [satorras2021n]. Equivariant message passing networks have been applied to predict the quantum mechanical wavefunction of nuclei and electronbased representations in an endtoend fashion [hermann2020deep, pfau2020ab, choo2020fermionic]. However, such networks are currently limited to small molecular systems because of the large size of the learned matrices, which scale quadratically with the number of electrons in the system.
3.2 Learning on grids
Grids capture the properties of a system at regularly spaced intervals. Based on the number of dimensions included in the system, grids can be 1D (e.g., sequences), 2D (e.g., RGB images), 3D (e.g., cubic lattices), or higherdimensional. Grids are defined by a Euclidean geometry and can be considered as a graph with a special adjacency, where (i) the vertices have a fixed ordering that is defined by the spatial dimensions of the grid, and (ii) each vertex has an identical number of adjacent edges and is therefore indistinguishable from all other vertices structurewise [bronstein2021geometric]. These two properties render local convolutions applied to a grid inherently permutation invariant, and provide a strong geometric prior for translation invariance (e.g.
by weight sharing in convolutions). These grid properties have critically determined the success of convolutional neural networks (CNNs),
e.g., in computer vision
[lecun1998gradient, lecun1995convolutional][hochreiter1997long, brown2020language], and speech recognition [hinton2012deep, mikolov2011strategies].3.2.1 Molecular grids
Molecules can be represented as grids in different ways. 2D grids (e.g., molecular structure drawings) are generally more useful for visualization rather than prediction, with few exceptions [tabchouri2019machine]. Analogous with some popular predeep learning approaches, for example Comparative molecular field analysis (CoMFA) [cramer1988comparative], and comparative molecular similarity indices analysis (CoMSIA) [klebe1998comparative], 3D grids are often used to capture the spatial distribution of the properties within one (or more) molecular conformer. Such representations are then used as inputs to the 3D CNNs. 3D CNNs are characterized by a greater resource efficiency than equivariant GNNs, which until now have mainly been applied to molecules with fewer than approximately 100 atoms. Thus, 3D CNNs are a method of choice when the protein structure has to be considered, e.g., for proteinligand binding affinity prediction [jimenez2018k, ragoza2017protein, li2019deepatom, karimi2019deepaffinity, jimenez2019deltadelta], or active site recognition [jimenez2017deepsite].
3.3 Learning on molecular surfaces
Molecular surfaces are defined by the surface enclosing the 3D structure of a molecule at a certain distance from each atom center. Each point on such a continuous surface is characterized by its chemical (e.g., hydrophobicity, electrostatics) and geometric features (e.g., shape, curvature). From a geometrical perspective, molecular surfaces are considered as 3D meshes, i.e., a set of polygons called faces described in terms of a set of vertices that describe how the mesh coordinates exist in the 3D space [ahmed2018survey]. The vertices can be represented by a 2D grid structure (where four vertices on the mesh define a pixel) or by a 3D graph structure. The grid and graphbased structures of meshes enable applications of 2D CNNs and GNNs to learn on meshbased molecular surfaces. Recently, 2D CNNs have been applied to learn on meshbased representations of protein surfaces to predict proteinprotein complexes and protein binding sites [gainza2020deciphering]. However, 2D CNNs applied to meshes come with certain limitations, such as the need for rotational data augmentation (due to the network invariance) and for homogenous mesh resolution. Recently introduced GNNs for meshbased representations have been shown to incorporate rotational equivariance into their network architecture and allow for heterogeneous mesh resolution (i.e., nonconsistent distance between the vertices in 3D) [pfaff2020learning]. Such GNNs are computationally efficient and have potential for modeling macromolecular structures; however, they have not yet found applications in molecular systems. Other studies have used 3D voxelbased surface representations of (macro)molecules as inputs to 3D CNNs, e.g., for proteinligand affinity [liu2021octsurf] and protein bindingsite [mylonas2020deepsurf] prediction.
3.4 Learning on string representations
3.4.1 Molecular strings
Molecules can be represented as molecular strings, i.e., linear sequences of alphanumeric symbols. Molecular strings were originally developed as manual ciphering tools to complement systematic chemical nomenclature [barnard2003representation, wiswesser1985historic] and later became suitable for data storage and retrieval. Some of the most popular stringbased representations are the Wiswesser Line Notation [wln_1952], the Sybyl line notation [ash1997sybyl], the International Chemical Identifier (InChI) [heller2013inchi], Hierarchical Editing Language for Macromolecules [zhang_2012_helm], and the Simplified Molecular Input Line Entry System (SMILES) [weininger1988smiles].
Each type of linear representation can be considered as a "chemical language." In fact, such notations possess a defined syntax, i.e., not all possible combinations of alphanumerical characters will lead to a “chemically valid” molecule. Furthermore, these notations possess semantic properties: depending on how the elements of the string are combined, the corresponding molecule will have different physicochemical and biological properties. These characteristics make it possible to extend the deep learning methods developed for language and sequence modeling to the analysis of molecular strings for "chemical language modeling" [ozturk2020exploring, cadeddu_2014_language].
SMILES strings – in which letters are used to represent atoms, and symbols and numbers are used to encode bond types, connectivity, branching, and stereochemistry (Figure 3a) – have become the most frequently employed data representation method for sequencebased deep learning [segler2018generating, schwaller2018found]. Whereas several other string representations have been tested in combination with deep learning, e.g., InChI [gomez2018automatic], DeepSMILES [deepsmiles], and selfreferencing embedded strings (SELFIES) [krenn2020self], SMILES remains the de facto representation of choice for chemical language modeling. The following text introduces the most prominent chemical language modeling methods, along with selected examples of their application to chemistry.
3.4.2 Chemical language models
Chemical language models are machine learning methods that can handle molecular sequences as inputs and/or outputs. The most common algorithms for chemical language modeling are Recurrent neural networks (RNNs) and Transformers:

RNNs (Figure 3b) [rumelhart1985learning] are neural networks that process sequence data as Euclidean structures, usually via onehotencoding. RNNs model a dynamic system in which the hidden state () of the network at any tth time point (i.e., at any tth position in the sequence) depends on both the current observation () and the previous hidden state (). RNNs can process sequence inputs of arbitrary lengths and provide outputs of arbitrary lengths. RNNs are often used in an "autoregressive" fashion, i.e.
, to predict the probability distribution over the next possible elements (tokens) at the time step
, given the current hidden state () and the preceding portions of the sequence. Several RNN architectures have been proposed to solve the gradient vanishing or exploding problems of "vanilla" RNNs [hochreiter1998vanishing, pascanu2013difficulty], such as long shortterm memory
[hochreiter1997long][chung2014empirical]. 
Transformers (Figure 3c) process sequence data as nonEuclidean structures, by encoding sequences as either (i) a fully connected graph, or (ii) a sequentially connected graph, where each token is only connected to the previous tokens in the sequence. The former approach is often used for feature extraction in general (e.g., in a Transformerencoder), whereas the latter is employed for nexttoken prediction e.g. in a Transformerdecoder). The positional information of tokens is usually encoded by positional embedding or sinusoidal positional encoding [vaswani2017attention]. Transformers combine graphlike processing with the socalled attention layers. Attention layers allow Transformers to focus on ("pay attention to") the perceived relevant tokens for each prediction. Transformers have been particularly successful in sequencetosequence tasks, such as language translation.
Extending early studies [yuan2017chemspacemim, segler2018generating, bjerrum2017molecular], RNNs for nexttoken prediction have been routinely applied to the de novo generation of molecules with desired biological or physicochemical properties, in combination with transfer [segler2018generating, gupta2018generative, merk2018novo, merk2018tuning] or reinforcement learning [olivecrona2017molecular, popova2018deep]. In this context, RNNs have shown remarkable capability to learn the SMILES syntax [segler2018generating, gupta2018generative], and capture highlevel molecular features ("semantics"), such as physicochemical [segler2018generating, gupta2018generative] and biological properties [merk2018novo, merk2018tuning, grisoni2021combining, yuan2017chemspacemim]. In this context, data augmentation based on SMILES randomization [arus2019randomized, bjerrum2017molecular] or bidirectional learning [grisoni2020bidirectional] have proven to be efficient for improving the quality of the chemical language learned by RNNs. Most published studies have used SMILES strings or derivative representations. In a few studies, oneletter amino acid sequences were employed for peptide design [muller2018recurrent, nagarajan2018computational, hamid_2018_amp, grisoni2018designing, das2021accelerated]. RNNs have also been applied to predict ligand–protein interactions and the pharmacokinetic properties of drugs [zheng2020predicting, wang2020optimizing], protein secondary structure [senior2019protein, zhou2020combining], and the temporal evolution of molecular trajectories [tsai2020learning]. RNNs have been applied for molecular feature extraction [bombarelli_automatic_2018, lin_bigru], showing that the learned features outperformed both traditional molecular descriptors and graphconvolution methods for virtual screening and property prediction [bombarelli_automatic_2018]. The Fréchet ChemNet distance [preuer2018frechet], which is based on the physicochemical and biological features learned by an RNN model, has become the de facto reference method to capture molecular similarity in this context.
Molecular Transformers have been applied to CASP, which can be cast as a sequencetosequence translation task, in which the string representations of the reactants are mapped to those of the corresponding product, or vice versa. Since their initial applications [schwaller2019molecular], Transformers have been employed to predict multistep syntheses [schwaller2020predicting], regio and stereoselective reactions [pesciullesi2020transfer], enzymatic reaction outcomes [kreutter2021predicting], and reaction yields and classes [schwaller2021prediction, schwaller2020data, schwaller2021mapping]. Recently, Transformers have been applied to molecular property prediction [chithrananda2020chemberta, morris_2020_transformer] and optimization [he2020molecular]. Transformers have also been used for de novo molecule design by learning to translate the target protein sequence into SMILES strings of the corresponding ligands [grechishnikova2021transformer]. Representations learned from SMILES strings by Transformers have shown promise for property prediction in lowdata regimes [honda2019smiles]. Furthermore, Transformers have recently been combined with E(3) and SE(3) equivariant layers to learn the 3D structures of proteins from their aminoacid sequence [jumper2021highly, baek2021accurate]. These equivariant Transformers achieve stateoftheart performance in protein structure prediction.
Other deep learning approaches have relied on stringbased representations for de novo design, e.g.
, conditional generative adversarial networks
[mirza2014conditional, arjovsky2017wasserstein, mendez2020novo] and variational autoencoders [griffiths2020constrained, alperstein2019all]. Most of these models, however, have limited or equivalent ability to automatically learn SMILES syntax, as compared to RNNs. 1D CNNs [hirohara2018convolutional, kimber2018synergy] and selfattention networks [zheng2019identifying, lim2020predicting, shin2019self] have been used with SMILES for property prediction. Recently, deep learning on amino acid sequences for property prediction was shown to perform on par with approaches based on humanengineered features [elabd2020amino].[float*=th!, floatplacement=t, width=]
Box 3: Structureactivity landscape modeling with geometric deep learning
This worked example shows how geometric deep learning (GDL) can be used to interpret the structureactivity landscape learned by a trained model. Starting from a publicly available molecular dataset containing estrogen receptor binding information [valsecchi2020nura]
, we trained an E(3)equivariant graph neural network (six hidden layers, 128 hidden neurons per layer) and analyzed the learned features and their relationship to ligand binding to the estrogen receptor. The figure shows an analysis of the learned molecular features (third hidden layer, analyzed
viaprincipal component analysis; the first two principal components are shown), and how these features relate to the density of active and inactive molecules in the chemical space. The network successfully separated the molecules based on both their experimental bioactivity and their structural features (e.g., atom scaffolds [bemis1996properties]) and might offer novel opportunities for explainable AI with GDL.4 Conclusions and outlook
Geometric deep learning in chemistry has allowed researchers to leverage the symmetries of different unstructured molecular representations, resulting in a greater flexibility and versatility of the available computational models for molecular structure generation and property prediction. Such approaches represent a valid alternative to classical chemoinformatics approaches that are based on molecular descriptors or other humanengineered features. For modeling tasks that are usually characterized by the need for highly engineered rules (e.g., chemical transformations for de novo design, and reactive site specification for CASP), the benefits of GDL have been consistently shown. In published applications of GDL, each molecular representation has shown characteristic strengths and weaknesses.
Molecular strings, like SMILES, have proven particularly suited for generative deep learning tasks, such as de novo design and CASP. This success may be due to the relatively easy syntax of such a chemical language, which facilitates nexttoken and sequencetosequence prediction. For molecular property prediction, SMILES strings could be limited due to their nonunivocity.
Molecular graphs have shown particular usefulness for property prediction, partly because of their human interpretability and ease of inclusion of desired edge and node features. The incorporation of 3D information (e.g., with equivariant message passing) is useful for quantum chemistry related modeling, whereas in drug discovery applications, this approach has often failed to clearly outbalance the increased complexity of the model. E(3)equivariant graph neural networks have also been applied for conformationaware de novo design [satorras2021gen], but prospective experimental validation studies have not yet been published.
Molecular grids have become the de facto standard for 3D representations of large molecular systems, due to (i) their ability to capture information at a userdefined resolution (voxel density) and (ii) the Euclidean structure of the input grid.
Finally, molecular surfaces are currently at the forefront of GDL. We expect many interesting applications of GDL on molecular surfaces in the near future.
To further the application and impact of GDL in chemistry, an evaluation of the optimal tradeoff between algorithmic complexity, performance, and model interpretability will be required. These aspects are crucial for reconciling the “two QSARs” [fujita2016understanding] and connect computer science and chemistry communities. We encourage GDL practitioners to include aspects of interpretability in their models (e.g., via XAI [jimenez2020drug]) whenever possible and transparently communicate with domain experts. The feedback from domain experts will also be crucial to develop new "chemistryaware" architectures, and further the potential of molecular GDL for concrete prospective applications.
The potential of GDL for molecular feature extraction has not yet been fully explored. Several studies have shown the benefits of learned representations compared to classical molecular descriptors, but in other cases, GDL failed to live up to its promise in terms of superior learned features. Although there are several benchmarks for evaluating machine learning models for property prediction [hu2020open, wu2018moleculenet] and molecule generation [polykovskiy2020molecular, brown2019guacamol], at present, there is no such framework to enable the systematic evaluation of the usefulness of datadriven features learned by AI. Such benchmarks and systematic studies are key to obtaining an unvarnished assessment of deep representation learning. Moreover, investigating the relationships between the learned features and the physicochemical and biological properties of the input molecules will augment the interpretability and applicability of GDL, e.g., to modeling structurefunction relationships like structureactivity landscapes (Box 3).
Compared to conventional QSAR approaches, in which the assessment of the applicability domain (i.e., the region of the chemical space where model predictions are considered reliable) has been routinely performed, contemporary GDL studies lack such an assessment. This systematic gap might constitute one of the limiting factors to the more widespread use of GDL approaches for prospective studies, as it could lead to unreliable predictions, e.g., for molecules with different mechanisms of action, functional groups, or physicochemical properties than the training data. In the future, it will be necessary to devise “geometryaware” approaches for applicability domain assessment.
Another opportunity will be to leverage less explored molecular representations for GDL. For instance, the electronic structure of molecules has vast potential for tasks such as CASP, molecular property prediction, and prediction of macromolecular interactions (e.g. proteinprotein interactions). Although accurate statistical and quantum mechanical simulations are computationally expensive, modern quantum machine learning models [von2020exploring, christensen2020fchl, huang2020quantum, heinen2020machine, heinen2020quantum] trained on large quantum data collections [ramakrishnan2014quantum, isert2021qmugs, von2020thousands] allow quantum information to be accessed much faster with high accuracy. This aspect could enable quantum and electronic featurization of extensive molecular datasets, to be used as input molecular representations for the task of interest.
Deep learning can be applied to a multitude of biological and chemical representations. The corresponding deep neural network models have the potential to augment human creativity, paving the way for new scientific studies that were previously unfeasible. However, research has only explored the tip of the iceberg. One of the most significant catalysts for the integration of deep learning in molecular sciences may be the responsibility of academic institutions to foster interdisciplinary collaboration, communication, and education. Picking the "high hanging fruits" will only be possible with a deep understanding of both chemistry and computer science, along with outofthebox thinking and collaborative creativity. In such a setting, we expect molecular GDL to increase the understanding of molecular systems and biological phenomena.
5 Acknowledgements
This research was supported by the Swiss National Science Foundation (SNSF, grant no. 205321182176) and the ETH RETHINK initiative.
6 Competing interest
G.S. declares a potential financial conflict of interest as cofounder of inSili.com LLC, Zurich, and in his role as scientific consultant to the pharmaceutical industry.
7 List of abbreviations
AD: Applicability Domain
AI: Artificial Intelligence
CASP: Computeraided Synthesis Planning
CNN: Convolutional Neural Network
DFT: Density Functional Theory
E(3): Euclidean Symmetry Group
GDL: Geometric Deep Learning
GNN: Graph Neural Network
QSAR: Quantitative StructureActivity Relationship
RNN: Recurrent Neural Network
SE(3): Special Euclidean Symmetry Group
SMILES: Simplified Molecular Input Line Entry Systems
XAI: Explainable Artificial Intelligence
1D: Onedimensional
2D: Twodimensional
3D: Threedimensional