Geometric Deep Learning on Molecular Representations

07/26/2021 ∙ by Kenneth Atz, et al. ∙ ETH Zurich TU Eindhoven 17

Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementarity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences.



There are no comments yet.


page 3

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep learning, which is an instance of artificial intelligence (AI) based on neural networks [lecun2015deep, schmidhuber2015deep], have led to numerous applications in the molecular sciences, e.g., in drug discovery [gawehn2016deep, jimenez2021artificial], quantum chemistry [gilmer2017neural], and structural biology [jumper2021highly, baek2021accurate]. Two characteristics of deep learning render it particularly promising when applied to molecules. First, deep learning methods can cope with "unstructured" data representations, such as text sequences [vaswani2017attention, brown2020language], speech signals [hinton2012deep, mikolov2011strategies], images [krizhevsky2017imagenet, farabet2012learning, tompson2014joint], and graphs [bronstein2017geometric, monti2019fake]. This ability is particularly useful for molecular systems, for which chemists have developed many models (i.e., "molecular representations") that capture molecular properties at varying levels of abstraction (Figure 1

). The second key characteristic is that deep learning can perform feature extraction (or feature learning) from the input data, that is, produce data-driven features from the input data without the need for manual intervention. These two characteristics are promising for deep learning as a complement to “classical” machine learning applications (

e.g., Quantitative Structure-Activity Relationship [QSAR]), in which molecular features (i.e., "molecular descriptors" [todeschini2009molecular]) are encoded a priori with rule-based algorithms. The capability to learn from unstructured data and obtain data-driven molecular features has led to unprecedented applications of AI in the molecular sciences.

One of the most promising advances in deep learning is geometric deep learning (GDL). Geometric deep learning is an umbrella term encompassing emerging techniques which generalize neural networks to Euclidean and non-Euclidean domains, such as graphs, manifolds, meshes, or string representations [bronstein2017geometric]. In general, GDL encompasses approaches that incorporate a geometric prior, i.e., information on the structure space and symmetry properties of the input variables. Such a geometric prior is leveraged to improve the quality of the information captured by the model. Although GDL has been increasingly applied to molecular modeling [gainza2020deciphering, segler2018generating, gilmer2017neural], its full potential in the field is still untapped.

Figure 1: Exemplary molecular representations for a selected molecule (i.e., the penam substructure of penicillin)
a. Two-dimensional (2D) depiction (Kekulé structure).
b. Molecular graph (2D), composed of vertices (atoms) and edges (bonds).
c. SMILES string [weininger1988smiles], in which atom type, bond type and connectivity are specified by alphanumerical characters.
d. Three-dimensional (3D) graph, composed of vertices (atoms), their position (, , coordinates) in 3D space, and edges (bonds).
e. Molecular surface represented as a mesh colored according to the respective atom types.

The aim of this review is to (i) provide a structured and harmonized overview of the applications of GDL on molecular systems, (ii) delineate the main research directions in the field, and (iii) provide a forecast of the future impact of GDL. Three fields of application are highlighted, namely drug discovery, quantum chemistry, and computer-aided synthesis planning (CASP), with particular attention to the data-driven molecular features learned by GDL methods. A glossary of selected terms can be found in Box 1.

2 Principles of geometric deep learning

The term geometric deep learning was coined in 2017 [bronstein2017geometric]. Although GDL was originally used for methods applied to non-Euclidean data [bronstein2017geometric], it now extends to all deep learning methods that incorporate geometric priors [bronstein2021geometric], that is, information about the structure and symmetry of the system of interest. Symmetry is a crucial concept in GDL, as it encompasses the properties of the system with respect to manipulations (transformations), such as translation, reflection, rotation, scaling, or permutation (Box 2).

Symmetry is often recast in terms of invariance and equivariance to express the behavior of any mathematical function with respect to a transformation (e.g. rotation, translation, reflection or permutation) of an acting symmetry group [marsden1974reduction]. Here, the mathematical function is a neural network applied to a given molecular input . can therein transform equivariantly, invariantly or neither with respect to , as described below:

  • Equivariance. A neural network applied to an input is equivariant to a transformation if the transformation of the input commutes with the transformation of , via a transformation of the same symmetry group, such that: . Neural networks are therefore equivariant to the actions of a symmetry group on their inputs if and only if each layer of the network “equivalently" transforms under any transformation of that group.

  • Invariance. Invariance is a special case of equivariance, where is invariant to if is the trivial group action (i.e., identity): .

  • is neither invariant nor equivariant to when (i) the transformation of the input does not commute with the transformation of : , and (ii) is not the trivial group action.

The symmetry properties of a neural network architecture vary depending on the network type and the symmetry group of interest and are individually discussed in the following sections. Readers can find an in-depth treatment of equivariance and group equivariant layers in neural networks elsewhere [cohen2016group, cohen2016steerable, cohen2018spherical, kondor2018generalization].

The concept of equivariance and invariance can also be used in reference to the molecular features obtained from a given molecular representation (), depending on their behaviour when a transformation is applied to . For instance, many molecular descriptors are invariant to the rotation and translation of the molecular representation by design [todeschini2009molecular], e.g., the Moriguchi octanol-water partitioning coefficient [moriguchi1992simple], which relies only on the occurrence of specific molecular substructures for calculation. The symmetry properties of molecular features extracted by a neural network depend on both the symmetry properties of the input molecular representation and of the utilized neural network.

Many relevant molecular properties (e.g.

, equilibrium energies, atomic charges, or physicochemical properties such as permeability, lipophilicity or solubility) are invariant to certain symmetry operations (Box 2). In many tasks in chemistry, it is thus desirable to design neural networks that transform equivariantly under the actions of pre-defined symmetry groups. Exceptions occur if the targeted property changes upon a symmetry transformation of the molecules (


., chiral properties which change under inversion of the molecule, or vector properties which change under rotation of the molecule). In such cases, the inductive bias (learning bias) of equivariant neural networks would not allow for the differentiation of symmetry-transformed molecules.

While neural networks can be considered as universal function approximators [cybenko1989approximation], incorporating prior knowledge such as reasonable geometric information (geometric priors) has evolved as a core design principle of neural network modeling [bronstein2021geometric]. By incorporating geometric priors, GDL allows to increase the quality of the model and bypasses several bottlenecks related to the need to force the data into Euclidean geometries (e.g., by feature engineering). Moreover, GDL provides novel modeling opportunities, such as data augmentation in low data regimes [tetko2020state, skinnider2021chemical].

[float*=t, floatplacement=t, width=]

Box 1: Glossary of selected terms

CoMFA and CoMSIA. Comparative Molecular Field Analysis (CoMFA) [cramer1988comparative] and Comparative Molecular Similarity Indices Analysis (CoMSIA) [klebe1998comparative] are popular 3D QSAR methods developed in the 1980s and 1990s, in which three-dimensional grids are used to capture the distributions of molecular features (e.g., steric, hydrophobic, and electrostatic properties). The obtained molecular descriptors serve as inputs to a regression model for quantitative bioactivity prediction.

Convolution. Operation within a neural network that transforms a feature space into a new feature space and thereby captures the local information found in the data. Convolutions were first introduced for pixels in images [lecun1995convolutional, lecun1998gradient] but the term "convolution" is now used for neural network architectures covering a variety of data structures such as graphs, point clouds, spheres, grids, or manifolds.

Density Functional Theory (DFT). A quantum mechanical modeling approach used to investigate the electronic structure of molecules.

Data augmentation. Artificial increase of the data volume available for model training, often achieved by leveraging symmetrical properties of the input data which are not captured by the model (e.g., rotation or permutation).

Feature. An individually measurable or computationally obtainable characteristic of a given sample (e.g., molecule), in the form of a scalar. In this review, the term refers to a numeric value characterizing a molecule. Such molecular features can be computed with rule-based algorithms ("molecular descriptors") or generated automatically by deep learning from a molecular representation ("hidden" or "learned" features).

Geometric prior. An inductive bias incorporating information on the symmetric nature of the system of interest into the neural network architecture. Also known as symmetry prior.

Inductive bias. Set of assumptions that a learning algorithm (e.g., a neural network) uses to learn the target function and to make predictions on previously unseen data points.

One-hot encoding

. Method for representing categorical variables as numerical arrays by obtaining a binary variable (0, 1) for each category. It is often used to covert sequences (

e.g., SMILES strings) into numerical matrices, suitable as inputs and/or outputs of deep learning models (e.g.

, chemical language models).

Quantitative Structure-Activity Relationship (QSAR). Machine learning techniques aimed at finding an empirical relationship between the molecular structure (usually encoded as molecular descriptors) and experimentally determined molecular properties, such as pharmacological activity or toxicity.

Reinforcement learning. A technique used to steer the output of a machine learning algorithm toward user-defined regions of optimality via a predefined reward function [sutton2018reinforcement].

Transfer learning. Transfer of knowledge from an existing deep learning model to a related task for which fewer training samples are available [pan2009survey].

Unstructured data. Data that are not arranged as vectors of (typically handcrafted) features. Examples of unstructured data include graphs, images, and meshes. Molecular representations are typically unstructured, whereas numerical molecular descriptors (e.g., molecular properties, molecular "fingerprints") are examples of structured data.

Voxel. Element of a regularly spaced, 3D grid (equivalent to a pixel in 2D space).


GDL approach Molecular representation(s) Applications


Graph neural networks
2D molecular graph
Molecular property prediction [feinberg2018potentialnet, jimenez2021coloring],
CASP [somnath2020learning, jin2017predicting], and generative molecular
design [zhou2019optimization, jin2018junction]


Equivariant graph
neural networks
3D molecular graph
and point cloud
Prediction of quantum chemical
energies [e3nn2020qm9L1, cormorant, satorras2021n], forces [fuchs2020se, satorras2021n, schutt2021equivariant]
and wave-functions [unke2021se3equivariant]


3D convolutional
neural networks
3D grid
Structure-based drug design
and property prediction [jimenez2018k, jimenez2019deltadelta, jimenez2017deepsite]


Mesh convolutional
neural networks
Surface encoded
as a mesh (represented as
2D grid or 3D graph)
Protein-protein interaction prediction
and ligand-pocket fingerprinting [gainza2020deciphering]


Recurrent neural
String notation (1D grid)
Generative molecular design [segler2018generating, grisoni2018designing, merk2018tuning],
synthesis planning [schwaller2018found], protein
structure prediction [senior2019protein, zhou2020combining] and prediction
of properties in drug discovery [wang2020optimizing, zheng2020predicting]


String notation (encoded
as a graph)
Prediction of properties in drug
discovery [mayr2018large], synthesis planning [schwaller2018molecular, schwaller2018found],
prediction of reaction yields [schwaller2021prediction],
generative molecular design [grechishnikova2021transformer] and
protein structure prediction [jumper2021highly, baek2021accurate]


Table 1: Summary of selected geometric deep learning (GDL) approaches for molecular modeling. For each approach, the utilized molecular representation(s) and selected applications are reported. 1D, one-dimensional; 2D, two-dimensional; 3D, three-dimensional.

3 Molecular GDL

The application of GDL to molecular systems is challenging, in part because there are multiple valid ways of representing the same molecular entity. Molecular representations111Note that in this review the term "representation" is used solely to denote human-made models of molecules (e.g., molecular graphs, 3D conformers, SMILES strings). To avoid confusion with other usages of the word "representation" in deep learning, we will use the term "feature" whenever referring to any numerical description of molecules, obtained either with rule-based algorithms (molecular descriptors) or learned (extracted) by neural networks. can be categorized based on their different levels of abstraction and the physicochemical and geometrical aspects they capture. Importantly, all of these representations are models of the same reality and are thus "suitable for some purposes, not for others" [hoffmann1991representation]. GDL provides the opportunity to experiment with different representations of the same molecule and leverages their intrinsic geometrical features to increase the quality of the model. Moreover, GDL has repeatedly proven useful in providing insights into relevant molecular properties for the task at hand, thanks to its feature extraction (feature learning) capabilities. In the following sections, we delineate the most prevalent molecular GDL approaches and their applications in chemistry, grouped according to the respective molecular representations used for deep learning: molecular graphs, grids, strings, and surfaces.

[float*=t, floatplacement=t, width=]

Box 2: Euclidean symmetries in molecular systems

Molecular systems (and three-dimensional representations thereof) can be considered as objects in Euclidean space. In such a space, one can apply several symmetry operations (transformations) that are (i) performed with respect to three symmetry elements (i.e., line, plane, point), and (ii) rigid, that is, they preserve the Euclidean distance between all pairs of atoms (i.e., isometry). The Euclidean transformations are as follows:

  • [topsep=0.5pt, partopsep=0.5pt, itemsep=-0.3pt]

  • Rotation. Movement of an object with respect to the radial orientation to a given point.

  • Translation. Movement of every point of an object by the same distance in a given direction.

  • Reflection. Mapping of an object to itself through a point (inversion), a line or a plane (mirroring).

All three transformations and their arbitrary finite combinations are included in the Euclidean group [E(3)]. The special Euclidean group [SE(3)] comprises only translations and rotations.

Molecules are always symmetric in the SE(3) group, i.e., their intrinsic properties (e.g., biological and physicochemical properties, and equilibrium energy) are invariant to coordinate rotation and translation, and combinations thereof. Several molecules are chiral, that is, some of their (chiral) properties depend on the absolute configuration of their stereogenic centers, and are thus non-invariant to molecule reflection. Chirality plays a key role in chemical biology; relevant examples of chiral molecules are DNA, and several drugs whose isomers exhibit markedly different pharmacological and toxicological properties [nguyen2006chiral].

3.1 Learning on molecular graphs

Figure 2: Deep learning on molecular graphs.
a. Message passing graph neural networks applied to two-dimensional (2D) molecular graphs: 2D molecular graph with its labeled vertex (atom) features (), and edge (bond) features (). Vertex features are updated by iterative message passing for a defined number of time steps across each pair of vertices and , connected via an edge . After the last message passing convolution, the final vertex can be (i) mapped to a bond () or atom () property, or (ii) aggregated to form molecular features (that can be mapped to a molecular property ).
b. E(3)-equivariant message passing graph neural networks applied to three-dimensional (3D) molecular graphs: 3D graphs that are labeled with atom features (), their absolute coordinates in 3D space () and their edge features (). Iterative spherical convolutions are used to obtain data-driven atomic features (), which can be mapped to atomic properties or aggregated, and mapped to molecular properties ( and , respectively).

3.1.1 Molecular graphs

Graphs are among the most intuitive ways to represent molecular structures [kipf2016semi]. Any molecule can be thought of as a mathematical graph , whose vertices () represent atoms, and whose edges () constitute their connection (Figure 3.1). In many deep learning applications, molecular graphs can be further characterized by a set of vertex and edge features.

3.1.2 Graph neural networks

Deep learning methods devoted to handling graphs as input are commonly referred to as graph neural networks (GNNs). When applied to molecules, GNNs allow for feature extraction by progressively aggregating information from atoms and their molecular environments (Figure  2a, [battaglia2016interaction, battaglia2018relational]). Different architectures of GNNs have been introduced [zhou2020graph], the most popular of which fall under the umbrella term of message passing neural networks [geerts2020let, duvenaud2015convolutional, gilmer2017neural]. Such networks iteratively update the vertex features of the l-th network layer () via graph convolutional operations, employing at least two learnable functions and , and a local permutation-invariant aggregation operator (e.g., sum): .

Since their introduction as a means to predict quantum chemical properties of small molecules at the density functional theory (DFT) level [gilmer2017neural], GNNs have found many applications in quantum chemistry [klicpera2020directional, mxmnet2020molecular, withnall2020building, tang2020self, goodall2020predicting], drug discovery [stokes2020deep, feinberg2018potentialnet, torng2019graph], CASP [somnath2020learning], and molecular property prediction [li2017learning, liu2019chemi]. When applied to quantum chemistry tasks, GNNs often use E(3)-invariant 3D information by including radial and angular information into the edge features of the graph [unke2019physnet, klicpera2020fast, klicpera2020directional, mxmnet2020molecular, schutt2021equivariant], thereby improving the prediction accuracy of quantum chemical forces and energies for equilibrium and non-equilibrium molecular conformations, as in the case of SchNet [schutt2018schnet, schutt2017quantum] and PaiNN [schutt2021equivariant]. SchNet-like architectures were used to predict quantum mechanical wave-functions in the form of Hartree-Fock and DFT density matrices [schutt2019unifying], and differences in quantum properties obtained by DFT and coupled cluster level-of-theory calculations [bogojeski2020quantum].

GNNs for molecular property prediction have been shown to outperform human-engineered molecular descriptors for several biologically relevant properties [yang2019analyzing]. Although including 3D information into molecular graphs generally improved the prediction of drug-relevant properties, no marked difference was observed between using a single or multiple molecular conformers for network training [axelrod2020molecular]. Because of their natural connection with molecular representations, GNNs seem particularly suitable in the context of explainable AI (XAI) [gunning2017explainable, jimenez2020drug], where they have been used to interpret models predicting molecular properties of preclinical relevance [jimenez2021coloring] and quantum chemical properties [schnake2020xai].

GNNs have been used for de novo molecule generation [Battaglia2018learning, simonovsky2018graphvae, de2018molgan, zhou2019optimization], for example by performing vertex and edge addition from an initial vertex [Battaglia2018learning] (Figure  2

b). GNNs have also been combined with variational autoencoders

[jin2018junction, de2018molgan, simonovsky2018graphvae, flam2021mpgvae] and reinforcement learning [you2018graph, jin2020multi, zhou2019optimization]. Finally, GNNs have been applied to CASP [somnath2020learning, coley2019graph, lei2017deriving]; however, the current approaches are limited to reactions in which one bond is removed between the products and the reactants.

3.1.3 Equivariant message passing

A recent area of development of graph-based methods are SE(3)- and E(3)-equivariant GNNs (equivariant message passing networks) which deal with the absolute coordinate systems of 3D graphs [e3nn2018, smidt2021finding] (Figure 2b). Thus, these networks may be particularly well-suited to be applied to 3D molecular representations. Such networks exploit Euclidean symmetries of the system (Box 2).

3D molecular graphs , in addition to their vertex and edge features ( and , respectively), also encode information on the vertex position in a 3D coordinate system (). By employing E(3)- [satorras2021n] and SE(3)-equivariant [e3nn2018] convolutions, such networks have shown high accuracy for predicting several quantum chemical properties such as energies [clebschgordonnet, cormorant, smidt2020euclidean, e3nn2020qm9L1, fuchs2020se, hutchinson2020lietransformer, schutt2021equivariant], interatomic potentials for molecular dynamics simulations [fuchs2020se, batzner2021se, satorras2021n], and wave-functions [unke2021se3equivariant]. SE(3) equivariant neural networks possess reflection-equivariance, and thereby enable the model to distinguish between chiral molecules [e3nn2018]. SE(3) neural networks are computationally expensive due to their use of spherical harmonics [muller2006spherical] and Wigner D-functions [dray1986unified] to compute learnable weight kernels. E(3)-equivariant neural networks are computationally more efficient and have shown to perform equal to, or better than, SE(3)-equivariant networks, e.g., for the modeling of quantum chemical properties and dynamic systems [satorras2021n]. Equivariant message passing networks have been applied to predict the quantum mechanical wave-function of nuclei and electron-based representations in an end-to-end fashion [hermann2020deep, pfau2020ab, choo2020fermionic]. However, such networks are currently limited to small molecular systems because of the large size of the learned matrices, which scale quadratically with the number of electrons in the system.

3.2 Learning on grids

Grids capture the properties of a system at regularly spaced intervals. Based on the number of dimensions included in the system, grids can be 1D (e.g., sequences), 2D (e.g., RGB images), 3D (e.g., cubic lattices), or higher-dimensional. Grids are defined by a Euclidean geometry and can be considered as a graph with a special adjacency, where (i) the vertices have a fixed ordering that is defined by the spatial dimensions of the grid, and (ii) each vertex has an identical number of adjacent edges and is therefore indistinguishable from all other vertices structure-wise [bronstein2021geometric]. These two properties render local convolutions applied to a grid inherently permutation invariant, and provide a strong geometric prior for translation invariance (e.g.

by weight sharing in convolutions). These grid properties have critically determined the success of convolutional neural networks (CNNs),


, in computer vision

[lecun1998gradient, lecun1995convolutional]

, natural language processing

[hochreiter1997long, brown2020language], and speech recognition [hinton2012deep, mikolov2011strategies].

3.2.1 Molecular grids

Molecules can be represented as grids in different ways. 2D grids (e.g., molecular structure drawings) are generally more useful for visualization rather than prediction, with few exceptions [tabchouri2019machine]. Analogous with some popular pre-deep learning approaches, for example Comparative molecular field analysis (CoMFA) [cramer1988comparative], and comparative molecular similarity indices analysis (CoMSIA) [klebe1998comparative], 3D grids are often used to capture the spatial distribution of the properties within one (or more) molecular conformer. Such representations are then used as inputs to the 3D CNNs. 3D CNNs are characterized by a greater resource efficiency than equivariant GNNs, which until now have mainly been applied to molecules with fewer than approximately 100 atoms. Thus, 3D CNNs are a method of choice when the protein structure has to be considered, e.g., for protein-ligand binding affinity prediction [jimenez2018k, ragoza2017protein, li2019deepatom, karimi2019deepaffinity, jimenez2019deltadelta], or active site recognition [jimenez2017deepsite].

3.3 Learning on molecular surfaces

Molecular surfaces are defined by the surface enclosing the 3D structure of a molecule at a certain distance from each atom center. Each point on such a continuous surface is characterized by its chemical (e.g., hydrophobicity, electrostatics) and geometric features (e.g., shape, curvature). From a geometrical perspective, molecular surfaces are considered as 3D meshes, i.e., a set of polygons called faces described in terms of a set of vertices that describe how the mesh coordinates exist in the 3D space [ahmed2018survey]. The vertices can be represented by a 2D grid structure (where four vertices on the mesh define a pixel) or by a 3D graph structure. The grid- and graph-based structures of meshes enable applications of 2D CNNs and GNNs to learn on mesh-based molecular surfaces. Recently, 2D CNNs have been applied to learn on mesh-based representations of protein surfaces to predict protein-protein complexes and protein binding sites [gainza2020deciphering]. However, 2D CNNs applied to meshes come with certain limitations, such as the need for rotational data augmentation (due to the network invariance) and for homogenous mesh resolution. Recently introduced GNNs for mesh-based representations have been shown to incorporate rotational equivariance into their network architecture and allow for heterogeneous mesh resolution (i.e., non-consistent distance between the vertices in 3D) [pfaff2020learning]. Such GNNs are computationally efficient and have potential for modeling macromolecular structures; however, they have not yet found applications in molecular systems. Other studies have used 3D voxel-based surface representations of (macro)molecules as inputs to 3D CNNs, e.g., for protein-ligand affinity [liu2021octsurf] and protein binding-site [mylonas2020deepsurf] prediction.

3.4 Learning on string representations

3.4.1 Molecular strings

Molecules can be represented as molecular strings, i.e., linear sequences of alphanumeric symbols. Molecular strings were originally developed as manual ciphering tools to complement systematic chemical nomenclature [barnard2003representation, wiswesser1985historic] and later became suitable for data storage and retrieval. Some of the most popular string-based representations are the Wiswesser Line Notation [wln_1952], the Sybyl line notation [ash1997sybyl], the International Chemical Identifier (InChI) [heller2013inchi], Hierarchical Editing Language for Macromolecules [zhang_2012_helm], and the Simplified Molecular Input Line Entry System (SMILES) [weininger1988smiles].

Each type of linear representation can be considered as a "chemical language." In fact, such notations possess a defined syntax, i.e., not all possible combinations of alphanumerical characters will lead to a “chemically valid” molecule. Furthermore, these notations possess semantic properties: depending on how the elements of the string are combined, the corresponding molecule will have different physicochemical and biological properties. These characteristics make it possible to extend the deep learning methods developed for language and sequence modeling to the analysis of molecular strings for "chemical language modeling" [ozturk2020exploring, cadeddu_2014_language].

SMILES strings – in which letters are used to represent atoms, and symbols and numbers are used to encode bond types, connectivity, branching, and stereochemistry (Figure  3a) – have become the most frequently employed data representation method for sequence-based deep learning [segler2018generating, schwaller2018found]. Whereas several other string representations have been tested in combination with deep learning, e.g., InChI [gomez2018automatic], DeepSMILES [deepsmiles], and self-referencing embedded strings (SELFIES) [krenn2020self], SMILES remains the de facto representation of choice for chemical language modeling. The following text introduces the most prominent chemical language modeling methods, along with selected examples of their application to chemistry.

Figure 3: Chemical language modeling.
a. SMILES strings, in which atom types are represented by their element symbols, and bond types and branching are indicated by other predefined alphanumeric symbols. For each molecule, via the SMILES algorithm a string of symbols ("tokens") is obtained (), which encodes the molecular connectivity, herein illustrated via the color that indicates the corresponding atomic position in the graph (left) and string (right). A molecule can be encoded via different SMILES strings depending on the chosen starting atom. Three random permutations incorporating identical molecular information are presented.
b.Recurrent neural networks, at any sequence position t, learn to predict the next token of a sequence s given the current sequence () and hidden state .
c. Transformer-based language models, in which the input sequence is structured as a graph. Vertices are featurized according to their token identity (e.g., via token embedding, ) and their position in the sequence (e.g., via sinusoidal positional encoding, ). During transformer learning, the vertices are updated via residual attention blocks. After passing attention layers, an individual feature representation for each token is obtained.

3.4.2 Chemical language models

Chemical language models are machine learning methods that can handle molecular sequences as inputs and/or outputs. The most common algorithms for chemical language modeling are Recurrent neural networks (RNNs) and Transformers:

  • RNNs (Figure  3b) [rumelhart1985learning] are neural networks that process sequence data as Euclidean structures, usually via one-hot-encoding. RNNs model a dynamic system in which the hidden state () of the network at any t-th time point (i.e., at any t-th position in the sequence) depends on both the current observation () and the previous hidden state (). RNNs can process sequence inputs of arbitrary lengths and provide outputs of arbitrary lengths. RNNs are often used in an "auto-regressive" fashion, i.e.

    , to predict the probability distribution over the next possible elements (tokens) at the time step

    , given the current hidden state () and the preceding portions of the sequence. Several RNN architectures have been proposed to solve the gradient vanishing or exploding problems of "vanilla" RNNs [hochreiter1998vanishing, pascanu2013difficulty]

    , such as long short-term memory


    and gated recurrent units


  • Transformers (Figure  3c) process sequence data as non-Euclidean structures, by encoding sequences as either (i) a fully connected graph, or (ii) a sequentially connected graph, where each token is only connected to the previous tokens in the sequence. The former approach is often used for feature extraction in general (e.g., in a Transformer-encoder), whereas the latter is employed for next-token prediction e.g. in a Transformer-decoder). The positional information of tokens is usually encoded by positional embedding or sinusoidal positional encoding [vaswani2017attention]. Transformers combine graph-like processing with the so-called attention layers. Attention layers allow Transformers to focus on ("pay attention to") the perceived relevant tokens for each prediction. Transformers have been particularly successful in sequence-to-sequence tasks, such as language translation.

Extending early studies [yuan2017chemspacemim, segler2018generating, bjerrum2017molecular], RNNs for next-token prediction have been routinely applied to the de novo generation of molecules with desired biological or physicochemical properties, in combination with transfer [segler2018generating, gupta2018generative, merk2018novo, merk2018tuning] or reinforcement learning [olivecrona2017molecular, popova2018deep]. In this context, RNNs have shown remarkable capability to learn the SMILES syntax [segler2018generating, gupta2018generative], and capture high-level molecular features ("semantics"), such as physicochemical [segler2018generating, gupta2018generative] and biological properties [merk2018novo, merk2018tuning, grisoni2021combining, yuan2017chemspacemim]. In this context, data augmentation based on SMILES randomization [arus2019randomized, bjerrum2017molecular] or bidirectional learning [grisoni2020bidirectional] have proven to be efficient for improving the quality of the chemical language learned by RNNs. Most published studies have used SMILES strings or derivative representations. In a few studies, one-letter amino acid sequences were employed for peptide design [muller2018recurrent, nagarajan2018computational, hamid_2018_amp, grisoni2018designing, das2021accelerated]. RNNs have also been applied to predict ligand–protein interactions and the pharmacokinetic properties of drugs [zheng2020predicting, wang2020optimizing], protein secondary structure [senior2019protein, zhou2020combining], and the temporal evolution of molecular trajectories [tsai2020learning]. RNNs have been applied for molecular feature extraction [bombarelli_automatic_2018, lin_bigru], showing that the learned features outperformed both traditional molecular descriptors and graph-convolution methods for virtual screening and property prediction [bombarelli_automatic_2018]. The Fréchet ChemNet distance [preuer2018frechet], which is based on the physicochemical and biological features learned by an RNN model, has become the de facto reference method to capture molecular similarity in this context.

Molecular Transformers have been applied to CASP, which can be cast as a sequence-to-sequence translation task, in which the string representations of the reactants are mapped to those of the corresponding product, or vice versa. Since their initial applications [schwaller2019molecular], Transformers have been employed to predict multi-step syntheses [schwaller2020predicting], regio- and stereoselective reactions [pesciullesi2020transfer], enzymatic reaction outcomes [kreutter2021predicting], and reaction yields and classes [schwaller2021prediction, schwaller2020data, schwaller2021mapping]. Recently, Transformers have been applied to molecular property prediction [chithrananda2020chemberta, morris_2020_transformer] and optimization [he2020molecular]. Transformers have also been used for de novo molecule design by learning to translate the target protein sequence into SMILES strings of the corresponding ligands [grechishnikova2021transformer]. Representations learned from SMILES strings by Transformers have shown promise for property prediction in low-data regimes [honda2019smiles]. Furthermore, Transformers have recently been combined with E(3) and SE(3) equivariant layers to learn the 3D structures of proteins from their amino-acid sequence [jumper2021highly, baek2021accurate]. These equivariant Transformers achieve state-of-the-art performance in protein structure prediction.

Other deep learning approaches have relied on string-based representations for de novo design, e.g.

, conditional generative adversarial networks

[mirza2014conditional, arjovsky2017wasserstein, mendez2020novo] and variational autoencoders [griffiths2020constrained, alperstein2019all]. Most of these models, however, have limited or equivalent ability to automatically learn SMILES syntax, as compared to RNNs. 1D CNNs [hirohara2018convolutional, kimber2018synergy] and self-attention networks [zheng2019identifying, lim2020predicting, shin2019self] have been used with SMILES for property prediction. Recently, deep learning on amino acid sequences for property prediction was shown to perform on par with approaches based on human-engineered features [elabd2020amino].

[float*=th!, floatplacement=t, width=]

Box 3: Structure-activity landscape modeling with geometric deep learning

This worked example shows how geometric deep learning (GDL) can be used to interpret the structure-activity landscape learned by a trained model. Starting from a publicly available molecular dataset containing estrogen receptor binding information [valsecchi2020nura]

, we trained an E(3)-equivariant graph neural network (six hidden layers, 128 hidden neurons per layer) and analyzed the learned features and their relationship to ligand binding to the estrogen receptor. The figure shows an analysis of the learned molecular features (third hidden layer, analyzed

viaprincipal component analysis; the first two principal components are shown), and how these features relate to the density of active and inactive molecules in the chemical space. The network successfully separated the molecules based on both their experimental bioactivity and their structural features (e.g., atom scaffolds [bemis1996properties]) and might offer novel opportunities for explainable AI with GDL.

4 Conclusions and outlook

Geometric deep learning in chemistry has allowed researchers to leverage the symmetries of different unstructured molecular representations, resulting in a greater flexibility and versatility of the available computational models for molecular structure generation and property prediction. Such approaches represent a valid alternative to classical chemoinformatics approaches that are based on molecular descriptors or other human-engineered features. For modeling tasks that are usually characterized by the need for highly engineered rules (e.g., chemical transformations for de novo design, and reactive site specification for CASP), the benefits of GDL have been consistently shown. In published applications of GDL, each molecular representation has shown characteristic strengths and weaknesses.

Molecular strings, like SMILES, have proven particularly suited for generative deep learning tasks, such as de novo design and CASP. This success may be due to the relatively easy syntax of such a chemical language, which facilitates next-token and sequence-to-sequence prediction. For molecular property prediction, SMILES strings could be limited due to their non-univocity.

Molecular graphs have shown particular usefulness for property prediction, partly because of their human interpretability and ease of inclusion of desired edge and node features. The incorporation of 3D information (e.g., with equivariant message passing) is useful for quantum chemistry related modeling, whereas in drug discovery applications, this approach has often failed to clearly outbalance the increased complexity of the model. E(3)-equivariant graph neural networks have also been applied for conformation-aware de novo design [satorras2021gen], but prospective experimental validation studies have not yet been published.

Molecular grids have become the de facto standard for 3D representations of large molecular systems, due to (i) their ability to capture information at a user-defined resolution (voxel density) and (ii) the Euclidean structure of the input grid.

Finally, molecular surfaces are currently at the forefront of GDL. We expect many interesting applications of GDL on molecular surfaces in the near future.

To further the application and impact of GDL in chemistry, an evaluation of the optimal trade-off between algorithmic complexity, performance, and model interpretability will be required. These aspects are crucial for reconciling the “two QSARs” [fujita2016understanding] and connect computer science and chemistry communities. We encourage GDL practitioners to include aspects of interpretability in their models (e.g., via XAI [jimenez2020drug]) whenever possible and transparently communicate with domain experts. The feedback from domain experts will also be crucial to develop new "chemistry-aware" architectures, and further the potential of molecular GDL for concrete prospective applications.

The potential of GDL for molecular feature extraction has not yet been fully explored. Several studies have shown the benefits of learned representations compared to classical molecular descriptors, but in other cases, GDL failed to live up to its promise in terms of superior learned features. Although there are several benchmarks for evaluating machine learning models for property prediction [hu2020open, wu2018moleculenet] and molecule generation [polykovskiy2020molecular, brown2019guacamol], at present, there is no such framework to enable the systematic evaluation of the usefulness of data-driven features learned by AI. Such benchmarks and systematic studies are key to obtaining an unvarnished assessment of deep representation learning. Moreover, investigating the relationships between the learned features and the physicochemical and biological properties of the input molecules will augment the interpretability and applicability of GDL, e.g., to modeling structure-function relationships like structure-activity landscapes (Box 3).

Compared to conventional QSAR approaches, in which the assessment of the applicability domain (i.e., the region of the chemical space where model predictions are considered reliable) has been routinely performed, contemporary GDL studies lack such an assessment. This systematic gap might constitute one of the limiting factors to the more widespread use of GDL approaches for prospective studies, as it could lead to unreliable predictions, e.g., for molecules with different mechanisms of action, functional groups, or physicochemical properties than the training data. In the future, it will be necessary to devise “geometry-aware” approaches for applicability domain assessment.

Another opportunity will be to leverage less explored molecular representations for GDL. For instance, the electronic structure of molecules has vast potential for tasks such as CASP, molecular property prediction, and prediction of macromolecular interactions (e.g. protein-protein interactions). Although accurate statistical and quantum mechanical simulations are computationally expensive, modern quantum machine learning models [von2020exploring, christensen2020fchl, huang2020quantum, heinen2020machine, heinen2020quantum] trained on large quantum data collections [ramakrishnan2014quantum, isert2021qmugs, von2020thousands] allow quantum information to be accessed much faster with high accuracy. This aspect could enable quantum and electronic featurization of extensive molecular datasets, to be used as input molecular representations for the task of interest.

Deep learning can be applied to a multitude of biological and chemical representations. The corresponding deep neural network models have the potential to augment human creativity, paving the way for new scientific studies that were previously unfeasible. However, research has only explored the tip of the iceberg. One of the most significant catalysts for the integration of deep learning in molecular sciences may be the responsibility of academic institutions to foster interdisciplinary collaboration, communication, and education. Picking the "high hanging fruits" will only be possible with a deep understanding of both chemistry and computer science, along with out-of-the-box thinking and collaborative creativity. In such a setting, we expect molecular GDL to increase the understanding of molecular systems and biological phenomena.

5 Acknowledgements

This research was supported by the Swiss National Science Foundation (SNSF, grant no. 205321182176) and the ETH RETHINK initiative.

6 Competing interest

G.S. declares a potential financial conflict of interest as co-founder of LLC, Zurich, and in his role as scientific consultant to the pharmaceutical industry.

7 List of abbreviations

AD: Applicability Domain
AI: Artificial Intelligence
CASP: Computer-aided Synthesis Planning
CNN: Convolutional Neural Network
DFT: Density Functional Theory
E(3): Euclidean Symmetry Group
GDL: Geometric Deep Learning
GNN: Graph Neural Network
QSAR: Quantitative Structure-Activity Relationship
RNN: Recurrent Neural Network
SE(3): Special Euclidean Symmetry Group
SMILES: Simplified Molecular Input Line Entry Systems
XAI: Explainable Artificial Intelligence
1D: One-dimensional
2D: Two-dimensional
3D: Three-dimensional