Learning from Protein Structure with Geometric Vector Perceptrons

by   Bowen Jing, et al.
Stanford University

Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the graph-structured and geometric aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient and natural representations of macromolecular structure. We demonstrate our approach on two important problems in learning from protein structure: model quality assessment and computational protein design. Our approach improves over existing classes of architectures, including state-of-the-art graph-based and voxel-based methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Equivariant Graph Neural Networks for 3D Macromolecular Structure

Representing and reasoning about 3D structures of macromolecules is emer...

Directed Weight Neural Networks for Protein Structure Representation Learning

A protein performs biological functions by folding to a particular 3D st...

Deep Multi-attribute Graph Representation Learning on Protein Structures

Graphs as a type of data structure have recently attracted significant a...

Feature vector regularization in machine learning

Problems in machine learning (ML) can involve noisy input data, and ML c...

Graph Attention Networks

We present graph attention networks (GATs), novel neural network archite...

G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation

Analyzing the structure of proteins is a key part of understanding their...

ProDyn0: Inferring calponin homology domain stretching behavior using graph neural networks

Graph neural networks are a quickly emerging field for non-Euclidean dat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many efforts in structural biology aim to predict, or derive insights from, the structure of a macromolecule (such as a protein, RNA, or DNA), represented as a set of positions associated with atoms or groups of atoms in 3D Euclidean space. These problems can often be framed as functions mapping the input domain of structures to some property of interest—for example, predicting the quality of a structural model or determining whether two molecules will bind. Thanks to their importance and difficulty, such problems, which we broadly refer to as learning from structure

, have recently developed into an exciting and promising application area for deep learning

[gainza2020deciphering, graves2020review, ingraham2019generative, pereira2016boosting, townshend2019end, won2019assessment].

Successful applications of deep learning are often driven by techniques that leverage the problem structure of the domain—for example, convolutions in computer vision


and attention in natural language processing

[vaswani2017attention]. What are the relevant considerations in the domain of learning from structure? Using proteins as the most common example, the spatial arrangement and orientation of the amino acids govern the topology, dynamics and function of the molecule [berg2002biochemistry]. These properties are in turn mediated by key pairwise and higher-order residue-residue interactions [del2006residue, hammes2006relating]. We refer to these as the geometric and relational aspects of the problem domain, respectively.

Current state-of-the-art methods for learning from structure are successful by leveraging one of these two aspects. Commonly, such methods either employ graph neural networks (GNNs), which are expressive in terms of relational reasoning [battaglia2018relational]

or convolutional neural networks (CNNs), which operate directly on spatial features. Here, we present a unifying architecture that bridges spatial and graph-based methods to leverage

both aspects of the problem domain.

We do so by introducing geometric vector perceptrons (GVPs), a drop-in replacement for standard multi-layer perceptrons (MLPs) in aggregation and feed-forward layers of GNNs. GVPs operate directly on both scalar and geometric features—features that transform as a vector under rotation of spatial coordinates. GVPs therefore allow for the embedding of geometric information at nodes and edges without reducing such information to hand-picked scalars that may not fully capture complex geometric relationships. We postulate that our approach makes it easier for a network to learn functions whose significant features are both geometric and relational.

Our method (GVP-GNN) can be applied to any problem where the input domain is a structure of a single macromolecule or of molecules bound to one another. In this work, we demonstrate our approach on two important problems connected to protein structure: model quality assessment and computational protein design. Model quality assessment (MQA) aims to select the best structural model of a protein from a large pool of candidate structures and is a crucial step in protein structure prediction [cheng2019estimation]. Computational protein design (CPD) is the conceptual inverse of structure prediction, aiming to infer an amino acid sequence that will fold into a given structure. Our method outperforms existing methods on both tasks.

2 Related Work

Current methods for learning from protein structure generally use one of three classes of structural representations, which we outline below. We also describe representative and state-of-the-art examples for MQA and CPD to set the stage for our experiments later.

Sequential representation

One way of representing a protein structure is as a sequence of feature vectors, one for each amino acid, that can be input into a 1D convolutional network. These representations intrinsically encode neither geometric nor relational information and instead rely on hand-selected features to represent the 3D structural neighborhood of each amino acid. Such features can include contact-based features [olechnovivc2017voromqa], orientations or positions collectively projected to local coordinates [karasikov2019smooth, wang2018computational], and physics-inspired energy terms [o2018spin2, uziela2017proq3d]. These methods were among the earliest developed for learning from structure, and a number of them (ProQ3D [uziela2017proq3d], VoroMQA [olechnovivc2017voromqa], SBROD [karasikov2019smooth]) remain competitive in assessments of MQA.

Graph-based representations

Graph-based methods represent amino acids or individual atoms as nodes and draw edges based on chemical bonds or spatial proximity, and are well-suited for complex relational reasoning [battaglia2018relational]. These methods reduce the challenging task of representing the collective structural neighborhood of an amino acid to that of representing individual edges. The CPD method ProteinSolver [strokach2020fast], for example, uses pairwise Euclidean and sequence distance to represent edges. However, to encode more complex geometric information, a choice of scalar representation scheme is again necessary i.e., vector-valued features whose components are intrinsically linked need to be decomposed into individual scalars. For example, the MQA method ProteinGCN [sanyal2020proteingcn] projects pairwise relative positions into local coordinates. The CPD method Structured Transformer [ingraham2019generative] additionally uses a quaternion representation of pairwise relative orientations.

Voxel-based representations

Many structural biology methods voxelize proteins or neighborhoods of amino acids into occupancy maps and then perform hierarchical 3D convolutions. These architectures circumvent the need for hand-picked representations of geometric information altogether and instead directly operate on approximate positions in 3D space. However, CNNs do not directly encode relational information within the structure. In addition, standard CNNs do not intrinsically account for the symmetries of the problem space. To address the latter, early methods, such as the MQA method 3DCNN [derevyanko2018deep], learned rotation invariance via data augmentation. More recent MQA methods such as Ornate [pages2019protein] and many end-to-end CPD models [anand2020protein, zhang2019prodconn] instead define voxels based on a local coordinate system.

Voxel-based and graph-based architectures, catering to geometric and relational aspects of the problem domain, respectively, have complementary strengths and weaknesses. To address this, some recent work on small molecules aims to leverage geometric features in GNNs through voxelized convolutions at each graph node [spurek2019geometric], message transformation based on encoded angle information [klicpera2020directional] or products with position vectors to account for vector features [cho2019enhanced]. Our contribution is similar in spirit but more general and expressive in methodology. With the introduction of GVPs, we provide a conceptually simple augmentation of GNNs that allows for arbitrary scalar and vector features at all steps of graph propagation, and has the desired equivariance properties with respect to rotations of atomic structures in space.

3 Methods

In this section, we formally introduce the geometric vector perceptron, discuss datasets and evaluation metrics for quality assessment and protein design, and describe the protein representations and model architectures used for those tasks.

3.1 Geometric vector perceptrons

Given a tuple of scalar features and vector features , the geometric vector perceptron computes new features . As outlined in pseudocode in Algorithm 1 and illustrated in Figure 1A, the computation is analogous to the linear combination of inputs in normal dense layers, with the addition of a norm operation and an intermediary hidden layer, which allow information in the vector channels to be extracted as scalars.

Input: Scalar and vector features .
Output: Scalar and vector features .
Algorithm 1 Geometric vector perceptron

, , and are weight matrices and

is a bias vector.


denote activation functions. In our experiments, we choose ReLU and sigmoid for

and , respectively, although other nonlinearities may be used. Additionally, may also be defined independently, but we use the maximum of for convenience.

Figure 1: (A) Schematic of the geometric vector perceptron illustrating the computation shown in Algorithm 1. Given a tuple of scalar and vector input features , the perceptron computes an updated tuple . is a function of both and . (B) Illustration of the structure-based prediction tasks. In model quality assessment (top), the goal is to predict a quality score given the 3D structure of a candidate model. Individual atoms are represented as colored spheres. The quality score measures the accuracy of a candidate structure with respect to an experimentally determined structure (shown in gray). In computational protein design (bottom), the goal is to predict an amino acid sequence that would fold into a given protein backbone structure.

Despite its conceptual simplicity, the GVP offers some desirable properties. First, the vector and scalar outputs of the GVP are equivariant and invariant, respectively, with respect to an arbitrary composition of rotations and reflections in 3D Euclidean space described by i.e.,


This is due to the fact that the only operations on vector-valued inputs are scalar multiplication, linear combination, and the norm. We include a formal proof in Supplementary Information.

In addition, a GVP defined with —that is, the part of a GVP that transforms vector features to scalar features—is able to -approximate a function that is invariant with respect to rotations and reflections in 3D under mild assumptions.


Let R describe an arbitrary rotation or reflection in . For let be the set of all such that are linearly independent and for all and some finite . Then for any continuous such that and for any , there exists a form such that for all .


In Supplementary Information. ∎

As a corollary, a GVP with nonzero is also able to approximate similarly-defined functions over the full input domain .

Finally, the GVP layer is nearly as fast as normal dense layers, incurring additional overhead only in reshaping and concatenation operations.

In addition to the GVP layer itself, we use a version of dropout that drops entire vector channels at random (as opposed to components within vector channels). We also introduce layer normalization for the vector features as


where are the individual row vectors of the vector feature matrix . This vector layer norm has no trainable parameters, but we continue to use normal layer normalization on scalar channels with trainable parameters .

3.2 Dataset and evaluation

We benchmark GVP-augmented graph neural networks on two distinct challenges in structural biology: model quality assessment and protein design. Figure 1B illustrates these tasks.

3.2.1 Model quality assessment

Model quality assessment aims to select the best structural model of a protein from a large pool of candidate structures.111We refer to models of protein structure as "structural models" or "candidate structures" to avoid confusion with the term "model" as used in the ML community. The performance of different MQA methods is regularly assessed in the biennial blind prediction competition CASP, of which the 13th was the most recent [cheng2019estimation]. For a number of recently solved but unreleased structures, called targets, structure generation programs produce a large number of candidate structures. MQA methods are ranked by how well they are able to predict the GDT-TS score of a candidate structure. The GDT-TS is a measure of how similar two protein backbones are after global alignment; specifically, it is the mean of


for , where is the Euclidean distance. In addition to accurately predicting the absolute quality of a candidate structure, a good MQA method should also be able to accurately assess the relative model qualities among a pool of candidates for a given target so that the best ones can be selected, perhaps for further refinement. Therefore, MQA methods are commonly evaluated on two metrics: a global correlation between the predicted and ground truth scores, pooled across all targets, and the average per-target correlation among only the candidate structures for a specific target [cao2016protein, derevyanko2018deep, pages2019protein].

To fulfill this aim, we train our MQA model on an absolute loss and a pairwise loss. That is, for each training step we intake pairs where are candidate structures for the same target and compute



is the Huber loss. When reshuffling at the beginning of each epoch, we also randomly pair up the candidate structures for each target. Interestingly, adding the pairwise term also improves global correlation, likely because the much larger number of possible pairs makes it more difficult to overfit.

We train and validate on the set of candidate structures generated in the CASP 5-10 assessments, which collectively contain 528 targets and 79200 candidate structures. For testing, we predict model quality for a total of 20880 stage 1 and stage 2 candidate structures from CASP 11 (84 targets) and 12 (40 targets).

3.2.2 Protein design

Computational protein design is the conceptual inverse of structure prediction, aiming to infer an amino acid sequence that will fold into a given structure. In comparison to model quality assessment, computational protein design is more difficult to unambiguously benchmark, as some structures may correspond to a large space of sequences and others may correspond to none at all. Therefore, the proxy metric of native sequence recovery—splitting the set of all known native structures in the PDB and attempting to design the sequences corresponding to held-out structures—is typically used to benchmark CPD models [li2014direct, o2018spin2, wang2018computational]. Drawing an analogy between sequence design and language modelling, ingraham2019generative also evaluate the model perplexity on held-out native sequences. Both metrics rest on the implicit assumption that native sequences are optimized for their structures [kuhlman2000native]

and should be assigned high probabilities.

To best approximate real-world applications that may require design of novel structures, the held-out evaluation set should bear minimal similarity to the training structures. We use the CATH 4.2 dataset curated by ingraham2019generative in which all available structures with 40% nonredudancy are partitioned by their CATH (class, architecture, topology/fold, homologous superfamily) classification. The canonical training, validation, and test splits consist of 18204, 608, and 1120 structures, respectively.

3.3 Model architecture


In a computation graph, a GVP can be placed at any node typically occupied by a dense layer. To most directly augment relational representations with geometric information, we modify GNNs by permitting all edge and node embeddings to have geometric vector and scalar features, and then use GVPs at all graph propogation and point-wise feed-forward steps.

Representations of proteins

A protein structure is a sequence of amino acids, where each amino acid consists of four backbone atoms222C, C, N, and O. and a set of sidechain atoms located in 3D Euclidean space. Here we represent only the backbone because our MQA benchmark corresponds to the assessment of backbone structure. In CPD, the sidechains are by definition unknown.

Let X be the position of atom X in the th amino acid (where X is a C, C, N, or O atom). We represent backbone structure as a graph where each node corresponds to an amino acid and has embedding with the following features:

  • Scalar features , where are the dihedral angles computed from C, N, C, C, and N.

  • The forward and reverse unit vectors in the directions of C and C, respectively.

  • The unit vector in the imputed direction of C

    .333C is the second carbon from the carboxyl carbon C. This is computed by assuming tetrahedral geometry and normalizing

    where and . The three unit vectors unambiguously define the orientation of each amino acid residue.

  • A one-hot representation of amino acid identity, when available.

The set of edges is for all where is among the nearest neighbors of as measured by the distance between their C atoms. This is motivated by the fact that if two entities have a relational dependency they are likely to be close together in space. Each edge has an embedding with the following features:

  • The unit vector in the direction of C.

  • The radial basis encoding of the distance .

  • A sinusoidal encoding of as described in vaswani2017attention, representing distance along the backbone.

In our notation, each feature vector is a concatenation of scalar and vector features as described above. Collectively, these features are sufficient for a complete description of the protein backbone. In particular, whereas previous graph-based representations depended on scalar edge embeddings to represent relative orientations and positions, we are able to directly encode the absolute orientations of each amino acid and relative positions in the equivalent of fewer real-valued channels.

Network architecture

Our GVP-GNN takes as input the protein graph defined above and performs graph propagation steps that update the node embeddings according to:


where is a composition of GVPs. We do not update edge embeddings and do not use a global graph embedding. Between node update steps, we also use a residual feed-forward point-wise update layer:


In model quality assessment, the network performs regression against the true quality score of a candidate structure, a global scalar property. To obtain a single global representation, we apply a node-wise GVP to reduce all features to scalars after all graph propagation steps, and then average the representations across all nodes.444We also tried learning a weighted average of nodes, but this led to increased overfitting. A final dense network with dropout and layer normalization then outputs the network’s prediction.

In computational protein design, the network learns a generative model over the space of protein sequences conditioned on the given backbone structure. Following ingraham2019generative, we frame this as an autoregressive

task and use a masked encoder-decoder architecture to capture the joint distribution over all positions: for each position

, the network models the distribution over amino acids at based on the complete structure graph, as well as the sequence information at positions . The encoder first performs graph propagation on the structural information only. Then, sequence information is added to the graph, and the decoder performs further graph propagation where incoming messages for are computed only with the encoder embeddings. Finally, we use one last GVP with 20-way scalar output and softmax activation to output the probability of the amino acids.

The MQA network, structure encoder, and masked decoder each use 3 graph propagation layers. The feed-forward modules consist of GVPs and the message-gather function uses 3.

4 Experiments

4.1 Model quality assessment

We compare our method against state-of-the-art methods on the CASP 11-12 test set in Table 1. These include representatives of voxel-based methods (3DCNN and Ornate), a graph-based method (GraphQA), and three methods that use sequential representations. All of these methods learn solely from protein structure,555There are two versions of GraphQA; we compare against the one using only structure information. with the exception of ProQ3D, which in addition uses sequence information on related proteins that is not always available. We include ProQ3D because it is an improved version of the best single-model method in CASP 11 and CASP 12 [uziela2017proq3d]. Our method outperforms all other structural methods in both global and per-target correlation, and even performs better than ProQ3D on all but one benchmark.

Stage 1 Stage 2 Stage 1 Stage 2
Method Glob Per Glob Per Glob Per Glob Per
Ours 0.84 0.66 0.87 0.45 0.79 0.73 0.82 0.62
3DCNN [derevyanko2018deep] 0.59 0.52 0.64 0.40 0.49 0.44 0.61 0.51
Ornate [pages2019protein] 0.64 0.47 0.63 0.39 0.55 0.57 0.67 0.49
GraphQA [baldassarre2020graphqa] 0.83 0.63 0.82 0.38 0.72 0.68 0.81 0.61
VoroMQA [olechnovivc2017voromqa] 0.69 0.62 0.65 0.42 0.46 0.61 0.61 0.56
SBROD [karasikov2019smooth] 0.58 0.65 0.55 0.43 0.37 0.64 0.47 0.61
ProQ3D [uziela2017proq3d] 0.80 0.69 0.77 0.44 0.67 0.71 0.81 0.60
Table 1: Comparison with state-of-the-art methods on CASP 11 and 12 in terms of global (Glob) and mean per-target (Per) Pearson correlation coefficients (higher is better). ProQ3D is set aside as the sole non-structure-only method. The top performing structure-only method for each metric is in bold, as is the top performing-method overall (if different). Our method generally improves over all other methods.

The CASP 11-12 datasets have been the most well-benchmarked in the recent MQA literature. However, for completeness, we also evaluate our method on CASP 13 (Table 2). Because of its recency, many target structures remain publicly unavailable. We use the stage 2 candidate structures of a subset of 20 targets previously used for benchmarking [baldassarre2020graphqa]. Our method achieves improved results over all other methods, including ProteinGCN [sanyal2020proteingcn], a more recent graph-based method. Because of the small sample size, we emphasize that these results, although promising, should be considered preliminary until further structures for CASP 13 are available.

Method Global Mean per-target
Ours 0.881 0.685
ProteinGCN [sanyal2020proteingcn] 0.723 0.603
VoroMQA [olechnovivc2017voromqa] 0.769 0.665
ProQ3D [uziela2017proq3d] 0.849 0.671
Table 2: Performance on the 20 publicly available CASP 13 targets, stage 2 in terms of global and mean per-target Pearson correlation coefficient. As before, the top structure-only method is in bold, as is the top method overall (if different). Our method outperforms all other methods.

4.2 Protein design

Our method achieves state-of-the-art performance on CATH 4.2, representing a substantial improvement both in terms of perplexity and sequence recovery over Structured Transformer [ingraham2019generative], which was trained on the same training set (Table 3). Following ingraham2019generative, we report evaluation on short (100 or fewer amino acid residues) and single-chain subsets of the CATH 4.2 test set, containing 94 and 103 proteins, respectively, in addition to the full test set. Although Structured Transformer leverages an attention mechanism on top of a graph-based representation of proteins, the authors note in ablation studies that removing attention appeared to increase performance. We therefore retrain and compare against a version of Structured Transformer with the attention layers replaced with standard graph propagation operations that we call Structured GNN. Our method also improves upon this model.

Perplexity Recovery %
Method Short Single-chain All Short Single-chain All
Ours 7.10 7.44 5.29 32.1 32.0 40.2
Structured GNN 8.31 8.88 6.55 28.4 28.1 37.3
Structured Transformer [ingraham2019generative] 8.54 9.03 6.85 28.3 27.6 36.4
Table 3: Performance on the CATH 4.2 test set and its short and single-chain subsets in terms of per-residue perplexity (lower is better) and recovery (higher is better). Recovery is reported as the median over all structures of the mean recovery of 100 sequences per structure. Our method performs better than Structured Transformer and a variant of it, Structured GNN, in which we replaced the attention mechanisms with standard graph propagation operations (see 4.2).

We emphasize that the underlying architecture of Structured GNN is comparable to ours, with the exception of GVPs and geometric vector channels. In particular, the underlying protein graph is built in a similar manner, except with structural information encoded in solely scalar channels. For example, relative orientations are encoded in terms of quaternion coefficients (4 scalar channels per edge), whereas we represent absolute orientations with 3 vector channels per node. Therefore, our improvement over Structured GNN is the most direct indication of the benefit of our approach that combines geometric and relational information. Additionally, due to the efficiency of our representation, we are able to achieve this performance gain with a modest but notable decrease in parameter count—1.01M in our model versus 1.38M in Structured GNN and 1.53M in Structured Transformer.

5 Conclusion

In this work, we developed the first architecture designed specifically for learning on dual relational and geometric representations of 3D macromolecular structure. At its core, our method, GVP-GNN, augments graph neural networks with computationally fast layers that perform expressive geometric reasoning over Euclidean vector features. Our method possesses desirable theoretical properties and empirically outperforms existing architectures on learning quality scores and sequence designs, respectively, from protein structure.

In further work, we hope to apply our architecture to other important structural biology problem areas, including protein complexes, RNA structure, and protein-ligand interactions. Our results more generally highlight the promise of domain-aware architectures in specialized applications of deep learning, and we hope to continue developing and refining such architectures in the domain of learning from structure.


We thank Tri Dao and all members of the Dror group for feedback and discussions.


We acknowledge support from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program, and Intel Corporation. SE is supported by a Stanford Bio-X Bowes fellowship. RJLT is supported by the U.S. Department of Energy, Office of Science Graduate Student Research (SCGSR) program.


Supplementary Information

Equivariance and invariance

The vector and scalar outputs of the GVP are equivariant and invariant, respectively, with respect to an arbitrary composition of rotations and reflections in 3D Euclidean space described by i.e.,


We can write the transformation described by as multiplying

with a unitary matrix

from the right. The L-norm, scalar multiplications, and nonlinearities are defined row-wise as in Algorithm 1. We consider scalar and vector outputs separately. The scalar output, as a function of the inputs, is

Since , we conclude is invariant. Similarly the vector output is

The row-wise scaling can also be viewed as left-multiplication by a diagonal matrix . Since , is invariant. Since

we conclude that is equivariant. ∎

The ability to approximate rotation-invariant functions

The GVP inherits an analogue of the Universal Approximation property [cybenko1989approximation] of standard dense layers. If describes an arbitrary rotation or reflection in 3D Euclidean space,666More precisely, if describes a unitary transformation. we show that a GVP can approximate arbitrary scalar-valued functions invariant under and defined over , the bounded subset of

whose elements can be canonically oriented based on three linearly independent vector entries. Without loss of generality, we assume the first three vector entries can be used.

The machinery corresponding to such approximations corresponds to a GVP

with only vector inputs, only scalar outputs, and a sigmoidal nonlinearity

, followed by a dense layer. This can also be viewed as the sequence of matrix multiplication with , taking the L-norm, and a dense network with one hidden layer. Such machinery can be extracted from any two consecutive GVPs (assuming a sigmoidal ). The theorem is restated from the main text:


Let R describe an arbitrary rotation or reflection in . For let be the set of all such that are linearly independent and for all and some finite . Then for any continuous such that and for any , there exists a form such that for all .


The idea is to write as a composition and . We show that multiplication with and and taking the L-norm can compute , and that the dense network with one hidden layer can approximate .

Call an element oriented if , , and , with . Define to be the orientation function that orients its input and then extracts the vector of coefficients, . These elements can be written as

and are invariant under rotation and reflection, because they are defined using only the norms and inner products of the . Then , where .

The key insight is that if we construct such that the rows of are the original vectors and all differences , then we can compute from the row-wise norms of . That is, where and is an application of the cosine law. The GVP precisely computes as an intermediate step: . It remains to show that there exists a form that -approximates . Up to a translation and uniform scaling of the hypercube, this is the result of the Universal Approximation Theorem [cybenko1989approximation]. ∎