1 Introduction
Many efforts in structural biology aim to predict, or derive insights from, the structure of a macromolecule (such as a protein, RNA, or DNA), represented as a set of positions associated with atoms or groups of atoms in 3D Euclidean space. These problems can often be framed as functions mapping the input domain of structures to some property of interest—for example, predicting the quality of a structural model or determining whether two molecules will bind. Thanks to their importance and difficulty, such problems, which we broadly refer to as learning from structure
, have recently developed into an exciting and promising application area for deep learning
[gainza2020deciphering, graves2020review, ingraham2019generative, pereira2016boosting, townshend2019end, won2019assessment].Successful applications of deep learning are often driven by techniques that leverage the problem structure of the domain—for example, convolutions in computer vision
[cohen2016inductive]and attention in natural language processing
[vaswani2017attention]. What are the relevant considerations in the domain of learning from structure? Using proteins as the most common example, the spatial arrangement and orientation of the amino acids govern the topology, dynamics and function of the molecule [berg2002biochemistry]. These properties are in turn mediated by key pairwise and higherorder residueresidue interactions [del2006residue, hammes2006relating]. We refer to these as the geometric and relational aspects of the problem domain, respectively.Current stateoftheart methods for learning from structure are successful by leveraging one of these two aspects. Commonly, such methods either employ graph neural networks (GNNs), which are expressive in terms of relational reasoning [battaglia2018relational]
or convolutional neural networks (CNNs), which operate directly on spatial features. Here, we present a unifying architecture that bridges spatial and graphbased methods to leverage
both aspects of the problem domain.We do so by introducing geometric vector perceptrons (GVPs), a dropin replacement for standard multilayer perceptrons (MLPs) in aggregation and feedforward layers of GNNs. GVPs operate directly on both scalar and geometric features—features that transform as a vector under rotation of spatial coordinates. GVPs therefore allow for the embedding of geometric information at nodes and edges without reducing such information to handpicked scalars that may not fully capture complex geometric relationships. We postulate that our approach makes it easier for a network to learn functions whose significant features are both geometric and relational.
Our method (GVPGNN) can be applied to any problem where the input domain is a structure of a single macromolecule or of molecules bound to one another. In this work, we demonstrate our approach on two important problems connected to protein structure: model quality assessment and computational protein design. Model quality assessment (MQA) aims to select the best structural model of a protein from a large pool of candidate structures and is a crucial step in protein structure prediction [cheng2019estimation]. Computational protein design (CPD) is the conceptual inverse of structure prediction, aiming to infer an amino acid sequence that will fold into a given structure. Our method outperforms existing methods on both tasks.
2 Related Work
Current methods for learning from protein structure generally use one of three classes of structural representations, which we outline below. We also describe representative and stateoftheart examples for MQA and CPD to set the stage for our experiments later.
Sequential representation
One way of representing a protein structure is as a sequence of feature vectors, one for each amino acid, that can be input into a 1D convolutional network. These representations intrinsically encode neither geometric nor relational information and instead rely on handselected features to represent the 3D structural neighborhood of each amino acid. Such features can include contactbased features [olechnovivc2017voromqa], orientations or positions collectively projected to local coordinates [karasikov2019smooth, wang2018computational], and physicsinspired energy terms [o2018spin2, uziela2017proq3d]. These methods were among the earliest developed for learning from structure, and a number of them (ProQ3D [uziela2017proq3d], VoroMQA [olechnovivc2017voromqa], SBROD [karasikov2019smooth]) remain competitive in assessments of MQA.
Graphbased representations
Graphbased methods represent amino acids or individual atoms as nodes and draw edges based on chemical bonds or spatial proximity, and are wellsuited for complex relational reasoning [battaglia2018relational]. These methods reduce the challenging task of representing the collective structural neighborhood of an amino acid to that of representing individual edges. The CPD method ProteinSolver [strokach2020fast], for example, uses pairwise Euclidean and sequence distance to represent edges. However, to encode more complex geometric information, a choice of scalar representation scheme is again necessary i.e., vectorvalued features whose components are intrinsically linked need to be decomposed into individual scalars. For example, the MQA method ProteinGCN [sanyal2020proteingcn] projects pairwise relative positions into local coordinates. The CPD method Structured Transformer [ingraham2019generative] additionally uses a quaternion representation of pairwise relative orientations.
Voxelbased representations
Many structural biology methods voxelize proteins or neighborhoods of amino acids into occupancy maps and then perform hierarchical 3D convolutions. These architectures circumvent the need for handpicked representations of geometric information altogether and instead directly operate on approximate positions in 3D space. However, CNNs do not directly encode relational information within the structure. In addition, standard CNNs do not intrinsically account for the symmetries of the problem space. To address the latter, early methods, such as the MQA method 3DCNN [derevyanko2018deep], learned rotation invariance via data augmentation. More recent MQA methods such as Ornate [pages2019protein] and many endtoend CPD models [anand2020protein, zhang2019prodconn] instead define voxels based on a local coordinate system.
Voxelbased and graphbased architectures, catering to geometric and relational aspects of the problem domain, respectively, have complementary strengths and weaknesses. To address this, some recent work on small molecules aims to leverage geometric features in GNNs through voxelized convolutions at each graph node [spurek2019geometric], message transformation based on encoded angle information [klicpera2020directional] or products with position vectors to account for vector features [cho2019enhanced]. Our contribution is similar in spirit but more general and expressive in methodology. With the introduction of GVPs, we provide a conceptually simple augmentation of GNNs that allows for arbitrary scalar and vector features at all steps of graph propagation, and has the desired equivariance properties with respect to rotations of atomic structures in space.
3 Methods
In this section, we formally introduce the geometric vector perceptron, discuss datasets and evaluation metrics for quality assessment and protein design, and describe the protein representations and model architectures used for those tasks.
3.1 Geometric vector perceptrons
Given a tuple of scalar features and vector features , the geometric vector perceptron computes new features . As outlined in pseudocode in Algorithm 1 and illustrated in Figure 1A, the computation is analogous to the linear combination of inputs in normal dense layers, with the addition of a norm operation and an intermediary hidden layer, which allow information in the vector channels to be extracted as scalars.
, , and are weight matrices and
is a bias vector.
anddenote activation functions. In our experiments, we choose ReLU and sigmoid for
and , respectively, although other nonlinearities may be used. Additionally, may also be defined independently, but we use the maximum of for convenience.Despite its conceptual simplicity, the GVP offers some desirable properties. First, the vector and scalar outputs of the GVP are equivariant and invariant, respectively, with respect to an arbitrary composition of rotations and reflections in 3D Euclidean space described by i.e.,
(1) 
This is due to the fact that the only operations on vectorvalued inputs are scalar multiplication, linear combination, and the norm. We include a formal proof in Supplementary Information.
In addition, a GVP defined with —that is, the part of a GVP that transforms vector features to scalar features—is able to approximate a function that is invariant with respect to rotations and reflections in 3D under mild assumptions.
Theorem.
Let R describe an arbitrary rotation or reflection in . For let be the set of all such that are linearly independent and for all and some finite . Then for any continuous such that and for any , there exists a form such that for all .
Proof.
In Supplementary Information. ∎
As a corollary, a GVP with nonzero is also able to approximate similarlydefined functions over the full input domain .
Finally, the GVP layer is nearly as fast as normal dense layers, incurring additional overhead only in reshaping and concatenation operations.
In addition to the GVP layer itself, we use a version of dropout that drops entire vector channels at random (as opposed to components within vector channels). We also introduce layer normalization for the vector features as
(2) 
where are the individual row vectors of the vector feature matrix . This vector layer norm has no trainable parameters, but we continue to use normal layer normalization on scalar channels with trainable parameters .
3.2 Dataset and evaluation
We benchmark GVPaugmented graph neural networks on two distinct challenges in structural biology: model quality assessment and protein design. Figure 1B illustrates these tasks.
3.2.1 Model quality assessment
Model quality assessment aims to select the best structural model of a protein from a large pool of candidate structures.^{1}^{1}1We refer to models of protein structure as "structural models" or "candidate structures" to avoid confusion with the term "model" as used in the ML community. The performance of different MQA methods is regularly assessed in the biennial blind prediction competition CASP, of which the 13th was the most recent [cheng2019estimation]. For a number of recently solved but unreleased structures, called targets, structure generation programs produce a large number of candidate structures. MQA methods are ranked by how well they are able to predict the GDTTS score of a candidate structure. The GDTTS is a measure of how similar two protein backbones are after global alignment; specifically, it is the mean of
(3) 
for , where is the Euclidean distance. In addition to accurately predicting the absolute quality of a candidate structure, a good MQA method should also be able to accurately assess the relative model qualities among a pool of candidates for a given target so that the best ones can be selected, perhaps for further refinement. Therefore, MQA methods are commonly evaluated on two metrics: a global correlation between the predicted and ground truth scores, pooled across all targets, and the average pertarget correlation among only the candidate structures for a specific target [cao2016protein, derevyanko2018deep, pages2019protein].
To fulfill this aim, we train our MQA model on an absolute loss and a pairwise loss. That is, for each training step we intake pairs where are candidate structures for the same target and compute
(4) 
where
is the Huber loss. When reshuffling at the beginning of each epoch, we also randomly pair up the candidate structures for each target. Interestingly, adding the pairwise term also improves global correlation, likely because the much larger number of possible pairs makes it more difficult to overfit.
We train and validate on the set of candidate structures generated in the CASP 510 assessments, which collectively contain 528 targets and 79200 candidate structures. For testing, we predict model quality for a total of 20880 stage 1 and stage 2 candidate structures from CASP 11 (84 targets) and 12 (40 targets).
3.2.2 Protein design
Computational protein design is the conceptual inverse of structure prediction, aiming to infer an amino acid sequence that will fold into a given structure. In comparison to model quality assessment, computational protein design is more difficult to unambiguously benchmark, as some structures may correspond to a large space of sequences and others may correspond to none at all. Therefore, the proxy metric of native sequence recovery—splitting the set of all known native structures in the PDB and attempting to design the sequences corresponding to heldout structures—is typically used to benchmark CPD models [li2014direct, o2018spin2, wang2018computational]. Drawing an analogy between sequence design and language modelling, ingraham2019generative also evaluate the model perplexity on heldout native sequences. Both metrics rest on the implicit assumption that native sequences are optimized for their structures [kuhlman2000native]
and should be assigned high probabilities.
To best approximate realworld applications that may require design of novel structures, the heldout evaluation set should bear minimal similarity to the training structures. We use the CATH 4.2 dataset curated by ingraham2019generative in which all available structures with 40% nonredudancy are partitioned by their CATH (class, architecture, topology/fold, homologous superfamily) classification. The canonical training, validation, and test splits consist of 18204, 608, and 1120 structures, respectively.
3.3 Model architecture
GVPGNNs
In a computation graph, a GVP can be placed at any node typically occupied by a dense layer. To most directly augment relational representations with geometric information, we modify GNNs by permitting all edge and node embeddings to have geometric vector and scalar features, and then use GVPs at all graph propogation and pointwise feedforward steps.
Representations of proteins
A protein structure is a sequence of amino acids, where each amino acid consists of four backbone atoms^{2}^{2}2C, C, N, and O. and a set of sidechain atoms located in 3D Euclidean space. Here we represent only the backbone because our MQA benchmark corresponds to the assessment of backbone structure. In CPD, the sidechains are by definition unknown.
Let X be the position of atom X in the th amino acid (where X is a C, C, N, or O atom). We represent backbone structure as a graph where each node corresponds to an amino acid and has embedding with the following features:

Scalar features , where are the dihedral angles computed from C, N, C, C, and N.

The forward and reverse unit vectors in the directions of C and C, respectively.

The unit vector in the imputed direction of C
.^{3}^{3}3C is the second carbon from the carboxyl carbon C. This is computed by assuming tetrahedral geometry and normalizingwhere and . The three unit vectors unambiguously define the orientation of each amino acid residue.

A onehot representation of amino acid identity, when available.
The set of edges is for all where is among the nearest neighbors of as measured by the distance between their C atoms. This is motivated by the fact that if two entities have a relational dependency they are likely to be close together in space. Each edge has an embedding with the following features:

The unit vector in the direction of C.

The radial basis encoding of the distance .

A sinusoidal encoding of as described in vaswani2017attention, representing distance along the backbone.
In our notation, each feature vector is a concatenation of scalar and vector features as described above. Collectively, these features are sufficient for a complete description of the protein backbone. In particular, whereas previous graphbased representations depended on scalar edge embeddings to represent relative orientations and positions, we are able to directly encode the absolute orientations of each amino acid and relative positions in the equivalent of fewer realvalued channels.
Network architecture
Our GVPGNN takes as input the protein graph defined above and performs graph propagation steps that update the node embeddings according to:
(5)  
(6) 
where is a composition of GVPs. We do not update edge embeddings and do not use a global graph embedding. Between node update steps, we also use a residual feedforward pointwise update layer:
(7) 
In model quality assessment, the network performs regression against the true quality score of a candidate structure, a global scalar property. To obtain a single global representation, we apply a nodewise GVP to reduce all features to scalars after all graph propagation steps, and then average the representations across all nodes.^{4}^{4}4We also tried learning a weighted average of nodes, but this led to increased overfitting. A final dense network with dropout and layer normalization then outputs the network’s prediction.
In computational protein design, the network learns a generative model over the space of protein sequences conditioned on the given backbone structure. Following ingraham2019generative, we frame this as an autoregressive
task and use a masked encoderdecoder architecture to capture the joint distribution over all positions: for each position
, the network models the distribution over amino acids at based on the complete structure graph, as well as the sequence information at positions . The encoder first performs graph propagation on the structural information only. Then, sequence information is added to the graph, and the decoder performs further graph propagation where incoming messages for are computed only with the encoder embeddings. Finally, we use one last GVP with 20way scalar output and softmax activation to output the probability of the amino acids.The MQA network, structure encoder, and masked decoder each use 3 graph propagation layers. The feedforward modules consist of GVPs and the messagegather function uses 3.
4 Experiments
4.1 Model quality assessment
We compare our method against stateoftheart methods on the CASP 1112 test set in Table 1. These include representatives of voxelbased methods (3DCNN and Ornate), a graphbased method (GraphQA), and three methods that use sequential representations. All of these methods learn solely from protein structure,^{5}^{5}5There are two versions of GraphQA; we compare against the one using only structure information. with the exception of ProQ3D, which in addition uses sequence information on related proteins that is not always available. We include ProQ3D because it is an improved version of the best singlemodel method in CASP 11 and CASP 12 [uziela2017proq3d]. Our method outperforms all other structural methods in both global and pertarget correlation, and even performs better than ProQ3D on all but one benchmark.
CASP 11  CASP 12  

Stage 1  Stage 2  Stage 1  Stage 2  
Method  Glob  Per  Glob  Per  Glob  Per  Glob  Per 
Ours  0.84  0.66  0.87  0.45  0.79  0.73  0.82  0.62 
3DCNN [derevyanko2018deep]  0.59  0.52  0.64  0.40  0.49  0.44  0.61  0.51 
Ornate [pages2019protein]  0.64  0.47  0.63  0.39  0.55  0.57  0.67  0.49 
GraphQA [baldassarre2020graphqa]  0.83  0.63  0.82  0.38  0.72  0.68  0.81  0.61 
VoroMQA [olechnovivc2017voromqa]  0.69  0.62  0.65  0.42  0.46  0.61  0.61  0.56 
SBROD [karasikov2019smooth]  0.58  0.65  0.55  0.43  0.37  0.64  0.47  0.61 
ProQ3D [uziela2017proq3d]  0.80  0.69  0.77  0.44  0.67  0.71  0.81  0.60 
The CASP 1112 datasets have been the most wellbenchmarked in the recent MQA literature. However, for completeness, we also evaluate our method on CASP 13 (Table 2). Because of its recency, many target structures remain publicly unavailable. We use the stage 2 candidate structures of a subset of 20 targets previously used for benchmarking [baldassarre2020graphqa]. Our method achieves improved results over all other methods, including ProteinGCN [sanyal2020proteingcn], a more recent graphbased method. Because of the small sample size, we emphasize that these results, although promising, should be considered preliminary until further structures for CASP 13 are available.
Method  Global  Mean pertarget 

Ours  0.881  0.685 
ProteinGCN [sanyal2020proteingcn]  0.723  0.603 
VoroMQA [olechnovivc2017voromqa]  0.769  0.665 
ProQ3D [uziela2017proq3d]  0.849  0.671 
4.2 Protein design
Our method achieves stateoftheart performance on CATH 4.2, representing a substantial improvement both in terms of perplexity and sequence recovery over Structured Transformer [ingraham2019generative], which was trained on the same training set (Table 3). Following ingraham2019generative, we report evaluation on short (100 or fewer amino acid residues) and singlechain subsets of the CATH 4.2 test set, containing 94 and 103 proteins, respectively, in addition to the full test set. Although Structured Transformer leverages an attention mechanism on top of a graphbased representation of proteins, the authors note in ablation studies that removing attention appeared to increase performance. We therefore retrain and compare against a version of Structured Transformer with the attention layers replaced with standard graph propagation operations that we call Structured GNN. Our method also improves upon this model.
Perplexity  Recovery %  

Method  Short  Singlechain  All  Short  Singlechain  All 
Ours  7.10  7.44  5.29  32.1  32.0  40.2 
Structured GNN  8.31  8.88  6.55  28.4  28.1  37.3 
Structured Transformer [ingraham2019generative]  8.54  9.03  6.85  28.3  27.6  36.4 
We emphasize that the underlying architecture of Structured GNN is comparable to ours, with the exception of GVPs and geometric vector channels. In particular, the underlying protein graph is built in a similar manner, except with structural information encoded in solely scalar channels. For example, relative orientations are encoded in terms of quaternion coefficients (4 scalar channels per edge), whereas we represent absolute orientations with 3 vector channels per node. Therefore, our improvement over Structured GNN is the most direct indication of the benefit of our approach that combines geometric and relational information. Additionally, due to the efficiency of our representation, we are able to achieve this performance gain with a modest but notable decrease in parameter count—1.01M in our model versus 1.38M in Structured GNN and 1.53M in Structured Transformer.
5 Conclusion
In this work, we developed the first architecture designed specifically for learning on dual relational and geometric representations of 3D macromolecular structure. At its core, our method, GVPGNN, augments graph neural networks with computationally fast layers that perform expressive geometric reasoning over Euclidean vector features. Our method possesses desirable theoretical properties and empirically outperforms existing architectures on learning quality scores and sequence designs, respectively, from protein structure.
In further work, we hope to apply our architecture to other important structural biology problem areas, including protein complexes, RNA structure, and proteinligand interactions. Our results more generally highlight the promise of domainaware architectures in specialized applications of deep learning, and we hope to continue developing and refining such architectures in the domain of learning from structure.
Acknowledgements
We thank Tri Dao and all members of the Dror group for feedback and discussions.
Funding
We acknowledge support from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program, and Intel Corporation. SE is supported by a Stanford BioX Bowes fellowship. RJLT is supported by the U.S. Department of Energy, Office of Science Graduate Student Research (SCGSR) program.
References
Supplementary Information
Equivariance and invariance
The vector and scalar outputs of the GVP are equivariant and invariant, respectively, with respect to an arbitrary composition of rotations and reflections in 3D Euclidean space described by i.e.,
Proof.
We can write the transformation described by as multiplying
with a unitary matrix
from the right. The Lnorm, scalar multiplications, and nonlinearities are defined rowwise as in Algorithm 1. We consider scalar and vector outputs separately. The scalar output, as a function of the inputs, isSince , we conclude is invariant. Similarly the vector output is
The rowwise scaling can also be viewed as leftmultiplication by a diagonal matrix . Since , is invariant. Since
we conclude that is equivariant. ∎
The ability to approximate rotationinvariant functions
The GVP inherits an analogue of the Universal Approximation property [cybenko1989approximation] of standard dense layers. If describes an arbitrary rotation or reflection in 3D Euclidean space,^{6}^{6}6More precisely, if describes a unitary transformation. we show that a GVP can approximate arbitrary scalarvalued functions invariant under and defined over , the bounded subset of
whose elements can be canonically oriented based on three linearly independent vector entries. Without loss of generality, we assume the first three vector entries can be used.
The machinery corresponding to such approximations corresponds to a GVP
with only vector inputs, only scalar outputs, and a sigmoidal nonlinearity
, followed by a dense layer. This can also be viewed as the sequence of matrix multiplication with , taking the Lnorm, and a dense network with one hidden layer. Such machinery can be extracted from any two consecutive GVPs (assuming a sigmoidal ). The theorem is restated from the main text:Theorem.
Let R describe an arbitrary rotation or reflection in . For let be the set of all such that are linearly independent and for all and some finite . Then for any continuous such that and for any , there exists a form such that for all .
Proof.
The idea is to write as a composition and . We show that multiplication with and and taking the Lnorm can compute , and that the dense network with one hidden layer can approximate .
Call an element oriented if , , and , with . Define to be the orientation function that orients its input and then extracts the vector of coefficients, . These elements can be written as
and are invariant under rotation and reflection, because they are defined using only the norms and inner products of the . Then , where .
The key insight is that if we construct such that the rows of are the original vectors and all differences , then we can compute from the rowwise norms of . That is, where and is an application of the cosine law. The GVP precisely computes as an intermediate step: . It remains to show that there exists a form that approximates . Up to a translation and uniform scaling of the hypercube, this is the result of the Universal Approximation Theorem [cybenko1989approximation]. ∎
Comments
There are no comments yet.