1. Introduction
Decades of scientific enquiry across disciplines have demonstrated just how fundamental form is to function (Boehr and Wright, 2008)
. A central question that arises in various scientific domains is how to effectively explore the space of all possible forms of a dynamic system to uncover those that satisfy the constraints needed for the system to be active/functional. In computational structural biology we find a most visible instantiation of this question in the problem of
denovo (or templatefree) protein structure prediction (PSP). PSP seeks to determine one or more biologicallyactive/native forms/structures of a protein molecule from knowledge of its chemical composition (the sequence of amino acids bonded to one another to form a protein chain) (Lee et al., 2017).PSP has a natural and elegant formulation as an optimization problem (Liwo et al., 1999) and is routinely tackled with stochastic optimization algorithms that are tasked with exploring a vast and highdimensional structure space. These algorithms sample one structure at a time, biased by an expertdesigned scoring/energy function that scores sampled structures based on the likelihood of these strutures being biologicallyactive/native (Shehu, 2013). While great progress has been made in stochastic optimization for PSP, there are fundamental challenges on how to achieve a resourceaware balance between exploration (exploring more of the structure space so as not to miss novel structures) and exploitation (improving structures so as to reach local optima of the scoring function) in vast and highdimensional search spaces and dealing with inherentlyinaccurate scoring functions that often drive towards scoreoptimal but nonnative structures (Kryshtafovych et al., 2017).
Stochastic optimization frameworks that conduct a biased (by the scoring function) exploration of a protein’s structure space, do not learn not to generate structures that are unfit according to the scoring function. Efforts to inject a learning component by Shehu and others struggle with how to best connect the structure generation mechanism/engine in these frameworks with the evaluation engine (Maximova et al., 2016)
. A machine learning (ML) framework would seemingly provide a natural setting; however, while discriminative (ML) models can in principle be built to evaluate tertiary structures, and a growing body of research is showing their promise
(Di Lena et al., 2012; Cheng and Baldi, 2007; Eickholt and Cheng, 2012; Li et al., 2011), effective generative ML models for tertiary protein structures have so far been elusive.The majority of work on leveraging deep neural networks (NNs) for problems related to PSP has been on the prediction of distance or contact maps. The latter are alternative representations of tertiary molecular structures that dispense with storing the Cartesian coordinates of atoms in a molecule and instead record only pairwise distances (or, as in contact maps, 0/1 bits to indicate a distance above or not above a threshold). Work on contact map (similarly, distance matrix) prediction focuses on the following setting: given the aminoacid sequence of a protein molecule of interest, predict one contact map that represents the most likely native structure of the target protein. The predicted contact map (or distance matrix) can be utilized to recover a compatible tertiary structure via optimization algorithms that treat the predicted contacts/distances as restraints
(Vendruscolo et al., 1997; Adhikari et al., 2015).Early work on prediction of a contact or distance matrix from a given aminoacid sequence (Kukic et al., 2014) employed bidirectional recursive NNs (BRNNs). RaptorXContact started a thread of productive research on residual convolutional NNs (CNNs) (Wang et al., 2017). DNNCON2 leveraged a twostage CNN (Adhikari et al., 2018). DeepContact deepened the framework via a 9layer residual CNN with filters (Liu et al., 2018). DeepCOV alleviated some of the demands for evolutionary information (Jones and Kandathil, 2018). PconsC4 further limited input features and significantly shortens prediction time (Michel et al., 2019). SPOTContact extended RaptorXContact by adding a 2DRNN stage downstream of the CNN (Hanson et al., 2018). TripletRes put together four CNNs trained endtoend (Li et al., 2019).
It is worth emphasizing that the above NNs for contact or pairwise distance prediction predict one instance from a given aminoacid sequence. These models do not generate multiple contact maps or multiple distance matrices to give any view of the diversity of physicallyrealistic (and possibly functionallyrelevant) structures in the structure space of a given protein. The objective of these models is to leverage the strong bias in the input features (which include aminoacid sequence, physicochemical information of amino acids, coevolutionary information, secondary structures, solvent accessibility, and other more sophisticated features) to get the model to output one prediction that is most likely to represent the native structure. While the performance of these models varies (Torrisi et al., 2020) and is beyond the scope of this paper, this body of work does not leverage generative deep learning and is not focused on generating a distribution of tertiary structures for a given aminoacid sequence (whether directly in Cartesian space or indirectly through the intermediate representation of contact maps or pairwise distance matrices).
Work in our laboratory has integrated some of these contact prediction models, such as RaptorXContact, in optimizationbased structure generation engines to evaluate the quality of a sampled tertiary structure in lieu of or in addition to a scoring/energy function(Zaman et al., 2019). AlphaFold can also be seen as a similar effort, where a deep NN is used to predict a pairwise distance matrix for a given aminoacid sequence and then predicted distances are encapsulated in a penaltybased scoring function to guide a gradient descentbased optimization algorithm assembling fragments into tertiary structures (Senior et al., 2019).
While it remains more challenging to generate a distribution of tertiary structures (whether directly or indirectly through the concept of contact maps or distance matrices), some preliminary efforts have been made (Anand and Huang, 2018b; Anand et al., 2019; Sabban and Markovsky, 2019; Ingraham et al., 2019). Work in (Anand and Huang, 2018b) investigates the promise of such models for protein denovo design. Other work that focuses on PSP remains preliminary and lack of rigorous evaluation on metrics of interest in PSP and, more generally, protein modeling does not expose what these models have actually learned. – need to provide more detail here to differentiate these efforts.
Inspired by recent momentum in generative deep learning, we propose here a novel approach based on generative, adversarial deep learning. Leveraging recent opportunities in graph generative learning, we put forth a disentanglementenhanced contact variational autoencoder (VAE) to which we refer as DECOVAE from now on. As any other deep neural network, DECOVAE learns from data, which we obtain from an initial population generated by the stateoftheart stochastic optimization engine Rosetta (LeaverFay et al., 2011). As a VAE, DECOVAE learns from this data the latent space encapsulating the distribution and then directly generates more data (tertiary structures) from the latent space.
DECOVAE addresses several hurdles that stand in the way of generative models for highlystructured data. First, each tertiary structure is represented as a graph via the concept of the contact graph, and DECOVAE learns the latent space of such graphs, directly sampling novel contact graphs from this space. In this first version, readilyavailable 3D reconstruction webservers, such as CONFOLD (Adhikari et al., 2015), are used to obtain tertiary structures corresponding to DECOVAEgenerated graphs. Second, DECOVAE addresses the issue of interpretability via the disentanglement mechanism. In one of several insightful analyses presented in this paper, we demonstrate how DECOVAE is more powerful than a version with no disentanglement mechanism, COVAE, as well as other deep architectures. We show that DECOVAE generates a distribution that does not simply reproduce the input distribution but instead contains novel and betterquality structures. A thorough comparison with several alternative generative models is also presented. In addition, the disentanglement mechanism integrated in DECOVAE elucidates what exactly about the tertiary structures the "axes"/factors of the learned latent space control, thus opening the black box that is often associated with deep neural networks.
The paper proceeds as follows. First, we present a summary of related work. DECOVAE, the input data, and other components of our approach (reconstruction of tertiary structures and various analyses and metrics) are then described in detail. The evaluation of the obtained distribution and the interpretability of DECOVAE is presented next. The paper concludes with a summary of remaining challenges and future opportunities.
2. Background and Related Work
The concept of a contact graph or a contact map has long been leveraged in molecular biology, particularly in protein modeling (Wu and Zhang, 2008). Briefly, the contact map is an alternatively, compressed representation of a tertiary molecular structure. We can think of it as an adjacenncy matrix, where the entry encodes whether atoms at positions are spatially proximate; the latter uses threshholds, which typically vary in Åin computational biology literature. Typically, the main carbon atoms of each amino acid (the CA atoms) are considered. Some works focus instead on the carbon beta atoms (connecting the side chain atoms to the backbone atoms of an amino acid). In this paper, we take the convention of CAbased contact maps.
Interest in predicting contacts from a given aminoacid sequence has a long history in protein structure prediction, as it was thought to be perhaps less challenging that predicting other representations (such as Cartesian coordinates). Many machine learning approaches have been used, including support vector machines
(Cheng and Baldi, 2007; Wu and Zhang, 2008)(Li et al., 2011), neural networks (Tegge et al., 2009; Pollastri and Baldi, 2002; Walsh et al., 2009; Fariselli et al., 2001; Hamilton et al., 2004), and deep learning (Di Lena et al., 2012; Eickholt and Cheng, 2012), with varied success. These approaches struggle with how best to capture the inherent structure in the input training data and how to discover the pertinent latent space.More recent methods leverage the concept of the latent space via variational autoencoders (Ma et al., 2019; Livi et al., 2016; Anand and Huang, 2018a). Specifically, Ma et al.(Ma et al., 2019) develops a deep learning based approach that uses convolutions and a variational autoencoder (a CVAE) to cluster tertiary structures obtained from folding simulations and discover intermediate states from the folding pathways. Anand et al. (Anand and Huang, 2018a) applies generative adversarial networks (GANs) for denovo protein design. The authors encode protein structures in terms of pairwise distances between CA atoms, eliminating the need for the generative model to learn translational and rotational symmetries. Livi et al. (Livi et al., 2016)
is the first to treat the contact map as a graph and propose a generative model by learning the probability distribution of the existence of a link/contact given the distance between two nodes.
To the best of our knowledge, no deep graphgenerative models have been applied to contact map generation. The DECOVAE we present in this paper is the first to do so. The method leverages developments in graphgenerative neural networks (GNNs). Most of the existing GNNbased graph generation for general graphs have been proposed in the last two years and are based on VAEs (Simonovsky and Komodakis, 2018; Samanta et al., 2019) or GANs (Bojchevski et al., 2018). Most of these approaches generate nodes and edges sequentially to form a whole graph, leading to issues of sensitivity to generation order and high time costs for large graphs. Specifically, GraphVAE (Simonovsky and Komodakis, 2018) represents each graph by its adjacent matrix and feature vector and utilizes the VAE model to learn the distribution of the graphs conditioned on a latent representation at the graph level. Graphite (Grover et al., 2018) and VGAE (Kipf and Welling, 2016b) encode the nodes of each graph into nodelevel embeddings and predict the links between each pair of nodes to generate a graph. In contrast, GraphRNN (You et al., 2018)
builds an autoregressive generative model on these sequences with a long shortterm memory (LSTM) model and shows good scalability.
These current graph generation models have two drawbacks: (i) the encoder and decoder architecture are not powerful to handle realworld graphs with complex structure relationships; (ii) they are black boxes, lacking interpretability. In the biology domain, we need to understand the features we learn from the observing data and how we generate the contact maps.
In this paper, we additionally address the issue of interpretability via the concept of disentanglement. Recently, a variety of disentangled representation learning algorithms have been proposed for VAEs. Assuming that the data is generated from independent factors of variation (e.g., object orientation and lighting conditions in images of objects), disentanglement as a metaprior encourages these factors to be captured by different independent variables in the representation. Learning via disentanglement should result in a concise abstract representation of the data useful for a variety of downstream tasks and promises improved sample efficiency. This has motivated a search for ways of learning disentangled representations, where perturbations of an individual dimension of the latent code
perturb the corresponding in an interpretable manner. A wellknown example is the (Higgins et al., 2017). This has prompted a number of approaches that modify the VAE objective by adding, removing, or altering the weight of individual terms (Kim and Mnih, 2018; Chen et al., 2018; Zhao et al., 2017; Kumar et al., 2017; Lopez et al., 2018; Esmaeili et al., 2018; Alemi et al., 2016).Marrying recent developments in disparate communities, we present here a VAE that generates contact graphs (of tertiary protein structures) and learns disentangled representations that allow interpreting how the learned latent factors control changes at the contact graph level. By utilizing known 3D reconstruction tools, we additionally show how the factors control changes in Cartesian space. The following Section provides further details into the DECOVAE and other models we design and assess for their ability to generate physicallyrealistic tertiary protein structures.
3. Result
Comparison Methods
We utilize three deep graph generation models to validate the superiority of our proposed model. VGAE (Kipf and Welling, 2016b)
is a framework for unsupervised learning on graphstructured data based on the variational autoencoder (VAE). This model makes use of latent variables and is capable of learning interpretable latent representations for undirected graphs. Graphite
(Grover et al., 2018) is an algorithmic framework for unsupervised learning of representations over nodes in large graphs using deep latent variable generative models. This model parameterizes variational autoencoders (VAE) with graph neural networks, and uses a novel iterative graph refinement strategy inspired by lowrank approximations for decoding. GraphRNN (You et al., 2018)is a deep autoregressive model to approximate any distribution of graphs with minimal assumptions about their structure. GraphRNN learns to generate graphs by training on a representative set of graphs and decomposes the graph generation process into a sequence of node and edge formations, conditioned on the graph structure generated so far.
Training and Testing Dataset
The evaluation is carried out on proteins of varying lengths ( to amino acids long) and folds (, , , and ) that are used as a benchmark to evaluate structure prediction methods (Zhang et al., 2018; Zaman and Shehu, 2019). On each protein, given its aminoacid sequence in fasta format, we run the Rosetta AbInitio protocol available in the Rosetta suite (LeaverFay et al., 2011) to obtain an ensemble of structures. For each structure, we only retain the central carbon (CA) atom for each amino acid (effectively discarding the other atoms). Each such CAonly structure is converted into a contact graph, as described above. The contact graph dataset of each protein is split into a training and testing data set in a x:y split. The deep models described above are separately trained on each protein. An overview of Disentanglementenhanced ContactVAE (DECOVAE) is illustrated in Fig.6. By obeserving many decoys contact maps for a protein sequence, it learns the distribution of these decoys and generate many more decoys for that protein sequence. DECOVAE has two main properties: contact generation and interpretation. The contact generation is an implementation of a deep graph varitional autoencoder. The graph encoder learns the distribution of the decoys of a certain sequence and generate new decoys based on a low dimension latent code. And then the contact decoys can be recovered into 3D structure by the existing methods. The contact interpretation ensures each element in the latent representation has a control on an interpretable factor of the generated protein structure. By changing the element value continuously in the latent code, we can control the different properties of the generated protein structure. The proposed method also has a variant which does not consider the disentanglement but only target on the generation of contact maps, named as ContacVAE (COVAE). The COVAE has the same structure with the DECOVAE, except the objective function when training a generative model. The details of the methods are illustrated in the Section of Methods
Comparing the generated contact maps with the native one
The proposed methods have been evaluated on 13 datasets, each of which corresponds to a protein. For each protein that is a sequence if amino acids, we use thousands of its contact map decoys to train the COVAE and DCOVAE model. Then based on this welltrained model, the same number of contact maps are generated and compared with the only native one. For each generated contact map and native one, we compute the precision, recall, coverage and and F1 score between them, as shown in Table 1. In addition, the performance of our methods are also compared with the other graph generation methods in machine learning domain (Graphite (Grover et al., 2018), GVAE (Kipf and Welling, 2016b), GraphRNN (You et al., 2018)). The algorithm and parameter settings of the proposed COVAE, DCOVAE model and the comparison methods are described in details in the Section of Methods.
As shown in Table 1, the graphs generated by COVAE and DCOVAE both have very high F1_score, and the coverage and recall are all around 0.34, which outperforms the other methods by a large margin. Specifically, our precision is 0.67 and 0.71 on average, over 53.4% and 56.7% higher than the highest performance of comparison methods; our coverage is 0.55 and 0.58 on average, over 12.7% and 24.8% higher than the highest performance of comparison methods; our recall is 0.57 and 0.59 on average, over 12.2% and 15.2% higher than the highest performance of comparison methods; our F1_score is 0.61 and 0.64 on average, over 52.4% and 54.7% higher than the highest performance of comparison methods; Moreover, the proposed methods (COVAE and DCOVAE) obtain the best performance in four metrics in almost 76% of the proteins. This indicates that the PGraphVAE and PDGraphVAE truly learn the underlined distribution for the contact map decoys for the amino acid sequence. The good performance of the proposed methods rely on its special architecture, where each amino acid has its unique parameters in extracting features from its contacts and the structure similarity is considered among different amino acids in generation process.
Protein  Model  Precision  Coverage  Recall  F1_score  Protein  Model  Precision  Coverage  Recall  F1_score 

1DTJA  Graphite  0.1320  0.5000  0.5000  0.2031  1DTDB  Graphite  0.1526  0.4893  0.5002  0.2338 
GVAE  0.1390  0.5109  0.5000  0.2101  GVAE  0.1526  0.4892  0.5001  0.2339  
GraphRNN  0.3153  0.1301  0.1301  0.1733  GraphRNN  0.2974  0.2696  0.2755  0.2858  
COVAE  0.7432  0.3310  0.3312  0.4607  COVAE  0.6144  0.5961  0.6093  0.6118  
DCOVAE  0.7429  0.3347  0.3346  0.4618  DCOVAE  0.6136  0.6014  0.6148  0.6142  
1AIL  Graphite  0.1277  0.4912  0.4994  0.2034  2H5ND  Graphite  0.0748  0.4976  0.5004  0.1302 
GVAE  0.1276  0.4911  0.4992  0.2032  GVAE  0.0747  0.4971  0.4998  0.1300  
GraphRNN  0.3900  0.3718  0.37802  0.3836  GraphRNN  0.2846  0.2829  0.2845  0.2844  
COVAE  0.8244  0.8391  0.8531  0.8385  COVAE  0.7127  0.6393  0.6428  0.6759  
DCOVAE  0.8231  0.8393  0.8533  0.8379  DCOVAE  0.7248  0.6354  0.6389  0.6791  
1ALY  Graphite  0.0684  0.4981  0.5002  0.1203  1HHP  Graphite  0.0902  0.4979  0.5003  0.1529 
GVAE  0.0683  0.4989  0.5011  0.1206  GVAE  0.0903  0.4978  0.5002  0.1529  
GraphRNN  0.2356  0.1979  0.1987  0.2155  GraphRNN  0.2804  0.2582  0.2594  0.2693  
COVAE  0.8034  0.4463  0.4482  0.5754  COVAE  0.6494  0.6612  0.6643  0.6567  
DCOVAE  0.8368  0.4408  0.4427  0.5791  DCOVAE  0.6544  0.6591  0.6622  0.6583  
1AOY  Graphite  0.1065  0.4963  0.4995  0.1756  1BQ9  Graphite  0.1566  0.4809  0.4994  0.2384 
GVAE  0.1066  0.4964  0.4996  0.1757  GVAE  0.1568  0.4815  0.5001  0.2387  
GraphRNN  0.3717  0.2959  0.2978  0.3299  GraphRNN  0.3659  0.3232  0.3356  0.3494  
COVAE  0.7620  0.7363  0.7409  0.7511  COVAE  0.4658  0.4451  0.4622  0.4639  
DCOVAE  0.8109  0.7299  0.7345  0.7708  DCOVAE  0.4638  0.4363  0.4531  0.4584  
1C8CA  Graphite  0.1322  0.4907  0.5002  0.2091  1CC5  Graphite  0.1234  0.4843  0.4997  0.1980 
GVAE  0.1319  0.4900  0.4995  0.2088  GVAE  0.1234  0.4846  0.4999  0.1979  
GraphRNN  0.3381  0.3065  0.3124  0.3239  GraphRNN  0.3489  0.2733  0.2820  0.3117  
COVAE  0.6513  0.6707  0.6837  0.6670  COVAE  0.8972  0.5878  0.6064  0.7237  
DCOVAE  0.6538  0.6695  0.6824  0.6678  DCOVAE  0.8965  0.5864  0.6051  0.7225  
1HZ6A  Graphite  0.1500  0.4933  0.5000  0.2308  1ISUA  Graphite  0.1534  0.4891  0.4996  0.2348 
GVAE  0.1500  0.4933  0.5000  0.2307  GVAE  0.1537  0.4898  0.5003  0.2351  
GraphRNN  0.1849  0.3543  0.3592  0.2441  GraphRNN  0.2837  0.2546  0.2601  0.2711  
COVAE  0.5960  0.4519  0.4581  0.5179  COVAE  0.7542  0.4738  0.4839  0.5886  
DCOVAE  0.6067  0.4545  0.4607  0.5237  DCOVAE  0.8435  0.4738  0.4839  0.6151  
1SAP  Graphite  0.1329  0.4958  0.4993  0.2099  1TIG  Graphite  0.1055  0.4945  0.5008  0.1742 
GVAE  0.1332  0.4964  0.5000  0.2103  GVAE  0.1054  0.4939  0.5001  0.1740  
GraphRNN  0.3021  0.3178  0.3202  0.3107  GraphRNN  0.2787  0.2687  0.2721  0.2752  
COVAE  0.3021  0.3179  0.3201  0.3106  COVAE  0.5797  0.5691  0.5764  0.5780  
DCOVAE  0.6681  0.6838  0.6888  0.6783  DCOVAE  0.5799  0.5689  0.5761  0.5780 
Percentage of native and nonnative contacts in a structure
For each of the generated contact map, we calculate the percentage of its contacts that are indeed native contacts, namely those contacts in a native structure. Specifically: Given contacts for a structure (each model) and the known native structure, we calculate the percentage of contacts that are in the native structure that are also in the reconstructed structure. Percentage for each reconstructed structure (model) is calculated for each protein per each method and then an average over all reconstructed structures is reported per protein per method. In addition, we also calculate the percentage of nonnative contacts: the contacts that are in a reconstructed structure but not in the structure. For each model, we calculate the number of nonnative contacts and divide it by the number of amino acids in the protein. Then we report the average percentage over all structures per protein per method. The comparison methods are the same ones that mentioned above.
As shown in Table 2, the proposed COVAE and DCOVAE both have the highest percentage of the native contacts and the lowest nonnative contacts among around 86% of all the proteins, especially with a very large superiority (e.g. 33% on average for native contacts and 61% on average for nonnative contacts comparing to the highest performance of comparison methods). The results demonstrate that the proposed methods can accurately generate the contacts like the native one and the generated contact maps have the similar balance between the contacts and noncontacts in the structure. Moreover, even the DCOVAE enhanced the disentanglement in the objective function of the model, its performance on contact map generation still remains infected, and sometimes even better than the COVAE.
Protein ID  dimension  metrictype  VGAE  Graphite  GraphRNN  COVAE  DCOVAE 

1AIL  73  contacts  49.41  49.47  35.94  97.06  97.00 
73  noncontacts  87.86  87.84  62.26  5.14  5.00  
1ALY  146  contacts  49.46  49.60  28.86  65.00  63.00 
146  noncontacts  98.98  98.98  95.28  85.07  85.02  
1AOY  78  contacts  49.46  49.44  27.39  82.73  83.00 
78  noncontacts  89.39  89.39  65.10  14.54  8.01  
1BQ9  54  contacts  49.57  49.56  28.64  83.41  83.34 
54  noncontacts  84.57  84.53  66.92  12.73  11.43  
1C8CA  64  contacts  49.44  49.42  27.94  92.03  93.00 
64  noncontacts  86.68  86.68  68.11  9.65  8.08  
1CC5  83  contacts  49.51  49.44  26.49  59.83  59.52 
83  noncontacts  88.66  88.67  69.09  18.62  19.06  
1DTDB  61  contacts  49.52  49.57  25.65  37.80  38.00 
61  noncontacts  86.69  86.69  74.63  65.61  66.00  
1DTJA  76  contacts  49.56  49.49  10.56  27.97  27.02 
76  noncontacts  88.18  88.19  75.10  46.86  43.71  
1HHP  99  contacts  49.48  49.55  24.32  90.95  91.11 
99  noncontacts  90.82  90.83  72.88  10.77  9.18  
1HZ6A  72  contacts  49.48  49.51  35.11  45.53  46.00 
72  noncontacts  88.50  88.50  85.53  55.72  55.00  
1ISUA  62  contacts  49.59  49.43  24.64  51.83  52.00 
62  noncontacts  85.15  85.19  72.65  19.65  9.34  
1SAP  66  contacts  49.52  49.43  29.21  91.70  92.18 
66  noncontacts  86.93  86.96  71.81  9.85  10.47  
1TIG  94  contacts  49.75  50.11  43.70  100.00  100.00 
94  noncontacts  99.00  99.00  98.73  98.00  98.00  
2H5ND  133  contacts  49.53  49.50  25.22  82.61  81.15 
133  noncontacts  93.56  93.61  76.86  19.92  19.40 
Evaluating the learned graph distribution
Since the successful generation of contact maps rely on successfully learning the distribution of the contact maps, we evaluate whether the generated contact maps follow the learned distributions. By regarding each contact map as a graph, we can calculate four properties of each graphs: density, number of edges, the degree, and transitivity. The density of a graph is the ratio of the number of edges and the number of possible edges. The average degree coefficient measures the similarity of connections in the graph with respect to the node degree. Transitivity is the overall probability for the graph to have adjacent nodes interconnected. All these properties can be calculated by the open source API NetworkX. Then the distance between the distribution of the generated contact maps and the distribution of the training sets in terms of the four properties is measured by three metrics: Pearson correlation coefficient, Bhattacharyya distance and Earth Mover’s Distance (EMD). In statistics, the Pearson correlation coefficient (PCC) is a measure of the linear correlation between two variables
and . Here the and refers to the a kind of graph property of the generated graphs and training graphs respectively. The Bhattacharyya distance and EMD both measures the similarity of two probability distributions and the smaller the better.The results for one example protein are shown as Table 3. In addition, we also compare our proposed methods with the comparison methods. The results for other proteins can be seen in the supplemental materials.As shown in Table 3, considering EMD distance, the proposed PGraphVAE has the best performance regarding all the graph properties. For the Bhattacharyya distance, the proposed PGraphVAE outperformed other comparison methods by 33.38% and the disentangle version outperformed others by 33.68%. The Pearson similarity is very small, which demonstrate that the generated graphs are various in some degree than the training set, which ensure the diveristy of the generated contact maps. Specifically, in terms of EMD distance, the proposed COVAE and DCOVAE achieve 1.4 and 2.9 on average, which is 65.8% and 29.2% smaller than the best performance comparison method.
Graph Property  Methods  Pearson  Bhattacharyya  EMD 

Density  Graphite  0.008  5.410  0.465 
GVAE  0.007  5.415  0.466  
GraphRNN  0.022  2.450  0.004  
COVAE  0.018  4.941  0.002  
DCOVAE  0.004  3.740  0.004  
Number of Edge  Graphite  0.007  5.410  1327 
GVAE  0.007  5.415  1329  
GraphRNN  0.022  2.450  12.18  
COVAE  0.019  4.941  5.423  
DCOVAE  0.004  3.741  11.29  
AveDegree Cor  Graphite  0.030  5.056  0.136 
GVAE  0.009  5.401  0.030  
GraphRNN  Nan  3.413  Nan  
COVAE  0.005  5.310  0.017  
DCOVAE  0.004  3.361  0.092  
Transitivity  Graphite  0.008  5.436  0.244 
GVAE  0.008  5.348  0.307  
GraphRNN  0.009  2.886  0.122  
COVAE  0.005  5.159  0.004  
DCOVAE  0.005  3.571  0.027 
Visualizing the 3D Structure from Generated Contact Maps
Given the generated contact maps by the proposed COVAE and DCOVAE, we use the existing tool CONFOLD (Adhikari et al., 2015) to reconstruct the contact map to its responding 3D structure. To visualize and evaluate the real 3D structure of the generated proteins, we also visualize the protein structure by the tool PyMol (DeLano, 2002) which is widely used and usersponsored molecular visualization system. We show the results both for the proposed COVAE model and the DCOVAE, which is disentangled. And we also compare our proposed model with the graph generative model graphRNN. As shown in Fig. 1, the proposed COVAE and DCOVAE have better performance in finding both the secondary and tertiary structure.
Interpreting the disentangled latent representation
It is important to be able to measure the level of disentanglement achieved by different models. In this subsection, we qualitatively demonstrate that our proposed DCOVAE framework consistently discovers more latent factors and disentangles them in a cleaner fashion. By learning a latent code representation of a protein structure, we assume each variable in the latent code corresponds to a certain factor or property in generating the protein structure. Thus, by changing the value of one variable continuously and fixing the remaining variables, we could visualize the corresponding change of the generated contact map and 3D protein structure.
First, an example protein IDTJA is analyzed and displayed. the As shown in Fig 3 and Fig 2, we use the convention that in each one latent code varies from left to right while the other latent codes and noise are fixed. The different rows correspond to different random samples of fixed latent codes and noise. For instance, in Fig 2, one column contains four generated contact map samples regarding each variables, and a row shows the generated contact maps for 5 possible values among 1 and 10000 in a certain variable of latent code with other noise fixed. From the changing contact maps in each row, it is easily to find out which part of the contact map is related or controlled by the certain row. We highlight the part that controlled by each variable in red circle in Fig 2. The corresponding 3D structure visualization is shown in Fig 2. On this way, we could easily interpret the role of each variable in the latent code in controlling the generation of the contact maps and the 3D structure.
4. Discussion
In summary, our study aims at utilizing the deep neural networkbased generative model for generating tertiary protein structure as well as interpreting the generation process. To the best of our knowledge, we showed for the first time the development and application of an interpretative graph variational autoencoder for the problem of protein structure generation and interpretation. This demonstrated the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, highlighting axes/factors that carry structural meaning, and opening the black box of deep models.
By treating tertiary protein structure construction problem as the generation of contact maps, we proposed a new disentangled VAE, named Contact Map VAE (COVAE) with new graph encoder and decoder. It first learns the underlying distribution of contact maps and then generates additional contact maps by sampling from the learned distribution. The similarity of the structure (e.g., in terms of precision, recall, F1score, and coverage) as well as positive contact percentage between the generated contact maps and the native ones demonstrated the quality of the generated contact maps. The proposed methods(COVAE and DCOVAE) show great advantages than the existing models by obtaining the best performance in four metrics in almost 76% of the proteins and the highest percentage of the native contacts as well as the lowest nonnative contacts among around 86% of all the proteins. Furthermore, we reconstructed the generated contact maps into 3D protein structures. The visualization of the constructed 3D protein illustrated that the generated contact maps are valid and useful for further generating the 3D tertiary protein structure.
To further investigate whether the proposed COVAE indeed learns the distribution of the observed contact map samples, we generated a set of contact maps and compared their underlying distribution and that of the real contact maps. Since it is difficult to directly evaluate the graph distribution, we resorted to evaluating the distribution of the properties of graphs, such as density, edge numbers, average degree correlation, and transitivity. The small EMD and Bhattachryya distances between the learned and real distribution in terms of all four properties of graphs generated by COVAE and DCOVAE validated that the underlying graph distribution is effectively discovered, which outperformed the distances calculated from the graphs that generated by comparison methods by 33.4% and 47.5% on average. Though the Pearson correlation score was very low, it showed that the diversity of the generated graphs was ensured since it measures the strength and direction of a linear relationship between two variables other than the similarity of two distributions.
Next, to explore our generative model’s capability of interpreting the process of contact map generation, we enhanced our COVAE by enforcing the weights of the second term of the training objective, leading to the disentanglement among variables and hence a new interpretable model named DCOVAE. The learned latent representation variables in DCOVAE is expected to be related to the factors that influence the formation of the contact maps. As a result, for each latent variable, by varying the value of this variable from 1 to 10,000 and fixing the values of others at the same time, a certain local part of the generated contact maps showed obvious trends (e.g., contracting or sketching). This demonstrated that the learned latent variables are effectively disentangled, which indicated the potential semantic factors in the formation of contact maps and the corresponding protein structures.
From one aspect, though there have been some deep generative models applied to the protein contact map generation, they cannot utilize the relationships among amino acids by treating the contact maps as the graphstructured data with graph generative models. From another aspect, though many interpretable learning models have been explored and applied into the image generation, there are no interpretable models that can discover the semantic factors that can control the process of protein folding. In summary, to the best of our knowledge, this is the first time an interpretable deep generative model for graphs have been applied for protein structure prediction and its effectiveness in generating goodquality protein structures is demonstrated. In addition, the proposed DCOVAE can also be applied to other realworld applications where there is a need for graphstructured data generation and interpretation.
There are much more promising and challenging topics that can be originated from the research in this paper for future exploration. For example, it would be interesting and potentially beneficial to develop an endtoend tertiary protein structure generation model directly for 3D structure instead of contact maps. This is because there is a gap between the contact map generation and 3D structure formulation process, and the learned variables can only explain well of the formation of contact maps instead of the 3D structure. In addition, the exploration of the nodeedge joint generative model would also be highly interesting. The proposed DCOVAE model focuses on generating the graph topology (i.e., via contact maps) instead of node features (e.g., properties and types of amino acid). Jointly generating both graph topology and node features could be important in some cases, such as when directly generating the 3D structure where the node features can be the 3D positions.
5. Methods
Problem Formulation
Recoverable tertiary structure formulation requires to preserve information not only on which atom is bonded with which else, but, more importantly, on which atoms are in proximity of eachother in threedimensional space. As shown in Fig. 6, to address this issue, we employ contact map graph which can be trivially computed from tertiary structures as the input to our model. And to recover a tertiary structure from a contact map is comparatively trivial (Adhikari et al., 2015).
Hence, the contact map is a graphbased representation of a tertiary structure that effectively embeds the threedimensional Cartesian space into a twodimensional space. Specifically, in our approach, let the contact graph
now be associated with its edge attribute tensor
, which denotes the adjacent matrix when ; and node attribute matrix (as shown in Fig. 6) denoting the identity of each atom by hot vector embedding, where is the number of atoms over which contacts are computed; one can do so for all atoms in a molecule, or representative atoms to control the size of the input space. The edge and node attributes are rich mechanisms to add additional information. For instance, node attributes can store not just the identities of the amino acids but also their PSSM profile (derived from the Position Specific Scoring Matrix), as well as their solvent accessibility and secondary structure as derived from a given tertiary structure. The edge attributes can encode additional information about contacts, such as their exact distance and/or contact predicted for that pair of amino acids (vertices) from sequence information alone. In our experiment, we are given an undirected, unweighted graph as mentioned above, and we only use the adjacency matrix of (we assume diagonal elements set to 1, i.e. every node is connected to itself) and node attributes store the identities of the amino acids with (the number of all the atom types).Deep Contact Map Variational Autoencoder
The task in this paper requires to learn the disentangled generative models on contact map that encodes graph information. Although disentanglement enhancement and deep graph generative models are respectively attracting fastincreasing attention in recent years, the synergistic integration of them has rarely been explored yet. To learn the distribution of the contact map decoys of a amino acid sequence, we first propose a new graph varational autoencoder framework that consists of powerful graph encoder, graph decoder, and latent graph code disentanglement mechanism. First we introduce the model without disentanglement mechanism, named as The Deep Contact Map Variational Autoencoder (COVAE).
Specifically, given the adjacent matrix of a contact map, we further introduce stochastic latent variables , summarized in an vector . The overall architecture contains a graph encoder and decoder which are trained by optimizing the variational lower bound Loss w.r.t. the variational parameters :
(1) 
where
is the KullbackLeibler divergence between
and . The first item is the reconstruction loss of the generated contact maps and the second term enforces the inferenced latent vector close to the real latent vector distribution. We take a Gaussian priorFor the encoder part, we take a simple inference model parameterized by a twolayer Graph Convolution Nueral Network (GCN) (Kipf and Welling, 2016a) as the encoder of our COVAE:
(2) 
Here is the mean of the latent vectors, which is inferenced by GCN; similarly is the standard of the latent vector that is inferenced by anotehr GCN. Thus, it is possible to sample from the distributions of the latent vectors .
For the generative model, we utilize the graph decoder network proposed in our previous works(Guo et al., 2018). This work first proposes the graph deconvolution based graph decoders, which achieve the best performance in the graph generation task when the node set of the graph is fixed. Thus, we choose this graph decoder as part of our COVAE.
(3) 
where are the elements of .
We perform fullbatch gradient descent and make use of the reparameterization trick (Kingma and Welling, 2013) for training. The architecture and mathematical operations of the graph encoder for modelling the and decoder for modelling the are detailed in the supplemental materials.
Disentangled Contact Map VAE
Then next we introduce the Contact Map VAE with the disentanglement mechanism, which is inpired by the (Higgins et al., 2017). is proposed for automated discovery of interpretable factorised latent representations from data in a completely unsupervised manner, and has been currently used in image domain (Huang et al., 2018)
and Natural Language Processing
(Yi et al., 2018). It is a modification of the variational autoencoder (VAE) framework by introducing an adjustable hyperparameter
that balances latent channel capacity and independence constraints with reconstruction accuracy. Thus, we proposed the Disentangled Contact Map VAE (DCOVAE) to largely interpret the latent representations for the protein generation. We propose augmenting the COVAE framework with a single hyperparameter that modulates the learning constraints applied to the model. These constraints impose a limit on the capacity of the latent information channel and control the emphasis on learning statistically independent latent factors. Eq. 1 can be rewritten to arrive at the DCOVAE formulation, but with the addition of the coefficient:(4) 
where with the corresponds to the COVAE framework mentioned above; and with the model is pushed to learn a more efficient latent representation of the data, which is disentangled if the data contains at least some underlying factors of variation that are independent.
Data availability
All data, models, and source code are freely available to readers upon contacting the authors.
References

DNCON2: improved protein contact prediction using twolevel deep convolutional neural networks
. Bioinformatics 34, pp. 1466–1472. Cited by: §1.  CONFOLD: residueresidue contactguided ab initio protein folding. Proteins: Structure, Function, and Bioinformatics 83 (8), pp. 1436–1449. Cited by: §1, §1, §3, §5.
 Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §2.
 Fully differentiable fullatom protein backbone generation. Cited by: §1.
 Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §2.
 Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §1.
 How do proteins interact?. Science 320 (5882), pp. 1429–1430. Cited by: §1.
 Netgan: generating graphs via random walks. arXiv preprint arXiv:1803.00816. Cited by: §2.
 Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
 Improved residue contact prediction using support vector machines and a large feature set. BMC bioinformatics 8 (1), pp. 113. Cited by: §1, §2.
 PyMOL. Cited by: §3.
 Deep architectures for protein contact map prediction. Bioinformatics 28 (19), pp. 2449–2457. Cited by: §1, §2.
 Predicting protein residue–residue contacts using deep networks and boosting. Bioinformatics 28 (23), pp. 3066–3072. Cited by: §1, §2.
 Structured disentangled representations. arXiv preprint arXiv:1804.02086. Cited by: §2.
 Prediction of contact maps with neural networks and correlated mutations. Protein engineering 14 (11), pp. 835–843. Cited by: §2.
 Graphite: iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459. Cited by: §2, §3, §3.
 Deep graph translation. arXiv preprint arXiv:1805.09980. Cited by: §5.
 Protein contact prediction using patterns of correlation. Proteins: Structure, Function, and Bioinformatics 56 (4), pp. 679–684. Cited by: §2.
 Accurate prediction of protein contact maps by coupling residual twodimensional bidirectional long shortterm memory with convolutional neural networks. Bioinformatics 34, pp. 4039–4045. Cited by: §1.
 Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §5.

Multimodal unsupervised imagetoimage translation
. InProceedings of the European Conference on Computer Vision (ECCV)
, pp. 172–189. Cited by: §5.  Learning protein structure with a differentiable simulator. In International Conference on Learning Representations, Cited by: §1.
 High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics 34, pp. 3308–3315. Cited by: §1.
 Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §5.
 Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §2, §3, §3.

Assessment of model accuracy estimations in CASP12
. Proteins: Struct, Funct, Bioinf 86 (Suppl 1), pp. 345–360. Cited by: §1.  Toward an accurate prediction of interresidue distances in proteins using 2d recursive neural networks. BMC Bioinf 15, pp. 6. Cited by: §1.
 Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §2.
 ROSETTA3: an objectoriented software suite for the simulation and design of macromolecules. Methods Enzymol 487, pp. 545–574. Cited by: §1, §3.
 Ab initio protein structure prediction. In From Protein Structure to Function with Bioinformatics, D. J. Rigden (Ed.), pp. 3–35. Cited by: §1.
 Ensembling multiple raw coevolutionary features with deep residual neural networks for contactmap prediction in CASP13. Proteins: Struct, Funct, Bioinf 87 (12), pp. 1082–1091. Cited by: §1.
 Predicting residue–residue contacts using random forest models. Bioinformatics 27 (24), pp. 3379–3384. Cited by: §1, §2.
 Enhancing evolutionary couplings with deep convolutional neural networks. Cell Syst 6 (e3), pp. 65–74. Cited by: §1.
 A generative model for protein contact networks. Journal of Biomolecular Structure and Dynamics 34 (7), pp. 1441–1454. Cited by: §2.
 Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci USA 96 (10), pp. 5482–5485. Cited by: §1.
 Information constraints on autoencoding variational bayes. In Advances in Neural Information Processing Systems, pp. 6114–6125. Cited by: §2.
 Deep generative model driven protein folding simulation. arXiv preprint arXiv:1908.00496. Cited by: §2.
 Principles and overview of sampling methods for modeling macromolecular structure and dynamics. PLoS Comp. Biol. 12 (4), pp. e1004619. Cited by: §1.
 PconsC4: fast, accurate and hasslefree contact predictions. Bioinformatics 35, pp. 2677–2679. Cited by: §1.

Prediction of contact maps by giohmms and recurrent neural networks using lateral propagation from all four cardinal corners
. Bioinformatics 18 (suppl_1), pp. S62–S70. Cited by: §2.  RamaNet: computational de novo protein design using a long shortterm memory generative adversarial neural network. BioRxiv, pp. 671552. Cited by: §1.

Nevae: a deep generative model for molecular graphs.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 1110–1117. Cited by: §2.  Protein structure prediction using multiple deep neural networks in CASP13. Proteins: Struct, Funct, Bioinf 87 (12), pp. 1141–1148. Cited by: §1.
 Probabilistic search and optimization for protein energy landscapes. In Handbook of Computational Molecular Biology, S. Aluru and A. Singh (Eds.), Cited by: §1.
 Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: §2.
 NNcon: improved protein contact map prediction using 2drecursive neural networks. Nucleic acids research 37 (suppl_2), pp. W515–W518. Cited by: §2.
 Deep learning methods in protein structure prediction. Comput and Struct Biotech J (S2001037019304441), pp. 1–10. Cited by: §1.
 Recovery of protein structure from contact maps. Folding and Design 1 (5), pp. 295–30. Cited by: §1.
 Ab initio and templatebased prediction of multiclass distance maps by twodimensional recursive neural networks. BMC structural biology 9 (1), pp. 5. Cited by: §2.
 Accurate de novo prediction of protein contact map by ultradeep learning model. PLoS Comput Biol 13, pp. e1005324. Cited by: §1.
 A comprehensive assessment of sequencebased and templatebased methods for protein contact prediction. Bioinformatics 24 (7), pp. 924–931. Cited by: §2, §2.
 Neuralsymbolic vqa: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1031–1042. Cited by: §5.
 Graphrnn: generating realistic graphs with deep autoregressive models. arXiv preprint arXiv:1802.08773. Cited by: §2, §3, §3.
 Using sequencepredicted contacts to guide templatefree protein structure prediction. In ACM Conf on Bioinf and Comp Biol (BCB), Niagara Falls, NY, pp. 154–160. Cited by: §1.
 Balancing multiple objectives in conformation sampling to control decoy diversity in templatefree protein structure prediction. BMC Bioinformatics 20 (1), pp. 211. External Links: ISSN 14712105, Document, Link Cited by: §3.
 Secondary structure and contact guided differential evolution for protein structure prediction. IEEE/ACM Trans Comput Biol and Bioinf. Note: preprint External Links: Document Cited by: §3.
 Infovae: information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262. Cited by: §2.
Acknowledgements
This work was supported in part by the National Science Foundation Grant No. 1907805 to AS and LZ and a Jeffress Memorial Trust Award to LZ and AS. The authors additionally thank members of the Zhao and Shehu laboratories for valuable feedback during this work.
Author contributions
XG implemented and evaluated the proposed methodologies, as well as drafted the manuscript. ST assisted with preparation of the input data and the evaluation of reconstructed structures. LZ and AS conceptualized the methodologies and provided guidance on implementation and evaluation, as well as edited and finalized the manuscript.
Competing interests
The authors declare no competing interests.
Comments
There are no comments yet.