Decades of scientific enquiry across disciplines have demonstrated just how fundamental form is to function (Boehr and Wright, 2008)
. A central question that arises in various scientific domains is how to effectively explore the space of all possible forms of a dynamic system to uncover those that satisfy the constraints needed for the system to be active/functional. In computational structural biology we find a most visible instantiation of this question in the problem ofde-novo (or template-free) protein structure prediction (PSP). PSP seeks to determine one or more biologically-active/native forms/structures of a protein molecule from knowledge of its chemical composition (the sequence of amino acids bonded to one another to form a protein chain) (Lee et al., 2017).
PSP has a natural and elegant formulation as an optimization problem (Liwo et al., 1999) and is routinely tackled with stochastic optimization algorithms that are tasked with exploring a vast and high-dimensional structure space. These algorithms sample one structure at a time, biased by an expert-designed scoring/energy function that scores sampled structures based on the likelihood of these strutures being biologically-active/native (Shehu, 2013). While great progress has been made in stochastic optimization for PSP, there are fundamental challenges on how to achieve a resource-aware balance between exploration (exploring more of the structure space so as not to miss novel structures) and exploitation (improving structures so as to reach local optima of the scoring function) in vast and high-dimensional search spaces and dealing with inherently-inaccurate scoring functions that often drive towards score-optimal but non-native structures (Kryshtafovych et al., 2017).
Stochastic optimization frameworks that conduct a biased (by the scoring function) exploration of a protein’s structure space, do not learn not to generate structures that are unfit according to the scoring function. Efforts to inject a learning component by Shehu and others struggle with how to best connect the structure generation mechanism/engine in these frameworks with the evaluation engine (Maximova et al., 2016)
. A machine learning (ML) framework would seemingly provide a natural setting; however, while discriminative (ML) models can in principle be built to evaluate tertiary structures, and a growing body of research is showing their promise(Di Lena et al., 2012; Cheng and Baldi, 2007; Eickholt and Cheng, 2012; Li et al., 2011), effective generative ML models for tertiary protein structures have so far been elusive.
The majority of work on leveraging deep neural networks (NNs) for problems related to PSP has been on the prediction of distance or contact maps. The latter are alternative representations of tertiary molecular structures that dispense with storing the Cartesian coordinates of atoms in a molecule and instead record only pairwise distances (or, as in contact maps, 0/1 bits to indicate a distance above or not above a threshold). Work on contact map (similarly, distance matrix) prediction focuses on the following setting: given the amino-acid sequence of a protein molecule of interest, predict one contact map that represents the most likely native structure of the target protein. The predicted contact map (or distance matrix) can be utilized to recover a compatible tertiary structure via optimization algorithms that treat the predicted contacts/distances as restraints(Vendruscolo et al., 1997; Adhikari et al., 2015).
Early work on prediction of a contact or distance matrix from a given amino-acid sequence (Kukic et al., 2014) employed bidirectional recursive NNs (BRNNs). RaptorX-Contact started a thread of productive research on residual convolutional NNs (CNNs) (Wang et al., 2017). DNNCON2 leveraged a two-stage CNN (Adhikari et al., 2018). DeepContact deepened the framework via a 9-layer residual CNN with filters (Liu et al., 2018). DeepCOV alleviated some of the demands for evolutionary information (Jones and Kandathil, 2018). PconsC4 further limited input features and significantly shortens prediction time (Michel et al., 2019). SPOT-Contact extended RaptorX-Contact by adding a 2D-RNN stage downstream of the CNN (Hanson et al., 2018). TripletRes put together four CNNs trained end-to-end (Li et al., 2019).
It is worth emphasizing that the above NNs for contact or pairwise distance prediction predict one instance from a given amino-acid sequence. These models do not generate multiple contact maps or multiple distance matrices to give any view of the diversity of physically-realistic (and possibly functionally-relevant) structures in the structure space of a given protein. The objective of these models is to leverage the strong bias in the input features (which include amino-acid sequence, physico-chemical information of amino acids, co-evolutionary information, secondary structures, solvent accessibility, and other more sophisticated features) to get the model to output one prediction that is most likely to represent the native structure. While the performance of these models varies (Torrisi et al., 2020) and is beyond the scope of this paper, this body of work does not leverage generative deep learning and is not focused on generating a distribution of tertiary structures for a given amino-acid sequence (whether directly in Cartesian space or indirectly through the intermediate representation of contact maps or pairwise distance matrices).
Work in our laboratory has integrated some of these contact prediction models, such as RaptorX-Contact, in optimization-based structure generation engines to evaluate the quality of a sampled tertiary structure in lieu of or in addition to a scoring/energy function(Zaman et al., 2019). AlphaFold can also be seen as a similar effort, where a deep NN is used to predict a pairwise distance matrix for a given amino-acid sequence and then predicted distances are encapsulated in a penalty-based scoring function to guide a gradient descent-based optimization algorithm assembling fragments into tertiary structures (Senior et al., 2019).
While it remains more challenging to generate a distribution of tertiary structures (whether directly or indirectly through the concept of contact maps or distance matrices), some preliminary efforts have been made (Anand and Huang, 2018b; Anand et al., 2019; Sabban and Markovsky, 2019; Ingraham et al., 2019). Work in (Anand and Huang, 2018b) investigates the promise of such models for protein de-novo design. Other work that focuses on PSP remains preliminary and lack of rigorous evaluation on metrics of interest in PSP and, more generally, protein modeling does not expose what these models have actually learned. – need to provide more detail here to differentiate these efforts.
Inspired by recent momentum in generative deep learning, we propose here a novel approach based on generative, adversarial deep learning. Leveraging recent opportunities in graph generative learning, we put forth a disentanglement-enhanced contact variational autoencoder (VAE) to which we refer as DECO-VAE from now on. As any other deep neural network, DECO-VAE learns from data, which we obtain from an initial population generated by the state-of-the-art stochastic optimization engine Rosetta (Leaver-Fay et al., 2011). As a VAE, DECO-VAE learns from this data the latent space encapsulating the distribution and then directly generates more data (tertiary structures) from the latent space.
DECO-VAE addresses several hurdles that stand in the way of generative models for highly-structured data. First, each tertiary structure is represented as a graph via the concept of the contact graph, and DECO-VAE learns the latent space of such graphs, directly sampling novel contact graphs from this space. In this first version, readily-available 3D reconstruction webservers, such as CONFOLD (Adhikari et al., 2015), are used to obtain tertiary structures corresponding to DECO-VAE-generated graphs. Second, DECO-VAE addresses the issue of interpretability via the disentanglement mechanism. In one of several insightful analyses presented in this paper, we demonstrate how DECO-VAE is more powerful than a version with no disentanglement mechanism, CO-VAE, as well as other deep architectures. We show that DECO-VAE generates a distribution that does not simply reproduce the input distribution but instead contains novel and better-quality structures. A thorough comparison with several alternative generative models is also presented. In addition, the disentanglement mechanism integrated in DECO-VAE elucidates what exactly about the tertiary structures the "axes"/factors of the learned latent space control, thus opening the black box that is often associated with deep neural networks.
The paper proceeds as follows. First, we present a summary of related work. DECO-VAE, the input data, and other components of our approach (reconstruction of tertiary structures and various analyses and metrics) are then described in detail. The evaluation of the obtained distribution and the interpretability of DECO-VAE is presented next. The paper concludes with a summary of remaining challenges and future opportunities.
2. Background and Related Work
The concept of a contact graph or a contact map has long been leveraged in molecular biology, particularly in protein modeling (Wu and Zhang, 2008). Briefly, the contact map is an alternatively, compressed representation of a tertiary molecular structure. We can think of it as an adjacenncy matrix, where the entry encodes whether atoms at positions are spatially proximate; the latter uses threshholds, which typically vary in Åin computational biology literature. Typically, the main carbon atoms of each amino acid (the CA atoms) are considered. Some works focus instead on the carbon beta atoms (connecting the side chain atoms to the backbone atoms of an amino acid). In this paper, we take the convention of CA-based contact maps.
Interest in predicting contacts from a given amino-acid sequence has a long history in protein structure prediction, as it was thought to be perhaps less challenging that predicting other representations (such as Cartesian coordinates). Many machine learning approaches have been used, including support vector machines(Cheng and Baldi, 2007; Wu and Zhang, 2008) et al., 2011), neural networks (Tegge et al., 2009; Pollastri and Baldi, 2002; Walsh et al., 2009; Fariselli et al., 2001; Hamilton et al., 2004), and deep learning (Di Lena et al., 2012; Eickholt and Cheng, 2012), with varied success. These approaches struggle with how best to capture the inherent structure in the input training data and how to discover the pertinent latent space.
More recent methods leverage the concept of the latent space via variational autoencoders (Ma et al., 2019; Livi et al., 2016; Anand and Huang, 2018a). Specifically, Ma et al.(Ma et al., 2019) develops a deep learning based approach that uses convolutions and a variational autoencoder (a CVAE) to cluster tertiary structures obtained from folding simulations and discover intermediate states from the folding pathways. Anand et al. (Anand and Huang, 2018a) applies generative adversarial networks (GANs) for de-novo protein design. The authors encode protein structures in terms of pairwise distances between CA atoms, eliminating the need for the generative model to learn translational and rotational symmetries. Livi et al. (Livi et al., 2016)
is the first to treat the contact map as a graph and propose a generative model by learning the probability distribution of the existence of a link/contact given the distance between two nodes.
To the best of our knowledge, no deep graph-generative models have been applied to contact map generation. The DECO-VAE we present in this paper is the first to do so. The method leverages developments in graph-generative neural networks (GNNs). Most of the existing GNN-based graph generation for general graphs have been proposed in the last two years and are based on VAEs (Simonovsky and Komodakis, 2018; Samanta et al., 2019) or GANs (Bojchevski et al., 2018). Most of these approaches generate nodes and edges sequentially to form a whole graph, leading to issues of sensitivity to generation order and high time costs for large graphs. Specifically, GraphVAE (Simonovsky and Komodakis, 2018) represents each graph by its adjacent matrix and feature vector and utilizes the VAE model to learn the distribution of the graphs conditioned on a latent representation at the graph level. Graphite (Grover et al., 2018) and VGAE (Kipf and Welling, 2016b) encode the nodes of each graph into node-level embeddings and predict the links between each pair of nodes to generate a graph. In contrast, GraphRNN (You et al., 2018)
builds an autoregressive generative model on these sequences with a long short-term memory (LSTM) model and shows good scalability.
These current graph generation models have two drawbacks: (i) the encoder and decoder architecture are not powerful to handle real-world graphs with complex structure relationships; (ii) they are black boxes, lacking interpretability. In the biology domain, we need to understand the features we learn from the observing data and how we generate the contact maps.
In this paper, we additionally address the issue of interpretability via the concept of disentanglement. Recently, a variety of disentangled representation learning algorithms have been proposed for VAEs. Assuming that the data is generated from independent factors of variation (e.g., object orientation and lighting conditions in images of objects), disentanglement as a meta-prior encourages these factors to be captured by different independent variables in the representation. Learning via disentanglement should result in a concise abstract representation of the data useful for a variety of downstream tasks and promises improved sample efficiency. This has motivated a search for ways of learning disentangled representations, where perturbations of an individual dimension of the latent codeperturb the corresponding in an interpretable manner. A well-known example is the (Higgins et al., 2017). This has prompted a number of approaches that modify the VAE objective by adding, removing, or altering the weight of individual terms (Kim and Mnih, 2018; Chen et al., 2018; Zhao et al., 2017; Kumar et al., 2017; Lopez et al., 2018; Esmaeili et al., 2018; Alemi et al., 2016).
Marrying recent developments in disparate communities, we present here a VAE that generates contact graphs (of tertiary protein structures) and learns disentangled representations that allow interpreting how the learned latent factors control changes at the contact graph level. By utilizing known 3D reconstruction tools, we additionally show how the factors control changes in Cartesian space. The following Section provides further details into the DECO-VAE and other models we design and assess for their ability to generate physically-realistic tertiary protein structures.
We utilize three deep graph generation models to validate the superiority of our proposed model. VGAE (Kipf and Welling, 2016b)
is a framework for unsupervised learning on graph-structured data based on the variational auto-encoder (VAE). This model makes use of latent variables and is capable of learning interpretable latent representations for undirected graphs. Graphite(Grover et al., 2018) is an algorithmic framework for unsupervised learning of representations over nodes in large graphs using deep latent variable generative models. This model parameterizes variational autoencoders (VAE) with graph neural networks, and uses a novel iterative graph refinement strategy inspired by low-rank approximations for decoding. GraphRNN (You et al., 2018)
is a deep autoregressive model to approximate any distribution of graphs with minimal assumptions about their structure. GraphRNN learns to generate graphs by training on a representative set of graphs and decomposes the graph generation process into a sequence of node and edge formations, conditioned on the graph structure generated so far.
Training and Testing Dataset
The evaluation is carried out on proteins of varying lengths ( to amino acids long) and folds (, , , and ) that are used as a benchmark to evaluate structure prediction methods (Zhang et al., 2018; Zaman and Shehu, 2019). On each protein, given its amino-acid sequence in fasta format, we run the Rosetta AbInitio protocol available in the Rosetta suite (Leaver-Fay et al., 2011) to obtain an ensemble of structures. For each structure, we only retain the central carbon (CA) atom for each amino acid (effectively discarding the other atoms). Each such CA-only structure is converted into a contact graph, as described above. The contact graph dataset of each protein is split into a training and testing data set in a x:y split. The deep models described above are separately trained on each protein. An overview of Disentanglement-enhanced Contact-VAE (DECO-VAE) is illustrated in Fig.6. By obeserving many decoys contact maps for a protein sequence, it learns the distribution of these decoys and generate many more decoys for that protein sequence. DECO-VAE has two main properties: contact generation and interpretation. The contact generation is an implementation of a deep graph varitional auto-encoder. The graph encoder learns the distribution of the decoys of a certain sequence and generate new decoys based on a low dimension latent code. And then the contact decoys can be recovered into 3D structure by the existing methods. The contact interpretation ensures each element in the latent representation has a control on an interpretable factor of the generated protein structure. By changing the element value continuously in the latent code, we can control the different properties of the generated protein structure. The proposed method also has a variant which does not consider the disentanglement but only target on the generation of contact maps, named as Contac-VAE (CO-VAE). The CO-VAE has the same structure with the DECO-VAE, except the objective function when training a generative model. The details of the methods are illustrated in the Section of Methods
Comparing the generated contact maps with the native one
The proposed methods have been evaluated on 13 datasets, each of which corresponds to a protein. For each protein that is a sequence if amino acids, we use thousands of its contact map decoys to train the CO-VAE and DCO-VAE model. Then based on this well-trained model, the same number of contact maps are generated and compared with the only native one. For each generated contact map and native one, we compute the precision, recall, coverage and and F1 score between them, as shown in Table 1. In addition, the performance of our methods are also compared with the other graph generation methods in machine learning domain (Graphite (Grover et al., 2018), GVAE (Kipf and Welling, 2016b), GraphRNN (You et al., 2018)). The algorithm and parameter settings of the proposed CO-VAE, DCO-VAE model and the comparison methods are described in details in the Section of Methods.
As shown in Table 1, the graphs generated by CO-VAE and DCO-VAE both have very high F1_score, and the coverage and recall are all around 0.34, which outperforms the other methods by a large margin. Specifically, our precision is 0.67 and 0.71 on average, over 53.4% and 56.7% higher than the highest performance of comparison methods; our coverage is 0.55 and 0.58 on average, over 12.7% and 24.8% higher than the highest performance of comparison methods; our recall is 0.57 and 0.59 on average, over 12.2% and 15.2% higher than the highest performance of comparison methods; our F1_score is 0.61 and 0.64 on average, over 52.4% and 54.7% higher than the highest performance of comparison methods; Moreover, the proposed methods (CO-VAE and DCO-VAE) obtain the best performance in four metrics in almost 76% of the proteins. This indicates that the P-GraphVAE and PD-GraphVAE truly learn the underlined distribution for the contact map decoys for the amino acid sequence. The good performance of the proposed methods rely on its special architecture, where each amino acid has its unique parameters in extracting features from its contacts and the structure similarity is considered among different amino acids in generation process.
Percentage of native and non-native contacts in a structure
For each of the generated contact map, we calculate the percentage of its contacts that are indeed native contacts, namely those contacts in a native structure. Specifically: Given contacts for a structure (each model) and the known native structure, we calculate the percentage of contacts that are in the native structure that are also in the reconstructed structure. Percentage for each reconstructed structure (model) is calculated for each protein per each method and then an average over all reconstructed structures is reported per protein per method. In addition, we also calculate the percentage of non-native contacts: the contacts that are in a reconstructed structure but not in the structure. For each model, we calculate the number of non-native contacts and divide it by the number of amino acids in the protein. Then we report the average percentage over all structures per protein per method. The comparison methods are the same ones that mentioned above.
As shown in Table 2, the proposed CO-VAE and DCO-VAE both have the highest percentage of the native contacts and the lowest non-native contacts among around 86% of all the proteins, especially with a very large superiority (e.g. 33% on average for native contacts and 61% on average for non-native contacts comparing to the highest performance of comparison methods). The results demonstrate that the proposed methods can accurately generate the contacts like the native one and the generated contact maps have the similar balance between the contacts and non-contacts in the structure. Moreover, even the DCO-VAE enhanced the disentanglement in the objective function of the model, its performance on contact map generation still remains infected, and sometimes even better than the CO-VAE.
Evaluating the learned graph distribution
Since the successful generation of contact maps rely on successfully learning the distribution of the contact maps, we evaluate whether the generated contact maps follow the learned distributions. By regarding each contact map as a graph, we can calculate four properties of each graphs: density, number of edges, the degree, and transitivity. The density of a graph is the ratio of the number of edges and the number of possible edges. The average degree coefficient measures the similarity of connections in the graph with respect to the node degree. Transitivity is the overall probability for the graph to have adjacent nodes interconnected. All these properties can be calculated by the open source API NetworkX. Then the distance between the distribution of the generated contact maps and the distribution of the training sets in terms of the four properties is measured by three metrics: Pearson correlation coefficient, Bhattacharyya distance and Earth Mover’s Distance (EMD). In statistics, the Pearson correlation coefficient (PCC) is a measure of the linear correlation between two variablesand . Here the and refers to the a kind of graph property of the generated graphs and training graphs respectively. The Bhattacharyya distance and EMD both measures the similarity of two probability distributions and the smaller the better.The results for one example protein are shown as Table 3. In addition, we also compare our proposed methods with the comparison methods. The results for other proteins can be seen in the supplemental materials.
As shown in Table 3, considering EMD distance, the proposed P-GraphVAE has the best performance regarding all the graph properties. For the Bhattacharyya distance, the proposed P-GraphVAE outperformed other comparison methods by 33.38% and the disentangle version outperformed others by 33.68%. The Pearson similarity is very small, which demonstrate that the generated graphs are various in some degree than the training set, which ensure the diveristy of the generated contact maps. Specifically, in terms of EMD distance, the proposed CO-VAE and DCO-VAE achieve 1.4 and 2.9 on average, which is 65.8% and 29.2% smaller than the best performance comparison method.
|Number of Edge||Graphite||0.007||5.410||1327|
Visualizing the 3D Structure from Generated Contact Maps
Given the generated contact maps by the proposed CO-VAE and DCO-VAE, we use the existing tool CONFOLD (Adhikari et al., 2015) to reconstruct the contact map to its responding 3D structure. To visualize and evaluate the real 3D structure of the generated proteins, we also visualize the protein structure by the tool PyMol (DeLano, 2002) which is widely used and user-sponsored molecular visualization system. We show the results both for the proposed CO-VAE model and the DCO-VAE, which is disentangled. And we also compare our proposed model with the graph generative model graphRNN. As shown in Fig. 1, the proposed CO-VAE and DCO-VAE have better performance in finding both the secondary and tertiary structure.
Interpreting the disentangled latent representation
It is important to be able to measure the level of disentanglement achieved by different models. In this subsection, we qualitatively demonstrate that our proposed DCO-VAE framework consistently discovers more latent factors and disentangles them in a cleaner fashion. By learning a latent code representation of a protein structure, we assume each variable in the latent code corresponds to a certain factor or property in generating the protein structure. Thus, by changing the value of one variable continuously and fixing the remaining variables, we could visualize the corresponding change of the generated contact map and 3D protein structure.
First, an example protein IDTJA is analyzed and displayed. the As shown in Fig 3 and Fig 2, we use the convention that in each one latent code varies from left to right while the other latent codes and noise are fixed. The different rows correspond to different random samples of fixed latent codes and noise. For instance, in Fig 2, one column contains four generated contact map samples regarding each variables, and a row shows the generated contact maps for 5 possible values among 1 and 10000 in a certain variable of latent code with other noise fixed. From the changing contact maps in each row, it is easily to find out which part of the contact map is related or controlled by the certain row. We highlight the part that controlled by each variable in red circle in Fig 2. The corresponding 3D structure visualization is shown in Fig 2. On this way, we could easily interpret the role of each variable in the latent code in controlling the generation of the contact maps and the 3D structure.
In summary, our study aims at utilizing the deep neural network-based generative model for generating tertiary protein structure as well as interpreting the generation process. To the best of our knowledge, we showed for the first time the development and application of an interpretative graph variational auto-encoder for the problem of protein structure generation and interpretation. This demonstrated the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, highlighting axes/factors that carry structural meaning, and opening the black box of deep models.
By treating tertiary protein structure construction problem as the generation of contact maps, we proposed a new disentangled VAE, named Contact Map VAE (CO-VAE) with new graph encoder and decoder. It first learns the underlying distribution of contact maps and then generates additional contact maps by sampling from the learned distribution. The similarity of the structure (e.g., in terms of precision, recall, F1-score, and coverage) as well as positive contact percentage between the generated contact maps and the native ones demonstrated the quality of the generated contact maps. The proposed methods(CO-VAE and DCO-VAE) show great advantages than the existing models by obtaining the best performance in four metrics in almost 76% of the proteins and the highest percentage of the native contacts as well as the lowest non-native contacts among around 86% of all the proteins. Furthermore, we reconstructed the generated contact maps into 3D protein structures. The visualization of the constructed 3D protein illustrated that the generated contact maps are valid and useful for further generating the 3D tertiary protein structure.
To further investigate whether the proposed CO-VAE indeed learns the distribution of the observed contact map samples, we generated a set of contact maps and compared their underlying distribution and that of the real contact maps. Since it is difficult to directly evaluate the graph distribution, we resorted to evaluating the distribution of the properties of graphs, such as density, edge numbers, average degree correlation, and transitivity. The small EMD and Bhattachryya distances between the learned and real distribution in terms of all four properties of graphs generated by CO-VAE and DCO-VAE validated that the underlying graph distribution is effectively discovered, which outperformed the distances calculated from the graphs that generated by comparison methods by 33.4% and 47.5% on average. Though the Pearson correlation score was very low, it showed that the diversity of the generated graphs was ensured since it measures the strength and direction of a linear relationship between two variables other than the similarity of two distributions.
Next, to explore our generative model’s capability of interpreting the process of contact map generation, we enhanced our CO-VAE by enforcing the weights of the second term of the training objective, leading to the disentanglement among variables and hence a new interpretable model named DCO-VAE. The learned latent representation variables in DCO-VAE is expected to be related to the factors that influence the formation of the contact maps. As a result, for each latent variable, by varying the value of this variable from 1 to 10,000 and fixing the values of others at the same time, a certain local part of the generated contact maps showed obvious trends (e.g., contracting or sketching). This demonstrated that the learned latent variables are effectively disentangled, which indicated the potential semantic factors in the formation of contact maps and the corresponding protein structures.
From one aspect, though there have been some deep generative models applied to the protein contact map generation, they cannot utilize the relationships among amino acids by treating the contact maps as the graph-structured data with graph generative models. From another aspect, though many interpretable learning models have been explored and applied into the image generation, there are no interpretable models that can discover the semantic factors that can control the process of protein folding. In summary, to the best of our knowledge, this is the first time an interpretable deep generative model for graphs have been applied for protein structure prediction and its effectiveness in generating good-quality protein structures is demonstrated. In addition, the proposed DCO-VAE can also be applied to other real-world applications where there is a need for graph-structured data generation and interpretation.
There are much more promising and challenging topics that can be originated from the research in this paper for future exploration. For example, it would be interesting and potentially beneficial to develop an end-to-end tertiary protein structure generation model directly for 3D structure instead of contact maps. This is because there is a gap between the contact map generation and 3D structure formulation process, and the learned variables can only explain well of the formation of contact maps instead of the 3D structure. In addition, the exploration of the node-edge joint generative model would also be highly interesting. The proposed DCO-VAE model focuses on generating the graph topology (i.e., via contact maps) instead of node features (e.g., properties and types of amino acid). Jointly generating both graph topology and node features could be important in some cases, such as when directly generating the 3D structure where the node features can be the 3D positions.
Recoverable tertiary structure formulation requires to preserve information not only on which atom is bonded with which else, but, more importantly, on which atoms are in proximity of each-other in three-dimensional space. As shown in Fig. 6, to address this issue, we employ contact map graph which can be trivially computed from tertiary structures as the input to our model. And to recover a tertiary structure from a contact map is comparatively trivial (Adhikari et al., 2015).
Hence, the contact map is a graph-based representation of a tertiary structure that effectively embeds the three-dimensional Cartesian space into a two-dimensional space. Specifically, in our approach, let the contact graph
now be associated with its edge attribute tensor, which denotes the adjacent matrix when ; and node attribute matrix (as shown in Fig. 6) denoting the identity of each atom by hot vector embedding, where is the number of atoms over which contacts are computed; one can do so for all atoms in a molecule, or representative atoms to control the size of the input space. The edge and node attributes are rich mechanisms to add additional information. For instance, node attributes can store not just the identities of the amino acids but also their PSSM profile (derived from the Position Specific Scoring Matrix), as well as their solvent accessibility and secondary structure as derived from a given tertiary structure. The edge attributes can encode additional information about contacts, such as their exact distance and/or contact predicted for that pair of amino acids (vertices) from sequence information alone. In our experiment, we are given an undirected, unweighted graph as mentioned above, and we only use the adjacency matrix of (we assume diagonal elements set to 1, i.e. every node is connected to itself) and node attributes store the identities of the amino acids with (the number of all the atom types).
Deep Contact Map Variational Auto-encoder
The task in this paper requires to learn the disentangled generative models on contact map that encodes graph information. Although disentanglement enhancement and deep graph generative models are respectively attracting fast-increasing attention in recent years, the synergistic integration of them has rarely been explored yet. To learn the distribution of the contact map decoys of a amino acid sequence, we first propose a new graph varational auto-encoder framework that consists of powerful graph encoder, graph decoder, and latent graph code disentanglement mechanism. First we introduce the model without disentanglement mechanism, named as The Deep Contact Map Variational Auto-encoder (CO-VAE).
Specifically, given the adjacent matrix of a contact map, we further introduce stochastic latent variables , summarized in an vector . The overall architecture contains a graph encoder and decoder which are trained by optimizing the variational lower bound Loss w.r.t. the variational parameters :
is the Kullback-Leibler divergence betweenand . The first item is the reconstruction loss of the generated contact maps and the second term enforces the inferenced latent vector close to the real latent vector distribution. We take a Gaussian prior
For the encoder part, we take a simple inference model parameterized by a two-layer Graph Convolution Nueral Network (GCN) (Kipf and Welling, 2016a) as the encoder of our CO-VAE:
Here is the mean of the latent vectors, which is inferenced by GCN; similarly is the standard of the latent vector that is inferenced by anotehr GCN. Thus, it is possible to sample from the distributions of the latent vectors .
For the generative model, we utilize the graph decoder network proposed in our previous works(Guo et al., 2018). This work first proposes the graph deconvolution based graph decoders, which achieve the best performance in the graph generation task when the node set of the graph is fixed. Thus, we choose this graph decoder as part of our CO-VAE.
where are the elements of .
We perform full-batch gradient descent and make use of the re-parameterization trick (Kingma and Welling, 2013) for training. The architecture and mathematical operations of the graph encoder for modelling the and decoder for modelling the are detailed in the supplemental materials.
Disentangled Contact Map VAE
Then next we introduce the Contact Map VAE with the disentanglement mechanism, which is inpired by the (Higgins et al., 2017). is proposed for automated discovery of interpretable factorised latent representations from data in a completely unsupervised manner, and has been currently used in image domain (Huang et al., 2018)et al., 2018)
. It is a modification of the variational autoencoder (VAE) framework by introducing an adjustable hyperparameterthat balances latent channel capacity and independence constraints with reconstruction accuracy. Thus, we proposed the Disentangled Contact Map VAE (DCO-VAE) to largely interpret the latent representations for the protein generation. We propose augmenting the CO-VAE framework with a single hyperparameter that modulates the learning constraints applied to the model. These constraints impose a limit on the capacity of the latent information channel and control the emphasis on learning statistically independent latent factors. Eq. 1 can be re-written to arrive at the DCO-VAE formulation, but with the addition of the coefficient:
where with the corresponds to the CO-VAE framework mentioned above; and with the model is pushed to learn a more efficient latent representation of the data, which is disentangled if the data contains at least some underlying factors of variation that are independent.
All data, models, and source code are freely available to readers upon contacting the authors.
DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics 34, pp. 1466–1472. Cited by: §1.
- CONFOLD: residue-residue contact-guided ab initio protein folding. Proteins: Structure, Function, and Bioinformatics 83 (8), pp. 1436–1449. Cited by: §1, §1, §3, §5.
- Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §2.
- Fully differentiable full-atom protein backbone generation. Cited by: §1.
- Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §2.
- Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pp. 7494–7505. Cited by: §1.
- How do proteins interact?. Science 320 (5882), pp. 1429–1430. Cited by: §1.
- Netgan: generating graphs via random walks. arXiv preprint arXiv:1803.00816. Cited by: §2.
- Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
- Improved residue contact prediction using support vector machines and a large feature set. BMC bioinformatics 8 (1), pp. 113. Cited by: §1, §2.
- PyMOL. Cited by: §3.
- Deep architectures for protein contact map prediction. Bioinformatics 28 (19), pp. 2449–2457. Cited by: §1, §2.
- Predicting protein residue–residue contacts using deep networks and boosting. Bioinformatics 28 (23), pp. 3066–3072. Cited by: §1, §2.
- Structured disentangled representations. arXiv preprint arXiv:1804.02086. Cited by: §2.
- Prediction of contact maps with neural networks and correlated mutations. Protein engineering 14 (11), pp. 835–843. Cited by: §2.
- Graphite: iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459. Cited by: §2, §3, §3.
- Deep graph translation. arXiv preprint arXiv:1805.09980. Cited by: §5.
- Protein contact prediction using patterns of correlation. Proteins: Structure, Function, and Bioinformatics 56 (4), pp. 679–684. Cited by: §2.
- Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 34, pp. 4039–4045. Cited by: §1.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §5.
Multimodal unsupervised image-to-image translation. In
Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §5.
- Learning protein structure with a differentiable simulator. In International Conference on Learning Representations, Cited by: §1.
- High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics 34, pp. 3308–3315. Cited by: §1.
- Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §5.
- Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §2, §3, §3.
Assessment of model accuracy estimations in CASP12. Proteins: Struct, Funct, Bioinf 86 (Suppl 1), pp. 345–360. Cited by: §1.
- Toward an accurate prediction of inter-residue distances in proteins using 2d recursive neural networks. BMC Bioinf 15, pp. 6. Cited by: §1.
- Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §2.
- ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487, pp. 545–574. Cited by: §1, §3.
- Ab initio protein structure prediction. In From Protein Structure to Function with Bioinformatics, D. J. Rigden (Ed.), pp. 3–35. Cited by: §1.
- Ensembling multiple raw coevolutionary features with deep residual neural networks for contact-map prediction in CASP13. Proteins: Struct, Funct, Bioinf 87 (12), pp. 1082–1091. Cited by: §1.
- Predicting residue–residue contacts using random forest models. Bioinformatics 27 (24), pp. 3379–3384. Cited by: §1, §2.
- Enhancing evolutionary couplings with deep convolutional neural networks. Cell Syst 6 (e3), pp. 65–74. Cited by: §1.
- A generative model for protein contact networks. Journal of Biomolecular Structure and Dynamics 34 (7), pp. 1441–1454. Cited by: §2.
- Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci USA 96 (10), pp. 5482–5485. Cited by: §1.
- Information constraints on auto-encoding variational bayes. In Advances in Neural Information Processing Systems, pp. 6114–6125. Cited by: §2.
- Deep generative model driven protein folding simulation. arXiv preprint arXiv:1908.00496. Cited by: §2.
- Principles and overview of sampling methods for modeling macromolecular structure and dynamics. PLoS Comp. Biol. 12 (4), pp. e1004619. Cited by: §1.
- PconsC4: fast, accurate and hassle-free contact predictions. Bioinformatics 35, pp. 2677–2679. Cited by: §1.
Prediction of contact maps by giohmms and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 18 (suppl_1), pp. S62–S70. Cited by: §2.
- RamaNet: computational de novo protein design using a long short-term memory generative adversarial neural network. BioRxiv, pp. 671552. Cited by: §1.
Nevae: a deep generative model for molecular graphs.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1110–1117. Cited by: §2.
- Protein structure prediction using multiple deep neural networks in CASP13. Proteins: Struct, Funct, Bioinf 87 (12), pp. 1141–1148. Cited by: §1.
- Probabilistic search and optimization for protein energy landscapes. In Handbook of Computational Molecular Biology, S. Aluru and A. Singh (Eds.), Cited by: §1.
- Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: §2.
- NNcon: improved protein contact map prediction using 2d-recursive neural networks. Nucleic acids research 37 (suppl_2), pp. W515–W518. Cited by: §2.
- Deep learning methods in protein structure prediction. Comput and Struct Biotech J (S2001037019304441), pp. 1–10. Cited by: §1.
- Recovery of protein structure from contact maps. Folding and Design 1 (5), pp. 295–30. Cited by: §1.
- Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks. BMC structural biology 9 (1), pp. 5. Cited by: §2.
- Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol 13, pp. e1005324. Cited by: §1.
- A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 24 (7), pp. 924–931. Cited by: §2, §2.
- Neural-symbolic vqa: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1031–1042. Cited by: §5.
- Graphrnn: generating realistic graphs with deep auto-regressive models. arXiv preprint arXiv:1802.08773. Cited by: §2, §3, §3.
- Using sequence-predicted contacts to guide template-free protein structure prediction. In ACM Conf on Bioinf and Comp Biol (BCB), Niagara Falls, NY, pp. 154–160. Cited by: §1.
- Balancing multiple objectives in conformation sampling to control decoy diversity in template-free protein structure prediction. BMC Bioinformatics 20 (1), pp. 211. External Links: Cited by: §3.
- Secondary structure and contact guided differential evolution for protein structure prediction. IEEE/ACM Trans Comput Biol and Bioinf. Note: preprint External Links: Cited by: §3.
- Infovae: information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262. Cited by: §2.
This work was supported in part by the National Science Foundation Grant No. 1907805 to AS and LZ and a Jeffress Memorial Trust Award to LZ and AS. The authors additionally thank members of the Zhao and Shehu laboratories for valuable feedback during this work.
XG implemented and evaluated the proposed methodologies, as well as drafted the manuscript. ST assisted with preparation of the input data and the evaluation of reconstructed structures. LZ and AS conceptualized the methodologies and provided guidance on implementation and evaluation, as well as edited and finalized the manuscript.
The authors declare no competing interests.