One of the goals of metagenomics is to identify the functions of proteins present in a given sample. Two commonly used methods to determine the protein functions are 1. to compare the amino acid sequence of a protein with the functionally annotated sequences present in protein sequence databases, 2. to compare the 3-D structure of a protein against those of the protein structure databases [lietal]. Thanks to the recent advances in computational tools and techniques especially applications of machine learning in the field of metagenomics, there is a growing number of annotations of proteins available.
The inverse problem, determining the 3-D structure of a protein for a given function, is a young field which has attracted the interest of researchers as engineering of proteins with certain functions has promising applications in biotechnology and medicine [protstructpredict]. Design of such proteins may lead to novel therapeutic agents such as custom designed signaling proteins that will allow us to give specific instructions to cells [Gurevich2014].
To address this issue, we propose an implementation of a Deep Convolutional Generative Adversarial Network using a protein data set obtained from Protein Data Bank (PDB) database.
Ii Related Work
Functionality of a protein and its structure are tightly coupled. Understanding the 3-D structure of a protein can give us knowledge regarding its functionality.
Ii-a Protein Structure Prediction
In the literature, we see that X-ray crystallography and Nuclear Magnetic Resonance (NMR) are used to determine the 3-D structure of a protein [ilari2008protein]. By emitting X-ray onto protein and measuring the diffractions and scatters, one can measure the density of molecules in a protein. NMR technique is faster compared to X-ray crystallography, but it is only used on proteins that have less than 150 amino acids [goodman2000relationships]. Therefore, developing computational models that would predict protein structure is a crucial need [moyer2020machine].
For this reason, identifying the native three-dimensional (3-D) structure of a protein is a common problem in bioinformatics and has applications in drug design, protein engineering, and protein annotation. Previous methods of 3-D structure prediction have focused on energy minimization to find thermodynamically favored structures and the results are assessed in comparison to the free energy of the native structure [dill2008protein].
A few works in energy prediction have successfully used Convolution Neural Networks (CNN) to predict the energy between each of the bonds in a structure[yao2017intrinsic]. Moreover, some have attempted to quantify the relative energy deviation of a decoy structure from its native, or most optimally folded, structure [moyer2020measuring]. This latter method is displayed in Figure 1a where the red to blue gradient corresponds to a measurement of energy deviation from a decoy structure to a native structure.
Ii-B Functional Protein Annotation
The function of a protein is closely tied to its structure. A similar sequence of amino acids between two proteins can imply an identical or similar function. However there are cases of even a single amino acid change entirely changing the function of a protein [schaefer2012predict]. As such, additional criteria beyond protein structure is needed to predict the function of a protein. To simplify this task, there is a large body of work that focuses on identifying ”structural motifs,” or certain protein structures and amino acid sequences which are found in many proteins with a specific function. It should be noted that the presence of a structural motif in that protein does not necessarily indicate that protein has a certain function.
A more general question is whether a protein is functional or non-functional. Since functionality is more or less indicative of native folding, one would suspect that a search of functional proteins to computationally expensive. To put it in perspective, the protein structure search space of an -lengthed amino acid sequence has permutations. Each individual amino acid sequence may have a range of unique structures in which the protein can function relatively well and a select few in which it functions most optimally. An exhaustive search is therefore unrealistic and effort should be put into recognizing relationships between functionally annotated structures that are already identified using X-ray crystallography and NMR.
Ii-C Generative Adversarial Network
Generative Adversarial Networks (GANs) are first introduced by Ian Goodfellow and have seen wide adaptations [goodfellow2014generative]. Even though noise cancellation was the first purpose of GANs, the field expanded onto developing conditional GANs and has seen wide adaptations in style transfer [isola2018imagetoimage], image generation [zhu2020unpaired], audio generation [alparslanspeech2020] [sparsitypaper].
A Generative Adversarial Network (GAN) can be thought of as a zero-sum game between two networks: 1) One to discriminate between real and fake data samples and 2) one to generate data samples that fool this discriminator. This dynamic is illistrated in Figure 1d. In mathematical terms, this corresponds to the minimax game:
is a vector of real data samples,is a latent representation of a fake data sample, is the generative network, and is the discriminative network[goodfellow2014generative]. Often, the generator uses random noise in order to create seemingly novel samples that are increasingly indistinguishable real samples. In training both of these models simultaneously, they are able to both become more accurate with discrimination and generation.
Ii-D Deep Convolutional GAN (DCGAN)
DCGANs merge the areas of convolutional neural networks and the GAN architecture described in C. It extends the standard GAN architecture by replacing the fully-connected generator and discriminator networks with deep convolutional neural networks.
Convolutional Neural Networks (CNNs) are useful for classifying images, especially over their non-convolutional counterparts because the convolution operation preserves spatial properties of images by working with 2D representations. In contrast, non-convolutional networks require 1D representations and the image is ”flattened” (the rows/columns of the image are concatenated). [DBLP:journals/corr/abs-1905-03288] and [SHARMA2018377] provide more in-depth discussions of how CNNs are used for image classification/recognition.
The use and deployment of a GAN or DCGAN is highly dependent on the problem at hand. In one case, this model could be used in order to discriminate between otherwise indistinguishable samples, such as . In another case, a model may be deployed for the generation of unique samples. Our work focuses on the latter.
Iii-a Protein Data Set
In the Protein Data Bank (PDB) database, every molecular structure can be uniquely identified using a four letter non-case sensitive accession number, also called a PDB ID. These molecular accessions are standardized in such a way that the first character is numeric and the last three characters are alphanumeric. An example of such a code is 1crn, identifying a specific hydrophobic protein structure of crambin [teeter1984water].
Our data set is composed of 1000 proteins segments identified by PDB IDs. Each segment is exactly 11 amino acids long and has more than 80% alpha helix composition.
Iii-B Protein Representation
Protein structures are commonly encoded using a contact map or distance matrix, which is an matrix of pairwise distances between atoms in a given structure [baldi2003principled]. These structures can be easily used to reconstruct a protein using methods known as multidimensional scaling [kruskal1964multidimensional]
. Although this representation captures the distances between each atom, it ignores the atom types which are responsible for forming specific bonds in different levels of protein structure. For instance, in alpha helix secondary structures hydrogen atoms are responsible for holding the helical spiral together. Additionally, sulfur atoms are known to form disulfide brides in the tertiary structure of a protein. Therefore, this work substitutes a contact map for a convolutional network design which has known to be successful in image recognition. Our hope is that convolution can be used in place of contact maps as it uses filter maps to learn the positional relationship amongst features. In addition our use of convolution, in each atom we store three additional features. These features include a one-hot encoded vector for the generic atom type (carbon, nitrogen, sulfur, etc), a one-hot encoded vector for the positional atom type (beta-carbon, alpha-carbon, etc.), and the atomic occupancy value corresponding to the nearest atom given by the following formula:
where is the single atom occupancy of atom with radius , and is the Van der Waals attractive force radius for atom .
Iii-C Network Architecture & Training
As it can be seen in figure Figure 2
, we have trained our DCGAN over 50 epochs with an early stopping. The training stopped on epoch around 17 and resulted in a final loss of 2.962 for the generator and 0.642 for the discriminator. Generator has a total of 46,302,126 parameters, all of which are trainable. Discriminator has a total of 287,169 parameters, all of which is trainable. Loss in generator means that the structure prediction is improving over time and the discriminator is trying to discriminate between the decoy and the non-decoy structure generation. Functional annotation generation is robust enough against adversarial attacks.
Iv Conclusion and Future Work
Two optimization steps were included in order to alleviate the computational load of the DCGAN. First, the feature space was trimmed from the 70x70x70 Å window down to the smallest possible rectangular prism grid without removing non-zero occupancy entries. Second, this grid-like formatted data was fed into the DCGAN using a generator function and a batch size of ten.
V Experiment Results & Observations
In this work, we limited our scope to simply discriminating between functional and non-functional protein structures. Although our results provide insight to the difficulty of the problem at hand, a more focused future development would be to create a DCGAN on subsets of protein structure with highly specific functions, such as ligand binding and RNA degradation. Furthermore, adding features such as torsion (Ramachandran) angles between outer bonds of the proteins would increase the representational fidelity of new generated data. Such future work would allow for the possible discovery of novel protein structures that are related to real samples by function.
We would like to acknowledge Drexel Society of Artificial Intelligence for its contributions and support for this research.