1 Introduction
Proteins are chains of chemical units, called amino acids, that fold to form three dimensional structures. The ability to predict a protein’s structure from its amino acid sequence remain to be the most elusive yet rewarding challenges in computational biology. The structure determines the protein’s function, which can be used to help understand life threatening diseases and accelerate drug discover (Kuntz (1992)). However, experimental methods for solving a protein’s structure is both time consuming and costly, and they only account for a small percentage of known protein sequences.
Earlier computational methods for predicting protein structure includes molecular dynamics (MD), which use physic based equations to simulate the trajectory of a protein’s folding process into a stable final 3D conformation (Marx and Hutter (2010)
). However, this method is computationally expensive and ineffective for larger proteins. Other approaches include using coevolutional information to predict the residueresidue contact map, which can be used to guide structure prediction methods. With the help of deep learning architectures like convolutional neural networks, contact prediction remains to be the prevailing methods in structure prediction (
Wang et al. (2016)). However, because these method does not provide an explicit mapping from sequence to structure, they lack the ability to capture intrinsic information between the sequence and structure.RGNs solve that issue, as it is an endtoend differentiable model that jointly optimizes the relationships between protein sequences and structure. However, because it uses RNNs as internal representations, training can be both difficult and time consuming (AlQuraishi (2019a)).
In this paper, we propose a modification to the RGN architecture. Inspired by the recent successes of the transformer models in the NLP community, we replace the LSTMs in the RGN model with the encoder portion of the Universal Transformer (UT) as the internal representation (Dehghani et al. (2019)). By doing so, the model is faster to train and it is contextually informed by all subsequent symbols. As a result, it is better at learning global dependencies among the amino acids than RNNs.
The UTGN operates by first taking a sequence of vector representation of the amino acids and applying the universal transformer architecture to iteratively refine a sequence of internal representations. Next, it uses the internal states to construct three torsional angles for each position, which is used to construct the 3D Cartesian structure (Figure
1).Our experiments show that UTGN achieve an improvement of in RMSD and in TMScore for the free modeling portion of CASP12. In addition, the UTGN achieved an improvement of in RMSD and in TMScore for the template based modeling portion.
2 Model Description
2.1 Input Representation
We represent each amino acid in the protein sequence of size as a
dimensional onehot encoding. Next, we derive the Multiple Sequence Alignment (MSA) from JackHMMer and use it to calculate the
PositionSpecific Scoring Matrix (PSSM) (Potter et al. (2018)). Then, we normalize the PSSM values to between 0 and 1 and concatenate it with the onehot encoding. After feeding this into a fully connected layer of dimension , we add positional encodings as in Vaswani et al. (2017) as follows(1) 
(2) 
where is the position of the vector in the protein sequence and is the index of that vector.
2.2 Universal Transformer
We use the multilayer encoder portion of the Universal Transformer (UT) for internal representation. This neural architecture operates by recurring over representations of each of the positions of the input sequences. Unlike recurrent neural networks, which recur over positions in the sequence, UT recurs over revisions of the vector representations of each position (Dehghani et al. (2019)). In each time step, the representations are revised by passing through layers, where each layer consists of a selfattention mechanism to exchange information across all positions in the sequence in the previous representation, followed by a transition function.
More specifically, given an input sequence of length , we initialize a matrix . Each new representation at time step is determined by first applying the multihead dotproduct selfattention (Vaswani et al. (2017)) mechanism. We compute the scaled dotproduct attention using queries , keys , and values as follows
(3) 
For each head , we map state to queries, keys, and values using learned matrices , , and , where is the number of heads (Vaswani et al. (2017)). Next, we apply the scaleddot product attention to each head, concatenate them, and multiply the result by a learned matrix .
(4) 
(5) 
We pass through the first layer of the encoder as follows
(6) 
where
(7) 
(8) 
where LAYERNORM is defined in Ba et al. (2016) and TRANSITION is either a one dimensional separable convolution (Chollet (2016)) or a fullyconnected layer. In addition, each layer has different weights and
indicates the layer number. Between the multihead attention and transition function, we incorporate both residual connections and dropout (
Srivastava et al. (2014)).For an encoder with layers, we have
(9) 
In contrast to the original UT model, we do not add positional encodings on each time step; rather, the positional encoding is only added at the initial starting phase (see Figure 2 for a complete model).
2.3 Dynamic Halting
Because we wish to expend more computing resources on amino acids with ambiguous relevancy, we use the Adaptive Computation Time (ACT) to dynamically halt changes in certain representations (Graves (2016)
). For each step and each symbol, if the scalar halting probability predicted by the model exceeds a threshold, the state representation is simply copied to the next time step. Recurrence continues until all representations are halted or the maximum number of steps are met.
2.4 Structure Construction
As in AlQuraishi (2019a), we use the final states for each position to construct the three torsional angles , , . These angles represent the geometry of the protein spanned by the backbone atoms , , . Though bond lengths and angles also vary, their variation is limited enough that we can assume them to be fixed. We will also ignore the side chains of the protein, as our focus is on the backbone atoms. The resulting angles at each position is then translated into the 3D coordinates for the backbone.
More specifically, at position , the corresponding angle triplet is calculated as follows
(10) 
(11) 
where is the row of matrix , , , are learned weights, and is the complex valued argument function. In addition, defines an alphabet of size whose letters correspond to triplets of torsional angles over the 3torus. Next, recurrent geometric units convert the sequence of torsional angles into 3D Cartesian coordinates as follows
(12) 
where is the length of the bond connecting atoms and , is the bond angle formed by atoms , , and , is the predicted torsional angle formed by atoms and , is the unitnormalized version of , is the cross product, and is the position of the newly predicted atom . The sequence form the final Cartesian coordinates of the protein backbone chain structure.
For training, the weights are optimized through the dRMSD loss function between the predicted and expected coordinates. This computes the pairwise distances between each atom in either the predicted or expected structure, and then finds the distance between those distances. More specifically,
(13) 
where are elements of matrix . We chose this loss function because it is differentiable and captures both local and global aspects of the protein structure.
3 Experiments and Analysis
3.1 Training Data and Batching
We evaluate our models with the CASP12 ProteinNet dataset with a thinning of 90%, which consists of around 50,000 structures (AlQuraishi (2019b)). The train and validation set contains all sequences and structures that exist prior to the CASP12 competition. The test set is the targets of CASP12, which consists of both the templatebased modeling (TBM), intended to assess the prediction of targets with structural homologs in the Protein Data Bank, and the free modeling (FM), intended to test a model’s ability to predict novel structures (Moult et al. (1995)). In the train and validation set, entries with missing residues were annotated and are not included in the calculation of dRMSD. Sequences with similar lengths are batched together with a batch size of 32.
3.2 Model Parameters
The dimension of the feed forward layer that connected the input to the UT encoder was . We use heads and layers for the UT encoder architecture. The ACT threshold is and the maximum number of ACT recurrence was . If a feed forward layer was used for the transition function (UTGNFF), the feed forward dimension was . If a separable convolution is used instead (UTGNSepConv), the kernel size is set to
and the stride was set to
. In addition, we set the alphabet size to 60 for the angularization layer. The UTGN architecture amounts to about million trainable parameters. For point of comparison, we train the RGN model with a size of , which is also around million trainable parameters.3.3 Optimizer
We used the ADAM optimizer with , , and learning rate of (Kingma and Ba (2014)). In addition, the loss function for optimization was length normalized (dRMSD / protein length).
3.4 Regularization
We apply a dropout probability of in the UT encoder architecture (Srivastava et al. (2014)). In addition, gradients are clipped using norm rescaling with a threshold of 5.0 (Pascanu et al. (2012)). Furthermore, we perform early stopping when the validation loss failed to change noticeably in epochs.
3.5 Analysis
We evaluate our model using two metrics: root mean squared deviation (RMSD) and Template Modelling (TM) Score (Zhang and Skolnick (2004)). RMSD is calculated by
(14) 
where and are two sets of points. This metric has the advantage that it does not require two structures to be globally aligned, and is able to detect regions of high agreement even if the global structure is not aligned. However, RMSD is very sensitive to protein length, leading to higher RMSD for longer proteins. The TM Score is calculated by
(15) 
where . TM scores are length normalized, and take values between and , with higher values indicating better alignment. A TM score corresponds to random alignment whereas a TM score of correspond to the same protein fold (Xu and Zhang (2010)).
After evaluating our model, we found that the UTGN with separable convolution as the transition function performed better than the RGN model in the free modeling category by percent in the RMSD metric and by percent in the TM score metric (Table 2). For the template based modeling, the improvement was percent for RMSD and percent for TM score (Table 1).
From training both the RGN and UTGN model, we note that RGNs tend to suffer from heavily from exploding gradients, whereas the UTGN model never had that issue. In addition, the RGN model takes around times longer for each epoch, and the UTGN model converged to its result about times faster. Furthermore, we found that UTGNs have more stable initializations, whereas different initializations in RGNs can produce very different evaluation results.
Because RGNs and UTGNs must learn very deep neural networks from scratch and do not include any biophysical priors into the model, training a stateoftheart model would require months of training and times more parameters. Though we only train for a few days and with only million parameters, we show that UTGNs have the potential to outperform RGNs.
Model  dRMSD ()  TM score 

RGN  17.8  0.200 
UTGNFF  17.6  0.198 
UTGNSepConv  17.1  0.208 
Model  dRMSD ()  TM score 

RGN  19.8  0.181 
UTGNFF  19.4  0.174 
UTGNSepConv  18.1  0.194 
4 Discussion
Before RGN was introduced, the protein prediction competition was dominated by complex models that fuse together multiple pipelines (Yang et al. (2014)). They tend to incorporate biological priors, like coevolutionary information and secondary structure, that significantly improved their model performance. But the RGN model show to be a very competitive option without biological priors, outperforming the CASP11 model in the free modeling category (AlQuraishi (2019a)). Just like end to end differentiable models were able to replace complex pipelines in image recognition, we expect endtoend differentiable architectures like UTGN to eventually replace the complex pipeline in protein structure prediction.
The biggest bottlenecks for training RGNs is time. The recurrent neural network portion is unstable, leading to gradient explosions. In addition, different initializations produce very different model performance, requiring researchers to try many different initializations. Furthermore, a fully refined model may take months to train (AlQuraishi (2019a)). As a result, it may take a significant amount of time to search for optimal parameters. UTGNs, however, are able to solve many of these problems, as replacing RNNs with transformers lead to a more stable and easily parallelizable model.
UTGNs are also better at learning global dependencies than RGNs. In RNNs, as the length of the path between two amino acids increase, information flow decreases. In contrast, the UT effectively has a global receptive field, as each new representation is contextually informed by all previous representations.
Some possible extensions for UTGN include using pretrained embedding representations of amino acids like UniRep (Alley et al. (2019)). This can replace the need to calculate PSSMs for each new sequence, which further reduces the prediction time. In addition, instead of using static positional encodings, we could train relative position representations along with the transformer (Shaw et al. (2018)). Or we could incorporate more information in the input sequence, like secondary structure predictions.
5 Conclusion
This paper introduces UTGN, an endtoend protein structure prediction architecture that uses a universal transformer as an internal representation. As opposed to the existing RGN model, UTGN is better at learning relationships of long range dependencies in the amino acids. In addition, the UTGN perform slightly better, converge much quicker, and is more stable to train. This progress shows that endtoend differentiable protein prediction architectures can become competitive models in the protein folding problem.
The code for UTGN is available at: https://github.com/JinLi711/3DProteinPrediction
Acknowledgments
We are grateful for Jie Zheng and Suwen Zhao for providing insightful guidance and commentary. We are also thankful for ShanghaiTech, School of Information Science and Technology for providing access to its computer cluster.
References
 Unified rational protein engineering with sequenceonly deep representation learning. Cited by: §4.
 Endtoend differentiable learning of protein structure.. Cell systems 8 4, pp. 292–301.e3. Cited by: §1, §2.4, §4, §4.

ProteinNet: a standardized data set for machine learning of protein structure
. In BMC Bioinformatics, Cited by: §3.1.  Layer normalization. ArXiv abs/1607.06450. Cited by: §2.2.

Xception: deep learning with depthwise separable convolutions.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1800–1807. Cited by: §2.2.  Universal transformers. ArXiv abs/1807.03819. Cited by: §1, §2.2.
 Adaptive computation time for recurrent neural networks. ArXiv abs/1603.08983. Cited by: §2.3.
 Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §3.3.
 Structurebased strategies for drug design and discovery.. Science 257 5073, pp. 1078–82. Cited by: §1.
 Ab initio molecular dynamics: basic theory and advanced methods. Cambridge University Press. Cited by: §1.
 A largescale experiment to assess protein structure prediction methods.. Proteins 23 3, pp. ii–v. Cited by: §3.1.

Understanding the exploding gradient problem
. ArXiv abs/1211.5063. Cited by: §3.4.  HMMER web server: 2018 update. In Nucleic Acids Research, Cited by: §2.1.
 Selfattention with relative position representations. In NAACLHLT, Cited by: §4.
 Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, pp. 1929–1958. Cited by: §2.2, §3.4.
 Attention is all you need. In NIPS, Cited by: §2.1, §2.2, §2.2.
 Accurate de novo prediction of protein contact map by ultradeep learning model. bioRxiv, pp. 073239. Cited by: §1.
 How significant is a protein structure similarity with tmscore = 0.5?. Bioinformatics 26 7, pp. 889–95. Cited by: §3.5.
 The itasser suite: protein structure and function prediction. Nature Methods 12, pp. 7–8. Cited by: §4.
 Scoring function for automated assessment of protein structure template quality.. Proteins 57 4, pp. 702–10. Cited by: §3.5.
Comments
There are no comments yet.