Proteins have important roles in living systems. The complex three-dimensional (3D) structure that determines the function of a protein represents an equilibrium state determined by the amino acid sequence and the appropriate physiological conditionsanfinsen1973principles . Protein designcoluzza2017computational requires determining the optimal amino acid sequence that results in a given 3D structure as an equilibrium state. Protein design, therefore, is the inverse problem of 3D structure prediction. In addition, as amino acid sequences are determined by genomic information, protein design can be used to explore the design principles of life. In recent decades, many computational protein design methods have been proposed and applied to drug designdahiyat1997novo ; kuhlman2003design ; fleishman2011computational ; koga2012principles ; bale2016accurate ; silva2019novo ; liu2006rosettadesign ; huang2011rosettaremodel . However, there have been fewer theoretical studies of design principles based on statistical mechanics, and few heuristics have been applied to the design of real proteins.
In order to design an optimal sequence, the most reasonable statistical mechanical procedure involves finding a sequence that minimizes the free energy of a given target conformation. As part of this procedure, one needs to carry out a folding simulation to check that the selected sequence folds into the target conformation with high probability every time a candidate optimal sequence is selected, for example, using the negative design methodjin2003novo .
However, many other design methods stabilize only a target conformation, although they could in principle lower the energy of other conformations. Nevertheless, they successfully find an optimal sequence that minimizes the energy of the target conformation. It is not clear why such design (inverse problem) methods work without an exhaustive search (forward problem) for all possible compact conformations. One of the main purposes of the present study is to investigate the success of the inverse problem without solving the forward problem, which is a highly non-trivial problem in statistical and biological physics.
Ii Model and Method
ii.1 The lattice HP model
In order to address the problem described above, we take a statistical mechanical approachcocco2018inverse based on Bayesian learning, using a coarse-grained protein model called the HP modellau1989lattice . We use a lattice model in which every amino acid residue is located on a lattice site, and a protein structure is represented by a self-avoiding walk on a 2D or 3D lattice. Although a real protein has 20 types of amino acid, the HP model includes only two types, hydrophobic (H) and polar (P). Here, we consider residues on a lattice position , where indicates that the -th residue is an H residue, and indicates that it is a P residue.
We assume the energy of the lattice protein is given by
where denotes the interaction potential between the monomers and . We moreover assume the simplest functional form of : , using two types of interaction set, and . The definition of the contact energy is
where contact between two residues is defined as the case where but . In this model, therefore, the energy given by Eq. (1) of denatured conformations is always higher than the energy of compact conformations; thus, we only consider compact conformations in the present study. Equilibrium statistical mechanics have been successfully applied using the HP model, which is similar to the Ising model. For lattice protein models of comparatively small size, several successful theoretical studies have been reported.
ii.2 Related works
The first and pioneering study of protein design using statistical mechanics and the HP model was performed by Shakhnovich and Gutin (SG)shakhnovich1993engineering . Subsequently, Kurosky and Deutsch proposed a design criterion in which the solution of the design problem is a sequence that maximizes the Boltzmann distribution as the conditional probabilitykurosky1995design ; deutsch1996new , where
is a set of position vectors of the target conformation. We hereafter callthe ‘target probability,’ and the design criterion maximizing the target energy probability is denoted the MTP criterion.
For the MTP criterion, a solution is given by
where is the inverse temperature of the system, is any compact conformation of the conformation space into which the sequence can fold, and
is a partition function of the conformation space. In terms of statistical learning, the MTP criterion is the maximum likelihood estimation for the canonical distribution [Eq. (4)]. The MTP criterion, however, includes . Hence, to obtain , one must carry out an exhaustive conformational search every time a candidate is found. For such conformational searches, very fast and accurate methods are required. Currently available methods include generalized ensemble Monte Carlo methodsiba1998simulation ; chikenji1999multi ; ShiraiKikuchi2013 . Even these methods, however, cannot provide reasonable results for longer chains (more than 100 residues). Thus, design of large and realistic protein models remains impossible, and successful statistical mechanical protein design using the MTP criterion has been reported only for comparatively small lattice proteins and modelskurosky1995design ; deutsch1996new ; seno1996optimal ; micheletti1998protein ; irback1998monte ; irback1999design ; iba1998design ; tokita2000dynamical ; rossi2000anovel ; betancourt2002protein ; zou2003using ; wang2004 ; jiao2006protein ; kleinman2006maximum ; coluzza2003designing ; coluzza2011coarse .
To overcome the above difficulty, Kurosky and Deutschdeutsch1996new carried out high-temperature expansion of the free energy and minimized using simulated annealing for a 2D HP model with –. A design method using simulated annealing for both sequence and conformation space was proposed by Seno et al.seno1996optimal . using a 2D lattice HP model with and . The multi-sequence Monte Carlo method proposed by Irbäck et al.irback1998monte ; irback1999design is an efficient procedure that obtains an optimal sequence excluding bad sequences with low target probability using fluctuation of sequences; this method was used to design a comparatively large 2D HP model ( and ) and 3D off-lattice HP model (). The ‘design equation’ method of Iba et al.
was the first application of Boltzmann machine learning to protein design and was used to obtain a correct sequence for several 3Dcubic conformationsiba1998design ; tokita2000dynamical . There are some other design methods with a similar policy to the design equation method: minimization of the relative entropywang2004 ; jiao2006protein and gradient descentkleinman2006maximum to optimize the sequence. Coluzza et al. proposed a method using both energy minimization and free energy minimization for some targetscoluzza2003designing ; coluzza2011coarse .
In recent years, many successful applications of deep learning have been reported in various engineering and scientific fields, including protein folding and drug designAlphaFold ; li2014direct ; wang2018computational ; greener2018design ; merk2018novo ; gupta2018generative ; o2018spin2 ; button2019automated . Deep learning for protein design involves learning the relations between a protein conformation and an amino acid sequence using big data from the Protein Data Bankbernstein1977protein
. Although deep neural networks have been used to successfully predict an optimal sequence for a target conformationwang2018computational ; button2019automated , we still do not theoretically understand the design principles of machine learning in this context. In the present study, therefore, we apply Bayesian learning to protein design to explore these design principles.
ii.3 Design method by Bayesian learning
In Bayesian learning, one assumes that data are generated by conditional probability under the parameter
. By Bayes’ theorem, the posterior distributionis given by
where is the prior distribution of if
is a continuous random variable. One estimatesand predicts unobserved data from the posterior given by Eq. (6). The basic procedure of Bayesian learning is as follows: starting from a highly arbitrary prior , the posterior is repeatedly updated with the highly objective likelihood function using Eq. (6); thus we can finally obtain a precise value of .
Here we apply the above procedure to protein design. Let the appearance probability of the given target conformation for data be the target probability , and let the prior of the HP sequence be the parameter , . Then, the posterior of the sequence is given by
where denotes summation over all sequences.
Furthermore, we focus on the fact that a protein acquires its native state as an equilibrium with water molecules in a living cell.
We therefore consider the following target probability as a grand canonical distribution:
The definition of the energy of the target for a given sequence is given by Eq. (1), and is the chemical potential of water. We assume that one water molecule combines with one P residue, hence , where is the number of all P residues. Therefore, Eqs. (8) and (9) denote the canonical distribution of the Hamiltonian with external field . Consequently, we rewrite Eqs. (8) and (9) as follows:
where is obtained by the conditions and as
In order to obtain an optimal sequence that maximizes , we have to repeat the exhaustive conformational search in Eq. (11) for each trial sequence . In general, this calculation takes an enormous amount of time.
One of the new ideas of the present study is that the prior distribution is given by
where is the partition function of both conformation and sequence space. The expression of the prior Eq. (13) is based on the following hypothesis: as a result of evolution, the probability is proportional to its partition function so that the free energy takes a minimum. Note that Kurosky and Deutschkurosky1995design assumed the equal a priori weights , where is the number of all HP sequences. By contrast, our method considers the above postulation for the weight of the appearance of sequences, which is reasonable from the viewpoint of thermodynamics and protein evolution. We obtain posterior distribution by substituting Eqs. (13) and (10) into Eq. (7) and by canceling out as follows:
where is a partition function of the sequence space corresponding to the given target conformation . An important point regarding Eq. (15) is that it no longer includes . Thus, we can efficiently obtain an optimal sequence using Eq. (15) simply by summation over . This design method without is essentially same as a procedure used previously to obtain more realistic protein designsdahiyat1997novo ; kuhlman2003design ; fleishman2011computational ; koga2012principles ; bale2016accurate ; silva2019novo ; liu2006rosettadesign ; huang2011rosettaremodel .
is large. We thus utilize one of the simplest Markov-chain Monte Carlo (MCMC) methods, Gibbs sampling. In this method, the sampling probability ofof each Monte Carlo step (MCS) is a conditional probability of given other random variables. We thus obtain the following sampling probability by substituting Eqs. (1) and (12) into Eq. (15). Accordingly, the sampling probability of an H residue () or P residue () is given by
where , a vector of all random variables of residues except for the -th residue , and the double signs correspond. Let , where denotes the set of sites that are the nearest neighbors of -th site except for those along the chain (). The random variables have fixed realizations in the denominator and the numerator of the right-hand side of Eq. (15) at every MCS. Thus, the random variables that interact with the -th residue remain only on the right-hand side of Eq. (17), because those fixed realizations, except for the residues that interact with , are canceled out in Eq. (15). We decide whether each residue is H or P using the expectation ; that is, is H if and P otherwise. We also take the number of MCSs until the estimated value does not change and let the burn-in be the leading of all MCSs. In this study, the inverse temperature is set to for all conformations of all lattice models. On the other hand, we heuristically set the chemical potential in order to design a unique ground state by repeating the design experiment many times. The necessary and sufficient condition for successful design is that the energy given by Eq. (1) of the target conformation and the sequence designed corresponds to a unique ground state of all possible compact conformations.
The formulation of our design method as described so far can be derived by assuming a joint distribution given by
where is given by Eq. (14). One can derive the prior [Eq. (13)] and the likelihood function (the target probability) given by Eq. (10) by a marginalization and a relation , respectively. Thus, the hypothesis [Eq. (13)] is included in the joint distribution [Eq. (18)].
iii.1 Enumerable conformations
First, we tested our design method with comparatively small lattice protein models, for which all compact conformations were enumerable. We designed 2D , , , , and lattice models, and 3D and lattice models. The numbers of all conformations and the numbers of all HP sequences of these lattice models are shown in Table 1.
The number of conformations is the number of all compact self-avoiding walks from which all kind of rotational, reflection, and head-tail symmetrical conformations have been eliminated. The total number of conformations of these lattice models is enumerable; hence, one can confirm whether or not the designed sequence folds into the target conformation as a unique ground state. Note that not every conformation always has a solution to the design problem. The number of sequences that fold into the target conformation as a unique ground state differs among target conformations and
|52667||68719476736||Random 100 and MHDC|
|103346||134217728||Random 100 and MHDC|
is called designability. In general, a conformation with higher designability is easier to design. This is because high designability means a large solution space in sequence space.
Designability is a significant quantity that relates to the thermodynamic stability of proteins; however, we do not address issues of designability in depth here. In order to calculate the exact success rate () of the overall conformation, one needs to select designable target conformations with designability greater than zero; however, to enumerate the designabilities of each conformation, one would need to enumerate the energy of every combination of conformations and sequences. This would require vast computation time for models with comparatively large size, such as the , , and models (Table 1), even though they are compact. Therefore, in this study, we carried out the enumeration of designabilities only for the , , , and lattice models.
For the models with , , and , the number of conformations was too large. Thus, we randomly chose 100 target conformations and determined the , that is, the number of successfully designed conformations (Table 2). For the models with and , we moreover identified the most highly designable conformation (MHDC) (Figs. 1, 2, and 3), in which designabilities were exactly enumeratedli1996emergence , to test whether our method could be used to design the easiest instance.
The results of the application of our method are summarized in Table 2
. All designed sequences were classified into three types: good, medium, and bad sequences. The good sequences had the target conformation as a unique ground state, medium sequences had the target conformation as one of the degenerated ground states, and bad sequences had ground state conformation(s) that did not include the target conformation. In the table,, , , and , denote the percentage of good sequences and the number of conformations that were designed with good, medium, and bad sequences, respectively. We also calculated the average degeneracy, , for all ground states. We repeated the calculations with various values of and obtained the optimal value that gave the maximum success rate. The values of for each lattice size are listed in Table 2. The values of the energy parameters are also listed in Table 2. The energy parameters were also used for a lattice model in previous workli1996emergence in order to avoid the degeneracy of ground states. We used the same energy parameters for 3D lattice models for the same reason. The total MCSs were set to for all target conformations listed in Table 2. The small and lattices included several non-designable conformations; we excluded such conformations when calculating .
According to the results shown in Table 2, success rates were relatively high for small 2D HP models, but they decreased as increased. The average degeneracy was low for 2D models. By contrast, the success rate for 3D models
was low compared with that of 2D models. For , was low, but for , it was comparatively high. Thus, designed sequences did not appear to be likely to fold into the target conformations for the cubic lattice. In addition, increased as the number of residues increased for both the 2D and 3D lattices.
Note that we did not enumerate designabilities of all conformations for the , , and models; hence, there may have been non-designable conformations among the 100 randomly chosen conformations. However, it is likely that this was not the case for the and models, because the smaller HP model did not lead to any non-designable conformation. On the other hand, the fraction of non-designable conformations out of all conformations for the model was 21/69; the fraction for the model is expected to be less than that because the fraction decreased as the size increased in the 2D cases. Thus, there may have been a considerable number of non-designable conformations among the randomly chosen 100 conformations for the model; hence, the real success rate of the model increased when non-designable conformations were excluded.
Concerning the MHDC of the and HP models, we obtained a good sequence (Figs. 1, 2, and 3). This is the first example of design of a MHDC without enumerating all HP sequencesli1996emergence . For the MHDC of the HP model, we successfully designed a good sequence for the energy parameters (Fig. 2) and (-1, 0, 0) (Fig. 3). We executed MCSs for these three cases. The results obtained here demonstrate the features of general globular proteins, with H residues on the inside of the protein and P residues on the surface exposed to the surrounding water molecules. We observed four residues (surrounded by dotted black circles in Figs. 2 and 3) that were different from each other, possibly owing to the presence or absence of H-P (P-H) contact energies.
The larger of the MHDC of HP model with compared with the case of could have been due to the lower H-H interaction , leading to a greater increase in the number of H-residues than in the case of . Therefore, one needs to increase in order for the surface residues to be P residues.
iii.2 Large 2D conformations
Here, we chose 2D HP models with comparatively large size () models studied by Irbäck et al.irback1998monte ; irback1999design . This confirmed that the designed sequence was likely to fold into the target conformation with simulated tempering. For the model with (respectively 50), the parameters were set to (0.85) and the MCSs were (). The energy parameters were set to in both cases. The simulation was executed by a normal PC with 1.2 GHz dual-core Intel Core m3 and 8 GB memory, and the calculation time was approximately 0.5–1 s (11–12 s) for (50). Thus, our method ran faster than those used in previous studies. As a result, we successfully designed the same sequences reported by Irbäck et al. (Figs. 4 and 5). Our method also demonstrates the features of globular proteins.
iii.3 Optimal and number of surface residues
We represent the relation between the optimal and the number of surface residues in Fig. 6. We show only the results for because the optimal varies depending on the energy parameters for a given conformation. We therefore plotted the results for all 2D models and the MHDC model with The residues that were bent 90 degrees inward (indicated by a dashed black circle in Figs. 4 and 5) were not counted for because a water molecule is unlikely to combine with such residues (see Fig. 6).
We observed noticeable linearity between and
. The outlierwas obtained in the 2D case (Fig. 4), in which the target conformation was not fully compact and the number of surface residues was much larger than those of other target conformations tested. According to these results, the optimal can be estimated by the number of surface residues of a target conformation.
iii.4 Probability of a P residue
Finally, in order to clarify why 3D models performed less well than 2D models, we consider the probability that a residue is P for all residues of the and MHDC models (Figs. 7 and 8). We use in Eq. (17) as .
Each in Figs. 7 and 8 is the average of over the last MCSs in both cases. In the case of the 3D lattice, the values of for residues 2, 16, 22, and 24 were not very high. These residues were located in the center of each cube side. By contrast, all values greater than 0.5 were almost equal to 1 in the case of the 2D lattice. Accordingly, it can be seen that the clear division of all values into or is an index of successful design. Thus, the 3D models are difficult instances for our design method.
Our method is similar to the SG method proposed by Shakhnovich and Gutinshakhnovich1993engineering , which did not include . The difference between the SG method and ours is the minimization function. The SG method minimizes directly, keeping a constant value determined a priori, but our method minimizes . Therefore, our method can minimize , maintaining the general features of globular proteins, that is, H residues on the inside and P residues on the surface exposed to the surrounding water molecules. Thus, one can minimize while reducing the diversity of conformations into which a designed HP sequence can fold by minimizing . This corresponds, in a sense, to negative designjin2003novo .
On the other hand, as discussed above, our method failed in the cases of 3D HP models and comparatively large compact 2D HP models because it failed to reduce the diversity of the foldable conformations of the designed sequence in these cases. The diversity—or, more simply, the total number of self-avoiding walks—increases for 3D lattices compared with 2D lattices. Such diversity is expected to increase as increases even in the case of 2D lattices. Thus, the success rates decrease in these cases.
In addition, the numbers of core residues in the 3D models used in this study were low, e.g., the 3D model had no core residue and the 3D model had only one; hence, it was difficult to design globular protein-like conformations using these models. Given that our design method finds an optimal sequence by controlling , such small numbers of cores may explain the lower performance of our method in the case of the 3D models. If our design method works in a given instance, the posterior [Eq. (15)] should show a sharp peak at the optimal . This is equivalent to the case where the of every residue is almost equal to 1 or 0. For the 3D lattice, however, there were several comparatively low values (close to 0.5), even in the case of the highly designable conformation. Thus, our design method was not appropriate for those conformations.
The greatest advantage of our method is that it skips the exhaustive calculation of by assuming the prior distribution given in Eq. (13). As already stated, the form of the prior means that the lower the free energy , the higher the prior distribution. The prior [Eq. (13)] states the hypothesis that sequences rich in P residues are, in general, more likely to be evolutionarily selected than sequences with unique stable conformations. This hypothesis is consistent with recent findings that organisms have many intrinsically disordered proteinswright1999intrinsically ; uversky2000natively ; dunker2001intrinsically ; tompa2002intrinsically ; these proteins do not have unique native conformations and are composed of sequences rich in P residues. Recent workuversky2017intrinsically has shown that such proteins form ‘droplets (that function as membraneless organelles)’ and have various biologically important roles (e.g., spatiotemporal regulation of gene expression, signaling, and stress response). This indicates that organisms make good use of the physical property given by Eq. (13).
In addition to the biological validity of the prior, the fact that it enabled fast protein design without the calculation of is significant because it suggests that Eq. (13) is not a unique solution for protein design without exhaustive calculation of . As all information about the thermodynamic profile of a protein is evolutionarily embedded solely in the sequence, it is in principle possible to search for a sequence that stabilizes a given target conformation if the code connecting and the thermodynamic profile is broken. The prior given by Eq. (13) may be one such code.
The simple conclusion from these results is that it is possible to design many conformations without an exhaustive conformational search by taking the water effect into account. This approach is more successful with 2D HP models than with 3D models; however, our method is expected to correctly design 3D target conformations given a sufficiently high designability of the target conformation. This is consistent with conventional protein design software, e.g., Rosetta. Hence, our method based on statistical mechanics may enable future studies on more realistic protein design.
Future work could consider an additional parameter reflecting the specific topological information of a target conformation. In addition, setting different numbers of water molecules to combine with each P residue would help to more closely model realistic globular proteins. As Bayesian learning is simple and flexible, such modifications could be easily implemented.
The authors are grateful to M. Ota, Nagoya University, for illuminating discussions. This work was supported by KAKENHI Nos. 19H03166 (G.C.) and 19K03650 (K.T.).
- (1) C. B. Anfinsen, Science 181, 223 (1973).
- (2) I. Coluzza, J. Phys. Condens. Matt. 29, 143001 (2017).
- (3) B. I. Dahiyat and S. L. Mayo, Science 278, 82 (1997).
- (4) B. Kuhlman, G. Dantas, G. C. Ireton, G. Varani, B. L. Stoddard, and D. Baker, Science 302, 1364 (2003).
- (5) S. J. Fleishman, T. A. Whitehead, D. C. Ekiert, C. Dreyfus, J. E. Corn, E.-M. Strauch, I. A. Wilson, and D. Baker, Science 332, 816 (2011).
- (6) N. Koga, R. Tatsumi-Koga, G. Liu, R. Xiao, T. B. Acton, G. T. Montelione, and D. Baker, Nature 491, 222 (2012).
- (7) J. B. Bale, S. Gonen, Y. Liu, W. Sheffler, D. Ellis, C. Thomas, D. Cascio, T. O. Yeates, T. Gonen, N. P. King, and D. Baker, Science 353, 389 (2016).
- (8) D.-A. Silva, S. Yu, U. Y. Ulge, J. B. Spangler, K. M. Jude, C. Labão-Almeida, L. R. Ali, A. Quijano-Rubio, M. Ruterbusch, I. Leung, T. Biary, S. J. Crowley, E. Marcos, C. D. Walkey, B. D. Weitzner, F. Pardo-Avila, J. Castellanos, L. Carter, L. Stewart, S. R. Riddell, M. Pepper, G. J. L. Bernardes, M. Dougan, K. C. Garcia, and D. Baker, Nature 565, 186 (2019).
- (9) Y. Liu and B. Kuhlman, Nucleic Acids Res. 34, W235 (2006).
- (10) P.-S. Huang, Y.-E. A. Ban, F. Richter, I. Andre, R. Vernon, W. R. Schief, and D. Baker, PLoS ONE 6, e24109 (2011).
- (11) W. Jin, O. Kambara, H. Sasakawa, A. Tamura, and S. Takada, Structure 11, 581 (2003).
- (12) S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, and M. Weigt. Rep. Prog. Phys. 81, 032601 (2018).
- (13) K. F. Lau and K. A. Dill, Macromolecules 22, (1989).
- (14) E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. U.S.A. 90, 719 (1993).
- (15) T. Kurosky and J. M. Deutsch, J. Phys. A 28, L387 (1995).
- (16) J. M. Deutsch and T. Kurosky, Phys. Rev. Lett. 76, 323 (1996).
- (17) Y. Iba, G. Chikenji, and M. Kikuchi, J. Phys. Soc. Jpn. 67, 3327 (1998).
- (18) G. Chikenji, M. Kikuchi, and Y. Iba, Phys. Rev. Lett. 83, 1886 (1999).
- (19) N. C. Shirai and M. Kikuchi, J. Chem. Phys. 139, 225103-1 (2013).
- (20) F. Seno, M. Vendruscolo, A. Maritan, and J. R. Banavar, Phys. Rev. Lett. 77, 1901 (1996).
- (21) C. Micheletti, F. Seno, A. Maritan, and J. R. Banavar, Phys. Rev. Lett. 80, 2237 (1998).
- (22) A. Irbäck, C. Peterson, F. Potthast, and E. Sandelin, Phys. Rev. E 58, R5249 (1998).
- (23) A. Irbäck, C. Peterson, F. Potthast, and E. Sandelin, Structure 7, 347 (1999).
- (24) Y. Iba, K. Tokita, and M. Kikuchi, J. Phys. Soc. Jpn. 67, 3985 (1998).
- (25) K. Tokita, M. Kikuchi, and Y. Iba, Prog. Theor. Phys. Suppl. 138, 378 (2000).
- (26) A. Rossi, A. Maritan, and C. Micheletti, J. Chem. Phys. 112, 2050 (2000).
- (27) M. R. Betancourt and D. Thirumalai, J. Phys. Chem. B 106, 599 (2002).
- (28) J. Zou and J. G. Saven. J. Chem. Phys. 118, 3843 (2003).
- (29) Wang, B. Wang, Y. Liu, W. Chen, and C. Wang, Chin. Sci. Bull. 49, 426 (2004).
- (30) X. Jiao, B. Wang, J. Su, W. Chen, and C. Wang, Phys. Rev. E 73, 061903-1 (2006).
- (31) C. L Kleinman, N. Rodrigue, C. Bonnard, H. Philippe, and N. Lartillot, BMC Bioinform. 7, 326 (2006).
- (32) I. Coluzza, H. G. Muller, and D. Frenkel, Phys. Rev. E 68, 046703-1 (2003).
- (33) I. Coluzza, PLoS ONE, 6, e20853 (2011).
- (34) A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis, Nature 577, 706 (2020).
- (35) Z. Li, Y. Yang, E. Faraggi, J. Zhan, and Y. Zhou, Proteins 82, 2565 (2014).
- (36) J. Wang, H. Cao, J. Z. H. Zhang, and Y. Qi, Sci. Rep. 8, 6349 (2018).
- (37) J. G. Greener, L. Moffat, and D. T. Jones, Sci. Rep. 8, 16189 (2018).
- (38) D. Merk, L. Friedrich, F. Grisoni, and G. Schneider. Mol. Inf. 37, 1700153 (2018).
- (39) A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider, Mol. Inf. 37, 1700111 (2018).
- (40) J. O’Connell, Z. Li, J. Hanson, R. Heffernan, J. Lyons, K. Paliwal, A. Dehzangi, Y. Yang, and Y. Zhou, Proteins 86, 629 (2018).
- (41) A. Button, D. Merk, J. A. Hiss, and G. Schneider, Nat. Mach. Intell. 1, 307 (2019).
- (42) F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer Jr, M. D. Brice,J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, Euro. J. Biochem. 80, 319 (1977).
- (43) H. Li, R. Helling, C. Tang, and N. Wingreen, Science 273, 666 (1996).
- (44) P. E. Wright and H. J. Dyson, J. Mol. Bio. 293, 321 (1999).
- (45) V. N. Uversky, J. R. Gillespie, and A. L. Fink, Proteins 41, 415 (2000).
- (46) A. K. Dunker, J. D. Lawson, C. J. Brown, R. M Williams, P. Romero, J. S. Oh, C. J. Oldfield, A. M. Campen, C. M. Ratliff, K. W. Hipps, J. Ausio, M. S. Nissen, R. Reeves, C. Kang, C. R. Kissinger, R. W. Bailey, M. D. Griswold, W. Chiu, E. C. Gerner, and Z. Obradovic, J. Mol. Graph. Model. 19, 26 (2001).
- (47) P. Tompa, Trends Biochem. Sci. 27, 527 (2002).
- (48) V. N. Uversky, Curr. Opin. Struct. Biol. 44, 18 (2017).