Monte Carlo sampling of flexible protein structures: an application to the SARS-CoV-2 omicron variant

01/19/2022
by   Samuel W. K. Wong, et al.
University of Waterloo
0

Proteins can exhibit dynamic structural flexibility as they carry out their functions, especially in binding regions that interact with other molecules. For the key SARS-CoV-2 spike protein that facilitates COVID-19 infection, studies have previously identified several such highly flexible regions with therapeutic importance. However, protein structures available from the Protein Data Bank are presented as static snapshots that may not adequately depict this flexibility, and furthermore these cannot keep pace with new mutations and variants. In this paper we present a sequential Monte Carlo method for broadly sampling the 3-D conformational space of protein structure, according to the Boltzmann distribution of a given energy function. Our approach is distinct from previous sampling methods that focus on finding the lowest-energy conformation for predicting a single stable structure. We exemplify our method on the SARS-CoV-2 omicron variant as an application of timely interest. Our results identify sequence positions 495-508 as a key region where omicron mutations have the most impact on the space of possible conformations, which coincides with the findings of other preliminary studies on the binding properties of the omicron variant.

READ FULL TEXT VIEW PDF

Authors

page 3

page 10

page 18

page 19

05/18/2021

Conformational variability of loops in the SARS-CoV-2 spike protein

The SARS-CoV-2 spike (S) protein facilitates viral infection, and has be...
11/25/2020

Protein Structure Parameterization via Mobius Distributions on the Torus

Proteins constitute a large group of macromolecules with a multitude of ...
01/06/2021

Statistical challenges in the analysis of sequence and structure data for the COVID-19 spike protein

As the major target of many vaccines and neutralizing antibodies against...
11/14/2017

Protofold II: Enhanced Model and Implementation for Kinetostatic Protein Folding

A reliable prediction of 3D protein structures from sequence data remain...
03/11/2020

Mapping active allosteric loci SARS-CoV Spike Proteins by means of Protein Contact Networks

Coronaviruses are a class of virus responsible of the recent outbreak of...
01/15/2022

StemP: A fast and deterministic Stem-graph approach for RNA and protein folding prediction

We propose a new deterministic methodology to predict RNA sequence and p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

SARS-CoV-2 is the virus that causes COVID-19 and its genome was rapidly sequenced following the first infections discovered in late 2019 (Zhu et al., 2020). The SARS-CoV-2 genome encodes 29 proteins (Gordon et al., 2020), of which the spike (S) protein has been most widely studied due to its role in facilitating viral infection (Walls et al., 2020). The 3-D structure of the S protein has the ability to bind to human ACE2 (angiotensin-converting enzyme 2), thereby gaining entry into cells. Thus, the presence of antibodies that recognize and block the binding region of the S protein can provide immunity against COVID-19, as elicited from prior infection (Ju et al., 2020) or vaccination (Wang et al., 2021). Infusion of monoclonal antibodies as an early treatment of COVID-19 infection can block disease progression in a similar way (Marovich et al., 2020). Proteins are often flexible and exhibit dynamic structural changes, such that they may have multiple states that facilitate binding (Henzler-Wildman and Kern, 2007). Investigating the S protein’s structural flexibility in its key binding regions, especially prior to their interaction with ACE2, can thus reveal insights into the potential efficacy of antibody candidates (Dehury et al., 2020). Computational methods for doing so would also be valuable more generally in drug discovery applications.

The SARS-CoV-2 protein is 1273 amino acids long, of which positions 319–541 are known as the receptor binding domain (RBD) that is responsible for recognizing ACE2 (Huang et al., 2020). The excerpted RBD from the original (or reference) S protein sequence from January 2020 (NCBI identifier: NC_045512.2) is displayed in the first row of the top panel of Figure 1. Within the RBD, there are four key regions with flexible structure (known as loops) that have been experimentally observed to directly interact with ACE2 (Williams et al., 2021); these are marked by colored boxes and labelled as Loops 1–4 in Figure 1. As the pandemic has progressed, mutations in the viral genome have led to variation in the amino acid sequences recorded from new infections; sets of mutations that have been deemed to affect COVID-19 epidemiology are designated as variants of concern by the WHO (2021). These include the delta and omicron variants; to illustrate, their RBD amino acid sequences are aligned with that of the reference sequence in the top panel of Figure 1, where mutated positions are indicated by the lighter blue shades. The delta variant, which was predominant for most of late 2021, has only two amino acid mutations in the RBD (at positions 452 and 478) compared to reference. In contrast, the omicron variant that has become dominant as of December 2021 is characterized by a much larger number of mutations, with 15 in the RBD alone. Some mutations can affect protein structure and function, such that the immunity provided by previous antibodies may no longer be as effective against new variants of SARS-CoV-2 (Harvey et al., 2021).

Figure 1: The sequence-to-structure correspondence for the RBD of the SARS-CoV-2 spike protein. The top panel shows the amino acid sequences of the original (reference) protein together with the delta and omicron variants. Light or dark blue shades indicate positions where two or three sequences have the same amino acid type, respectively. The four loops which interact with ACE2 are marked by the colored boxes (red, orange, cyan, and green). The bottom panel shows the laboratory-determined 3-D structure corresponding to the RBD of the reference sequence (PDB code 6XM0). The four loops of interest in the structure are highlighted with the colors corresponding to the top panel. A gap can be seen in the structure within Loop 1 (red), where the laboratory experiment could not resolve the coordinates of the two amino acids in positions 445–446. The RBD also contains several -helices and -sheets; these regions have more regular structural pattern compared to loops and are colored with magenta and yellow respectively in the structure.

As a result of laboratory structure determination efforts, the Protein Data Bank (PDB, Bernstein et al., 1977) has grown to over 185000 entries as of 2021. Each PDB structure provides a static snapshot of a protein’s arrangement of atoms in 3-D coordinates, known as a conformation. Since early 2020, laboratories have contributed over 700 PDB entries relating to the SARS-CoV-2 S protein alone. To illustrate the sequence-to-structure correspondence, the bottom panel of Figure 1 shows a consensus 3-D structure of the reference RBD obtained from the PDB (accession code 6XM0), where the locations of Loops 1-4 in the structure are highlighted using the same colours as the top panel. While this figure only displays a single static snapshot of the protein, analyses of the available PDB structure data show that many loop regions of the S protein have significant conformational flexibility (Wong and Liu, 2022), including Loops 1, 3 and 4. This echos the findings of studies that have used molecular dynamics (MD) simulations to assess the structural flexibility of the RBD, where Loops 3 and 4 were identified to be among the most flexible regions (Dehury et al., 2020); Williams et al. (2021) also identified additional conformations for Loop 3 that were not represented in the PDB. Further, atomic coordinates that cannot be determined by laboratory experiment are missing from published PDB structures, most often occurring in regions with high structural flexibility (Shehu et al., 2006); in Figure 1, two such amino acids are missing from Loop 1. Computational approaches can thus complement experimental knowledge and help address some of its limitations; this can be especially useful when new mutations are discovered and PDB structures have not been determined (Chen et al., 2020).

The energy landscape theory Onuchic et al. (1997)

has provided a general approach taken by many computational methods for protein structure prediction, namely to find the most probable stable conformation with the minimum potential energy. That approach however is less useful for quantifying the range of structural variation for a loop that is highly flexible and potentially has multiple stable states. For this latter goal, we would ideally draw samples from the Boltzmann distribution

, where represents a conformation, is an energy function, and is the effective temperature. This type of conformational sampling problem was studied using sequential Monte Carlo by Zhang et al. (2007) in the context of simplified protein 3-D models. It becomes more challenging to design samplers whose draws follow for realistic 3-D protein structures. While subsequent studies have extended the sequential chain growth idea to tackle real proteins (Tang et al., 2014; Wong et al., 2018), those sampling methods focused on finding low-energy conformations for protein structure prediction, rather than drawing from the Boltzmann distribution. The value of more broadly sampling the conformational space has been recently recognized in the context of flexible loops as a necessary step to reconstruct their energy landscape (Barozet et al., 2020), although the exhaustive sampling method presented therein did not attempt to compute the likelihood of its samples.

These considerations motivate the two main contributions of this paper. First, we present a sequential Monte Carlo (SMC) method to draw samples from the protein’s 3-D conformational space that follow the Boltzmann distribution for a given energy function. This provides a principled statistical approach that is generally applicable for exploring the energy landscape of conformations for flexible protein loops. Second, as an application of timely interest we apply the proposed method on the key loops of the SARS-CoV-2 S protein RBD, using both the reference and omicron sequences as input. This provides insight into how the range of conformations for each loop may change as a result of the observed mutations, even as the omicron variant continues to be studied experimentally. We summarize the SMC output by applying hierarchical clustering on the samples obtained, and our results indicate that omicron RBD mutations may be most impactful on the possible conformations of Loop 4, while Loops 1 and 3 remain relatively stable before and after mutation.

2 Method

We present an SMC method for drawing samples from the Boltzmann distribution for a given energy function , and a protein loop represented as a continuous segment of amino acids. To achieve this, we design a new sampling scheme that leverages sequential importance resampling (Liu et al., 2001) and the protein sample space representation of Wong et al. (2018). We take as given that a template 3-D structure can be supplied as input for the method, which governs how the segment is expected to interact with the rest of the protein. This may be obtained from the PDB if available, such as the reference S protein structure partially shown in the bottom panel of Figure 1. Then we require the amino acid sequence and positions for the segment of interest, e.g., ‘SNNLDSKVGGNYN’ for Loop 1 of the reference RBD sequence in Figure 1 (positions 438–450). In general, suppose the segment is amino acids long, denoted by . Then the goal of SMC is to sample conformations beginning at the C atom of and ending at the C atom of in the template structure, so that proper backbone connectivity is maintained with the rest of the template which is held fixed. The method allows for mutations within the segment to be modeled as well, since no existing structure information is needed for the segment being sampled. Computational analyses of mutations can be quite useful in general since it is impractical to apply laboratory structure determination to study all possible mutations (Chen et al., 2020).

2.1 Overview of sampling setup and strategy

While atom locations are given as 3-D coordinates in the PDB, for sampling it is more convenient to work with the equivalent dihedral angles that represent the relevant geometric degrees of freedom in protein structures. For an individual amino acid, the dihedral angles

and determine the backbone and side chain atom coordinates, respectively;

is a vector of length 0 to 4, depending on the amino acid type. Specifying a complete length

segment requires , and . To satisfy the required backbone connectivity with the template, the actual free backbone dihedral angles for the segment are , , with the remaining six backbone dihedrals determined by analytic closure Coutsias et al. (2004). Our sequential sampling strategy is then to draw for one amino acid at a time, and evaluate its incremental energy contribution given the previously sampled amino acids. We sample in an order that alternates between adding one amino acid to the left and right ends of the segment: . This is more efficient than sampling in only one direction, since the two ends are more tightly constrained by the rest of the template (Wong et al., 2018). Concurrently with this backbone sampling, we also check possible configurations for the side chain atoms of the partial backbone conformation to ensure that they have feasible intermediate placements. After amino acids are constructed, we use the connectivity requirement to place the remaining backbone atoms. A finalized draw for the entire vector of side chain dihedrals () is then obtained conditional on the completed backbone.

SMC methods were originally developed for real-time analysis of dynamically evolving systems, so that the target distribution of interest at a given time-step is represented by a properly weighted sample (Liu and Chen, 1998). The framework of SMC is also useful more generally for sampling from complicated high-dimensional distributions, including simulation of biopolymers (Liu et al., 2001), by decomposing into a sequence of propagation and resampling steps (Doucet et al., 2001). In its application here, we use these steps to maintain a weighted sample of partial conformations (or particles) as each amino acid is added.

For convenience in what follows, let be the maximum number of particles in the SMC algorithm and denote , , . Then represents a complete conformation for the segment, and the sequential sampling approach we described is based on the factorization of the target distribution as

(2.1)

where are constructed using the incremental energies of partial conformations, as we subsequently describe.

2.2 Representation of sample space

The angular sample space for the backbone dihedrals is . In practice, is generally close to

due to a planar bond and its distribution may be approximated by a normal random variable with mean

and SD based on PDB structures (except for the Proline amino acid type, where the normal mean may be either or with probabilities 0.1 and 0.9 respectively), independent of other dihedral angles. Let denote this normal (or mixture of normals) density. The angles and

are much more flexible; their joint distribution is known as the Ramachandran plot and has been studied extensively

(Hooft et al., 1997). To provide sufficient resolution to accurately reproduce the backbones of PDB structures, their values may be discretized using a by grid (Wong et al., 2018). Thus, we may exhaustively evaluate a maximum of directions of growth for the pair . Specifically, we use the discretized grid and let denote the set of unique angle pairs on the grid.

The sample space for the side chain dihedrals is often given a discrete representation based on the most commonly observed positions in real structures, known as rotamers, and lists are tabulated for each of the 18 amino acid types whose side chains have at least one dihedral angle (Ponder and Richards, 1987). We follow this approach and use the rotamer lists provided in Shapovalov and Dunbrack Jr (2011), which range from three rotamers for Serine’s single dihedral angle (with positions , , and ) to a total of 75 combinations for Arginine’s four dihedral angles. Let denote the list of rotamers for the amino acid type corresponding to . In practice we may perturb the rotamers to increase the allowable range of side chain conformations; here we opt to add Normal noise with SD to the listed rotamer position of the first dihedral angle (Wong et al., 2018).

2.3 Energy evaluation

We take the energy function to be given, which has the role of evaluating the relative likelihood of sampled conformations. A typical will consider features including dihedral angles and atomic interactions, with the latter computed as a function of pairwise distances between atoms (Alford et al., 2017). For sampling, we may assign for conformations with unrealistic angles or distances (e.g., having atoms that clash with each other); within an SMC algorithm such conformations will then be given weights of zero and discarded. In the context of a segment, accounts for the conformation of the segment, along with atomic interactions between the segment and the rest of the template. Atomic clashes will occur very frequently if is naively sampled, which further motivates an SMC approach.

To carry out SMC we need to compute incremental energies, so it is convenient to rewrite (2.1) using an energy perspective (without loss of generality, is scaled so that ):

(2.2)

where we use to denote incremental contributions, e.g., of angles and atoms corresponding to given an existing placement of . More concretely, we take the form , where is the probability of the angle grid pair , refers to the positions of the four backbone atoms implied by , computes the energy of their atomic interactions with the template, and are coefficients that depend on the particular energy function chosen. Analogously for subsequent backbone contributions we have where now additionally depends on the interactions between and the previously sampled backbone. Note that for , the term will be to incorporate the energy contributions of the remaining backbone atoms for connectivity. Finally, we have the side chain energy , where is the joint density of the side chain dihedrals, refers to the positions of all the side chain atoms in the segment implied by , and are additional coefficients.

Details of the energy function used for our analyses are provided in the Supplement.

2.4 Techniques for computational efficiency

The main computational cost associated with energy evaluation is the calculation of pairwise distances, which is vital for detecting conformations with atomic clashes. Thus, a sensible strategy is to “foresee” which partial conformations will lead to unavoidable clashes, and remove them from consideration. This is most useful when applied to side chains, since they generally contain more atoms than the completed backbone but their energy contribution would otherwise be evaluated only at the very end; completed backbones with no possible side chain placements would result in wasted computational effort.

We thus implement a side chain pre-computation scheme as follows. When sampling , we also pre-compute for all rotamers in the list where is the current growth direction being evaluated; we then store the rotamers, up to a maximum of of them, to represent the distribution of (i.e., if , we sample with probabilities ). However, if all rotamers have , we set so that will be removed from consideration. In subsequent backbone sampling, suppose we have up to joint rotamer positions for representing the distribution of for a partial backbone , . Then when evaluating a growth direction for , we similarly pre-compute for all rotamers in and set if all . Otherwise, we store rotamers out of , again sampled if necessary . Based on the stored positions for and for , we can efficiently obtain candidates for , where the first two terms are already pre-computed. Finally, if all these we set ; otherwise we sample from these for use when sampling the backbone of the next amino acid.

The role of and is to achieve a balance between computational cost and maintaining a reasonably accurate representation of during intermediate steps. This has two practical benefits: (i) partial backbone conformations with no feasible side chain placements are removed early; (ii) the pre-computed joint rotamer positions reduce the time needed to finish side chain sampling at the end (only rotamers for will remain to be evaluated). While Wong et al. (2018) also proposed keeping a collection of intermediate , there are two important distinctions: (i) our are sampled according to the energy function; (ii) our stored energies of are only used to remove dead ends and not for the resampling steps, so that any bias introduced to the weights will be minimal.

2.5 SMC algorithm

Based on the preceding setup, the scheme of the SMC algorithm is laid out as follows:

  1. Initialization. We initialize the SMC algorithm by constructing the first set of particles representing . For each of the angle pairs of the grid, i.e., , , we draw an independent value of from , then set and compute . Then the feasible directions of growth for the first amino acid are represented by the collection of where ; suppose there are of these. We use this collection to define the particles for after normalizing the weights, i.e., . The set of rotamers and energies for are also stored (Section 2.4). In practice we will choose , so resampling is not needed and we proceed directly to the propagation step for .

  2. Propagation of backbone. Suppose we have a set of particles that represent partial backbone conformations, where . Taking , we add the -th amino acid by appending for each angle pair of the grid, together with an independent draw from and computing . Suppose there are a total of feasible growths with , out of all the possible directions starting from existing particles (i.e., ). We use this collection to define an expanded set of intermediate particles representing , i.e., where each consists of a unique pair with and the intermediate weights are normalized so that . The set of joint rotamer positions and energies for are also stored (Section 2.4).

  3. Resampling. If there is no need to resample; we simply set and assign the intermediate particles and weights to be . Our scheme rapidly increases the number of intermediate particles after each propagation step, so generally a resampling step is needed to reduce the particle population to size . For this purpose we adopt the optimal sampling of Fearnhead and Clifford (2003) to select particles for further propagation. Briefly, this proceeds as follows in our context: (i) solve the equation to obtain the value ; (ii) divide the intermediate particles into two groups, where group 1 contains the particles that have and group 2 contains the remaining ; (iii) resample particles from group 2 with probabilities proportional to using stratified resampling; (iv) set and take to be all the group 1 particles (keeping their intermediate weights ) and the resampled particles from group 2 (with weights reassigned to be ). The propagation and resampling steps are iterated until we obtain the set of completed backbones for the segment, i.e., after the connectivity constraint is applied to .

  4. Finalizing the side chains. After backbone sampling, we have and joint rotamer positions representing the distributions of , for each . To complete the side chains for each backbone, we draw one vector , and then successively sample the three remaining side chain rotamers according to , , as long as there exists rotamers with . This simple strategy for sampling works for most particles to obtain a final conformation without atomic clashes; those that cannot be completed in this way (empirically 10-25%) are assigned a final weight of zero.

3 Application: SARS-CoV-2 omicron variant

We focus our analyses on Loops 1, 3, and 4, corresponding to the segments 438–450, 471–491, and 495–508 as indicated in Figure 1. Compared to the reference sequence, the omicron variant has 2, 3, and 4 mutations in these loops, respectively (no mutations were observed in Loop 2). The full-length S protein from the PDB structure 6XM0 was used as the template. For the main results, we ran the proposed SMC method on the reference and omicron sequences for these three loops, using particles, and for side chain pre-computation. We also discuss and compare with existing loop sampling and prediction methods to provide additional context for our results.

3.1 Main results

We use hierarchical clustering as a simple way to summarize the SMC samples, so that the main distinct conformations possible for each loop can be represented by the clusters identified. Complete linkage is chosen to encourage the identification of compact clusters (p. 62, Everitt et al., 2011)

. To compute the distance matrix between sampled conformations, we adopt the most commonly-used metric for comparing protein structures, namely the root-mean-square deviation (RMSD) of the backbone atoms. For each loop, we perform this cluster analysis on the combined SMC samples from the reference and omicron sequences. Then insights into a key question of interest, namely the effects of mutation on the energy landscape of the loop, may be gleaned in terms of the conformational clusters. Specifically, if a cluster is mainly composed of samples from either reference or omicron, this suggests the conformational space represented by that cluster is more favorable for one amino acid sequence than the other. Across the different clusters, we can then see evidence of whether the range of conformations for the loop has changed as a result of mutation.

The results of this cluster analysis for Loops 1, 3, and 4 are displayed in Table 1, after cutting each dendrogram into 10 clusters. The reference and omicron composition of each cluster is presented in terms of the number of SMC samples (“” columns) and total particle weight (“Weight” columns); e.g., cluster 1 for Loop 3 contains 33096 SMC samples, 17546 from reference (with a normalized total weight of 0.425 among reference samples) and 15550 from omicron (with a corresponding total weight of 0.398). For Loops 1 and 3, all 10 clusters have compositions that are distributed relatively evenly between reference and omicron; this suggests that according to the energy function there are no dramatic changes to the accessible conformational space of these two loops as a result of the omicron mutations. In contrast, the results for Loop 4 have more pronounced empirical differences, with the compositions of three clusters in particular being sharply divided between the reference and omicron sequences: omicron is predominant in cluster 1, while reference is predominant in clusters 6 and 10. Both the number and weight of samples in these clusters suggest a similar conclusion: as a result of the omicron mutations, the energy landscape of Loop 4 shifts to favor the conformation represented by cluster 1, and disfavor those represented by clusters 6 and 10.

Loop 1 Loop 3 Loop 4
Reference Omicron Reference Omicron Reference Omicron
Cluster Weight Weight Weight Weight Weight Weight
1 5818 0.151 9328 0.331 17546 0.426 15550 0.398 3428 0.069 13923 0.345
2 4915 0.124 6453 0.187 10238 0.267 14660 0.347 8651 0.271 3923 0.097
3 8293 0.198 8949 0.114 5241 0.158 5664 0.112 3870 0.107 6006 0.144
4 8242 0.192 5036 0.096 2762 0.075 3283 0.070 2200 0.057 4289 0.131
5 5485 0.119 6074 0.106 1887 0.033 1285 0.027 1812 0.068 4869 0.108
6 4868 0.121 2675 0.053 1426 0.029 1349 0.030 6043 0.161 481 0.008
7 3470 0.049 3370 0.066 310 0.004 299 0.004 4538 0.122 1287 0.042
8 1996 0.036 1456 0.037 337 0.005 268 0.003 939 0.022 4270 0.093
9 480 0.005 601 0.005 190 0.002 323 0.005 1839 0.063 854 0.028
10 526 0.006 325 0.004 151 0.002 164 0.002 1805 0.060 344 0.003
Table 1: Results of cluster analysis comparing the reference and omicron conformations sampled by SMC for Loops 1, 3, and 4 of the SARS-CoV-2 S protein. Hierarchical clustering was applied after combining the SMC samples from reference and omicron for each loop. The reference and omicron composition of each resulting cluster is presented in terms of the number of SMC samples (“” columns) and total particle weight (“Weight” columns, normalized so that each column of weights sums to 1).

To visualize these conformational clusters, we take the medoid as the representative for each cluster (as the centroid is not guaranteed to satisfy protein geometry). The medoids of the 10 clusters for Loop 4 are plotted in Figure 2, using different colours to illustrate their conformations; e.g., the clusters discussed above are shown in red (cluster 1), dark blue and light blue (clusters 6 and 10). Similar plots for Loops 1 and 3 are provided in the Supplement. With a choice of 10 clusters, the conformations represented are substantively distinct: for Loop 4, the smallest RMSD between any pair of medoids is 1.9 Å, which is around the 2 Å cutoff that has been previously suggested as a threshold for structural similarity (Krissinel, 2007). The size of the conformational space, and thus the separation between cluster medoids, are generally expected to grow with loop length. Loop 1, which at 13 amino acids long is one shorter than Loop 4, has a somewhat larger 2.3 Å between its two closest medoids. An intuitive explanation is that the right end of Loop 4 (near position 508) appears to be more constrained by atomic interactions with the rest of the S protein, such that the 10 clusters have little structural variability there (see Figure 2). Loop 3 is the longest at 21 amino acids; its conformational space is very large and its 10 clusters are all separated by at least 3.8 Å.

Figure 2: Sampled conformations for Loop 4 based on the reference sequence and the omicron variant. The 3-D structures representing the medoids of the 10 clusters of conformations identified by hierarchical clustering are visualized with different colours. Clusters 1 to 10 corresponding to those listed for Loop 4 in Table 1 are coloured as follows: red, orange, yellow, green, cyan, dark blue, indigo, light violet, grey, and light blue. The starting and ending positions of the loop are labelled (495 and 508). Substantive structural differences can be seen among the clusters. Cluster 1 is predominantly composed of samples from the omicron variant, while clusters 6 and 10 are predominantly composed of samples from the reference sequence.

We expect the differences noted between reference and omicron for Loop 4 in Table 1 are larger than can be explained by Monte Carlo variation alone. To informally quantify this, we ran the SMC algorithm twice on the Loop 4 reference sequence and performed the same hierarchical clustering procedure after combining the outputs of the two repetitions. Table 2 shows the cluster compositions by repetition, in a similar format to Table 1

. In relative terms, the number of samples is more stable across the repetitions compared to the total weights. The amount of Monte Carlo variation seen here is comparable to the empirical differences between reference and omicron for Loops 1 and 3, but substantially less than those seen for Loop 4 (clusters 1, 6, 10). A more formal investigation into the variance of SMC in this context would be an interesting direction of future research.

Repetition 1 Repetition 2
Cluster Weight Weight
1 8568 0.253 8970 0.215
2 3656 0.116 5536 0.119
3 2423 0.065 2430 0.148
4 3610 0.118 3719 0.086
5 4065 0.127 3100 0.067
6 3457 0.074 3894 0.081
7 2885 0.060 3334 0.088
8 2459 0.058 2869 0.087
9 1795 0.066 1765 0.056
10 2207 0.062 2149 0.053
Table 2: Quantifying the Monte Carlo variation across repetitions of the SMC algorithm on the Loop 4 reference sequence. Hierarchical clustering was applied on the combined outputs of two SMC repetitions. The composition of each resulting cluster by repetition is presented in terms of the number of SMC samples (“” columns) and total particle weight (“Weight” columns, normalized so that each column of weights sums to 1).

3.2 Comparisons with other methods and approaches

To provide additional context for our results, we consider some established methods for sampling and prediction of loops: DiSGro (Tang et al., 2014), next-generation KIC (Stein and Kortemme, 2013), and SMC-based stochastic optimization (Wong et al., 2018). We denote the latter two as NGK and SMCopt for short. Each of these methods is designed to sample the conformational space of a loop and select the lowest-energy conformation found as the prediction. While this approach may be less useful in the context of loops that are flexible enough to adopt multiple stable conformations (see Introduction), we can nonetheless use some simple metrics to assess the samples obtained by these methods in comparison to ours. Software for DiSGro and SMCopt can output a user-specified number of samples in a single run. For NGK (available in the Rosetta macromolecular modelling suite https://rosettacommons.org/) each run is designed to output a single conformation as a prediction; since each run is stochastic, a larger sample can be obtained by repeated runs of the software. Each of these methods outputs conformations together with their computed energy values (according to the method’s energy function), although the samples have not been designed to follow the Boltzmann distribution.

The PDB template contains the ground truth of one known conformation for each of Loops 3 and 4, in the case of the reference sequence. Samples that resemble this conformation (e.g., as measured by RMSD) should be present among those generated by an effective sampling method. This provides a first metric of comparison: the minimum RMSD to the template conformation among the samples generated by each method. When the goal is to predict a single ‘correct’ conformation for the loop, the usual approach is to choose the lowest-energy sample as the prediction. In the case of highly flexible loops, it would not be surprising if the lowest-energy and template conformations have significant disagreement; this phenomenon has been previously observed in studies of loops with multiple stable conformations (Marks et al., 2018). We use this as a second illustrative metric: the RMSD to the template conformation of the lowest-energy sample from each method.

The results for these metrics on Loops 3 and 4 are shown in Table 3; to use each method to sample as exhaustively as possible, DiSGro was run to output 100,000 conformations, SMCopt was run with 50000 particles (to match our particles), and NGK was run 500 times. DiSGro could handle a maximum length of 20 amino acids, so we reduced its Loop 4 task by one amino acid, to 471–490. The minimum RMSD metrics confirm that the samples from our SMC method can cover the conformational space near the template comparably to methods designed for loop prediction. Overall, we expect the minimum RMSDs to depend on the size of the conformational space that increases with loop length; here the methods obtain 1-2 Å for Loop 3 and 3-4 Å for Loop 4. In contrast, the RMSDs of the lowest-energy sample to the template tend to have much larger values (6–12 Å, with SMCopt’s Loop 3 being an exception), confirming our intuition that these flexible loops may be better studied beyond a traditional loop prediction framework.

Minimum RMSD to template RMSD of lowest-energy sample
Loop Positions SMC DiSGro NGK SMCopt SMC DiSGro NGK SMCopt
3 438–450 2.07 1.81 1.71 1.16 7.96 10.80 6.42 1.92
4 471–491 3.37 3.49 2.51 4.20 11.59 12.18 10.25 12.84
Table 3: Sampled conformations for Loops 3 and 4 using SMC, DiSGro, NGK, and SMCopt. Two metrics are calculated for each method: the minimum RMSD to the template conformation among all its samples; the RMSD to the template conformation of its lowest-energy sample. DiSGro results for Loop 4 are approximated by sampling 471–490 due to the limitations of the method.

4 Discussion and conclusion

This paper presented a new application of SMC for sampling the conformational space of loops in protein structures, according to the Boltzmann distribution for a given energy function. Compared to existing methods for loop sampling and prediction that focus on finding the lowest-energy conformation, the emphasis of this work is to broadly explore the energy landscape of possible conformations using principled statistical methodology. Our approach is especially pertinent for studying loops that can adopt a range of conformations rather than a single stable one. Evidence for such highly flexible loops has been seen in both laboratory experiments that generate PDB structure data, and computational approaches such as MD simulations. As a case study of timely interest, we applied the proposed SMC method to key regions in the RBD of the SARS-CoV-2 S protein that are known to be flexible and have a number of mutations in the omicron variant discovered November 2021.

Our results indicate that the energy landscapes of Loop 4 for the reference and omicron variant cover a different range of conformations, when compared via a cluster analysis of the SMC samples; in contrast Loops 1 and 3 appear to be more stable before and after mutation. Other preliminary studies on the omicron variant have sought to identify the key mutations that may lead to a new or stronger binding interface with ACE2: among the 15 omicron RBD mutations, those identified were 493, 496, 501, 498 (Lubin et al., 2021) and 477, 496, 498, 501 (Wang et al., 2022). The mutations at 496, 498, and 501 were deemed significant by both studies; all three of these positions are within Loop 4 and agree with our findings on its highest propensity for conformational change. Overall, SMC can be a powerful tool and used in conjunction with other computational analyses of protein structure (e.g., MD simulations, comparative modelling, binding free energy calculations) to obtain new insights.

As some directions for future research, we could investigate more formal ways to use the SMC samples to test for conformational differences between sequence variants, and to quantify the Monte Carlo variance of the SMC method in this context. It would also be of practical interest to apply the method to study more extensive datasets, such as protein loops that are known to be highly flexible (Marks et al., 2018), PDB structures with missing loops that cannot be determined by laboratory experiment (Shehu et al., 2006), and any new SARS-CoV-2 variants that may continue to emerge.

Acknowledgements

This work was partially supported by Discovery Grant RGPIN-2019-04771 from the Natural Sciences and Engineering Research Council of Canada.

References

  • Alford et al. (2017) Alford, R. F., Leaver-Fay, A., Jeliazkov, J. R., O’Meara, M. J., DiMaio, F. P., Park, H., Shapovalov, M. V., Renfrew, P. D., Mulligan, V. K., Kappel, K., et al. (2017). The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation 13, 3031–3048.
  • Barozet et al. (2020) Barozet, A., Molloy, K., Vaisset, M., Siméon, T., and Cortés, J. (2020).

    A reinforcement-learning-based approach to enhance exhaustive protein loop sampling.

    Bioinformatics 36, 1099–1106.
  • Bernstein et al. (1977) Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The protein data bank. European Journal of Biochemistry 80, 319–324.
  • Chen et al. (2020) Chen, J., Wang, R., Wang, M., and Wei, G.-W. (2020). Mutations strengthened SARS-CoV-2 infectivity. Journal of molecular biology 432, 5212–5226.
  • Coutsias et al. (2004) Coutsias, E. A., Seok, C., Jacobson, M. P., and Dill, K. A. (2004). A kinematic view of loop closure. Journal of computational chemistry 25, 510–528.
  • Dehury et al. (2020) Dehury, B., Raina, V., Misra, N., and Suar, M. (2020). Effect of mutation on structure, function and dynamics of receptor binding domain of human SARS-CoV-2 with host cell receptor ACE2: a molecular dynamics simulations study. Journal of Biomolecular Structure and Dynamics 39, 7231–7245.
  • Doucet et al. (2001) Doucet, A., De Freitas, N., and Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer.
  • Everitt et al. (2011) Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. Wiley Series in Probability and Statistics. Wiley.
  • Fearnhead and Clifford (2003) Fearnhead, P. and Clifford, P. (2003).

    On-line inference for hidden Markov models via particle filters.

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 887–899.
  • Gordon et al. (2020) Gordon, D. E., Jang, G. M., Bouhaddou, M., Xu, J., Obernier, K., White, K. M., O’Meara, M. J., Rezelj, V. V., Guo, J. Z., Swaney, D. L., et al. (2020). A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 583, 459–468.
  • Harvey et al. (2021) Harvey, W. T., Carabelli, A. M., Jackson, B., Gupta, R. K., Thomson, E. C., Harrison, E. M., Ludden, C., Reeve, R., Rambaut, A., Peacock, S. J., et al. (2021). SARS-CoV-2 variants, spike mutations and immune escape. Nature Reviews Microbiology 19, 409–424.
  • Henzler-Wildman and Kern (2007) Henzler-Wildman, K. and Kern, D. (2007). Dynamic personalities of proteins. Nature 450, 964–972.
  • Hooft et al. (1997) Hooft, R. W., Sander, C., and Vriend, G. (1997). Objectively judging the quality of a protein structure from a Ramachandran plot. Bioinformatics 13, 425–430.
  • Huang et al. (2020) Huang, Y., Yang, C., Xu, X.-F., Xu, W., and Liu, S.-W. (2020). Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacologica Sinica 41, 1141–1149.
  • Ju et al. (2020) Ju, B., Zhang, Q., Ge, J., Wang, R., Sun, J., Ge, X., Yu, J., Shan, S., Zhou, B., Song, S., et al. (2020). Human neutralizing antibodies elicited by SARS-CoV-2 infection. Nature 584, 115–119.
  • Krissinel (2007) Krissinel, E. (2007). On the relationship between sequence and structure similarities in proteomics. Bioinformatics 23, 717–723.
  • Liu and Chen (1998) Liu, J. S. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association 93, 1032–1044.
  • Liu et al. (2001) Liu, J. S., Chen, R., and Logvinenko, T. (2001). A theoretical framework for sequential importance sampling with resampling. In Sequential Monte Carlo methods in practice, pages 225–246. Springer.
  • Lubin et al. (2021) Lubin, J. H., Markosian, C., Balamurugan, D., Pasqualini, R., Arap, W., Burley, S. K., and Khare, S. D. (2021). Structural models of SARS-CoV-2 Omicron variant in complex with ACE2 receptor or antibodies suggest altered binding interfaces. bioRxiv .
  • Marks et al. (2018) Marks, C., Shi, J., and Deane, C. M. (2018). Predicting loop conformational ensembles. Bioinformatics 34, 949–956.
  • Marovich et al. (2020) Marovich, M., Mascola, J. R., and Cohen, M. S. (2020). Monoclonal antibodies for prevention and treatment of COVID-19. JAMA 324, 131–132.
  • Onuchic et al. (1997) Onuchic, J. N., Luthey-Schulten, Z., and Wolynes, P. G. (1997). Theory of protein folding: the energy landscape perspective. Annual review of physical chemistry 48, 545–600.
  • Ponder and Richards (1987) Ponder, J. W. and Richards, F. M. (1987). Tertiary templates for proteins: use of packing criteria in the enumeration of allowed sequences for different structural classes. Journal of molecular biology 193, 775–791.
  • Shapovalov and Dunbrack Jr (2011) Shapovalov, M. V. and Dunbrack Jr, R. L. (2011).

    A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions.

    Structure 19, 844–858.
  • Shehu et al. (2006) Shehu, A., Clementi, C., and Kavraki, L. E. (2006). Modeling protein conformational ensembles: from missing loops to equilibrium fluctuations. Proteins: Structure, Function, and Bioinformatics 65, 164–179.
  • Stein and Kortemme (2013) Stein, A. and Kortemme, T. (2013). Improvements to robotics-inspired conformational sampling in rosetta. PloS one 8, e63090.
  • Tang et al. (2014) Tang, K., Zhang, J., and Liang, J. (2014). Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLoS computational biology 10, e1003539.
  • Walls et al. (2020) Walls, A. C., Park, Y.-J., Tortorici, M. A., Wall, A., McGuire, A. T., and Veesler, D. (2020). Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 181, 281–292.
  • Wang et al. (2022) Wang, T., Ge, J., Zhang, L., Lan, J., He, X., Ren, Y., Wang, Z., Zhou, H., Fan, S., Zhu, C., et al. (2022). Structural and computational insights into the SARS-CoV-2 Omicron RBD-ACE2 interaction. BioRxiv .
  • Wang et al. (2021) Wang, Z., Schmidt, F., Weisblum, Y., Muecksch, F., Barnes, C. O., Finkin, S., Schaefer-Babajew, D., Cipolla, M., Gaebler, C., Lieberman, J. A., et al. (2021). mRNA vaccine-elicited antibodies to SARS-CoV-2 and circulating variants. Nature 592, 616–622.
  • WHO (2021) WHO (2021). Tracking SARS-CoV-2 variants. https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/. Accessed: 2022-01-03.
  • Williams et al. (2021) Williams, J. K., Wang, B., Sam, A., Hoop, C. L., Case, D. A., and Baum, J. (2021). Molecular Dynamics Analysis of a Flexible Loop at the Binding Interface of the SARS-CoV-2 Spike Protein Receptor-Binding Domain. Proteins: Structure, Function, and Bioinformatics .
  • Wong et al. (2018) Wong, S. W., Liu, J. S., and Kou, S. (2018). Exploring the conformational space for protein folding with sequential Monte Carlo. The Annals of Applied Statistics 12, 1628–1654.
  • Wong and Liu (2022) Wong, S. W. and Liu, Z. (2022). Conformational variability of loops in the SARS-CoV-2 spike protein. Proteins: Structure, Function, and Bioinformatics 90, 691–703.
  • Zhang et al. (2007) Zhang, J., Lin, M., Chen, R., Liang, J., and Liu, J. S. (2007). Monte carlo sampling of near-native structures of proteins with applications. PROTEINS: Structure, Function, and Bioinformatics 66, 61–68.
  • Zhu et al. (2020) Zhu, N., Zhang, D., Wang, W., Li, X., Yang, B., Song, J., Zhao, X., Huang, B., Shi, W., Lu, R., et al. (2020). A novel coronavirus from patients with pneumonia in China, 2019. New England Journal of Medicine 382, 727–733.

Appendix A Details of energy function

The energy function used in this paper combines aspects of those previously developed for protein loop sampling and prediction.

For the backbone dihedral density , we took the empirical probabilities computed by Wong et al. (2018) on the same by grid, for each of the 20 amino acid types based on a database of loop structures. The angle pairs on the grid with very low empirical probability () were treated to be zero. We used a Normal density for with mean and SD , except for Proline which was taken to be the mixture of Normals as described in the main text.

For the function , the energies of pairwise atomic interactions were taken to be the values derived in the DiSGro study (Tang et al., 2014) for scoring loop conformations, with two modifications. First, we set the energy of pairwise atomic clashes (i.e., when the Van der Waals forces of a pair exceeds 10.0 kcal/mol according to the Lennard-Jones 12-6 model, as in Wong et al. (2018)) to be . This ensures the final conformation will be free of atomic clashes, compared to the more relaxed maximum pairwise energy of 8.0 used originally in DiSGro. Second, as in Wong et al. (2018) we used a distance check between the backbone atoms of the current amino acid and the C on the other end of the segment and set if it is not possible to complete the segment with realistic bond lengths and angles.

We took the side chain dihedral density , that is, uniform over all allowable rotamer positions. With this choice, the energy contribution of side chain atoms is entirely based on their atomic interactions as computed via .

Finally, values were needed for the coefficients that weight the different energy components. Based on the empirical results in Wong et al. (2018), a coefficient ratio of 10 to 1 between the dihedral and DiSGro energy components is a good choice. We then used the loops of length 8 and 12 given in Table 4 of Wong et al. (2018) as a tuning set to select the overall energy scaling factor (i.e., equivalent to the effective temperature), taking the minimum RMSD to the template among the SMC samples as our metric. Based on these experiments, we chose and (noting that is inconsequential here, since the side chain dihedrals were taken to be uniform).

Appendix B Plots of conformational clusters for Loops 1 and 3

Figure 2 in the main text plots the medoids of the 10 clusters of conformations for Loop 4 of the SARS-CoV-2 spike protein, based on the combined SMC samples from the reference and omicron variant. We display analogous plots for Loops 1 and 3 in Web Figures S1 and S2. In contrast to Loop 4, the compositions of the clusters identified for Loops 1 and 3 are more evenly distributed between reference and omicron samples, which suggests the energy landscape is fairly stable before and after mutation. The cluster medoids have substantively distinct structures. The smallest RMSD between any pair of medoids is 2.3 Å for Loop 1; meanwhile Loop 3, which has a much larger conformational space due to its length (21 amino acids), has all its medoids separated by at least 3.8 Å RMSD.

Figure S1: Sampled conformations for Loop 1 based on the reference sequence and the omicron variant. The 3-D structures representing the medoids of the 10 clusters of conformations identified by hierarchical clustering are visualized with different colours. Clusters 1 to 10 corresponding to those listed for Loop 1 in Table 1 of the main text are coloured as follows: red, orange, yellow, green, cyan, dark blue, indigo, light violet, grey, and light blue. The starting and ending positions of the loop are labelled (438 and 450). Clear structural differences can be seen among the clusters. The cluster compositions are distributed relatively evenly between reference and omicron samples.
Figure S2: Sampled conformations for Loop 3 based on the reference sequence and the omicron variant. The 3-D structures representing the medoids of the 10 clusters of conformations identified by hierarchical clustering are visualized with different colours. Clusters 1 to 10 corresponding to those listed for Loop 3 in Table 1 of the main text are coloured as follows: red, orange, yellow, green, cyan, dark blue, indigo, light violet, grey, and light blue. The starting and ending positions of the loop are labelled (471 and 491). Clear structural differences can be seen among the clusters. The cluster compositions are distributed relatively evenly between reference and omicron samples.

Appendix C Running the software

A software package for Linux systems that implements the SMC loop sampling algorithm presented in this paper can be downloaded from https://swong.ca/downloads/loopsmc.tar.gz. The processed data from the Protein Data Bank to supply as reference and omicron structure templates for sampling Loops 1, 3, and 4 are also provided with the package. For detailed usage instructions, see the README file inside the package.

The other loop sampling and prediction methods discussed in Section 3.2 of the main text are also freely available: