Modern science increasingly leverages machine learning on large datasets in the sciences, from electronic structurewu2018moleculenet to whole genome sequences kavvas2018machine to distributed ocean sensor measurements FroylandPadberg_Chaos15
. Many of these datasets capture the dynamics of a system evolving in time, encoding trends with predictive power. Analyzing these datasets using a statistically robust and interpretable framework is a longstanding challenge that often involves clustering, or the unsupervised learning of coherent groups within the dataset.
Each of the aforementioned methods exhibits drawbacks with respect to a priori assumptions and algorithmic limitations. For example, partition-based clustering such as -means requires the modeler to prescribe the number of partitions in a dataset before constructing the model. If multiple results are obtained from different values of , these results are not interrelated; similarly, the model cannot be used to determine relationships between the clusters of a single model. While connectivity-based methods feature interrelated clusters, these also require the determination of where to cut the corresponding dendrogram to obtain the clustering result. Although density-based methods do not require a priori or a posteriori determination of the number of clusters to use, these methods are generally not robust to datasets containing a range of cluster densities Ali_ICIET10 .
Here, we present a new method, simultaneous coherent structure coloring (sCSC), which minimizes the assumptions required in an unsupervised clustering task. sCSC focuses solely on the efficient separation of the most dissimilar states in the system, resulting in a quantitative structure that automatically captures the clusters in the dataset and their interrelationships without a priori knowledge of the system. The method is demonstrated for simulated and empirical systems of fluid and molecular dynamics, and its straightforward extension to other types of data is discussed.
The use of clustering for data analysis is ubiquitous. However, our motivation emerged from research on the identification of coherent structures from fluid dynamics. A variety of mathematical frameworks have been developed to identify coherent structures. The broad class of Lagrangian methods has been developed to describe flows that are unsteady (i.e., not well-summarized by instantaneous snapshots) in a way that is not dependent on their frame of reference (i.e., may not contain velocity or acceleration data). Two recent reviews of Lagrangian methods for the detection of coherent structures in fluids can be found in Refs. AllshousePeacock_Chaos15, and Hadjighasemetal_Chaos17, .
Existing algorithms for coherent structure analysis that involve clustering exhibit various limitations. The fuzzy -means approach presented in Ref. FroylandPadberg_Chaos15, , for example, introduces a dynamic distance between particle trajectories, but ultimately requires the choice of
, i.e. how many clusters to use. To avoid explicitly choosing the number of coherent structures, a spectral clustering method was introduced in Ref.Hadjighasemetal_PRE16, , which utilizes the spectral gap in the graph Laplacian to determine the number of coherent structures. However, it was subsequently shown in Ref. SchlueterKuck_JFM17, that such a gap is only robust when the number of trajectories used exceeds .
The method of coherent structure coloring (CSC), introduced in Refs. SchlueterKuck_JFM17, and SchlueterKuck_Chaos17, , was designed to address these and other limitations of clustering algorithms for coherent structure determination based on trajectories of particle spatial coordinates. In this work, we extend CSC in the context of its own limitations, as described below.
While we restrict the focus of the rest of the paper to clustering, this is not the only way to identify coherent structures from frame-independent particle trajectories. Over the past two decades, both the fluid and molecular dynamics communities have developed methods to identify “almost-invariant” sets through data-driven approximations to the Perron-Frobenius operator and its adjoint, the Koopman operator. The former, also referred to as the transfer or transition operator, propagates probability densities forward in time, whereas the latter propagates observablesklus2016numerical .
used finite approximations to the Perron-Frobenius eigenfunctions to divide the space occupied by a dynamical system into almost-invariant sets and almost-cycles. At the same time, Schütteet al. schutte1999direct and Deuflhard et al. Deuflhard_LAA00 introduced the use of approximations to the same operator, under a reversibility constraint, to determine the “metastable” states of molecular systems simulated on the atomic level.
A decade later, researchers in their respective fields independently determined equivalent algorithms for optimizing the estimation of the approximated eigenfunctions of the Perron-Frobenius and Koopman operatorsNoe_MMS13 ; williams2015data ; wu2017variational
. In both cases, linear models are generated using a data-driven, objective protocol to model highly nonlinear dynamics, where the eigenvectors and eigenvalues can be used to identify coherent sets.
In fact, in our final example in the current study, we utilize both types of methods on a molecular dynamics simulation dataset. We first create an optimized model that approximates the Perron-Frobenius operator Noe_MMS13 , which is difficult to visualize due to its high dimensionality. Then, to reduce our model to a visualizable and interpretable coarse-grained model, we use the clustering method presented in this work to identify the major coherent structures.
Simultaneous coherent structure coloring (sCSC)
Coherent structure coloring theory
Many datasets we wish to explore in the physical sciences are generated by complex dynamical systems that exhibit instabilities and chaos. A key consequence of these processes is that states of the system (e.g. fluid particle trajectories or protein conformations) that are proximal but belonging to different coherent sets will separate exponentially faster as the system evolves than states belonging to the same cluster Guckenheimer_Book83 ; HallerYuan_PhysicaD00 .
On this basis, we previously hypothesized that these complex datasets can be clustered more robustly and effectively by amplifying state differences rather than state similarity SchlueterKuck_JFM17 ; SchlueterKuck_Chaos17 . The rationale for this approach is that the exponential separation of dissimilar states can provide more sensitive detection of clusters than a focus on state similarity, the latter requiring longer observation to become apparent Guckenheimer_Book83 ; HallerYuan_PhysicaD00 . In other words, we aim to identify coherent clusters indirectly, by prioritizing the separation of states with greatest dissimilarity and confidently ruling out the possibility of their membership in the same cluster. Those states that remain together after the separation process will subsequently emerge as belonging to the same cluster.
To amplify the dissimilarity between states, we solve an optimization problem to maximize a figure of merit that quantifies total state dissimilarity in the dataset. Specifically, this figure of merit depends on a scalar value assigned to each state in the system, where the squared difference in the scalar value assigned to each of pair states (e.g. for states 1 and 2) is weighted by a measure of their dissimilarity. Formally, the clustering parameter is given by
where the summations of and are each taken over the full set of states to be clustered, and is an element of the adjacency matrix containing the pairwise dissimilarity between states and . The construction of this matrix requires the calculation of
adjacency values. Example definitions of the (symmetric) pairwise dissimilarity can include the standard deviation for comparison of time-dependent signals, or the Jensen-Shannon divergence for comparison of probability distributionsLin_IEEE91 ; Husic_Draft17 . Both definitions represent measures of dissimilarity, where identical data points receive . Thus, assuming all data points are unique, the matrix will be dense.
Given the adjacency matrix , we seek to find state assignments that will maximize , subject to the constraint that the magnitude of the vector containing the scalar values must remain finite (e.g. to avoid the trivial case that maximizes for and ). It is straightforward to show that the constrained optimization of equation 1 with finite can be written as the generalized eigenvalue problem Hall_MS70 :
where is a diagonal matrix with entries equal to the row-sums of the adjacency matrix, i.e. for each row , and is the graph Laplacian. This maximization is expressed using the Lagrangian form; see SchlueterKuck_JFM17 for more details.
Each of the eigenvectors of equation 2 represents a solution that assigns to each state a scalar value based on its dissimilarity to the other states in the system. Those states with scalar assignments in each that are most dissimilar can be presumed to belong to different clusters of the data when the data is partitioned according to that particular solution of equation 1. The eigenvector associated with the maximum eigenvalue contains the scalar assignments that maximize the figure of merit . This can be considered the single most effective partitioning of the dataset.
Given the analogy between this approach and the problem of fuzzy graph coloring Munoz_Omega05 , wherein the connected nodes of a graph with large weights are assigned the most disparate values, we call this method coherent structure coloring (CSC) SchlueterKuck_JFM17 . The technique has recently been demonstrated to successfully identify coherent eddies and jets associated with individual fluid particle trajectories in model geophysical flows SchlueterKuck_Chaos17 .
Simultaneous inclusion of multiple CSC solutions
. Hence, although multiple dimensions of dissimilarity are almost always present in real data, the method cannot simultaneously distinguish between multiple types of dissimilarity in a dataset. Moreover, the method applied to individual fluid particle trajectories in a subsequent study required a subjectively defined threshold to calculate eigenspace distancesSchlueterKuck_Chaos17 , and it was shown to produce degenerate results for fluid particles in chaotic regions of the flow (cf. Fig 7 in Ref. SchlueterKuck_Chaos17, ).
Importantly, because the adjacency matrix introduced in the previous section is real and symmetric, the remaining eigenvectors associated with lesser eigenvalues provide additional, linearly independent (i.e. orthogonal) solutions for partitioning the data, albeit less effectively Press_Book83 . The key innovation of the present work is to use all of the eigenvectors in a top-down fashion to simultaneously cluster the system states.
To perform sCSC, we begin with the most effective partition given by the eigenvector associated with the maximum eigenvalue, and proceed through the set of orthogonal eigenvectors in order of decreasing eigenvalue. This approach simultaneously reveals the coherent sets of the system, and eliminates the subjective user intervention required in the previous method SchlueterKuck_Chaos17 .
Given a dissimilarity measure and resulting eigenvector solutions, the simultaneous coherent structure coloring (sCSC) algorithm begins by assigning to each state in the system a binary membership based on its corresponding scalar value along each orthogonal coordinate direction. A bifurcation is appropriate given that each one-dimensional coordinate has two extreme ends toward which the optimization of equation 1 pulls dissimilar states.
The states are bifurcated along each coordinate dimension by using agglomerative clustering with average linkage (although other linkages or splitting methods could be used for this step; see e.g. Ref. mullner2013fastcluster, , Table 1.) and assigning to each state a value of 0 or 1 based on its membership within either of the two largest clusters of the resulting dendrogram. Each eigenvector contributes a separate bit to the binary code associated with each of the states in the system, with the leading bit corresponding to the largest eigenvalue and the remaining bits concatenated in decreasing order of their corresponding eigenvalues. Though we suggest using a bifurcation in general, the method does not prohibit the division of each eigenvector coordinate into three or more discrete bins, thus creating a -way splitting and associated base- codes.
For each subsequent eigenvector, the bifurcation is performed for all data points (i.e. states), and each is assigned a 0 or a 1. For the th eigenvector bifurcation, this enables numerically possible clusters (Fig 1). For example, the first splitting produces branches and , and the second splitting enables the population of unique clusters by appending or to each branch of the existing binary code (). However, it may be the case that the numerically possible branch is not occupied because there is no data point that receives both a label of in the first bifurcation and a label of in the second bifurcation. Thus, we hypothesize that branch (and its only occupied split, branch ) evidences a coherent region of the data. In this way, a natural stopping criterion emerges from unoccupied bit codes during the binary splitting.
The binary codes generated by the aforementioned process can be visualized in a dendrogram, with each branch pair connecting those states that differ only at the least significant bit of their binary code. The length of each branch pair is a measure of the dissimilarity between the groups connected by the branches, and it corresponds to the value of the summation in equation (1) computed only over those states connected by the branches. Bits for progressively smaller eigenvalues are included at progressively lower levels of the dendrogram. The dissimilarity between the groups connected at lower levels therefore generally becomes smaller as well. While in principle the sCSC dendrogram should naturally truncate when no further splits occur, large amounts of data points or statistical noise may lead to insignificant (i.e., low-) clusters or explore a combinatorially unfavorable number of splits. In that case, one may choose to truncate the dendrogram after a certain number of eigenvectors according to visual inspection, or determine a cutoff based on the magnitude of or the eigenvalue.
As in standard divisive and hierarchical clustering methods, the clustering models produced with sCSC are dependent on the adjacency definition supplied by the user. Because the adjacency matrix summarizes pairwise dissimilarities only, this has the benefit of not requiring adherence to the triangle inequality Hansen1997 —in fact, the data points need not exist in a well-defined space at all. However, with this flexibility comes the drawback that a poor dissimilarity metric may obscure patterns in the data. The dissimilarity measures used in this study have been shown to be effective in previous studies SchlueterKuck_JFM17 ; Husic_Draft17 , and in general may require domain-specific knowledge to determine for a given dataset.
In the next two sections, we apply sCSC to benchmark problems in fluid dynamics in order to demonstrate its effectiveness in identifying coherent structures in the absence of a priori assumptions. Then, we demonstrate the use of sCSC to determine the number and shape of flow structures involved in vortex ring entrainment using data obtained from empirical measurements the laboratory. Finally, to highlight the interpretability of the sCSC dendrogram for high-dimensional datasets, we use sCSC to visualize an interpretable representation of an atomistic protein folding simulation. Finally, we discuss the relationship of this method to other unsupervised clustering methods, and the possibility of extending sCSC beyond physical dynamical systems.
Coherent structure identification from analytical geophysical flow simulations and empirical measurements
Quadruple-eddy simulated ocean flow
A key challenge in geophysical fluid dynamics is to extract and characterize coherent fluid motions from sparsely sampled turbulent flows of air or water. The coherent structures, often manifested as eddies and jets, can dominate the transport of heat, salt, nutrients, and pollutants Huhnetal_GRL12 ; HughesMiller_GRL17 . Therefore, they can serve as the basis for low-order models that capture the salient physics Kaiseretal_JFM14 , or as a template for data assimilation into large-scale weather forecasting models Macleanetal_PhysicaD17 . Turbulent flow structures in the ocean also impact the behavior and ecology of marine life Stockeretal_Nature17 .
Distributed sensor networks such as the Argo collection of 3800 ocean drifters Argo_00 sample the flow field in a Lagrangian sense, recording the properties of the water as each drifter is carried by the prevailing currents. Here we demonstrate the capability of the sCSC method to extract coherent fluid structures from such collections of Lagrangian measurements.
To do so, we first apply sCSC to a common model of Lagrangian ocean drifters in a simplified flow field comprising only four eddies, the unsteady quadruple-eddy flow AllshousePeacock_Chaos15 ; FroylandPadberg_Chaos15 . While this model represents a simplification of the full physics, it is valuable due to its common use for the evaluation and comparison with existing methods to identify coherent structures AllshousePeacock_Chaos15 ; FroylandPadberg_Chaos15 ; SchlueterKuck_JFM17 ; SchlueterKuck_Chaos17 .
As shown in Fig 2A, drifter trajectories within the two eddies at the upper-left and lower-right rotate clockwise, whereas trajectories within the other two eddies rotate counter-clockwise. Simultaneous with this rotation, an east-west oscillation of the eddy field occurs, which causes exchange of drifters between the eastern and western eddies. This exchange, which depends on the location and timing of the drifter release relative to the east-west oscillation cycle, is illustrated in the transition from initial drifter positions in Fig 2B to their final positions in Fig 2C.
Each drifter trajectory represents a state of this fluid dynamic system, and the pairwise dissimilarity between each of the states is given by the standard deviation of the instantaneous distance between drifter positions at time , , divided by the average distance between each pair of drifters, , for total time points SchlueterKuck_JFM17 :
This measure anticipates that coherent structures will comprise drifters whose relative positions do not vary as the flow evolves, leading to a small values of the pairwise dissimilarity measure (i.e. a small standard deviation) within each cluster. By contrast, pairs of drifters that straddle the boundary between coherent structures can experience exponential separation over time and a correspondingly large standard deviation of their instantaneous separation HallerYuan_PhysicaD00 .
Without requiring the specification of the number of eddies, the sCSC method reveals a clear, physically interpretable structure for this complex flow (Fig 2D). The primary bifurcation of the flow is between trajectories that remain in the eddy cores of their original quadrant (branch 0) and trajectories that do not (branch 1). The trajectories of branch 0 are then further subdivided into trajectories that remain within eddy cores in the northern half of the flow (branch 00) and those that remain within eddy cores in the southern half of the flow (branch 01), reflecting the absence of north-south drifter exchange. Finally, the trajectories associated with the individual quadrants are identified at the level of the third bifurcation (e.g. branch 000 shown in Fig 2D inset, as well as branches 001, 010, and 011 for the other three individual quadrants, not shown in inset). An additional visualization of the major coherent structures identified—namely, branches 00, 01, 10, and 11 in Fig 2D—is presented in Fig 3A.
Whereas the application of -means clustering or other conventional tools would require a priori guidance to determine that four independent structures exist in branch 0 (i.e. one eddy per quadrant) Hadjighasemetal_Chaos17 , this result is revealed naturally by the sCSC dendrogram, as further bifurcations after branch 000 do not produce additional coherent states; all of the trajectories that remain together after the third bifurcation remain together after subsequent bifurcation.
To be sure, the presence of the four eddy cores can also be revealed by a contour map of the largest finite-time Lyapunov exponent (FTLE) corresponding to the quadruple-eddy velocity field (see Figs. 2 and 4 in Ref. SchlueterKuck_JFM17, ). The key advantage of the sCSC approach is that a similar result can be achieved with two orders-of-magnitude less data: Schlueter-Kuck and Dabiri showed in Ref. SchlueterKuck_JFM17, that the FTLE gradient calculation is well-posed when the number of drifters is on the order of By contrast, the same cores can be identified by as few as 300 drifters using the present method, and the cores can be identified as long as drifters are present in the cores over timescales longer than the eddy turnover time.
The structure of branch 1 is less well organized and reflects the chaotic advection of trajectories that spiral radially within a quadrant and/or switch quadrants in the unsteady flow. Nonetheless, the dendrogram structure does indicate geometric symmetries within the chaotic motions, such as a preference for three quadrants among the trajectories in branches 110 and 111; and a more constrained preference for two quadrants exists at branch 1110. A general observation is that geometric symmetries appear as balanced dendrogram bifurcations. This is in contrast to the structure of random noise, which is characterized by a trivial sCSC dendrogram with a single branch that contains nearly all of the states and a splintering of a small number of fully-converged states at each level of the dendrogram (see Fig. S1).
Bickley jet simulated atmospheric flow
A more complex geophysical flow model is the Bickley jet, which serves as a common model for zonal jets in the atmosphereRypinaetal_JAS07 . This flow is composed of a central meandering jet as well as flanking eddies that are periodic along the east-west axis (Fig 4A). The sCSC dendrogram corresponding to this flow (for the same dissimilarity measure as the quadruple-eddy flow, equation (3)) is similarly effective in extracting the salient coherent features (Fig 4D). The flanking eddies are identified in branch 0.
However, a key difference from the previous quadruple-eddy example is that the individual eddies are largely indistinguishable from one another. This result reflects the homogeneity of fluid dynamics within the flanking eddies, which was not present among the trajectories in the quadruple-eddy flow (contrast e.g. Figs. 2C and 4C). A notable exception is the eddy located at the meridional axis of symmetry, i.e. branch 011. An additional analysis tracking the distance of particles from the eddy cores showed that fewer particles belonging to this center eddy travel past a given contour threshold during the simulated time-series than particles from other eddies. Branch 1 of the Bickley jet dendrogram collects those trajectories that are not associated with the flanking eddies. A subset of those trajectories, namely branch 10, is the meandering zonal jet. The remaining trajectories (branch 11) form a chaotic background flow that is robust to further bifurcation. These three coherent structures are further visualized in Fig 3B.
The sCSC structure of both of these simulated geophysical flows can be exploited to create low-order models of the governing fluid transport processes, without the need for ad hoc assumptions regarding the number of coherent structures present. Because similar results can be achieved despite significant missing or noisy data (see Ref. SchlueterKuck_JFM17, ), the inherently limited data collection that can be achieved in the ocean and atmosphere can be more effectively leveraged to potentially improve the accuracy of weather forecasting, for exampleMacleanetal_PhysicaD17 . Hence, the sCSC method can be a powerful tool for both very large and very sparse datasets.
Empirical measurement of vortex ring formation and entrainment
Vortex ring formation is a prominent phenomenon in engineered and biological systems as diverse as aerodynamic flow control, animal swimming, and the human cardiovascular system KruegerGharib_POF03 ; Dabiri_ARFM09 ; Gharibetal_PNAS06 . The growth and dynamics of vortex rings are dictated by the extent to which they entrain surrounding fluid Dabiri_JFM04 . Moreover, knowledge of the precise region of the flow that is ultimately entrained by a forming vortex ring can be used to predict how a vortex delivers mass, momentum, and energy to the surrounding flow. For example, pathological vortex ring formation in the human left ventricle has been shown to provide an effective diagnostic of heart failure Gharibetal_PNAS06 . Despite the importance of vortex ring entrainment, methods to quantify the region of the flow impacted by vortex rings have shown limited success, particularly in cases for which the FTLE field cannot be calculated due to the sparsity of measurements. Here, we demonstrate the ability of the sCSC technique to precisely identify the region of a flow that is entrained by a forming vortex ring—knowledge that has been previously inaccessible in cases where measurement data is sparse, such as when the flow is interrogated using non-invasive clinical methods such as ultrasound or magnetic resonance imaging.
Vortex rings were formed in the laboratory using a piston-cylinder apparatus described in previous work Schlueter_PRF16 . A motor-driven piston pushes water through a vertical hollow cylinder of diameter = 2.49 cm that is submerged in a tank with cross-sectional area of 61 cm by 61 cm and height of 91 cm. As the flow exits the cylinder at a nominal speed of 7 cm s, the fluid boundary layer at the inner surface of the cylinder rolls up into a toroidal vortex ring, which propagates away from the cylinder via self-induction.
A set of 1174 fluid particle trajectories in the domain encountered by the vortex ring were analyzed using the present sCSC method and the dissimilarity measure in equation (3) to identify regions of the ambient flow that were entrained by the vortex ring. As illustrated in Fig. 5A, it is impossible to determine which fluid particles have been entrained by the vortex ring based on visual inspection of the trajectories alone. A comparison FTLE analysis performed by Schlueter-Kuck and Dabiri on 30,500 advected particles (see Ref. SchlueterKuck_JFM17, , Fig. 9) showed that 1174 trajectories are not sufficiently close to one another to compute the FTLE field, because the required gradient calculations are not well-posed for sparse trajectories. The alternative use of existing techniques based on heuristics, such as -means or the spectral eigengap, rely on knowledge of the number of eddies to guide clustering; in the present case, it is not known a priori how many structures comprise the flow.
The sCSC dendrogram (Fig. 5D) avoids the need for explicit determination of the number of eddies, as it unambiguously identifies the fluid particles entrained by the vortex ring as those belonging to branch 0. Branch 1 identifies all other particles and further bifurcations of that branch reveal underlying geometric symmetries, as in Branch 1 of the quadruple-eddy flow in Fig. 2D.
Plots of the initial and final positions of the fluid particles in Fig. 5B and C show that the fluid entrained by the vortex ring occupies a well-defined region in the immediate path of the vortex ring, a result that is consistent with intuition but that can now be characterized quantitatively for the first time. The void created by the evolution of the blue particles from Fig. 5B to Fig. 5C is filled by the fluid ejected from the cylinder. The entrained blue particles ultimately occupy positions around the vortex ring that are consistent with the FTLE analysis in Ref. SchlueterKuck_JFM17, . This provides another demonstration of the interpretability of the sCSC results: notably, these result have been achieved without any of the ad hoc assumptions required by existing methods of entrainment quantification Dabiri_JFM04 ; Olcay_EIF08 .
Visualizing macrostate modeling of molecular dynamics
In this section we highlight the interpretability of the sCSC dendrogram for a high-dimensional dynamical dataset. Specifically, we focus on an atomistic simulation of protein folding. Whereas fluid dynamics datasets typical represent only a few spatiotemporal coordinates, atomistic molecular dynamics (MD) datasets can contain thousands of degrees of freedom with complex interrelationships.
While MD is resource-intensive, advances in simulation parameters Wang_JPCB2017 , bespoke hardware Shaw_Science10 , and distributed computing frameworks Shirts_Science00 , have enabled MD analyses to yield insight into complex biophysical systems at biologically meaningful timescales Husic_JACS18 . Thus, these simulations have the potential to uncover biophysical phenomena such as the misfolding mechanisms involved in a variety of diseases, stable configurations yet undiscovered by crystallography, and small molecule binding sites and kinetics for drug discovery.
However, without complementary analysis methods designed to communicate statistically rigorous and understandable conclusions resulting from such computational experiments, the benefits of advances in MD cannot be fully realized. While many methods have been developed to perform these analyses Husic_JACS18 , it remains a challenge to display their results in a meaningful way. sCSC can be used to augment already-existing methods for analyzing MD simulations such that the results can be visualized and interpreted.
To demonstrate the use of sCSC to visualize an MD analysis, we use an ultralong MD simulation performed by Lindorff-Larsen et al. LindorffLarsen_Science11 of the folding and unfolding of Protein G, a -residue protein expressed in streptococcal bacteria. The simulation details are described in the Supporting Materials of Ref. LindorffLarsen_Science11, . We use a Markov state model (MSM) analysis to define the states of the system, which is a discrete approximation to the Perron-Frobenius operator schutte1999direct . This established mathematical framework codifies the system using a kinetic master equation Husic_JACS18 . The master equation takes the form of a stochastic transition probability matrix, in which each state of the system is identified by a probability distribution of transitioning to every other state.
After constructing a quantitatively accurate and optimized MSM Noe_MMS13 , we are interested in clustering these state into a smaller number of interpretable “macrostates”, since it is conceptually difficult to describe hundreds unique states in a physically interpretable way Pande_Methods10 .
For our MSM, we found that 175 states optimally describes the system according to a variational evaluation (the MSM construction protocol is consistent with current best practices and is described in detail in the Methods). Minimum variance clustering analysis (MVCA), an effective coarse-graining method for MSMs, has recently been developed by one of the authors and uses a pairwise information theoretic dissimilarity metric in order to group states into a smaller number of macrostates, namely, the Jensen-Shannon divergence between the probability distributions characterizing the rows of the MSM transition probability matrixHusic_Draft17 (see also Methods equation (4)).
By using the same pairwise dissimilarity metric as MVCA, the state adjacencies can be input into the sCSC algorithm to produce a visualization of a set of macrostates in the protein folding dataset, which is displayed in Fig. 6. Nine branches of the sCSC dendrogram are depicted by sampling one structure from each original MSM state contained in that branch. Since the nine depicted branches contain all 175 original MSM states, these branches can be interpreted as a possible set of system macrostates. By superimposing a representative conformation from each MSM state and coloring the protein according to its secondary structure, we can visualize the MD trajectory by interpreting the sCSC groupings.
First, we note that the folded structure (branch 0) is identified in the first sCSC solution and is separated from the denatured, unfolded ensemble, which comprises the rest of the dendrogram (branch 1). We see that the folded branch isolates a well-defined conformation with low variance across sampled conformations. The incorporation of subsequent sCSC eigenvectors identifies groups of structures unified by their protein secondary structure features. Various branches contain similar secondary structure elements (similar colors in the structure visualization in Fig. 6), elucidating substructures exhibited during the folding of Protein G. For example, branch 1110 contains -sheet secondary structure (yellow), whereas branch 11110 contains noticeable -helical secondary structure (pink). Branch 1111111 is the least coherent, containing the most unstructured states. Summary statistics for each macrostate can be found in Table S1.
In this example, we have chosen to highlight secondary structure changes so we can understand which secondary structure elements characterize different subprocesses within folding. We see that the yellow -sheet secondary structure appears in several macrostates—often along with the blue -helix, thought to be an intermediate structure during -helix formation Sundaralingam_Science89 —which might indicate that the formation of the pink -helix is a rate-limiting step in the folding process. However, we could also choose to quantify and visualize macrostate contact maps, radii of gyration, or distance to various structures in order to gain complementary insight into the folding system.
For other dynamical processes characteristic of proteins, such as conformational change, allostery, and drug binding, we might choose to visualize parameters related to specific sites of interest or observables that can be probed experimentally. The choice of how to describe the macrostates is independent of the clustering process, but the depictions or statistics that enable the best interpretation of the system will depend on the dynamics of interest.
The sCSC dendrogram analysis offers advantages in the interpretation of high-dimensional MD datasets after an adequate kinetic model has been constructed. Utilizing the pairwise state adjacency from this kinetic model for an sCSC analysis produces a hierarchical representation of structural motifs according to the extent of their dissimilarity, which provides insight into the protein conformations that characterize subprocesses within folding. As in the fluid dynamics examples in the previous section, truncating the dendrogram when bifurcations are unoccupied produces an objective way to visualize the converged clusters. Finally, the generation of orthogonal sCSC solutions enables orthogonal dynamical processes in the protein folding simulation to be incorporated in analogy to the simple model in Fig 1. We anticipate that this type of interpretable visualization will be useful for communicating the results obtained from high-dimensional datasets.
The present approach addresses the previously stated challenges with common clustering algorithms: it does not require a choice of cluster number or dendrogram cutting, it leverages the concept of dissimilarity in a computationally tractable way, and it maintains an interpretable hierarchical relationship among splittings.
Perhaps the most important advantage of this approach relative to commonly used tools is that the number, shape, and size of clusters in the data emerges naturally from the sCSC dendrogram rather than being specified a priori. As the set of eigenvectors that is included in the analysis is increased to include those associated with lesser eigenvalues, the number of unoccupied binary codes generally increases. This is because progressively fewer groups of states that survived the preceding orthogonal partitions will be subsequently separated at lower levels of the dendrogram. In this way, the set of clusters in the dataset is revealed to be those branches of the dendrogram structure whose shape converges as the number of eigenvectors included in the analysis increases. The sCSC dendrogram indicates not only the number of these converged clusters but also the relative strength of the partitions between clusters, via the length of the connecting branches in -space.
While sCSC conceptually resembles divisive hierarchical clustering, the number of possible divisions in the latter scales as with the number of clusters , which is generally not feasible for large unless the initial dataset is sparse Boley_DMKD98 . sCSC scales in the same way as agglomerative hierarchical clustering, requiring a computationally nontrivial but tractable calculation of dissimilarity values for initial data points. However, unlike agglomerative clustering—which also requires the calculation of dissimilarities—small differences between states at lower levels of the sCSC dendrogram have no impact on the clusters that form at higher levels, as the top-down approach begins by using the most significant partitions indicated by the eigenvectors associated with the largest eigenvalues.
When applying sCSC, domain knowledge should inform selection of an appropriate dissimilarity measure, but ad hoc and a priori
assumptions about the structure of the data itself are not needed. While we have demonstrated sCSC only for simulated physical systems, we anticipate that these features will make sCSC a powerful tool for interrogating both new and longstanding research problems, including those in fields where the underlying processes are less accessible, such as genomics and neuroscience. For example, genetic ancestries can potentially be clustered on the basis of the sCSC structure that emerges from the dissimilarity of single-nucleotide polymorphisms (SNPs) among individuals within a population. In the latter case, differences in neuronal activation can be amplified using sCSC to identify emergent functions that involve coordination of spatially distant neurons. These and other applications can be pursued immediately given the tools developed here.
Quadruple-eddy ocean flow model
The velocity field of the quadruple-eddy ocean flow model is given by,
where and are the dimensionless east-west and north-south spatial coordinates (i.e. normalized by the quadrant side length), is time in dimensionless units, and
In the present unsteady implementation of the model, , , and . 3000 Lagrangian drifters were randomly initialized in the domain and advected in the flow using a fifth-order Runge-Kutta integration scheme. The duration of advection was 40 dimensionless time units, corresponding to 4 periods of horizontal oscillation of the flow.
Bickley jet atmospheric model
The Bickley jet flow is given by the streamfunction , where,
In the present study, we use similar parameter values as in Ref. Hadjighasemetal_PRE16, : ms, km, , , , , , and , , , and the flow is computed on the interval , m, , m, over the time interval , days, divided into 601 discrete time steps. The flow was treated as periodic in . 3000 Lagrangian fluid particles were randomly initialized in the domain and advected in the flow using a fifth-order Runge-Kutta integration scheme.
Markov state models
Markov state models (MSMs) are a kinetic master equation framework for describing and analyzing time-series data such as molecular dynamics (MD) simulations by approximating the continuous Perron-Frobenius operator using a discrete transition probability matrix schutte1999direct . A MSM requires partitioning the phase space explored by a system into discrete states (henceforth “microstates”), and is represented by a transition probability matrix defined for a Markovian lag time at which transitions between the microstates are independent of the history of the system. For protein folding analyses, phase space (positions and velocities) is conventionally approximated by conformation space (positions), and states are chosen according to an objective optimization protocol, in this case a variational principle Noe_MMS13 . The Markovian lag time chosen to analyze a system must be long enough for memoryless inter-state transitions, but short enough to resolve dynamics; for protein folding dynamics, lag times on the order of tens of nanoseconds are typical.
The MSM transition probability matrix is constrained to be stochastic, symmetric with respect to a stationary distribution, ergodic, and aperiodic. It can thus be decomposed into eigenvalues and eigenvectors, , where the eigenvalues are on the unit interval and the highest eigenvalue is unique. The variational principle states that the sum of estimated eigenvalues is bounded from above by the sum of true eigenvalues; thus many state decompositions can be tested according to the sum of a set number of eigenvalues for a set Markovian lag time and the state decomposition resulting in the highest sum of approximated eigenvalues can be chosen for further analysis.
The MSM for the simulation analyzed in this work was constructed according to the protocol used in Ref. Husic_JCTC17, for a set lag time of 50 ns according to a previous analysis for the same system performed in Ref. Beauchamp_PNAS12, . First, the Cartesian coordinates from the raw simulation data are transformed into the sines and cosines of the and
side chain dihedral angles for each amino acid of the protein. Next, the vector of dihedrals is again transformed using time structure-based independent component analysis (tICA)Schwantes_JCTC13 with a tICA lag time of 50 ns wherein each tICA solution vector was weighted according to its associated eigenvalue Noe_JCTC16 . Then, mini-batch -means was used to cluster the simulation frames according to their weighted tICA representations for 265 different numbers of cluster centers randomly chosen between 10 and 5000. Finally, a MSM was constructed on each -means state decomposition in which the transition probability matrix is obtained using a maximum likelihood estimator of the data such that detailed balance is achieved. For each model, five MSMs were fit to a randomly chosen half of the data and then applied to the other half of the data, and the latter was used to sum the first 50 MSM eigenvalues as that model’s score. The winning model was chosen to be the one that achieved the single maximum score from parameter sets with mean scores within one standard deviation of the maximum mean score. For our analysis of 265 different microstate numbers, the best model according to this variational analysis had 175 microstates and was used for analysis in the main text.
Coarse-graining MSMs with MVCA
Minimum variance clustering analysis (MVCA) was recently published by one of the authors as an algorithm for coarse-graining an MSM transition probability matrix into a smaller number of macrostates by grouping the original microstates Husic_Draft17 . MVCA achieves a coarse-grained model by using agglomerative hierarchical clustering with Ward’s minimum variance method ward1963hierarchical to cluster the microstates, where the pairwise dissimilarity between microstates is quantified using an information theoretic measure between the probability distribution characterized by the corresponding row of the MSM transition matrix.
If two microstates are defined by transition probability distributions and , their pairwise dissimilarity can be written using the Jensen-Shannon divergence Lin_IEEE91 ,
where is the elementwise mean of and
, and each term is the Kullback-Leibler divergence to the mean. We quantify the dissimilarity between microstates using the square root of equation4 Endres_IEEE03 ; Husic_Draft17 .
From this set of pairwise similarities, MVCA goes on to cluster the microstates using agglomerative hierarchical clustering with Ward’s method. In the analysis presented in this work, the set of pairwise dissimilarities is instead used to construct the adjacency matrix for sCSC.
The three fluid mechanics datasets and all adjacency matrices used to create the models in this work are available on github at https://github.com/brookehus/sCSC. This repository also contains example MATLAB and Python codes, including Jupyter notebook tutorials. The all-atom molecular dynamics simulations of Protein G were previously published in Ref. LindorffLarsen_Science11, , and the trajectories are available at no cost for non-commercial use through contacting email@example.com.
The authors are grateful to Muneeb Sultan, Jared Dunnmon, Nicole Xu, and the referees for insightful manuscript feedback and to D. E. Shaw Research for providing the Protein G dataset. This work was supported by the U.S. National Science Foundation and by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.
- (1) Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci. 2018;9(2):513–530. doi:10.1039/c7sc02664a.
- (2) Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y, Dillon N, et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat Commun. 2018;9(1):4306. doi:10.1038/s41467-018-06634-y.
- (3) Froyland G, Padberg-Gehle K. A rough-and-ready cluster-based approach for extracting finite-time coherent sets from sparse and incomplete trajectory data. Chaos. 2015;25:087406. doi:10.1063/1.4926372.
- (4) Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. vol. 1. Springer series in statistics New York, NY, USA:; 2001.
- (5) Macqueen J. Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability; 1967. p. 281–297.
- (6) Dunn JC. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J Cybern. 1973;3(3):32–57. doi:10.1080/01969727308546046.
- (7) Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second ACM SIGKDD International Conference on Knowledge, Discovery, and Data Mining. vol. 96; 1996. p. 226–231.
- (8) Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev E. 2004;69:026113. doi:10.1103/PhysRevE.69.026113.
- (9) Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. vol. 344. John Wiley & Sons; 2009.
- (10) Ali T, Asghar S, Sajid NA. Critical analysis of DBSCAN variations. In: 2010 International Conference on Information and Emerging Technologies; 2010. p. 1–6.
- (11) Allshouse MR, Peacock T. Lagrangian based methods for coherent structure detection. Chaos. 2015;25:097617. doi:10.1063/1.4922968.
- (12) Hadjighasem A, Farazmand M, Blazevski D, Froyland G, Haller G. A critical comparison of Lagrangian methods for coherent structure detection. Chaos. 2017;27:053104. doi:10.1063/1.4982720.
- (13) Hadjighasem A, Karrasch D, Teramoto H, Haller G. Spectral-clustering approach to Lagrangian vortex detection. Phys Rev E. 2016;93:063107. doi:10.1103/PhysRevE.93.063107.
- (14) Schlueter-Kuck KL, Dabiri JO. Coherent structure colouring: identification of coherent structures from sparse data using graph theory. J Fluid Mech. 2017;811:468–486. doi:10.1017/jfm.2016.755.
- (15) Schlueter-Kuck KL, Dabiri JO. Identification of individual coherent sets associated with flow trajectories using coherent structure coloring. Chaos. 2017;27(9):091101. doi:10.1063/1.4993862.
- (16) Klus S, Koltai P, Schütte C. On the numerical approximation of the Perron-Frobenius and Koopman operator. J Comput Dynam. 2016;3(1):51–79. doi:10.3934/jcd.2016003.
- (17) Dellnitz M, Junge O. On the approximation of complicated dynamical behavior. SIAM J Numer Anal. 1999;36(2):491–515. doi:10.1137/S0036142996313002.
- (18) Froyland G, Dellnitz M. Detecting and locating near-optimal almost-invariant sets and cycles. SIAM J Sci Comput. 2003;24(6):1839–1863. doi:10.1137/S106482750238911X.
- (19) Mezić I. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dyn. 2005;41(1-3):309–325. doi:10.1007/s11071-005-2824-x.
- (20) Schütte C, Fischer A, Huisinga W, Deuflhard P. A direct approach to conformational dynamics based on hybrid Monte Carlo. J Comput Phys. 1999;151(1):146–168. doi:10.1006/jcph.1999.6231.
Deuflhard P, Huisinga W, Fischer A, Schütte C.
Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains.Linear Algebra Appl. 2000;315(1):39–59. doi:10.1016/S0024-3795(00)00095-1.
- (22) Noé F, Nüske F. A Variational Approach to Modeling Slow Processes in Stochastic Dynamical Systems. Multiscale Model Simul. 2013;11(2):635–655. doi:10.1137/110858616.
- (23) Williams MO, Kevrekidis IG, Rowley CW. A data–driven approximation of the koopman operator: Extending dynamic mode decomposition. J Nonlinear Sci. 2015;25(6):1307–1346. doi:10.1007/s00332-015-9258-5.
- (24) Wu H, Nüske F, Paul F, Klus S, Koltai P, Noé F. Variational Koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations. J Chem Phys. 2017;146(15):154104. doi:10.1063/1.4979344.
- (25) Guckenheimer J, Holmes P. Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. vol. 42. Springer; 1983.
- (26) Haller G, Yuan G. Lagrangian coherent structures and mixing in two-dimensional turbulence. Physica D. 2000;147(3-4):352–370. doi:10.1016/S0167-2789(00)00142-1.
- (27) Lin J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory. 1991;37(1):145–151. doi:10.1109/18.61115.
- (28) Husic BE, McKiernan KA, Wayment-Steele HK, Sultan MM, Pande VS. A minimum variance clustering approach produces robust and interpretable coarse-grained models. J Chem Theory Comput. 2018;14(2):1071–1082.
- (29) Hall KM. An r-Dimensional Quadratic Placement Algorithm. Management Sci. 1970;17(3):219–229.
- (30) Munoz S, Ortuno MT, Ramirez J, Yanez J. Coloring fuzzy graphs. Omega. 2005;33:211–221. doi:10.1016/j.omega.2004.04.006.
- (31) Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes. Cambridge University Press; 2007.
- (32) Müllner D. fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. J Stat Softw. 2013;53(9):1–18. doi:10.18637/jss.v053.i09.
- (33) Hansen P, Jaumard B. Cluster analysis and mathematical programming. Math Program. 1997;79.
- (34) Huhn F, von Kameke A, Perez-Munuzuri V, Olascoaga MJ, Beron-Vera FJ. The impact of advective transport by the South Indian Ocean Countercurrent on the Madagascar plankton bloom. Geophys Res Lett. 2012;39(6):L06602. doi:10.1029/2012GL051246.
- (35) Hughes CW, Miller PI. Rapid water transport by long-lasting modon eddy pairs in the southern midlatitude oceans. Geophys Res Lett. 2017;44(24). doi:10.1002/2017GL075198.
- (36) Kaiser E, Noack BR, Cordier L, Spohn A, Segond M, Abel M, et al. Cluster-based reduced-order modelling of a mixing layer. J Fluid Mech. 2014;754:365–414. doi:10.1017/jfm.2014.355.
- (37) Maclean J, Santitissadeekorn N, Jones CKRT. A coherent structure approach for parameter estimation in Lagrangian Data Assimilation. Physica D. 2017;360:36–45. doi:10.1016/j.physd.2017.08.007.
- (38) Sengupta A, Carrara F, Stocker R. Phytoplankton can actively diversify their migration strategy in response to turbulent cues. Nature. 2017;543:555–558. doi:10.1038/nature21415.
- (39) Argo. Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). 2000;.
- (40) Rypina II, Brown MG, Beron-Vera FJ, Kocak H, Olascoaga MJ, Udovydchenkov IA. On the Lagrangian Dynamics of Atmospheric Zonal Jets and the Permeability of the Stratospheric Polar Vortex. J Atmos Sci. 2007;64:3595–3610. doi:10.1175/JAS4036.1.
- (41) Krueger PS, Gharib M. The significance of vortex ring formation to the impulse and thrust of a starting jet. Phys Fluids. 2003;15:1271–1281. doi:10.1063/1.1564600.
- (42) Dabiri JO. Optimal vortex formation as a unifying principle in biological propulsion. Annu Rev Fluid Mech. 2009;41:17–33. doi:10.1146/annurev.fluid.010908.165232.
- (43) Gharib M, Rambod E, Kheradvar A, Sahn DJ, Dabiri JO. Optimal vortex formation as an index of cardiac health. Proc Natl Acad Sci USA. 2006;103:6305–6308. doi:10.1073/pnas.0600520103.
- (44) Dabiri JO, Gharib M. Fluid entrainment by isolated vortex rings. J Fluid Mech. 2004;511:311–331. doi:10.1017/S0022112004009784.
- (45) Schlueter-Kuck KL, Dabiri JO. Pressure evolution in the shear layer of forming vortex rings. Phys Rev Fluids. 2016;1:012501(R). doi:10.1103/PhysRevFluids.1.012501.
- (46) Olcay AB, Krueger PS. Measurement of ambient fluid entrainment during vortex ring formation. Exp Fluids. 2008;44:235–247. doi:10.1007/s00348-007-0397-9.
- (47) Wang LP, McKiernan KA, Gomes J, Beauchamp KA, Head-Gordon T, Rice JE, et al. Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. J Phys Chem B. 2017;121(16):4023–4039. doi:10.1021/acs.jpcb.7b02320.
- (48) Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, Eastwood MP, et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science. 2010;330(6002):341–346. doi:10.1126/science.1187409.
- (49) Shirts M, Pande VS. Screen Savers of the World Unite! Science. 2000;290(5498):1903–1904. doi:10.1126/science.290.5498.1903.
- (50) Husic BE, Pande VS. Markov State Models: From an Art to a Science. J Am Chem Soc. 2018;140(7):2386–2396. doi:10.1021/jacs.7b12191.
- (51) Lindorff-Larsen K, Piana S, Dror RO, Shaw DE. How Fast-Folding Proteins Fold. Science. 2011;334(6055):517–520. doi:10.1126/science.1208351.
- (52) Pande VS, Beauchamp K, Bowman GR. Everything you wanted to know about Markov State Models but were afraid to ask. Methods. 2010;52(1):99 – 105. doi:10.1016/j.ymeth.2010.06.002.
- (53) Sundaralingam M, Sekharudu Y. Water-inserted a-helical segments implicate reverse turns as folding intermediates. Science. 1989;244:1333–1337. doi:10.1126/science.2734612.
- (54) Boley D. Principal Direction Divisive Partitioning. Data Min Knowl Discov. 1998;2(4):325–344. doi:10.1023/A:1009740529316.
- (55) Husic BE, Pande VS. Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding. J Chem Theory Comput. 2017;13(3):963–967. doi:10.1021/acs.jctc.6b01238.
- (56) Beauchamp KA, McGibbon R, Lin YS, Pande VS. Simple few-state models reveal hidden complexity in protein folding. Proc Natl Acad Sci. 2012;109(44):17807–17813. doi:10.1073/pnas.1201810109.
- (57) Schwantes CR, Pande VS. Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. J Chem Theory Comput. 2013;9(4):2000–2009. doi:10.1021/ct300878a.
- (58) Noé F, Clementi C. Kinetic Distance and Kinetic Maps from Molecular Dynamics Simulation. J Chem Theory Comput. 2015;11(10):5002–5011. doi:10.1021/acs.jctc.5b00553.
- (59) Ward Jr JH. Hierarchical grouping to optimize an objective function. J Amer Statist Assoc. 1963;58(301):236–244.
- (60) Endres DM, Schindelin JE. A new metric for probability distributions. IEEE Transactions on Information Theory. 2003;49(7):1858–1860. doi:10.1109/TIT.2003.813506.
Kabsch W, Sander C.
Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features.Biopolymers. 1983;22(12):2577–2637. doi:10.1002/bip.360221211.
- (62) McGibbon RT, Beauchamp KA, Harrigan MP, Klein C, Swails JM, Hernández CX, et al. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys J. 2015;109(8):1528 – 1532. doi:10.1016/j.bpj.2015.08.015.