Principal nested shape space analysis of molecular dynamics data

03/22/2019 ∙ by Ian L. Dryden, et al. ∙ The University of Nottingham 0

Molecular dynamics simulations produce huge datasets of temporal sequences of molecules. It is of interest to summarize the shape evolution of the molecules in a succinct, low-dimensional representation. However, Euclidean techniques such as principal components analysis (PCA) can be problematic as the data may lie far from in a flat manifold. Principal nested spheres gives a fundamentally different decomposition of data from the usual Euclidean sub-space based PCA (Jung, Dryden and Marron, 2012, Biometrika). Sub-spaces of successively lower dimension are fitted to the data in a backwards manner, with the aim of retaining signal and dispensing with noise at each stage. We adapt the methodology to 3D sub-shape spaces and provide some practical fitting algorithms. The methodology is applied to cluster analysis of peptides, where different states of the molecules can be identified. Also, the temporal transitions between cluster states are explored.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many notions of shape, and one of the most common is that the shape of an object is obtained by removing location, rotation and scale (Kendall, 1984). Analyzing the shapes of objects measured at sets of labelled landmarks is of interest in a wide variety of disciplines (Dryden and Mardia, 2016), and in many applications the dataset is large, either in sample size or in dimension, or in both. For example, in molecular dynamics simulations the three dimensional co-ordinates of a molecule of hundreds or thousands of atoms may be available for millions of observations. It is of interest to summarize the main features of shape variability, such as describing the main modes of variability and clustering the data into several states. A key aspect is to project the data into a low dimensional space which retains the main features of the signal in the data.

The most common approaches to dimension reduction in shape analysis involve projecting the data into a tangent space to the mean shape and then carrying out principal components analysis (PCA) in this Euclidean space (Kent, 1994; Cootes et al., 1994; Dryden and Mardia, 2016, Section 7.7). This approach has been successful in many applications over the past couple of decades, although if the data are very dispersed then the method can be problematic as the data may lie far from in a flat manifold. In addition, linear variation (corresponding to geodesics in the manifold) may not be the most appropriate or efficient summary of the variability. There have been several advances in dimension reduction on manifolds, motivated by shape analysis applications. Fletcher et al. (2004)

developed principal geodesic analysis to the manifold setting by finding principal directions and variances in the tangent space and projecting back to the manifold using the exponential map. The above approaches are forwards fitting, in that the lower dimension representations are fitted before the higher dimensional ones.

Huckemann and Ziezold (2006) proposed a method of PCA for Riemannian manifolds based on geodesics of the intrinsic metric, and Huckemann et al. (2010) developed an algorithmic method to perform intrinsic PCA on quotient spaces based on generalized geodesics. Kenobi et al. (2010) proposed two intrinsic methods of fitting minimal geodesics through shape data, and an extension to polynomial fitting. These intrinsic methods differ in their formulation from the tangent space methods, in that tangent space PCA involves maximising the explained variance for each linear component in the tangent space, whereas the intrinsic geodesic methods involves minimizing the unexplained variance in order to fit a geodesic subspace. The intrinsic methods are also forwards fitting in the sense that the first principal geodesic is fitted before the second principal geodesic etc., although there is a different notion of a mean which is the point that minimizes the variance of the projected data in the principal geodesic. Panaretos et al. (2014) introduce principal flow which is a curve passing through the mean of the data, where a particle moves along the principal flow in a path of maximal variation, up to smoothness constraints.

A very different backwards approach to subspace fitting was considered by Jung et al. (2012) who introduced a technique called Principal Nested Spheres (PNS) for sequentially decomposing a unit sphere into subspheres of successively lower dimension. This backwards approach to dimension reduction is the approach that we consider, which requires special adaptation to be applicable to three dimensional shapes and to large datasets, and these are our main methodological contributions.

The main motivation for this work comes from biomedical sciences, where it is of great interest to study the changes in shape of a protein, as its shape is an important component of a protein’s function. Studies of this type are often approached through molecular dynamics simulations, which are large deterministic simulations using Newtonian mechanics to model the movement of a protein in a box of water molecules (e.g. Salomon-Ferrer et al., 2013). We consider a dataset of 100 independent simulation runs in the study of the small alanine pentapeptide (Ala) which consists of atoms in at equal picosecond () time intervals. Further details on this particular peptide are given by Margulis et al. (2002). The runs all start with fairly similar (but different) configurations and the subsequent simulated configurations vary considerably over the time points (10 nanoseconds). Figure 1 shows example configurations at times near the beginning and end of the sequence for run 1, after removing the effect of rotation, translation and scaling using generalized Procrustes analysis (Gower, 1975; Goodall, 1991). As seen in the figures, the shape changes quite a lot with bending and straightening over the course of time.

Shape analysis can be considered an example of Object Oriented Data Analysis (Wang and Marron, 2007; Marron and Alonso, 2014), and the key initial question to ask is: “What are the data objects?” We have several choices available in our application, for example the data objects could be (i) the shapes of each individual molecule of 29 atoms in 3 dimensions or (ii) individual runs of 29 atoms in 3 dimensions observed at all 10,000 time points. The two respective choices of object space are (i) and (ii) , where is the shape space of points in dimensions (Kendall, 1984). For the application in this paper we will consider (i) for our data objects, and so the full sample size is in our case, whereas in (ii) the sample size would be .

Figure 1: Temporal sequences of thinned run 1.

It is of interest to describe the shape variability using a low dimensional representation and to examine if there are preferred states, i.e. clusters of shapes which are more commonly formed by the dynamic peptide. Also, the patterns of temporal transitions between states are of interest.

2 Shape PCA, principal nested spheres and principal nested shape spaces

2.1 Shape PCA

Consider configuration matrices , where is the number of configurations. Shape PCA is similar to classical PCA (Jolliffe, 2002)

, except the shapes must first be projected into the Procrustes tangent space centred at the overall estimated mean shape of the data. Technical details of the Procrustes tangent space are given in Appendix 1. The observations are optimally aligned (by translating, scaling and rotating) using generalized Procrustes analysis

(Gower, 1975; Goodall, 1991) and the full Procrustes sample mean is obtained as an estimate of the overall population mean shape (Dryden and Mardia, 2016, Equation (6.11)). The procedure is described in detail in Chapter 7 of Dryden and Mardia (2016) and is implemented in the shapes package in R (Dryden, 2018). Let be the Procrustes tangent projection of centred, scaled and rotated configurations of at their Procrustes mean (Dryden and Mardia, 2016, Equation (4.33)), which are obtained using the command procGPA with option tangentcoords="partial" from the R package shapes

. The principal components are the eigenvectors of the sample covariance matrix of

, and these eigenvectors and the PC scores are given in the output of procGPA. The method is equivalent to carrying out PCA using the extrinsic Frobenius distance between matrices in the pre-shape space after removing rotations using Procrustes registration.

2.2 Principal nested spheres

The analysis of the principal nested spheres for a given data set in a unit sphere is introduced in Jung et al. (2012). The main idea is that a high dimensional unit sphere is decomposed into successively lower dimensional subspheres using backwards fitting, and at each level the Euclidean PNS scores are obtained from the residuals. The method can capture non-geodesic variation, so if the data lie on a curved submanifold of the sphere, then the resulting variation after fitting the PNS can be linearly represented (Jung et al., 2012). PNS is an iterative method of decomposing into a sequence of nested (sub-) spheres, where each is a sphere of dimension and . At each step , the method captures the variation of the data set in a lower dimensional subsphere and provides the best -dimensional approximation to the data. Since is a great subsphere of if and only if is a unit sphere, the method extends the existing methods of finding principal geodesics for the data. The main ingredients of the method can be summarised as follows.

Any subsphere of of dimension can be characterised by and as


where is the spherical distance between given by .

For a data set , the best fitting -dimensional subsphere is defined to be



and . Then

are the signed residuals on the sphere, where . Since has radius , denote by the projection of onto divided by , so that . Since is isometric with the standard unit sphere , applying the above procedure to we get the best fitting -dimensional (sub-)sphere to the data to be

where is determined by (2.2) using with replaced by and replaced by the -dimensional unit sphere , and where is defined on in a similar fashion to that in (2.1). Moreover, the corresponding residuals are

The resulting fitting sequence obtained by repeating this procedure is the so called principal nested spheres for the given data set . Writing for the spherical coordinates of the final projection points in of the data, with respect to their Fréchet mean in that sphere (assuming it is unique), the resulting combined residuals

give a representation of the data, where the th column comprises the coordinates of in terms of the principal nested spheres. This representation is called the PNS coordinates of the data and can be used to interpret the structure of the data.

2.3 Principal nested shape-spaces

Principal nested spheres analysis for a data set on a sphere uses the particular structure of the sphere. In principle, this can be generalised to a sequence of fitting nested sub-manifolds for a data set contained in a manifold. However, it is generally not straightforward. For the analysis of shape variation of a given set of configurations, we propose a method of applying the technique of PNS to obtain principal nested sub-shape spaces (PNSS) for shapes of -dimensional configurations.

We wish to describe the shape variability of the peptide data, and so this requires removing information about translation, scale and rotation. For a given configuration in with labelled landmarks that are not all identical, the pre-shape of the configuration is obtained from by removing the effects of translation and scaling. The pre-shape can be represented by a matrix of unit norm (Kendall, 1984) and it is known that the pre-shape sphere consisting of all pre-shapes of such configurations is the entire sphere . Then, the shape of is the equivalence class of under rotation, i.e. .

In order to remove rotations we consider the quotient space of the pre-shape sphere with respect to rotations , and this shape space is complicated for being non-homogenous and containing singularities (Le and Kendall, 1993). However, practical statistical analysis can be carried out by identifying the tangent space to the pre-shape sphere at a point as comprising two orthogonal real sub-spaces: the vertical and horizontal tangent spaces. The vertical tangent space contains the rotation information and the horizontal tangent space contains the shape information, and the latter is often called the Procrustes tangent space (Kent and Mardia, 2001). Working in this horizontal tangent space forms the heart of most practical statistical analyses of landmark shapes.

To develop the method of PNSS we need to make some further identifications. We consider the subset of the pre-shape sphere which is orthogonal to , and denote it by which a great sphere of dimension . In fact is the image of using the exponential map onto the sphere. If we now have a datapoint on the pre-shape sphere then we can obtain the Procrustes fit as the solution to minimizing the great circle distance over rotations . The Procrustes fit is denoted by with the fitted rotation matrix , and for all the Procrustes fits lie within a half-sphere of , because the distance . The Procrustes fit will be unique if the rank of is at least and is outside the cut-locus of . In the case of uniqueness, which will often be the case for practical data, we can identify the subset of the sphere

with a bijective map with the subset of shape space

The subsets and are diffeomorphic to each other. The practical implication of this identification is that any sub-manifolds in are mapped to sub-manifolds in .

Definition: Given a sequence of principal nested spheres on (as defined in Section 2.2), the intersection of the sequence of principal nested spheres with maps to a sequence of sub-manifolds in , which we define as a sequence of Principal Nested Shape Spaces. The final point in the sequence, with dimension 0, corresponds to the PNSS mean shape.

We now consider the practical implementation of principal nested shape spaces for the peptide shape data in dimensions. First of all an overall Procrustes mean is obtained using generalized Procrustes analysis. We use the function procGPA in the shapes package in R (Dryden, 2018). For our dataset each Procrustes fit is unique, as will generally be the case for most practical datasets. We then apply PNS to the Procrustes fitted data on , and the resulting PNS subspaces intersections with will be have a diffeomorphic mapping to the PNSS subspaces in , and the PNS scores are equal to the PNSS scores.

PNSS uses the relationship between the pre-shape sphere and the shape space to construct a nested sequence of sub-shape-spaces in a given shape space from a sequence of nested subspheres, as constructed in the previous section, in the corresponding pre-shape sphere. Despite not being strictly analogous to the concept of principal nested spheres, this approach does offer an extrinsic method to apply that concept. The dimensional case was discussed by Jung et al. (2012), where complex arithmetic leads to a simple adaptation of PNS to planar shapes, where the shape space is the complex projective space , which is a homogeneous space (Kendall, 1984).

A full technical construction of PNSS is given in Appendix 1.

2.4 Approximate PNSS using PC scores

When a dataset is large, the numerical computation required for the principal nested spheres analysis, and so for the principal nested sub-shape-spaces analysis, usually tends to be slow. In fact the time required to compute PNSS on large datasets can be extensive. Much of the time is spent fitting the higher dimensional spheres, which are taking account of small amounts of noise. An alternative approximate approach is to first carry out PCA and retain the first PC scores, where these scores capture a high percentage of the variability. If the Procrustes fits of the original data configurations to their mean shape are close to a great subsphere of a lower dimension than the dimension of , the speed of computation can be much improved by first using the technique of principal component analysis to determine this great subsphere and then using the projections to this subsphere as the input for the principal nested sub-shape-spaces analysis.

The technical details of the method of approximate PNSS using PC scores are given in Appendix 2.

A practical implementation of PNSS using PC scores is given in the shapes package in R using the command pns4pc (Dryden, 2018), and this code is used in our peptide application. Output from the R command pns4pc includes the PNSS scores, where the first PNSS score is a circular variable and the remaining PNSS scores are Euclidean. In our application we shall see that the clusters of peptide states are much more apparent in the PNSS scores compared to the PC scores. The percentage of shape variability explained by the PNSS scores is helpful to explore the efficiency and effectiveness of the method, and we will see that the first few PNSS scores explain much more shape variability than the first few PC scores.

3 Peptide shape analysis

3.1 Molecular dynamics data

The data set consists of a total of runs, with each run consisting of landmarks in dimensional space from a small peptide (Ala) evaluated at consecutive times at picosecond intervals. As the time interval between consecutive configurations is very small, we first thin the data in time to give configurations at equally spaced times from each run. The thinned times are . We consider temporal sequences of each run and register using generalized Procrustes analysis (Gower, 1975; Goodall, 1991).

The starting configuration for each peptide is quite similar at the start of each run with the Riemannian shape distance between pairs of runs primarily less than radians, where . The variability in shape between runs is much larger at the end (Riemannian distances up to about ), as see in Figure 2.

(a) Histograms
(b) Boxplots
Figure 2: Histograms (a) and boxplots (b) of pairwise Riemannian shape distances at first two starting and last two finish times. For , .

3.2 Shape PCA and PNSS

We first run shape PCA on the thinned 100 configurations from each of 100 runs (10000 configurations in total), Much of the data variation can be much explained by first several principal components as shown in Figure 3. We choose the first ten shape principal components for the computation of PNSS using PC scores since those ten explain 90.1% of the shape variation. The percentages of shape variability captured by the first three individual PC scores are 28.7%, 16.4%, 12.7% and by the PNSS scores are 65%, 9.2%, 3.9%.

Figure 3: (Left) Cumulative percent explained by principal components for PNSS (black) and shape PCA (red). (Right) The percentages of shape variability captured by the individual PC scores for PNSS (black) and shape PCA (red).

We carry out the PNSS analysis to sequentially reduce the dimension of the projected data from to to . The percentage variability explained by the first two PNSS scores () is much higher than the first two PCA scores (). The PNSS score 1 is a circular variable, where the most negative score and most positive PNSS1 score are identified with each other (at the cut point of the circle). The higher PNSS scores do not have the extreme points identified with each other.

We calculate PNSS1, PNSS2 and PNSS3 coordinates of all the 1,000,000 configurations through the model constructed using the thinned 10000 configurations. The pairwise 2D plots of first three PNSS and PC scores are in Figure 4. A vertical long cluster forms all over PC2 between small range of PC1. On the other hand, in PNSS1-PNSS2 plot three salient clusters are close to each other and one cluster is positioned on the left side. For this particular peptide there are believed to be four preferred shape states, and these can be clearly seen in the first PNSS plot, whereas it is not evident with PCA. Hence, there is a clear advantage in using PNSS compared to PCA in this dataset.

The relative variation of PNSS1 to PNSS2 is much higher than that of PC1 to PC2. Since first two PNSS components account for high proportion (74.2%) of the shape variability, the subsequent components from PNSS3 are far less important. Both left and right tails in PNSS1 should be considered connected, given this is a circular variable.

(a) PCA plot.
(b) PNSS plot using 10 shape PC scores.
Figure 4: Pairwise components plot of (a) shape PCA and (b) PNSS using first 10 shape PC scores.

3.3 Cluster analysis

Given that there are believed to be four preferred states we now consider cluster analysis. The result of four-group clustering is shown in Figure 5(b)

, where hierarchical clustering analysis using Ward’s method is used

(Ward, 1963) but with distances rather than squared distances (ward.D in R). In (a) the cluster analysis was performed on (PC1, PC2, PC3) and in (b) it was performed using the great circle distance on then displayed in terms of (PNSS1, PNSS2) coordinates. Three PCs in (a) are needed for separating them into four groups, on the other hand two PNSSs in (b) are enough to split them clearly. However when looking carefully at Figure 4 the four clusters do not stand out in the PCA plot even with three PCs, but they are clearly distinct in the PNSS1-PNSS2 plot. Plotting PNSS1 and PNSS2 with the colour scheme from PCA clustering is given in (c), and we see that the result is similar. The most dense part of the blue cluster in (b) is where the starting configurations for each of the runs are located. Ward’s method was our preferred clustering method, giving good practical separation into four groups in our application.

In (d) we plot the PNSS scores on using the PNSS clustering. The solid red line is the estimated first principal arc, on the other hand the solid-dotted red line is the estimated second principal arc where the solid line indicates the PNSS2 values. In (e) we plot the same data as in (d) except we use the PCA cluster colours, and the clustering in the plots (d) and (e) are very similar. If we carry out the clustering on other sets of variables such as PNSS1, 2 then the broad pattern is similar, particularly near the main modes for the four clusters.

(a) Clustering on (PC1, PC2, PC3).
(c) Clustering on (PC1, PC2, PC3) and representation in terms of (PNSS1, PNSS2). PNSS1 is a circular variable, with and being equivalent, as the cut-point of the circle.
(e) data. PCA clusters.
(b) Clustering on data and representation in terms of (PNSS1, PNSS2). PNSS1 is a circular variable, with and being equivalent, as the cut-point of the circle.
Figure 5: Pairwise plots of PC1, PC2, PC3 scores (a) and PNSS1, PNSS2 scores (b)-(e). The colour is using four-group hierarchical clustering on (PC1, PC2, PC3) in (a), (c) and (e) and on in (b) and (d), all using Ward’s method with non-squared distances. Panel (d), (e) shows the projected data points on with the estimated first two principal arcs.
(b) Clustering on data and representation in terms of (PNSS1, PNSS2). PNSS1 is a circular variable, with and being equivalent, as the cut-point of the circle.
(d) data. PNSS clusters.

3.4 Structure of PNSS variability

For each PNSS we can obtain an interval on as


where is the PNSS mean, is a constant,

are standard deviation of

th PNSS scores and are the direction of th principal arc. As seen in Figure 3, and we used .

Figure 6: PNSS 1 (top left), PNSS 2 (top right) and PNSS 3 (bottom) in landmark space, with PNSS mean (rainbow coloured points) on all plots, and the Procrustes mean (black) shown on the PNSS1 plot. Rainbow arcs have been drawn from the PNSS mean along each PNSS score direction to the ends of the interval (3.1). The grey points and lines show the landmarks at .

We display the first three principal arcs in landmark space as shown in Figure 6 and we display both the PNSS mean (corresponding to PNSS component 0) and the Procrustes mean for comparison. The black dots with connected black lines indicate the Procrustes mean. The PNSS mean is given by rainbow coloured points (with the rainbow colour indicating point order). We can see that the PNSS mean and Procrustes mean shapes are visibly different. In addition coloured rainbow arcs have been drawn from the PNSS mean along the PNSS score 1 direction to the ends of the interval (3.1). The grey points indicate the location at along the arc, and these are joined by grey lines which give an idea of the structure of variability in the PNSS. This plot shows considerable amounts of twisting along the arcs in PNSS1, and a smaller amount of different bending in PNSS2 and PNSS3.

We see that the first PNSS score demonstrates an articulated type of movement, sweeping out arcs. The PNSS mean is clearly different from the Procrustes mean. The PNSS description of variation is particularly appropriate here as bonds have fixed lengths, and local rotational movement is much more practical than linear movement to and from the Procrustes mean, as in shape PCA.

Figure 7: PNSS 1 for each of the clusters 1 and 2 (top row, left, right) and clusters 3 and 4 (bottom row, left, right). The PNSS mean (rainbow coloured points) and Procrustes means (black) are displayed. Rainbow arcs have been drawn from the PNSS mean along the PNSS 1 score direction to the ends of the interval (3.1)

In order to examine the shapes of the clusters, we plot the PNS mean and Procrustes shapes of each cluster in Figure 7, together with the PNSS 1 arcs for each cluster. The cluster 1 mean has two clear bends in the peptide, clusters 2 and 4 means have a single bend each in different positions, and cluster 3 mean is broadly straight with no bends. These four cluster means are quite different, and they illustrate the wide range of shape configurations that the peptides visit during their temporal sequences. Also, cluster 1 is clearly more variable, with larger PNSS 1 arcs than the other clusters. Also, the PNSS mean and Procrustes mean are more different in cluster 1.

3.5 Temporal clustering

A graphical view of shape changing pattern within these four groups over time can be seen as in Figure 8(a) for run 55. This run changes shape regularly but some other runs have different patterns, for example run 66 changes very much during the middle times, run 25 does not visit the yellow cluster, and so on. All runs are shown in Figure 8(b).

(a) Run 55
(b) All runs
Figure 8: Visit history.

Since all 100 runs behave with different dynamic patterns, we ran cluster analysis on 100 transition matrices within these four locations. To this end we estimate transition matrices for each run,

, using empirical probabilities and we use the Hellinger distance on transition matrices defined by

The overall transition matrix estimated from all the runs is given in Table 1, indicating the conditional probability from a cluster (rows) to another cluster (columns). The transition jumps are evaluated at each 100 time points on the original scale, i.e. for the thinned data. The overall equilibrium distribution of the chain is given in Table 2, obtained from as . Overall we see that more time is spent in Cluster 3 and less in Cluster 4. There is a preference to move from Cluster 1 to either Cluster 2 or 4 first, before then transitioning to Cluster 3 as the next most likely move. So, we can see that the movement between the states is asymmetric, and there are preferred orderings of transitions between states.

1 2 3 4
1 0.8628 0.0712 0.0135 0.0525
2 0.0744 0.7480 0.1608 0.0168
3 0.0069 0.0893 0.8501 0.0537
4 0.0655 0.0178 0.1588 0.7578
Table 1: Overall average transition matrix.
Figure 9: TC1 - TC4 denote temporal cluster labels after clustering transition matrices and Cluster 1 - Cluster 4 on -axis denote the finishing location labels after clustering on the data.
1 2 3 4
Overall 0.213 0.218 0.416 0.153
TC1 0.411 0.220 0.254 0.114
TC2 0.085 0.218 0.540 0.158
TC3 0.180 0.112 0.334 0.374
TC4 0.091 0.105 0.795 0.008
Table 2: Equilibrium probabilities for Cluster 1,2,3,4, for the overall dataset and for runs within each temporal cluster TC1,TC2,TC3,TC4.

According to the first panel in Figure 9 showing the average distance between clusters divided by average distance within clusters, we split the transition matrices into four groups (temporal clusters) TC1 to TC4, by hierarchical clustering using Ward’s method with the Hellinger distance. Focusing on the finishing locations where runs terminate, we estimated the final probability for each location. In the bottom panel, the labels Cluster 1 to Cluster 4 on -axis denote the finish location labels which correspond to four groups in Figure 5(b) (b). Again we see that Cluster 3 is seen as finish location with high probability overall and relatively fewer runs terminate at Cluster 4.

We also estimate the transition matrices with TC1,TC2,TC3 and TC4 and display the equilibrium probabilities in Table 2. Simulation runs in TC1 have a relatively strong pull to Cluster 1, runs in TC2 have a pull towards Cluster 3, runs in TC3 have a pull towards Clusters 3 and 4, and runs in TC4 have a very strong pull towards Cluster 3. This behaviour is also seen in the estimated probability of final locations in Figure 9.

Hence it is of interest that the molecular dynamics simulations do exhibit different behaviours in the runs, with TC1 being particularly different from the other temporal clusters.

The analysis is dependent on the choice of temporal thinning. We have chosen to thin to every 100 observations due to the very fine temporal resolution in the original data. There would be some small periods spent in other clusters between the thinned values, but the choice of 100 gives a good compromise between excessive long times spent in a cluster (too little thinning) versus brief/missed visits to clusters (too much thinning).

4 Discussion

We have developed the method of principal nested shape spaces as an extension of principal nested spheres Jung et al. (2012), and the peptide application clearly demonstrates the utility of the method. The technique has been applied to other datasets, including molecular dynamics simulations of enzyme data. Again useful insights are obtained from the backwards method for the enzymes data that are different from using conventional tangent space shape PCA, which is a forwards method that requires estimation of the mean shape at the outset. The general approach to backwards PCA and principal nested manifold relations is discussed by Damon and Marron (2014). Further related work includes Pennec (2016) who discusses barycentric sub-spaces, where sub-manifolds are generated from weighted means of reference points.

The data are available at:


  • Cootes et al. (1994) Cootes, T. F., Taylor, C. J., Cooper, D. H., and Graham, J. (1994). Image search using flexible shape models generated from sets of examples. In Mardia, K. V., editor, Statistics and Images: Vol. 2, pages 111–139. Carfax, Oxford.
  • Damon and Marron (2014) Damon, J. and Marron, J. S. (2014). Backwards principal component analysis and principal nested relations. Journal of Mathematical Imaging and Vision, 50:107–114.
  • Dryden (2018) Dryden, I. L. (2018). shapes package. R Foundation for Statistical Computing, Vienna, Austria. Contributed package, Version 1.2.4.
  • Dryden and Mardia (2016) Dryden, I. L. and Mardia, K. V. (2016). Statistical Shape Analysis, with Applications in R, 2nd edition. Wiley, Chichester.
  • Fletcher et al. (2004) Fletcher, P. T., Lu, C., Pizer, S., and Joshi, S. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Transactions on Medical Imaging, 23:995–1005.
  • Goodall (1991) Goodall, C. (1991). Procrustes methods in the statistical analysis of shape. Journal Royal Statistical Society Series B-Methodological, 53:285–339.
  • Gower (1975) Gower, J. (1975). Generalized Procrustes analysis. Psychometrika, 40:33–51.
  • Huckemann et al. (2010) Huckemann, S., Hotz, T., and Munk, A. (2010). Intrinsic shape analysis: geodesic pca for riemannian manifolds modulo isometric lie group actions. Statistica Sinica, 20:1–100.
  • Huckemann and Ziezold (2006) Huckemann, S. and Ziezold, H. (2006). Principal component analysis for Riemannian manifolds, with an application to triangular shape spaces. Adv. Appl. Prob. (SGSA), 38:299–319.
  • Jolliffe (2002) Jolliffe, I. T. (2002). Principal Component Analysis, 2nd edition. Springer, New York.
  • Jung et al. (2012) Jung, S., Dryden, I. L., and Marron, J. S. (2012). Analysis of principal nested spheres. Biometrika, 99:551–568.
  • Kendall (1984) Kendall, D. G. (1984). Shape manifolds, Procrustean metrics and complex projective spaces. Bulletin of the London Mathematical Society, 16:81–121.
  • Kenobi et al. (2010) Kenobi, K., Dryden, I. L., and Le, H. (2010). Shape curves and geodesic modelling. Biometrika, 97:567–584.
  • Kent (1994) Kent, J. T. (1994). The complex Bingham distribution and shape analysis. Journal of the Royal Statistical Society, Series B, 56:285–299.
  • Kent and Mardia (2001) Kent, J. T. and Mardia, K. V. (2001). Shape, Procrustes tangent projections and bilateral symmetry. Biometrika, 88:469–485.
  • Le (1991) Le, H. (1991). On geodesics in Euclidean shape spaces. Journal of the London Mathematical Society, 44:360–372.
  • Le and Kendall (1993) Le, H. and Kendall, D. G. (1993). The Riemannian structure of Euclidean shape spaces: a novel environment for statistics. Ann. Statist., 21:1225–1271.
  • Margulis et al. (2002) Margulis, C. J., Stern, H. A., and Berne, B. J. (2002). Helix unfolding and intramolecular hydrogen bond dynamics in small -helices in explicit solvent. J. Phys. Chem. B, 106:10748–10752.
  • Marron and Alonso (2014) Marron, J. S. and Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5):732–753.
  • Panaretos et al. (2014) Panaretos, V. M., Pham, T., and Yao, Z. (2014). Principal flows. Journal of the American Statistical Association, 109(505):424–436.
  • Pennec (2016) Pennec, X. (2016). Barycentric subspace analysis on manifolds. ArXiv e-prints. 1607.02833v2.
  • Salomon-Ferrer et al. (2013) Salomon-Ferrer, R., Case, D. A., and Walker, R. C. (2013). An overview of the Amber biomolecular simulation package. Wiley Interdisciplinary Reviews: Computational Molecular Science, 3:198–210.
  • Wang and Marron (2007) Wang, H. and Marron, J. S. (2007). Object oriented data analysis: sets of trees. Ann. Statist., 35(5):1849–1873.
  • Ward (1963) Ward, Jr., J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244.

Appendix 1

Principal nested sub-shape-spaces for -dimensional data

Let in denote labelled landmarks that are not all identical, and write for the pre-shape of the configuration obtained from by removing the effects of translation and scaling. The pre-shape is a matrix of unit norm, and the pre-shape sphere is denoted .

The tangent space to at any is the subspace of given by

where denotes the space of all matrices. In terms of the quotient map from the pre-shape sphere to the Kendall shape space of configurations in of labelled landmarks, we can decompose into the direct sum of two orthogonal subspaces, one of which is tangent to the equivalence class of . We denote this subspace of by . Its orthogonal complement is the Procrustean, or the horizontal, sub-tangent space at . It is known that (Kent and Mardia, 2001)

and that is isometric with the tangent space at to the shape space .

For any , write for the skew-symmetric matrix with component 1, component and 0 otherwise. Then, for any pre-shape with rank,

so that . Moreover, spans .

The subset of orthogonal to the Euclidean subspace spanned by is a great sphere of dimension . We denote this subsphere by , that is,

Clearly, and the dimension of is the same as that of the shape space . It can be checked that is the image of under the exponential map at .

For any pre-shape , we apply the ordinary Procrustes analysis procedure to choose , where

  1. The resulting is usually called the Procrustes fit to of the original configuration;

  2. has the same shape as ;

  3. is symmetric;

  4. ; and

  5. the spherical distance distance, , between and is the same as the induced Riemannian shape distance between the shapes of and .

The fact that is a symmetric matrix implies that, for any ,

so that . It follows that and, in particular, that the intersection is always non-empty. Since , all such lie in the same half sphere of .

For given and , the corresponding , and hence the corresponding , is not necessarily unique. The uniqueness of , or , is equivalent to the shape being within the non-singular part of but outside the cut locus of the shape (Le, 1991). Hence, the subset

of is topologically a ball in containing and is bijective, under the quotient map, with the subset

of . Note that being non-singular is equivalent to it being shape of a configuration whose pre-shape matrix has rank at least . Since the quotient map is differentiable and since and are respectively submanifolds of and , it follows that these two sets are diffeomorphic with each other. Thus, any sub-manifold in is mapped to a sub-manifold in . In particular, the intersection of with a subsphere of is mapped to a sub-manifold of , so that the intersection of with a sequence of nested subspheres of is mapped to a sequence of nested sub-manifolds of . However that these maps are not isometric. Nevertheless, the great circle arc through is mapped to a geodesic in through .

Appendix 2

Approximate PNSS using PCA

Consider configurations , in of labelled landmarks with Procrustes mean shape . Their pre-shape matrices are , , and their Procrustes fits to are , . Let , , be the Procrustes tangent projection of at (Dryden and Mardia, 2016, (4.33)), i.e.

and be the unit eigenvector corresponding to the

th largest eigenvalue of the covariance matrix of

, ,


is the vectorize operation of stacking the columns

into a single vector , and , where is the inverse vectorise operator forming a matrix of columns. Then, and . For any , the -dimensional unit sphere in the -dimensional linear subspace spanned by is a -dimensional subsphere of . A minimal practical requirement is that the first eigenvalues will be strictly positive. If we choose such that the first principal components explain a high proportion of the total variation of the data then, without much loss of information, the use of the projection of to as the input for the principal nested sub-shape-spaces analysis will reduce the initial dimension of the sphere and improve the speed of computation.

The projection of to can be approximated by using the shape principal component scores, where the shape principal component score for the th configuration on the th principal component is given by . For this, let

Then, is the image of under the inverse exponential map of at , it lies in and, in particular, . By writing

the projection of in the subspace, spanned by , of the tangent space is


The image of under the exponential map back in the pre-shape sphere is

where, if , we take . When is sufficiently close to , gives a good approximation to the projection of to .

Since are orthonormal, they form a basis of the -dimensional linear subspace spanned by themselves and, in terms of this new basis, can be expressed as

This representation will simplify computation. Conversely, for a given , its principal component scores are


where .