1. Introduction
In the past 60 years, image analysis and the related science have developed at an enormous speed along with computational power. Abnormality detection, an active field of research, can be very useful if key features can be extracted from the image templates under comparison. The process of identifying prominent features generally includes two major steps: (1) transforming the image or shape representations from nonlinear spaces into locally linear coordinates for any necessary calculations, and (2) making inferences from these calculations via statistical analyses. Among various representations used, landmark representation, despite its long history and simplicity, has proven to be one of most effective tools in making statistical inferences on the shape/image objects.
The traditional Procrustes methods extract residual vectors of landmark locations through calculating a group average, which serves as a quantity similar to the statistical mean
[3, 4]. The resulting residual vectors are then analyzed via classical statistical techniques for making inferences. Alternatively, a class of diffeomorphic methods, that were first introduced in the 1990s [15, 10] and developed, both theoretically and numerically more recently, [1, 18, 2, 27, 28, 35, 37, 26] model the given image data through certain manifolds and relate two deformable templates through geodesics on the selected manifolds. For detailed information about the development of this class of methods, we refer readers to the book and paper by Younes et al. [38, 29]. The geodesic of a Riemannian manifold is governed by a class of partial differential equations (PDE) called the EulerPoincaré (EP) equations
[8, 17, 30, 37, 25]. When images or shapes are represented by landmarks, geodesic matching algorithms can be constructed via finding the proper initial conditions of the equivalent finite dimensional particle system of the EP equations. For example, an efficient template matching algorithm that takes advantage of the landmark representation and the particle formulation of the EP equation was introduced by Camassa et al. [6, 7]. This method uses a constant matrix to update the search direction of the geodesic shooting for finding the initial conditions that evolve the reference template to the target one, instead of the traditional methods of forwardbackward integration for updating the gradient (known as the LDDMM) or Newton’s optimization. Another feature of this method is that a nonsmooth conic kernel for the particle system is used to accelerate the convergence of matching. We give a brief introduction of the aforementioned geodesic matching algorithm in Section 2. Using this matching algorithm as our underlying warping method, in this paper we propose : (1) a landmarkbased algorithm calculating the residual momentum representation for each landmark template as feature vectors and (2) a Bayesian approach on the resulting feature vectors for detecting the most statistically significant landmarks that encode the difference between groups under comparison. While calculating the momentum representation, like in the Procrustes approach, an estimator as an average of a group on a general Riemannian manifold is also calculated as a byproduct. Some similar estimators of central tendency in this context are also given in [2, 12]. The resulting residual momentum at convergence then provides a locally linear space for applying our Bayesian statistical analysis for detection purposes. This is the novelty of our proposed algorithm as well as the major difference that sets it apart from the existing literature
[35, 32, 23, 33, 12].The paper is organized as follows. In Sections 2 and 3, an averaging algorithm on the momentum field under a deformative template setting is constructed using a landmark representation and the particle shooting algorithm. We show how the momentum representation for each template is collected during the averaging procedure. In Section 4, we present an example of practical usage of the proposed algorithm for medical image abnormality detection. Using the momentum representation based on the calculated average template, a Bayesian predictive statistical approach is presented to detect landmarks whose momentum coordinates encode the most significant differences between groups under comparison. We call this set of landmarks a predictor. Finally, we show that the brain structure changes identified by our predictors are highly consistent with studies in the literature.
2. Basic Formulation
For template matching, based on the principle of diffeomorphism, the warping between two images is established through the geodesic defined on the diffeomorphism group Diff or Diff for some , which is an open subset of the space , where is an identity map. Suppose that such a space is denoted by . For , a vector in the tangent space of with identity , its norm is defined by:
(2.1) 
Suppose that the operator is defined by , the associated Green’s kernel of is given by
(2.2) 
where, is the modified Bessel function of the second kind of order and is the usual notation for the Gamma function [30]. A notable special parametric choice is the twodimensional () Green function’s with and , resulting in the simple form^{1}^{1}1In this case the space is
(2.3) 
This yields a conic shape kernel that has advantages for template matching as discussed in [7]; we will use this later in our numerical experiments.
The length of a curve is defined by using the right translation
(2.4) 
Furthermore, an energy functional can be defined by
(2.5) 
The geodesic is usually obtained by minimizing the energy functional rather than the length . This is because minimizing chooses not only the best path but also the simplest parametrization of time in that satisfies the constraints and , where is referred to as the reference, whereas is the target [30]. In this case, we have , where with being an invertible increasing function (a reparametrization of time) from [0,1] to itself. Note that we can assume the existence and the uniqueness of such a geodesic in our problems, as long as the manifold we consider is compact and we only work on some neighborhood of .
The EulerLagrange equation, arising from minimizing the energy functional, turns out to have exactly the same form as the EP equations [27, 30]:
(2.6) 
with , subject to the “time” boundary conditions and . Assuming that the space is geodesic complete, we can convert this boundaryvalue problem into finding the corresponding initial condition for an initialvalue problem. Let the corresponding initial condition be defined through a proper logarithmic map from M to , the tangent space of at , by:
(2.7) 
The corresponding exponential map from to M then can be defined accordingly by
(2.8) 
The Exp map is an isometry between and to the first order [30], here we take the results from finitedimensional Riemannian geometry since this paper will only deal with finite landmarks:
(2.9) 
where
is the norm of curvature tensor and
. By letting , , the term . Combining all the previous results, we will have(2.10) 
Thus, numerically, the length of tangent vector can be used to estimate the length of the geodesic. This firstorder isometry is illustrated in Figure 1.
Under the setting of landmark representation, a diffeomorphism can be represented by its actions on the landmark space , i.e.
(2.11) 
This gives a finite dimensional approximation of the original infinite functional space^{2}^{2}2Readers are referred to [13, 14] for methods dealing with nonlabeled landmarks.. In the landmark space , it is convenient to define a pair of variables for each landmark and the corresponding momentum . For given initial data , the time evolution of the EulerLagrangian (or the EP) equations is obtained by evolving the particle system in time for [6]:
(2.12) 
The above particle system gives an easy way of calculating the tangent vector, which is the velocity field of the EP equations,
(2.13) 
It is worth noting that numerically solving the particle system (2.12) is equivalent to approximating an Exp map. Moreover, with this particle setting, on the tangent space, numerically it is advantageous for us to directly work on the momentum field , other than the velocity for the Log map. Detailed discussion about the advantage of using the momentum field is in Section 3. Note that the conversion between the velocity and the momentum fields is straightforward by using Eq. (2.13).
For our applications, landmark positions at and are known, while the initial momenta at are not. A shooting algorithm that searches for the initial momenta is introduced in [7]. In that algorithm, a predictioncorrection step:
(2.14)  
(2.15) 
where
(2.16)  
(2.17)  
(2.18) 
is coupled with the particle system (2.12) to search for the appropriate initial conditions , , so that . This shooting algorithm numerically generates the Riemann Log map. We refer readers to [7] for more details about this algorithm. Figure 2 gives a schematic illustration of this shooting algorithm.
3. A geodesic shooting algorithm for an average
In the literature, there are many classical approaches for obtaining an average of a group [3, 4, 22, 30, 33]. The basic idea of these approaches is to calculate the Karcher mean or Fréchet mean with different choice of metrics [19, 11]. Recently more sophisticated estimators related to tangent spaces of Riemannian manifolds have been studied for their potential to be used as averages of groups [12, 35, 36, 32]. Among all these approaches, the classical morphormetric average, like the Procrustes average, in a sense, ignores the nonlinear and nonflat geometric features. On the other hand, to consider such features in finding an average under the setting of diffeomorphic deformation, it is necessary to find a nice linear space (at least locally), on which performing linear combinations is both mathematically sound and physically meaningful. To this end, we choose the momentum field as such a space for our proposed algorithm (described below), because the momentum provides a local templatebased coordinate system which is also linear in nature [17, 32, 35]. As a result, in our algorithm, when we find an average of a group in the momentum field, and map the result through the Exp map back to the manifold, in principle certain information of the nonflat geometry will still be preserved. At convergence, the initial momentum that drives the average to each individual observation will serve as the momentum representation of that landmark template and used for later statistical analysis.
Let denote the average of the group . We use Figure 3 to illustrate the proposed averaging procedure, with details given in Algorithm 1.
(3.1) 
(3.2) 
(3.3) 
(3.4)  
(3.5) 
Note that in Algorithm 1, in Eq. (3.3) is called the weight. Two special weighting schemes are worth noting. In the first scheme, the weight is chosen to be
(3.6) 
This weighting scheme mimics the direct average method for the Karcher mean in flat Eulerian spaces, which gives the solution of the least sum of distance squares, i.e.
(3.7) 
Note that for this weight, each member has an equal opportunity for contributing to the group average, which makes this scheme equivalent to that introduced in [32]; the latter reference includes a thorough discussion of the socalled "center of mass" in Riemannian manifolds under a statistical setting.
The second weighting scheme is motivated by the idea of finding the geometric median [12], a robust estimator when the dataset has corrupted members. Using the idea from the Weizfeld’s algorithm for Eulerian spaces [20], the assigned weight for this scheme is calculated by
(3.8) 
For this choice of weight, the estimator is the value that minimizes the sum of distances, denoted by
(3.9) 
where is the geodesic connecting the unknown group average and each group member , i.e. the restriction is now and for each .
The existence and uniqueness of both “mean" () and “median" () were investigated in [12, 32] respectively with assumptions on the manifold that either one of the below holds:

the sectional curvature ;

, and diam(U)< with a subset of the manifold that can cover all observations.
Following the idea of Kuhn and others [20, 12], we now give a proof that shows the convergence of Algorithm 1 for both choices of weights, (3.6) and (3.8) under the condition (b).
Theorem 1 (Local Convergence of Algorithm 1).
Proof:
If the condition (b) is satisfied, there exists a unique minimizer on the manifold [12, 32]. We only need to show the introduced algorithm actually converges under this condition.
We first show claim (1). Let , where and and is a manifold. Let . The distance is defined by , with the norm induced by operator .
Since is a geodesic convex function, there will be no local optimum on . In order to show that the algorithm converges, we only need to prove if is not the minimum.
Let be the iteration of Algorithm 1, with the following defined as:
(3.10) 
Hence,
(3.11)  
(3.12)  
(3.13) 
It is worth noting that the last inequality holds, only if all observations and the condition (b) is satisfied. In this case, we can use Toponogov Comparison Theorem to obtain the inequality.
Let be the geodesic triangle on the manifold, see Figure 4. Toponogov Comparison Theorem states that if observations satisfy the condition (b), for each corresponding side of the two triangles, the distance on the is less than the corresponding distance on , i.e. .
Hereafter, for a fixed , we drop in and denote the norm by for simplicity. Suppose that the algorithm uses the update , where , then is the center of gravity of weights placed at , and hence minimizes the function^{3}^{3}3Since is the denominator, observations cannot be selected as an initial guess for starting the algorithm.:
(3.14) 
Thus .
On the other hand,
(3.15)  
(3.16) 
Substituting this into , we get
(3.17) 
Hence, we have
(3.18) 
Now combining Eqs. (3.13) and (3.18) yields
(3.19) 
Note that the equalities in (3.19) hold if and only if , which is equivalent to , i.e. the iteration stops and the minimum is reached. This proves that Algorithm 1 converges when the weight (3.8) is used. Moreover, at convergence, the result of Algorithm 1 minimizes the sum of the distances.
Following the above structure, and using the same notation, we can prove claim (2). Below, we repeat the proof but skip some steps to avoid redundancy. Redefining , we have . Since , the function is now defined by
. So by the same argument, we have
The equality holds only if . This proves the convergence of Algorithm 1 with equal weights, and the result gives the least sum of distances squared.
We remark that similar to Weizfeld’s algorithm for flat Eulerian spaces, the initial guess for Algorithm 1 cannot be chosen as any of the group members , otherwise the algorithm will not converge. However, there is a way to get around with this restriction in flat Eulerian spaces [36], so that the algorithm converges for any initial guess. The extension of it on general Riemannian manifolds is still under our investigation. We also remark that Pennec [32] investigated other geometric measurements in a statistical setting for the equally weighted schemes, see [32] for further details.
The two weighting schemes give rise to different average templates for a group, based on which, the resulting residual momentum representation for each group member will be different. We remark that the result of Theorem 1 relies heavily on the condition (b) which may not hold for every dataset, as discussed in [24]
. Nevertheless, in this paper, we intend to use the algorithm as a feature extraction tool, a preprocessor to extract momentum features for the purpose of detecting local changes that are statistically significant. The data we use are often recorded from the same or similar objects, such as the annotated brain landmarks of different people or of the same people but obtained at different times. In these cases, observations are assumed “close", and hence convergence can be expected. In other cases, the algorithms may not converge
^{4}^{4}4For example, curvatures are negative., or the minimizer is not unique.^{5}^{5}5For example, curvatures are positive, but diameter inequality is not satisfied for observations. In such cases, we will use a smaller scale parameter in Green’s kernel^{6}^{6}6All the following examples are conducted using unless specifically specified otherwise. , so that the matrix with entry will become more diagonally dominated when used for calculating the geodesic. Landmarks in this case will become less correlated with each other and the deformation will look more linear. By doing so, the geometry of the landmark deformation may become closer to flat case, and we can hope that the diameter inequality in condition (b) could then be satisfied and the proposed algorithms can at least converge numerically.3.1. Example
In this example, we calculate the average for a group of 20 shapes by using the two different weighting schemes, respectively. 20 landmarks are used to represent each shape. In the group we consider, a portion of the group (, where ), the shapes are the ellipses , where
obeys a normal distribution:
, whereis the identity matrix, whereas the rest of the group are the socalled outliers, which are copies of the heartshape curve:
(3.20) 
Figure 5 shows a typical ellipse and a typical outlier in this group. We show the comparison between the two weighting schemes in Figure 6, where is chosen to be , , , and for different levels of outliers.
It is clear from the figure that the equally weighted scheme absorbs the traits equally from each member of the group; namely when the level of outliers is higher (larger ), the shape of the resulting average deviates more from the shapes of the majority, which are a distribution of ellipses. On the contrary, the other weighting scheme gives a robust estimator, in the sense that the obtained average stays close to the majority of the group, in particular when is large.
We remark that the use of the weighting schemes depends on the goals of the application. For example, the equally weighted scheme is sensitive to outliers, which can be advantageous if the goal is to to detect outliers in streaming data. On the other hand, if a most representative average is the aim of the application, then the robust algorithm should be used to obtain a robust estimator.
4. Abnormality Detection
Once an average for a group is defined and properly calculated, the next two important questions are

Whether or not new data can be classified into a member of this group by comparing with the average?

What features from the new data can be used to justify the decision made in (a)?
These two questions are part of an even bigger topic of feature extraction in pattern recognition and classification, namely finding features with which data from different categories can be effectively discriminated. To answer these two questions, we need to seek a set of meaningful variables that can carry necessary information about the nature of the data. The two variables we choose for the task are (i) a global variable , which is the Hamiltonian that indicates the energy required for the deformation between two sets of landmarks [7], and (ii) a local variable , which is the collection of momentum at each landmark calculated by Algorithm 1.
The global variable could be used to distinguish templates of different kinds, for example brains between humans and chimpanzees. In [7], we have utilized for clustering and classifying different simple contour curves. However, it may not be sensitive enough to compare subcategories of the same kind, such as brain structures between people with and without schizophrenia where only minor local changes could cause a difference. To distinguish such a difference, we need to resort to the local variable. In this section, we will focus on applying the residual momentum from Algorithm 1 for abnormality detection. A Bayesian predictive approach will be used to assess the information hidden behind the momentum representation . At the same time, we will discuss advantages of using the momentum representation over the traditional location representation. In order to justify this advantage numerically, we use a dataset of 28 parasagittal brain images for illustration. Each image is annotated by 13 landmarks. The dataset can be sorted into two groups. One contains 14 images of nonschizophrenic people, while the other has 14 images from confirmed schizophrenic cases. For more details about this dataset, we refer the reader to the link [16]. The annotated landmarks of a typical brain image in this dataset are shown in Figure 7. In applications like this, observations on landmarks of different individuals are essentially close, so we can expect the convergence of Algorithm 1.
4.1. Detection
For each image in the dataset, the annotated landmarks represent locations of human brain structures. It is well known that landmark locations could be different between a schizophrenic and a nonschizophrenic template. Such differences are sometimes beyond visualization and can only be identified through mathematical and statistical analysis. These landmarks carry certain features and information that can help distinguish schizophrenic images from a group of nonschizophrenic templates. We refer to the collection of these landmarks as a predictor. A predictor is viewed as a signature for describing the difference between templates. For the remainder of this section, we focus on finding the landmarks that comprise a predictor for our image data. The upshot of this exercise could help medical researchers locate those specific landmarks (structures of brain), and shed light on the development of the disease.
It is worth noting that in order to obtain convincing results, more than one method should be applied for validation. One of the traditional Procrustes methods uses superposition of datasets to compare the variations for each landmark. Adopting this approach, Bookstein[4] showed that landmarks 1 and 13 are the “most different” landmarks between the two groups in the dataset. For comparison, below we apply Algorithm 1 and use a Bayesian approach on momentum coordinates of each template to detect the abnormal landmarks.
Our algorithm begins with finding the momentum coordinates for each template in the database. Momentum coordinates for templates of the nonschizophrenic group will be calculated through Algorithm 1. For each , is a 13 by 2 matrix, in which the th row () contains the momentum coordinate calculated at the th landmark. We denote the resulting average of this group by . After is found, momentum representations based on the average calculated from the previous step for each individual in the schizophrenic group, denoted , will be computed by the Riemannian logarithm map, i.e. . To apply statistical methods for the momentum coordinates, we construct the observation data matrix for each landmark. For each landmark , we collect the row of the momentum coordinates of the nonschizophrenic group to form a matrix . Similarly, we construct the data matrix for the schizophrenic group.
Note 1.
We remark that for the image in each group, e.g. the nonschizophrenic group, the momentum coordinators are correlated, whereas is independent of , the momentum coordinate of first landmark of the image in this group. Our statistics in the next section are based on the independent momentum coordinates computed, respectively, from the 14 different images with respective to the average within the two groups.
4.2. Statistical methods for momentum data
In this section, we use two statistical methods to process the momentum data for landmarks . Our aim is to identify those landmarks associated with changes between nonschizophrenia group and the compared schizophrenia group through analysis on their residual momentum representation. If both methods give consistent conclusions that are also consistent with the classical Procrustes method, the result by our method will be trustworthy in this task. We do not use classical hypothesis testing since our sample sizes are relatively small and most of the classical results rely on large sample asymptotics. Thus, we use a simple distancebased approach and also construct a Bayesian predictive distribution which conditions on the data and does not rely on asymptotics.
4.2.1. Method 1(Norm of mean momentum)
The first statistical method we use to analyze the momentum data is to directly compare the norm of mean momentum at each landmark. This method follows an intuitive idea that landmarks with large mean momentum calculated from the schizophrenic group may indicate abnormalities. In this method, for each and , , we compute the norm of mean momentum by
(4.1) 
Note that for each , there are two components. The norm evaluates the square root of the sum of component squares. Table 1 and Table 2 summarize the norm of the mean momenta calculated at each landmark for the two groups, respectively. For the schizophrenic group (see Table 2 ), we note that Landmarks 1, 6 and 13 have unusually large values compared to other landmarks. We group these landmarks together as a predictor for this method.
Norm of Mean Momentum for N Group  

Landmark  1  2  3  4  5  6  7 
0.0001  9.39e06  1.17e05  9.07e06  1.11e05  0.0001  3.74e05  
Landmark  8  9  10  11  12  13  
2.6610e05  1.3453e05  1.9547e05  2.4342e05  4.8575e06  5.9788e05 
Norm of Mean Momentum for S Group  
Landmark  1  2  3  4  5  6  7 
0.0338  0.0533  0.0995  0.0932  0.0432  
Landmark  8  9  10  11  12  13  
0.0722  0.0635  0.0270  0.0826  0.0873 
4.2.2. Method 2 (Bayesian Posterior Predictive distributions)
While the method by observing the sample mean can help provide certain information on abnormal location of landmarks, using Bayesian (posterior) predictive distributions we can obtain distributions for the landmarks in each of the two groups. This offers the advantage that the inference is not on the means of the landmarks, but on the distributions of the landmarks themselves. The landmarks are ‘observables’ while their means are parameters and ‘latent’; inference on observables can be validated with future data as more images are obtained. Predictive distributions of the momentum data for each landmark are obtained via a Gibbs sampler for sampling from the posterior predictive distributions; details of the statistical modeling and its implementation are in Appendix A.
For each landmark, the respective predictive distributions of the two groups are compared to decide whether there is a significant difference. The predictive distribution is a surface, which is hard for us to analyze the difference. Instead, we use the contour lines of for their comparison. Figure 8 shows the plots of the 95% contour lines of the two groups for each of the 13 landmarks.
To quantify the difference between the contour lines for each landmark, we compute the ratio of the overlapping area of the two contour regions over the union of the two regions. Let and represent the contour region for each group, respectively, for the landmark. We define the ratio as . We expect that the smaller the value is, the more different the two distributions will be. In practice, we can approximate by to avoid the extra computation of 2D integration, where . Here with and are the central intervals of the marginal predictive distributions. is defined exactly the same way.
Table 3 shows the approximated ratios for the 13 landmarks. Landmarks with (i.e. approximately less than a half of the total region is overlapped) is considered as having significant difference between the two groups, and they will be grouped together as a predictor. In this case, the predictor for this method includes landmarks 1, 2, 6 and 13. We remark that the momentum field used in our calculations is advantageous. For comparison, we repeat exactly the same calculations, but replace the momentum data by the original position data. Table 4 is the result of using position data, the counter part to Table 3. As we can see that no landmarks have ratios that are less than 0.5 in Table 4. This indicates that the momentum representation has better separability than the location one.
Results on Momentum Data  
Landmark  1  2  3  4  5  6  7 
0.6906  0.6542  0.5561  0.7187  
Landmark  8  9  10  11  12  13  
0.6825  0.5348  0.6155  0.5470  0.5554 
Results on Position Data  

Landmark  1  2  3  4  5  6  7 
0.7900  0.6636  0.5480  0.6249  0.5559  0.6045  0.5528  
Landmark  8  9  10  11  12  13  
0.5887  0.5022  0.6171  0.5560  0.6066  0.6063 
The above three methods provide different viewpoints for our investigation. Method 2 finds a predictor by directly examining the peak values of the parameters (the 2norm of mean momentum in our case). Method 3 uses Bayesian inference to search for a predictor. By examining the mean momentum, we found an additional abnormal landmark 6 when compared to Procrustes Method. The Bayesian approach, which also models the data, helps to dig out another suspected abnormal landmark 2. The three methods here seem to discover abnormal landmarks at different scales. In Table
5, we summarize the predictors found in the literature and our methods. The results are consistent.Method  Predictor 

Procrustes Method in [4]  1,13 
Mean Momentum (MM)  1, 6,13 
Predictive Distribution Contour (PDC)  1, 2, 6, 13 
Finally, we comment that from Table 5, the structure differences between the groups of schizophrenia and nonschizophrenia in our database most likely occur at the locations annotated by landmarks 1, 2, 6 and 13. Indeed, the region enclosed by these landmarks (see Figure 9) were investigated in clinical trials and found to be relevant to schizophrenia from some of the literature [21, 31].
5. Conclusions
We have proposed a Bayesian approach using the residual momentum representation of landmark templates through the calculation of a group average. We prove a convergence theorem for the averaging algorithm with different weights under certain conditions. The resulting residual momentum variable from the averaging process provides a local templatebased linear coordinate system. It also serves as a representation which could enhance potential differences between groups under comparison when using the original location representation cannot, and thus is advantageous for related statistical analysis like abnormality detection. We apply the proposed approach on a medical image database. The database contains 14 nonschizophrenic and 14 schizophrenic templates annotated with 13 landmarks. The region of brain structure changes in schizophrenia found in our study is highly coherent with that in the literature.
Finally, though there are certainly much remain to do, the applications of this proposed algorithms could have impacts on clinical medicine, forensic, and surveillance. For example, medical image abnormality detection will help physicians for early diagnoses on schizophrenia, alzheimer, and heart diseases. Fast automatic face recognition algorithms will provide forensic identification and instantaneous image matching in surveillance that helps governments for intelligence gathering, the prevention of crime, the protection of a process, person, group or object, or for the investigation of crime.
Appendix A Statistical model specifics and implementation of method 2
The data are computed from the analysis of images from two groups of 14 nonschizophrenic and 14 schizophrenic brains, so that within each group at each of 13 landmarks, we have 14 observations (number of brains in group) on bivariate data (x,y components). Notationally, the vectors in the and directions result in data which we denote as . In what follows, we use the notation to denote a subset of formed by collecting vectors over the index .
For fixed , we model the vectors
as conditionally independent bivariate normal random variables given the mean,
, and covariance matrix, yielding the joint probability distribution,
(A.1) 
where denotes a probability distribution with for the conditional version. For purposes of data modeling, the bivariate distribution can be rewritten as the product of , the marginal distribution of the first component of the vector, with
, the conditional distribution of the second component given the first, where both are univariate normal distributions. This reformulation yields a likelihood where the covariance structure is a function of the two standard deviations (
) and one correlation (). There is more flexibility in assigning priors to scalar parameters (Lunn et al. 2013, p.226), whereas the usual bivariate formulation restricts priors for the covariance matrix, , to the Wishart distribution with little flexibility in the prior’s shape. Given data from a combination, we denote the contribution to the likelihood for as , which is proportional to the quantity in equation (A.1). Then, the overall likelihood is , where .We used independence priors for the mean and covariance parameters,
where the components of each were modeled hierarchically. Specifically, for the mean parameters, we took , which implies that the means for the and components of the vectors had independent prior distributions. Across the combinations, the means for the components were assumed to be conditionally independent with and . We used fairly diffuse hyperprior distributions and . Fixing the hyperpriors at specific constants resulted in no difference in the final results.
The components of the covariance matrices were also modeled to have independent prior distributions,
with prior conditional independence for combinations and distributions , and ,
, which give diffuse gamma distributions. Finally, the priors for the correlation parameters were also conditionally independent,
; that is, a truncated normal distribution with and . Fixing parameters of the priors to constants gave similar results to assigning hyperprior distributions.The overall prior distribution for , , is given by a product of the prior and hyperprior distributions, with the hyperparameters integrated out. The resulting posterior distribution for is obtained as
with the high dimensional integration in the denominator bypassed using Markov chain Monte Carlo (MCMC) sampling.
Furthermore, we use the posterior distributions to obtain posterior predictive distributions for ‘new’ observations from each of the 26 landmarkgroup combinations. Using predictive inference to assess the potential similarities and differences across the two groups at each landmark gives inference in terms of the observables, namely, the vectors instead of making inferences on the parameters in used to model the data on the vectors. The predictive distributions are obtained as
where is a bivariate normal distribution with conditionally independent of the observed data . Again, MCMC is used to bypass this integration.
We used open source programs for all model implementation; specifically, R (R Development Core Team 2015) using the ‘rjags’ (Plummer 2012b) package which interface with MCMC sampling programs ‘Just Another Gibbs Sampler (JAGS)’ (Plummer 2003). The ‘coda’ package (Plummer 2012a) assists in managing and presenting sampling output. The ‘emdbook’ package (Bolker 2014) was used to produce contour plots.
Appendix B Acknowledgements
The paper was completed while SH was in the Statistics Department at the University of Wyoming.
References
 [1] Beg, M. F., Miller M. I., Trouvé, A., and Younes, L., Computing large deformation metric mappings via geodesics flows of diffeomorphisms. Int. J. Comp. Vis., 61(2), (2005), pp 139157.
 [2] Beg M. F., Khan, A.. Computing an average anatomical using LDDMM and geodesic shooting. IEEE, ISBI, (2006), pp 1116–1119.
 [3] Bookstein, F. L., Landmark methods for forms without landmarks: morphometrics of group differences in outline shape. Medical Image Analysis, 1(3), (1996), pp 225243.
 [4] Bookstein, F. L., Biometrics, Biomathematics and the morphometric synthesis. Bulletin of Mathematical Biology, 56(2), (1996), pp 313–365.

[5]
Brunelli, R.,
Chapter 11,Template matching techniques in computer vision: theory and practice
John Wiley & Sons, Ltd. 2009.  [6] Camassa, R., Kuang, D., and Lee, L., Solitary waves and particle algorithms for a class of EulerPoincaré equations. Stud. Appl. Math., 137, (2016), pp 502546.
 [7] Camassa, R., Kuang, D., and Lee, L., A geodesic landmark shooting algorithm for template matching and its applications. SIAM Journal on Imaging Sciences, 10(1), (2017), pp 303334.
 [8] Chertock, A., Du Toit, P., and Marsden, J. E., Integration of the EPDIFF equation by particle methods. ESAIM:M2AN, 46, (2012), pp 515534.
 [9] Duda R. O., Hart P. E., and Stork D. G. Pattern Classification, 2nd Edition. John Wiley & Sons 2001
 [10] Dupuis, P., Grenander, U., Miller, M. I., Variational problems on flows of diffeomorphisms for image matching. Q. Appl. Math., 56, (1998), pp 587600.
 [11] Fletcher, P. T. , Lu, C., Pizer, M., and Joshi, S., Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. on Medical Imaging, (2004), pp995–1005.
 [12] Fletcher, P. T., Venkatasubramanian, S., and Joshi, S, Robust statistics on Riemannian manifolds via the geometric median. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2008), pp. 18.
 [13] Glaunes, J. and Joshi, S., Template estimation from unlabeled point set data and surfaces for computational anatomy. Proceedings of the International Workshop on the Mathematical Foundations of Computational Anatomy (MFCA2006).
 [14] Glaunes, J., Trouvé A. ,and, Younes, L., Diffeomorphic matching 10 of distributions a new approach for unlabelled pointsets and submanifolds matching, Proceedings of CVPR’04.
 [15] Grenander U., General pattern theory Oxford University Press, Clarendon, UK, 1993.
 [16] http://life.bio.sunysb.edu/morph/data/bookschizo.dta.
 [17] Holm, D. D, Ratnanather, J. T., Trouvé, A., and Younes, L., Soliton dynamics in computational anatomy. Neuroimage, 23, (2004), pp 170–178.
 [18] Joshi, S. and Miller, M. I., Landmark matching via large deformation diffeomorphisms. IEEE Trans. Image Processing, 9, (2000), pp 13571370.
 [19] Karcher, H., Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):(1977) pp509–541.
 [20] Kuhn H., A note on Fermat’s problem. Mathematical Programming, 4, (1973), pp 98107.
 [21] Marsh, L., Suddath R. L., Higgins, N., and Weinberger, D. R., Medial temporal lobe structures in schizophrenia: relationship of size to duration of illness. Schizophrenia Research, Vol. 11, Issue 3, (1994), pp 225238.
 [22] Lorenzen, P., Davis, B., and Joshi, S., Unbiased atlas formation via large deformations metric mapping. In Int. Conf. on Medical Image Computing and Computer Assisted Intervention, 8, (2005), pp 411–418.
 [23] Ma, J., Miller, M.,Trouvé, A.,and Younes, L., Bayesian template estimation in computational analysis. Neuroimage, 42, (2008), pp 251–262.
 [24] Mario Micheli, Peter W. Michor, and David Mumford Sectional Curvature in Terms of the Cometric, with Applications to the Riemannian Manifolds of Landmarks. SIAM J. Imaging Sciences Vol.5, No.1, (2012), pp. 394433.
 [25] McLachlan, R. I. and Marsland S., Nparticle dynamics of the Euler equations for planar diffeomorphisms. Dyn. Sys., 22(3), (2007), pp 269290.
 [26] McLachlan, M. I. and Marsland , S., Discrete Mechanics and Optimal Control for Image Registration. Anziam Journal,, 48, (2007), pp 1–16.
 [27] Miller, M. I., Trouvé, A., and Younes, L., On metrics and the EulerLagrange equations of computational anatomy. Annual Reviews in Biomedical Engineering, 4, (2002), pp 375405.
 [28] Miller, M. I., Trouvé, A., and Younes, L., Geodesic shooting for computational anatomy. J. Math Imaging, 24, (2006), pp 209228.
 [29] Miller, M. I., Trouvé, A., and Younes, L., Hamiltonian Systems and Optimal Control in Computational Anatomy: 100 Years Since D’Arcy Thompson. Annu Rev Biomed Eng. , 17, (2015), pp 447509.
 [30] Mumford, D. and Desolneux, A., Pattern Theory: The Stochastic Analysis of Real World Signals. A K Peters, Lid, Natick, MA, 2010.
 [31] Nopoulosa, P. C., Ceilleya, J. W., Gailisa, E. A. and Andreasen, N. C. , An MRI study of midbrain morphology in patients with schizophrenia: relationship to psychosis, neuroleptics, and cerebellar neural circuitry. Biological Psychiatry, Vol. 49, Issue 1, (2001), pp 1319.
 [32] Pennec, X, Intrinsic statistics on Riemann manifolds: basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25, (2006), pp 127–154.
 [33] Srivastava, A., Joshi, S., Mio, W., and Liu, X., Statistical shape analysis: clustering, learning and testing. IEEE transactions on pattern analysis and machine intelligence, 27(4), April, (2005).
 [34] Tome, P., Fierrez, J. VeraRodriguez, R., and Ramos, D., Identification using face regions: Application and assessment in forensic scenarios. Forensic Science International , 233, (2013), pp 75–83.
 [35] Vaillant M., Miller, M. I., Younes L., and Trouvé A., Statistics on diffeomorphisms via tangent space representations. NeuroImage, 23, (2004) pp161–169.
 [36] Vardi, Y. and Zhang, C.H. . The multivariate L1median and associated data depth. Proceedings of the National Academy of Sciences of the United States of America 97 (4), (2000), pp 1423–1426 (electronic).
 [37] Younes L., Arrate F., and Miller M. I., Evolutions equations in computational anatomy. NeuroImage, 45, (2009), pp 40–50.
 [38] Younes, L., Shapes and Diffeomorphisms. Applied Mathematical Sciences, Volume 171, Springer, 2010.
Comments
There are no comments yet.