1 Introduction
A compositional data vector has nonnegative components which sum to a constant, usually taken to be , as we assume here. Such vectors lie in the standard simplex, , which has the form
(1) 
where . Compositional datasets appear naturally in many fields including medicine, geology, archeology, biology and economics; see e.g. [2].
The following two approaches have been widely used in compositional data analysis. One of these approaches, due to Aitchison [1, 2], is to apply the following transformation, known as the centred logratio transformation:
(2) 
Also, define . Note that, by definition, the components of sum to . Given a sample of compositional vectors , we transform to using (2
). Aitchison then proposed applying standard multivariate methods, such as principal components analysis, to logtransformed compositional vectors, which is tantamount to assuming the standard Euclidean metric is appropriate for the
.Another popular approach is to “do nothing”; i.e. assume that the Euclidean metric inherited from the ambient Euclidean space is appropriate for the , and apply standard multivariate methods directly to the . For advocation of this approach see Baxter [3, 4], Baxter et al. [5] and Baxter and Freestone [6].
For some datasets, the logratio approach seems preferable, whereas for other datasets, the “do nothing” approach seems superior. Nevertheless, the choice over which method to use, if either, has occasionally become heated and acrimonious. See the paper by Scealy and Welsh [19] for a careful and detailed discussion of the issues and history of this debate. Our view concerning the choice of metric is that it is not possible to come up with a compelling choice for either method based on purely a priori or theoretical grounds, and that a more pragmatic approach is to make a datadependent choice of metric.
Bearing this in mind, one possibility is to consider a finitedimensional family of transformations, such as the onedimensional transformation family considered in this paper, where each transformation implies a choice of metric. Then use the data to choose the optimal transformation, which in turn determines an optimal metric. It is not so clear how to do this nonparametrically, but if a parametric family of models, such as the Dirichlet in the present context, is assigned to the transformed compositional data, then we can estimate both the distributional parameters and the transformation parameter simultaneously by maximum likelihood. Using this approach we estimate the parameters of the transformed compositional data model. In effect, the optimal transformation parameter compels the data to resemble the chosen parametric family as closely as possible. However, after carrying out this approach with a number of real and simulated datasets we found that in cases where the maximum likelihood estimator of is close to , there is a tendency for the other (Dirichlet) parameters to be rather large. See Tables 14 in Section 5 for examples of this phenomenon. This has motivated us to take a close look at the asymptotics as (i) and (ii) the Dirichlet becomes highly concentrated. This theoretical and numerical investigation is the principal focus of our paper.
The phenomenon mentioned in the previous paragraph potentially arises with analogous families of metrics on other manifolds should we wish to select an optimal metric in a datadependent way. Consequently, before focusing on compositional data in the remainder of the paper, we briefly discuss the question of choice of metric in the setting of general manifolds in Section 2.
The structure of this paper is as follows. In the next section we discuss the choice of metric in the statistical analysis of manifoldvalued data from a broader perspective. Then, in Section 3, we briefly review the transformation for compositional data. In Section 4 we present some asymptotics for small when the transformation is applied to the Dirichlet family, and in Section 5 we present numerical results, both from simulations and the analysis of real datasets. These numerical results reflect the theoretical results in the previous section. Proofs are given in Section 6.
2 Statistics on manifolds and choice of metric
Since the middle of the last century, statistical analysis of data which naturally lies in a manifold has grown enormously in interest and relevance. An important early example was R.A. Fisher’s paper [12] on directional statistics with application to paleomagnetic directional data. Subsequently, the field of directional statistics, in which the sample space consists of the surface of a unit sphere in Euclidean space (or, equivalently, the set of directions or unit vectors) has grown into a welldeveloped and mature field. See, for example, the books by Mardia [17], Mardia and Jupp [18] and Fisher et al. [13].
In a related but different direction, there has been great interest in statistical shape analysis from the 1980s onwards. A popular and mathematically deep approach to statistical shape analysis was initiated by D.G. Kendall; see the monograph Kendall et al. [16]. Other books on this field include Small [20] and Dryden and Mardia [10, 11]. In the Kendall approach, objects in Euclidean space, usually in two or three dimensions, are represented by the coordinates of a finite set of labelled landmarks. For example, a human face could be represented approximately by landmark coordinates at obvious places including the eyes, ears, nose and mouth. Then the shape is defined as what remains when translation, scale and orientation have been “quotiented out”. Further details may be found in the literature mentioned above.
Whenever analysing manifold data, one question that always needs to be addressed is what metric to use. For example, a choice of metric is required if we wish to define a Frechét mean on the manifold, which is a generalisation of the leastsquares mean in Euclidean space. One important dichotomy is between intrinsic and extrinsic metrics. In the former, the metric, often a Riemannian metric, is defined “intrinsically”, without any reference to an ambient space. In contrast, with extrinsic metrics, the manifold of interest is considered to be embedded in an ambient space, typically a Euclidean space endowed with the standard Euclidean metric, and the manifold inherits the metric from the ambient space. From a conceptual point of view, intrinsic metrics are often to be preferred. However, extrinsic metrics are often easier to work with, both from a practical and theoretical point of view; from a practical point of view because intrinsic distance may be more difficult to calculate and from a theoretical point of view because e.g. asymptotic theory tends to be more difficult when intrinsic metrics are used. As an example of the difficulties which can arise, see Hotz and Huckemann [15], who study the nonstandard behaviour which arises when the intrinsic, or arclength, metric is used for circular data.
Even when we have made the choice to use an extrinsic metric, the choice of metric may still not be clear. Two examples (which supplement the case of compositional data considered here) in which a oneparameter family of metrics may be worth considering will now be mentioned.
In [8]
a family of power metrics on the space of covariances matrices of a given dimension is considered, with a particular focus on diffusion tensor imaging. In this case, the relevant manifold is the space of positivedefinite matrices of given dimension. There is a close analogy with the
transformation considered here; in their setting the powered quantities are the eigenvalues of the covariance matrices, whereas here the powered quantities are components of compositional vectors.
The second example relates to statistical shape analysis. In [9] a family of extrinsic metrics is considered for measuring the distance between two shapes described by labelled landmarks. Here, the family of metrics consists of powers of eigenvalues of certain symmetric nonnegative definite matrices, but in this case the relevant matrices do not have full rank.
In both of these examples, the question of how to choose the metric arises, as it does in this paper, where the focus is on compositional data. It is clear that the choice one makes is going to have an effect on the statistical analysis, in some cases a major effect.
3 Review of the transformation
For the purpose of this paper it is convenient to define the transformation of a compositional vector by
(3) 
This transformation, which has a slightly different but mathematically equivalent form to that given in [21] and [22], is also known as the compositional power transformation [2].
Note that the transformation given by (3) defines a bijection of the interior of with inverse
(4) 
If one excludes the boundary of the simplex, which corresponds to compositional vectors that have one or more components equal to zero, then the transformation (3) and its inverse (4) are well defined for all . The motivation for transformation (3) is that the case corresponds to the logratio approach, whereas corresponds to the "do nothing" approach; see the discussion in the Introduction. We define the case in terms of the limit ; then, rescaling by as follows, we obtain
where is the vector of ones and is the centred logratio transformation (2). For the case , (3) is just the identity transformation of the simplex.
The transformation (3) leads naturally to a simplicial distance measure , which we call the metric, between observations , defined in terms of the Euclidean distance between transformed observations, i.e.,
The special case
is the Euclidean distance on the logratio transformed data, whereas
is just Euclidean distance multiplied by .
The choice of the value of depends upon the context of the analysis. Maximum likelihood estimation, assuming the transformation (3) or (4) has been applied to a parametric family of distributions such as the Dirichlet, requires the Jacobian determinant of (3), which is given by (see [23] for the derivation)
(5) 
4 Gaussian asymptotic limits as
In this section, we examine the transformation defined in (3), applied to a sample of compositional data on . The transformation along with the assumption of a dimensional Dirichlet distribution for the transformed data result in a parametric family which has parameters in total (i.e. one additional parameter). The particular focus in this Section is to see what happens when we apply maximum likelihood estimation to the transformed parameter family in the limit where the Dirichlet becomes highly concentrated and Gaussian and . This question arose when fitting this transformed model to datasets where the MLE of turns out to be small, while the MLE of the Dirichlet parameters tended to be large. For numerical evidence of this phenomenon, see Tables 14 below.
4.1 The Dirichlet case
Recall that the Dirichlet density on the simplex at a point is
where . Suppose that
where
Then the loglikelihood which results when the transformation is applied to is given by
which can be rewritten after expansion and reorganization as
(6)  
where . Also, define
(7) 
We now state our first result about the asymptotic behavior of the loglikelihood.
Proposition 1. Define as in (7) and
(8) 
where
(9) 
and define
Then as , the loglikelihood for and the satisfies
(10) 
where , ,
(11) 
and
(12) 
and
(13) 
are the first three sample cumulants of the .
Proposition 1 is proved in Section 6. We present several remarks below concerning Proposition 1.
Remark 2. The limit distribution is logistic normal with isotropic covariance structure (see Remark 3 below for clarification). The simple covariance structure in the limit as is a consequence of the restricted nature of the covariance structure of the Dirichlet distribution. Note that the limit distribution has the same number of parameters as the dimensional Dirchlet distribution, namely .
Remark 3. Although the likelihood in (10) clearly has a Gaussian character, it may not be immediately clear how it arises, given that when we sum over we get . To clarify, suppose we have independent Gaussian variables , , where it is assumed that . Define to be the sample mean. The covariance matrix of is , where is the identity matrix and is the vector of ones. The MoorePenrose inverse of is given by
and the relevant quadratic form is given by
(16) 
because we have assumed that . Note that if we put , and , , then (16) is equal to
which agrees with the inner sum of the second term in (10).
Remark 4. The assumption (9) only plays a cosmetic role in the proof provided that we are prepared to allow and the to depend weakly on . Specifically, suppose that . Then define
It is easily checked that
and also and the only depend weakly on in the sense that and , , as .
Remark 5. Note that the defined in (8) have a rather special structure in that they are asymptotically equal as . A natural question is: what happens to the asymptotics in Proposition 1 if the do not coalesce as , specifically if the are of the form . It turns out that, in contrast to Proposition 1, when the do not coalesce the loglikelihood does not converge to a finite limit as , as we shall see later.
4.2 The role of asymptotic normality and the general case
The asymptotic regime considered in the previous section is quite limited because the in (8) satisfy
However, it seems that we need this strong assumption to hold if we are to have convergence to a finite loglikelihood as indicated in Proposition 1. One important point we have not discussed yet is the fact that the Dirichlet distribution in Proposition 1 is converging to a multivariate normal. Consider the following elementary result.
Proposition 2. Suppose , where . Then as ,
where , and . Moreover, the convergence of densities also occurs uniformly in growing balls of the form as for any fixed , in the sense that, on the set ,
uniformly for , where in the line above, is the Dirichlet density.
The proof of Proposition 2 is similar to that of Proposition 1 but is somewhat simpler and so is omitted.
We now consider the asymptotic behaviour of as in the context of a Dirichlet distribution with parameters which do not coalesce as in Proposition 1.
Corollary 1. Under the assumptions of Proposition 2,
(17) 
where , and , and .
Proof of Corollary 1. If , then elementary calculations show that as ,
(18) 
where and . Consequently, from (18),
(19) 
and so Corollary 1 follows because, from Proposition 2, .
So for to attain a general limiting mean vector in the limit as , it is necessary for the expectation vector of to go to infinity as .
Finally, we go to a more general case in which we do not assume that is Dirichlet but just assume that as ,
Since the are defined on the simplex the vector has nonnegative components which satisfy and the covariance matrix
has an eigenvector
with corresponding eigenvalue . In this case, we obtain the same limiting result as in (17), but with the more general covariance matrix replacing , and with the same requirement that the mean vector of goes to infinity if we wish for to obtain a general limiting mean vector.5 Numerical results
We first show the asymptotic results of the previous section on synthetic data and then apply the transformation on real datasets.
5.1 Numerical Simulations
As discussed earlier, a finite loglikelihood is obtained under the assumption in (8) for . In contrast, when the Dirichlet’s parameters do not converge to the same value as the loglikelihood is expected to be infinite. Figure 1 presents this behaviour as it is encoded by the asymptotic limit for the mean value . Blue lines correspond to the former case with and . As expected from Proposition 1, the mean value of converges to a finite value which is however different at each coordinate due to the different values of ’s elements. Red lines correspond to the case with where the mean value diverges as implied by Corollary 1. Evidently, the rate of divergence is inverse proportional to .
Remark 6. The estimation of requires the exponentiation of with which leads to numerical instability in the general case (i.e., the red lines) when is below . In order to ensure numerical stability, we calculate with the following equivalent formula
where .
5.2 Real data examples
Three real datasets are used for illustration purposes.

Mammals. This dataset is taken from Hartigan [14] and describe the percentage composition of 24 mammals’ milk on the basis of 5 different constituents (water, protein, fat, lactose and ash). A 4group solution has been considered as the optimal grouping of such data.

East Bay Clams. From the many colonies of clams in East Bay, 20 were selected at random and from each a sample of clams was taken [2]. Each sample was sieved into three size ranges, large, medium, small; then each size range was sorted by shell colour, dark, light. For each colony the proportions of clams in each coloursize combination was estimated and the corresponding compositions are recorded.

OECD. The dataset contains percentages of the labour force and the per capita income of 20 European OECD countries in 1960. The data can be downloaded from the DASL library.

GRTA. This dataset contains information on the number of Greek road traffic accidents and persons injured from January 2010 up to August 2017 on a monthly basis. Three quantities of interest are measured, Killed, seriously injured and slightly injured.
For each of these three datasets, we apply the transformation (3) and maximise the Dirichlet loglikelihood. This is the same as maximizing the profile loglikelihood of presented in (6). For a range of values of we apply the transformation (3) to the compositional data and maximize the sum of the Dirichlet loglikelihood in (6) and the logarithm of the Jacobian in (5) of the transformation given in (3). Figure 2 presents these profile loglikelihoods. Tables 1–4 show the resulting estimates of the Dirichlet parameters obtained from the MLE applied to the transformed data and based on the two asymptotic forms. Asymptotic 1 corresponds to model with and being estimated from (15) and (14), respectively while Asymptotic 2 corresponds to the general case . The estimation of the parameter vector is based again on MLE. Therefore the results for Asymptotic 2 are expected to be similar with the results when direct MLE is applied.
(a)  (b) 
(a)  (b) 
Method  water  protein  fat  lactose  ash 

Direct MLE  793.908  671.054  680.629  663.055  604.305 
Asymptotic 1  782.297  668.362  677.964  660.235  597.352 
Asymptotic 2  793.673  670.856  680.428  662.859  604.127 
Method  dl  dm  ds 

Direct MLE  232.218  224.377  222.197 
Asymptotic 1  231.568  223.800  221.592 
Asymptotic 2  231.990  224.156  221.979 
Method  PCINC  AGR  IND  SER 

Direct MLE  212.965  129.782  143.360  136.907 
Asymptotic 1  199.034  124.925  139.824  132.928 
Asymptotic 2  212.669  129.601  143.161  136.716 
Method  Killed  Seriously injured  Slightly injured 

Direct MLE  28366.65  28109.32  25420.99 
Asymptotic 1  28305.94  28057.80  25320.62 
Asymptotic 2  28366.36  28109.03  25420.73 
6 Proofs
Proof of Proposition 1. Let us focus on the loglikelihood for a single observation . At the end of the proof we will sum over to obtain the loglikelihood (10).
After reorganization, the loglikelihood for observation may be written
(20) 
We now make use of Lemma 1 and Lemma 2 which are stated and proved below. Noting that
and substituting (22) and (25) into (20), we obtain, as ,
(21) 
where and are defined in (11). Finally, sum (21) over to obtain (10).
Lemma 1 and Lemma 2 are now stated and proved.
Proof of Lemma 1. This involves no more than repeated application of Stirling’s formula
Making use of (9) and applying Stirling’s formula gives
(23) 
Similarly,
(24) 
Summing (24) over and subtracting from (23) yields (22) after some further calculations.
7 Discussion and Conclusions
Our numerical and theoretical results show that some care is needed when applying a finitedimensional family of transformations to a parametric model, and then fitting the resulting larger model by maximum likelihood. The particular focus of study here is on using the
transformed Dirichlet distribution as the parametric model. The numerical results in Section 5 and the theoretical results Proposition 1 and Corollary 1 in Section 4 suggest that when the maximum likelihood estimator of is close to , there is a tendency for the other parameter estimates to be rather large. In some of these cases with small one may be better off using the logratio transformation, provided there are no sample data components which are zero or very close to zero.Acknowledgements
This work was partially supported by EPSRC grant EP/K022547/1, for which we are grateful. Partial results from this research were obtained when the second author was a PhD student at the University of Nottingham.
References
 [1] Aitchison, J. (1983). Principal components analysis of compositional data. Biometrika, 70, 5765.

[2]
Aitchison, J. (1986). The Statistical Analysis of Compositional Data.
Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted in 2003 with additional material by The Blackburn Press.
 [3] Baxter, M.J. (1995). Standardization and transformation in principal component analysis, with applications to archaeometry. Applied Statistics, 44, 513527.
 [4] Baxter, M.J. (2001). Statistical modelling of artefact compositional data. Archaeometry, 43, 131147.
 [5] Baxter, M.J., Beardah, C.C., Cool, H.E.M. and Jackson, C.M. (2005). Compositional data analysis of some alkaline glasses. Mathematical Geology, 37, 183196.
 [6] Baxter, M.J. and Freestone, I.C. (2006). Logratio compositional data analysis in archaeometry. Archaeometry, 48, 511531.
 [7] Bhattacharya, A. and Bhattacharya, R.N. (2012). Nonparametric Inference on Manifolds with Applications to Shape Spaces. Cambridge University Press, Cambridge.
 [8] Dryden, I.L., Koloydenko, A. and Zhou, D. (2009). NonEuclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Ann. Appl. Statist., 3, 11021123.
 [9] Dryden, I.L., Le, H., Preston, S.P. and Wood, A.T.A. (2014). Mean shapes, projections and intrinsic limiting distributions. [Discussion contribution.] Journal of Statistical Planning and Inference, 145, 2532.
 [10] Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. Wiley, New York.
 [11] Dryden, I.L. and Mardia, K.V. (2016). Statistical Shape Analysis with Applications in R. Second Edition. Wiley, New York.
 [12] Fisher, R. A. (1953) Dispersion on a sphere. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society, 217, 295305.
 [13] Fisher, N.I., Lewis, T. and Embleton, B.J.J. (1987). Statistical analysis of spherical data. Cambridge University Press, Cambridge.
 [14] Hartigan, J.A. (1975). Clustering algorithms. Wiley, New York.
 [15] Hotz, T., Huckemann, S. (2015). Intrinsic Means on the Circle: Uniqueness, Locus and Asymptotics. The Annals of the Institute of Statistical Mathematics, 67, 177193
 [16] Kendall, D.G., Barden, D., Carne, T.K. and Le, H. (1999). Shape and Shape Theory. Wiley, New York.
 [17] Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London.
 [18] Mardia, K.V. and Jupp, P.E. (2000). Directional Statistics. John Wiley & Sons, Chichester.
 [19] Scealy, J. L. and Welsh, A. H. (2014). Colours and cocktails: compositional data analysis. 2013 Lancaster lecture. Australian & New Zealand Journal of Statistics, 56, 145–169.
 [20] Small, C.G. (1996). The Statistical Theory of Shape. Springer, New York.
 [21] Tsagris, M.T., Preston, S., and Wood, A. T. A. (2011). A databased power transformation for compositional data. In: Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain.
 [22] Tsagris, M, Preston, S. and Wood, A.T.A. (2016). Improved classification for compositional data using the transformation. Journal of Classification 33, 243261.
 [23] Tsagris, M., and Stewart, C. (2018). A folded model for compositional data analysis. arXiv preprint arXiv:1802.07330.