A compositional data vector has non-negative components which sum to a constant, usually taken to be , as we assume here. Such vectors lie in the standard simplex, , which has the form
where . Compositional datasets appear naturally in many fields including medicine, geology, archeology, biology and economics; see e.g. .
The following two approaches have been widely used in compositional data analysis. One of these approaches, due to Aitchison [1, 2], is to apply the following transformation, known as the centred log-ratio transformation:
Also, define . Note that, by definition, the components of sum to . Given a sample of compositional vectors , we transform to using (2
). Aitchison then proposed applying standard multivariate methods, such as principal components analysis, to log-transformed compositional vectors, which is tantamount to assuming the standard Euclidean metric is appropriate for the.
Another popular approach is to “do nothing”; i.e. assume that the Euclidean metric inherited from the ambient Euclidean space is appropriate for the , and apply standard multivariate methods directly to the . For advocation of this approach see Baxter [3, 4], Baxter et al.  and Baxter and Freestone .
For some datasets, the log-ratio approach seems preferable, whereas for other datasets, the “do nothing” approach seems superior. Nevertheless, the choice over which method to use, if either, has occasionally become heated and acrimonious. See the paper by Scealy and Welsh  for a careful and detailed discussion of the issues and history of this debate. Our view concerning the choice of metric is that it is not possible to come up with a compelling choice for either method based on purely a priori or theoretical grounds, and that a more pragmatic approach is to make a data-dependent choice of metric.
Bearing this in mind, one possibility is to consider a finite-dimensional family of transformations, such as the one-dimensional -transformation family considered in this paper, where each transformation implies a choice of metric. Then use the data to choose the optimal transformation, which in turn determines an optimal metric. It is not so clear how to do this non-parametrically, but if a parametric family of models, such as the Dirichlet in the present context, is assigned to the transformed compositional data, then we can estimate both the distributional parameters and the transformation parameter simultaneously by maximum likelihood. Using this approach we estimate the parameters of the -transformed compositional data model. In effect, the optimal transformation parameter compels the data to resemble the chosen parametric family as closely as possible. However, after carrying out this approach with a number of real and simulated datasets we found that in cases where the maximum likelihood estimator of is close to , there is a tendency for the other (Dirichlet) parameters to be rather large. See Tables 1-4 in Section 5 for examples of this phenomenon. This has motivated us to take a close look at the asymptotics as (i) and (ii) the Dirichlet becomes highly concentrated. This theoretical and numerical investigation is the principal focus of our paper.
The phenomenon mentioned in the previous paragraph potentially arises with analogous families of metrics on other manifolds should we wish to select an optimal metric in a data-dependent way. Consequently, before focusing on compositional data in the remainder of the paper, we briefly discuss the question of choice of metric in the setting of general manifolds in Section 2.
The structure of this paper is as follows. In the next section we discuss the choice of metric in the statistical analysis of manifold-valued data from a broader perspective. Then, in Section 3, we briefly review the -transformation for compositional data. In Section 4 we present some asymptotics for small when the -transformation is applied to the Dirichlet family, and in Section 5 we present numerical results, both from simulations and the analysis of real datasets. These numerical results reflect the theoretical results in the previous section. Proofs are given in Section 6.
2 Statistics on manifolds and choice of metric
Since the middle of the last century, statistical analysis of data which naturally lies in a manifold has grown enormously in interest and relevance. An important early example was R.A. Fisher’s paper  on directional statistics with application to paleomagnetic directional data. Subsequently, the field of directional statistics, in which the sample space consists of the surface of a unit sphere in Euclidean space (or, equivalently, the set of directions or unit vectors) has grown into a well-developed and mature field. See, for example, the books by Mardia , Mardia and Jupp  and Fisher et al. .
In a related but different direction, there has been great interest in statistical shape analysis from the 1980s onwards. A popular and mathematically deep approach to statistical shape analysis was initiated by D.G. Kendall; see the monograph Kendall et al. . Other books on this field include Small  and Dryden and Mardia [10, 11]. In the Kendall approach, objects in Euclidean space, usually in two or three dimensions, are represented by the coordinates of a finite set of labelled landmarks. For example, a human face could be represented approximately by landmark coordinates at obvious places including the eyes, ears, nose and mouth. Then the shape is defined as what remains when translation, scale and orientation have been “quotiented out”. Further details may be found in the literature mentioned above.
Whenever analysing manifold data, one question that always needs to be addressed is what metric to use. For example, a choice of metric is required if we wish to define a Frechét mean on the manifold, which is a generalisation of the least-squares mean in Euclidean space. One important dichotomy is between intrinsic and extrinsic metrics. In the former, the metric, often a Riemannian metric, is defined “intrinsically”, without any reference to an ambient space. In contrast, with extrinsic metrics, the manifold of interest is considered to be embedded in an ambient space, typically a Euclidean space endowed with the standard Euclidean metric, and the manifold inherits the metric from the ambient space. From a conceptual point of view, intrinsic metrics are often to be preferred. However, extrinsic metrics are often easier to work with, both from a practical and theoretical point of view; from a practical point of view because intrinsic distance may be more difficult to calculate and from a theoretical point of view because e.g. asymptotic theory tends to be more difficult when intrinsic metrics are used. As an example of the difficulties which can arise, see Hotz and Huckemann , who study the non-standard behaviour which arises when the intrinsic, or arc-length, metric is used for circular data.
Even when we have made the choice to use an extrinsic metric, the choice of metric may still not be clear. Two examples (which supplement the case of compositional data considered here) in which a one-parameter family of metrics may be worth considering will now be mentioned.
a family of power metrics on the space of covariances matrices of a given dimension is considered, with a particular focus on diffusion tensor imaging. In this case, the relevant manifold is the space of positive-definite matrices of given dimension. There is a close analogy with the
-transformation considered here; in their setting the powered quantities are the eigenvalues of the covariance matrices, whereas here the powered quantities are components of compositional vectors.
The second example relates to statistical shape analysis. In  a family of extrinsic metrics is considered for measuring the distance between two shapes described by labelled landmarks. Here, the family of metrics consists of powers of eigenvalues of certain symmetric non-negative definite matrices, but in this case the relevant matrices do not have full rank.
In both of these examples, the question of how to choose the metric arises, as it does in this paper, where the focus is on compositional data. It is clear that the choice one makes is going to have an effect on the statistical analysis, in some cases a major effect.
3 Review of the -transformation
For the purpose of this paper it is convenient to define the -transformation of a compositional vector by
Note that the transformation given by (3) defines a bijection of the interior of with inverse
If one excludes the boundary of the simplex, which corresponds to compositional vectors that have one or more components equal to zero, then the -transformation (3) and its inverse (4) are well defined for all . The motivation for transformation (3) is that the case corresponds to the log-ratio approach, whereas corresponds to the "do nothing" approach; see the discussion in the Introduction. We define the case in terms of the limit ; then, rescaling by as follows, we obtain
The -transformation (3) leads naturally to a simplicial distance measure , which we call the -metric, between observations , defined in terms of the Euclidean distance between transformed observations, i.e.,
The special case
is the Euclidean distance on the log-ratio transformed data, whereas
is just Euclidean distance multiplied by .
The choice of the value of depends upon the context of the analysis. Maximum likelihood estimation, assuming the transformation (3) or (4) has been applied to a parametric family of distributions such as the Dirichlet, requires the Jacobian determinant of (3), which is given by (see  for the derivation)
4 Gaussian asymptotic limits as
In this section, we examine the -transformation defined in (3), applied to a sample of compositional data on . The -transformation along with the assumption of a -dimensional Dirichlet distribution for the transformed data result in a parametric family which has parameters in total (i.e. one additional parameter). The particular focus in this Section is to see what happens when we apply maximum likelihood estimation to the transformed parameter family in the limit where the Dirichlet becomes highly concentrated and Gaussian and . This question arose when fitting this transformed model to datasets where the MLE of turns out to be small, while the MLE of the Dirichlet parameters tended to be large. For numerical evidence of this phenomenon, see Tables 1-4 below.
4.1 The Dirichlet case
Recall that the Dirichlet density on the simplex at a point is
where . Suppose that
Then the log-likelihood which results when the -transformation is applied to is given by
which can be rewritten after expansion and re-organization as
where . Also, define
We now state our first result about the asymptotic behavior of the log-likelihood.
Proposition 1. Define as in (7) and
Then as , the log-likelihood for and the satisfies
where , ,
are the first three sample cumulants of the .
Proposition 1 is proved in Section 6. We present several remarks below concerning Proposition 1.
Remark 1. When is small, the MLE obtained by maximising (10) is given by
where , and
Remark 2. The limit distribution is logistic normal with isotropic covariance structure (see Remark 3 below for clarification). The simple covariance structure in the limit as is a consequence of the restricted nature of the covariance structure of the Dirichlet distribution. Note that the limit distribution has the same number of parameters as the -dimensional Dirchlet distribution, namely .
Remark 3. Although the likelihood in (10) clearly has a Gaussian character, it may not be immediately clear how it arises, given that when we sum over we get . To clarify, suppose we have independent Gaussian variables , , where it is assumed that . Define to be the sample mean. The covariance matrix of is , where is the identity matrix and is the -vector of ones. The Moore-Penrose inverse of is given by
and the relevant quadratic form is given by
because we have assumed that . Note that if we put , and , , then (16) is equal to
which agrees with the inner sum of the second term in (10).
Remark 4. The assumption (9) only plays a cosmetic role in the proof provided that we are prepared to allow and the to depend weakly on . Specifically, suppose that . Then define
It is easily checked that
and also and the only depend weakly on in the sense that and , , as .
Remark 5. Note that the defined in (8) have a rather special structure in that they are asymptotically equal as . A natural question is: what happens to the asymptotics in Proposition 1 if the do not coalesce as , specifically if the are of the form . It turns out that, in contrast to Proposition 1, when the do not coalesce the log-likelihood does not converge to a finite limit as , as we shall see later.
4.2 The role of asymptotic normality and the general case
The asymptotic regime considered in the previous section is quite limited because the in (8) satisfy
However, it seems that we need this strong assumption to hold if we are to have convergence to a finite log-likelihood as indicated in Proposition 1. One important point we have not discussed yet is the fact that the Dirichlet distribution in Proposition 1 is converging to a multivariate normal. Consider the following elementary result.
Proposition 2. Suppose , where . Then as ,
where , and . Moreover, the convergence of densities also occurs uniformly in growing balls of the form as for any fixed , in the sense that, on the set ,
uniformly for , where in the line above, is the Dirichlet density.
The proof of Proposition 2 is similar to that of Proposition 1 but is somewhat simpler and so is omitted.
We now consider the asymptotic behaviour of as in the context of a Dirichlet distribution with parameters which do not coalesce as in Proposition 1.
Corollary 1. Under the assumptions of Proposition 2,
where , and , and .
Proof of Corollary 1. If , then elementary calculations show that as ,
where and . Consequently, from (18),
and so Corollary 1 follows because, from Proposition 2, .
So for to attain a general limiting mean vector in the limit as , it is necessary for the expectation vector of to go to infinity as .
Finally, we go to a more general case in which we do not assume that is Dirichlet but just assume that as ,
Since the are defined on the simplex the vector has non-negative components which satisfy and the covariance matrix
has an eigenvectorwith corresponding eigenvalue . In this case, we obtain the same limiting result as in (17), but with the more general covariance matrix replacing , and with the same requirement that the mean vector of goes to infinity if we wish for to obtain a general limiting mean vector.
5 Numerical results
We first show the asymptotic results of the previous section on synthetic data and then apply the -transformation on real datasets.
5.1 Numerical Simulations
As discussed earlier, a finite log-likelihood is obtained under the assumption in (8) for . In contrast, when the Dirichlet’s parameters do not converge to the same value as the log-likelihood is expected to be infinite. Figure 1 presents this behaviour as it is encoded by the asymptotic limit for the mean value . Blue lines correspond to the former case with and . As expected from Proposition 1, the mean value of converges to a finite value which is however different at each coordinate due to the different values of ’s elements. Red lines correspond to the case with where the mean value diverges as implied by Corollary 1. Evidently, the rate of divergence is inverse proportional to .
Remark 6. The estimation of requires the exponentiation of with which leads to numerical instability in the general case (i.e., the red lines) when is below . In order to ensure numerical stability, we calculate with the following equivalent formula
5.2 Real data examples
Three real datasets are used for illustration purposes.
Mammals. This dataset is taken from Hartigan  and describe the percentage composition of 24 mammals’ milk on the basis of 5 different constituents (water, protein, fat, lactose and ash). A 4-group solution has been considered as the optimal grouping of such data.
East Bay Clams. From the many colonies of clams in East Bay, 20 were selected at random and from each a sample of clams was taken . Each sample was sieved into three size ranges, large, medium, small; then each size range was sorted by shell colour, dark, light. For each colony the proportions of clams in each colour-size combination was estimated and the corresponding compositions are recorded.
OECD. The dataset contains percentages of the labour force and the per capita income of 20 European OECD countries in 1960. The data can be downloaded from the DASL library.
GRTA. This dataset contains information on the number of Greek road traffic accidents and persons injured from January 2010 up to August 2017 on a monthly basis. Three quantities of interest are measured, Killed, seriously injured and slightly injured.
For each of these three datasets, we apply the -transformation (3) and maximise the Dirichlet log-likelihood. This is the same as maximizing the profile log-likelihood of presented in (6). For a range of values of we apply the -transformation (3) to the compositional data and maximize the sum of the Dirichlet log-likelihood in (6) and the logarithm of the Jacobian in (5) of the -transformation given in (3). Figure 2 presents these profile log-likelihoods. Tables 1–4 show the resulting estimates of the Dirichlet parameters obtained from the MLE applied to the -transformed data and based on the two asymptotic forms. Asymptotic 1 corresponds to model with and being estimated from (15) and (14), respectively while Asymptotic 2 corresponds to the general case . The estimation of the parameter vector is based again on MLE. Therefore the results for Asymptotic 2 are expected to be similar with the results when direct MLE is applied.
|Method||Killed||Seriously injured||Slightly injured|
Proof of Proposition 1. Let us focus on the log-likelihood for a single observation . At the end of the proof we will sum over to obtain the log-likelihood (10).
After re-organization, the log-likelihood for observation may be written
We now make use of Lemma 1 and Lemma 2 which are stated and proved below. Noting that
Lemma 1 and Lemma 2 are now stated and proved.
Proof of Lemma 1. This involves no more than repeated application of Stirling’s formula
Making use of (9) and applying Stirling’s formula gives
7 Discussion and Conclusions
Our numerical and theoretical results show that some care is needed when applying a finite-dimensional family of transformations to a parametric model, and then fitting the resulting larger model by maximum likelihood. The particular focus of study here is on using the-transformed Dirichlet distribution as the parametric model. The numerical results in Section 5 and the theoretical results Proposition 1 and Corollary 1 in Section 4 suggest that when the maximum likelihood estimator of is close to , there is a tendency for the other parameter estimates to be rather large. In some of these cases with small one may be better off using the log-ratio transformation, provided there are no sample data components which are zero or very close to zero.
This work was partially supported by EPSRC grant EP/K022547/1, for which we are grateful. Partial results from this research were obtained when the second author was a PhD student at the University of Nottingham.
-  Aitchison, J. (1983). Principal components analysis of compositional data. Biometrika, 70, 57-65.
Aitchison, J. (1986). The Statistical Analysis of Compositional Data.
Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted in 2003 with additional material by The Blackburn Press.
-  Baxter, M.J. (1995). Standardization and transformation in principal component analysis, with applications to archaeometry. Applied Statistics, 44, 513-527.
-  Baxter, M.J. (2001). Statistical modelling of artefact compositional data. Archaeometry, 43, 131-147.
-  Baxter, M.J., Beardah, C.C., Cool, H.E.M. and Jackson, C.M. (2005). Compositional data analysis of some alkaline glasses. Mathematical Geology, 37, 183-196.
-  Baxter, M.J. and Freestone, I.C. (2006). Log-ratio compositional data analysis in archaeometry. Archaeometry, 48, 511-531.
-  Bhattacharya, A. and Bhattacharya, R.N. (2012). Nonparametric Inference on Manifolds with Applications to Shape Spaces. Cambridge University Press, Cambridge.
-  Dryden, I.L., Koloydenko, A. and Zhou, D. (2009). Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Ann. Appl. Statist., 3, 1102-1123.
-  Dryden, I.L., Le, H., Preston, S.P. and Wood, A.T.A. (2014). Mean shapes, projections and intrinsic limiting distributions. [Discussion contribution.] Journal of Statistical Planning and Inference, 145, 25-32.
-  Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. Wiley, New York.
-  Dryden, I.L. and Mardia, K.V. (2016). Statistical Shape Analysis with Applications in R. Second Edition. Wiley, New York.
-  Fisher, R. A. (1953) Dispersion on a sphere. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society, 217, 295-305.
-  Fisher, N.I., Lewis, T. and Embleton, B.J.J. (1987). Statistical analysis of spherical data. Cambridge University Press, Cambridge.
-  Hartigan, J.A. (1975). Clustering algorithms. Wiley, New York.
-  Hotz, T., Huckemann, S. (2015). Intrinsic Means on the Circle: Uniqueness, Locus and Asymptotics. The Annals of the Institute of Statistical Mathematics, 67, 177-193
-  Kendall, D.G., Barden, D., Carne, T.K. and Le, H. (1999). Shape and Shape Theory. Wiley, New York.
-  Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London.
-  Mardia, K.V. and Jupp, P.E. (2000). Directional Statistics. John Wiley & Sons, Chichester.
-  Scealy, J. L. and Welsh, A. H. (2014). Colours and cocktails: compositional data analysis. 2013 Lancaster lecture. Australian & New Zealand Journal of Statistics, 56, 145–169.
-  Small, C.G. (1996). The Statistical Theory of Shape. Springer, New York.
-  Tsagris, M.T., Preston, S., and Wood, A. T. A. (2011). A data-based power transformation for compositional data. In: Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain.
-  Tsagris, M, Preston, S. and Wood, A.T.A. (2016). Improved classification for compositional data using the -transformation. Journal of Classification 33, 243-261.
-  Tsagris, M., and Stewart, C. (2018). A folded model for compositional data analysis. arXiv preprint arXiv:1802.07330.