Gaussian asymptotic limits for the α-transformation in the analysis of compositional data

11/29/2018 ∙ by Yannis Pantazis, et al. ∙ Foundation for Research & Technology-Hellas (FORTH) The University of Nottingham 0

Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchison [1, 2]; and another popular approach has been to use the standard Euclidean metric inherited from the ambient space. Tsagris et al. [21] proposed a one-parameter family of power transformations, the α-transformations, which include both the metric implied by Aitchison's transformation and the Euclidean metric as particular cases. Our underlying philosophy is that, with many datasets, it may make sense to use the data to help us determine a suitable metric. A related possibility is to apply the α-transformations to a parametric family of distributions, and then estimate a along with the other parameters. However, as we shall see, when one follows this last approach with the Dirichlet family, some care is needed in a certain limiting case which arises (α≠ 0), as we found out when fitting this model to real and simulated data. Specifically, when the maximum likelihood estimator of a is close to 0, the other parameters tend to be large. The main purpose of the paper is to study this limiting case both theoretically and numerically and to provide insight into these numerical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A compositional data vector has non-negative components which sum to a constant, usually taken to be , as we assume here. Such vectors lie in the standard simplex, , which has the form

(1)

where . Compositional datasets appear naturally in many fields including medicine, geology, archeology, biology and economics; see e.g. [2].

The following two approaches have been widely used in compositional data analysis. One of these approaches, due to Aitchison [1, 2], is to apply the following transformation, known as the centred log-ratio transformation:

(2)

Also, define . Note that, by definition, the components of sum to . Given a sample of compositional vectors , we transform to using (2

). Aitchison then proposed applying standard multivariate methods, such as principal components analysis, to log-transformed compositional vectors, which is tantamount to assuming the standard Euclidean metric is appropriate for the

.

Another popular approach is to “do nothing”; i.e. assume that the Euclidean metric inherited from the ambient Euclidean space is appropriate for the , and apply standard multivariate methods directly to the . For advocation of this approach see Baxter [3, 4], Baxter et al. [5] and Baxter and Freestone [6].

For some datasets, the log-ratio approach seems preferable, whereas for other datasets, the “do nothing” approach seems superior. Nevertheless, the choice over which method to use, if either, has occasionally become heated and acrimonious. See the paper by Scealy and Welsh [19] for a careful and detailed discussion of the issues and history of this debate. Our view concerning the choice of metric is that it is not possible to come up with a compelling choice for either method based on purely a priori or theoretical grounds, and that a more pragmatic approach is to make a data-dependent choice of metric.

Bearing this in mind, one possibility is to consider a finite-dimensional family of transformations, such as the one-dimensional -transformation family considered in this paper, where each transformation implies a choice of metric. Then use the data to choose the optimal transformation, which in turn determines an optimal metric. It is not so clear how to do this non-parametrically, but if a parametric family of models, such as the Dirichlet in the present context, is assigned to the transformed compositional data, then we can estimate both the distributional parameters and the transformation parameter simultaneously by maximum likelihood. Using this approach we estimate the parameters of the -transformed compositional data model. In effect, the optimal transformation parameter compels the data to resemble the chosen parametric family as closely as possible. However, after carrying out this approach with a number of real and simulated datasets we found that in cases where the maximum likelihood estimator of is close to , there is a tendency for the other (Dirichlet) parameters to be rather large. See Tables 1-4 in Section 5 for examples of this phenomenon. This has motivated us to take a close look at the asymptotics as (i) and (ii) the Dirichlet becomes highly concentrated. This theoretical and numerical investigation is the principal focus of our paper.

The phenomenon mentioned in the previous paragraph potentially arises with analogous families of metrics on other manifolds should we wish to select an optimal metric in a data-dependent way. Consequently, before focusing on compositional data in the remainder of the paper, we briefly discuss the question of choice of metric in the setting of general manifolds in Section 2.

The structure of this paper is as follows. In the next section we discuss the choice of metric in the statistical analysis of manifold-valued data from a broader perspective. Then, in Section 3, we briefly review the -transformation for compositional data. In Section 4 we present some asymptotics for small when the -transformation is applied to the Dirichlet family, and in Section 5 we present numerical results, both from simulations and the analysis of real datasets. These numerical results reflect the theoretical results in the previous section. Proofs are given in Section 6.

2 Statistics on manifolds and choice of metric

Since the middle of the last century, statistical analysis of data which naturally lies in a manifold has grown enormously in interest and relevance. An important early example was R.A. Fisher’s paper [12] on directional statistics with application to paleomagnetic directional data. Subsequently, the field of directional statistics, in which the sample space consists of the surface of a unit sphere in Euclidean space (or, equivalently, the set of directions or unit vectors) has grown into a well-developed and mature field. See, for example, the books by Mardia [17], Mardia and Jupp [18] and Fisher et al. [13].

In a related but different direction, there has been great interest in statistical shape analysis from the 1980s onwards. A popular and mathematically deep approach to statistical shape analysis was initiated by D.G. Kendall; see the monograph Kendall et al. [16]. Other books on this field include Small [20] and Dryden and Mardia [10, 11]. In the Kendall approach, objects in Euclidean space, usually in two or three dimensions, are represented by the coordinates of a finite set of labelled landmarks. For example, a human face could be represented approximately by landmark coordinates at obvious places including the eyes, ears, nose and mouth. Then the shape is defined as what remains when translation, scale and orientation have been “quotiented out”. Further details may be found in the literature mentioned above.

Whenever analysing manifold data, one question that always needs to be addressed is what metric to use. For example, a choice of metric is required if we wish to define a Frechét mean on the manifold, which is a generalisation of the least-squares mean in Euclidean space. One important dichotomy is between intrinsic and extrinsic metrics. In the former, the metric, often a Riemannian metric, is defined “intrinsically”, without any reference to an ambient space. In contrast, with extrinsic metrics, the manifold of interest is considered to be embedded in an ambient space, typically a Euclidean space endowed with the standard Euclidean metric, and the manifold inherits the metric from the ambient space. From a conceptual point of view, intrinsic metrics are often to be preferred. However, extrinsic metrics are often easier to work with, both from a practical and theoretical point of view; from a practical point of view because intrinsic distance may be more difficult to calculate and from a theoretical point of view because e.g. asymptotic theory tends to be more difficult when intrinsic metrics are used. As an example of the difficulties which can arise, see Hotz and Huckemann [15], who study the non-standard behaviour which arises when the intrinsic, or arc-length, metric is used for circular data.

Even when we have made the choice to use an extrinsic metric, the choice of metric may still not be clear. Two examples (which supplement the case of compositional data considered here) in which a one-parameter family of metrics may be worth considering will now be mentioned.

In [8]

a family of power metrics on the space of covariances matrices of a given dimension is considered, with a particular focus on diffusion tensor imaging. In this case, the relevant manifold is the space of positive-definite matrices of given dimension. There is a close analogy with the

-transformation considered here; in their setting the powered quantities are the eigenvalues of the covariance matrices, whereas here the powered quantities are components of compositional vectors.

The second example relates to statistical shape analysis. In [9] a family of extrinsic metrics is considered for measuring the distance between two shapes described by labelled landmarks. Here, the family of metrics consists of powers of eigenvalues of certain symmetric non-negative definite matrices, but in this case the relevant matrices do not have full rank.

In both of these examples, the question of how to choose the metric arises, as it does in this paper, where the focus is on compositional data. It is clear that the choice one makes is going to have an effect on the statistical analysis, in some cases a major effect.

3 Review of the -transformation

For the purpose of this paper it is convenient to define the -transformation of a compositional vector by

(3)

This transformation, which has a slightly different but mathematically equivalent form to that given in [21] and [22], is also known as the compositional power transformation [2].

Note that the transformation given by (3) defines a bijection of the interior of with inverse

(4)

If one excludes the boundary of the simplex, which corresponds to compositional vectors that have one or more components equal to zero, then the -transformation (3) and its inverse (4) are well defined for all . The motivation for transformation (3) is that the case corresponds to the log-ratio approach, whereas corresponds to the "do nothing" approach; see the discussion in the Introduction. We define the case in terms of the limit ; then, rescaling by as follows, we obtain

where is the -vector of ones and is the centred log-ratio transformation (2). For the case , (3) is just the identity transformation of the simplex.

The -transformation (3) leads naturally to a simplicial distance measure , which we call the -metric, between observations , defined in terms of the Euclidean distance between transformed observations, i.e.,

The special case

is the Euclidean distance on the log-ratio transformed data, whereas

is just Euclidean distance multiplied by .

The choice of the value of depends upon the context of the analysis. Maximum likelihood estimation, assuming the transformation (3) or (4) has been applied to a parametric family of distributions such as the Dirichlet, requires the Jacobian determinant of (3), which is given by (see [23] for the derivation)

(5)

4 Gaussian asymptotic limits as

In this section, we examine the -transformation defined in (3), applied to a sample of compositional data on . The -transformation along with the assumption of a -dimensional Dirichlet distribution for the transformed data result in a parametric family which has parameters in total (i.e. one additional parameter). The particular focus in this Section is to see what happens when we apply maximum likelihood estimation to the transformed parameter family in the limit where the Dirichlet becomes highly concentrated and Gaussian and . This question arose when fitting this transformed model to datasets where the MLE of turns out to be small, while the MLE of the Dirichlet parameters tended to be large. For numerical evidence of this phenomenon, see Tables 1-4 below.

4.1 The Dirichlet case

Recall that the Dirichlet density on the simplex at a point is

where . Suppose that

where

Then the log-likelihood which results when the -transformation is applied to is given by

which can be rewritten after expansion and re-organization as

(6)

where . Also, define

(7)

We now state our first result about the asymptotic behavior of the log-likelihood.

Proposition 1. Define as in (7) and

(8)

where

(9)

and define

Then as , the log-likelihood for and the satisfies

(10)

where ,  ,

(11)

and

(12)

and

(13)

are the first three sample cumulants of the .


Proposition 1 is proved in Section 6. We present several remarks below concerning Proposition 1.

Remark 1. When is small, the MLE obtained by maximising (10) is given by

(14)

where , and

(15)

Remark 2. The limit distribution is logistic normal with isotropic covariance structure (see Remark 3 below for clarification). The simple covariance structure in the limit as is a consequence of the restricted nature of the covariance structure of the Dirichlet distribution. Note that the limit distribution has the same number of parameters as the -dimensional Dirchlet distribution, namely .


Remark 3. Although the likelihood in (10) clearly has a Gaussian character, it may not be immediately clear how it arises, given that when we sum over we get . To clarify, suppose we have independent Gaussian variables , , where it is assumed that . Define to be the sample mean. The covariance matrix of is , where is the identity matrix and is the -vector of ones. The Moore-Penrose inverse of is given by

and the relevant quadratic form is given by

(16)

because we have assumed that . Note that if we put , and , , then (16) is equal to

which agrees with the inner sum of the second term in (10).


Remark 4. The assumption (9) only plays a cosmetic role in the proof provided that we are prepared to allow and the to depend weakly on . Specifically, suppose that . Then define

It is easily checked that

and also and the only depend weakly on in the sense that and , , as .


Remark 5. Note that the defined in (8) have a rather special structure in that they are asymptotically equal as . A natural question is: what happens to the asymptotics in Proposition 1 if the do not coalesce as , specifically if the are of the form . It turns out that, in contrast to Proposition 1, when the do not coalesce the log-likelihood does not converge to a finite limit as , as we shall see later.

4.2 The role of asymptotic normality and the general case

The asymptotic regime considered in the previous section is quite limited because the in (8) satisfy

However, it seems that we need this strong assumption to hold if we are to have convergence to a finite log-likelihood as indicated in Proposition 1. One important point we have not discussed yet is the fact that the Dirichlet distribution in Proposition 1 is converging to a multivariate normal. Consider the following elementary result.

Proposition 2. Suppose , where . Then as ,

where , and . Moreover, the convergence of densities also occurs uniformly in growing balls of the form as for any fixed , in the sense that, on the set ,

uniformly for , where in the line above, is the Dirichlet density.

The proof of Proposition 2 is similar to that of Proposition 1 but is somewhat simpler and so is omitted.

We now consider the asymptotic behaviour of as in the context of a Dirichlet distribution with parameters which do not coalesce as in Proposition 1.

Corollary 1. Under the assumptions of Proposition 2,

(17)

where , and , and .


Proof of Corollary 1. If , then elementary calculations show that as ,

(18)

where and . Consequently, from (18),

(19)

and so Corollary 1 follows because, from Proposition 2, .

So for to attain a general limiting mean vector in the limit as , it is necessary for the expectation vector of to go to infinity as .

Finally, we go to a more general case in which we do not assume that is Dirichlet but just assume that as ,

Since the are defined on the simplex the vector has non-negative components which satisfy and the covariance matrix

has an eigenvector

with corresponding eigenvalue . In this case, we obtain the same limiting result as in (17), but with the more general covariance matrix replacing , and with the same requirement that the mean vector of goes to infinity if we wish for to obtain a general limiting mean vector.

5 Numerical results

We first show the asymptotic results of the previous section on synthetic data and then apply the -transformation on real datasets.

5.1 Numerical Simulations

As discussed earlier, a finite log-likelihood is obtained under the assumption in (8) for . In contrast, when the Dirichlet’s parameters do not converge to the same value as the log-likelihood is expected to be infinite. Figure 1 presents this behaviour as it is encoded by the asymptotic limit for the mean value . Blue lines correspond to the former case with and . As expected from Proposition 1, the mean value of converges to a finite value which is however different at each coordinate due to the different values of ’s elements. Red lines correspond to the case with where the mean value diverges as implied by Corollary 1. Evidently, the rate of divergence is inverse proportional to .

Figure 1: Asymptotic comparison as between (blue lines) and (red lines).

Remark 6. The estimation of requires the exponentiation of with which leads to numerical instability in the general case (i.e., the red lines) when is below . In order to ensure numerical stability, we calculate with the following equivalent formula

where .

5.2 Real data examples

Three real datasets are used for illustration purposes.

  • Mammals. This dataset is taken from Hartigan [14] and describe the percentage composition of 24 mammals’ milk on the basis of 5 different constituents (water, protein, fat, lactose and ash). A 4-group solution has been considered as the optimal grouping of such data.

  • East Bay Clams. From the many colonies of clams in East Bay, 20 were selected at random and from each a sample of clams was taken [2]. Each sample was sieved into three size ranges, large, medium, small; then each size range was sorted by shell colour, dark, light. For each colony the proportions of clams in each colour-size combination was estimated and the corresponding compositions are recorded.

  • OECD. The dataset contains percentages of the labour force and the per capita income of 20 European OECD countries in 1960. The data can be downloaded from the DASL library.

  • GRTA. This dataset contains information on the number of Greek road traffic accidents and persons injured from January 2010 up to August 2017 on a monthly basis. Three quantities of interest are measured, Killed, seriously injured and slightly injured.

For each of these three datasets, we apply the -transformation (3) and maximise the Dirichlet log-likelihood. This is the same as maximizing the profile log-likelihood of presented in (6). For a range of values of we apply the -transformation (3) to the compositional data and maximize the sum of the Dirichlet log-likelihood in (6) and the logarithm of the Jacobian in (5) of the -transformation given in (3). Figure 2 presents these profile log-likelihoods. Tables 14 show the resulting estimates of the Dirichlet parameters obtained from the MLE applied to the -transformed data and based on the two asymptotic forms. Asymptotic 1 corresponds to model with and being estimated from (15) and (14), respectively while Asymptotic 2 corresponds to the general case . The estimation of the parameter vector is based again on MLE. Therefore the results for Asymptotic 2 are expected to be similar with the results when direct MLE is applied.

(a) (b)
(a) (b)
Figure 2: Profile log-likelihood as a function of for the (a) Mammals, (b) East Bay Clams, (c) OECD and (d) GRTA datasets.
Method water protein fat lactose ash
Direct MLE 793.908 671.054 680.629 663.055 604.305
Asymptotic 1 782.297 668.362 677.964 660.235 597.352
Asymptotic 2 793.673 670.856 680.428 662.859 604.127
Table 1: Estimates of the parameters for the Mammals dataset. The columns of the table give estimates of parameters defined in (8), but labelled by the component name rather than . The optimal value for is .
Method dl dm ds
Direct MLE 232.218 224.377 222.197
Asymptotic 1 231.568 223.800 221.592
Asymptotic 2 231.990 224.156 221.979
Table 2: Estimates for the East Bay Clams dataset. The columns of the table give estimates of parameters defined in (8), but labelled by component name rather than . The optimal value for is .
Method PCINC AGR IND SER
Direct MLE 212.965 129.782 143.360 136.907
Asymptotic 1 199.034 124.925 139.824 132.928
Asymptotic 2 212.669 129.601 143.161 136.716
Table 3: Estimates for the OECD dataset. The columns of the table give estimates of the parameters defined in (8), but labelled by component name rather than . The optimal value for is .
Method Killed Seriously injured Slightly injured
Direct MLE 28366.65 28109.32 25420.99
Asymptotic 1 28305.94 28057.80 25320.62
Asymptotic 2 28366.36 28109.03 25420.73
Table 4: Estimates for the GRTA dataset. The columns of the table give estimates of the parameters defined in (8), but labelled by component name rather than . The optimal value for is .

6 Proofs

Proof of Proposition 1. Let us focus on the log-likelihood for a single observation . At the end of the proof we will sum over to obtain the log-likelihood (10).

After re-organization, the log-likelihood for observation may be written

(20)

We now make use of Lemma 1 and Lemma 2 which are stated and proved below. Noting that

and substituting (22) and (25) into (20), we obtain, as ,

(21)

where and are defined in (11). Finally, sum (21) over to obtain (10).


Lemma 1 and Lemma 2 are now stated and proved.

Lemma 1. With the defined as in (8) with the subject to (9), the following result holds: as ,

(22)

Proof of Lemma 1. This involves no more than repeated application of Stirling’s formula

Making use of (9) and applying Stirling’s formula gives

(23)

Similarly,

(24)

Summing (24) over and subtracting from (23) yields (22) after some further calculations.

Lemma 2. As ,

(25)

where , and are defined in (12) and (13).


Proof of Lemma 2. As ,

(26)

where , and are defined in (12) and (13). Since, by (9),

(25) follows easily when we multiply (26) by and collect terms.

7 Discussion and Conclusions

Our numerical and theoretical results show that some care is needed when applying a finite-dimensional family of transformations to a parametric model, and then fitting the resulting larger model by maximum likelihood. The particular focus of study here is on using the

-transformed Dirichlet distribution as the parametric model. The numerical results in Section 5 and the theoretical results Proposition 1 and Corollary 1 in Section 4 suggest that when the maximum likelihood estimator of is close to , there is a tendency for the other parameter estimates to be rather large. In some of these cases with small one may be better off using the log-ratio transformation, provided there are no sample data components which are zero or very close to zero.

Acknowledgements

This work was partially supported by EPSRC grant EP/K022547/1, for which we are grateful. Partial results from this research were obtained when the second author was a PhD student at the University of Nottingham.

References

  • [1] Aitchison, J. (1983). Principal components analysis of compositional data. Biometrika, 70, 57-65.
  • [2] Aitchison, J. (1986). The Statistical Analysis of Compositional Data.

    Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted in 2003 with additional material by The Blackburn Press.

  • [3] Baxter, M.J. (1995). Standardization and transformation in principal component analysis, with applications to archaeometry. Applied Statistics, 44, 513-527.
  • [4] Baxter, M.J. (2001). Statistical modelling of artefact compositional data. Archaeometry, 43, 131-147.
  • [5] Baxter, M.J., Beardah, C.C., Cool, H.E.M. and Jackson, C.M. (2005). Compositional data analysis of some alkaline glasses. Mathematical Geology, 37, 183-196.
  • [6] Baxter, M.J. and Freestone, I.C. (2006). Log-ratio compositional data analysis in archaeometry. Archaeometry, 48, 511-531.
  • [7] Bhattacharya, A. and Bhattacharya, R.N. (2012). Nonparametric Inference on Manifolds with Applications to Shape Spaces. Cambridge University Press, Cambridge.
  • [8] Dryden, I.L., Koloydenko, A. and Zhou, D. (2009). Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Ann. Appl. Statist., 3, 1102-1123.
  • [9] Dryden, I.L., Le, H., Preston, S.P. and Wood, A.T.A. (2014). Mean shapes, projections and intrinsic limiting distributions. [Discussion contribution.] Journal of Statistical Planning and Inference, 145, 25-32.
  • [10] Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. Wiley, New York.
  • [11] Dryden, I.L. and Mardia, K.V. (2016). Statistical Shape Analysis with Applications in R. Second Edition. Wiley, New York.
  • [12] Fisher, R. A. (1953) Dispersion on a sphere. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society, 217, 295-305.
  • [13] Fisher, N.I., Lewis, T. and Embleton, B.J.J. (1987). Statistical analysis of spherical data. Cambridge University Press, Cambridge.
  • [14] Hartigan, J.A. (1975). Clustering algorithms. Wiley, New York.
  • [15] Hotz, T., Huckemann, S. (2015). Intrinsic Means on the Circle: Uniqueness, Locus and Asymptotics. The Annals of the Institute of Statistical Mathematics, 67, 177-193
  • [16] Kendall, D.G., Barden, D., Carne, T.K. and Le, H. (1999). Shape and Shape Theory. Wiley, New York.
  • [17] Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London.
  • [18] Mardia, K.V. and Jupp, P.E. (2000). Directional Statistics. John Wiley & Sons, Chichester.
  • [19] Scealy, J. L. and Welsh, A. H. (2014). Colours and cocktails: compositional data analysis. 2013 Lancaster lecture. Australian & New Zealand Journal of Statistics, 56, 145–169.
  • [20] Small, C.G. (1996). The Statistical Theory of Shape. Springer, New York.
  • [21] Tsagris, M.T., Preston, S., and Wood, A. T. A. (2011). A data-based power transformation for compositional data. In: Proceedings of the 4th Compositional Data Analysis Workshop, Girona, Spain.
  • [22] Tsagris, M, Preston, S. and Wood, A.T.A. (2016). Improved classification for compositional data using the -transformation. Journal of Classification 33, 243-261.
  • [23] Tsagris, M., and Stewart, C. (2018). A folded model for compositional data analysis. arXiv preprint arXiv:1802.07330.