Log In Sign Up

A folded model for compositional data analysis

A folded type model is developed for analyzing compositional data. The proposed model, which is based upon the α-transformation for compositional data, provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution.


page 1

page 2

page 3

page 4


The shared weighted Lindley frailty model for cluster failure time data

The primary goal of this paper is to introduce a novel frailty model bas...

Score matching for compositional distributions

Compositional data and multivariate count data with known totals are cha...

Novel Compositional Data's Grey Model for Structurally Forecasting Arctic Crude Oil Import

The reserve of crude oil in the Arctic area is abundant. Ice melting is ...

It's All Relative: New Regression Paradigm for Microbiome Compositional Data

Microbiome data are complex in nature, involving high dimensionality, co...

The α-k-NN regression for compositional data

Compositional data arise in many real-life applications and versatile me...

Clustering microbiome data using mixtures of logistic normal multinomial models

Discrete data such as counts of microbiome taxa resulting from next-gene...

A Flexible Spatial Autoregressive Modelling Framework for Mixed Covariates of Multiple Data Types

Mixed spatial autoregressive (SAR) models with numerical covariates have...

1 Introduction

Compositional data are positive multivariate data which sum to the same constant, usually . In this case, their sample space is the standard simplex


where denotes the number of variables (better known as components).

Compositional data are met in many different scientific fields. In sedimentology, for example, samples were taken from an Arctic lake and their composition of water, clay and sand were the quantities of interest. Data from oceanography studies involving Foraminiferal (a marine plankton species) compositions at different sea depths from oceanography were analyzed in Aitchison (2003, pg 399). Schnute & Haigh (2007) also analyzed marine compositional data through catch curve models for a quillback rockfish (Sebastes maliger

) population. In hydrochemistry, Otero et al. (2005) used regression analysis to draw conclusions about anthropogenic and geological pollution sources of rivers in Spain. Stewart & Field (2011) modeled compositional diet estimates with an abundance of zero values obtained through quantitative fatty acid signature analysis. In another biological setting, Ghosh & Chakrabarti (2009) were interested was in the classification of immunohistochemical data. Other applications areas of compositional data analysis include archaeometry (Baxter et al., 2005), where the composition of ancient glasses, for instance, is of interest, and in economics (Fry, Fry, & McLaren, 2000), where the focus is on the percentage of the household expenditure allocated to different products. Compositional data are also met in political science (Katz, & King, 1999) for modeling electoral data and in forensic science where the compositions of forensic glasses are compared and classified (Neocleous, Aitken, & Zadora, 2011). In demography, compositional data are met in multiple-decrement life tables and the mortality rates amongst age groups are modeled (Oeppen, 2008). In a study of the brain, Prados et al. (2010) evaluated the diffusion anisotropy from diffusion tensor imaging using new measures derived from compositional data distances. Some recent areas of application include bioinformatics and specifically microbiome data analysis (Xia et al., 2013; Chen & Li, 2016; Shi, Zhang & Li, 2016). These examples illustrate the breadth of compositional data analysis applications and consequently the need for parametric models defined on the simplex.

The Dirichlet distribution is a natural distribution for such data due to its support being the simplex space. However, it has long been recognized that this distribution is not statistically rich and flexible enough to capture many types of variabilities (especially curvature) of compositional data and, for this reason, a variety of transformations have been proposed that map the data outside of the simplex. In Aitchison (1982) the log-ratio transformation approach was developed, and later the so called isometric log-ratio transformation methodology which was first proposed in Aitchison (2003, pg 90) and examined in detail by Egozcue et al. (2003). More recently, Tsagris, Preston & Wood (2011) suggested the -transformation which includes the isometric transformation as a special case. The -transformation is a Box-Cox type transformation and has been successfully applied in regression analysis (Tsagris, 2015) and classification settings (Tsagris, Preston & Wood, 2016).

Regardless of the transformation chosen, the usual approach for modeling compositional data is to assume that the transformed data is multivariate normally distributed. While the -transformation offers flexibility, a disadvantage of this transformation is that it maps the compositional data from the simplex () to a subset of , and not itself on which the multivariate normal density is defined. An improvement to this method can be obtained by using a folded model procedure, similar to the approach used in Scealy & Welsh (2014) and to the folded normal distribution in employed in Leone, Nelson & Nottingham (1961) and Johnson (1962). The folded normal distribution in

, for example, corresponds to taking the absolute value of a normal random variable and essentially “folding” the negative values to the positive side of the distribution. The model we propose here works in a similar fashion where values outside the simplex are mapped inside of it via a folding type transformation. An advantage of this approach over the aforementioned log-ratio methodology is that it allows one to fit any suitable multivariate distribution on

through the parameter .

The paper is structured as follows. In Section 2 we describe the folding approach and in Section 3 we introduce the -folded multivariate normal model for compositional data, along with the EM algorithm for maximum likelihood estimation (MLE), an algorithm for generating data from the proposed folded distribution, inference for

and an approach for principal component analysis. Simulation studies are conducted in Section

4 to assess the estimation accuracy of the parameters and two data sets are analyzed to further assess the performance of our model. Finally, concluding remarks may be found in Section 5.

2 The folding technique

2.1 The -transformation

For a composition , the centered log-ratio transformation is defined in [] as


The sample space of Equation (2) is the set


which is a subset of . Note that the zero sum constraint in Equation (2) is an obvious drawback of this transformation as it can lead to singularity issues. In order to remove the redundant dimension imposed by this constraint, one can apply the isometric log-ratio transformation


where is the Helmert matrix (Lancaster, 1965) (an orthonormal matrix) after deletion of the first row. The left multiplication by the Helmert sub-matrix maps the data onto

thus, in effect, removing the zero sum constraint. The Helmert sub-matrix is a standard orthogonal matrix in shape analysis used to overcome singularity problems (Dryden & Mardia, 1998;, Le & Small, 1999).

Tsagris, Preston & Wood (2011) developed the -transformation as a more general transformation than that in Equation (4). Let


denote the power transformation for compositional data as defined by Aitchison (2003). In a manner analogous to Equations (2-4), first define


The sample space of Equation (6) is then the set


Note that the inverse of Equation (6) is as follows


for . Finally, the -transformation is defined as


The transformation in Equation (9) is a one-to-one transformation which maps data inside the simplex onto a subset of and vice versa for . The corresponding sample space of Equation (9) is


For , the inverse transformation from to is given by where is given in Equation (8).

Note that vectors in

are not subject to the sum to zero constraint and that .

For convenience purposes we allow to lie within . From Equations (5) and (6), when , the simplex is linearly expanded as the values of the components are simply multiplied by a scalar and then centered. When , the inverse of the values of the components are multiplied by a scalar and then centered.

If we assume that the -transformed data (for any value of ) follow a multivariate normal distribution, then a way to choose the value of is via maximum likelihood estimation. However, given that the multivariate normal distribution is defined on and that

, this approach might neglect an important amount of volume (probability) of the multivariate normal. The same problem arose in Scealy & Welsh (2011b), who then developed the folded Kent distribution (Scealy & Welsh, 2014) and we propose a similar solution here.

2.2 The folding transformation

Let in Equation (9) for and some value of , then the inverse transformation provides a transformation from to . Now suppose we have a point (that is, outside of ) and we want to map it inside the simplex . It can be shown (Tsagris, 2013) that the following transformation maps to


where is defined in Equation (8) and .

Note that the inverse of Equation (11) is the dimensional vector


where and is defined in Equation (6).

In summary, we propose the following folding transformation from to


where and .  

3 The -folded multivariate normal distribution

3.1 Definition

The distribution of in Equation (13) can be derived as a mixture distribution if we assume that is multivariate normally distributed with parameters and , that is if . Motivated by the folded normal distribution in Leone (1961), we define the distribution of on and refer to it as the -folded multivariate normal distribution. The term -folded is a combination of the -transformation and the folding type approach that is applied.

The distribution of can be written as


where and refer to the densities of and respectively when , and is the probability that .

To specify the density , we require the relevant Jacobians of the transformations and . These are given in the following two Lemmas with corresponding proofs in the Appendix.

Lemma 3.1

The Jacobian of is

Lemma 3.2

The Jacobian of is

From the two lemmas, it follows that

where , , and is given in Equation (9). We will refer to the density in Equation 3.1 as the -folded multivariate normal distribution.

Although we have indicated that , it is important to note that boundaries on the simplex (that is, zero values) are not allowed due to the log of being undefined in such cases. This potential limitation also occurs with conventional transformations for compositional data analysis as well, including the isometric transformation in (Equation 4). Recent approaches to handle zero values include mapping the data onto the hyper-sphere (Scealy and Welsh, 2011b, Scealy and Welsh), assuming latent variables (Butler & Glasbey, 2008) or conditioning on the zero values (Stewart & Field 2011; Tsagris & Stewart, 2018).

A few special cases are worthy of mention. First, when , the second term in Equation (3.1) vanishes and we end up with the -normal distribution []. When , . Finally, when , Equation (3.1) reduces to the multivariate logistic normal on ([])

where is defined in Equation (4). The density function in Equation (3.1) is that of the logistic normal on . This leads us to the following Definition which generalizes Aitchison’s (2003) Definition 6.2.

A D-part composition is said to follow the -folded multivariate normal distribution if follows a multivariate normal distribution in where


and is defined in Equation (9).

The generic power transformation Equation (9) includes the isometric log-ratio transformation in Equation (4) and thus the same concepts can be used to build the model. This means that once the data are transformed via Equation 18, a multivariate normality test can be applied. However, numerical optimization is required for obtaining the maximum likelihood estimates of the mean vector and covariance matrix, as described in Section 3.3.

The -folded multivariate normal distribution resolves the problem associated with the assumption of multivariate normality of the -transformed data (Tsagris, Preston & Wood, 2011); the ignored probability left outside the simplex due to the sample space of the transformation being a subset of . With this distribution, any ignored probability is folded back onto the simplex, and hence the density has the form of Equation (3.1).

3.2 Contour plots of the -folded bivariate normal distribution

We now consider a visual illustration of the bivariate normal distribution with and without folding. In particular, we examine contours of the normal distribution in and compare these to the contours of the folded model in Equation 3.1 plotted on the simplex, with . We consider two settings in which in both cases, but use two different covariances matrices as follows


The mean vector was selected by applying the -transformation in Equation (9) with to a sub-composition formed by the first three components of the Hongite data (artificial dataset used by Aitchison, 2003). The elements of the first covariance matrix () were chosen so that the correlation was positive, whereas the second covariance matrix is such that .

For each combination of parameters we calculated the density of the normal and the folded normal for a grid of two-dimensional vectors and then plotted their contours. While for the unconstrained normal case, the density was simply calculated for all grid points, for the folded model, the transformation in Equation 13 was first applied (with ) and then the density in Equation 3.1 was calculated. The contour plots are presented in Figure 1.

(a) (b)
(c) (d)
Figure 1: All contour plots refer to a normal distribution with mean . The covariance matrix of the first row is and of the second row is . The left plot is the normal in the real space and the triangle (the simplex after the -transformation in Equation (9) with ) is for illustration purposes. The left column shows the contour plots of the multivariate normal in and the right column shows the contours of the -folded normal on the simplex.

Figures (a) and (c) depict the contours of the multivariate normal distribution (defined in ) while Figures (b) and (d) show the contours of the -folded normal (defined on the simplex). The first row corresponds to in Equation (19) while the second row is derived from . The triangles in the second column (Figures (b) and (d)) display the simplex while the corresponding triangles in Figures (a) and (c) were obtained through the transformation in Equation 9.

What is perhaps most evident from the contour plots is that points falling outside the triangle in Figures (a) and (c) result in modes inside the simplex in Figures (b) and (d) respectively. When there is a high probability of being left outside of two or more sides of the triangle (or faces of the pyramid or hyper-pyramid in higher dimensions), as in Figures 1(c) and 1(d), the contours of the folding model will have a somewhat peculiar shape due to a multi-modal distribution arising on the simplex. The multi-modality depends upon the allocation of the probability left outside the simplex along the components. If only one side of the simplex has probability left outside (as in Figure 1(a)) then the resulting distribution will be unimodal (see Figure 1(b)).

An estimate of the probability left outside each side of the triangle (or the simplex) may be obtained through simulation in a straightforward manner. To accomplish this for our example, we generated data from the multivariate normal distribution with the parameters previously specified and applied the inverse of the -transformation (that is, ) in Equation 13. For the points left outside of the simplex, we simply calculated how many are outside of each edge of the triangle and divided by the total number of the simulated data points.

If we partition the missed probability into three parts, where each part refers to one of the three components, then we obtain the values for the case when (see Figures 1(a) and 1(b)). In this case, most of the probability is left outside the third component and the total probability left outside of the simplex is therefore . The total probability left outside of the simplex when (see Figures 1(c) and 1(d)) is and the allocation to the three components is different than in the previous example, namely (). Since, in this case, all of the estimates are relatively high, multi-modality appears in Figure 1(d).

3.3 Maximum likelihood estimation

The estimation of the parameters of the -folded model on (that is, ) is not too complicated mainly because there are not too many parameters ( for the mean vector, for the covariance matrix and two more including the value of and the probability ) involved in the maximization of the log-likelihood. We can use the “simplex” algorithm of Nelder & Mead (1965) to maximize the logarithm of Equation (3.1), available via the command optim in R (R Core Team, 2015). This algorithm is generally robust and is derivative free (Venables & Ripley, 2002). However when moving to higher dimensions (roughly = 5 or larger), the maximization is not straightforward.

For this reason, we use the E-M algorithm (McLachlan & Krishnan, 2007) to maximize the log-likelihood corresponding to Equation 3.1. Let denote the set of compositional vectors in . Following Jung, Foskey & Marron (2011) who applied the E-M algorithm in the context of a univariate folded normal we propose the algorithm below.  

E-M algorithm

  1. For a fixed value of , apply the -transformation without the Helmert sub-matrix multiplication (that is, Equation 6) to the compositional data to obtain the matrix .

  2. Calculate for each vector , for .

  3. Left multiply each by the Helmert sub-matrix to obtain . Then multiply each by to obtain (see Equation 18). The and are the transformed compositional data onto and , for respectively.

  4. Choose initial values for the estimates of the parameters, for example


    where is the estimated conditional expectation of the indicator function that indicates whether the -th observation belongs to or .

  5. Update all the parameters each time, for

  6. Repeat Step 5 until the change between two successive log-likelihood


    values is less than a tolerance value.

The above described procedure should be repeated for a grid of values of , ranging from to , choosing the value of which maximizes the log-likelihood. A more efficient search for the best alpha is via Brent’s algorithm (Brent, 2013). When , the MLE estimates of the transformed data are obtained directly; no E-M algorithm is necessary as all the probability is retained with the simplex. The fact that (9) tends to (4) as ensures the continuity of the log-likelihood at .

3.4 Generating data from the -folded multivariate normal distribution

The algorithm below describes how to simulate a random vector from the -folded model in Equation (3.1) when . The case when , is considered subsequently.

  1. Choose and , where .

  2. Generate a by 1 vector from a .

  3. As per Equation 13, determine whether . To do this, compute . If for all components of and , then and let . Otherwise, and let where .

When , a the following simplified algorithm is used:

  1. Choose and .

  2. Generate a by 1 vector from a .

  3. Compute .

  4. (21)

Equation (21) is the inverse of the centered log-ratio transformation in Equation (2) and is the mechanism behind Aitchison’s idea that generates compositional data.

3.5 Inference for

In the previous two subsections, simplifications arose if and it may, consequently, be worthwhile to test whether the simpler multivariate normal in Equation (3.1) (corresponding to ) is appropriate for the data at hand.

Consider the hypothesis test: versus . While one option is to use a log-likelihood ratio test, depending on the alternative hypothesis (a. & , b. & , c. & or d. ,

), the degrees of freedom will vary. We have not encountered case d. so far in our data analyses but, in this case, would recommend using a Dirichlet model. In fact, if we generate data from a Dirichlet distribution, case d. is expected to arise from the MLE of Equation (

3.1). An alternative to the log likelihood ratio test is to use a parametric bootstrap such as the hypothesis testing procedure described below.

  1. For a given dataset, estimate the value of

    obtained via the E-M algorithm. This is the observed test statistic denoted by


  2. Apply the -transformation in Equation (9) with to the data. The data are now mapped onto set in Equation (10).

  3. Apply the inverse of the isometric transformation with in Equation (4) to the data in Step 2. to form a new sample of compositions acquired with

    . That is, the data has been transformed under the null hypothesis.

  4. Re-sample times from this new composition and each time estimate the value of for .

The value is then given by (Davison & Hinkley, 1997)


where is the indicator function.

One might argue that the value of

itself is not a pivotal statistic, in fact it is not even standardized, so a second bootstrap should be performed to obtain the standard error of the estimate for each bootstrap sample. In order to avoid this extra computational burden, the parametric bootstrap hypothesis testing could alternatively be carried out using the log-likelihood ratio test statistic in Steps 1 and 4 above.

Inference for

could also be achieved via the construction of bootstrap confidence intervals. For this approach, simply re-sample the observations (compositional vectors) from the compositional dataset and find the value of

for which the log-likelihood derived from Equation (3.1) is maximized for each bootstrap sample. By repeating this procedure many times, we can empirically estimate the distribution of , including the standard error of . A variety of confidence intervals may be formed based on this distribution (see Davison & Hinkley, 1997 and R package boot). The percentile method, for example, simply uses the

lower and upper quantiles of the bootstrap distribution as confidence limits.

A less computationally intensive approach to obtain confidence intervals is based upon the second derivative of the profile log-likelihood of , that is, the observed Fisher’s information measure. Assuming asymptotic normality of the estimator, the inverse of the observed information serves as an estimate of the standard error of the maximum likelihood estimator (Cox & Hinkley, 1979).

3.6 Principal component analysis

Principal component analysis for compositional data has been described by Aitchison (1983). The centred log-ratio transformation (2

) is applied to the compositional data and standard eigen analysis is applied to the covariance matrix (which has at least one zero eigenvalue).

If , we suggest an analogous approach in which the estimated covariance matrix is mapped to space (Equation 7), using the Helmert sub-matrix

If , the covariance is mapped to (Equation (3)).

Step 3 of the algorithm presented in 3.4 is used to back-transform the principal components onto the simplex.

4 Simulation study and data analysis examples

4.1 Simulation study

Recovery of the mean, covariance and probability left outside the simplex

For this part of our simulation study, two values of were chosen, namely and , and 4-dimensional data () from a multivariate normal distribution with two different mean vectors and a variety of different covariance parameters were generated. Specifically, we used mean vectors

and covariance matrices


where . Note that the value of changes the probability that a point is left outside of the simplex. The “true” values of for each value of were computed through Monte-Carlo simulations and (denoted as ”Probability”) for each value is presented in Table 1.

For each combination of , and , seven sample sizes, namely , were considered. Results are based on simulated data sets (for each ) and, for each simulated sample, estimates of , and were calculated, and their distance from the true parameters were estimated. All computations took place in R 3.2.3 R (R Core Team, 2015) using a desktop computer with Intel Core i5 at 3.5 GHz processor and 32GB RAM.

For the probability left outside of the simplex (that is, 1-), the absolute difference between the estimated probability and the true probability is computed. For the mean vector, the Euclidean distance was calculated to measure the discrepancy between the estimated vector and true vector whereas for the covariance matrix, the following metric [] was calculated

where denotes the -th eigenvalue of the matrix .

Note that while we could have used the Kullback-Leibler divergence of the fitted multivariate normal from the true multivariate normal to evaluate the overall performance of our estimation method, we would then have had no individual information regarding the accuracy of our procedure in terms of estimating the probability left outside the simplex, the mean vector and the covariance matrix. The results of the simulation studies are presented in Table

1 and Figure 2.

0.5 1 2 3 5 7 10
Estimated 0.0281 0.0925 0.1997 0.2800 0.3900 0.4630 0.5377
Probability 0.0223 0.1048 0.1849 0.3060 0.4217 0.4750 0.5356
Table 1: Probability left outside the simplex when and , calculated via Monte Carlo with 50,000,000 iterations.
Simulation study when
Simulation study when
Figure 2: All graphs contain the mean distance from each set of the parameters. The first column refers to the Euclidean distance of the estimated mean vector from the true mean vector. The second column refers to the mean distance between the estimated and the true covariance matrix. The third column refers to the mean absolute distance between the estimated probability and the true probability inside the simplex.

From Figure (2) we observe that when the probability left outside the simplex grows larger ( is larger), a larger sample size is required in order to get better estimates, for both the probability and the mean vector. The covariance matrix seems to be unaffected by the probability left outside the simplex.

Estimation of the computational time required by the maximum likelihood estimation

Using only , we generated data data as before with increasing sample sizes and, for each sample size, recorded the time (in seconds) required to estimate the true value of . As expected, the computational cost is mostly affected by . For large sample sizes the computational burden is similar regardless of the probability of being outside of the simplex.

Sample size 0.5 1 2 3 5 7 10
50 0.052 0.058 0.065 0.075 0.095 0.124 0.148
100 0.073 0.079 0.092 0.106 0.135 0.174 0.212
200 0.090 0.089 0.096 0.105 0.130 0.162 0.195
300 0.095 0.095 0.096 0.108 0.128 0.160 0.194
500 0.084 0.083 0.091 0.100 0.123 0.156 0.188
750 0.078 0.074 0.081 0.089 0.109 0.135 0.163
1000 0.082 0.078 0.086 0.095 0.117 0.140 0.168
2000 0.267 0.390 0.357 0.352 0.339 0.276 0.302
5000 0.638 0.933 0.838 0.841 0.816 0.675 0.722
10000 1.435 2.113 1.905 1.915 1.858 1.544 1.723
Table 2: Computational times (in seconds) required to estimate the value of , averaged over 1000 repetitions.

Recovery of the true value of

In the previous simulations, recall that the value of was fixed. In order to have a better picture of our estimation procedure, we also assessed the accuracy of the true value of for large sample sizes for insight into the asymptotic case. The mean vector was set to

and the covariance matrices were the same as in 23.

For a range of values of ranging from up to with a step of we estimated these values for the different values of using 4 sample sizes . For each combination of , and we used repetitions.

Figure 3 shows the average bias of the estimates in boxplots for each sample size. Each box corresponds to a value of and is the average bias aggregated for all values of . For example, Figure 3(a) refers to a sample size equal to and the first box contains information about the average biases of the values of

. The range in the variances increases with the value of

as expected, since higher values of indicate higher probability of being left outside of the simplex. Table 3 presents for many combinations of values of and calculated using Monte Carlo simulation with repetitions.

(a) n = 1000 (b) n = 5000
(c) n = 10000 (d) n = 20000
Figure 3: Box plots of the range of a a function of for 4 different sample sizes.
0.5 1 2 3 5 7 10
0.0 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.1 0.000 0.000 0.000 0.000 0.000 0.000 0.002
0.2 0.000 0.000 0.001 0.007 0.034 0.071 0.128
0.3 0.000 0.004 0.040 0.091 0.187 0.316 0.348
0.4 0.006 0.047 0.156 0.245 0.367 0.445 0.522
0.5 0.043 0.149 0.306 0.402 0.516 0.583 0.648
0.6 0.132 0.284 0.448 0.536 0.632 0.687 0.741
0.7 0.258 0.423 0.571 0.644 0.722 0.768 0.812
0.8 0.398 0.551 0.673 0.731 0.794 0.830 0.866
0.9 0.535 0.661 0.757 0.802 0.851 0.880 0.907
1.0 0.66 0.756 0.827 0.861 0.898 0.918 0.937
Table 3: Probability left outside the simplex for many combinations of and values.

4.2 Example 1: Sharp’s artificial data

We chose to analyze Sharp’s artificial data (Sharp, 2006) because they are curved data and according to Aitchison (2003) the logistic normal (3.1) should produce a very good fit for curved data. Clearly a Dirichlet distribution would fail to capture the variability of such data and we would not expect a value of to do better. This dataset consists of components and Sharp constructed them from Aitchison’s Hongite data (Aitchison, 2003). []

showed that the normalized geometric mean of the components (assuming a logistic normal distribution) fails to lie within the corpus of the data, whereas the spatial graph median does.

Figure 4(a) shows the profile log-likelihood of and the maximum of the log-likelihood which occurs at . The log-likelihood values at and are equal to and respectively. The log-likelihood ratio test of at degrees of freedom clearly rejects the logistic normal on the simplex (that is the optimal transformation) and this conclusion is in line with the confidence interval limits. Figure 4(b) shows the ternary diagram of the data and we can clearly see that both the simple and the normalized geometric mean fail to lie within the main bulk of the data. However, the Fréchet mean (or -mean) (calculated at ) achieves this goal. The three means in are

and the estimated parameters assuming a multivariate normal in with and are

Note that in order to transform the -mean (that is, in our example) inside the simplex, we apply Equation 13. The resulting compositional mean vector (-mean) was termed Fréchet mean by (Tsagris, Preston & Wood, 2011).

In this example the probability left outside the simplex was calculated via Monte Carlo simulations (with 20,000,000 iterations) and was equal to . The value of the mixing proportion in (3.1) was and . However, there were 2 out of 25 estimated values of the less than

, so without these (possible outliers) a proportion equal to

would have been obtained and this is closer to the Monte-Carlo estimated probability left outside the simplex.

The contour plots of the -folded model with and appear in Figures 5(a) and 5(b) respectively. We generated observations from each model and these are plotted in Figures 5(c) and 5(d). When the simulated data look more like the observed data, in contrast to the simulated data with .

(a) (b)
Figure 4: (a) Profile log-likelihood of using Sharp’s data. The red and green lines indicate the confidence interval of the true value of the parameter. (b) Ternary plot of the data along with three -means evaluated at (geometric mean normalised to sum to 1), and (arithmetic mean). The lines correspond to the scores of the first principal component for each value of .
(a) (b)
(c) (d)
Figure 5: Ternary plots of Sharp’s data along with the contours of (a) the -folded model with and (c) of the multivariate normal distribution applied to the -transformed data with . Plots (b) and (d) are simulated data from the -folded model with and from the multivariate normal distribution applied to the -transformed data with respectively.

4.3 Example 2: Arctic lake data

Samples from an Arctic lake were collected and the composition of three minerals, sand silt and clay, was measured at different depths (Aitchison, 2003, pg 359). The estimated optimal value of for this dataset was equal to and Figure 6(b) shows the data along with the three -means and the scores of the first principal component, evaluated at (geometric mean normalized to sum to 1) and (arithmetic mean).

A closer look at Figure 6(b) reveals that the -mean evaluated at lies within the bulk of the data in contrast to the -mean evaluated at as well as the arithmetic mean (). The improved fit with (over ) can also be seen from the centers of the contour plots in Figures 7(a) and 7(b). There are two observations which seem to be highly influential.

We removed the two seemingly influential observations and re-implemented the analysis. The optimal value of is now and the contour plots of the -folded model at and are presented in Figure2 7(c) and 7(d). We again observe that a value of other than appears to provide a better fit to the data, leading to the same conclusion as before. That is, even though the data seem to be curved in the simplex, a log-ratio transformation is not the most suitable transformation.

(a) (b)
Figure 6: (a) Profile log-likelihood of using the Arctic lake data. The red and green lines indicate the confidence interval of the true value of the parameter. (b) Ternary plot of the data along with three -means evaluated at (geometric mean normalised to sum to 1), and (arithmetic mean). The lines correspond to the scores of the first principal component for each value of .
(a) (b)
(c) (d)
Figure 7: The contour plots refer to (a) the -folded model with and (b) to the multivariate normal distribution applied to the -transformed data with . The second row presents the contour plots with two observations removed. Plots (c) and (d) correspond to and respectively.

5 Conclusions

In this paper we developed a novel parametric model, with nice properties, for compositional data analysis. Tsagris, Preston & Wood (2011) introduced the -transformation and corresponding multivariate normal distribution. A drawback of that model is that it does not take into account the probability left outside the simplex. This is similar to the Box-Cox transformation, where the support of the transformed data is not the whole of as we have already mentioned. Another possible solution would be to use truncation []. However, in the examples presented the -folded model appeared to fit the data adequately and, interestingly, an improved fit was obtained when . For more examples where the log-ratio transformation fails to capture the variability of the data see Tsagris, Preston & Wood (2011), baxter (2006) and Sharp (2006).

The use of another multivariate model, such as the multivariate skew normal distribution (Azzalini & valle, 1996) has been suggested. The difficulty, however, with this distribution is that more parameters need to be estimated, thus making the estimation procedure more difficult because the log-likelihood has many local maxima. This of course does not exclude the possibility of using this model. Another, perhaps simpler, alternative model is the multivariate

distribution. Bayesian analysis and regression modeling are two suggested research directions.

Similar to the Box-Cox transformation, the logarithm of the Jacobian of the -transformation contains the term . Thus, zero values are not allowed. Note that this issue also arises with the logistic normal. However, it is possible to generalize most of the analyzes suggested for the logistic normal distribution using our proposed folded model, including extensions that allow zeros.

As is standard practice in log-ratio transformation analysis, if one is willing to exclude from the sample space the boundary of the simplex, which corresponds to observations that have one or more components equal to zero, then the -transformation (9) and its inverse are well defined for all , and the -folded model provides a new approach for the analysis of compositional data with the potential to provide an improved fit over traditional models.

Proof of Lemma 3.1

Let us begin by deriving the Jacobian determinant of (5) at first. The map (5) is degenerate due to the constraints and . In order to make (9) non-degenerate we consider the version of (5) as follows


The (A.1) is presented to highlight that in fact we have and not variables.

Let us start by proving the Jacobian of (5) or (A.1). We denote , where . The diagonal and the non-diagonal elements of the Jacobian matrix are as follows.

The Jacobian takes the following form []:

where is a diagonal matrix with elements and and are defined as


Then the multiplication is

So we end up with

Finally the Jacobian of (5) takes the following form

The Jacobian of the -transformation (9) without the left multiplication by the Helmert sub-matrix is simply the Jacobian of (5) multiplied by

The multiplication by the Helmert sub-matrix adds an extra term to the Jacobian, which is and hence the Jacobian becomes.

Proof of Lemma 3.2

The proof presented below is about the case of for convenience purposes. The way to map a point from inside the simplex to a point outside the simplex, is given in Equation 12 and the component wise transformation can be expressed as


where is defines in (6) and since , we have exclude the superscript .

There are two cases to consider when calculating the Jacobian determinant of the transformation.

  1. The first case is when and the transformation is

    There are two sub-cases to be specified.