DeepAI
Log In Sign Up

The α-k-NN regression for compositional data

02/12/2020
by   Michail Tsagris, et al.
University of Crete
University of New Brunswick
Northern Border University
0

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. This paper, through use of the α-transformation, extends the classical k-NN regression to what is termed α-k-NN regression, yielding a highly flexible non-parametric regression model for compositional data. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed model without modification. Extensive simulation studies and real-life data analysis highlight the advantage of using α-k-NN regression for complex relationships between the response data and predictor variables for two cases, namely when the response data is compositional and predictor variables are continuous (or categorical) and vice versa. Both cases suggest that α-k-NN regression can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric relationship with the predictor variables.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/16/2020

A Transformation-free Linear Regression for Compositional Outcomes and Predictors

Compositional data are common in many fields, both as outcomes and predi...
09/10/2021

Neural Networks for Latent Budget Analysis of Compositional Data

Compositional data are non-negative data collected in a rectangular matr...
02/20/2018

A folded model for compositional data analysis

A folded type model is developed for analyzing compositional data. The p...
04/14/2021

Regularized regression on compositional trees with application to MRI analysis

A compositional tree refers to a tree structure on a set of random varia...
07/09/2019

The Integrated nested Laplace approximation for fitting models with multivariate response

This paper introduces a Laplace approximation to Bayesian inference in r...
11/16/2017

Categorical data analysis using a skewed Weibull regression model

In this paper, we present a Weibull link (skewed) model for categorical ...
08/01/2020

A Bayesian Mixture Model for Changepoint Estimation Using Ordinal Predictors

In regression models, predictor variables with inherent ordering, such a...

1 Introduction

Non-negative multivariate vectors with variables (typically called components) conveying only relative information are referred to as compositional data. When the vectors are normalized to sum to 1, their sample space is the standard simplex given below

(1)

where denotes the number of components.

Examples of compositional data may be found in many different fields of study and the extensive scientific literature that has been published on the proper analysis of this type of data is indicative of its prevalence in real-life applications111For a substantial number of specific examples of applications involving compositional data see (Tsagris and Stewart, 2020).. It is perhaps not surprising, given the widespread occurrence of this type of data, that many compositional data analysis applications involve covariates. In sedimentology, for example, samples were collected from an Arctic lake and the change in their chemical composition across different water depths was of interest (van den Boogaart et al., 2018). This data set is analyzed in Section 5

using our proposed methodology along with several other data sets. These include compositional glacial data, household consumption expenditures data, data on the concentration of chemical elements in samples of soil, data on morphometric measurements of fish and electoral data, all of which are cassociated with some covariates. Also in this section, real-life data on life expectancy is linked to gender and compositional predictor variables containing the proportion of deaths by various diseases. In addition to these examples, the literature cites numerous other applications of compositional regression analysis. For example, data from oceanography studies involving Foraminiferal (a marine plankton species) compositions at different sea depths from oceanography were analyzed in

Aitchison (2003). In hydrochemistry, (Otero et al., 2005) used regression analysis to draw conclusions about anthropogenic and geological pollution sources of rivers in Spain. In economics, Morais et al. (2018) linked the market shares to some independent variables, while in political sciences the percentage of votes of each candidate can be linked to some predictor variables (Katz and King, 1999, Tsagris and Stewart, 2018). Finally, in the field of bioinformatics compositional data techniques have been used for analysing microbiome data (Xia et al., 2013, Chen and Li, 2016, Shi et al., 2016).

The need for valid regression models for compositional data in practice has led to several developments in this area, many of which have been proposed in recent years. The first regression model for compositional response data was developed by Aitchison (2003), commonly referred to as Aitchison’s model, and was based on the additive log-ratio transformation defined Section 2 . Iyengar and Dey (2002) investigated the generalized Liouville family of distributions that permits distributions with negative or mixed correlation and also contains non-Dirichlet distributions with non-positive correlation. Gueorguieva et al. (2008), Hijazi and Jernigan (2009), Melo et al. (2009) and Morais et al. (2018) modelled compositional data using Dirichlet regression. Tolosana-Delgado and von Eynatten (2009) also used the additive log-ratio transformation while Egozcue et al. (2012) extended Aitchison’s regression model by using a transformation similar to the isometric log-ratio transformation (see Section 2

) but instead of employing the Helmert sub-matrix, they chose a different orthogonal matrix that is compositional data dependent.

A drawback of the aforementioned regression models is their inability to handle zero values and, consequently, a few models have recently been proposed to address the zero problem. Scealy and Welsh (2011) transformed the compositional data onto the unit hyper-sphere and introduced the Kent regression which treats zero values naturally. Leininger et al. (2013) modelled spatial compositional data with zeros from the Bayesian stance. Mullahy (2015)estimated regression models of economic share data where the shares could take zero values with nontrivial probability. Murteira and Ramalho (2016) discussed alternative regression models, applicable when zero values are present, in the field of econometrics. Tsagris (2015a) proposed a regression model that minimizes the Jensen-Shannon divergence and Tsagris (2015b) proposed the -regression that generalises Aitchison’s log-ratio regression, both of which are compatible with zeros. Tsagris and Stewart (2018) proposed the zero adjusted Dirichlet regression, an extension of Dirichlet regression allowing for zeros.

The case of compositional predictor data has also been examined, but to a smaller degree. Egozcue et al. (2012) suggested the use of the isometric log-ratio transformation for the predictor compositional variables whereas (Tsagris, 2015b) proposed applying the -transformation to the compositional predictors prior to fitting any regression model. (Both transformations are discussed in Section 2.) Shi et al. (2016), in the field of bioinformatics, also considered regression analysis with compositional data as covariates, while Lin et al. (2014) proposed a more sophisticated model that performs a LASSO type variable selection with compositional predictor variables.

Finally, Wang et al. (2013)

considered linear regression for compositional data as both dependent and independent variables, again using the isometric log-ratio transformation, whereas

(Alenazi, 2019) suggested the use of principal components regression.

Most of the aforementioned regression models share the same characteristic, they are limited to linear or generalised linear relationships between the dependent and independent variables, even though the relationships in many real-life applications are not restricted to the linear setting nor conform to fixed parametric forms. For this reason, more advanced regression models and algorithms, such as - regression, are often considered. Kernel regression (Wand and Jones, 1994) is a more sophisticated technique that generalises - regression by adding different weights to each observation that decay exponentially with distance. Di Marzio et al. (2015) introduced local constant and local linear smoothing regression and examined the cases when the response, the predictors, and both are compositions. A disadvantage of Kernel regression is that it is more complex and computationally expensive than - regression. Another highly flexible model is the projection pursuit regression (Friedman and Stuetzle, 1981), applicable to log-ratio transformed compositional data. The log-ratio transformation can be either the additive log-ratio or the isometric log-ratio previously mentioned, but in either case, zero values are not allowed.

The contribution of this paper is the proposed -- regression for compositional data which links predictor variables to covariates in a non-parametric, non-linear fashion. The model has the potential to provide a better fit to the data, yielding improved predictions when the relationships between the compositional and the non compositional variables are complex. The approach utilises the -transformation and extends the classical - regression, thus adding more flexibility. Additionally, in contrast to other non-parametric regressions such as projection pursuit, the method allows for zero values in the data. A disadvantage of the -- regression strategy, however, is that the usual statistical inference on the effect of the independent variables is not straightforward. --

regression is first developed for the case in which the response data is compositional (Euclidean-Compositional regression) and then subsequently the Euclidean-Compositional case (where the predictor variables only are compositional). The Compositional-Compositional regression case (where both the predictor and response variables are compositional) was also examined, but the proposed method did not perform better than the principal components regression method of

Alenazi (2019) and hence is not considered here.

The paper is structured as follows: Section 2 describes relevant transformations and regression models for compositional data, while in Section 3, -- regression is introduced. Simulation studies are implemented in Section 4 and in Section 5 our proposed methodology is applied to real-life datasets, illustrating the advantages and limitations of the proposed model. Finally, concluding remarks can be found in Section 6.

2 Compositional data analysis: transformations and regression models

Some preliminary definitions and methods in compositional data analysis relevant to the work in this paper are now introduced. Two commonly used log-ratio transformations and a more general -transformation are defined and subsequently, some existing regression models for compositional response data are presented.

2.1 Transformations

2.1.1 Additive log-ratio transformation

Aitchison (1982) suggested applying the additive log-ratio (alr) transformation to compositional data. Let then the alr transformation is given by

(2)

where . Note that the common divisor, , need not be the first component and was simply chosen for convenience. The inverse of Equation (2) is given by

(3)

2.1.2 Isometric log-ratio transformation

An alternative transformation proposed by Aitchison (1983) is the centred log-ratio (clr) transformation defined as

(4)

where

is the geometric mean which, in practice, is computed for each compositional vector in the sample. The inverse of Equation (

4) is given by

(5)

where denotes the closure operation, or normalization to the unity sum.

The clr transformation in Equation (4

) was proposed in the context of principal component analysis and its drawback is that

, so essentially the problem of the unity sum constraint has been replaced by the problem of the zero sum constraint. In order to address this issue, Egozcue et al. (2003) proposed multiplying Equation (4) by the Helmert sub-matrix (Lancaster, 1965, Dryden and Mardia, 1998, Le and Small, 1999), an orthogonal matrix with the first row omitted, which results in what is called the isometric log-ratio (ilr) transformation

(6)

where . Note that any orthogonal matrix which preserves distances would also be appropriate (Tsagris et al., 2011) in place of . The inverse of Equation (6) is

(7)

2.1.3 -transformation

The main disadvantage of the above transformations is that they do not allow zero values in any of the components, unless a zero value imputation technique (see

Martín-Fernández et al. (2003)) is first applied. This strategy, however, can produce regression models with predictive performance worse than regression models that handle zeros naturally (Tsagris, 2015a). When zeros occur in the data, the power transformation introduced by Aitchison (2003), and subsequently modified by Tsagris et al. (2011), may be used. Specifically, Aitchison (2003) defined the power transformation as

(8)

and Tsagris et al. (2011) defined the -transformation, based on Equation (8), as

(9)

where is the Helmert sub-matrix. The power transformed vector, , in Equation (8) remains in the simplex , whereas in Equation (9) is mapped onto a subset of . Note that the tranformation in Equation (9

) is simply a linear transformation of Equation (

8). Furthemore, as , Equation (9) converges to the ilr transformation in Equation (6) (Tsagris et al., 2016). For convenience purposes, is generally taken to be between and , but when zeros occur in the data, must be restricted to be strictly positive. The inverse of (9) is

(10)

For a sample of compositional data transformed by Equation (8), , and a given value of , the sample Fréchet mean using the -transformation was specified in Tsagris et al. (2011) as

(11)

Also, Equation (11) converges to the closed geometric mean, (defined below and in Aitchison (1989)), as tends to zero. That is,

Tsagris et al. (2011) argued that while the -transformation did not satisfy some of the properties that Aitchison (2003) deemed important, this was not a down side of this transformation as those properties were suggested mainly to fortify the concept of log-ratio methods. Scealy and Welsh (2014) also questioned the importance of these properties and, in fact, showed that some of them are not actually satisfied by the log-ratio methods that they were intended to justify. Further, using the power transformation in Equation (8), Pantazis et al. (2019) derived an important theoretical result. For Dirichlet distributed compositional data, as , the Dirichlet distribution is highly concentrated towards the centre of the simplex and becomes Gaussian.

2.1.4 Other transformations

Aitchison (2003) considered a Box–Cox transformation for compositional data defined as for , with the alr transformation in Equation (2) being the limit when . Iyengar and Dey (1998) extended this transformation by substituting with , allowing a different lambda for each component. Greenacre (2009, 2011) considered a power transformation similar to Equation (9) but in the context of correspondence analysis. It is also a Box–Cox transformation applied to each component of whose limit is as . More recently, Stewart et al. (2014) suggested a new metric for compositional data that is based on the power transformation in Equation (8) and it is similar to the -distance defined in Equation (17)

with its limit distance, as being Aitchison’s distance measure (18).

Another approach is to treat compositional data as directional data after applying the square root transformation. This technique, which allows for zeros, was first suggested by Stephens (1982) and has been popularised by Scealy and Welsh (2011, 2014), Scealy et al. (2015). Note that raw data analysis which applies standard multivariate techniques to the standardised compositional data has also been used by some authors (Baxter, 1995, 2001, Baxter et al., 2005).

2.2 Compositional-Euclidean regression models

In this section some preexisting regression models proposed for compositional response variables will be reviewed. The additive and isometric log-ratio regression models, presented first, do not allow for zeros, whereas all the other models can treat zeros naturally.

2.2.1 Additive and isometric log-ratio regression models

Let denote the response matrix with rows containing alr transformed compositions. can then be linked to some predictor variables via

(12)

where is the matrix of coefficients, is the design matrix containing the predictor variables and is the residual matrix. Referring to Equation (2), Equation (12) can be re-written as

(13)

Equation (13) can be found in (Tsagris, 2015b) where it is shown that alr regression is in fact a multivariate linear regression in the logarithm of the compositional data with the first component (or any other component) playing the role of an offset variable; an independent variable with coefficient equal to .

Regression based on the ilr tranformation (ilr regression) is similar to alr regression and is carried out by substituting in Equation (12) by in Equation (6). The fitted values for both the alr and ilr transformations are the same and are therefore generally back transformed onto the simplex using the inverse of the alr transformation in Equation (3) for ease of interpretation.

2.2.2 -regression

Tsagris (2015b) proposed the -regression that utilises the inverse of the additive log-ratio transformation, combined with the -transformation as a link function. An interesting feature of this method is that the line is always curved (unless is far away from zero) and can be seen as a generalization of log-ratio regression with the limiting case () being alr regression in Equation (12). In order for the fitted values to satisfy the constraint imposed by the simplex, the inverse of the additive logistic transformation of the mean response is used to link the predictor variables to the compositional responses. Hence, the fitted values will always lie within and the model retains the flexibility that the -transformation can offer. Note that the -transformation is applied to both the observed compositional data and to the fitted values,

(14)

where is as in Equation (13), and a Gaussian log-likelihood is maximised. The value of

is chosen via minimisation of the Kullback-Leibler divergence of the observed to the fitted values

(Tsagris, 2015b).

2.2.3 Kullback-Leibler divergence based regression

Murteira and Ramalho (2016) used Equation (14) (as in the -regression) and estimated the coefficients via minimization of the Kullback-Leibler divergence

(15)

where is defined in Equation (14). The above regression model (15

), also referred to as Multinomial logit regression, will be denoted by KLD (Kullback-Leibler Divergence) regression throughout the rest of the paper.

2.3 Euclidean-Compositional regression models

There are a limited number of proposed regression models for the case with compositional predictor variables. Hron et al. (2012) applied the ilr transformation in Equation (6) to the compositional predictor variables before carrying out a standard regression analysis. This is however not the optimal strategy and collinearities among the transformed variables can still be present. Meng (2010) and Wang et al. (2010) proposed the use of partial least squares regression. In line with these methods, Tsagris (2015b) proposed the use of principal component regression coupled with the -transformation in Equation (9). The method involved the -transforming the compositional data and then performing principal component regression. The chosen values of and the number of principal components are the ones that minimize the cross validated mean squared prediction error.

3 The -- regression

The well-known - regression is a naive non-parametric smoother. In general terms, to predict the response value of a new vector of predictor values (), the Euclidean distances from to the observed predictor values are first calculated. - regression then works by selecting the response values corresponding to the observations with the -smallest distances between and the observed predictor values, and then averaging those response values using the sample mean or median.

The proposed -- regression is an extension of - regression that takes into account the compositional nature of the data. The method incorporates the power transformation allowing for more flexibility compared to the usual log-ratio methods. It is applicable to both regression cases namely the Compositional-Euclidean case as well as the Euclidean-Compositional case as described below.

3.1 Compositional-Euclidean -- regression

When the response variables, , represent compositional data, the - algorithm can simply be applied to the transformed data. Then, for a given transformation, the average of the transformed observations whose predictor values are closest (using Euclidean distance) to the new predictor value is computed, and a back-transformation of the predicted vector can be used to map it onto the simplex.

The approach presented here involves using the power transformation in Equation (8) combined with the Fréchet mean in Equation (11). In --NN regression, the predicted response value corresponding to is then

(16)

where denotes the set of observations. In the limiting case of , the predicted response value is then

It is interesting to note that the limiting case also results from applying the clr transformation to the response data, taking the mean of the relevant transformed observations and then back transforming the mean using Equation (5). To see this, let

and hence

or

--NN regression is therefore a generalization of the above procedure.

3.2 Euclidean-Compositional -- regression

Predictor compositional data are treated in a similar manner, switching from the classification task (Tsagris et al., 2016) to the regression task. In this case the predicted values are given by

where the response variables are denoted by and assumed to belong to Euclidean space, and denotes the set of observations whose predictor values are closest to the new predictor values. The proximity of the compositional predictor vectors to the new predictor vector is measured via the -metric (Tsagris et al., 2016) defined for compositional data as

(17)

The special case

(18)

is Aitchison’s distance measure (Aitchison, 2003), whereas

is just the Euclidean distance multiplied by .

4 Simulation studies

In this section, two sets of simulations are explored, namely one for the Compositional-Euclidean case and another for the Euclidean-Compositional setting. For each, two relationships between the response and predictor variables are considered. A 10-fold cross validation protocol was repeatedly (100 times) applied for each regression model. The criterion of the predictive performance was dependant upon the type of the response variable and is explained in the relevant subsections. All computations were carried out in R and the R package Compositional (Tsagris et al., 2019) was utilised for the already published regression models.

4.1 Compositional-Euclidean regression

As previously mentioned, alr and ilr regression do not allow zero values (without some form of imputation), whereas the -regression and the KLD regression treat them naturally. However, the -regression is computationally expensive222A numerical optimization using the Nelder-Mead algorithm (Nelder and Mead, 1965) is applied using the command optim in R. and will not be considered in the evaluation studies of the present paper. The KLD regression is also computationally expensive, but employment of the Newton-Raphson algorithm (Böhning, 1992) allows for an efficient computation.

For the Compositional-Euclidean setting, the values of the predictor variables (denoted by

) were generated from a Gaussian distribution with mean zero and unit variance and were linked to the compositional responses via two functions, a polynomial as well as a more complex function. For both cases, the outcome was mapped onto

using Equation (19)

(19)

where is the number of components. Note that Equation (19) is the inverse of the alr transformation (or Equation (3).

More specifically, for the simpler polynomial case, the values of the predictor variables (either 1 or 2 predictor variables) were raised to a power (one of three powers) and then multiplied by a vector of coefficients. White noise was added as to yield

(20)

where indicates the degree of the polynomial. The constant terms in the regression coefficients were randomly generated from whereas the slope coefficients were generated from .

For the segmented linear model case one predictor variable was used ranging from up to and the function was defined as

(21)

The regression coefficients were randomly generated from a while the regression coefficients were randomly generated from .

The above two scenarios were repeated with the addition of zero values in 20% of randomly selected compositional vectors. For each compositional vector that was randomly selected, a third of its component values were set to zero and those vectors were normalised to sum to 1. Finally, for all cases, the sample sizes varied between 100 and 1000 with an increasing step size equal to 50 and the number of components was set equal to . The estimated predictive performance of the regression models was computed using the Kulback-Leibler (KL) divergence from the observed to the predicted compositional vectors and using the Jensen-Shannon (JS) divergence which, unlike KL, is a metric. For all examined case scenarios the results were averaged over the 1000 repeats.

Figures 1 and 2 show graphically the results of the comparison between the -- regression and the KLD regression with no zeros and zero values present, respectively. For the first case of no zero values present (Figure 1), when the relationship between the predictor variable(s) and the compositional responses is linear, the error is slightly less for the KLD regression compared to the -- regression. In all other cases, quadratic, cubic and segmented relationships, the -- regression consistently produces more accurate predictions. Another trait observable in all plots of Figure 1 is that the relative predictive performance of the -- regression compared to KLD regression reduces as the number of components in the compositional data increases. The results in the zero values present case in Figure 2 are, in the essence, the same compared to the aforementioned case. When the relationship between the predictor variables is linear, KLD regression again exhibits slightly more accurate predictions compared to the -- regression, while the opposite is true for most other other cases. Furthermore, the impact of the number of components of the compositional responses on the relative predictive performance of -- regression compared to the KLD regression varies according to the polynomial degree, number of predictor variables and sample size.

One predictor variable
(a) p = 1 (c) p = 2 (e) p = 3 (g) Segmented
Two predictor variables
(b) p = 1 (d) p = 2 (f) p = 3 (h) Segmented
Figure 1: No zero values present case scenario. Ratio of the Kullback-Leibler divergences between the -- regression and the KLD regression. Values lower than indicate that the -- regression has smaller prediction error than the KLD regression. (a) and (b): The degree of the polynomial in (20) is , (c) and (d): The degree of the polynomial in (20) is , (e) and (f): The degree of the polynomial in (20) is . (g) and (f) refer to the segmented linear relationships case (21). The number of components () appear with different colors.
One predictor variable
(a) p = 1 (c) p = 2 (e) p = 3 (g) Segmented
Two predictor variables
(b) p = 1 (d) p = 2 (f) p = 3 (h) Segmented
Figure 2: Zero values present case scenario. Ratio of the Kullback-Leibler divergences between the -- regression and the KLD regression. Values lower than indicate that the -- regression has smaller prediction error than the KLD regression. (a) and (b): The degree of the polynomial in (20) is , (c) and (d): The degree of the polynomial in (20) is , (e) and (f): The degree of the polynomial in (20) is . (g) and (f) refer to the segmented linear relationships case (21). The number of components () appear with different colors.

4.2 Euclidean-Compositional regression

The values of the predictor variables (

) were generated from Gaussian distribution with mean zero and unit variance and linked to the compositional responses via linear and non-linear functions. Specifically, the scores of the first 2 eigenvectors

of were used to produce the response variable and then, were mapped onto using (3).

  • For the linear relationships, the values of the scores ( and ) were raised to a power and then multiplied by a matrix of coefficients (), followed by the addition of white noise ()

    (22)

    where indicates the degree of the polynomial and is the number of components.

  • For the non-linear relationships, the function was two folded and given by

    (23)

    The regression coefficients were randomly generated from while were randomly generated from .

The above two scenarios were repeated with the addition of zero values in 20% randomly selected compositional vectors. For each randomly selected compositional vector a third of its component values were set to zero and those vectors were normalised to sum to 1. Finally, for all cases, the sample sizes varied between 100 and 1000 with an increasing step equal to 50, the number of components was equal to . The mean prediction square prediction error (MPSE) measured the predictive performance of each regression method.

No zero values present
(a) (c) (e) (g)
Zero values present
(b) (d) (f) (h)
Figure 3: Ratio of the MPSE between the -- regression and the -PCR. Values lower than indicate that the -- regression has smaller prediction error than the KLD regression. (a) and (b): The degree of the polynomial in (22) is , (c) and (d): The degree of the polynomial in (22) is , (e) and (f): The degree of the polynomial in (22) is . (g) and (f) refer to the non-linear relationships case (23).

Figure 3 illustrates the relative predictive performance of the -- regression compared to the -PCR (Tsagris, 2015b). The predictive performance of the -- regression is always better than the -PCR except for the case of the linear relationship. In all cases, the number of components of the predictor compositional variables affects the predictive performance of the -- regression. Regardless of the type of the relationship, the relative predictive performance of the -- regression deteriorates as the number of components increases.

5 Examples with real data

To illustrate the performance of the -- regression 9 publicly available datasets were utilised as examples.

5.1 Compositional-Euclidean regression

The same cross-validation protocol was repeated using the 7 real datasets listed below.

  • Lake: Measurements in silt, sand and clay were taken at 39 different water depths in an Arctic lake. The question of interest is to predict the composition of these three elements for a given water depth. The dataset is available in the R package compositions (van den Boogaart et al., 2018) and contains no zero values.

  • Glacial: In a pebble analysis of glacial tills, the percentages by weight in 92 observations of pebbles of glacial tills sorted into 4 categories, red sandstone, gray sandstone, crystalline and miscellaneous, were recorded. The glaciologist is interested in predicting the compositions based on the total pebbles counts. The dataset is available in the R package compositions (van den Boogaart et al., 2018) and almost half of the observations (42 out of 92) contain at least one zero value.

  • GDP: The 2009 GDP per capita of the 27 member states of the European Union and the mean household consumption expenditures (in Euros) in 12 categories, food, house, alcohol-tobacco, clothing, household, health, transport, communication, culture, education, horeca and miscellaneous. The data are taken from eurostat, are available in Egozcue et al. (2012) and they contain no zero values.

  • Gemas: The available dataset contains 2083 compositional vectors containing concentration in 22 chemical elements (in mg/kg). According to (Templ et al., 2011) the sampling, at a density of 1 site/2500 sq. km, was completed at the beginning of 2009 by collecting 2211 samples of agricultural soil, and 2118 samples from land under permanent grass cover, according to an agreed field protocol. Then, all samples were shipped to Slovakia for further process. The dataset is available in the R package robCompositions (Templ et al., 2011) with 2108 vectors, but 25 vectors had missing values and thus were excluded from the current analysis. There was only one vector with one zero value.

  • Fish: This dataset consists information on the mass (predictor variable) and 26 morphometric measurements (compositional response variable) for 75 Salvelinus alpinus (type of fish). The dataset is available in the R package easyCODA (Greenacre, 2018) and contains no zero values.

  • Data: The response variable is a matrix of 9 party vote-shares across 89 different democracies (countries) and the predictor variable is the average number of electoral districts in each country. The dataset is available in the R package ocompositions (Rozenas, 2015) and 80 out of the 89 vectors contain at least one zero value.

  • Elections: The dataset contains information on the 2000 U.S. presidential election in the 67 counties of Florida. The number of votes each the 10 candidates received was transformed into proportions. For each county information on 8 predictor variables was available such as, population, percentage of population over 65 years old, mean personal income, percentage of people who graduated from college prior to 1990 and others. The dataset is available in (Smith, 2002) and 23 out of the 67 vectors contained at least one zero value.

Figure 4 presents the boxplots of the relative performance of the -- regression compared to the KLD regression for each dataset. The first set of boxplots refer to the ratio of the Kullback-Leibler divergences between the -- regression and the KLD regression. The second set of boxplots refers to the same ratio computed using the Jensen-Shannon divergences. For both case, values lower than indicate that the -- regression has smaller prediction error than the KLD regression.

The -- outperformed the KLD regression for the datasets Gemas and Elections, whereas the oposite was true for the dataset Lake, Fish and Data. There was no clear winner in the dataset Glacial.

Table 1 presents the most frequently selected values of and . It is worthy to mention that the value of was never selected for any dataset, indicating that the isometric log-ratio transformation (6) was never considered the optimal transformation. In addition, the smallest the percentage of selection the higher the variance of that parameter. For the dataset Lake for example, the optimal value was chosen to be at 93% of the cases exhibiting small variability, whereas for the dataset Fish the optimal value of was selected only 29% of the times. Indeed for this dataset, the optimal value of exhibited large variability. The variability in the nearest neighbours () was independent of the variability of the . For the dataset Lake, 10 nearest neighbours were chosen at only 66% of the times, and the same percentage was the dataset Fish. For the dataset Gemas, the choice of was highly variable, whereas the choice of was always the same. The oposite was true for the dataset Data, for which the optimal value of was always the same but the value of was highly variable.

The first dataset (Lake) is the only one that contains 3 components and hence it is easy to visualize it. Figure 5 contains the observed compositional data, along with the fitted values of the -- regression and of the KLD regression. As expected, the fitted values of the -- regression do not fall within a line, as with the KLD regression.

Figure 4: Ratio of the Kullback-Leibler and the Jensen-Shannon divergences between the -- regression and the KLD regression. Values lower than indicate that the -- regression has smaller prediction error than the KLD regression.
Figure 5: Ternary diagram of the Arctic lake data along with the fitted values based on the -- regression and the KLD regression.
Dataset Dimensions No of vectors with at No of predictors (% of selection) (% of selection)
() least one zero value
Lake 0 1 1 (93%) 10 (66%)
Glacial 42 1 1 (93%) 10 (75%)
GDP 0 1 0.8 (33%) 5 (89%)
Gemas 1 2 0.5 (47%) 10 (100%)
Fish 0 1 -1 (29%) 20 (66%)
Data 80 1 1 (100%) 10 (28%)
Elections 23 8 0.5 (30%) 7 (32%)
Table 1: Information about the response compositional variables, the predictor variables and the pairs (, ) chosen for every dataset, specifically the most frequently selected value and its percentage of selection.

5.2 Euclidean-Compositional regression

The Mortality dataset, which has no zero values, and is available in the R package robCompositions (Templ et al., 2011) will be analyzed. The life expectancy (in average years) is available for 30 European Union countries for both genders (male and female). The absolute number of deaths attributed to 8 types of diseases are also known: 1) certain infectious and parasitic diseases, 2) malignant neoplasms, 3) endocrine nutritional and metabolic diseases, 4) mental and behavioural disorders, 5) diseases of the nervous system and the sense organs, 6) diseases of the circulatory system, 7) diseases of the respiratory system and 8) digestive diseases of the digestive system. For each gender, the average country’s life expectancy was predicted using information on the composition of deaths by diseases. The predictive performance metric used in this case is the Root Mean Prediction Square Error (RMPSE).

Figure 6 presents the results of the 100 times repeated CV protocol. The boxplots of the RMPSE of each regression model, along with a boxplot of their ratio. The ratio of the RMPSE between the -PCR and the -- regression, where values greater than indicate that the RMPSE of the -PCR is higher than the RMPSE of -- regression. It is obvious that the RMPSE of the -PCR is always more than 2 times higher than the RPMSE of the -- regression, showing that the -- regression constantly outperforms the -PCR. Finally, the fitted values using the -- regression and using the -PCR model are also presented and the values again provide evidence that the -- regression outperforms the -PCR.

Table 2 presents some information regarding the value selection of the pair for each dataset. The variability of the selected values is not small and the conclusion is similar to the Compositional-Euclidean regression. The degree of consistency in the selection of values of the parameters is independent of the parameter.

(a) (b) (c)
(d) (e)
Figure 6: (a) and (b): RMPSE of mean life expectancy for females and males respectively using the -- regression and using the -PCR model. (c): Ratio of the RMPSE between the -PCR and the -- regression. Values greater than indicate that the RMPSE of the -PCR is higher than the RMPSE of -- regression. (d) and (e): Observed versus estimated life expectancy using the -- regression and using the -PCR model for females and males respectively. For females, the values of the coefficient of determination are for the -- regression and for the -PCR. For males, the values of the coefficient of determination are for the -- regression and for the -PCR.
Dataset (% of selection) (% of selection)
Females 0.9 (57%) 5 (57%)
Males 0.7 (43%) 4 (68%)
Table 2: Mortality dataset. Information on the pairs (, ) chosen for every dataset, specifically the most frequently selected value and its percentage of selection.

6 Conclusions

An improved, generic regression involving compositional data, termed -- regression was proposed. The following two cases were covered, Compositional-Euclidean regression and Euclidean-Compositional regression. In both cases, the -transformation is first applied to compositional data followed by the classical - regression. For the Compositional-Euclidean regression we took advantage of the Fréchet mean defined for compositional data (Tsagris et al., 2011) in order to prevent outside the simplex fitted values. For the Euclidean-Compositional regression this was not necessary.

In the Compositional-Euclidean regression, the -- regression outperformed or was on par with the Kullback-Leibler divergence regression, in both simulation studies and examples with real data, regardless of zero values being present. The simulation studies showed that when the relationships between the responses and the predictors are linear, the KLD regression has slightly better predictive performance and should be preferred to the -- regression. On the contrary, with non linear relationships though, the -- regression outperformed the KLD regression significantly as the former’s predictive performance was constantly higher and up to 2 times higher. Another interesting conclusion was that the relative predictive performance of the -- regression compared to the KLD regression was affected by the number of components of the compositional response. As the number of components increased, the ratio of their performances increased.

The conclusions with the real data analysis were similar. The -- regression outperformed KLD regression in some datasets, whereas in some other datasets the converse was true, or their predictive performance was similar. An important trait was that these conclusions were the same regardless of the type of divergence used, Kullback-Leibler of Jensen-Shannon.

In the Euclidean-Compositional regression, the -- regression again outperformed the -PCR in both simulation studies and examples with real data, regardless of zero values being present. In the simulation studies, -- regression exhibited a higher predictive performance compared to -PCR, except for some cases with a high number of components of the compositional data. However, as the sample size increased, the ratio of the regression models’ relative performance approached 1. There were also cases, where the RPMSE of -- regression was half the RPMSE of the -PCR. In all cases, the ratio of their relative performance had a decreasing tendency as the sample size was increasing. A common trait observed in the Compositional-Euclidean regression case scenario as well, was that -- regression was affected by the number of components of the compositional predcitor variables.

The analysis of the two real datasets showed that -- regression substantially outperformed the -PCR as the former’s predictive performance was always more than two times higher.

A disadvantage, exhibited both the in the Euclidean-Compositional and the Compositional-Euclidean regression is that the -

suffers from the curse of dimensionality. The higher the dimensions the higher the level of the noise in the data and the distances are highly affected. This was observed in the simulation studies of both cases. A possible solution would be to estimate the predictive performance of both methods using cross-validation or to check the percentage of variance explained by the first few principal components. A second disadvantage of the

-

regression, is it lacks the classical statistical inference, hypothesis testing, etc, but these are counterbalanced by its higher predictive performance compared to parametric models.

Add comment about categorical predictors.

Future directions suggest the use non-linear PCA, such as kernel PCA (Mika et al., 1999). The disadvantage of kernel methods though is their computational cost that increases exponentially with sample size and in this paper’s context, they would increase the model’s complexity as they require tuning of their bandwidth parameter. Projection pursuit (Friedman and Stuetzle, 1981) is another non-linear alternative to the

-NN regression, which again requires tuning of a parameter, but at the moment, this method would only work for compositional data without zero values. Finally, to overcome the problem associated with the curse of dimensionality in the Euclidean-Compositional regression, variable (or component) selection could prove beneficial. Moving towards this direction one could apply LASSO

(Lin et al., 2014), but unfortunately, this methodology cannot treat zero values. Zero value imputation (Martín-Fernández et al., 2003, 2012) could revolve this problem, but could also lead to less accurate estimations (Tsagris, 2015a).

References

  • Aitchison (1982) Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B, 44(2):139–177.
  • Aitchison (1983) Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika, 70(1):57–65.
  • Aitchison (1989) Aitchison, J. (1989). Measures of location of compositional data sets. Mathematical Geology, 21(7):787–790.
  • Aitchison (2003) Aitchison, J. (2003). The statistical analysis of compositional data. New Jersey: Reprinted by The Blackburn Press.
  • Alenazi (2019) Alenazi, A. (2019). Regression for compositional data with compositional data as predictor variables with or without zero values.

    Journal of Data Science

    , 17(1):219–237.
  • Baxter (1995) Baxter, M. (1995). Standardization and transformation in principal component analysis, with applications to archaeometry. Applied Statistics, 44(4):513–527.
  • Baxter (2001) Baxter, M. (2001). Statistical modelling of artefact compositional data. Archaeometry, 43(1):131–147.
  • Baxter et al. (2005) Baxter, M., Beardah, C., Cool, H., and Jackson, C. (2005). Compositional data analysis of some alkaline glasses. Mathematical Geology, 37(2):183–196.
  • Böhning (1992) Böhning, D. (1992).

    Multinomial logistic regression algorithm.

    Annals of the institute of Statistical Mathematics, 44(1):197–200.
  • Chen and Li (2016) Chen, E. Z. and Li, H. (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics, 32(17):2611–2617.
  • Di Marzio et al. (2015) Di Marzio, M., Panzera, A., and Venieri, C. (2015). Non-parametric regression for compositional data. Statistical Modelling, 15(2):113–133.
  • Dryden and Mardia (1998) Dryden, I. and Mardia, K. (1998). Statistical Shape Analysis. John Wiley & Sons.
  • Egozcue et al. (2003) Egozcue, J., Pawlowsky-Glahn, V., Mateu-Figueras, G., and Barceló-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3):279–300.
  • Egozcue et al. (2012) Egozcue, J. J., Daunis-I-Estadella, J., Pawlowsky-Glahn, V., Hron, K., and Filzmoser, P. (2012). Simplicial regression. The normal model. Journal of Applied Probability and Statistics, 6(182):87–108.
  • Friedman and Stuetzle (1981) Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823.
  • Greenacre (2009) Greenacre, M. (2009). Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8):3107–3116.
  • Greenacre (2011) Greenacre, M. (2011). Measuring subcompositional incoherence. Mathematical Geosciences, 43(6):681–693.
  • Greenacre (2018) Greenacre, M. (2018). Compositional Data Analysis in Practice. Chapman & Hall/CRC Press.
  • Gueorguieva et al. (2008) Gueorguieva, R., Rosenheck, R., and Zelterman, D. (2008). Dirichlet component regression and its applications to psychiatric data. Computational Statistics & Data Analysis, 52(12):5344–5355.
  • Hijazi and Jernigan (2009) Hijazi, R. and Jernigan, R. (2009). Modelling compositional data using dirichlet regression models. Journal of Applied Probability and Statistics, 4(1):77–91.
  • Hron et al. (2012) Hron, K., Filzmoser, P., and Thompson, K. (2012). Linear regression with compositional explanatory variables. Journal of Applied Statistics, 39(5):1115–1128.
  • Iyengar and Dey (1998) Iyengar, M. and Dey, D. K. (1998). Box–cox transformations in bayesian analysis of compositional data. Environmetrics, 9(6):657–671.
  • Iyengar and Dey (2002) Iyengar, M. and Dey, D. K. (2002). A semiparametric model for compositional data analysis in presence of covariates on the simplex. Test, 11(2):303–315.
  • Katz and King (1999) Katz, J. and King, G. (1999). A statistical model for multiparty electoral data. American Political Science Review, 93(1):15–32.
  • Lancaster (1965) Lancaster, H. (1965). The Helmert matrices. American Mathematical Monthly, 72(1):4–12.
  • Le and Small (1999) Le, H. and Small, C. (1999). Multidimensional scaling of simplex shapes. Pattern Recognition, 32(9):1601–1613.
  • Leininger et al. (2013) Leininger, T. J., Gelfand, A. E., Allen, J. M., and Silander Jr, J. A. (2013). Spatial regression modeling for compositional data with many zeros. Journal of Agricultural, Biological, and Environmental Statistics, 18(3):314–334.
  • Lin et al. (2014) Lin, W., Shi, P., Feng, R., and Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika, 101(4):785–797.
  • Martín-Fernández et al. (2012) Martín-Fernández, J., Hron, K., Templ, M., Filzmoser, P., and Palarea-Albaladejo, J. (2012). Model-based replacement of rounded zeros in compositional data: Classical and robust approaches. Computational Statistics & Data Analysis, 56(9):2688–2704.
  • Martín-Fernández et al. (2003) Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V. (2003). Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3):253–278.
  • Melo et al. (2009) Melo, T. F., Vasconcellos, K. L., and Lemonte, A. J. (2009). Some restriction tests in a new class of regression models for proportions. Computational Statistics & Data Analysis, 53(12):3972–3979.
  • Meng (2010) Meng, J. (2010). Multinomial logit pls regression of compositional data. In 2010 Second International Conference on Communication Systems, Networks and Applications, volume 2, pages 288–291. IEEE.
  • Mika et al. (1999) Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., and Rätsch, G. (1999). Kernel pca and de-noising in feature spaces. In Advances in Neural Information Processing Systems, pages 536–542.
  • Morais et al. (2018) Morais, J., Thomas-Agnan, C., and Simioni, M. (2018). Using compositional and Dirichlet models for market share regression. Journal of Applied Statistics, 45(9):1670–1689.
  • Mullahy (2015) Mullahy, J. (2015). Multivariate fractional regression estimation of econometric share models. Journal of Econometric Methods, 4(1):71–100.
  • Murteira and Ramalho (2016) Murteira, J. M. R. and Ramalho, J. J. S. (2016). Regression analysis of multivariate fractional data. Econometric Reviews, 35(4):515–552.
  • Nelder and Mead (1965) Nelder, J. and Mead, R. (1965). A simplex algorithm for function minimization. Computer Journal, 7(4):308–313.
  • Otero et al. (2005) Otero, N., Tolosana-Delgado, R., Soler, A., Pawlowsky-Glahn, V., and Canals, A. (2005). Relative vs. absolute statistical analysis of compositions: a comparative study of surface waters of a mediterranean river. Water Research, 39(7):1404–1414.
  • Pantazis et al. (2019) Pantazis, Y., Tsagris, M., and Wood, A. T. (2019). Gaussian asymptotic limits for the -transformation in the analysis of compositional data. Sankhya A, 81(1):63–82.
  • Rozenas (2015) Rozenas, A. (2015). ocomposition: Regression for Rank-Indexed Compositional Data. R package version 1.1.
  • Scealy et al. (2015) Scealy, J., De Caritat, P., Grunsky, E. C., Tsagris, M. T., and Welsh, A. (2015). Robust principal component analysis for power transformed compositional data. Journal of the American Statistical Association, 110(509):136–148.
  • Scealy and Welsh (2011) Scealy, J. and Welsh, A. (2011). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society. Series B, 73(3):351–375.
  • Scealy and Welsh (2014) Scealy, J. and Welsh, A. (2014). Colours and cocktails: Compositional data analysis 2013 Lancaster lecture. Australian & New Zealand Journal of Statistics, 56(2):145–169.
  • Shi et al. (2016) Shi, P., Zhang, A., Li, H., et al. (2016). Regression analysis for microbiome compositional data. The Annals of Applied Statistics, 10(2):1019–1040.
  • Smith (2002) Smith, R. L. (2002). A statistical assessment of Buchanan’s vote in Palm Beach county. Statistical Science, 17(4):441–457.
  • Stephens (1982) Stephens, M. A. (1982). Use of the von Mises distribution to analyse continuous proportions. Biometrika, 69(1):197–203.
  • Stewart et al. (2014) Stewart, C., Iverson, S., and Field, C. (2014). Testing for a change in diet using fatty acid signatures. Environmental and Ecological Statistics, 21(4):775–792.
  • Templ et al. (2011) Templ, M., Hron, K., and Filzmoser, P. (2011). robCompositions: an R-package for robust statistical analysis of compositional data. John Wiley and Sons.
  • Tolosana-Delgado and von Eynatten (2009) Tolosana-Delgado, R. and von Eynatten, H. (2009). Grain-size control on petrographic composition of sediments: compositional regression and rounded zeros. Mathematical geosciences, 41(8):869.
  • Tsagris (2015a) Tsagris, M. (2015a). A novel, divergence based, regression for compositional data. In Proceedings of the 28th Panhellenic Statistics Conference, April, Athens, Greece.
  • Tsagris (2015b) Tsagris, M. (2015b). Regression analysis with compositional data containing zero values. Chilean Journal of Statistics, 6(2):47–57.
  • Tsagris et al. (2019) Tsagris, M., Athineou, G., and Alenazi, A. (2019). Compositional: Compositional Data Analysis. R package version 3.5.
  • Tsagris et al. (2011) Tsagris, M., Preston, S., and Wood, A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4rth Compositional Data Analysis Workshop, Girona, Spain.
  • Tsagris et al. (2016) Tsagris, M., Preston, S., and Wood, A. T. (2016). Improved classification for compositional data using the -transformation. Journal of Classification, 33(2):243–261.
  • Tsagris and Stewart (2018) Tsagris, M. and Stewart, C. (2018). A dirichlet regression model for compositional data with zeros. Lobachevskii Journal of Mathematics, 39(3):398–412.
  • Tsagris and Stewart (2020) Tsagris, M. and Stewart, C. (2020). A folded model for compositional data analysis. Australian Journal of Statistics (To appear).
  • van den Boogaart et al. (2018) van den Boogaart, K. G., Tolosana-Delgado, R., and Bren, M. (2018). compositions: Compositional Data Analysis. R package version 1.40-2.
  • Wand and Jones (1994) Wand, M. P. and Jones, M. C. (1994). Kernel smoothing. Chapman and Hall/CRC.
  • Wang et al. (2010) Wang, H., Meng, J., and Tenenhaus, M. (2010). Regression modelling analysis on compositional data. In Handbook of Partial Least Squares, pages 381–406. Springer.
  • Wang et al. (2013) Wang, H., Shangguan, L., Wu, J., and Guan, R. (2013). Multiple linear regression modeling for compositional data. Neurocomputing, 122:490–500.
  • Xia et al. (2013) Xia, F., Chen, J., Fung, W. K., and Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics, 69(4):1053–1063.