Structural (and rounded) zeros are sometimes met in compositional data. The term structural refers to values which are truly zeros, for instance the percentage of money a family spends on smoking or alcohol. Rounded zeros on the other hand are very small values in some components which were rounded to zero. In geology for example the instrument which measures the composition of the elements has a detection limit. Values below that limit are not detected. This has two possible explanations; either the element is completely absent or had a value smaller than the detection limit of the instrument.
Ever since 1982 (Aitchison, 1982)
, the most widely used approach for compositional data analysis is the log-ratio approach. The nature of the logs though gives rise to a mathematical problem, the log of zero is undefined. This problem was dealt with simple imputation techniques such as imputation by a small value(Aitchison, 2003), or with substitution of the zero by a fraction of the detection limit (Palarea Albaladejo et al., 2005), or via the EM algorithm (Palarea-Albaladejo et al., 2007)
. If the zeros present are indeed rounded down only because the detection limit of the instrument was not that low, then these approaches can be used. However, even in this case, the true value could be lower than estimated.(Scealy and Welsh, 2011a) showed an example of the problem when these approaches are adopted. The smaller the imputed value is, the higher the magnitude of the log-ratio transformed values are. If on the other hand the value is a true zero (not rounded), then any imputation technique is clearly not correct.
(Butler and Glasbey, 2008)
proposed a latent Gaussian model for modelling zero values. They used a multivariate normal distribution in
to model the data. When a point was outside the simplex they projected it orthogonally onto the faces and vertices of the simplex. However this approach has the problem of sometimes assigning too much probability on the vertices and sometimes more than is necessary. Furthermore, the higher the dimensionality of the simplex, finding the correct regions to project the points lying outside the simplex becomes more difficult. Maximum likelihood estimation becomes more difficult also, but with the use of MCMC methods they managed to tackle the estimation problems. We propose a different model for handling zero values, which is inspired though from that model(Butler and Glasbey, 2008). Instead of using an orthogonal projection for the points lying outside the simplex we move them along the line connecting the points with the center of the simplex.
In this section we will discuss the issue of structural zeros, that is when the value observed is actually zero and is not due to a rounding error. We will suggest a new method for modelling structural zeros based on the multivariate normal distribution. It is a different projection than the one suggested by Butler and Glasbey (2008). Since the simplex has the form of a triangle (when ), it seems that the projection of the points lying outside the simplex should be projected onto the boundaries of the simplex following a similar idea to the folded model (Tsagris and Stewart, 2020).
2 The -transformation
2.1 The stay-in-the-simplex version of the -transformation
The power transformation defined by Aitchison (2003) we saw earlier in Section LABEL:add is
We shall call (1) the stay-in-the-simplex version of the -transformation. Note that the map (1) is degenerate due to the constraints and . In order to make (1) non-degenerate we consider the version of (1) as follows
2.2 The -transformation
A centred and scaled version of (1) is defined as
where is defined in (1), is the
-dimensional vector ofs and is the Helmert sub-matrix (Tsagris et al., 2011; Tsagris and Stewart, 2020). Note that (4
) is simply a linear transformation of (1) and so any inference made on either of them should be the same. The Jacobian of the -transformation (4) is equal to (Tsagris and Stewart, 2020)
3 The zero-censored model
We will try to fit a multivariate normal distribution on the simplex which comprises of two components, one component for the data which lie inside the simplex and a second component for the points lying on the faces. A key (possibly restrictive for some datasets) feature of the model is that we assign zero probability on the vertices. At first, we will use the -transformation, with , in order to escape the unit sum constraint. So in effect we center the simplex and multiply it by , the number of components and then multiply from the left with the Helmert sub-matrix to remove the unit sum constraint. Then similarly to Butler and Glasbey (2008) we can write the log-likelihood as
where the -transformation with , with , is the density of the multivariate normal for the data lying inside the simplex, is the density of the -th point lying outside the simplex given that it is in a line going through the origin. The is the number of points lying inside the interior of the simplex and denotes the number of points on the faces of the simplex. The line integral refers to the -th observation lying on the face the simplex for which the integral is calculated along the -th component, with , where is the number of components. Finally, is the Jacobian determinant of the -transformation with .
The rationale is similar to the Butler and Glasbey (2008) model. We assume there is a latent multivariate normal distribution but we have observed the compositional data only. Zero values of compositional data imply that the values of the latent distribution were outside the simplex. An advantage of this model over the one suggested by Butler and Glasbey (2008) is that the likelihood is tractable for any number of dimensions.
The limitation of our suggested model is that can handle compositional vectors with zero values in only one of their components. We will need to calculate the line integral of this component in the multivariate density from that point to infinity. Therefore, the log-likelihood consists of the density inside the interior of the simplex and the density on the faces, thus assigning zero probability on the vertices (and to the edges when ). We will express (5) in a more convenient way as
where is the full sample size. The vector inside the integral has changed from to with , where and
will be explained below in the Gram-Schmidt process. In order to calculate the line integral we will first perform a rotation towards one arbitrary direction. For convenience reasons we chose the first direction, the X-axis for instance in the two dimensions as seen in Figure1. The rotation takes place by multiplying the vector with the zero component (after the -transformation) by an orthonormal matrix. The matrix is calculated via the Gram-Schmidt orthonormalization process (Strang, 1988).
3.1 Gram-Schmidt orthonormalization process
The process in mathematical terms is described as follows. Suppose we have a vector in and we want to rotate it to the line defined by the unit vector , with in . We have to find an orthonormal basis first using the Gram-Schmidt orthonormalization process. Let us denote the projection operation of a vector onto by
Then the following operations will take place
denotes the inner product of two vectors. Denote the matrix of the orthonormalized vectorsby
Then all we have to do to get is and the first element of the vector is the term we saw in (6).
where means all elements except from the first one. is the density of the multivariate normal with parameters calculated at and is the density of the conditional univariate normal with parameters calculated at . The conditional distribution of a multivariate normal is still a normal (Mardia et al., 1979) and the following relationships hold true
Hence, using these relationships we can calculate the parameters of the normal density appearing inside the integral of (7). Thus, we have the following relationships
where the index is used to indicate the -th observation and means the rotation matrix (calculated from the Gram-Schmidt orthonormalization process) for the -th observation. The rotation matrix rotates the vector to the line defined by the unit vector . Now,
where means the matrix without the first element and indicates the first row of the matrix .
The rationale is to multiply each vector by the rotation matrix and rotate the data onto the first axis, thus the new vector is denoted by . Thus (7) can be written as
is the cumulative distribution of a standard normal random variable. The final form of the log-likelihood (8) is the form maximized numerically. The index in each of the parameters (for the compositions which contained one zero value) indicates that each composition with one zero value had to be projected onto the face and thus its contribution to the parameters is different.
Figure 1 shows a graphical example of the rotation in . The red line integral is calculated through a normal distribution whose parameters are rotated via the Gram-Schmidt orthonormalization process in the same way the black line was rotated to the red line. This is one example of a composition with a zero value in one of its components. In the sample, we have to sum all of these cases.
3.2 Example 1. Simulated data
Figure 2 shows a simulated example of the zero-censored model. Data of size were generated from the following multivariate normal
When an observation fell outside the simplex it was ”pulled” to the boundary, moving along the line connecting the point with the center of the simple, via the technique described in Tsagris and Stewart (2020). There were such cases in the data. We applied the zero-censored model to the data by maximizing the log-likelihood (7). We estimated zeros ( compositional vectors having one element with a zero value). We generated vectors from a multivariate normal and counted the number of vectors that fell outside the simplex. The estimated parameters of this normal distribution, used in this random vector generation, were
Figure 2 shows the ternary plot of the data along with the contours of the zero-censored model calculated from the estimated parameters.
A key thing we have to mention about Figure 2 is that the contour lines look vertical but have in fact a negative slope. This is not seen because of the scaling of the ternary plot. The range of values of the x-axis is larger than the range of the simulated values in the first variable and thus the contour lines do not depict the negative slope they should.
3.3 Example 2. Time budget data
We will illustrate the performance of the zero-censored model using real data (Härdle and Hlávka, 2007). There are individuals and for each person information about the time allocation in activities is known. The individuals are identified according to gender, country where they live, professional activity, and matrimonial status. We are not interested in their categorization but in the amount of time each person spent on categories of activities over days (the total is hours fixed for every row) in . The special feature of these data is that they contain some zero values. Some activities have zero allocation, for instance one woman did not spend even an hour on transportation linked to professional activity and four women did not spend any hour on occupation linked to children. This means that we have five compositions which have one zero in one component only.
The estimated parameters are
3.4 Diagnostics for the zero-censored model
We have performed a similar goodness of fit diagnostic to the one Butler and Glasbey (2008) performed. We generated data from the fitted multivariate normal model and estimated the number of zeros in each component. For the first example with the simulated data we had out of vectors with one zero element, , and zeros in the first, second and third component respectively. The corresponding percentages are . We estimated the percentages of the zero values in each component to be respectively based on simulated observations. We repeated the same procedure for the real data in the second example and the results are presented in Table 1.
|number of zeros||0||1||0||4||0||0||0||0||0||0|
|number of zeros||0.593||0.547||2.106||2.151||0.002||0.000||0.000||0.000||0.137||0.000|
From Table 1 we see that there is evidence to support the hypothesis that the fit of the model is not to be rejected. We could also use the test statistic as a discrepancy measure between the estimated and the observed frequencies and a -value could be calculated via simulations or via the distribution.
We developed a parametric zero-censored model for compositional data with zero values. The use of another multivariate model, such as the multivariate skew normal distribution(Azzalini and Valle, 1996) could also be utilized but the difficulty with this distribution is that more parameters need to be estimated, thus making the estimation procedure more difficult. This of course does not exclude the possibility of using this model.
The zero-censored model attacks the problem of zeros from a different perspective than the one Butler and Glasbey (2008) suggested. The data are projected on to the faces of the simplex using a non-orthogonal projection in contrast to the orthogonal Butler and Glasbey (2008) proposed. The advantage over Butler and Glasbey’s approach is that it is not difficult to project the data onto the edges regardless of the dimension. Both models however share the same problem, that of estimating the parameters of the normal distribution which becomes harder as the dimension increases. Both the zero-censored model (8) and the Butler and Glasbey’s model avoid the use of the log-ratio methodology or imputation of the zero values. A limitation of the zero-censored model is that it only allows for one zero per compositional vector. For instance if we have or components, only one zero should be present in each vector.
A further question when modelling compositional data by using either the proposed zero-censored model or the Butler and Glasbey (2008) model, is how to include covariates. Scealy and Welsh (2011b) defined an alternative model based on the Kent distribution which offers the possibility for regression and handling zeros conveniently, at the cost of computational complexity.
- Aitchison (1982) Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B, 44(2):139–177.
- Aitchison (2003) Aitchison, J. (2003). The statistical analysis of compositional data. Reprinted by The Blackburn Press.
- Azzalini and Valle (1996) Azzalini, A. and Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83(4):715.
- Butler and Glasbey (2008) Butler, A. and Glasbey, C. (2008). A latent gaussian model for compositional data with zeros. Journal of the Royal Statistical Society: Series C (Applied Statistics), 57(5):505–520.
- Härdle and Hlávka (2007) Härdle, W. and Hlávka, Z. (2007). Multivariate statistics: exercises and solutions. Springer Publishing Company, Incorporated.
- Mardia et al. (1979) Mardia, K., Kent, J., and Bibby, J. (1979). Multivariate Analysis. London: Academic Press.
- Palarea Albaladejo et al. (2005) Palarea Albaladejo, J., Fernández, M., Antoni, J., and Gómez García, J. (2005). alr approach for replacing values below the detection limit. In Proceedings of the 2rd Compositional Data Analysis Workshop, Girona, Spain.
- Palarea-Albaladejo et al. (2007) Palarea-Albaladejo, J., Martín-Fernández, J., and Gómez-García, J. (2007). A parametric approach for dealing with compositional rounded zeros. Mathematical Geology, 39(7):625–645.
- Scealy and Welsh (2011a) Scealy, J. and Welsh, A. (2011a). Properties of a square root transformation regression model. In Proceedings of the 4rth Compositional Data Analysis Workshop, Girona, Spain.
- Scealy and Welsh (2011b) Scealy, J. and Welsh, A. (2011b). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society: Series B, 73(3):351–375.
- Strang (1988) Strang, G. (1988). Linear algebra and its applications. Thomson Learning Inc.
- Tsagris et al. (2011) Tsagris, M., Preston, S., and Wood, A. (2011). A data-based power transformation for compositional data. In Proceedings of the 4rth Compositional Data Analysis Workshop, Girona, Spain.
- Tsagris and Stewart (2020) Tsagris, M. and Stewart, C. (2020). A folded model for compositional data analysis. Australian & New Zealand Journal of Statistics, 62(2):249–277.