1 Introduction
Let us denote by
a vector of proportion values that is an observation of a random variable
. As an application example, we can access blood sample composition data via R addon package DirichletReg [10]. The data set BloodSamples gives 36 compositions of Alb., PreAlb., Glob. A, and Glob. B in relation to two types of diseases (14 patients suffer from disease A, 16 from disease B and 6 are unclassified). The first 6 observations show the following values:library("DirichletReg") Bld < BloodSamples head(Bld) Albumin Pre.Albumin Globulin.A Globulin.B Disease New A1 0.348 0.197 0.201 0.254 A No A2 0.386 0.239 0.141 0.242 A No A3 0.471 0.240 0.089 0.200 A No A4 0.427 0.245 0.111 0.217 A No A5 0.346 0.230 0.204 0.219 A No A6 0.485 0.231 0.101 0.183 A No
Figure 1 on page 1 visualizes the data. The task for a Dirichlet regression model is here to work out the differences in the composition of blood compartments by differences in the levels of a binary disease variable (Disease, unclassified observations are dropped).
As distributes in dimensions, ie.
one observation for observation unit is a vector of values , , with . For each , one usually assumes a strictly positive component below , ie. .
Zero observations
In application, zero observations – ie. – are usually often present, either because the corresponding proportion is completely absent, or because is only observable with certain precision and consequently at risk to fall below a detection limit. Several strategies have been introduced of how to deal with, and how to distinguish among zero observations (see [11] for an overview). For the remainder of this article, we assume that any zero observation comes from the class of rounded zeroes [11]:
”One typical example of rounded zero is the zero in a component of a particular mineral which indicates that no quantifiable proportion of the mineral has been recorded by the measurement process. Because this kind of zeros is usually understood as ’a trace too small to measure’, it seems reasonable to replace them by a suitable small value […].” [11]
In case of zero observations, the transformation proposed by [15] – applied by [9] and implemented in R package DirichReg [10] in function DR_data as argument trafo – is applied:
(1) 
where is the number of observation units.
Dirichlet distribution
The Dirichlet distribution [1] is defined as:
(2) 
with shape parameter vector , where the multinomial beta function, serves as a normalizing constant, and denotes the Gamma function.
In the following, will denote that a random variable is Dirichlet distributed. From this parametrisation, it is evident that the elements of must control location as well as scale of the distribution, where the sum of all components of parameter vector is interpreted as precision, ie. ”the higher this value, the more density is near the expected value” [9], while componentwise expected values are given by the ratio of parameters with precision, ie.
(3) 
Two direct consequences of this definition:

Expected values are proportions summing up to 1, and by this

the vector of expected values only has degrees of freedom: if we hold expected values fixed on certain values, the last expected value is given by the difference to 1.
Regression modeling for Dirichlet distribution
[9] introduces two parameter formulations for Dirichlet regression models, where here the socalled alternative formulation will be used:
(4) 
with expectation vector , expectation vector element , component index , and precision .
Using componentwise coefficient vectors for components , we can setup linear predictors defining as:
where for one reference component , all elements of are equal to , i.e. , to guarantee identifiability as a consequence of the degrees of freedom reduced to (Usually, is selected as or , here as ).
Plugging these linear predictors into a ratio of model components – after applying the exponential function guaranteeing positivity –, the componentwise expected values conditional on are defined as:
Applying again the exponential function, and introducing coefficient vector , we can express the global precision conditional on as:
We get to densities:
(5) 
and likelihood:
(6) 
Why this manuscript?
As seen above, one component is redundant since it is a linear combination of the others, but in practice we desire to know the relationship between the explanatory variables and the outcomes in all component. Often the components are equally important in the view of the researcher and it is not sensible to relegate one to reference status. The use of a Bayesian simulation framework allows us – by the use of simple postestimation calculations – to quantify uncertainty in both the coefficient scales, as well as on the proportion scale. We are able to have a look on all quantities as an ensemble, get to statistical inference statements for all model components, and by this, are able to move between the ’x in c’ and ’c across x’ perspectives. This is a great advantage in comparison to the ML framework
[10], where predictions are also easily achieved and readily implemented, but the expression of uncertainties for all quantities and/or on basis of proportions becomes quickly tedious, as handling variances and covariances is rather difficult for nonlinear combinations.
[17] recently proposed a Bayesian solution to this practical limitation based on a penalized likelihood formulation respecting restrictions on the degrees of freedom as described above. Following the simulation from the joint posterior, [17] applies postsimulation calculations ”to ensure that fitted expected values sum to one for each observation” [17]:
This is a somewhat counterintuitive step, as the sampling of coefficients is performed such that it should respect the degrees of freedom restriction by suitably penalizing the likelihood. Further, it introduces a certain removal of sampling variation that was initially accepted during the Bayesian sampling algorithm as being plausible with respect observed data and statistical model. By this one looses a direct linkage between the Bayesian model expressed by the joint posterior, and the adjusted samples from the joint posterior .
In order to overcome this issue, but still be able to remain in the Bayesian framework that [17] introduced with good reason:
”Specifically, we use the Bayesian simulation framework, which holds many advantages. It allows us to directly quantify uncertainty in both the coefficients and the means. Also, when moving to a predictive framework, construction of predictive densities is relatively straightforward.” [17]
I will introduce a Stan implementation using one response component as reference status as used in the ML estimation framework by [9], and demonstrate how postestimation calculations overcome the limitations induced by reference status definition in the ML estimation framework. This postestimation technique is simple: calculate the expected value – conditional on the covariate values of an observation unit – of reference component as the difference of the sum of the other expected values to the boundary condition that all must sum up to 1:
– this so far not different to what the we need to do to get to an estimate for in the ML framework – repeatedly for each sample from the posterior. By this we are equipped with not only one estimate, but a full distribution of samples for – in the same fashion as for the other component – and consequently, uncertainty measures for all response component can be expressed – with, of course, however still reduced degrees of freedom of .
[17] based his software implementation on Gibbs sampling [5] via the OpenBUGS program [8], but as I personally feel much more comfortable and experienced using the alternative Bayesian inference framework Stan, I will introduce the new approach using this alternative approach. However, any researcher should technically be able to perform her/his calculations in a similar way in the OpenBUGS framework.
The remainder of this article is organized as follows: Section 2 introduces the Bayesian model as well as the Stan estimation framework for the Dirichlet regression model, and Section 3 shows an application example.
2 Methods and software implementation
We move to a Bayesian framework and introduce normal priors with mean 0 and variance on all model parameters:
(7)  
One can consider these normal priors as weaklyinformative, as they certainly don’t act uniformly on the coefficients, but as the informativeness of a prior does not depend on the absolute value per se, but rather on how flat it is in the region of high likelihood and vice versa, we can still consider that they let the data speak for themselves in almost any application with unitscale covariates, as one can, for those unitscale covariates, assume a region of high likelihood for coefficients in an interval of only a few units around 0.
2.1 Software implementation
The Bayesian estimation of additive regression models with Dirichlet distributed responses is implemented in Stan [3] and practically applied using RStan [16]. RStan functions as an R [14] interface to Stan using common R syntax. Stan is statistical software written in C++
which operationalizes Bayesian inference by drawing samples from a regression model’s joint posterior in the form of Markov chains.
Stan is my personal generic Bayesian software framework – it’s a probabilistic programming language – of choice as it is the current goldstandard [12], and I, the author, personally feel more comfortable and experienced with it in comparison to JAGS [13], Bugs [6] and the like. In contrast to the alternatives for Bayesian analyzes, Stan uses Hamiltonian Monte Carlo [4] and the NoUTurn Sampler (NUTS) [7] which require fewer samples for chain convergence though at the cost of increased computation time per sample. In each of the further applications, a single model fit incorporates four chains with 2000 iterations each, where first 1000 iterations of each chain are discarded as warmup leaving in total 4000 postwarmup samples. These are defaults as for example also chosen by brms [2]. A Stan implementation for the Dirichlet regression model including covariate effects for the variables transported by X – R base function model.matrix(...) is a convenient tool for preparation:stan_code N; // total number of observations int
ncolY; // number of categories int ncolX; // number of predictor levels matrix[N,ncolX] X; // predictor design matrix matrix[N,ncolY] Y; // response variable real sd_prior; // Prior standard deviation } parameters { matrix[ncolY1,ncolX] beta_raw; // coefficients (raw) real theta; } transformed parameters{ real exptheta = exp(theta); matrix[ncolY,ncolX] beta; // coefficients for (l in 1:ncolX) { beta[ncolY,l] = 0.0; } for (k in 1:(ncolY1)) { for (l in 1:ncolX) { beta[k,l] = beta_raw[k,l]; } } } model { // prior: theta ~ normal(0,sd_prior); for (k in 1:(ncolY1)) { for (l in 1:ncolX) { beta_raw[k,l] ~ normal(0,sd_prior); } } // likelihood for (n in 1:N) { vector[ncolY] logits; for (m in 1:ncolY){ logits[m] = X[n,] * transpose(beta[m,]); } transpose(Y[n,]) ~ dirichlet(softmax(logits) * exptheta); } } ’
Supplement S1 gives an implementation of alternative parametrization [9] Dirichlet regression including varying precision by varying covariates.
3 Results for the Blood Samples Application
First, let’s see how these data is analyzed using DirichletReg:
Bld < na.omit(Bld) Bld$Smp < DR_data(Bld[, 1:4])
We get a warning that some values needed to be corrected such that all proportions for one observation unit sum up to 1:
Warning in DR_data(Bld[, 1:4]) : not all rows sum up to 1 => normalization forced
The ML estimate is calculate via:
m < DirichReg(Smp ~ Dis.  1, Bld, model = "alternative", base = 4)
We get the following coefficient estimates:
(b < unlist(coef(m))) beta.Alb..(Interc.) beta.Alb..Dis.B 0.63010700 0.25191609 beta.Pre.Alb..(Interc.) beta.Pre.Alb..Dis.B 0.06274025 0.30952737 beta.Glob.A.(Interc.) beta.Glob.A.Dis.B 0.48628655 0.18189666 gamma.gamma.(Interc.) 4.22272495
And can calculate expected values by:
(B < matrix(nrow = 4, ncol = 2, c(b[1:6], rep(0, 2)), byrow = T)) [,1] [,2] [1,] 0.63010700 0.2519161 [2,] 0.06274025 0.3095274 [3,] 0.48628655 0.1818967 [4,] 0.00000000 0.0000000 (eB < t(exp(apply(B, MAR = 1, FUN = cumsum)))) [,1] [,2] [1,] 1.8778115 1.4596416 [2,] 1.0647502 0.7813070 [3,] 0.6149056 0.5126391 [4,] 1.0000000 1.0000000 (mu < cbind(eB[, 1]/colSums(eB)[1], eB[, 2]/colSums(eB)[2])) [,1] [,2] [1,] 0.4120296 0.3888657 [2,] 0.2336276 0.2081494 [3,] 0.1349227 0.1365731 [4,] 0.2194201 0.2664118
Now using the Bayesian estimation using Stan begins with compiling the previously introduced Stan code:
library("rstan") prg < stan_model(model_code = stan_code)
We then need to translate the covariates into a design matrix X:
X < as.matrix(model.matrix(lm(Albumin ~ Disease, data = Bld))) X < matrix(nrow = nrow(X), ncol = ncol(X), data = as.numeric(X))
Define response object Y:
Y < Bld$Smp
Provide everything as a list:
D < list(N = nrow(Y), ncolY = ncol(Y), ncolX = ncol(X), X = X, Y = Y, sd_prior = 1)
And finally estimate the parameters using function sampling from the RStan package:
fit1 < sampling(prg, data = D, chains = 4, iter = 2000, cores = 4, control = list(adapt_delta = 0.95, max_treedepth = 20), refresh = 100)
Using extract, we can access the posterior samples:
B < extract(fit1)$beta
A small helpingfunction will assist us with calculating the expected values:
simplex < function(x){ exp(x)/sum(exp(x)) }
We can plot everything using the following code:
plot(1:4, Bld[1, 1:4], ylim = c(0, 0.6), type = "n", xaxt = "n", las = 1, xlab = "", ylab = "Proportion", main = "Disease A", xlim = c(0.6, 4.4)) abline(h = seq(0, 0.6, by = 0.1), col = "grey", lty = 3) axis(1, at = 1:4, labels = names(Bld)[1:4], las = 2) aux quantile, prob = 0.975), type = "b", pch = 4, lty = 2, col = my_colors[2]) lines(apply(aux, MAR = 2, FUN = quantile, prob = 0.025), type = "b", pch = 4, lty = 2, col = my_colors[2]) lines(apply(aux, MAR = 2, FUN = mean), lwd = 2, col = my_colors[2], type = "b", pch = 16) plot(1:4, Bld[1, 1:4], ylim = c(0, 0.6), type = "n", xaxt = "n", las = 1, xlab = "", ylab = "Proportion", main = "Disease B", xlim = c(0.6, 4.4)) abline(h = seq(0, 0.6, by = 0.1), col = "grey", lty = 3) axis(1, at = 1:4, labels = names(Bld)[1:4], las = 2) aux
Figure 2 visualizes the outcome of these R plotting commands. Note that by the reduced degrees of freedom, results should be seen only in the ensemble as the outcome of one proportion influences all the other proportions’ outcomes, and vice versa.
Table 1 shows the results on the parameter level and compares ML to Bayes for several different prior choices. As can be seen, different prior choices only have an practically irrelevant influence on the parameters’ posterior distribution.
4 Discussion
We have introduced an implementation of a Bayesian estimation framework for Dirichlet regression that combines the advantage of identifiability – by selection of a reference category – with postestimation flexibility – by use of a Bayesian simulation algorithm. Directly to be used Stan code was introduced, and an application demonstrates the modeling capabilities of this solution.
References
 [1] L. N. Bol’shev. Dirichlet distribution. Encyclopedia of Mathematics, URL: http://www.encyclopediaofmath.org/ index.php?title=Dirichlet_distribu tion, last accessed 20180417.

[2]
P.C. Bürkner.
brms
: An R Package for Bayesian Generalized Linear Mixed Models using
Stan. Journal of Statistical Software, 80(1):1–28, 2017.  [3] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1):1–32, 2017.
 [4] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
 [5] Alan E. Gelfand and Adrian F. M. Smith. Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990.
 [6] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex Bayesian modelling. The Statistician, 43:169 178, 1994.

[7]
M. D. Hoffman and A. Gelman.
The NoUTurn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo.
Journal of Machine Learning Research
, 15:1593–1623, 2014.  [8] Spiegelhalter D. Thomas A. Lunn, D. and N. Best. The BUGS project: Evolution, critique, and future directions. 28:3049–3067, 2009.
 [9] M. J. Maier. DirichletReg: Dirichlet Regression for Compositional Data in R. Research Report Series / Department of Statistics and Mathematics, Vienna University of Economics and Business, 125, 2014.
 [10] M. J. Maier. DirichletReg: Dirichlet Regression in R, 2015. R package version 0.63.

[11]
J.A. MartínFernández, C. BarcelóVidal, and V. PawlowskyGlahn.
Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation.
Mathematical Geology, 35(3), 2003.  [12] C. C. Monnahan, J. T. Thorson, T. A. Branch, and R. B. O’Hara. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3):339–348.
 [13] M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling, 2003.
 [14] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017.

[15]
M. Smithson and J. Verkuilen.
A Better Lemon Squeezer? MaximumLikelihood Regression With BetaDistributed Dependent Variables.
Psychological Methods, 11(1):54–71, 2006.  [16] Stan Development Team. RStan: the R interface to Stan, 2016. R package version 2.14.1.
 [17] S. Van der Merwe. A method for Bayesian regression modelling of composition data. arxiv preprint, 2018.
S1 Stan implementation of the alternative parametrization including varying precision
The following code uses design matrix object X both for expectation and precision:
stan_code_var_theta < ’ data { int<lower=1> N; // total number of observations int<lower=2> ncolY; // number of categories int<lower=2> ncolX; // number of predictor levels matrix[N,ncolX] X; // predictor design matrix matrix[N,ncolY] Y; // response variable real sd_prior_beta; // Prior standard deviation real sd_prior_theta; // Prior standard deviation } parameters { matrix[ncolY1,ncolX] beta_raw; // coefficients (raw) matrix[1,ncolX] theta; // coefficients } transformed parameters{ matrix[ncolY,ncolX] beta; // coefficients for (l in 1:ncolX) { beta[ncolY,l] = 0.0; } for (k in 1:(ncolY1)) { for (l in 1:ncolX) { beta[k,l] = beta_raw[k,l]; } } } model { // priors for (l in 1:ncolX) { for (k in 1:(ncolY1)) { beta_raw[k,l] ~ normal(0,sd_prior_beta); } theta[1,l] ~ normal(0,sd_prior_theta); } // likelihood for (n in 1:N) { vector[ncolY] logits; real exptheta; for (m in 1:ncolY){ logits[m] = X[n,] * transpose(beta[m,]); } exptheta = exp(X[n,] * transpose(theta[1,])); logits = softmax(logits); transpose(Y[n,]) ~ dirichlet(logits * exptheta); } } ’
Comments
There are no comments yet.