    # Bayesian Regression for a Dirichlet Distributed Response using Stan

For an observed response that is composed by a set - or vector - of positive values that sum up to 1, the Dirichlet distribution (Bol'shev, 2018) is a helpful mathematical construction for the quantification of the data-generating mechanics underlying this process. In applications, these response-sets are usually denoted as proportions, or compositions of proportions, and by means of covariates, one wishes to manifest the underlying signal - by changes in the value of these covariates - leading to differently distributed response compositions. This article gives a brief introduction into this class of regression models, and based on a recently developed formulation (Maier, 2014), illustrates the implementation in the Bayesian inference framework Stan.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let us denote by

a vector of proportion values that is an observation of a random variable

. As an application example, we can access blood sample composition data via R add-on package DirichletReg . The data set BloodSamples gives 36 compositions of Alb., Pre-Alb., Glob. A, and Glob. B in relation to two types of diseases (14 patients suffer from disease A, 16 from disease B and 6 are unclassified). The first 6 observations show the following values:

library("DirichletReg")
Bld <- BloodSamples
Albumin Pre.Albumin Globulin.A Globulin.B Disease New
A1   0.348       0.197      0.201      0.254       A  No
A2   0.386       0.239      0.141      0.242       A  No
A3   0.471       0.240      0.089      0.200       A  No
A4   0.427       0.245      0.111      0.217       A  No
A5   0.346       0.230      0.204      0.219       A  No
A6   0.485       0.231      0.101      0.183       A  No Figure 1: Blood sample data: Grey points (lines indicate connections of proportions as coming from a shared observation unit) show the original data, black points the respective sample means.

Figure 1 on page 1 visualizes the data. The task for a Dirichlet regression model is here to work out the differences in the composition of blood compartments by differences in the levels of a binary disease variable (Disease, unclassified observations are dropped).

As distributes in dimensions, ie.

 Y={(Y1,…,YC)⊤;Y1+…+YC=1},

one observation for observation unit is a vector of values , , with . For each , one usually assumes a strictly positive component below , ie. .

#### Zero observations

In application, zero observations – ie. – are usually often present, either because the corresponding proportion is completely absent, or because is only observable with certain precision and consequently at risk to fall below a detection limit. Several strategies have been introduced of how to deal with, and how to distinguish among zero observations (see  for an overview). For the remainder of this article, we assume that any zero observation comes from the class of rounded zeroes :

”One typical example of rounded zero is the zero in a component of a particular mineral which indicates that no quantifiable proportion of the mineral has been recorded by the measurement process. Because this kind of zeros is usually understood as ’a trace too small to measure’, it seems reasonable to replace them by a suitable small value […].” 

In case of zero observations, the transformation proposed by  – applied by  and implemented in R package DirichReg  in function DR_data as argument trafo – is applied:

 yc,i=⎧⎪⎨⎪⎩yc,i, if yc,i>0,yc,i⋅(n−1)+1Cn, else, (1)

where is the number of observation units.

#### Dirichlet distribution

The Dirichlet distribution  is defined as:

 f(y|α)=1B(α)C∏c=1yαc−1c (2)

with shape parameter vector , where the multinomial beta function, serves as a normalizing constant, and denotes the Gamma function.

In the following, will denote that a random variable is Dirichlet distributed. From this parametrisation, it is evident that the elements of must control location as well as scale of the distribution, where the sum of all components of parameter vector is interpreted as precision, ie. ”the higher this value, the more density is near the expected value” , while component-wise expected values are given by the ratio of parameters with precision, ie.

 E(Yc)=αcα0 (3)

Two direct consequences of this definition:

• Expected values are proportions summing up to 1, and by this

• the vector of expected values only has degrees of freedom: if we hold expected values fixed on certain values, the last expected value is given by the difference to 1.

#### Regression modeling for Dirichlet distribution

 introduces two parameter formulations for Dirichlet regression models, where here the so-called alternative formulation will be used:

 Yi|xi∼D(μi,θ), (4)

with expectation vector , expectation vector element , component index , and precision .

Using component-wise coefficient vectors  for components , we can set-up linear predictors defining as:

 ηc,i=x⊤iβc,

where for one reference component , all elements of are equal to , i.e. , to guarantee identifiability as a consequence of the degrees of freedom reduced to (Usually, is selected as or , here as ).

Plugging these linear predictors into a ratio of model components – after applying the exponential function guaranteeing positivity –, the component-wise expected values conditional on are defined as:

 μc,i=exp(ηc,i)C∑d=1exp(ηi,d).

Applying again the exponential function, and introducing coefficient vector , we can express the global precision conditional on as:

 θi=exp(z⊤iγ).

We get to densities:

 f(yi|(μi,1,μi,2,…,μc,i)⊤,θi)= =f(yi|μi,θi), (5)

and likelihood:

 L({yi;i=1,…,n})=n∏i=1f(yi|μi,θi). (6)

#### Why this manuscript?

As seen above, one component is redundant since it is a linear combination of the others, but in practice we desire to know the relationship between the explanatory variables and the outcomes in all component. Often the components are equally important in the view of the researcher and it is not sensible to relegate one to reference status. The use of a Bayesian simulation framework allows us – by the use of simple post-estimation calculations – to quantify uncertainty in both the coefficient scales, as well as on the proportion scale. We are able to have a look on all quantities as an ensemble, get to statistical inference statements for all model components, and by this, are able to move between the ’x in c’ and ’c across x’ perspectives. This is a great advantage in comparison to the ML framework



, where predictions are also easily achieved and readily implemented, but the expression of uncertainties for all quantities and/or on basis of proportions becomes quickly tedious, as handling variances and covariances is rather difficult for nonlinear combinations.

 recently proposed a Bayesian solution to this practical limitation based on a penalized likelihood formulation respecting restrictions on the degrees of freedom as described above. Following the simulation from the joint posterior,  applies post-simulation calculations ”to ensure that fitted expected values sum to one for each observation” :

This is a somewhat counter-intuitive step, as the sampling of coefficients is performed such that it should respect the degrees of freedom restriction by suitably penalizing the likelihood. Further, it introduces a certain removal of sampling variation that was initially accepted during the Bayesian sampling algorithm as being plausible with respect observed data and statistical model. By this one looses a direct linkage between the Bayesian model expressed by the joint posterior, and the adjusted samples from the joint posterior .

In order to overcome this issue, but still be able to remain in the Bayesian framework that  introduced with good reason:

”Specifically, we use the Bayesian simulation framework, which holds many advantages. It allows us to directly quantify uncertainty in both the coefficients and the means. Also, when moving to a predictive framework, construction of predictive densities is relatively straightforward.” 

I will introduce a Stan implementation using one response component as reference status as used in the ML estimation framework by , and demonstrate how post-estimation calculations overcome the limitations induced by reference status definition in the ML estimation framework. This post-estimation technique is simple: calculate the expected value – conditional on the covariate values of an observation unit – of reference component as the difference of the sum of the other expected values to the boundary condition that all must sum up to 1:

 μC,i,s=1C−1∑d=1μd,i,s,

– this so far not different to what the we need to do to get to an estimate for in the ML framework – repeatedly for each sample from the posterior. By this we are equipped with not only one estimate, but a full distribution of samples for – in the same fashion as for the other component – and consequently, uncertainty measures for all response component can be expressed – with, of course, however still reduced degrees of freedom of .

 based his software implementation on Gibbs sampling  via the OpenBUGS program , but as I personally feel much more comfortable and experienced using the alternative Bayesian inference framework Stan, I will introduce the new approach using this alternative approach. However, any researcher should technically be able to perform her/his calculations in a similar way in the OpenBUGS framework.

The remainder of this article is organized as follows: Section 2 introduces the Bayesian model as well as the Stan estimation framework for the Dirichlet regression model, and Section 3 shows an application example.

## 2 Methods and software implementation

We move to a Bayesian framework and introduce normal priors with mean 0 and variance on all model parameters:

 yi|xi,zi ∼D(μi,θi), (7) μc,i =exp(x⊤iβc), βj,c ∼N(0,52),j=1,…,pβ,∀c≠~c, βj,~c =0,j=1,…,pβ, γj,c ∼N(0,52),j=1,…,pγ.

One can consider these normal priors as weakly-informative, as they certainly don’t act uniformly on the coefficients, but as the informativeness of a prior does not depend on the absolute value per se, but rather on how flat it is in the region of high likelihood and vice versa, we can still consider that they let the data speak for themselves in almost any application with unit-scale covariates, as one can, for those unit-scale covariates, assume a region of high likelihood for coefficients in an interval of only a few units around 0.

### 2.1 Software implementation

The Bayesian estimation of additive regression models with Dirichlet distributed responses is implemented in Stan  and practically applied using RStan . RStan functions as an R  interface to Stan using common R syntax. Stan is statistical software written in C++

which operationalizes Bayesian inference by drawing samples from a regression model’s joint posterior in the form of Markov chains.

Stan is my personal generic Bayesian software framework – it’s a probabilistic programming language – of choice as it is the current gold-standard , and I, the author, personally feel more comfortable and experienced with it in comparison to JAGS , Bugs  and the like. In contrast to the alternatives for Bayesian analyzes, Stan uses Hamiltonian Monte Carlo  and the No-U-Turn Sampler (NUTS)  which require fewer samples for chain convergence though at the cost of increased computation time per sample. In each of the further applications, a single model fit incorporates four chains with 2000 iterations each, where first 1000 iterations of each chain are discarded as warmup leaving in total 4000 post-warmup samples. These are defaults as for example also chosen by brms . A Stan implementation for the Dirichlet regression model including covariate effects for the variables transported by XR base function model.matrix(...) is a convenient tool for preparation:

stan_code  N;  // total number of observations
int ncolY; // number of categories
int ncolX; // number of predictor levels
matrix[N,ncolX] X; // predictor design matrix
matrix[N,ncolY] Y; // response variable
real sd_prior; // Prior standard deviation
}
parameters {
matrix[ncolY-1,ncolX] beta_raw; // coefficients (raw)
real theta;
}
transformed parameters{
real exptheta = exp(theta);
matrix[ncolY,ncolX] beta; // coefficients
for (l in 1:ncolX) {
beta[ncolY,l] = 0.0;
}
for (k in 1:(ncolY-1)) {
for (l in 1:ncolX) {
beta[k,l] = beta_raw[k,l];
}
}
}
model {
// prior:
theta ~ normal(0,sd_prior);
for (k in 1:(ncolY-1)) {
for (l in 1:ncolX) {
beta_raw[k,l] ~ normal(0,sd_prior);
}
}
// likelihood
for (n in 1:N) {
vector[ncolY] logits;
for (m in 1:ncolY){
logits[m] = X[n,] * transpose(beta[m,]);
}
transpose(Y[n,]) ~ dirichlet(softmax(logits) * exptheta);
}
}
’


Supplement S1 gives an implementation of alternative parametrization  Dirichlet regression including varying precision by varying covariates.

## 3 Results for the Blood Samples Application

First, let’s see how these data is analyzed using DirichletReg:

Bld <- na.omit(Bld)
Bld$Smp <- DR_data(Bld[, 1:4])  We get a warning that some values needed to be corrected such that all proportions for one observation unit sum up to 1: Warning in DR_data(Bld[, 1:4]) : not all rows sum up to 1 => normalization forced  The ML estimate is calculate via: m <- DirichReg(Smp ~ Dis. | 1, Bld, model = "alternative", base = 4)  We get the following coefficient estimates: (b <- unlist(coef(m))) beta.Alb..(Interc.) beta.Alb..Dis.B 0.63010700 -0.25191609 beta.Pre.Alb..(Interc.) beta.Pre.Alb..Dis.B 0.06274025 -0.30952737 beta.Glob.A.(Interc.) beta.Glob.A.Dis.B -0.48628655 -0.18189666 gamma.gamma.(Interc.) 4.22272495  And can calculate expected values by: (B <- matrix(nrow = 4, ncol = 2, c(b[1:6], rep(0, 2)), byrow = T)) [,1] [,2] [1,] 0.63010700 -0.2519161 [2,] 0.06274025 -0.3095274 [3,] -0.48628655 -0.1818967 [4,] 0.00000000 0.0000000 (eB <- t(exp(apply(B, MAR = 1, FUN = cumsum)))) [,1] [,2] [1,] 1.8778115 1.4596416 [2,] 1.0647502 0.7813070 [3,] 0.6149056 0.5126391 [4,] 1.0000000 1.0000000 (mu <- cbind(eB[, 1]/colSums(eB), eB[, 2]/colSums(eB))) [,1] [,2] [1,] 0.4120296 0.3888657 [2,] 0.2336276 0.2081494 [3,] 0.1349227 0.1365731 [4,] 0.2194201 0.2664118  Now using the Bayesian estimation using Stan begins with compiling the previously introduced Stan code: library("rstan") prg <- stan_model(model_code = stan_code)  We then need to translate the covariates into a design matrix X: X <- as.matrix(model.matrix(lm(Albumin ~ Disease, data = Bld))) X <- matrix(nrow = nrow(X), ncol = ncol(X), data = as.numeric(X))  Define response object Y: Y <- Bld$Smp


Provide everything as a list:

D <- list(N = nrow(Y), ncolY = ncol(Y), ncolX = ncol(X),
X = X, Y = Y, sd_prior = 1)


And finally estimate the parameters using function sampling from the RStan package:

fit1 <- sampling(prg, data = D, chains = 4, iter = 2000, cores = 4,
control = list(adapt_delta = 0.95, max_treedepth = 20),
refresh = 100)


Using extract, we can access the posterior samples:

B <- extract(fit1)\$beta


A small helping-function will assist us with calculating the expected values:

simplex <- function(x){
exp(x)/sum(exp(x))
}


We can plot everything using the following code:

plot(1:4, Bld[1, 1:4], ylim = c(0, 0.6), type = "n", xaxt = "n", las = 1,
xlab = "", ylab = "Proportion", main = "Disease A", xlim = c(0.6, 4.4))
abline(h = seq(0, 0.6, by = 0.1), col = "grey", lty = 3)
axis(1, at = 1:4, labels = names(Bld)[1:4], las = 2)
aux quantile, prob = 0.975), type = "b", pch = 4,
lty = 2, col = my_colors)
lines(apply(aux, MAR = 2, FUN = quantile, prob = 0.025), type = "b", pch = 4,
lty = 2, col = my_colors)
lines(apply(aux, MAR = 2, FUN = mean), lwd = 2, col = my_colors, type = "b",
pch = 16)
plot(1:4, Bld[1, 1:4], ylim = c(0, 0.6), type = "n", xaxt = "n", las = 1,
xlab = "", ylab = "Proportion", main = "Disease B", xlim = c(0.6, 4.4))
abline(h = seq(0, 0.6, by = 0.1), col = "grey", lty = 3)
axis(1, at = 1:4, labels = names(Bld)[1:4], las = 2)
aux 

Figure 2 visualizes the outcome of these R plotting commands. Note that by the reduced degrees of freedom, results should be seen only in the ensemble as the outcome of one proportion influences all the other proportions’ outcomes, and vice versa.

Table 1 shows the results on the parameter level and compares ML to Bayes for several different prior choices. As can be seen, different prior choices only have an practically irrelevant influence on the parameters’ posterior distribution. Figure 2: Results for the blood sample data: Grey points show the original data, black points show the respective sample mean, red points give the estimated expected values conditional on disease based on the Bayesian approach, and black crosses show 95% credible (point-wise) intervals based on the Bayesian estimation approach.

## 4 Discussion

We have introduced an implementation of a Bayesian estimation framework for Dirichlet regression that combines the advantage of identifiability – by selection of a reference category – with post-estimation flexibility – by use of a Bayesian simulation algorithm. Directly to be used Stan code was introduced, and an application demonstrates the modeling capabilities of this solution.

## References

•  L. N. Bol’shev. Dirichlet distribution. Encyclopedia of Mathematics, URL: http://www.encyclopediaofmath.org/ index.php?title=Dirichlet_distribu tion, last accessed 2018-04-17.
•  P.-C. Bürkner. brms

: An R Package for Bayesian Generalized Linear Mixed Models using

Stan.
Journal of Statistical Software, 80(1):1–28, 2017.
•  B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1):1–32, 2017.
•  S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
•  Alan E. Gelfand and Adrian F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409, 1990.
•  W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex Bayesian modelling. The Statistician, 43:169 178, 1994.
•  M. D. Hoffman and A. Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.

Journal of Machine Learning Research

, 15:1593–1623, 2014.
•  Spiegelhalter D. Thomas A. Lunn, D. and N. Best. The BUGS project: Evolution, critique, and future directions. 28:3049–3067, 2009.
•  M. J. Maier. DirichletReg: Dirichlet Regression for Compositional Data in R. Research Report Series / Department of Statistics and Mathematics, Vienna University of Economics and Business, 125, 2014.
•  M. J. Maier. DirichletReg: Dirichlet Regression in R, 2015. R package version 0.6-3.
•  J.A. Martín-Fernández, C. Barceló-Vidal, and V. Pawlowsky-Glahn.

Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation.

Mathematical Geology, 35(3), 2003.
•  C. C. Monnahan, J. T. Thorson, T. A. Branch, and R. B. O’Hara. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3):339–348.
•  M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling, 2003.
•  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017.
•  M. Smithson and J. Verkuilen.

A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables.

Psychological Methods, 11(1):54–71, 2006.
•  Stan Development Team. RStan: the R interface to Stan, 2016. R package version 2.14.1.
•  S. Van der Merwe. A method for Bayesian regression modelling of composition data. arxiv preprint, 2018.

## S1 Stan implementation of the alternative parametrization including varying precision

The following code uses design matrix object X both for expectation and precision:

stan_code_var_theta <- ’
data {
int<lower=1> N;  // total number of observations
int<lower=2> ncolY;  // number of categories
int<lower=2> ncolX; // number of predictor levels
matrix[N,ncolX] X; // predictor design matrix
matrix[N,ncolY] Y;  // response variable
real sd_prior_beta; // Prior standard deviation
real sd_prior_theta; // Prior standard deviation
}
parameters {
matrix[ncolY-1,ncolX] beta_raw; // coefficients (raw)
matrix[1,ncolX] theta; // coefficients
}
transformed parameters{
matrix[ncolY,ncolX] beta; // coefficients
for (l in 1:ncolX) {
beta[ncolY,l] = 0.0;
}
for (k in 1:(ncolY-1)) {
for (l in 1:ncolX) {
beta[k,l] = beta_raw[k,l];
}
}
}
model {
// priors
for (l in 1:ncolX) {
for (k in 1:(ncolY-1)) {
beta_raw[k,l] ~ normal(0,sd_prior_beta);
}
theta[1,l] ~ normal(0,sd_prior_theta);
}
// likelihood
for (n in 1:N) {
vector[ncolY] logits;
real exptheta;
for (m in 1:ncolY){
logits[m] = X[n,] * transpose(beta[m,]);
}
exptheta = exp(X[n,] * transpose(theta[1,]));
logits = softmax(logits);
transpose(Y[n,]) ~ dirichlet(logits * exptheta);
}
}
’