# Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (paSB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-specific binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coefficients of their corresponding categories. The paSB construction allows transforming an arbitrary cross-entropy-loss binary classifier into a Bayesian multinomial one. Specifically, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, and producing nonlinear classification decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and appealing performance.

• 8 publications
• 89 publications
08/04/2022

### Tree stick-breaking priors for covariate-dependent mixture models

Stick-breaking priors are often adopted in Bayesian nonparametric mixtur...
08/23/2016

### Softplus Regressions and Convex Polytopes

To construct flexible nonlinear predictive distributions, the paper intr...
07/23/2020

### A binary-response regression model based on support vector machines

The soft-margin support vector machine (SVM) is a ubiquitous tool for pr...
04/09/2018

### On marginal and conditional parameters in logistic regression models

A fundamental research question is how much a variation in a covariate i...
05/17/2019

### Colombian Women's Life Patterns: A Multivariate Density Regression Approach

Women in Latin America and the Caribbean face difficulties related to th...
01/31/2019

### Bayesian nonparametric multiway regression for clustered binomial data

We introduce a Bayesian nonparametric regression model for data with mul...
12/31/2020

### Breaking Ties: Regression Discontinuity Design Meets Market Design

Many schools in large urban districts have more applicants than seats. C...

## 1 Introduction

Inferring the functional relationship between a categorical response variable and its covariates is a fundamental problem in physical and social sciences. To address this problem, it is common to use either multinomial logistic regression (MLR) (McFadden, 1973; Greene, 2003; Train, 2009) or multinomial probit regression (Albert and Chib, 1993; McCulloch and Rossi, 1994; McCulloch et al., 2000; Imai and van Dyk, 2005), both of which can be expressed as a latent-utility-maximization model that lets an individual make the decision by comparing its random utilities across all categories at once. In this paper, we address the problem via a new stick-breaking construction of the multinomial distribution, which defines a one-to-one random mapping between the category and stick indices. Rather than assuming an individual compares its random utilities across all categories at once, we assume an individual makes a sequence of stick-specific binary random decisions. The choice of the individual is the category mapped to the stick that is the first to choose “1,” or the category mapped to stick if all the first

sticks choose “0.” This framework transforms the problem of regression analysis of categorical variables into the problem of inferring the one-to-one mapping between the category and stick indices, and performing regression analysis of binary stick-specific random variables.

Both MLR and the proposed stick-breaking models link a categorical response variable to its covariate-dependent probability parameters. While MLR is invariant to the permutation of category labels, given a fixed category-stick mapping, the proposed stick-breaking models purposely destruct that invariance. We are motivated to introduce this new framework for discrete choice modeling mainly to facilitate efficient Bayesian inference via data augmentation, introduce nonlinear decision boundaries, and relax a well-recogonized restrictive model assumption of MLR, as described below.

An important motivation is to extend efficient Bayesian inference available to binary regression to multinomial one. In the proposed stick-breaking models, the binary stick-specific random variables of an individual are conditionally independent given their stick-specific covariate-dependent probabilities. Under this setting, one can solve a multinomial regression by solving conditionally independent binary ones. The only requirement is that the underlying binary regression model uses the cross entropy loss. In other words, we require each stick-specific binary random variable to be linked via the Bernoulli distribution to its corresponding stick-specific covariate-dependent probability parameter.

Another important motivation is to improve the model capacity of MLR, which is a linear classifier in the sense that if the total number of categories is , then MLR uses the intersection of

linear hyperplanes to separate one class from the others. By choosing nonlinear binary regression models, we are able to enhance the capacities of the proposed stick-breaking models. We are also motivated to relax the

independence of irrelevant alternative (IIA) assumption, an inherent property of MLR that requires the probability ratio of any two choices to be independent of the presence or characteristics of any other alternatives (McFadden, 1973; Greene, 2003; Train, 2009). By contrast, the proposed stick-breaking models make the probability ratio of two choices depend on other alternatives, as long as the two sticks that both choices are mapped to are not next to each other.

In light of these considerations, we will first extend the softplus regressions recently proposed in Zhou (2016), a family of cross-entropy-loss binary classifiers that can introduce nonlinear decision boundaries and can recover logistic regression as a special case, to construct Bayesian multinomial softplus regressions (MSRs). We then consider a multinomial generalization of the widely used support vector machine (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995; Schölkopf et al., 1999; Cristianini and Shawe-Taylor, 2000), a max-margin binary classifier that uses the hinge loss. While there has been significant effort in extending binary SVMs into multinomial ones (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011), the resulted extensions typically only provide the predictions of deterministic class labels. By contrast, we extend the Bayesian binary SVMs in Sollich (2002) and Mallick et al. (2005) under the proposed framework to construct Bayesian multinomial SVMs (MSVMs), which naturally provide predictive class probabilities.

We will show that the proposed Bayesian MSRs and MSVMs, which all generalize the stick-breaking construction to perform Bayesian multinomial regression, are not only capable of placing nonlinear decision boundaries between different categories, but also amenable to posterior simulation via data augmentation. Another attractive feature shared by all these proposed Bayesian algorithms is that they can not only predict class probabilities but also quantify model uncertainty. In addition, we will show that robit regression, a robust cross-entropy-loss binary classifier proposed in Liu (2004), can be extended into a robust Bayesian multinomial classifier under the proposed stick-breaking construction.

The remainder of the paper is organized as follows. In Section 2 we briefly review MLR and discuss the restrictions of its stick-breaking construction. In Section 3 we propose the permuted and augmented stick breaking (paSB) to construct Bayesian multi-class classifiers, present the inference, and show how the IIA assumption is relaxed. Under the paSB framework, we show how to transform softplus regressions and support vector machines into Bayesian multinomial regression models in Sections 4 and 5, respectively. We provide experimental results in Section 6 and conclude the paper in Section 7.

## 2 Multinomial Logistic Regression and Stick Breaking

In this section we first briefly review multinomial logistic regression (MLR). We then use the stick-breaking construction to show how to generate a categorical random variable as a sequence of dependent binary variables, and further discuss a naive approach to transform binary logistic regression under stick breaking into multinomial regression. In the following discussion, we use

to index the individual/observation, to index the choice/category, and the prime symbol to denote the transpose operation.

### 2.1 Multinomial Logistic Regression

MLR that parameterizes the probability of each category given the covariates as

 P(yi=s|xi,{βs}1,S)=pis, pis=ex′iβs/(∑Sj=1ex′iβj) (1)

is widely used, where consists of and covariates, and consists of the regression coefficients for the th category (McCullagh and Nelder, 1989; Albert and Chib, 1993; Holmes and Held, 2006). Without loss of generality, one may choose category as the reference category by setting all the elements of as 0, making almost surely (a.s.). For MLR, if data is assigned to the category with the largest , then one may consider that category resides within a convex polytope (Grünbaum, 2013), defined by the set of solutions to inequalities as , where .

Despite its popularity, MLR is a linear classifier in the sense that it uses the intersection of

linear hyperplanes to separate one class from the others. As a classical discrete choice model in econometrics, it makes the independence of irrelevant alternatives (IIA) assumption, implying that the unobserved factors for choice making are both uncorrelated and having the same variance across all alternatives

(McFadden, 1973; Train, 2009). Moreover, while its log-likelihood is convex and there are efficient iterative algorithms to find the maximum likelihood or maximum a posteriori solutions of

, the absence of conjugate priors on

makes it difficult to derive efficient Bayesian inference. For Bayesian inference, Polson et al. (2013)

have introduced the Pólya-Gamma data augmentation for logit models, and combined it with the data augmentation technique of

Holmes and Held (2006) for the multinomial likelihood to develop a Gibbs sampling algorithm for MLR. This algorithm, however, has to update one at a time while conditioning on all for . Thus it may not only lead to slow convergence and mixing, especially when the number of categories is large, but also prevent us from parallelizing the sampling of within each MCMC iteration.

### 2.2 Stick Breaking

Suppose is a random variable drawn from a categorical distribution with a finite vector of probability parameters , where , , and . Instead of directly using , one may consider generating using the multinomial stick-breaking construction that sequentially draws binary random variables

 bis∣∣{bij}j

for . Note that and by construction. Defining if and only if and for all , then one has a strick-breaking representation for the multinomial probability parameter as

 P(yi=s|{πis}1,S)=P(bis=1)∏j≠sP(bij=0)=πis∏j

which, as expected, recovers by substituting the definitions of shown in (2).

The finite stick-breaking construction in (3) can be further generalized to an infinite setting, as widely used in Bayesian nonparametrics (Hjort et al., 2010). For example, the stick-break construction of Sethuraman (1994) represents the length of the th stick using the product of stick-specific probabilities that are independent, and identically distributed (i.i.d.) beta random variables. It represents a size-biased random permutation of a Dirichlet process (DP) (Ferguson, 1973) random draw, which includes countably infinite atoms whose weights sum to one. The stick-breaking construction of Sethuraman (1994) has also been generalized to represent a draw from a random probability measure that is more general than the DP (Pitman, 1996; Ishwaran and James, 2001; Wang et al., 2011a).

Related to this paper, one may further consider making the stick-specfic probabilities depend on the covariates (Dunson and Park, 2008; Chung and Dunson, 2009; Ren et al., 2011). For example, the logistic stick-breaking process of Ren et al. (2011) uses the product of covariate-dependent logistic functions to parameterize the probability of the th stick. To implement a stick-breaking process mixture model, truncated stick-breaking representations with a finite number of sticks are commonly used, with inference developed via both Gibbs sampling (Ishwaran and James, 2001; Dunson and Park, 2008; Rodriguez and Dunson, 2011) and variational approximation (Blei and Jordan, 2006; Kurihara et al., 2007; Ren et al., 2011).

Another related work is the order-based dependent Dirichlet processes of Grifin and Steel (2006), which use an ordered stick-breaking construction for mixture modeling, encouraging the data samples close to each other in the covariate space to share similar orders of the sticks and hence similar mixture weights. We will show that the proposed stick-breaking construction is distinct in that all data samples share the same category-stick mapping inferred from the data, with the category labels mapped to lower-indexed sticks subject to fewer geometric constraints on their decision boundaries.

### 2.3 Logistic Stick Breaking

The stick-breaking construction parameterizes each with the product of probability parameters and links each with a unit-norm binary vector , where and a.s. if . Following the logistic stick-breaking construction of Ren et al. (2011), one may represent with (3) and parameterize the logit of each with a latent Gaussian variable as . To model observed or latent multinomial variables, a stick-breaking procedure, closely related to that of Ren et al. (2011), is used in Khan et al. (2012) to transform the modeling of multinomial probability parameters into the modeling of the logits of binomial probability parameters using Gaussian latent variables. As shown in Linderman et al. (2015), this procedure allows using the Pólya-Gamma data augmentation, without requiring the assistance of the technique of Holmes and Held (2006), to construct Gibbs sampling that simultaneously updates all categories in each MCMC iteration, leading to improved performance over the one proposed in Polson et al. (2013).

The simplification brought by the stick-breaking representation, which stochastically arranges its categories in decreasing order, comes with a clear change in that it removes the invariance of the multinomial distribution to label permutation. While the loss of invariance to label permutation may not pose a major issue for Bayesian mixture models inferred with MCMC (Jasra et al., 2005; Kurihara et al., 2007), it appears to be a major obstacle when applying stick breaking for multinomial regression, where the performance is often found to be sensitive to how the labels of the categories are ordered. In particular, if one constructs a logistic stick breaking model by letting , which means , then one has

 pis=(1+e−x′iβs)−1∏j

which clearly tends to impose fewer geometric constraints on the classification decision boundaries of a category with a smaller . For example, is larger than if while is possible to be larger than only if both and . We will use an example to illustrate this type of geometric constraints in Section 6.1.

Under the logistic stick-breaking construction, not only could the performance be sensitive to how the different categories are ordered, but the imposed geometric constraints could also be overly restrictive even if the categories are appropriately ordered. Below we address the first issue by introducing a permuted and augmented stick-breaking representation for a multinomial model, and the second issue by adding the ability to model nonlinearity.

## 3 Permuted and Augmented Stick Breaking

To turn the seemingly undesirable sensitivity of the stick-breaking construction to label permutation into a favorable model property, when label asymmetry is desired, and mitigate performance degradation, when label symmetry is desired, we introduce a permuted and augmented stick-breaking (paSB) construction for a multinomial distribution, making it straightforward to extend an arbitrary binary classifier with cross entropy loss into a Bayesian multinomial one. The paSB construction infers a one-to-one mapping between the labels of the categories and the indices of the latent sticks, transforming the problem from modeling a multinomial random variable into modeling conditionally independent binary ones. It not only allows for parallel computation within each MCMC iteration, but also improves the mixing of MCMC in comparison to the one used in Polson et al. (2013), which updates one regression-coefficient vector conditioning on all the others, as will be shown in Section 6.5. Note that the number of distinct one-to-one label-stick mappings is , which quickly becomes too large to exhaustively search for the best mapping as increases. Our experiments will show that the proposed MCMC algorithm can quickly escape from a purposely poorly initialized mapping and subsequently switch between many different mappings that all lead to similar performance, suggesting an effective search space that is considerably smaller than .

### 3.1 Category-Stick Mapping and Data Augmentation

The proposed paSB construction randomly maps a category to one and only one of the latent sticks and makes the augmented Bernoulli random variables conditionally independent to each other given . Denote as a permutation of , where is the index of the stick that category is mapped to. Given the label-stick mapping , let us denote as the multinomial probability of category , and as the covariate-dependent stick probability that is associated with the covariates of observation and the stick that category is mapped to. For notational convenience, we will write as and as . We emphasize that here the th regression-coefficient vector is always associated with both category and the corresponding stick probabily , a construction that will facilitate the inference of the label-stick mapping . The following Theorem shows how to generate a categorical random variable of categories with a set of conditionally independent Bernoulli random variables. This is key to transforming the problem from solving multinomial regression into solving binary regressions independently.

###### Theorem 1

Suppose , where is a multinomial probability vector whose elements are constructed as

 pis(z)=(πizs)1(zs≠S)∏j

then can be equivalently generated under the permuted and augmented stick-breaking (paSB) construction as

 yi∼S∑s=1 {[1(bizs=1)]1(zs≠S)∏j

Distinct from the conventional stick breaking in (2) that maps category to stick and makes depend on , under the new construction in (5)-(6), the categories are now randomly permuted and then one-to-one mapped to sticks, and the augmented binary random variables become mutually independent given . Given , we still have for and a.s., but impose no restriction on any for , whose conditional posteriors given and remain the same as their priors. These changes are key to appropriately ordering the latent sticks, more flexibly parameterizing and hence , and maintaining tractable inference.

With paSB, the problem of inferring the functional relationship between the categorical response and the corresponding covariates is now transformed into the problem of modeling conditionally independent binary regressions as

 bizs|xi,βs∼Bernoulli[πizs(xi,βs)], i=1,…,N, s=1,…,S.

Note that the only requirement for the binary regression model under paSB is that it uses the Bernoulli likelihood. In other words, it uses the cross entropy loss (Murphy, 2012) as

 −N∑i=1lnP(bizs|xi,βs)=N∑i=1{−bizslnπizs(xi,βs)−(1−bizs)ln[1−πizs(xi,βs)]}.

A basic choice is paSB logistic regression that lets

 πizs(xi,βs)=1/(1+e−xiβs),

which becomes the same as the logistic stick breaking construction described in Section 2.3 if for all . Another choice is paSB-robit regression that extends robit regression of Liu (2004), a robust binary classifier using cross entropy loss, into a robust Bayesian multinomial classifier. In robit regression, observation is labeled as 1 if and as 0 otherwise, where are independently drawn from a -distribution with degrees of freedom, denoted as . Consequently, the conditional class probability function of robit regression is , where is the cumulative density function of . The robustness is attributed to the heavy-tail property of , which, if , imposes less penalty than the conditional class probability function of logistic regression does on misclassified observations that are far from the decision boundary. Applying Theorem 1, the category probability of paSB-robit regression with degrees of freedom is shown in (4), where . The paSB-robit regression provides a simple solution to robust multiclass classification; with defined in Theorem 1, we run independent binary robit regressions using the Gibbs sampler proposed in Liu (2004).

In addition to paSB, we define permuted and augmented reverse stick breaking (parSB) in the following Corollary.

###### Corollary 2

Suppose and

 pis(z)=(1−πizs)1(zs≠S)∏j

then can also be generated under the permuted and augmented reverse stick-breaking (parSB) representation as

 yi∼S∑s=1 {[1(bizs=0)]1(zs≠S)∏j

Generally speaking, if , which is the case for logistic stick breaking and robit stick breaking, where are defined as and , respectively, and Bayesian multinomial SVMs to be discussed in Section 5, then there is no need to introduce parSB as an addition to paSB. Otherwise, there are potential benefits, such as for softplus regressions to be introduced in Section 4, to combine parSB with paSB.

### 3.2 Inference of Stick Variables and Category-Stick Mapping

Below we first describe Gibbs sampling for the augmented stick variables , and then introduce a Metropolis-Hastings (MH) step to infer the category-stick mapping . Given the category label , stick probability , and , we sample as

 (bij|yi,πij,z)∼1(j=zyi)+1(j>zyi)Bernoulli(πij),

for , and let

 biS=1(zyi=S).

This means we let if , if , draw from if , and let if and only if . Note that stick is used as a reference stick and is not used in defining in (4). Despite having no impact on computing , we infer (, sample the regression-coefficient vector ) under the likelihood and use it in a Metropolis-Hastings step, as described in (8) shown below, to decide whether to switch the mappings of two different categories, if one of which is mapped to the reference stick . Once we have an MCMC sample of , we then essentially solve independently binary classification problems, the th of which can be expressed as

Analogously, for parSB, can be sampled as for , and which means we let if , let if , draw from if , and let if and only if .

Since stick-breaking multinomial classification is not invariant to the permutation of its class labels, it may perform substantially worse than it could be if the inherent geometric constraints implied by the current ordering of the labels make it difficult to adapt the decision boundaries to the data. Our solution to this problem is to infer the one-to-one mapping between the category labels and stick indices from the data. We construct a Metropolis-Hastings (MH) step within each Gibbs sampling iteration, with a proposal of switching two sticks that categories and , , are mapped to, by changing the current category-stick one-to-one mapping from to . Assuming a uniform prior on and proposing uniformly at random from one of the possibilities, we would accept the proposal with probability

 min{∏i∏Ss=1[pis(z′)]1(yi=s)∏Ss=1[pis(z)]1(yi=s), 1}=min⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∏i∏Ss=1[(πiz′s)1(z′s≠S)∏j

### 3.3 Sequential Decision Making

Random utility models, including both the logit and probit models as special examples, are widely used to infer the functional relationship between a categorical response variable and its covariates. For discrete choice analysis in econometrics (Hanemann, 1984; Greene, 2003; Train, 2009), these models assume that among a set of alternatives, an individual makes the choice that maximizes his/her utility , where and represent the observable and unobservable parts of , respectively. If is set as , then marginalizing out leads to MLR if all follow the extreme value distribution (McFadden, 1973; Greene, 2003; Train, 2009), and multinomial probit regression if all

(Albert and Chib, 1993; McCulloch and Rossi, 1994; McCulloch et al., 2000; Imai and van Dyk, 2005).

Instead of examining the utilities of all choices before making the decision, the paSB construction is characterized by a sequential decision making process, described as follows. In step one, an individual decides whether to select the choice mapped to stick 1, or to select a choice among the remaining alternatives, , choices . If the individual selects the choice mapped to stick , then the sequential process is terminated. Otherwise this choice is eliminated and the individual proceeds to step two, in which he/she would follow the same procedure to either select the choice mapped to stick or proceed to the next step to select a choice among the remaining alternatives, , choices . The individual, reconsidering none of the eliminated choices, will keep making a one-vs-remaining decision at each step until the termination of the sequential decision making process.

This unique sequential decision making procedure relaxes the independence of irrelevant alternatives (IIA) assumption, as described in the following Lemma.

###### Lemma 3

Under the paSB construction, the probability ratio of two choices are influenced by the success probabilities of the sticks that lie between these two choices’ corresponding sticks. In other words, the probability ratio of two choices will be influenced by some other choices if they are not mapped to adjacent sticks.

As in Lemma 3, the paSB construction could adjust how two choices’ probability ratio depends on the other alternatives by controlling the distance between the two sticks that they are mapped to, and hence provide a unique way to relax the IIA assumption. While the widely used MLR can be considered as a random-utility-maximization model with the IIA assumption, the paSB multinomial logistic model performs sequential random utility maximization that relaxes this assumption, as described in Lemma 5 in the Appendix.

## 4 Bayesian Multinomial Softplus Regression

Logistic regression is a cross-entropy-loss binary classifier that can be straightforwardly extended to paSB multinomial logistic regression (paSB-MLR). However, it is a linear classifier that uses a single hyperplane to separate one class from the other. To introduce nonlinear classification decision boundaries, we consider extending softplus regression of Zhou (2016), a multi-hyperplane binary classifier that uses the cross entropy loss, into multinomial softplus regression (MSR) under paSB.

Softplus regression uses the interaction of multiple hyperplanes to construct a union of convex-polytope-like confined spaces to enclose the data labeled as “1,” which are hence separated from the data labeled as “0”. It is constructed under a Bernoulli-Poisson link (Zhou, 2015)

that thresholds at one a latent Poisson count, with the distribution of the Poisson rate defined as the convolution of the probability density functions of

experts, each of which corresponds to the stack of gamma distributions with covariate-dependent scale parameters. The number of experts and the number of layers can be considered as the two model parameters that determine the nonlinear capacity of the model. More specifically, for expert , denoting as its weight and as its th regression-coefficient vector, the conditional class probability can be expressed as

 P(yi=1|xi,{rk,{β(t+1)k}1,T}1,K)=1−K∏k=1(1−pik), pik=1−(1+ex′iβ(T+1)kln{1+ex′iβ(T)kln[1+…ln(1+ex′iβ(2)k)]})−rk;

when , the conditional class probability reduces to

 P(yi=1|xi,r,β)=1−(11+ex′iβ)r;

and when , it becomes the same as that of binary logistic regression. Note that a gamma process, a random draw from which is expressed as , can be used to support a potentially countably infinite number of experts for softplus regression. For this reason, one can set as large as permitted by computation and relies on the gamma process’s inherent shrinkage mechanism to turn off unneeded model capacity (not all experts will be used if is set to be sufficiently large).

### 4.1 paSB and parSB Extensions of Softplus Regressions

We first follow Zhou (2016) to define

 ς(x1,…,xt)=ln(1+extln{1+ext−1ln[1+…ln(1+ex1)]})

as the stack-softplus function. Note that if , the stack-softplus function reduces to softplus function , which is often considered as a smoothed version of the rectifier function, expressed as

, that has become the dominant nonlinear activation function for deep neural networks

(Nair and Hinton, 2010; Glorot et al., 2011; Krizhevsky et al., 2012; LeCun et al., 2015). We then parameterize , the negative logarithms of the failure probabilities of the stick that category is mapped to, as

 λizs=∞∑k=1rskς(x′β(2)sk,…,x′β(T+1)sk), (9)

where the countably infinite atoms and their weights constitute a draw from a gamma process (Ferguson, 1973), with as a finite and continuous base distribution over a complete separable metric space and as a scale parameter. In other words, we let or

 bizs∼Bernoulli[1−∞∏k=1(1+ex′iβ(T+1)skln{1+ex′iβ(T)skln[1+…ln(1+ex′iβ(2)sk)]})−rsk ]. (10)

As shown in Theorem 10 of Zhou (2016), can be equivalently generated from a hierarchical model that convolves countably infinite stacked gamma distributions, with covariate-dependent scales, as

 θ(T)isk∼Gamma(rsk,ex′iβ(T+1)sk), … θ(t)isk∼Gamma(θ(t+1)isk,ex′iβ(t+1)sk), … θ(1)isk∼Gamma(θ(2)isk,ex′iβ(2)sk), bizs=1 (mis≥1), mis=∞∑k=1m(1)isk, m(1)isk∼Pois(θ(1)isk), (11)

the marginalization of whose latent variables lead to (10). Note the gamma distribution is defined such that and , and the hierarchical structure in (11) can also be related to the augmentable gamma belief network proposed in Zhou et al. (2016). We consider the combination of (11) and either paSB in (5) or parSB in (7) as the Bayesian nonparametric hierarchical model for multinomial softplus regression (MSR) that is defined below.

[Multinomial Softplus Regression] With a draw from a gamma process for each category that consists of countably infinite atoms with weights , where , given the covariate vector and category-stick mapping , MSR parameterizes , the multinomial probability of category , under the paSB construction as

 pis(z)=[1−∏∞k=1(1+ex′iβ(T+1)skln{1+ex′iβ(T)skln[1+…ln(1+ex′iβ(2)sk)]})−rsk ]1(zs≠S) ×∏j:zj

and parameterizes under the parSB construction as

 ×∏j:zj

For the convenience of implementation, we truncate the number of atoms of the gamma process at by choosing a discrete base measure for each category as , under which we have as the prior distribution for the weight of expert in category . For each category, we expect only some of its experts to have non-negligible weights if is set large enough, and we may use , where is defined in (11), to measure the number of active experts inferred from the data.

### 4.2 Geometric Constraints for MSR

Since by definition we have in MSR, it is clear that if for all are small and is the first one to have a large probability value close to one, will be likely assigned to category regardless of how large the values of are. To motivate the use of the seemingly over-parameterized sum-stack-softplus function in (9), we first consider the simplest case of . Without loss of generality, let us assume that the category-stick mapping is fixed at .

###### Lemma 4

For paSB-MSR with and , the set of solutions to in the covariate space are bounded by a convex polytope defined by the intersection of linear hyperplanes.

Note that the binary softplus regression with is closely related to logistic regression, and reduces to logistic regression if (Zhou, 2016). With Lemma 4, it is clear that even if an optimal category-stick mapping is provided, paSB-MSR with may still clearly underperform MLR. This is because category uses a single hyperplane to separate itself from the remaining categories, and hence uses the interaction of at most hyperplanes to separate itself from the other categories. By contrast, MLR uses a convex polytope bounded by at most hyperplanes for each of the categories.

When and/or , an exact theoretical analysis is beyond the scope of this paper. Instead we provide some qualitative analysis by borrowing related geometric-constraint analysis for softplus regressions in Zhou (2016). Note that Equation (10) indicates that a noisy-or model (Pearl, 2014; Srinivas, 1993), commonly appearing in causal inference, is used at each step of the sequential one-vs-remaining decision process; at each step, the binary outcome of an observation is attributed to the disjunctive interaction of many possible hidden causes. Roughly speaking, to enclose category to separate it from the remaining categories in the covariate space, paSB-MSR with and uses the complement of a convex-polytope-bounded space, paSB-MSR with and uses a convex-polytope-like confined space, and paSB-MSR with both and uses a union of convex-polytope-like confined spaces. For parSB-MSR with , the interpretation is the same except a convex polytope in paSB will be replaced with the complement of a convex polytope, and vise versa. In contrast to SVMs using the kernel trick, MSRs using the original covariates might be more appealing in research areas, like biostatistics and sociology, where the interpretation of regression coefficients and investigation of causal relationships are of interest. In addition, we find that the classification capability of MSRs could be further enhanced with data transformation, as will be discussed in Section 6.4.

## 5 Bayesian Multinomial Support Vector Machine

Support vector machines (SVMs) are max-margin binary classifiers that typically minimize a regularized hinge loss objective function as

 l(β,ν)=N∑i=1max(1−bix′iβ,0)+νR(β),

where represents the binary label for the th observation, is a regularization function that is often set as the or norm of , is a tuning parameter, and is the th row of the design matrix . For linear SVMs, is the covariate vector of the th observation, whereas for nonlinear SVMs, one typically set the th element of as the kernel distance between the covariate vector of the th observation and the th support vector. The decision boundary of a binary SVM is and an observation is assigned the label , which means if and if .

### 5.1 Bayesian Binary SVMs

It is shown in Polson and Scott (2011) that the exponential of the negative of the hinge loss can be expressed as a location-scale mixture of normals as

 L(bi|xi,β) =exp[−2max(1−bix′iβ,0)] =∫∞01√2πωiexp[−12(1+ωi−bix′iβ)2ωi]dωi.

Consequently, can be regarded as a pseudo likelihood in the sense that it is unnormalized with respect to . This location-scale normal mixture representation of the hinge loss allows developing close-form Gibbs sampling update equations for the regression coefficients via data augmentation, as discussed in detail in Polson and Scott (2011) and further generalized in Henao et al. (2014) to construct nonlinear SVMs amenable to Bayesian inference. While data augmentation has made it feasible to develop Bayesian inference for SVMs, it has not addressed a common issue that SVMs provide the predictions of deterministic class labels but not class probabilities. For this reason, below we discuss how to allow SVMs to predict class probabilities while maintaining tractable Bayesian inference via data augmentation.

Following Sollich (2002) and Mallick et al. (2005)

, by defining the joint distribution of

and to be proportional to , one may define the conditional distribution of the binary label as

 P(bi|xi,β)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩11+e−2bix′iβ, for |x′iβ|≤1;11+e−bi[x′iβ+sign(x′iβ)], for |x′iβ|>1; (12)

which defines a probabilistic inference model that has the same maximum a posteriori (MAP) solution as that of a binary SVM for a given data set. Note that for MAP inference, the penalty term of the regularized hinge loss can be related to a corresponding prior distribution imposed on , such as Gaussian, Laplace, and spike-and-slab priors (Polson and Scott, 2011).

### 5.2 paSB Multinomial Support Vector Machine

Generalizing previous work in constructing Bayesian binary SVMs, we propose multinomial SVM (MSVM) under the paSB framework that is distinct from previously proposed MSVMs (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011). A Bayesian MSVM that predicts class probabilities has also been proposed before in Zhang and Jordan (2006), which, however, does not have a data augmentation scheme to sample the regression coefficients in closed form, and consequently, relies on a random-walk Metropolis-Hastings procedure that may be difficult to tune.

Redefining the label sample space from to , we may rewrite (12) as , where

 πi,svm(xi,β)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩11+e−2xiβ, for |x′iβ|≤1;11+e−xiβ−sign(x′iβ), for |x′iβ|>1. (13)

The Bernoulli likelihood based cross-entropy-loss binary classifier, whose covariate-dependent probabilities are parameterized as in (13), is exactly what we need to extend the binary SVM into a multinomial classifier under paSB introduced in Theorem 1. More specifically, given the category-stick mapping , with the success probabilities of the stick that category is mapped to parameterized as and binary stick variables drawn as , we have the following definition.

[paSB multinomial SVM] Under the paSB construction, given the covariate vector and category-stick mapping , multinomial support vector machine (MSVM) parameterizes , the multinomial probability of category , as

 pis(z)=[πizs,svm(xi,βs)]1(zs≠S)∏j:zj

Note that there is no need to introduce parSB-MSVM in addition to paSB-MSVM, since by definition, we have for all .

## 6 Example Results

Constructed under the paSB framework, a multinomial regression model of categories is characterized by not only how the stick-specific binary classifiers with cross entropy loss parameterize their covariate-dependent probability parameters, but also how its categories are one-to-one mapped to latent sticks. To investigate the unique properties of a paSB multinomial regression model, we will study the benefits of both inferring an appropriate mapping and increasing the modeling capacity of the underlying binary regression model. For illustration purpose, we will focus on multinomial softplus regression (MSR) whose capacity and complexity are both explicitly controlled by and .

### 6.1 Influence of Binary Regression Model Capacity

We first consider the Iris data set with categories. We choose the sepal and petal lengths as the two dimensional covariates to illustrate the performance of MSR under four different settings. We fix , which means category is mapped to stick for all , but choose different model capacities by varying and .

Examining the relative 2D spatial locations of the observations, where the blue, black, and gray points are labeled as category 1, 2, and 3, respectively, one can imagine that setting , which means mappings categories 2, 1, and 3 to the 1st, 2nd, and 3rd sticks, respectively, will already lead to excellent class separations for MSR with , according to the analysis in Section 4.2 and also confirmed by our experimental results (not shown for brevity). More specifically, with the 2nd, 1st, and 3rd categories mapped to the 1st, 2nd, and 3rd sticks, respectively, one can first use a single hyperplane to separate category 2 (black points) from both categories 1 (blue points) and 3 (gray points), and then use another hyperplane to separate category 1 (blue points) from category 3 (gray points).

However, when the mapping is fixed at , as shown in the first row of Figure 1, MSR with performs poorly and fails to separate out category 1 (blue points) right in the beginning. This is not surprising since MSR with