Softplus Regressions and Convex Polytopes

08/23/2016
by   Mingyuan Zhou, et al.
The University of Texas at Austin
0

To construct flexible nonlinear predictive distributions, the paper introduces a family of softplus function based regression models that convolve, stack, or combine both operations by convolving countably infinite stacked gamma distributions, whose scales depend on the covariates. Generalizing logistic regression that uses a single hyperplane to partition the covariate space into two halves, softplus regressions employ multiple hyperplanes to construct a confined space, related to a single convex polytope defined by the intersection of multiple half-spaces or a union of multiple convex polytopes, to separate one class from the other. The gamma process is introduced to support the convolution of countably infinite (stacked) covariate-dependent gamma distributions. For Bayesian inference, Gibbs sampling derived via novel data augmentation and marginalization techniques is used to deconvolve and/or demix the highly complex nonlinear predictive distribution. Example results demonstrate that softplus regressions provide flexible nonlinear decision boundaries, achieving classification accuracies comparable to that of kernel support vector machine while requiring significant less computation for out-of-sample prediction.

READ FULL TEXT VIEW PDF

page 12

page 16

page 19

page 23

05/29/2019

Bayesian Inference for Polya Inverse Gamma Models

Probability density functions that include the gamma function are widely...
05/02/2012

Bayesian inference for logistic models using Polya-Gamma latent variables

We propose a new data-augmentation strategy for fully Bayesian inference...
12/30/2016

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

To model categorical response variables given their covariates, we propo...
04/25/2016

Nonparametric Bayesian Negative Binomial Factor Analysis

A common approach to analyze a covariate-sample count matrix, an element...
02/08/2018

Bayesian analysis of predictive Non-Homogeneous hidden Markov models using Polya-Gamma data augmentation

We consider Non-Homogeneous Hidden Markov Models (NHHMMs) for forecastin...
09/06/2019

A Pólya-Gamma Sampler for a Generalized Logistic Regression

In this paper we introduce a novel Bayesian data augmentation approach f...
05/07/2020

Relevance Vector Machine with Weakly Informative Hyperprior and Extended Predictive Information Criterion

In the variational relevance vector machine, the gamma distribution is r...

1 Introduction

Logistic and probit regressions that use a single hyperplane to partition the covariate space into two halves are widely used to model binary response variables given the covariates

(Cox and Snell, 1989; McCullagh and Nelder, 1989; Albert and Chib, 1993; Holmes and Held, 2006). They are easy to implement and simple to interpret, but neither of them is capable of producing nonlinear classification decision boundaries, and they may not provide large margin to achieve accurate out-of-sample predictions. For two classes not well separated by a single hyperplane, rather than regressing a binary response variable directly on its covariates, it is common to select a subset of covariate vectors as support vectors, choose a nonlinear kernel function, and regress a binary response variable on the kernel distances between its covariate vector and these support vectors (Boser et al., 1992; cortes1995support; Vapnik, 1998; Schölkopf et al., 1999; Tipping, 2001)

. Alternatively, one may construct a deep neural network to nonlinearly transform the covariates in a supervised manner, and then regress a binary response variable on its transformed covariates

(Hinton et al., 2006; LeCun et al., 2015; Bengio et al., 2015).

Both kernel learning and deep learning map the original covariates into a more linearly separable space, transforming a nonlinear classification problem into a linear one. In this paper, we propose a fundamentally different approach for nonlinear classification. Relying on neither the kernel trick nor a deep neural network to transform the covariate space, we construct a family of softplus regressions that exploit two distinct types of interactions between hyperplanes to define flexible nonlinear classification decision boundaries directly on the original covariate space. Since kernel learning based methods such as kernel support vector machines (SVMs) (cortes1995support; Vapnik, 1998) may scale poorly in that the number of support vectors often increases linearly in the size of the training dataset, they could be not only slow and memory inefficient to train but also unappealing for making fast out-of-sample predictions (Steinwart, 2003; Wang et al., 2011)

. One motivation of the paper is to investigate the potential of using a set of hyperplanes, whose number is directly influenced by how the interactions of multiple hyperplanes can be used to spatially separate two different classes in the covariate space rather than by the training data size, to construct nonlinear classifiers that can match the out-of-sample prediction accuracies of kernel SVMs, but potentially with much lower computational complexity. Another motivation of the paper is to increase the margin of the classifier, related to the discussion in

Kantchelian et al. (2014) that for two classes that are linearly separable, even though a single hyperplane is sufficient to separate the two different classes in the training dataset, using multiple hyperplanes to enclose one class may help clearly increase the total margin of the classifier and hence improve the out-of-sample prediction accuracies.

Our motivated construction exploits two distinct operations—convolution and stacking—on the gamma distributions with covariate-dependent scale parameters. The convolution operation convolves differently parameterized probability distributions to increase representation power and enhance smoothness, while the stacking operation mixes a distribution in the stack with a distribution of the same family that is subsequently pushed into the stack. Depending on whether and how the convolution and stacking operations are used, the models in the family differ from each other on how they use the softplus functions to construct highly nonlinear probability density functions, and on how they construct their hierarchical Bayesian models to arrive at these functions. In comparison to the nonlinear classifiers built on kernels or deep neural networks, the proposed softplus regressions all share a distinct advantage in providing interpretable geometric constraints, which are related to either a single or a union of convex polytopes

(Grünbaum, 2013), on the classification decision boundaries defined on the original covariate space. In addition, like neither kernel learning, whose number of support vectors often increases linearly in the size of data (Steinwart, 2003), nor deep learning, which often requires carefully tuning both the structure of the deep network and the learning algorithm (Bengio et al., 2015)

, the proposed nonparametric Bayesian softplus regressions naturally provide probability estimates, automatically learn the complexity of the predictive distribution, and quantify model uncertainties with posterior samples.

The remainder of the paper is organized as follows. In Section 2, we define four different softplus regressions, present their underlying hierarchical models, and describe their distinct geometric constraints on how the covariate space is partitioned. In Section 3, we discuss Gibbs sampling via data augmentation and marginalization. In Section 4, we present experimental results on eight benchmark datasets for binary classification, making comparisons with five different classification algorithms. We conclude the paper in Section 5. We defer to the Supplementary Materials all the proofs, an accurate approximate sampler and some new properties for the Polya-Gamma distribution, the discussions on related work, and some additional example results.

2 Hierarchical Models and Geometric Constraints

2.1 Bernoulli-Poisson link and softplus function

To model a binary random variable, it is common to link it to a real-valued latent Gaussian random variable using either the logistic or probit links. Rather than following the convention, in this paper, we consider the Bernoulli-Poisson (BerPo) link

(Dunson and Herring, 2005; Zhou, 2015) to threshold a latent count at one to obtain a binary outcome  as

(1)

where , , and if the condition is satisfied and otherwise. The marginalization of the latent count from the BerPo link leads to

The conditional distribution of given and can be efficiently simulated using a rejection sampler (Zhou, 2015). Since its use in Zhou (2015)

to factorize the adjacency matrix of an undirected unweighted symmetric network, the BerPo link has been further extended for big binary tensor factorization

(Hu et al., 2015), multi-label learning (Rai et al., 2015), and deep Poisson factor analysis (Henao et al., 2015). This link has also been used by Caron and Fox (2015) and Todeschini and Caron (2016) for network analysis.

We now refer to , the negative logarithm of the Bernoulli failure probability, as the BerPo rate for and simply denote (1) as It is instructive to notice that and hence letting

(2)

is equivalent to letting

(3)

where was referred to as the softplus function in Dugas et al. (2001)

. It is interesting that the BerPo link appears to be naturally paired with the softplus function, which is often considered as a smoothed version of the rectifier, or rectified linear unit

that is now widely used in deep neural networks, replacing other canonical nonlinear activation functions such as the sigmoid and hyperbolic tangent functions

(Nair and Hinton, 2010; Glorot et al., 2011; LeCun et al., 2015; Krizhevsky et al., 2012; CRLU).

In this paper, we further introduce the stack-softplus function

(4)

which can be recursively defined with . In addition, with as the weights of the countably infinite atoms of a gamma process (Ferguson, 1973), we will introduce the sum-softplus function, expressed as and sum-stack-softplus (SS-softplus) function, expressed as The stack-, sum-, and SS-softplus functions constitute a family of softplus functions, which are used to construct nonlinear regression models, as presented below.

2.2 The softplus regression family

The equivalence between (2) and (3), the apparent partnership between the BerPo link and softplus function, and the convenience of employing multiple regression coefficient vectors to parameterize the BerPo rate, which is constrained to be nonnegative rather than between zero and one, motivate us to consider using the BerPo link together with softplus function to model binary response variables given the covariates. We first show how a classification model under the BerPo link reduces to logistic regression that uses a single hyperplane to partition the covariate space into two halves. We then generalize it to two distinct multi-hyperplane classification models: sum- and stack-softplus regressions, and further show how to integrate them into SS-softplus regression. These models clearly differ from each other on how the BerPo rates are parameterized with the softplus functions, leading to decision boundaries under distinct geometric constraints.

To be more specific, for the th covariate vector , where the prime denotes the operation of transposing a vector, we model its binary class label using

(5)

where , given the regression model parameters that may come from a stochastic process, is a nonnegative deterministic function of that may contain a countably infinite number of parameters. Let denote a gamma process (Ferguson, 1973) defined on the product space , where , is a scale parameter, and is a finite and continuous base measure defined on a complete separable metric space , such that are independent gamma random variables for disjoint Borel sets of . Below we show how the BerPo rate function is parameterized under four different softplus regressions, two of which use the gamma process to support a countably infinite sum in the parameterization, and also show how to arrive at each parameterization using a hierarchical Bayesian model built on the BerPo link together with the convolved and/or stacked gamma distributions.

Definition 1 (Softplus regression).

Given , weight , and a regression coefficient vector , softplus regression parameterizes in (5) using a softplus function as

(6)

Softplus regression is equivalent to the binary regression model

which, as proved in Appendix B, can be constructed using the hierarchical model

(7)
Definition 2 (Sum-softplus regression).

Given a draw from a gamma process , expressed as , where is an atom and is its weight, sum-softplus regression parameterizes in (5) using a sum-softplus function as

(8)

Sum-softplus regression is equivalent to the binary regression model

which, as proved in Appendix B, can be constructed using the hierarchical model

(9)
Definition 3 (Stack-softplus regression).

With weight and regression coefficient vectors , where , stack-softplus regression with layers parameterizes in (5) using a stack-softplus function as

(10)

Stack-softplus regression is equivalent to the regression model

which, as proved in Appendix B, can be constructed using the hierarchical model that stacks gamma distributions, whose scales are differently parameterized by the covariates, as

(11)
Definition 4 (Sum-stack-softplus (SS-softplus) regression).

Given a drawn from a gamma process , expressed as , where is an atom and is its weight, with each , SS-softplus regression with layers parameterizes in (5) using a sum-stack-softplus function as

(12)

SS-softplus regression is equivalent to the regression model

which, as proved in Appendix B, can be constructed from the hierarchical model that convolves countably infinite stacked gamma distributions that have covariate-dependent scale parameters as

(13)

Below we discuss these four different softplus regression models in detail and show that both sum- and stack-softplus regressions use the interactions of multiple regression coefficient vectors through the softplus functions to define a confined space, related to a convex polytope (Grünbaum, 2013) defined by the intersection of multiple half-spaces, to separate one class from the other in the covariate space. They differ from each other in that sum-softplus regression infers a convex-polytope-bounded confined space to enclose negative examples (i.e., data samples with ), whereas stack-softplus regression infers a convex-polytope-like confined space to enclose positive examples (i.e., data samples with ).

The opposite behaviors of sum- and stack-softplus regressions motivate us to unite them as SS-softplus regression, which can place countably infinite convex-polytope-like confined spaces, inside and outside each of which favor positive and negative examples, respectively, at various regions of the covariate space, and use the union of these confined spaces to construct a flexible nonlinear classification decision boundary. Note that softplus regressions all operate on the original covariate space. It is possible to apply them to regress binary response variables on the covariates that have already been nonlinearly transformed with the kernel trick or a deep neural network, which may combine the advantages of these distinct methods to achieve an overall improved classification performance. We leave the integration of softplus regressions with the kernel trick or deep neural networks for future study.

2.3 Softplus and logistic regressions

It is straightforward to show that softplus regression with is equivalent to logistic regression , which uses a single hyperplane dividing the covariate space into two halves to separate one class from the other. Similar connection has also been illustrated in Dunson and Herring (2005). Clearly, softplus regression arising from (7) generalizes logistic regression in allowing . Let denote the probability threshold to make a binary decision. One may consider that softplus regression defines a hyperplane to partition the dimensional covariate space into two halves: one half is defined with , assigned with label since under this condition, and the other half is defined with , assigned with label since under this condition. Instead of using a single hyperplane, the three generalizations in Definitions 2-4 all partition the covariate space using a confined space that is related to a single convex polytope or the union of multiple convex polytopes, as described below.

2.4 Sum-softplus regression and convolved NB regressions

Note that since in (9) can be equivalently written as , sum-softplus regression can also be constructed with

(14)

where represents a negative binomial (NB) distribution (Greenwood and Yule, 1920; Fisher et al., 1943) with shape parameter and probability parameter , and can be considered as NB regression (Lawless, 1987; Long, 1997; Cameron and Trivedi, 1998; Winkelmann, 2008)

that parameterizes the logit of

with . To ensure that the infinite model is well defined, we provide the following proposition and present the proof in Appendix B.

Proposition 1.

The infinite product in sum-softplus regression is smaller than one and has a finite expectation that is greater than zero.

As the probability distribution of the sum of independent random variables is the same as the convolution of these random variables’ probability distributions (, Fristedt and Gray, 1997), the probability distribution of the BerPo rate is the convolution of countably infinite gamma distributions, each of which parameterizes the logarithm of its scale using the inner product of the same covariate vector and a regression coefficient vector specific for each . As in (14), since is the summation of countably infinite latent counts , each of which is a NB regression response variable, we essentially regress the latent count on using a convolution of countably infinite NB regression models. If are drawn from a continuous distribution, then a.s. for all , and hence given and , the BerPo rate would not follow the gamma distribution and would not follow the NB distribution.

Note that if we modify the proposed sum-softplus regression model in (9) as

(15)

then we have , which becomes the same as Eq. 2.7 of Dunson and Herring (2005) that is designed to model multivariate binary response variables. Though related, that construction is clearly different from the proposed sum-softplus regression in that it uses only a single regression coefficient vector and does not support . It is of interest to extend the models in Dunson and Herring (2005) with the sum-softplus construction discussed above and the stack- and SS-softplus constructions to be discussed below.

2.4.1 Convex-polytope-bounded confined space that favors negative examples

For sum-softplus regression arising from (9), the binary classification decision boundary is no longer defined by a single hyperplane. Let us make the analogy that each is an expert of a committee that collectively make binary decisions. For expert , the magnitude of indicates how strongly its opinion is weighted by the committee, represents that it votes “No,” and represents that it votes “Yes.” Since the response variable , the committee would vote “No” if and only if all its experts vote “No” (, all the counts are zeros), in other words, the committee would vote “Yes” even if only a single expert votes “Yes.”

Let us now examine the confined covariate space for sum-softplus regression that satisfies the inequality , where a data point is labeled as one with a probability no greater than . Although it is not immediately clear what kind of geometric constraints are imposed on the covariate space by this inequality, the following theorem shows that it defines a confined space, which is bounded by a convex polytope defined by the intersection of countably infinite half-spaces.

Theorem 2.

For sum-softplus regression, the confined space specified by the inequality , which can be expressed as

(16)

is bounded by a convex polytope defined by the set of solutions to countably infinite inequalities

(17)
Proposition 3.

For any data point that resides outside the convex polytope defined by (17), which means violates at least one of the inequalities in (17) a.s., it will be labeled under sum-softplus regression with with a probability greater than , and with a probability no greater than .

The convex polytope defined in (17) is enclosed by the intersection of countably infinite -dimensional half-spaces. If we set as the probability threshold to make binary decisions, then the convex polytope assigns a label of to an inside the convex polytope (, an that satisfies all the inequalities in Eq. 17) with a relatively high probability, and assigns a label of to an outside the convex polytope (, an that violates at least one of the inequalities in Eq. 17) with a probability of at least .

Note that as , and . Thus expert with a tiny essentially has a negligible impact on both the decision of the committee and the boundary of the convex polytope. Choosing the gamma process as the nonparametric Bayesian prior sidesteps the need to tune the number of experts in the committee, shrinking the weights of all unnecessary experts and hence allowing a finite number of experts with non-negligible weights to be automatically inferred from the data.

2.4.2 Illustration for sum-softplus regression

A clear advantage of sum-softplus regression over both softplus and logistic regressions is that it could use multiple hyperplanes to construct a nonlinear decision boundary and, similar to the convex polytope machine of Kantchelian et al. (2014), to separate two different classes by a large margin. To illustrate the imposed geometric constraints, we first consider a synthetic two dimensional dataset with two classes, as shown in Fig. 1 (a), where most of the data points of Class reside within a unit circle and these of Class reside within a ring outside the unit circle.

Figure 1: Visualization of sum-softplus regression with experts on a binary classification problem under two opposite labeling settings. For each labeling setting, 2000 Gibbs sampling iterations are used and the MCMC sample that provides the maximum likelihood on fitting the training data labels is used to display the results. (a) A two dimensional dataset that consists of 150 data points from Class , whose radiuses are drawn from and angles are distributed uniformly at random between 0 and degrees, and another 150 data points from Class , whose both -axis and -axis values are drawn from . With data points in Classes and labeled as “1” and “0,” respectively, and with , (b) shows the inferred weights of the experts, ordered by their values, (c) shows a contour map, the value of each point of which represents how many inequalities specified in (17) are violated, and whose region with zero values corresponds to the convex polytope enclosed by the intersection of the hyperplanes defined in (17), and (d) shows the the contour map of the predicted class probabilities. (f)-(h) are analogous plots to (b)-(d), with the data points in Classes and relabeled as “0” and “1,” respectively. (e) The average per data point log-likelihood as a function of MCMC iteration, for both labeling settings.
Figure 2: Visualization of the posteriors of sum-softplus regression based on 20 MCMC samples, collected once per every 50 iterations during the last 1000 MCMC iterations, with the same experimental setting used for Fig. 1. With

, (a) and (b) show the contour maps of the posterior means and standard deviations, respectively, of the number of inequalities specified in (

17) that are violated, and (c) and (d) show the contour maps of the posterior means and standard deviations, respectively, of predicted class probabilities. (e)-(h) are analogous plots to (a)-(d), with the data points in Classes and relabeled as “0” and “1,” respectively.

We first label the data points of Class as “1” and these of Class as “0.” Shown in Fig. 1 (b) are the inferred weights of the experts, using the MCMC sample that has the highest log-likelihood in fitting the training data labels. It is evident from Figs. 1 (b) and (c) that sum-softplus regression infers four experts (hyperplanes) with significant weights. The convex polytope in Fig. 1 (c) that encloses the space marked as zero is intersected by these four hyperplanes, each of which is defined as in (17) with . Thus outside the convex polytope are data points that would be labeled as “1” with at least 50% probabilities and inside it are data points that would be labeled as “0” with relatively high probabilities. We further show in Fig. 1 (d) the contour map of the inferred probabilities for , where are calculated with (8). Note that due to the model construction, a single expert’s influence on the decision boundary can be conveniently measured, and the exact decision boundary is bounded by a convex polytope. Thus it is not surprising that the convex polytope in Fig. 1 (c), which encloses the space marked as zero, aligns well with the contour line of shown in Fig. 1 (d).

Despite being able to construct a nonlinear decision boundary bounded by a convex polytope, sum-softplus regression has a clear restriction in that if the data labels are flipped, its performance may substantially deteriorate, becoming no better than that of logistic regression. For example, for the same data shown in Fig. 1 (a), if we choose the opposite labeling setting where the data points of Class are labeled as “0” and these of Class are labeled as “1,” then sum-softplus regression infers a single expert (hyperplane) with non-negligible weight, as shown in Figs. 1 (f)-(g), and fails to separate the data points of two different classes, as shown in Figs. 1 (g)-(h). The data log-likelihood plots in Fig. 1 (e) also suggest that sum-softplus regression could perform substantially better if the training data are labeled in favor of its geometric constraints on the decision boundary. An advantage of a Bayesian hierarchical model is that with collected MCMC samples, one may estimate not only the posterior means but also uncertainties. The standard deviations shown in Figs. 2

(b) and (d) clearly indicate the uncertainties of sum-softplus regression on its decision boundaries and predictive probabilities in the covariate space, which may be used to help decide how to sequentially query the labels of unlabeled data in an active learning setting

(Cohn et al., 1996; Settles, 2010).

The sensitivity of sum-softplus regression to how the data are labeled could be mitigated but not completely solved by combining two sum-softplus regression models trained under the two opposite labeling settings. In addition, sum-softplus regression may not perform well no matter how the data are labeled if neither of the two classes could be enclosed by a convex polytope. To fully resolve these issues, we first introduce stack-softplus regression, which defines a convex-polytope-like confined space to enclose positive examples. We then show how to combine the two distinct, but complementary, softplus regression models to construct SS-softplus regression that provides more flexible nonlinear decision boundaries.

2.5 Stack-softplus regression and stacked gamma distributions

The model in (11) combines the BerPo link with a gamma belief network that stacks differently parameterized gamma distributions. Note that here “stacking” is defined as an operation that mixes the shape parameter of a gamma distribution at layer with a gamma distribution at layer , the next one pushed into the stack, and pops out the covariate-dependent gamma scale parameters from layers to 2 in the stack, following the last-in-first-out rule, to parameterize the BerPo rate of the class label shown in (10).

2.5.1 Convex-polytope-like confined space that favors positive examples

Let us make the analogy that each is one of the criteria that an expert examines before making a binary decision. From (10) it is clear that as long as a single criterion of the expert is strongly violated, which means that is much smaller than zero, then the expert would vote “No” regardless of the values of for all . Thus the response variable could be voted “Yes” by the expert only if none of the expert criteria are strongly violated. For stack-softplus regression, let us specify a confined space using the inequality , which can be expressed as

(18)

and hence any data point outside the confined space (, violating the inequality in Eq. 18 a.s.) will be labeled as with a probability no less than .

Considering the covariate space

(19)

where all the criteria except criterion of the expert tend to be satisfied, the decision boundary of stack-softplus regression in would be clearly influenced by the satisfactory level of criterion , whose hyperplane partitions into two parts as

(20)

for all . Let us define with and the recursion for , and define with and the recursion for . Using the definition of and , combining all the expert criteria, the confined space of stack-softplus regression specified in (18) can be roughly related to a convex polytope, which is specified by the solutions to a set of inequalities as

(21)

The convex polytope is enclosed by the intersection of -dimensional hyperplanes, and since none of the criteria would be strongly violated inside the convex polytope, the label () would be assigned to an inside (outside) the convex polytope with a relatively high (low) probability.

Unlike the confined space of sum-softplus regression defined in (16) that is bounded by a convex polytope defined in (17), the convex polytope defined in (21) only roughly corresponds to the confined space of stack-softplus regression, as defined in (18). Nevertheless, the confined space defined in (18) is referred to as a convex-polytope-like confined space, due to both its connection to the convex polytope in (21) and the fact that (18) is likely to be violated if at least one of the criteria is strongly dissatisfied (, for some ).

2.5.2 Illustration for stack-softplus regression

Figure 3: Analogous figure to Fig. 1 for stack-softplus regression with expert criteria, with the following differences: (b) shows the average latent count per positive sample, , as a function of layer , (c) shows a contour map, the value of each point of which represents how many inequalities specified in (21) are satisfied, and whose region with the values of corresponds to the convex polytope enclosed by the intersections of the hyperplanes defined in (21),
Figure 4: Analogous figure to Fig. 2 for stack-softplus regression, with the following differences: (a) and (b) show the contour maps of the posterior means and standard deviations, respectively, of the number of inequalities specified in (21) that are satisfied. (e)-(f) are analogous plots to (a)-(b) under the opposite labeling setting.
Figure 3: Analogous figure to Fig. 1 for stack-softplus regression with expert criteria, with the following differences: (b) shows the average latent count per positive sample, , as a function of layer , (c) shows a contour map, the value of each point of which represents how many inequalities specified in (21) are satisfied, and whose region with the values of corresponds to the convex polytope enclosed by the intersections of the hyperplanes defined in (21),

Let us examine how stack-softplus regression performs on the same data used in Fig. 1. When Class is labeled as “1,” as shown in Fig. 4 (g), stack-softplus regression infers a convex polytope that encloses the space marked as using the intersection of all hyperplanes, each of which is defined as in (21); and as shown in Fig. 4 (h), it works well by using a convex-polytope-like confined space to enclose positive examples. However, as shown in Figs. 4 (c)-(e), its performance deteriorates when the opposite labeling setting is used. Note that due to the model construction that introduces complex interactions between the hyperplanes, (21) can only roughly describe how a single hyperplane could influence the decision boundary determined by all hyperplanes. Thus it is not surprising that neither the convex polytope in Fig. 4 (c), which encloses the space marked with the largest count there, nor the convex polytope in Fig. 4 (g), which encloses the space marked with , align well with the contour lines of in Figs. 4 (d) and (h), respectively.

While how the latent count decreases as increases does not indicate a clear cutoff point for the depth , neither do we observe a clear sign of overfitting when is set as large as 100 in our experiments. Both Figs. 4 (c) and (g) indicate that most of the hyperplanes are far from any data points and tend to vote “Yes” for all training data. The standard deviations shown in Figs. 4 (f) and (h) clearly indicate the uncertainties of stack-softplus regression on its decision boundaries and predictive probabilities in the covariate space.

Like sum-softplus regression, stack-softplus regression also generalizes softplus and logistic regressions in that it uses the boundary of a confined space rather than a single hyperplane to partition the covariate space into two parts. Unlike the convex-polytope-bounded confined space of sum-softplus regression that favors placing negative examples inside it, the convex-polytope-like confined space of stack-softplus regression favors placing positive examples inside it. While both sum- and stack-softplus regressions could be sensitive to how the data are labeled, their distinct behaviors under the same labeling setting motivate us to combine them together as SS-softplus regression, as described below.

2.6 Sum-stack-softplus (SS-softplus) regression

Note that if , SS-softplus regression reduces to sum-softplus regression; if , it reduces to stack-softplus regression; and if , it reduces to softplus regression, which further reduces to logistic regression if the weight of the single expert is fixed at . To ensure that the SS-softplus regression model is well defined in its infinite limit, we provide the following proposition and present the proof in Appendix B.

Proposition 4.

The infinite product in sum-stack-softplus regression as

is smaller than one and has a finite expectation that is greater than zero.

2.6.1 Union of convex-polytope-like confined spaces

We may consider SS-softplus regression as a multi-hyperplane model that employs a committee, consisting of countably infinite experts, to make a decision, where each expert is equipped with criteria to be examined. The committee’s distribution is obtained by convolving the distributions of countably infinite experts, each of which mixes stacked covariate-dependent gamma distributions. For each , the committee votes “Yes” as long as at least one expert votes “Yes,” and an expert could vote “Yes” if and only if none of its criteria are strongly violated. Thus the decision boundary of SS-softplus regression can be considered as a union of convex-polytope-like confined spaces that all favor placing positively labeled data inside them, as described below, with the proofs deferred to Appendix B.

Theorem 5.

For sum-stack-softplus regression, the confined space specified by the inequality , which can be expressed as

(22)

encompasses the union of convex-polytope-like confined spaces, expressed as

where the th convex-polytope-like confined space is specified by the inequality

(23)
Corollary 6.

For sum-stack-softplus regression, the confined space specified by the inequality is bounded by

Proposition 7.

For any data point that resides inside the union of countably infinite convex-polytope-like confined spaces , which means satisfies at least one of the inequalities in (23), it will be labeled under sum-stack-softplus regression with with a probability greater than , and with a probability no greater than .

2.6.2 Illustration for sum-stack-softplus regression

Let us examine how SS-softplus regression performs on the same dataset used in Fig. 1. When Class is labeled as “1,” as shown in Figs. 6 (b)-(c), SS-softplus regression infers about eight convex-polytope-like confined spaces, the intersection of six of which defines the boundary of the covariate space that separates the points that violate all inequalities in (23) from the ones that satisfy at least one inequality in (23). The union of these convex-polytope-like confined spaces defines a confined covariate space, which is included within the covariate space satisfying , as shown in Fig. 6 (d).

When Class is labeled as “1,” as shown in Fig. 6 (f)-(g), SS-softplus regression infers about six convex-polytope-like confined spaces, one of which defines the boundary of the covariate space that separates the points that violate all inequalities in (23) from the others for the covariate space show in Fig. 6 (g). The union of two convex-polytope-like confined spaces defines a confined covariate space, which is included in the covariate space with , as shown in Fig. 6 (h). Figs. 6 (f)-(g) also indicate that except for two convex-polytope-like confined spaces, the boundaries of all the other convex-polytope-like confined spaces are far from any data points and tend to vote “No” for all training data. The standard deviations shown in Figs. 6 (b), (d), (f), and (h) clearly indicate the uncertainties of SS-softplus regression on classification decision boundaries and predictive probabilities.

Figure 5: Analogous figure to Figs. 1 and 4 for SS-softplus regression with experts and criteria for each expert, with the following differences: (b) shows the average latent count per positive sample, , as a function of both the expert index and layer index , where the experts are ordered based on the values of , (c) shows a contour map, the value of each point of which represents how many inequalities specified in (23) are satisfied, and whose region with nonzero values corresponds to the union of convex-polytope-like confined spaces, each of which corresponds to an inequality defined in (23), and (f) and (g) are analogous plots to (b) and (c) under the opposite labeling setting where data in Class are labeled as “1.”
Figure 6: Analogous figure to Fig. 2 for SS-softplus regression, with the following differences: (a) and (b) show the contour maps of the posterior means and standard deviations, respectively, of the number of inequalities specified in (23) that are satisfied. (e)-(f) are analogous plots to (a)-(b).
Figure 5: Analogous figure to Figs. 1 and 4 for SS-softplus regression with experts and criteria for each expert, with the following differences: (b) shows the average latent count per positive sample, , as a function of both the expert index and layer index , where the experts are ordered based on the values of , (c) shows a contour map, the value of each point of which represents how many inequalities specified in (23) are satisfied, and whose region with nonzero values corresponds to the union of convex-polytope-like confined spaces, each of which corresponds to an inequality defined in (23), and (f) and (g) are analogous plots to (b) and (c) under the opposite labeling setting where data in Class are labeled as “1.”

3 Gibbs sampling via data augmentation and marginalization

Since logistic, softplus, sum-softplus, and stack-softplus regressions can all be considered as special cases of SS-softplus regression, below we will focus on presenting the nonparametric Bayesian hierarchical model and Bayesian inference for SS-softplus regression.

The gamma process has an inherent shrinkage mechanism, as in the prior the number of atoms with weights larger than follows whose mean is finite a.s., where is the mass parameter of the gamma process. In practice, the atom with a tiny weight generally has a negligible impact on the final decision boundary of the model, hence one may truncate either the weight to be above or the number of atoms to be below . One may also follow Wolpert et al. (2011) to use a reversible jump MCMC (Green, 1995) strategy to adaptively truncate the number of atoms for a gamma process, which often comes with a high computational cost. For the convenience of implementation, we truncate the number of atoms in the gamma process to be by choosing a finite discrete base measure as , where will be set sufficiently large to achieve a good approximation to the truly countably infinite model.

We express the truncated SS-softplus regression model using (13) together with

(24)

where . Related to Tipping (2001), the normal gamma construction in (24) is used to promote sparsity on the regression coefficients . We derive Gibbs sampling by exploiting local conjugacies under a series of data augmentation and marginalization techniques. We comment here that while the proposed Gibbs sampling algorithm is a batch learning algorithm that processes all training data samples in each iteration, the local conjugacies revealed under data augmentation and marginalization may be of significant value in developing efficient mini-batch based online learning algorithms, including those based on stochastic gradient MCMC (Welling and Teh, 2011; Girolami and Calderhead, 2011; Patterson and Teh, 2013; Ma et al., 2015) and stochastic variation inference (Hoffman et al., 2013). We leave the maximum likelihood, maximum a posteriori, (stochastic) variational Bayes inference, and stochastic gradient MCMC for softplus regressions for future research.

For a model with , we exploit the data augmentation techniques developed for the BerPo link in Zhou (2015) to sample , these developed for the Poisson and multinomial distributions (Dunson and Herring, 2005; Zhou et al., 2012a) to sample , these developed for the NB distribution in Zhou and Carin (2015) to sample and , and these developed for logistic regression in Polson and Scott (2011) and further generalized to NB regression in Zhou et al. (2012b) and Polson et al. (2013) to sample . We exploit local conjugacies to sample all the other model parameters. For a model with , we further generalize the inference technique developed for the gamma belief network in Zhou et al. (2015a) to sample the model parameters of deep hidden layers. Below we provide a theorem, related to Lemma 1 for the gamma belief network in Zhou et al. (2015a), to show that each regression coefficient vector can be linked to latent counts under NB regression. Let represent the sum-logarithmic distribution described in Zhou et al. (2015b), Corollary 9 further shows an alternative representation of (13), the hierarchical model of SS-softplus regression, where all the covariate-dependent gamma distributions are marginalized out.

Theorem 8.

Let us denote , , , and . With and

(25)

for , which means

(26)

one may find latent counts that are connected to the regression coefficient vectors as

(27)
Corollary 9.

With defined as in (26) and hence , the hierarchical model of sum-stack-softplus regression can also be expressed as