1 Introduction
The truncated multivariate normal (tMVN) distribution is routinely used as a prior distribution on model parameters in Bayesian shapeconstrained regression. Structural constraints, such as monotonicity and/or convexity, are commonly induced by expanding the function in an appropriate basis where the constraints can be induced by imposing linear constraints on the coefficients; some examples of such a basis include piecewise linear functions [10], splines [6], Bernstein polynomials [34], and compactly supported basis functions [23]. Under a Gaussian or scalemixture of Gaussian error distribution, the conditional posterior of the basis coefficients once again turns out to be truncated normal with linear constraints [23].
The problem of sampling from a tMVN distribution with linear constraints is also frequently encountered as a component of a larger Markov chain Monte Carlo (MCMC) algorithm to sample from the full conditional distribution of a constrained parameter vector. As a running example revisited on multiple occasions in this article, consider binary variables
, with a vector of latent Gaussian thresholds [1] anda vector of parameters/latent variables so that the joint distribution of
follows a distribution. It then immediately follows that the (conditional) posterior of follows a distribution truncated to , with or depending on whether or . Such latent Gaussian threshold models are ubiquitous in the analysis of binary and nominal data; examples include probit regression and its multivariate extensions [1, 17, 7, 26], multinomial probit models [24, 36, 18], tobit models [33, 28], and binary Gaussian process (GP) classification models [16] among others.In this article, we propose a new family of distributions called the soft tMVN distribution which replaces the hard constraints in a tMVN distribution with a smoothed or “soft” version using a logistic sigmoid function. The soft tMVN distribution admits a smooth logconcave density on the
dimensional Euclidean space. Although the soft tMVN distribution is supported on the entire dimensional space, it can be made to increasingly concentrate most of its mass on a polyhedron determined by multiple linear inequality constraints, by tweaking a parameter. In fact, we show that the soft tMVN distribution approximates the corresponding tMVN distribution in total variation distance.Recognizing the soft tMVN distribution as the posterior distribution in a pseudologistic regression model, we develop an efficient blocked Gibbs sampler combining the Polya–Gamma data augmentation of Polson, Scott & Windle (2013)
[30] along with a structured multivariate normal sampler from Bhattacharya, Chakraborty, and Malllick (2016) [3]. In contrast, existing Gibbs samplers for a tMVN distribution sample the coordinates oneatatime from their respective full conditional univariate truncated normal distributions [14, 20, 9, 32]. The algorithm of Geweke is implemented in the R package tmvtnorm [35]. While the Gibbs sampling procedure is entirely automated, it is wellrecognized in a broader context that such oneatatime updates can lead to slow mixing, especially if the variables are highly correlated. We have additionally observed numerical instabilities in the R implementation for unconstrained dimensions exceeding 400. While exact Hamiltonian Markov chain (HMC) algorithms to sample from tMVN [27] are also popular, such algorithms are not suitable to sample from the soft tMVN, and leaffrog steps with careful tuning are necessary to obtain good mixing. There also exists acceptreject algorithms for the tMVN distribution that create exact samples from the distribution [4]. The algorithm of Botev is implemented in the R package TruncatedNormal [5]. While exact samples are possible, when the acceptance probability becomes small, either the algorithm slows tremendously or approximate samples are produced. We typically saw small acceptance probabilities in the
R implementation when the constrained dimension exceeded 200. With such motivation, we propose to replace a tMVN distribution with its softened version inside a larger MCMC algorithm and use our sampling strategy for the soft tMVN distribution. In recent years, there has been several instances of such approximate MCMC (aMCMC) [19] algorithms where the exact transition kernel of a Markov chain is replaced by an approximation thereof for computational ease.The soft tMVN distribution can also be used as a prior distribution in Bayesian shapeconstrained regression problems as an alternative to the usual tMVN prior. Like the tMVN distribution, the soft tMVN distribution is conditionally conjugate for the mean in a Gaussian likelihood. The soft tMVN can be viewed as a shrinkage prior which encourages shrinkage towards a linearly constrained region rather than being supported on the region. There is an interesting parallel between the soft tMVN distribution and globallocal shrinkage priors used in sparse regression problems. The globallocal priors replace the point mass (at zero) of the more traditional discrete mixture priors and rather encourage shrinkage towards the origin, with the motivation that a subset of the regression coefficients may have a small but nonnegligible effect. Similarly, the soft tMVN prior favors the shape constraints while allowing for small departures.
The rest of the article is organized as follows. In Section 2, we introduce the soft tMVN distribution as an approximation to the tMVN distribution and discuss its properties. In Section 3, we discuss various strategies to sample from a soft tMVN distribution, including a scalable Gibbs sampler suitable for highdimensional situations. Section 4 contains a number of simulation examples to illustrate the efficacy of the proposed sampler as well as the approximation capability of the soft tMVN distribution. We conclude with a discussion in Section 5.
2 The soft tMVN distribution
Consider a tMVN distribution
(1) 
where , is a positive definite matrix, and is described by linear constraints,
where denotes the sign of the th inequality, and . Without loss of generality, we assume the first coordinates to be constrained; this is mainly for notational convenience and can always be achieved by reordering the variables, if necessary. We also assume throughout that has positive Lebesgue measure, so that the density in (1) is nonsingular on . In the special case where , the th unit vector in (with 1 at the th coordinate and 0 elsewhere), the constraint set reduces to the form mentioned in the introduction. While this is an important motivating example, our approach works more generally for the type of constraints in the above display.
Write, using the convention ,
Our main idea is to replace the indicator functions above with a smoothed or “soft” approximation. A rich class of approximations to the indicator function is provided by sigmoid functions, which are nonnegative, monotone increasing, differentiable, and satisfy and
. The cumulative distribution function of any absolutely continuous distribution on
which is symmetric about zero can be potentially used as a sigmoid function. Here, for reasons to be apparent shortly, we choose to use the logistic sigmoid function , which is the cdf of the logistic distribution. Specifically, define, for ,(2) 
to be a scaled version of . The parameter controls the quality of the approximation, with larger values of providing increasingly better approximations to . In fact, it is straightforward to see that
(3) 
It is also immediate that is an approximation to with the same approximation error.
We are now ready to describe our approximation scheme. Fixing some large and replacing the indicators by their respective sigmoidal approximations in (1), we obtain the approximation to as
(4) 
We refer to as a soft tMVN distribution and generically denote it by . It is immediate to note that is a smooth (infinitely differentiable) density supported on . Further, a simple calculation shows that
i.e., the Hessian matrix of the negative log density is positive definite. This implies that is a logconcave density, which, in particular means is unimodal. We collect these various observations about in Proposition 2.1.
Proposition 2.1.
A proof is provided in the Appendix. The last part of Proposition 2.1 formalizes the intuition that approximates for large by showing that the distance between and converges to 0 as . An inspection of the proof for the approximation will reveal that we haven’t used any particular feature of the logistic function and the argument can be extended to other sigmoid functions.
The approximation result implies that although has a nonzero density at all points in , the effective support is the region for large values of , and a random draw from will fall inside with overwhelmingly large probability. To obtain a more quantitative feel for how the approximation gets better with increasing , we set to be a standard bivariate normal distribution truncated to the first orthant,
(5) 
Figure 1 shows contour plots of (last column) along with those for for various values of , with increasing from left to right. Each row corresponds to a different value of . It is evident that the approximation quickly improves at increases, and stabilizes around . We later show in simulations involving substantial higher dimensions that with continues to provide a reasonable approximation to the corresponding tMVN distribution .
The accurate approximation of the soft tMVN has two important consequences in our opinion. First, for any of the examples discussed in the introduction which require a sample from a tMVN within an MCMC algorithm, a sample from a tMVN can be replaced with a sample from the corresponding soft tMVN distribution; we discuss efficient strategies to sample the tMVN distribution in the next section. Second, the soft tMVN distribution can itself be used as a prior distribution for constrained parameters. As a prior, the soft tMVN replaces the hard constraints imposed by the tMVN with soft constraints, encouraging shrinkage towards the constrained region . Indeed, the soft tMVN distribution can be considered a global shrinkage prior [29] which shrinks vectors towards a prespecified constrained region.
The tMVN prior is conditionally conjugate for a Gaussian likelihood and the soft tMVN prior naturally inherits this conditional conjugacy. Suppose and is assigned a soft tMVN prior. Then,
The conditional conjugacy allows one to fit a conditionally Gaussian model with a soft tMVN prior using standard Gibbs sampling algorithms, provided one can efficiently sample from a soft tMVN distribution. We discuss this in the next section.
3 Sampling from the soft tMVN distribution
3.1 Gibbs sampler in highdimensions
In this subsection, we propose a scalable dataaugmentation blockedGibbs sampler to sample from a soft tMVN distribution. The proposed Gibbs sampler updates the entire vector in a block, unlike oneatatime updates for Gibbs samplers for tMVNs.
Apart from logconcavity, the other nice feature behind our choice of the logistic sigmoid function is that can be recognized as the posterior distribution of a vector of regression parameters in a logistic regression model. To see this, consider the setup of a logistic regression model with binary response and vector of predictors for ,
Assuming a prior on the vector of regression coefficients , the posterior distribution of is given by
If we now set and , then the above density is identical to . The number of constraints plays the role of the sample size, and the ambient dimension indicates the number of the regression parameters in this pseudologistic model. Thus, sampling from is equivalent to sampling from the conditional posterior of regression parameters in a highdimensional logistic regression model, which can be conveniently carried out using the Polya–Gamma data augmentation scheme of Polson, Scott & Windle (2013) [30]. The Polya–Gamma scheme introduces auxiliary variables and performs Gibbs sampling by alternatively sampling from and as follows:
(i) Sample independently for ,
(ii) Sample , with
(6) 
where with th row , , , and .
In (i), PG denotes a Polya–Gamma distribution which can be sampled using the
Bayeslogit package in R [30]. Note that the entirevector is sampled in a block in step (ii). The worstcase complexity of sampling from the multivariate Gaussian distribution in (
6) is . However, exploiting the structure of and , a sample from can be obtained with significantly less cost using a recent algorithm in Bhattacharya et al. (2016) [3] provided and a variate can be cheaply sampled.Define and . Then, a sample from (ii) is obtained by first sampling
(7) 
and setting
(8) 
First, by the Sherman–Woodbury–Morrison formula,
Thus,
(9) 
which only requires solving a system.
Sampling in (7) can be efficiently carried out by adapting the algorithm of Bhattacharya et al. (2016) [3] to the present setting. The steps are:
(a) Sample and .
(b) Set .
(c) Solve .
(d) Set .
It follows from Bhattacharya et al. (2016) [3] that obtained in step (d) has the desired Gaussian distribution. Barring the sampling of in step (a), the remaining steps have a combined complexity of , which can be significantly smaller than when . If is a diagonal matrix, can be trivially sampled with cost. Even for nondiagonal , it is often possible to exploit its structure to cheaply sample from . For example, in the probit and multivariate probit regression context, assumes the form (see Section 4.2),
where is a diagonal matrix and is an (possibly dense) matrix. A sample from is then obtained by
(i) Sample and independently.
(ii) Set and .
Since
is a linear transformation of
which is jointly Gaussian, also has a joint Gaussian distribution. Calculating the covariance matrix of then immediately shows that . Since is diagonal, can be sampled in steps, and the matrix multiplication costs , so that the overall cost is .3.2 Other strategies
In moderate dimensions, it is possible to use a Metropolis (Gaussian) random walk and its various extensions to sample from a soft tMVN distribution. In particular, given that the soft tMVN distribution can be recognized as the posterior distribution in a model with a Gaussian prior, elliptical slice sampling [25] is a viable option.
There is substantial literature on sampling from logconcave distributions using variants of the Metropolis algorithm with strong theoretical guarantees [13, 12, 21, 22, 2]. More recently, Dalalyan (2017) [8] and Durmus & Moulines (2016) [11] provided nonasymptotic bounds on the rate of convergence of unadjusted Langevin Monte Carlo (LMC) algorithms for logconcave target densities. Assuming the target density is proportional to for some convex function , the successive iterates of a firstorder LMC algorithm takes the form
where the s are independent variates and is a stepsize parameter. Clearly, forms a discretetime Markov chain and the results in Dalalyan (2017) [8] and Durmus & Moulines (2016) [11] characterize the rate at which the distribution of converges to the target density in total variation distance. Aside from the nonasymptotic bounds, another key message from their results is that the typical Metropolis adjustment as in Metropolis adjusted Langevin (MALA) [roberts] is not required for logconcave targets. Dalalyan (2017) [8] also provide a secondorder version of the LMC algorithm called LMCO which can incorporate the Hessian . Since both and are analytically tractable, it is possible to use both the LMC and LMCO algorithms to sample from .
Other than MCMC, another possible strategy to sample from is to use a multivariate generalization of the adaptive rejection sampling (ARS) [15].
4 Simulations
In this section, we conduct a number of simulations to empirically illustrate that the soft tMVN distribution continues to provide an accurate approximation to the tMVN distribution in highdimensional situations. These simulations also demonstrate the scalability of the proposed Gibbs sampler.
To begin with, we first justify our continued use of in higher dimensions. In Figure 1, we had provided the contour plots of a bivariate tMVN distribution and its soft tMVN approximation with . As an obvious extension, we now consider the bivariate marginal of , where is drawn from a multivariate normal distribution with mean and with a compound symmetry covariance structure, , truncated to the positive orthant. We consider two choices of , namely and , and provide the contour plots for and in the top and bottom panels of Figure 2 respectively. The contour plots were drawn by collecting samples from the and distributions, and then retaining the first two coordinates in each case to obtain samples from the bivariate marginal. Specifically, we used the rejection sampler of Botev (2017) [4] implemented in the R package TruncatedNormal [5] to draw samples from a tMVN distribution and used our data augmentation Gibbs sampler to sample from the soft tMVN distribution. The figure shows that remains a reasonable choice in higher dimensions, and we henceforth fix throughout.
Next, we provide some numerical summaries in two different settings. Due to the inherent difficulty of comparing two highdimensional distributions, we will compare the marginal densities. Specifically, given densities and on with finite mean, we consider two different measures to compare them. The first one uses the 1^{st} Wasserstein () distance between two distributions, . It is an average distance between the marginals,
(10) 
where denotes the th marginal density of . Our second measure is an average squared distance between the mean vectors for the two densities,
(11) 
with .
We compute and between and for two different covariance structures in . Due to the lack of analytic expressions for the marginals for nondiagonal , we resort to simulations to approximate and . The highest dimension used in our simulations is ; while our sampler can be scaled beyond this, the rejection sampler starts producing warning messages due to incurring small acceptance probabilities.
4.1 ProbitGaussian Process Motivation
For our first example, we consider where the covariance matrix is formed from the Matern kernel [31] and where is either or for . This structure is motivated by a binary Gaussian process classification model. Suppose is a binary response at locations for . This is modeled as , where , , and is the Matern kernel. Letting , the conditional distribution of follows the above where if and if .
For the simulation, set . Let for . We randomly sample from and from and let , , and . We set the smoothness parameter for the Matern kernel at 3/5 and the scale parameter at 1. We then proceed to draw 5000 samples from the tMVN, , using Botev’s rejection sampler and 5000 samples from the soft tMVN, , using our Gibbs sampler. The 5000 samples were collected for our method after discarding 1000 initial samples as burnin and collecting every 100th sample to thin the chain.
Figures 3 and 4 show the marginal density plots of 8 coordinates of based on the 5000 samples for the two values of respectively. The tMVN distribution is shown in blue while the soft tMVN is in pink. It is evident that for both values of , the marginal densities are visually indistinguishable. To obtain an overall summary measure, Figure 5 shows the histogram of , defined in equation (11), (left panel) and , defined in (10), over independent simulations. Both the histograms are tightly centered near the origin, which again suggests the closeness of the tMVN and soft tMVN distributions. As a quick comparison, the value of between and for the current is about for both values of .
4.2 ProbitGaussian Motivation
Our second example assumes where
, is either or for in , is an matrix, and is a diagonal matrix.
This covariance structure is motivated by a univariate/multivariate probit model. The usual univariate probit model has binary response variables
with predictors for . Using the latent variable representation of Albert & Chib [1], where follows a distribution and . Setting a Gaussian prior on , , the joint distribution of follows a Gaussian distribution. Then the conditional posterior of follows the above distribution where , , , , and if and if .The multivariate probit model has data where is a binary response with predictors for . Using data augmentation, where follows a distribution and . Assume that follows a prior. Letting , , and , we can rewrite the model in terms of vectors instead of matrices. Let , , , and . Then follows a Gaussian distribution and the conditional distribution of follows the above where , , , , , and if and if .
For this simulation, we sample and , and then set to the above form. Draw and . Then if , set and if , set . For both , we then proceed to draw 5000 samples from the tMVN, , using Botev’s rejection sampler and 5000 samples from the soft tMVN, , using our Gibbs sampler. The 5000 samples were collected for our method after discarding 1000 initial samples as burnin and collecting every 100th sample to thin the chain.
Figures 6 and 7 show the marginal density plots of 8 coordinates of based on the 5000 samples for the two combinations respectively; as before, the tMVN distribution is shown in blue while the soft tMVN is in pink. We once again see that for both combinations, the marginal densities overlap well. To obtain an overall summary measure, Figure 8 shows the histogram of , defined in equation (11), (left panel) and , defined in (10), over independent simulations. We see that the histogram of and shifts to the right for than for . This shift is expected as the size of the matrix grows; as a point of comparison, in Figure 9, we plot the histogram of between and for the present choice of and see a similar shift.
5 Discussion
In this paper, we have presented the soft tMVN distribution, which provides a smooth approximation to the tMVN distribution with linear constraints. Our theoretical and empirical results suggest that the soft tMVN distribution offers a good approximation to the tMVN distribution in high dimensional situations. Future work will focus on using the soft tMVN distribution as a prior in various shapeconstrained models, with posterior computation aided by the Gibbs sampler developed herein.
Appendix
Proof of Proposition 1
Let , where
and is the normalizing constant. Similarly, let , where
where recall is the sigmoidal function. Bound
where we have used triangle inequality and the fact that . Now, we have
Thus, we have
(12) 
Now, using the inequality (2) and the fact that for numbers ,
we have that
where means for some positive constant , and the expectation is under a distribution. By monotone convergence theorem, the right hand side of the above display converges to 0 as .
References
 [1] Albert, J. H., and Chib, S. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association 88, 422 (1993), 669–679.

[2]
Belloni, A., and Chernozhukov, V.
High dimensional sparse econometric models: An introduction.
In
Inverse Problems and HighDimensional Estimation
. Springer, 2011, pp. 121–156.  [3] Bhattacharya, A., Chakraborty, A., and Mallick, B. K. Fast sampling with Gaussian scale mixture priors in highdimensional regression. Biometrika (2016), asw042.
 [4] Botev, Z. The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1 (2017), 125–148.
 [5] Botev, Z. I. TruncatedNormal: Truncated Multivariate Normal, 2015. R package version 1.0.
 [6] Cai, B., and Dunson, D. B. Bayesian multivariate isotonic regression splines: Applications to carcinogenicity studies. Journal of the American Statistical Association 102, 480 (2007), 1158–1171.
 [7] Chib, S., and Greenberg, E. Analysis of multivariate probit models. Biometrika 85, 2 (1998), 347–361.
 [8] Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 3 (2017), 651–676.
 [9] Damien, P., and Walker, S. G. Sampling truncated normal, beta, and gamma densities. Journal of Computational and Graphical Statistics 10, 2 (2001), 206–215.
 [10] Dunson, D. B., and Neelon, B. Bayesian inference on orderconstrained parameters in generalized linear models. Biometrics 59, 2 (2003), 286–295.
 [11] Durmus, A., and Moulines, E. Highdimensional Bayesian inference via the Unadjusted Langevin Algorithm.
 [12] Frieze, A., and Kannan, R. Logsobolev inequalities and sampling from logconcave distributions. The Annals of Applied Probability 9, 1 (1999), 14–26.
 [13] Frieze, A., Kannan, R., and Polson, N. Sampling from logconcave distributions. The Annals of Applied Probability (1994), 812–837.
 [14] Geweke, J. Efficient simulation from the multivariate normal and studentt distributions subject to linear constraints and the evaluation of constraint probabilities, 1991.
 [15] Gilks, W. R., and Wild, P. Adaptive rejection sampling for Gibbs sampling. Applied Statistics (1992), 337–348.
 [16] Girolami, M., and Rogers, S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation 18, 8 (2006), 1790–1817.
 [17] Holmes, C. C., Held, L., et al. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian analysis 1, 1 (2006), 145–168.
 [18] Johndrow, J., Dunson, D., and Lum, K. Diagonal orthant multinomial probit models. In Artificial Intelligence and Statistics (2013), pp. 29–38.
 [19] Johndrow, J., Mattingly, J., Mukherjee, S., and Dunson, D. Approximations of markov chains and highdimensional bayesian inference. arXiv preprint (2015).

[20]
Kotecha, J. H., and Djuric, P. M.
Gibbs sampling approach for generation of truncated multivariate Gaussian random variables.
In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on (1999), vol. 3, IEEE, pp. 1757–1760.  [21] Lovász, L., and Vempala, S. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on (2006), IEEE, pp. 57–68.
 [22] Lovász, L., and Vempala, S. Simulated annealing in convex bodies and an o*() volume algorithm. Journal of Computer and System Sciences 72, 2 (2006), 392–417.
 [23] Maatouk, H., and Bay, X. Gaussian process emulators for computer experiments with inequality constraints. Mathematical Geosciences 49, 5 (2017), 557–582.
 [24] McCulloch, R. E., Polson, N. G., and Rossi, P. E. A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of econometrics 99, 1 (2000), 173–193.
 [25] Murray, I., Adams, R., and MacKay, D. Elliptical slice sampling. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010), pp. 541–548.
 [26] O’brien, S. M., and Dunson, D. B. Bayesian multivariate logistic regression. Biometrics 60, 3 (2004), 739–746.
 [27] Pakman, A., and Paninski, L. Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics 23, 2 (2014), 518–542.
 [28] Polasek, W., and Krause, A. The hierarchical tobit model: A case study in Bayesian computing. OperationsResearchSpektrum 16, 2 (1994), 145–154.
 [29] Polson, N. G., and Scott, J. G. Shrink globally, act locally: Sparse Bayesian regularization and prediction. Bayesian statistics 9 (2010), 501–538.
 [30] Polson, N. G., Scott, J. G., and Windle, J. Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association 108, 504 (2013), 1339–1349.

[31]
Rasmussen, C. E.
Gaussian processes in machine learning.
In Advanced lectures on machine learning. Springer, 2004, pp. 63–71. 
[32]
RodriguezYam, G., Davis, R. A., and Scharf, L. L.
Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression.
Unpublished manuscript (2004).  [33] Tobin, J. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society (1958), 24–36.
 [34] Wang, J., and Ghosh, S. K. Shape restricted nonparametric regression with Bernstein polynomials. Computational Statistics & Data Analysis 56, 9 (2012), 2729–2741.
 [35] Wilhelm, S., and G, M. B. tmvtnorm: Truncated Multivariate Normal and Student t Distribution, 2015. R package version 1.410.
 [36] Zhang, X., Boscardin, W. J., and Belin, T. R. Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational statistics & data analysis 52, 7 (2008), 3697–3708.
Comments
There are no comments yet.