Sampling constrained probability distributions using Spherical Augmentation

06/19/2015 ∙ by Shiwei Lan, et al. ∙ University of California, Irvine University of Warwick 0

Statistical models with constrained probability distributions are abundant in machine learning. Some examples include regression models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirichlet allocation (LDA). Bayesian inference involving probability distributions confined to constrained domains could be quite challenging for commonly used sampling algorithms. In this paper, we propose a novel augmentation technique that handles a wide range of constraints by mapping the constrained domain to a sphere in the augmented space. By moving freely on the surface of this sphere, sampling algorithms handle constraints implicitly and generate proposals that remain within boundaries when mapped back to the original space. Our proposed method, called Spherical Augmentation, provides a mathematically natural and computationally efficient framework for sampling from constrained probability distributions. We show the advantages of our method over state-of-the-art sampling algorithms, such as exact Hamiltonian Monte Carlo, using several examples including truncated Gaussian distributions, Bayesian Lasso, Bayesian bridge regression, reconstruction of quantized stationary Gaussian process, and LDA for topic modeling.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many commonly used statistical models in Bayesian analysis involve high-dimensional probability distributions confined to constrained domains. Some examples include regression models with norm constraints (e.g., Lasso), probit, many copula models, and latent Dirichlet allocation (LDA). Very often, the resulting models are intractable and simulating samples for Monte Carlo estimations is quite challenging

(Neal and Roberts, 2008; Sherlock and Roberts, 2009; Neal et al., 2012; Brubaker et al., 2012; Pakman and Paninski, 2013). Although the literature on improving the efficiency of computational methods for Bayesian inference is quite extensive (see, for example, Neal, 1996, 1993; Geyer, 1992; Mykland et al., 1995; Propp and Wilson, 1996; Roberts and Sahu, 1997; Gilks et al., 1998; Warnes, 2001; de Freitas et al., 2001; Brockwell, 2006; Neal, 2011, 2005, 2003; Beal, 2003; Møller et al., 2006; Andrieu and Moulines, 2006; Kurihara et al., 2006; Cappé et al., 2008; Craiu et al., 2009; Welling, 2009; Gelfand et al., 2010; Randal et al., 2007; Randal and P., 2011; Welling and Teh, 2011; Zhang and Sutton, 2011; Ahmadian et al., 2011; Girolami and Calderhead, 2011; Hoffman and Gelman, 2011; Beskos et al., 2011; Calderhead and Sustik, 2012; Shahbaba et al., 2014; Ahn et al., 2013; Lan et al., 2014; Ahn et al., 2014), these methods do not directly address the complications due to constrained target distributions. When dealing with such distributions, MCMC algorithms typically evaluate each proposal to ensure it is within the boundaries imposed by the constraints. Computationally, this is quite inefficient, especially in high dimensional problems where proposals are very likely to miss the constrained domain. Alternatively, one could map the original domain to the entire Euclidean space to remove the boundaries. This approach too is computationally inefficient since the sampler needs to explore a much larger space than needed.

In this paper, we propose a novel method, called Spherical Augmentation, for handling constraints involving norm inequalities (Figure 1). Our proposed method augments the parameter space and maps the constrained domain to a sphere in the augmented space. The sampling algorithm explores the surface of this sphere. This way, it handles constraints implicitly and generates proposals that remain within boundaries when mapped back to the original space. While our method can be applied to all Metropolis-based sampling algorithms, we mainly focus on methods based on Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011). As discussed by Neal (2011), one could modify standard HMC such that the sampler bounces off the boundaries by letting the potential energy go to infinity for parameter values that violate the constraints. This creates “energy walls” at boundaries. This approach, henceforth called Wall HMC, has limited applications and tends to be computationally inefficient, because the frequency of hitting and bouncing increases exponentially as dimension grows. Byrne and Girolami (2013) discuss an alternative approach for situations where constrained domains can be identified as sub-manifolds. Pakman and Paninski (2013) follow the idea of Wall HMC and propose an exact HMC algorithm specifically for truncated Gaussian distributions with non-holonomic constraints. Brubaker et al. (2012) on the other hand propose a modified version of HMC for handling holonomic constraint . All these methods provide interesting solutions for specific types of constraints. In contrast, our proposed method offers a general and efficient framework applicable to a wide range of problems.

The paper is structured as follows. Before presenting our methods, in Section 2 we provide a brief overview of HMC and one of its variants, namely, Lagrangian Monte Carlo (LMC) (Lan et al., 2014). We then present the underlying idea of spherical augmentation, first for two simple cases, ball type (Section 3.1) and box type (Section 3.2) constraints, then for more general -norm type constraints (Section 3.3), as well as some functional constraints (Section 3.4). In Section 4, we apply the spherical augmentation technique to HMC (Section 4.2) and LMC (Section 4.3) for sampling from constrained target distributions. We evaluate our proposed methods using simulated and real data in Section 5. Finally, Section 6 is devoted to discussion and future directions.

Figure 1: -norm constraints

2 Preliminaries

2.1 Hamiltonian Monte Carlo

HMC improves upon random walk Metropolis (RWM) by proposing states that are distant from the current state, but nevertheless accepted with high probability. These distant proposals are found by numerically simulating Hamiltonian dynamics, whose state space consists of its position

, denoted by the vector

, and its momentum, denoted by the vector . Our objective is to sample from the continuous probability distribution of with the density function . It is common to assume that the fictitious momentum variable , where is a symmetric, positive-definite matrix known as the mass matrix

, often set to the identity matrix

for convenience.

In this Hamiltonian dynamics, the potential energy, , is defined as minus the log density of (plus any constant), that is ; the kinetic energy, for the auxiliary momentum variable is set to be minus the log density of (plus any constant). Then the total energy of the system, Hamiltonian function, is defined as their sum,

(1)

Given the Hamiltonian , the system of evolves according to the following Hamilton’s equations,

(2)

In practice when the analytical solution to Hamilton’s equations is not available, we need to numerically solve these equations by discretizing them, using some small time step . For the sake of accuracy and stability, a numerical method called leapfrog is commonly used to approximate the Hamilton’s equations (Neal, 2011). We usually solve the system for steps, with some step size, , to propose a new state in the Metropolis algorithm, and accept or reject it according to the Metropolis acceptance probability. (See Neal, 2011, for more discussions).

2.2 Lagrangian Monte Carlo

Although HMC explores the target distribution more efficiently than RWM, it does not fully exploit its geometric properties of the parameter space. Girolami and Calderhead (2011) propose Riemannian HMC (RHMC), which adapts to the local Riemannian geometry of the target distribution by using a position-specific mass matrix . More specifically, they set to the Fisher information matrix. In this paper, we mainly use spherical metric instead to serve the purpose of constraint handling. The proposed method can be viewed as an extension to this approach since it explores the geometry of sphere.

Following the argument of Amari and Nagaoka (2000), Girolami and Calderhead (2011) define Hamiltonian dynamics on the Riemannian manifold endowed with metric . With the non-flat metic, the momentum vector becomes and the Hamiltonian is therefore defined as follows:

(3)

Unfortunately the resulting Riemannian manifold Hamiltonian dynamics becomes non-separable since it contains products of and , and the numerical integrator, generalized leapfrog, is an implicit scheme that involves time-consuming fixed-point iterations.

Lan et al. (2014) propose to change the variables and define an explicit integrator for RHMC by using the following equivalent Lagrangian dynamics:

(4)
(5)

where the velocity . Here, is the Christoffel Symbols derived from .

The proposed explicit integrator is time reversible but not volume preserving. Based on the change of variables theorem, one can adjust the acceptance probability with Jacobian determinant to satisfy the detailed balance condition. The resulting algorithm, Lagrangian Monte Carlo (LMC), is shown to be more efficient than RHMC (See Lan et al., 2014, for more details).

Throughout this paper, we express the kinetic energy in terms of velocity, , instead of momentum, (Beskos et al., 2011; Lan et al., 2014).

3 Spherical Augmentation

In this section, we introduce the spherical augmentation technique for handling norm constraints implicitly. We start with two simple constraints: ball type (2-norm) and box type (-norm). Then, we generalize the methodology to arbitrary -norm type constraints for . Finally, we discuss some functional constraints that can be reduced to norm constraints.

Figure 2: Transforming the unit ball to the sphere .

3.1 Ball type constraints

Consider probability distributions confined to the -dimensional unit ball . The constraint is given by restricting the 2-norm of parameters: .

The idea of spherical augmentation is to augment the original -dimensional manifold of unit ball to a hyper-sphere in -dimensional space. This can be done by adding an auxiliary variable to the original parameter to form an extended parameter such that . Next, we identify the lower hemisphere with the upper hemisphere by ignoring the sign of . This way, the domain of the target distribution is changed from the unit ball to the -dimensional sphere, , through the following transformation:

(6)

which can also be recognized as the coordinate map from the Euclidean coordinate chart to the manifold .

After collecting samples using a sampling algorithm (e.g., HMC) defined on the sphere, , we discard the last component and obtain the samples that automatically satisfy the constraint . Note that the sign of does not affect our Monte Carlo estimates. However, after applying the above transformation, we need to adjust our estimates according to the change of variables theorem as follows:

(7)

where as shown in Corollary 1 in Appendix A.1. Here, and are volume elements under the Euclidean metric and the canonical spherical metric respectively.

With the above transformation (6), the resulting sampler is defined and moves freely on while implicitly handling the constraints imposed on the original parameters. As illustrated in Figure 2, the boundary of the constraint, i.e., , corresponds to the equator on the sphere . Therefore, as the sampler moves on the sphere, e.g. from to , passing across the equator from one hemisphere to the other translates to “bouncing back” off the boundary in the original parameter space.

Figure 3: Transforming the hyper-rectangle to the sphere .

3.2 Box type constraints

Many constraints are given by both lower and upper bounds. Here we focus on a special case that defines a hyper-rectangle ; other box type constraints can be transformed to this hyper-rectangle. This constrained domain can be mapped to the unit ball and thus reduces to the ball type constraint discussed in Section 3.1. However, a more natural approach is to use spherical coordinates, which directly map the hyper-rectangle to the sphere ,

(8)

Therefore, we use as the spherical coordinate chart for the manifold . Instead of being appended with an extra dimension as in Section 3.1, here is treated as the spherical coordinates of the point with .

After obtaining samples on the sphere , we transform them back to in the original constrained domain using the following inverse mapping of (8):

(9)

Similarly, we need to adjust the estimates based on the following change of variables formula:

(10)

where as shown Proposition A.3 in Appendix A.3. Here, and are volume elements under the Euclidean metric and the round spherical metric respectively.

With the above transformation (8), we can derive sampling methods on the sphere to implicitly handle box type constraints. As illustrated in Figure 3, the red vertical boundary of collapses to the north pole of , while the green vertical boundary collapses to the south pole. Two blue horizontal boundaries are mapped to the same prime meridian of shown in blue color. As the sampler moves freely on the sphere , the resulting samples automatically satisfy the original constraint after being transformed back to the original domain.

Figure 4: Transforming -norm constrained domain to unit ball. Left: from unit cube to unit ball ; Right from general -norm domain to unit ball .

3.3 General -norm constraints

The ball and box type constraints discussed in previous sections are in fact special cases of more general -norm constraints with set to 2 and respectively. In general, these constraints are expressed in terms of -norm of the parameter vector ,

(11)

This class of constraints is very common in statistics and machine learning. For example, when are regression parameters,

corresponds to the ridge regression and

corresponds to Lasso (Tibshirani, 1996).

Denote the domain constrained by general -norm as . It could be quite challenging to sample probability distributions defined on (see Figure 1). To address this issue, we propose to transform to the unit ball so that the method discussed in Section 3.1 can be applied. As before, sampling methods defined on the sphere generate samples that automatically fall within . Then we transform those samples back to the -norm domain, , and adjust the estimates with the following change of variables formula:

(12)

where . In the following, we introduce the bijective mappings between and and specify the associated Jacobian determinants .

3.3.1 Norm constraints with

For , -norm domain can be transformed to the unit ball bijectively via the following map (illustrated by the left panel of Figure 4):

(13)

The Jacobian determinant of is . See Appendix B for more details.

3.3.2 Norm constraints with

When , the norm inequality defines a unit hypercube, , from which the more general form, hyper-rectangle, , can be obtained by proper shifting and scaling. The unit hypercube can be transformed to its inscribed unit ball through the following map (illustrated by the right panel of Figure 4):

(14)

The Jacobian determinant of is . More details can be found in Appendix B.

3.4 Functional constraints

Many statistical problems involve functional constraints. For example, Pakman and Paninski (2013) discuss linear and quadratic constraints for multivariate Gaussian distributions. Since the target distribution is truncated Gaussian, Hamiltonian dynamics can be exactly simulated and the boundary-hitting time can be analytically obtained. However, finding the hitting time and reflection trajectory is computationally expensive. Some constraints of this type can be handled by the spherical augmentation method more efficiently. Further, our method can be use for sampling from a wide range of distributions beyond Gaussian.

Figure 5: Sampling from a Gaussian distribution (first row) and a damped sine wave distribution (second row) with linear constraints. First column shows the true distributions. The exact HMC method of Pakman and Paninski (2013) is shown in the second column. The last two columns show our proposed methods.

3.4.1 Linear constraints

In general, linear constraints can be written as , where is matrix, is a -vector, and the boundaries and are both -vectors. Here, we assume and is invertible. (Note that we generally do not have .) Instead of sampling directly, we sample with the box type constraint: . Now we can apply our proposed method to sample and transform it back to . In this process, we use the following change of variables formula:

(15)

where .

Figure 5 illustrates that both exact HMC (Pakman and Paninski, 2013) and HMC with spherical augmentation can handle linear constraints, here , and , imposed on a 2d Gaussian distribution with and (first row). However, the exact HMC is not applicable to more complicated distributions such as the damped sine wave distribution (second row in Figure 5) with the following density:

(16)

However, it is worth noting that for truncated Gaussian distributions, the exact HMC method of Pakman and Paninski (2013) can handle a wider range of linear constraints compared to our method.

3.4.2 Quadratic constraints

General quadratic constraints can be written as , where are scalars. We assume symmetric and positive definite. By spectrum theorem, we have the decomposition , where

is an orthogonal matrix and

is a diagonal matrix of eigenvalues of

. By shifting and scaling, , we only need to consider the ring type constraints for ,

(17)

which can be mapped to the unit ball as follows:

(18)

We can then apply our proposed method in Section 3.1 to obtain samples in and transform them back to the original domain with the following inverse operation of (18):

(19)

In this process, we need the change of variables formula

(20)

where , .

3.4.3 More general constraints

We close this section with some comments on more general types of constraints. In some problems, several parameters might be unconstrained, and the type of constraints might be vary across the constrained parameters. In such cases, we could group the parameters into blocks and update each block separately using the methods discussed in this section. When dealing with one-sided constraints, e.g. , one can map the constrained domain to the whole space and sample the unconstrained parameter , where . Alternatively, the one-sided constraint can be changed to a two-sided constraint for by setting .

4 Monte Carlos with Spherical Augmentation

In this section, we show how the idea of spherical augmentation can be used to improve Markov Chain Monte Carlo methods applied to constrained probability distributions. In particular, we focus on two state-of-the-art sampling algorithms, namely Hamiltonian Monte Carlo

(Duane et al., 1987; Neal, 2011), and Lagrangian Monte Carlo(Lan et al., 2014). Note however that our proposed method is generic so its application goes beyond these two algorithms.

4.1 Common settings

Throughout this section, we denote the original parameter vector as , the constrained domain as , the coordinate vector of sphere as . All the change of variables formulae presented in the previous section can be summarized as

(21)

where is the Jacobian determinant of the mapping and is some spherical measure.

For energy based MCMC algorithms like HMC, RHMC and LMC, we need to investigate the change of energy under the above transformation. The original potential energy function should be transformed to the following

(22)

Consequently the total energy in (1) becomes

(23)

The gradient of potential energy , metric and natural gradient (preconditioned gradient) under the new coordinate system can be calculated as follows

(24)
(25)
(26)

4.2 Spherical Hamiltonian Monte Carlo

We define HMC on the sphere in two different coordinate systems: the Cartesian coordinate and the spherical coordinate. The former is applied to ball type constraints or those that could be converted to ball type constraints; the later is more suited for box type constraints. Besides the merit of implicitly handling constraints, HMC on sphere can take advantage of the splitting technique (Beskos et al., 2011; Shahbaba et al., 2014; Byrne and Girolami, 2013) to further improve its computational efficiency.

4.2.1 Spherical HMC in the Cartesian coordinate

We first consider HMC for the target distribution with density defined on the unit ball endowed with the Euclidean metric . The potential energy is defined as . Associated with the auxiliary variable (i.e., velocity), we define the kinetic energy for , which is a -dimensional vector sampled from the tangent space of . Therefore, the Hamiltonian is defined on as

(27)

Under the transformation in (6), the above Hamiltonian (27) on will be changed to the follwing Hamiltonian on as in (23):

(28)

where the potential energy (i.e., the distribution is fully defined in terms of the original parameter , which are the first elements of ), and is the canonical spherical metric.

Viewing as the Euclidean coordinate chart of manifold , we have the logorithm of volume adjustment, (See Appendix A.1). The last two terms in Equation (28) is the minus log density of (See Girolami and Calderhead, 2011; Lan et al., 2014, for more details). However, the derivative of log volume adjustment, , contributes an extremely large component to the gradient of energy around the equator (), which in turn increases the numerical error in the discretized Hamiltonian dynamics. For the purpose of numerical stability, we instead consider the following partial Hamiltonian and leave the volume adjustment as weights to adjust the estimation of integration (21):

(29)

If we extend the velocity as with , then falls in the tangent space of the sphere, . Therefore, . As a result, the partial Hamiltonian (29) can be recognized as the standard Hamiltonian (27) in the augmented dimensional space

(30)

This is due to the energy invariance presented as Proposition A.1 in Appendix A. Now we can sample the velocity and set . Alternatively, since is idempotent, we can sample by with .

The Hamiltonian function (29) can be used to define the Hamiltonian dynamics on the Riemannian manifold in terms of , or equivalently as the following Lagrangian dynamics in terms of (Lan et al., 2014):

(31)

where are the Christoffel symbols of second kind derived from . The Hamiltonian (29) is preserved under Lagrangian dynamics (31). (See Lan et al., 2014, for more discussion).

Byrne and Girolami (2013) split the Hamiltonian (29) as follows:

(32)

However, their approach requires the manifold to be embedded in the Euclidean space. To avoid this assumption, instead of splitting the Hamiltonian dynamics of , we split the corresponding Lagrangian dynamics (31) in terms of as follows (See Appendix C for more details):

(33a)
(33b)

Note that the first dynamics (33a) only involves updating velocity in the tangent space and has the following solution (see Appendix C for more details):

(34)

The second dynamics (33b) only involves the kinetic energy and has the geodesic flow that is a great circle (orthodrome or Riemannian circle) on the sphere as its analytical solution (See Appendix A.2 for more details):

(35)

This solution defines an evolution, denoted as . Both (34) and (35) are symplectic. Due to the explicit formula for the geodesic flow on sphere, the second dynamics in (33b) is simulated exactly. Therefore, updating does not involve discretization error so we can use large step sizes. This could lead to improved computational efficiency. Because this step is in fact a rotation on sphere, it can generate proposals that are far away from the current state. Algorithm 1 shows the steps for implementing this approach, henceforth called Spherical HMC in the Cartesian coordinate (c-SphHMC). It can be shown that the integrator in the algorithm has order 3 local error and order 2 global error (See the details in Appendix D).

Initialize at current after transformation
Sample a new velocity value
Set
Calculate
for  to  do
   
   
   
   
end for
Calculate
Calculate the acceptance probability
Accept or reject the proposal according to for the next state
Calculate and the corresponding weight
Algorithm 1 Spherical HMC in the Cartesian coordinate (c-SphHMC)

4.2.2 Spherical HMC in the spherical coordinate

Now we define HMC on the sphere in the spherical coordinate . The natural metric on the sphere induced by the coordinate mapping (8) is the round spherical metric111Note, ., .

As in Section 4.2.1, we start with the usual Hamiltonian defined on as in (27) with . Under the transformation in (8), Hamiltonian (27) on is changed to the following Hamiltonian as in (23):

(36)

where the potential energy and is the round spherical metric.

As before, the logorithm of volume adjustment is (See Appendix A.3). The last two terms in Equation (36) is the minus log density of . Again, for numerical stability we consider the following partial Hamiltonian and leave the volume adjustment as weights to adjust the estimation of integration (21):

(37)

Taking derivative of in (8) with respect to time we have

(38)

We can show that ; that is, . Taking derivative of in (9) with respect to time yields

(39)

Further, we have . Therefore, the partial Hamiltonian (37) can be recognized as the standard Hamiltonian (27) in the augmented dimensional space, which is again explained by the energy invariance Proposition A.1 (See more details in Appendix A)

(40)

Similar to the method discussed in Section 4.2.1, we split the Hamiltonian (37), , and its corresponding Lagrangian dynamics (31) as follows:

(41a)
(41b)

The first dynamics (41a) involves updating the velocity only. However, the diagonal term of , increases exponentially fast as dimension grows. This will cause the velocity updated by (41a) to have extremely large components. To avoid such issue, we use small time vector , instead of scalar , in updating Equation (41a). The second dynamics (41b) describes the same geodesic flow on the sphere as (33b) but in the spherical coordinate . Therefore it should have the same solution as (35) expressed in . To obtain this solution, we first apply , which consists of (8)(38). Then, we use in (35) to evolve for some time to find . Finally, we use , composite of (9)(39), to go back to .

Initialize at current after transformation
Sample a new velocity value
Set ,  
Calculate