# Bayesian Lasso : Concentration and MCMC Diagnosis

Using posterior distribution of Bayesian LASSO we construct a semi-norm on the parameter space. We show that the partition function depends on the ratio of the l 1 and l 2 norms and present three regimes. We derive the con- centration of Bayesian LASSO, and present MCMC convergence diagnosis. Keywords: LASSO, Bayes, MCMC, log-concave, geometry, incomplete Gamma function

## Authors

• 1 publication
• 2 publications
• ### A Convergence Diagnostic for Bayesian Clustering

Many convergence diagnostics for Markov chain Monte Carlo (MCMC) are wel...
12/07/2017 ∙ by Masoud Asgharian, et al. ∙ 0

• ### Efficient MCMC for Gibbs Random Fields using pre-computation

Bayesian inference of Gibbs random fields (GRFs) is often referred to as...
10/11/2017 ∙ by Aidan Boland, et al. ∙ 0

• ### Bayesian generalized fused lasso modeling via NEG distribution

The fused lasso penalizes a loss function by the L_1 norm for both the r...
02/16/2016 ∙ by Kaito Shimamura, et al. ∙ 0

• ### Regenerative Simulation for the Bayesian Lasso

The Gibbs sampler of Park and Casella is one of the most popular MCMC me...
06/06/2018 ∙ by Y. -L. Chen, et al. ∙ 0

• ### Safe-Bayesian Generalized Linear Regression

We study generalized Bayesian inference under misspecification, i.e. whe...
10/21/2019 ∙ by Rianne de Heide, et al. ∙ 0

• ### Bayesian Lasso Posterior Sampling via Parallelized Measure Transport

It is well known that the Lasso can be interpreted as a Bayesian posteri...
01/07/2018 ∙ by Marcela Mendoza, et al. ∙ 0

• ### Variational Full Bayes Lasso: Knots Selection in Regression Splines

We develop a fully automatic Bayesian Lasso via variational inference. T...
02/26/2021 ∙ by Larissa Alves, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be two positive integers, and be an matrix with real numbers entries. Bayesian LASSO

 c(x)=1Zexp(−∥Ax−y∥222−∥x∥1) (1)

is a typically posterior distribution used in the linear regression

 y=Ax+w.

Here

 Z=∫Rpexp(−∥Ax−y∥222−∥x∥1)dx (2)

is the partition function, and are respectively the Euclidean and the

norms. The vector

are the observations, is the unknown signal to recover, is the standard Gaussian noise, and is a known matrix which maps the signal domain into the observation domain . If we suppose that is drawn from Laplace distribution i.e. the distribution proportional to

 exp(−∥x∥1), (3)

then the posterior of known is drawn from the distribution (1). The mode

 argmin{∥Ax−y∥222+∥x∥1:x∈Rp} (4)

of was first introduced in Tibshirani1996 and called LASSO. It is also called Basis Pursuit De-Noising method Chen . In our work we select the term LASSO and keep it for the rest of the article.

In general LASSO is not a singleton, i.e. the mode of the distribution is not unique. In this case LASSO is a set and we will denote by lasso any element of this set. A large number of theoretical results has been provided for LASSO. See Daubechies2004 , DDN , Fort , Mendoza , Pereyra and the references herein. The most popular algorithms to find LASSO are LARS algorithm Efron , ISTA and FISTA algorithms see e.g. Beck and the review article Parikh .

The aim of this work is to study geometry of bayesian LASSO and to derive MCMC convergence diagnosis.

## 2 Polar integration

Using polar coordinates , the partition function (2)

 Zp=∫SZp(s)ds, (5)

where denotes one of or norms in , denotes the surface measure on the unit sphere of the norm , and

 Zp(s)=∫+∞0exp{−f(rs)}rp−1dr, (6)

where

We express the partition function (6) using the parabolic cylinder function. We also give an inequality of concentration and a geometric interpretation of the partition function .

## 3 Parabolic cylinder function and partition function

We extend the function à

 x∈Rp→Zp(x)=∫+∞0exp{−f(rx)}rp−1dr. (7)

This extension is homogeneous of order .

If , then , and more if , then

 Zp(x)=(p−1)!∥x∥−p1exp(−∥y∥222).

If , then we will express using the parabolic cylinder function. We recall that for the parabolic cylinder function is given by

 Db(z)=zbexp(−z24)[1+O(z−2)], (8)

when Temme . We also recall the integral representation of Erdélyi Erdélyi for the parabolic cylinder function

 exp(z24)Γ(ν)D−ν(z)=∫+∞0exp(−12r2−zr)rν−1dr,ν>0,

where is the function.

###### Proposition 3.1

The variable

 ωlasso:=∥x∥1−⟨Ax,y⟩∥Ax∥2 (9)

will play an important role. It depends only on and the function is bounded below by

Now we can announce the following result.

###### Proposition 3.2

We have for

 Zp(x)=(p−1)!exp(−∥y∥222)∥Ax∥−p2exp(ω2lasso4)D−p(ωlasso).

If , then and

 Zp(x)=(p−1)!exp(−∥y∥222)∥x∥−p1[1+O(ω−2lasso)].
###### Corollary 3.3

If then is bounded below by , where is the norm of the operator . The partition fucntion

 Zp(s)=(p−1)!ωplassoexp(ω2lasso4)D−p(ωlasso)

is convex and decreasing.

###### Proof 3.4

It suffices to remark that

## 4 Geometric interpretation of the partition function

First we represent for in the form

 f(rx)=∥y∥222−ω2lasso2+(r∥Ax∥2+ωlasso)22. (10)

The function is log-concav and integrable in . Observe that is a norm on the null space of . A general result Ball tells us that

 x∈Rp→Z−1pp(x):=∥x∥c

is a quasi-norm on . The unit ball of is defined by

 B(A,y) := {x∈Rp:∥x∥c≤1} = {x∈Rp:Zp(x)≥1} = {x=rs∈Rp:Zp(s)≥rp} = {x=rs∈Rp:(p−1)!exp(−∥y∥222)∥As∥−p2exp(ω2lasso4)D−p(ωlasso)≥rp}.

Its contour is equal to

 C(A,y) := {x∈Rp:∥x∥c=1} = {x∈Rp:Zp(x)=1} = {x=rs∈Rp:Zp(s)=rp} = {x=rs∈Rp:(p−1)!exp(−∥y∥222)∥As∥−p2exp(ω2lasso4)D−p(ωlasso)=rp}.

We summarize our results in the following proposition.

###### Proposition 4.1

1) For each , the longest segment contained in holds for is solution of the equation

2) The ball

 B(A,y)=⋃s∈Sp−1,1[0,rmax(s)]s,

and its contour is equal to

 C(A,y)={rmax(s)s:s∈Sp−1,1}.

3) The volume is

 ∫Sp−1,1rpmax(s)pds=Zpp.

## 5 Necessary and sufficient condition to have lasso equal zero

Now we can give the necessary and sufficient condition to have

###### Proposition 5.1

The following assertions are equivalent.

1) .

2) pour tout .

3)

## 6 Concentration around the lasso

### 6.1 The case lasso null

The polar coordinate formula tells us that, we can draw a vector from by drawing its angle uniformly, and then simulate its distance to the origin from

 c(r,s)dr=1Zp(s)exp{−f(rs)}rp−1dr (11)

Now let’s estimate for

the probability

 P(∥x∥>r)=∫S∫+∞rc(r,s)drds|S|,

where denotes the Lebesgue measure of . We introduce for each pair , the function

 ga,b,p(r):=ga,b(r)−(p−1)ln(r),r>0. (12)

In the following

 a=∥As∥2,b=ωlasso.

The function is increasing (because ). The function f est convexe et atteint son minimum au point solution de l’équation

 ∥As∥2(r∥As∥2+ωlasso)=p−1r.

The positive root is given by

 r(s)=−ωlasso+√ω2lasso+4(p−1)2∥As∥2 (13)

On one hand

 ∫+∞0exp{−ga,b,p(r)}dr ≥ exp{−ga,b(r(s))}∫r(s)0rp−1dr=exp{−ga,b,p(r(s))}r(s)p.

On the other hand by using the convexity of , we have for all ,

 ga,b(r)≥ga,b(r(s))+(p−1)(r−r(s))r(s),

because . We deduce for ,

 ∫+∞qr(s)exp{−ga,b,p(r)}dr≤exp{−ga,b(r)+p−1}∫+∞qr(s)exp{−p−1r(s)r}rp−1dr ≤exp{−ga,b(r(s))+p−1}∫+∞q(p−1)exp(−r)rp−1drr(s)(p−1)p ≤exp{−ga,b(r(s))+p−1}r(s)(p−1)pΓ(p,q(p−1)),

where is the upper incomplete gamma function. Finally we get the following result.

###### Proposition 6.1

We have for all ,

 P(∥x∥≥qr(s))≤pexp(p−1)(p−1)pΓ(p,q(p−1)):=P(q,p). (14)

Using the following estimate Natalini

 xa−1exp(−x)<Γ(a,x)1,B>1,x>BB−1(a−1),

we get for ,

 Γ(p,q(p−1))≤2qp−1(p−1)p−1exp(−q(p−1)).

Therefore the quantity

 P(q,p)≤2pqp−1(p−1)exp{−(q−1)(p−1)}.

balance sheet . If is drawn from the density , alors with a probability at least equal to .

In the figure(1) we plot for , , and the density for a fixed value of .

We notice that the mode is very close to the value of (13) for the same fixed .

### 6.2 The general case

We take the vector . We will study the concentration of around . The variable of interest is . The law of has for density

 c(u+l)du=1Zpexp{−f(u+l,A,y)}du.

The change of variables formula gives for each norm

 c(u+l)du=1Zpexp{−f(rθ+l)}rp−1drdθ,r>0,θ∈S.

By definition for any vector , the convex function reaches its minimum at the point . Therefore is increasing.

The function

 f(rs+l,p):=f(rs+l)−(p−1)ln(r),r>0, (15)

is strictly convex. Its critical point is solution of the equation

 ∂rf(rs+l)=p−1r.

By a similar proof to that of propostion (6.1) we have the following result;

###### Proposition 6.2

If is drawn from the density , and , then for all ,

 P(∥x−l∥≥qrl(s))≤pexp(p−1)(p−1)pΓ(p,q(p−1)):=P(q,p). (16)

## 7 Applications

### 7.1 The contour in the case p=2, n=1

Let a matrix of order . Its null-space . We have that contains

 Ker(A)⋂B2,1.

This intersection is a symmetric segment noted .

To determine the other points of the set , we will directly calculate . A simple calculation gives

 Z2(s)=exp(−y22+ω2lasso2)∫+∞0exp{−(|As|r+ωlasso)22}rdr,

et

 |As|∫+∞0exp{−(|As|r+ωlasso)22}rdr+ωlasso∫+∞0exp{−(|As|r+ωlasso)22}dr=1.

Finally we have the following proposition.

###### Proposition 7.1

1) If , then

 Z2(s)=exp(−y22+ω2lasso2)|As|−1{1−ωlasso|Aω|√2π(1−F(ωlasso))},

where is the distibution function of the normal law.

2) If and , then

 Z2(s) = ωlassoexp(ω2lasso2){1−ω2lasso√2π(1−F(ωlasso))}.

3) Ifi , and , then the function

 z2(b2)=1bexp(12b2){1−1b2√2π(1−F(1b))}

defined on is convex and decreasing, where .

4) We have for

 Z2(s)=z2(1ω2lasso),∀ω∈S1,1.

the ball

 B(A,0)={rs:Z2(s)≥r2}.

is contained in the unit disk for the norm . The contour is defined by the equation

 Z2(s)=r2.

The norm of the linear operator is defined by

 λ1,2=sups:∥s∥1=1∥As∥2.

the function is constant on est constante sur

 Ω1,2={s:∥s∥1=1,∥As∥=λ1,2}.

If then

 Ω1,2 = {s:∥s∥1=1,∥As∥=λ1,2} = [(1,0),(0,1)]⋃[(−1,0),(0,−1)].

If with , then

 Ω1,2 = {s:∥s∥1=1,∥As∥2=λ1,2} = {(0,sgn(a2)),(0,−sgn(a2))}.

In both case

 {z2(λ21,2)}12Ω1,2

is part of the contour. The other points of the contour are deduced from the equation

 z2(b2)=a2,b∈(0,λ1,2).

Each pair generate four points of of the form where

 |s1|+|s2|=1,|a1s1+a2s2|=b.

We plot in the figure 2 the contour of for different choices of the matrix . We notice that the surface of is decreasing function relatively the norm of the matrix .

###### Remark 7.2

The numerics show that exploses for the large values of , it means that is closes to the null-space of . to eleminate that explosion we need to estimate the tail of the gaussian density. Using the Gordon estimation Gordon^4

 exp(−x22)x+1x≤√2π(1−F(x))≤exp(−x22)x,x>0,

we have the following approximation

 1b2−1b≤z2(b2)≤1b2−11+b3,nearto0. (17)

## 8 MCMC diagnosis

Here we take , , and for simplicity we consider . We sample from the distribution (1) using Hastings-Metropolis algorithm and propose the test as a criterion for the convergence. Here . We recall that if is drawn from the target distribution , then with the probability at least equal to . Table 2 gives the values of the probability . Note that for the criterion is satisfied with a large probability.

### 8.1 Independent sampler (IS)

The proposal distribution

 Q(x2,x1)=p(x2)=12pexp(−∥x2∥1),∀x1,x2.

The ratio

 c(x)p(x)≤2pZ,∀x.

It’s known that MCMC with the target distribution and the proposal distribution is uniformly ergodic Mengersen :

 supA⊂B(Rp)|P(x(t)∈A|x(0))−∫Ac(x)dx|≤(1−Z2p)t.

Here and then . Figure 4(a) shows respectively the plot of and .

### 8.2 Random-walk (RW) Metropolis algorithm

We do not know if the target distribution satisfies the curvature condition in Roberts Section 6. Here we propose to analyse the convergence of the Random walk Metropolis algorithm using the criterion . Figure 4(b) shows respectively the plot of and .

Figures 4 show that contrary to independent sampler algorithm, the random walk (RW) algorithm satisfies early the criterion . More precisely

• the independent sampler (IS) algorithm begins to satisfy the criterion at iteration.

• The RW algorithm begins to satisfy the criterion at iteration, but the IS algorithm never satisfies the criterion .

We finally compare IS and RW algorithms using the fact that . The best algorithm will furnish the best approximation of the integral . Table 3 gives the estimators and . It follows that and . We conclude that the random walk algorithm wins for both criteria against independent sampler algorithm.

## 9 Conclusion

We studied the geometry of bayesian LASSO using polar coordinates and calculated the partition function. We obtained a concentration inequality and derived MCMC convergence diagnosis for the convergence of Hasting Metropolis algorithm. We showed that the random walk MCMC with the variance 0.5 wins again the independent sampler with Laplace proposal distribution.

## References

• (1) K. Ball, Logarithmically concave functions and sections of convex sets in , Stud. Math. T. LXXXVIII (1988) 69-84.
• (2) A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problem, SIAM J. Imag. Sci. (2009) 183-202. .
• (3) E.T. Copson, Asymptotic Expansions, Cambridge University Press, 1965.
• (4) S. Chen, D. L. Donoho, M. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. 20 (1) (1998) 33-61.
• (5) I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math. 57 (11) (2004) 1413–1457.
• (6) A. Dermoune, D. Ounaissi, N. Rahmania, Oscillation of Metropolis-Hastings and simulated annealing algorithms around LASSO estimator, Math. Comput. Simulation (2015).
• (7) A. Erdélyi, Higher transcendental functions, California institute of technology, bateman manuscript project, vol. 2 p. 119 (1953).
• (8) B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Statist. 37 (2004) 407-499.
• (9) G. Fort, S. Le Corff, E. Moulines, A. Schreck, A shrinkage-thresholding Metropolis adjusted Langevin algorithm for bayesian variable selection, arXiv:1312.5658v3 [math.ST] (2015).
• (10) R.D. Gordon, Values of Mills’ ratio of area to bounding ordinate of the normal probability integral for large values of the argument, Annals of Mathematical Statistics, vol. 12 364–366 (1941).
• (11) B. Klartag, V.D. Milman, Geometry of log-concave functions and measures, Geom. Dedicata 112 (1) (2005) 169-182.
• (12) M. Mendoza, A. Allegra, T.P. Coleman, Bayesian Lasso Posterior Sampling Via Parallelized Measure Transport, arXiv:1801.02106v1 [stat.CO] (2018).
• (13) K. Mengersen, R.L Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, Ann. Statist. 24 (1994) 101-121.
• (14) N. Parikh, S. Boyed, Proximal algorithms, Found. Trends Optim. 1 (3) (2003) 123-231.
• (15)

M. Pereyra, Proximal Markov chain Monte Carlo algorithms, Stat. Comput. (2015) 1-16.

• (16) N.M. Temme, Numerical and asymptotic aspects of parabolic cylinder functions, ournal of Computational and Applied Mathematics 121 221–246 (2000).
• (17)

G.O. Roberts, A.L. Tweedie, Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms, Biometrika 83 (1) (1996) 95-110.

• (18) R. Tibshirani, Regression shrinkage and selection via Lasso, J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1) (1996) 267-288.
• (19) R. Tibshirani, The Lasso problem and uniqueness, Electron. J. Stat. 7 (2013) 1456-1490.