Sparse Methods for Automatic Relevance Determination

05/18/2020 ∙ by Samuel H. Rudy, et al. ∙ MIT 0

This work considers methods for imposing sparsity in Bayesian regression with applications in nonlinear system identification. We first review automatic relevance determination (ARD) and analytically demonstrate the need to additional regularization or thresholding to achieve sparse models. We then discuss two classes of methods, regularization based and thresholding based, which build on ARD to learn parsimonious solutions to linear problems. In the case of orthogonal covariates, we analytically demonstrate favorable performance with regards to learning a small set of active terms in a linear system with a sparse solution. Several example problems are presented to compare the set of proposed methods in terms of advantages and limitations to ARD in bases with hundreds of elements. The aim of this paper is to analyze and understand the assumptions that lead to several algorithms and to provide theoretical and empirical results so that the reader may gain insight and make more informed choices regarding sparse Bayesian regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many modeling and engineering problems it is critical to build statistical models from data which include estimates of the model uncertainty. This is often achieved through non-parametric Bayesian regression in the form of Gaussian processes and similar methods

[26]

. While these methods offer tremendous flexibility and have seen success in a wide variety of applications they have two significant shortcomings; they are not interpretable and they often fail in high dimensional settings. The simplest case of parametric Bayesian regression is Bayesian ridge regression, where one learns a distribution for model parameters by assuming identical independently distributed (iid) Gaussian priors on model weights. However, Bayesian ridge requires the researcher to provide a single length scale for the prior that stays fixed across dimensions. It is therefore not invariant to changes in units. Furthermore Bayesian ridge regression does not yield sparse models. In a high dimensional setting this may hinder the interpretability of the learned model. Automatic Relevance Determination (ARD)

[20, 32, 36]

addresses both of these problems. ARD learns length scales associated with each free variable in a regression problem. In the context of linear regression, ARD is often referred to as Sparse Bayesian Learning (SBL)

[33, 36, 38] due to its tendency to learn sparse solutions to linear problems. ARD has been applied to problems in compressed sensing [3], sparse regression [35, 42, 43, 40, 12], matrix factorization, [30], classification of gene expression data [16], earthquake detection [22]

, Bayesian neural networks

[20], as well as other fields.

More recently, some works have used ARD for interpretable nonlinear system identification. In this setting a linear regression problem is formulated to learn the equations of motion for a dynamical system from a large collection of candidate functions, called a library. Traditionally, frequentist methods have been applied to select a small set of active terms from the candidate functions. These include symbolic regression [6, 29, 24], sequential thresholding [8, 27, 28], information theoretic methods [2, 15], relaxation methods, [31, 45], and constrained sparse optimization [34]. Bayesian methods for nonlinear system identification [39, 21] including ARD have been applied to the nonlinear system identification problem for improved robustness in the case of low data [40] and for uncertainty quantification [42, 43, 12]. The critical challenge for any library method for nonlinear system identification is learning the correct set of active terms.

Motivated by problems such as nonlinear system identification, where accuracy in determining the sparsity pattern on a predictor is paramount, we focus on the ability of ARD to accurately learn a small subset of active terms in a linear system. This is in contrast to convergence in the sense of any norm. Indeed, it was previously shown [40] that ARD converges to the true predictor as noise present in the training data shrinks to zero. However, we show analytically that ARD fails to obtain the true sparsity pattern in the case of an orthonormal design matrix, leading to extraneous terms for arbitrarily small magnitudes of noise. This result motivates further considerations for imposing sparsity on the learned model.

This paper explores several intuitive methods for imposing sparsity in the ARD framework. We discuss the assumptions that lead to each technique, any approximations we use to make them tractable, and in some cases provide theoretical results regarding their accuracy with respect to selection of active terms. We stress that while sparse regression is a mature field with many approaches designed to approximate the -penalized least squares problem [31, 41, 2, 44, 5], most of these techniques do not consider uncertainty. We therefore only compare results of the proposed techniques to ARD.

The paper is organized as follows. In section 2 we provide a brief discussion on the automatic relevance determination method for Bayesian linear regression. Section 3 introduces two regularization-based methods for imposing sparsity of learned predictors from the ARD algorithm. Section 4 introduces various thresholding-based approaches. In each case we provide analytical results for the expected false positive and negative rates with respect to coefficients being set to zero. Section 5 includes a more detailed comparison between methods. Section 6 includes results of each of the proposed methods applied to a variety of problems including a sparse linear system, function fitting, and nonlinear system identification. Discussion and comments towards future work are included in Section 7.

2 Setup

We start with the likelihood model,

(1)

where forms a nonlinear basis, is scalar, , and

is normally distributed error with variance

. We assume a prior distribution on weights with variance given by hyper-parameter .

(2)

Automatic relevance determination seeks to learn the value of parameter that maximizes evidence. This approach is known as evidence maximization, empirical Bayes, or type-II maximum likelihood [4, 19]. Given a dataset we marginalize over to obtain the posterior likelihood of . This gives,

(3)

where , ,

is a column vector of all observed outputs and

is a matrix whose rows are the nonlinear features of each observed . We estimate by maximizing Eq. (3) and the subsequent distribution for , given by . Letting this is,

(4)

In practice is found by minimizing the negative log of Eq. (3) given by,

(5)

Following [40] (see Appendix A) the second term in (5) is equivalent to,

(6)

which gives the following representation of the loss function (

5),

(7)

To minimize (7) we solve a sequence of -penalized least squares problems developed in [36]. This is shown in Alg. 1.

1:   Initialize
2:   while not converged:
3:              where
4:               where
5:        
6:         Optional: relearn
7:  return
Algorithm 1 ARD()

Some works have used Gamma distribution priors on scale parameters

and precision [32]. This leads to a problem that is solved using coordinate descent of a slightly altered loss function from that shown in Eq. (7). More recent works [36, 40, 42] have not used this formulation, so much of the following work does not use hierarchical priors. We note however, that the case of a Gamma distribution prior on with shape parameter is in fact a Laplace prior. This case has been studied as a Bayesian compressed sensing method [3] and is a special case of the formulation considered in Sec. 3.

The minimization problem in step 4 of Alg. 1 may be re-written, after rescaling and , to obtain the commonly used Lagrangian form of the least absolute shrinkage and selection operator (Lasso) [31]. Letting,

(8)

we get,

(9)

Typical solvers for Eq. (8) include coordinate descent [37], proximal gradient methods [23], alternating direction method of multipliers [7], and least angle regression (LARS) [10]. Several example datasets considered in this manuscript resulted in ill-conditioned and therefore slow convergence of algorithms for solving the Lasso subroutine. We found empirically that all methods performed equally well on orthogonal but for ill-conditioned cases LARS far outperformed other optimization routines.

As we have noted, it is often the case that solutions to Eq. (5) exhibit some degree of sparsity. However, such solutions are only sparse in comparison to those derived by methods such as Bayesian ridge, where all coefficients are nonzero. For problems where we seek to find a very small set of nonzero terms, Alg. 1 must be adjusted to push extraneous terms to zero. In the following two sections we will discuss five methods for doing so.

3 Regularization Based Methods

We begin with a discussion of two methods for regularizing ARD to obtain more sparse predictors: inflating the variance passed into Alg. 1 and including a prior for the distribution of . In each case the sparse predictor is found as the fixed point of an iterative algorithm. In subsequent sections we will discuss thresholding based methods that alternate between iterative optimization and thresholding operations. In certain cases we refer to the set valued subgradient of a continuous piecewise differentiable function. In cases where the subgradient is a singleton we treat it as a real number.

3.1 Variance Inflation

The error variance of likelihood model (1) may be intuitively thought of as a level of mistrust for the data . Extremely large values of will push estimates of to be dominated by priors. It is shown in [36] that the ARD prior given by (2) is equivalent to a concave regularization term. We therefore expect large to encourage more sparse models. The regularization may be strengthened by passing in an artificially large value of to the iterative algorithm for solving Eq. (7) or, if also learning , by applying an inflated value at each step in the algorithm. We will call this process ARD with variance inflation (ARDvi), shown in Algorithm 2. Note that this differs from Alg. 1 only slightly by treating the variance used in the standard ARD algorithm as a tuning parameter, with a higher variance indicating less trust in the data and a greater regularization.

1:   Initialize
2:   while not converged:
3:              where
4:               where
5:        
6:         Optional: relearn
7:  return
Algorithm 2 ARDvi( inflation factor)

3.1.1 Sparsity properties of ARDvi for orthogonal features

To better understand the effect of variance inflation we consider Alg. 2 in the case where columns of are othogonal. Note that this implies . Let be the vector of column norms of so that . Define to be the extension of to an orthogonal basis of that with the first entries of given by . Now let be a fixed point of algorithm 2, , and , be defined by steps 3 and 4. The expression in step 3 is given by,

(10)

The Karush-Kugn-Tucker (KKT) stationarity condition for the update in step 4 gives,

(11)

where denotes the true value from Eq. (1

). We can find the false positive probability for a term being included in the model by setting

and finding conditions under which . Subbing in the value for from Eq. (10) and dividing by gives,

(12)

where is a set valued function taking value if or otherwise. If then,

(13)

while for ,

(14)

It follows that . Since we find that the false positive rate is,

(15)

where is the Gauss error function. Of particular note is that the number of false positives is independent from the variance of the linear model’s error term, . While the mean predictor learned from ARD does converge in any norm to the true solution for , the expected number of nonzero terms in the learned predictor stays constant. If one desires a sparse predictor, this motivates including a small threshold parameter below which coefficients are ignored, which we will discuss in a subsequent section.

We define a false negative by Algorithm 2 inding some (and respectively ) when the true solution and find the likelihood of such a case in a similar manner. Applying Eq. (10) and the KKT conditions we find,

(16)

The false negative likelihood is therefore,

(17)

Note that this function vanishes for large , indicating that important terms, as measured by are far less likely to be missed.

Figure 1 demonstrates the validity of equations (15) and (17) on a simple test problem. We construct a matrix with orthogonal columns having random magnitude such that and random with having non-zero terms distributed according to . The mean number of added and missed nonzero terms across 50 trials are shown and agree very well with the predicted values. As anticipated, the number of missing terms decays to zero as , but the same is not true for the number of added terms, which only decays as the inflation parameter is increased. The failure of ARDvi to converge as to the true sparsity pattern for fixed is certainly troubling, but for sufficiently large only an arbitrarily small number of terms will be added.

Figure 1: Number of missing and added terms using ARD with variance inflation. Dots indicate empirical average over 50 trials and dashed lined indicate values predicted by equations (15) and (17). Since theoretical number of missing terms is dependent on , it was held fixed across all trials and only noise was sampled. In each case is a 250x250 matrix with orthogonal columns.

3.2 Regularization via Sparsity Promoting Hierarchical Priors

Since the sparsity of is controlled by that of we can also attempt to regularize through the use of a hierarchical prior. Previous approaches to ARD have suggested hierarchical priors on in the form of Gamma distributions [32]. However, except for certain cases, the general class of Gamma distributions does not impose sparsity. Instead, we consider the use of a sparsity promoting hierarchical prior on the scale parameters . We consider distributions of the form,

(18)

where , are each convex and concave functions in , respectively. Given data we can follow a procedure similar to the one used in Sec. 2 and find,

(19)

A fully Bayesian approach would estimate through the joint posterior likelihood of pairs , but this would be computationally expensive. Instead, we approximate

by its maximum-a-posteriori estimate

, a process sometimes labelled type-II MAP [19]. The MAP estimate of is found by minimizing the negative log of the posterior distribution,

(20)

As expected, Eq. (20) closely resembles Eq. (7) and may be solved with a similar method. Alg. 3 constructs a sequence which monotonically increases the likelihood given by Eq. (20). Since is nonconvex, we can only guarantee convergence to a local minimum. We initialize Alg. 3 using the unregularized ARD value of .

1:   Initialize using Algorithm 1
2:   while not converged:
3:        
4:              where
5:        
6:         Optional: relearn
7:  return
Algorithm 3 ARDr()

Algorithm 3 allows for significant freedom in choosing and . The concave component of the prior, , acts as a sparsity encouraging regularizer on , as is common for concave priors [11]. Examples of concave include the identity, , and approximations of the -norm. We consider functions of the following form;

(21)

where is a parameter controlling the strength of the regularizer and is a width parameter. The convex prior may an indicator function restricting to a specific domain or left as a constant. In either case implementing the above algorithm is trivial. If is not a linear or indicator function then step 5 in Alg. 3 will require an internal iterative algorithm.

3.2.1 Sparsity properties of ARDr for orthogonal features

The behavior of Alg. 3 is complicated by the generality of functions and . In the simplest case we let be constant and be the linear function . This is the formulation used in [3] and a special case of using a Gamma distribution prior with shape parameter on . The update step in line 5 of Alg. 3 gives as in the unregularized case. For a fixed point of Alg. 3 we have,

(22)

The KKT conditions for the update are unchanged from the unregularized case and are given by,

(23)

If then Eq. (22) tells us and therefore,

(24)

The converse of Eq. (24) is shown in Appendix B. From this equivalence it follows that the false positive and negative rates for Alg. 3 are given by,

(25)
(26)
Figure 2: Number of missing and added terms using ARDr using orthonormal . Dots indicate empirical average over 50 trials and dashed lined indicate values predicted by equations (31) and (32).

These rates are verified empirically by testing 50 trials using 250x250 with orthogonal columns and random as in Sec. 3.1. Results are shown in Fig. 2. Note that for fixed the false negative rate does indeed approach zero as , however, the false positive rate increases. This indicates that a linear model with smaller error requires higher regularization to achieve a sparse solution. For held fixed as varies, the false negative rate still approaches zero and the false positive rate is constant. This latter case is shown in Fig. 3.

Figure 3 shows a similar convergence pattern to what we observed for ARDvi in Fig. 1. The number of added terms (false positives) remains constant as for any fixed regularization parameter . However, we note again that for sufficiently large the fixed false positive rate may be made arbitrarily small. In the following section we will construct thresholding methods including one for which the false positive and negative rates converge to zero as .

Figure 3: Number of missing and added terms using ARDr using orthonormal and holding constant. Dots indicate empirical average over 50 trials and dashed lined indicate values predicted by equations (31) and (32).

4 Thresholding Based Methods

As we have shown, automatic relevance determination will not realize the correct non-zero coefficients in a general sparse regression problem, but it will converge in any norm [40]. Therefore, applying an arbitrarily small threshold on will ensure selection of the correct nonzero coefficients in the limit of low noise. In this section we discuss methods for thresholding the output from Alg. 1 using the mean magnitude of coefficients or based on the posterior distribution of .

4.1 Magnitude Based Thresholding

Sequential thresholding based on the magnitude of coefficients has been used extensively in regression [8, 5] and also in conjunction with automatic relevance determination methods for identifying nonlinear dynamical systems with uncertainty quantification in [42]. Here we consider the method initially proposed in [42], called threshold sparse Bayesian regression (TSBR). To distinguish from other thresholding methods we use the term magnitude sequential threshold sparse Bayesian learning (M-STSBL). Magnitude based thresholding assumes that coefficients learned in the ARD algorithm with sufficiently small magnitude, are irrelevant and may be treated as zero.

1:  
2:  
3:  
4:  
5:   if :
6:   return
Algorithm 4 M-STSBL()

The sequential hard-thresholding method for automatic relevance determination is implemented in Alg. 4. Non-zero terms are indexed by whose complement tracks terms removed from the model. At each iteration the algorithm either recursively calls itself with fewer features or terminates if all features are kept non-zero.

4.1.1 Sparsity properties of M-STSBL for orthogonal features

We consider the number of errors using Alg. 4 in a similar context to the analysis of the variance inflation and regularized method. First consider the likelihood of a false non-zero term. Recall from the previous section that the KKT conditions for a fixed point of Alg. 1 imply,

(27)

We can rewrite this as,

(28)

where,

(29)

is invertible on and strictly increasing. Therefore,

(30)

This gives the likelihood of a false non-zero coefficient as,

(31)

and the likelihood for a false zero coefficient as,

(32)

Equations (32) and (31) are verified empirically by testing on 50 trials over a 250x250 using the same experimental design as in Sec. 3.1. Results shown in Fig. 4. In contrast to regularization based approaches, we now have the desirable condition where the number of false positive terms each goes to zero as . However, the number false negatives now only shrinks to a fixed positive number - a consequence of using a hard threshold. This motivates alternative criteria for thresholding. In the next section, we will discuss thresholding based not strictly on magnitude but on the marginal posterior likelihood that a coefficient is zero.

Figure 4: Number of missing and added terms using M-STSBL using orthonormal . Dots indicate empirical average over 50 trials and dashed lined indicate values predicted by equations (31) and (32).

4.2 Likelihood Based Thresholding

While Alg. 4 was shown to be effective in [42] it is not independent from the units of measurement used for each feature and is not practical in the case where some true coefficients are small. An alternative means of thresholding is to do so based on the marginal likelihood of a coefficient being zero. The marginal posterior distribution of is given by,

(33)

where are given by Eq.(4) and the marginal likelihood that is,

(34)

We can construct a sequential thresholding algorithm shown by Alg. 5 by removing terms whose marginal likelihood evaluated at zero is sufficiently large. The remaining subset of features in then passed recursively to the same procedure until convergence, marked by no change in the number of features. This process is described by Alg. 5 where parameter is the marginal likelihood at zero above which features are removed.

1:  
2:  
3:  
4:  
5:   if :
6:   return
Algorithm 5 L-STSBL()

4.2.1 Sparsity properties of L-STSBL for orthogonal features

We again consider the case where . Let,

(35)

so that the thresholding criteria is . In Alg. 1 so is the mean posterior estimate. The orthogonality of the columnsof allows us to express the marginal posterior variance as a function of . Letting be the covariance from a fixed point of Alg. 1 we have,

(36)

This allows us to express,

(37)

where and are strictly decreasing and increasing functions of , respectively. It follows that is strictly decreasing and therefore invertible with easily computed by bisection. For there is some such that,

(38)

and recalling Eq. (30),

(39)

This gives,

(40)

and,

(41)
Figure 5: Number of missing and added terms using L-STSBL using orthonormal . Dots indicate points where data was collected as the empirical average over 50 trials.

Equations (41) and (40) are verified empirically using the same experimental setup as in previous sections. Results shown in Fig.