# Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias

A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. However, regularization often shrinks the estimator for relevant features, which leads to incorrect feature selection. Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevance of the features automatically. We derive a variational Bayesian inference algorithm that maximizes the lower bound of the factorized information criterion (FIC), which is a recently developed asymptotic criterion for evaluating the marginal log-likelihood. In addition, we propose reparametrization to accelerate the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsity-shrinkage trade-off.

## Authors

• 1 publication
• 16 publications
• 18 publications
• ### Factorized Asymptotic Bayesian Hidden Markov Models

This paper addresses the issue of model selection for hidden Markov mode...
06/18/2012 ∙ by Ryohei Fujimaki, et al. ∙ 0

• ### Sparse Methods for Automatic Relevance Determination

This work considers methods for imposing sparsity in Bayesian regression...
05/18/2020 ∙ by Samuel H. Rudy, et al. ∙ 0

• ### Approximate Variational Inference Based on a Finite Sample of Gaussian Latent Variables

Variational methods are employed in situations where exact Bayesian infe...
06/11/2019 ∙ by Nikolaos Gianniotis, et al. ∙ 0

• ### Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Factorized information criterion (FIC) is a recently developed approxima...
04/22/2015 ∙ by Kohei Hayashi, et al. ∙ 0

• ### On the overestimation of widely applicable Bayesian information criterion

A widely applicable Bayesian information criterion (Watanabe, 2013) is a...
08/28/2019 ∙ by Toru Imai, et al. ∙ 0

• ### Shrinkage estimation of rate statistics

This paper presents a simple shrinkage estimator of rates based on Bayes...
10/17/2018 ∙ by Einar Holsbø, et al. ∙ 0

• ### Relevant sparse codes with variational information bottleneck

In many applications, it is desirable to extract only the relevant aspec...
05/24/2016 ∙ by Matthew Chalk, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In sparse linear regression, various approaches impose sparsity by implementing regularization on a weight parameter. For example, Lasso (Tibshirani, 1994) introduces sparsity by regularizing the weights by L1 norm. Automatic relevance determination (ARD) (MacKay, 1994; Neal, 1996)

regularizes the weights by a prior distribution, with hyperparameters indicating the relevance of the input features. Empirical Bayes estimation of the hyperparameters thus eliminates irrelevant features automatically. Although ARD is notorious for its slow convergence, several authors have improved the algorithm (e.g.,

(Wipf and Nagarajan, 2008)).

The trade-off between sparsity and shrinkage is a crucial issue in sparse regularization methods (Aravkin et al., 2014). In Lasso, for example, a large regularization constant incorporates strong sparsity and is more likely to estimate the weights of irrelevant features as zero, which is desirable in terms of interpretability. However, it also shrinks the weights of relevant features and may eliminate them. ARD suffers from the same problem, although the bias of ARD is weaker than that of Lasso (Aravkin et al., 2014). Because both sparsity and shrinkage are caused by regularization, the shrinkage effects are inevitable as long as we use the regularization scheme.

To address this issue, we propose an alternative method for sparse estimation, namely, Bayesian masking (BM), which differs from existing methods in that it does not impose any regularization on the weights. Our contributions can be summarized as follows.

• The BM model (Section 4). The BM model introduces binary latent variables into a linear regression model. The latent variables randomly mask features to be zero at each sample according to the priors that are defined for each feature, but shared among samples. The estimation of the priors on the masking rates determines the relevance of the features.

• A variational Bayesian inference algorithm for the BM model (Section 4.2). The EM-like coordinate ascent algorithm maximizes the lower bound of the factorized information criterion (FIC). The convergence of the algorithm is accelerated by combining gradient ascent and reparametrization (Section 4.4), which are motivated by previous studies on convergence analysis of coordinate ascent and information geometry.

• Analytic forms of the one-dimensional (1D) estimators of Lasso, ARD (Section 3), and BM (Section 4.3). The analytic estimators of these methods provide insights into their shrinkage mechanisms.

Through numerical experiments, we empirically show that the proposed method outperforms Lasso and ARD in terms of the sparsity-shrinkage trade-off.

### 1.1 Notation

Hereafter,

denotes a column vector of the

-th row of a matrix .

## 2 Background

### 2.1 Linear Regression and Least Squares

Consider a linear regression model:

 y=Xβ+ϵ, (1)

where is a vector of target values, is a matrix of explanatory variables, is a vector of weight parameters, and denotes observation noise. Further, is the number of samples and is the number of features. Because the noise is i.i.d. Gaussian, the maximum likelihood estimator (MLE) is given as the least-squares (LS) solution:

 ^βLS =argminβλ2∥y−Xβ∥22=(X⊤X)−1X⊤y. (2)

### 2.2 Lasso

Lasso is formulated as the L1-penalized regression problem, and the estimator is given as

 ^βLasso=argminβλ2∥y−Xβ∥22+α∥β∥1, (3)

where is a regularization constant.

### 2.3 Ard

Consider the prior distribution of :

 p(β|γ)=N(β|0,Γ), (4)

where and

is the hyperparameter determining the variance of the prior. ARD determines

by the empirical Bayes principle, i.e., by maximizing the marginal log-likelihood: . Then, the estimator of is then often given by the posterior mean with plugged-in  (Wipf and Nagarajan, 2008; Aravkin et al., 2014):

 ^βARD=^ΓX⊤(λ−1I+X^ΓX⊤)−1y. (5)

Clearly, for , results in for any .

## 3 Trade-off between Sparsity and Shrinkage

Concerning the trade-off between sparsity and shrinkage, Aravkin et al. (2014) derived the upper bounds of the estimators for and showed that undesirable shrinkage occurs for both Lasso and ARD; specifically, the shrinkage bias of Lasso is larger than that of ARD when an unnecessary feature is correctly pruned.

In this section, we revisit the above-mentioned issue. To understand how sparse regularization works, we derive the exact forms of the estimators for and show that ARD is better than Lasso in terms of the sparsity-shrinkage trade-off. Although our analysis is much simpler than the earlier study by Aravkin et al. (2014), it is meaningful because our derived estimators

• are exact and analytically written (no approximation is needed) and

• highlight the significant differences between ARD and Lasso.

### 3.1 1D Estimators

##### Ls

When , the matrix inverse in Eq. (2) becomes a scalar inverse and is simply written as

 ^βLS=x⊤yx⊤x, (6)

where we let represent to emphasize the dimensionality.

##### Lasso

For , the L1-penalty becomes . Thus, Eq. (3) yields a stationary point where . For , the solution is the same except that the sign is reversed. Combining both cases yields the solution for all :

 ^βLasso=sign(^βLS)max(0,|^βLS|−2αλx⊤x). (7)
##### Ard

Similar to Lasso, the 1D estimator of ARD is analytically written as111The full derivation of Eq. (8) is shown in Appendix A.

 (8)

Note that we assumed that is known.222Practically, is set to the unbiased version of MLE (Aravkin et al., 2014).

### 3.2 Comparison of LS, Lasso and ARD

Although their regularizations are different, Lasso and ARD have the same shrinkage mechanism — subtracting the constant from and cropping to zero if its magnitude is smaller than that of the constant. Since the constants in Eqs. (7) and (8) are both larger than zero except for the noiseless case (i.e., ), the shrinkage bias is inevitable in both Lasso with and ARD. On the other hand, this retraction to zero is necessary for sparsity because it prunes the irrelevant features.

It is worth noting that the bias of ARD is much weaker than that of Lasso when the scale of is large. This is easily confirmed by transforming the constant in Eq. (8) as , which indicates that the shrinkage is weak when is large but strong when is close to zero. Compared to Lasso, this behavior of ARD is desirable, as it maintains sparsity for weak features while alleviating unnecessary shrinkage. Figure 1 shows how shrinkage occurs in Lasso and ARD.

## 4 BM Model and Inference Algorithms

### 4.1 BM Model

Obviously, the shrinkage bias of Lasso and ARD comes from their imposition regularization on the weights. For example, if

in Lasso, the loss function becomes equivalent to that of LS, and of course, no shrinkage occurs. Using

yields the same result in ARD.

Hence, we introduce a new estimation method that maintains sparsity by using latent variables instead of regularization. Let be binary latent variables having the same dimensionality of . We insert between and , i.e.,

 y=(X∘Z)β+ϵ, (9)

where ’’ denotes the Hadamard product (i.e., element-wise multiplication).

masks the features randomly at each sample. For Bayesian treatment, we introduce prior distributions. We assume that follows a Bernoulli prior distribution as , where indicates how feature is relevant. Then, estimating the priors on the masking rates automatically determines the relevance of the features.

We also introduce the priors for and ; however, we set them to be as weak as possible so that their effects are negligible when is sufficiently large. Thus, we simply employ them as constants as they do not depend on , i.e., .

### 4.2 FAB-EM Algorithm

Our approach is based on the concept of Bayesian model selection; the central task is to evaluate the marginal log-likelihood. However, in our case, the marginal likelihood is intractable.

Thus, we adopt FIC, a recently proposed approximation for the marginal log-likelihood (Fujimaki and Morinaga, 2012; Hayashi and Fujimaki, 2013; Hayashi et al., 2015). We also adopt the factorized asymptotic Bayesian inference (FAB), which provides a tractable algorithm for parameter inference and model pruning by optimizing the lower bound of FIC. The FAB algorithm alternately maximizes the lower bound in an EM-like manner.

To obtain the lower bound of FIC, we introduce a mean-field approximation on the posterior distribution of as . Then we obtain the objective function as

 G({μn},β,λ,π) = Eq[logp(y|X,Z,β,λ)]+Eq[logp(Z|π)] (10) −12∑k(log(Nπk)+∑nEq[znk]/N−πkπk)−K+12logN +∑n,kH(q(znk)),

where means the expectation under , and is the entropy. The derivation of Eq. (10) is described in Appendix B.

##### FAB E-step

In the FAB E-step, we update . By taking the gradient of w.r.t. and setting it to zero, we obtain the following fixed-point equations:

 μnk=σ(cnk+logπk1−πk−12Nπk) (11)

where

is the sigmoid function and

. Updating Eq. (11) several times give us at a local maximum of .

##### FAB M-step

Since only the first and second terms in Eq. (10) are relevant, the FAB M-step is equivalent to the M-step in the EM algorithm. We have the closed-form solutions to update , and as

 β = Ω−1(X∘M)⊤y, (12) 1λ = ∑n(y2n−2yn(xn∘μn)⊤β+(xn∘β)⊤Eq[znz⊤n](xn∘β))N (13) πk = ∑nμnkN, (14)

where and . We note that .

##### Pruning step

As noted in previous papers, the third term in penalizes and automatically eliminates irrelevant features. Then, when , we remove the corresponding feature from the model. The pseudo-code of the resulting algorithm is given in Algorithm 1.

### 4.3 Analysis of FAB Estimator

As in Section 3, we investigate the FAB estimator of . If follows the linear model (1) with , the FAB estimator is expectedly obtained as

 Eϵ[^βFAB] =Ω−1~X⊤Xβ∗ =β∗+Ω−1b (15)

for any , where and .

Eq. (15) immediately suggests that is biased by . The bias consists of the two cross terms: and . Thus, the bias increases when and are correlated and and are negatively correlated. On the other hand, the cross terms become zero when or for all . This also implies that the bias is weakened by an appropriately estimated . In Section 5.2, we numerically evaluate the bias of FAB with Lasso and ARD, showing that FAB achieves the lowest bias.

Remarkably, when , no cross term appears and the bias vanishes for any satisfying . Furthermore, the 1D estimator is simply written as

 ^βFAB=~x⊤y~x⊤x. (16)

Again, if for all , recovers .

### 4.4 FAB-EG and Hybrid FAB-EM-EG Algorithms

Hayashi and Fujimaki (2013) reported that model pruning in FAB is slow, and we find that our algorithm suffers from the same drawback. In our case, {} for irrelevant features requires many iterations until convergence to zero although the weights and for relevant features converge rapidly. To overcome the problem of slow convergence, we combine gradient ascent and reparametrization as discussed below.

First, we replace the FAB-M step by gradient ascent, which is motivated by previous studies (Salakhutdinov et al., 2003; Maeda and Ishii, 2007) on convergence analysis of the EM algorithm. Thus, we find that gradient ascent certainly helps; however, the convergence remains slow. This is because the model distribution is insensitive to the direction of when is small. This means that the gradient for would be shallow for an irrelevant feature , since the estimator of takes a small value for the feature.

The natural gradient (NG) method (Amari, 1998) can effectively overcome the insensitivity in the model distribution. The key concept of the NG method is to adopt the Fisher information matrix as a metric on a parameter space (the so-called the Fisher information metric) in order to define the distance by the Kullback-Leibler (KL) divergence between model distributions instead of the Euclidean distance between parameter vectors. Thus, parameter updates by the NG method can make steady changes in the model distribution at each iteration. Amari and his co-workers showed that the NG method is especially effective when the parameter space contains singular regions, i.e., regions where the Fisher information matrix degenerates (Amari et al., 2006; Wei et al., 2008). This is because the problem of shallow gradient is severe around a singular region, since gradient completely vanishes along with the region. Hence, learning by ordinary gradient is often trapped around singular regions and remains trapped for many iterations, even when the models in the regions show poor agreement with data. In contrast, learning by the NG method is free from such a slow down.

In our case, the model has two singular regions for each feature, i.e., and . In particular, singular region causes the slow convergence. Hence, the NG method should be effective; however, the evaluation of the Fisher information metric is computationally expensive in our case. Therefore, as an alternative, we propose a simple reparametrization that approximates the Fisher information metric. Our strategy is to perform a block diagonal approximation of the full matrix, as has been performed recently in (Desjardins et al., 2015)

for neural networks.

Toward this end, we examine the Fisher information metric in the single-parameter case (), which approximates diagonal blocks of the full metric. Then, the model of interest here is simply . We focus on the case of small because the slow learning of is prominent in this region as noted above. Although the exact form of the metric is difficult to compute, our focus on the small- case allows further approximation. Taylor expansion around

and lowest-order approximation give us a metric tensor:

 G≡(GββGβπGβπGππ)=λN⟨x2⟩(λβ2N⟨x2⟩f(π)+π2βπβπβ2)., (17)

where represent polynomials of and . The key factor in is because this represents the fact that a smaller value of makes the model more insensitive to changes in .

The approximated metric obtained above is complicated and difficult to handle. Hence, we consider the following simple reparametrization for :

 (βk,πk)→(βk,sk=βkπk). (18)

Let us show what metric is introduced in -space through the reparametrization. Toward this end, we recall that considering usual gradient ascent on ()-space corresponds to introducing a metric in the original space, where is the Jacobian matrix for the reparametrization. Then, the reparametrization corresponds to introducing a block diagonal metric tensor in which each diagonal block is given by

 G′=(1+π2kβkπkβkπkβ2k). (19)

The similarity between and is clear. Although they are not identical, is simpler and shares the key factor as .

##### FAB-G step

In the FAB-G step, we replace the FAB-M step by gradient ascent. We calculate the gradient on the parameter space after reparametrization (), and project it onto the original space. Then, we obtain the following update rule for ():

 (βt+1kπt+1k)=(βtkπtk)+ηt⎛⎜ ⎜⎝1−πkβk−πkβk1+π2kβ2k⎞⎟ ⎟⎠⎛⎜⎝∂G∂βk∂G∂πk⎞⎟⎠, (20)

where is the number iterations and are the learning coefficients. We note that the update of by is scaled by , and then accelerated when is small, as expected. When , we update only by . Since the singularity problem comes from and , not , updating retains the closed-form solution Eq. (14).

We refer to the FAB algorithm as the FAB-EG algorithm when the G step replaces the M step. Even though the FAB-EG algorithm shows faster convergence, the fast initial progress in the FAB-EM algorithm remains an attractive feature. Thus, to exploit both benefits, we propose a hybrid algorithm in which the learning progresses by FAB-EM initially and by FAB-EG later. The pseudo-code of the hybrid algorithm is given in Algorithm 2.

## 5 Experiments

### 5.1 Overview

First, we evaluate BM (i.e., the proposed method), Lasso, and ARD with a simple example where , and we show that BM outperforms the other two in terms of the sparsity-shrinkage trade-off. Second, using the same example, we show how the parameters are updated by the FAB-EM and FAB-EG algorithms. Specifically, we highlight how the use of gradient ascent and the introduction of the reparametrization help to avoid being trapped in the singular region. Third, we demonstrate that the hybrid FAB-EM-EG algorithm converges faster than the FAB-EM algorithm. Finally, we evaluate BM, Lasso, and ARD again using larger values of .

### 5.2 Experiment with Two Features

For demonstration purposes, we borrowed a simple setting from Aravkin et al. (2014) who considered that

 (y1y2)=(100.51)(β1β2)+(ϵ1ϵ2), (21)

where and are sampled from . Assume that the true parameter values are and , i.e., the first feature is irrelevant and the second feature is relevant. Note that the variance is supposed to be known for simplicity.

#### 5.2.1 Comparison of BM, Lasso, and ARD

In our setting, we generated 500 datasets, each containing samples of . Note that, in BM (Algorithm 1), we adopted zero tolerance for model pruning: we pruned the -th feature only when was smaller than the machine epsilon. In Lasso, we determined by 2-fold cross validation.

The estimation results are summarized in Figure 2; the left panel shows when the irrelevant feature was pruned and the right panel shows the frequency of pruning of the irrelevant feature in the 500 trials. Note that the relevant feature was not pruned in any of the methods. We can easily see that BM achieved the highest sparsity without shrinkage of . On the other hand, ARD displayed no visible shrinkage as in BM; however, its sparsity was lower than that of BM. Lasso displayed shrinkage bias and the lowest sparsity.

#### 5.2.2 Learning Trajectory of FAB-EM and FAB-EG Algorithms

Using the same simple example, we show how the parameters are updated by the FAB-EM and FAB-EG algorithms. For comparison, we also performed the FAB-EG algorithm without reparametrization. We fixed the learning coefficient as for FAB-EG and for FAB-EG without reparametrization.

Figure 3 shows typical learning trajectories of and with 100 samples. We considered the 10 initial points of and located diagonally in the upper right. The initial values of and are set to the true values. In FAB-EM, the learning trajectories were trapped around . In FAB-EG without reparametrization, the trapping was mitigated but still occurred, especially when the initial values of were small. Intuitively speaking, this is because the gradient for is shallow with small . Thus, the learning trajectory approached to smaller since the feature was irrelevant, and then, the learning of became slower. In contrast, the learning trajectories of FAB-EG with the reparametrization approached to with fewer iterations regardless of the initial points, which means that the irrelevant feature was pruned quickly. This result empirically demonstrates that using gradient ascent alone improves the convergence only slightly, but combining it with the reparametrization accelerates the convergence sharply.

### 5.3 Experiment with Larger Number of Features

Next, we explain the results with larger examples (). We generated and

from the uniform distribution in

, and half of the elements in were set as zero. was generated by Eq. (1) with . was set as . We controlled as described in Appendix C.

#### 5.3.1 Performance Validation of Hybrid FAB-EM-EG algorithm

With , we demonstrate that the hybrid FAB-EM-EG algorithm converges faster than the FAB-EM algorithm. Toward this end, we counted the number of correctly pruned features and plotted it against the elapsed time for the algorithms. We set in Algorithm 2 to iterations. Figure 4 shows the number of correctly pruned features against the elapsed time. We can clearly see the faster convergence of the hybrid FAB-EM-EG algorithm. Note that the faster convergence is not attributable to over-pruning because the number of wrongly pruned features at termination were (hybrid FAB-EM-EG) and (FAB-EM-EG), which were nearly equal.

#### 5.3.2 Precision and Recall

For , we used two performance measures, Recall and Precision, defined as Precision and Recall , where and are the numbers of true and estimated irrelevant features, respectively, and is the number of correctly pruned features. We examined , , , and , and for each , we generated 100 datasets. Figure 5 summarizes the estimation results for BM (Algorithm 2), Lasso, and ARD. We set the algorithm switching point as iterations. In Lasso, was determined by 10-fold cross validation. As shown in the left and middle panels, although BM displayed slightly lower Precision than the others, it achieved the highest Recall. We also computed the

score, the harmonic mean of Precision and Recall, which can be interpreted as a metric for evaluating the performance in terms of the sparsity-shrinkage trade-off. As shown in the right panel, BM attained the highest

score for all values in this range. Thus, we concluded that BM achieved the best performance for the larger values of .

## 6 Conclusion

In this paper, we proposed a new sparse estimation method, BM, whose key feature is that it does not impose direct regularization on the weights. Our strategy was to introduce binary latent variables that randomly mask features and to perform a variational Bayesian inference based on FIC. In addition, we introduced gradient ascent and the reparametrization to accelerate the convergence. Our analysis of the estimators of BM, Lasso, and ARD highlighted how their sparsity mechanisms are different from one another. Finally, experimental comparisons of the three methods demonstrated that BM achieves the best performance in terms of the sparsity-shrinkage trade-off.

Note that augmenting a statistical model by random masking variables itself is not a new idea. For example, van der Maaten et al. (2013)

used random masking to generate virtual training data. However, our approach is distinguished from those studies by its purpose. Namely, we aim to identify whether the features are relevant or not, rather than improving prediction performance. In the augmented model, the FAB algorithm penalizes the masking rates, i.e., existence probability of the features, unlike the sparse regularization techniques where the weight values of the features are penalized. Applying the BM to real-world tasks where model identification is crucial, e.g., causal network inference, is a promising future work.

Y.K. was supported by the Platform Project for Supporting in Drug Discovery and Life Science Research (Platform for Dynamic Approaches to Living Systems) from Japan Agency for Medical Research and development (AMED). S.M. and K.H. were supported by JSPS KAKENHI Grant Number 15K15210 and 15K16055, respectively.

## References

• Amari (1998) Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10:251–276, 1998.
• Amari et al. (2006) Shun-ichi Amari, Hyeyoung Park, and Tomoko Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18(5):1007–1065, 2006.
• Aravkin et al. (2014) Aleksandr Aravkin, James V. Burke, Alessandro Chiuso, and Gianluigi Pillonetto. Convex vs non-convex estimators for regression and sparse estimation: the mean squared error properties of ARD and GLasso.

Journal of Machine Learning Research

, 15:217–252, 2014.
• Desjardins et al. (2015) G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural Neural Networks. ArXiv e-prints, 2015.
• Fujimaki and Morinaga (2012) Ryohei Fujimaki and Satoshi Morinaga. Factorized asymptotic bayesian inference for mixture modeling. In AISTATS, 2012.
• Hayashi and Fujimaki (2013) Kohei Hayashi and Ryohei Fujimaki. Factorized asymptotic bayesian inference for latent feature models. In 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013.
• Hayashi et al. (2015) Kohei Hayashi, Shin-ichi Maeda, and Ryohei Fujimaki. Rebuilding factorized information criterion: Asymptotically accurate marginal likelihood. In International Conference on Machine Learning (ICML), 2015.
• MacKay (1994) DavidJ MacKay.

Bayesian Methods for Backpropagation Networks.

In Models of Neural Networks III, pages 211–254. Springer, 1994.
• Maeda and Ishii (2007) Shin-ichi Maeda and Shin Ishii. Convergence analysis of the EM algorithm and joint minimization of free energy. In IEEE Workshop on Machine Learning for Signal Processing, 2007.
• Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.
• Petersen and Pedersen (2012) K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2012.
• Salakhutdinov et al. (2003) Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahramani. Optimization with EM and expectation-conjugate-gradient. In International Conference on Machine Learning (ICML), 2003.
• Tibshirani (1994) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
• van der Maaten et al. (2013) Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q. Weinberger. Learning with marginalized corrupted features. In International Conference on Machine Learning (ICML), 2013.
• Wei et al. (2008) Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shun-ichi Amari. Dynamics of learning near singularities in layered networks. Neural Computation, 20(3):813–843, 2008.
• Wipf and Nagarajan (2008) David P. Wipf and Srikantan S. Nagarajan. A new view of automatic relevance determination. In Advances in Neural Information Processing Systems 20, 2008.

## Appendix A The One-Dimensional ARD Estimator

According to Wipf and Nagarajan (2008), the negative marginal log-likelihood when is given by

 log|λ−1I+γxx⊤|+y⊤(λ−1I+γxx⊤)−1y (22) = log(λ−1+γx⊤x)+λy⊤y−y⊤λ2γxx⊤1+λγx⊤xy (23) = log(λ−1+γx⊤x)+λy⊤y−λγ(x⊤y)2λ−1+γx⊤x. (24)

In the second line, we use the matrix determinant lemma (Petersen and Pedersen, 2012, Eq. (24)) for the first term and the variant of the Sherman-Morrison relation (Petersen and Pedersen, 2012, Eq. (160)) for the second term. The derivative is

 ∂ Eq.~{}(???)∂γ =x⊤xλ−1+γx⊤x−λ(x⊤y)2λ−1+γx⊤x+λγ(x⊤y)2x⊤x(λ−1+γx⊤x)2 (25) =x⊤x(λ−1+γx⊤x)−λ(x⊤y)2(λ−1+γx⊤x)+λγ(x⊤y)2x⊤x(λ−1+γx⊤x)2 (26) =λ−1x⊤x+γ(x⊤x)2−(x⊤y)2(λ−1+γx⊤x)2. (27)

The stationary point is then given as

 ^γ =max(0,(x⊤y)2−λ−1x⊤x(x⊤x)2) (28) =max(0,^β2LS−(λx⊤x)−1). (29)

Note that we use the max operator since is the variance and it must be non-negative. By substituting the result of into Eq. (5), we obtain

 ^βARD =^γx⊤(λ−1I+^γxx⊤)−1y (30) =λ^γx⊤y−λ^γ2x⊤xx⊤yλ−1+^γx⊤x (31) =λ^β2LSx⊤y−^βLS−λ^β4LSx⊤xx⊤y−2^β2LSx⊤y+λx⊤yλ2x⊤xλ−1+^β2LSx⊤x−λ−1 (32) =λ^β2LSx⊤y−^βLS−λ^β4LSx⊤xx⊤y−2^β2LSx⊤y+λ−1^βLS^β2LSx⊤x (33) =−^βLS+2^βLS−1λ^βLSx⊤x (34) =^βLS−1λx⊤y. (35)

Recall that when . Since and are both non-negative, the condition is written as , or equivalently, . Substituting this condition into the above equation yields Eq. (8).

## Appendix B Derivation of The Lower Bound of FIC

Hayashi et al. (2015) obtained a general representation of the lower bound of FIC, and in this case, we have

 FIC≥Eq[logp(y,Z|X,β,λ,π)]−12Eq[log|Fβ|]−K+12logN+H(q), (36)

where and denotes the Hessian matrix of the negative log-likelihood w.r.t. . Note that the priors for and do not appear in the lower bound, since the priors do not depend on , as assumed in Section 4.1. In order to derive the FAB algorithm, we compute the lower bound of the second term on the right-hand side.

The Hessian matrix is represented as

 Fβ = −∇β∇βlogp(y,Z|X,β,λ,π) (37) = (X∘Z)⊤(X∘Z). (38)

Then, using Hadamard’s inequality for a positive-semidefinite matrix yields

 −log|Fβ| ≥ −∑klog∑nx2nkznk (39) ≥ −∑klog(∑nznk)(∑nx2nk) (40) = −∑klog∑nznk+const. (41)

As stated in Fujimaki and Morinaga (2012), since is a concave function, its linear approximation at yields the lower bound:

 −Eq[log∑nznk]≥−(logNπk+∑nEq[znk]/N−πkπk). (42)

Thus, we obtain Eq. (10) in the main text.

## Appendix C Control of Learning Coefficient

We explain how to set the learning coefficients in Section 5.3. We set to a constant value ; however, when the maximum of the update of is greater than , is modified such that the maximum is . We chose the constant as , where is the number of samples.