1 Introduction
In sparse linear regression, various approaches impose sparsity by implementing regularization on a weight parameter. For example, Lasso (Tibshirani, 1994) introduces sparsity by regularizing the weights by L1 norm. Automatic relevance determination (ARD) (MacKay, 1994; Neal, 1996)
regularizes the weights by a prior distribution, with hyperparameters indicating the relevance of the input features. Empirical Bayes estimation of the hyperparameters thus eliminates irrelevant features automatically. Although ARD is notorious for its slow convergence, several authors have improved the algorithm (e.g.,
(Wipf and Nagarajan, 2008)).The tradeoff between sparsity and shrinkage is a crucial issue in sparse regularization methods (Aravkin et al., 2014). In Lasso, for example, a large regularization constant incorporates strong sparsity and is more likely to estimate the weights of irrelevant features as zero, which is desirable in terms of interpretability. However, it also shrinks the weights of relevant features and may eliminate them. ARD suffers from the same problem, although the bias of ARD is weaker than that of Lasso (Aravkin et al., 2014). Because both sparsity and shrinkage are caused by regularization, the shrinkage effects are inevitable as long as we use the regularization scheme.
To address this issue, we propose an alternative method for sparse estimation, namely, Bayesian masking (BM), which differs from existing methods in that it does not impose any regularization on the weights. Our contributions can be summarized as follows.

The BM model (Section 4). The BM model introduces binary latent variables into a linear regression model. The latent variables randomly mask features to be zero at each sample according to the priors that are defined for each feature, but shared among samples. The estimation of the priors on the masking rates determines the relevance of the features.

A variational Bayesian inference algorithm for the BM model (Section 4.2). The EMlike coordinate ascent algorithm maximizes the lower bound of the factorized information criterion (FIC). The convergence of the algorithm is accelerated by combining gradient ascent and reparametrization (Section 4.4), which are motivated by previous studies on convergence analysis of coordinate ascent and information geometry.
Through numerical experiments, we empirically show that the proposed method outperforms Lasso and ARD in terms of the sparsityshrinkage tradeoff.
1.1 Notation
2 Background
2.1 Linear Regression and Least Squares
Consider a linear regression model:
(1) 
where is a vector of target values, is a matrix of explanatory variables, is a vector of weight parameters, and denotes observation noise. Further, is the number of samples and is the number of features. Because the noise is i.i.d. Gaussian, the maximum likelihood estimator (MLE) is given as the leastsquares (LS) solution:
(2) 
2.2 Lasso
Lasso is formulated as the L1penalized regression problem, and the estimator is given as
(3) 
where is a regularization constant.
2.3 Ard
Consider the prior distribution of :
(4) 
where and
is the hyperparameter determining the variance of the prior. ARD determines
by the empirical Bayes principle, i.e., by maximizing the marginal loglikelihood: . Then, the estimator of is then often given by the posterior mean with pluggedin (Wipf and Nagarajan, 2008; Aravkin et al., 2014):(5) 
Clearly, for , results in for any .
3 Tradeoff between Sparsity and Shrinkage
Concerning the tradeoff between sparsity and shrinkage, Aravkin et al. (2014) derived the upper bounds of the estimators for and showed that undesirable shrinkage occurs for both Lasso and ARD; specifically, the shrinkage bias of Lasso is larger than that of ARD when an unnecessary feature is correctly pruned.
In this section, we revisit the abovementioned issue. To understand how sparse regularization works, we derive the exact forms of the estimators for and show that ARD is better than Lasso in terms of the sparsityshrinkage tradeoff. Although our analysis is much simpler than the earlier study by Aravkin et al. (2014), it is meaningful because our derived estimators

are exact and analytically written (no approximation is needed) and

highlight the significant differences between ARD and Lasso.
3.1 1D Estimators
Ls
When , the matrix inverse in Eq. (2) becomes a scalar inverse and is simply written as
(6) 
where we let represent to emphasize the dimensionality.
Lasso
For , the L1penalty becomes . Thus, Eq. (3) yields a stationary point where . For , the solution is the same except that the sign is reversed. Combining both cases yields the solution for all :
(7) 
Ard
3.2 Comparison of LS, Lasso and ARD
Although their regularizations are different, Lasso and ARD have the same shrinkage mechanism — subtracting the constant from and cropping to zero if its magnitude is smaller than that of the constant. Since the constants in Eqs. (7) and (8) are both larger than zero except for the noiseless case (i.e., ), the shrinkage bias is inevitable in both Lasso with and ARD. On the other hand, this retraction to zero is necessary for sparsity because it prunes the irrelevant features.
It is worth noting that the bias of ARD is much weaker than that of Lasso when the scale of is large. This is easily confirmed by transforming the constant in Eq. (8) as , which indicates that the shrinkage is weak when is large but strong when is close to zero. Compared to Lasso, this behavior of ARD is desirable, as it maintains sparsity for weak features while alleviating unnecessary shrinkage. Figure 1 shows how shrinkage occurs in Lasso and ARD.
4 BM Model and Inference Algorithms
4.1 BM Model
Obviously, the shrinkage bias of Lasso and ARD comes from their imposition regularization on the weights. For example, if
in Lasso, the loss function becomes equivalent to that of LS, and of course, no shrinkage occurs. Using
yields the same result in ARD.Hence, we introduce a new estimation method that maintains sparsity by using latent variables instead of regularization. Let be binary latent variables having the same dimensionality of . We insert between and , i.e.,
(9) 
where ’’ denotes the Hadamard product (i.e., elementwise multiplication).
masks the features randomly at each sample. For Bayesian treatment, we introduce prior distributions. We assume that follows a Bernoulli prior distribution as , where indicates how feature is relevant. Then, estimating the priors on the masking rates automatically determines the relevance of the features.
We also introduce the priors for and ; however, we set them to be as weak as possible so that their effects are negligible when is sufficiently large. Thus, we simply employ them as constants as they do not depend on , i.e., .
4.2 FABEM Algorithm
Our approach is based on the concept of Bayesian model selection; the central task is to evaluate the marginal loglikelihood. However, in our case, the marginal likelihood is intractable.
Thus, we adopt FIC, a recently proposed approximation for the marginal loglikelihood (Fujimaki and Morinaga, 2012; Hayashi and Fujimaki, 2013; Hayashi et al., 2015). We also adopt the factorized asymptotic Bayesian inference (FAB), which provides a tractable algorithm for parameter inference and model pruning by optimizing the lower bound of FIC. The FAB algorithm alternately maximizes the lower bound in an EMlike manner.
To obtain the lower bound of FIC, we introduce a meanfield approximation on the posterior distribution of as . Then we obtain the objective function as
(10)  
where means the expectation under , and is the entropy. The derivation of Eq. (10) is described in Appendix B.
FAB Estep
In the FAB Estep, we update . By taking the gradient of w.r.t. and setting it to zero, we obtain the following fixedpoint equations:
(11) 
where
is the sigmoid function and
. Updating Eq. (11) several times give us at a local maximum of .FAB Mstep
Since only the first and second terms in Eq. (10) are relevant, the FAB Mstep is equivalent to the Mstep in the EM algorithm. We have the closedform solutions to update , and as
(12)  
(13)  
(14) 
where and . We note that .
Pruning step
As noted in previous papers, the third term in penalizes and automatically eliminates irrelevant features. Then, when , we remove the corresponding feature from the model. The pseudocode of the resulting algorithm is given in Algorithm 1.
4.3 Analysis of FAB Estimator
As in Section 3, we investigate the FAB estimator of . If follows the linear model (1) with , the FAB estimator is expectedly obtained as
(15) 
for any , where and .
Eq. (15) immediately suggests that is biased by . The bias consists of the two cross terms: and . Thus, the bias increases when and are correlated and and are negatively correlated. On the other hand, the cross terms become zero when or for all . This also implies that the bias is weakened by an appropriately estimated . In Section 5.2, we numerically evaluate the bias of FAB with Lasso and ARD, showing that FAB achieves the lowest bias.
Remarkably, when , no cross term appears and the bias vanishes for any satisfying . Furthermore, the 1D estimator is simply written as
(16) 
Again, if for all , recovers .
4.4 FABEG and Hybrid FABEMEG Algorithms
Hayashi and Fujimaki (2013) reported that model pruning in FAB is slow, and we find that our algorithm suffers from the same drawback. In our case, {} for irrelevant features requires many iterations until convergence to zero although the weights and for relevant features converge rapidly. To overcome the problem of slow convergence, we combine gradient ascent and reparametrization as discussed below.
First, we replace the FABM step by gradient ascent, which is motivated by previous studies (Salakhutdinov et al., 2003; Maeda and Ishii, 2007) on convergence analysis of the EM algorithm. Thus, we find that gradient ascent certainly helps; however, the convergence remains slow. This is because the model distribution is insensitive to the direction of when is small. This means that the gradient for would be shallow for an irrelevant feature , since the estimator of takes a small value for the feature.
The natural gradient (NG) method (Amari, 1998) can effectively overcome the insensitivity in the model distribution. The key concept of the NG method is to adopt the Fisher information matrix as a metric on a parameter space (the socalled the Fisher information metric) in order to define the distance by the KullbackLeibler (KL) divergence between model distributions instead of the Euclidean distance between parameter vectors. Thus, parameter updates by the NG method can make steady changes in the model distribution at each iteration. Amari and his coworkers showed that the NG method is especially effective when the parameter space contains singular regions, i.e., regions where the Fisher information matrix degenerates (Amari et al., 2006; Wei et al., 2008). This is because the problem of shallow gradient is severe around a singular region, since gradient completely vanishes along with the region. Hence, learning by ordinary gradient is often trapped around singular regions and remains trapped for many iterations, even when the models in the regions show poor agreement with data. In contrast, learning by the NG method is free from such a slow down.
In our case, the model has two singular regions for each feature, i.e., and . In particular, singular region causes the slow convergence. Hence, the NG method should be effective; however, the evaluation of the Fisher information metric is computationally expensive in our case. Therefore, as an alternative, we propose a simple reparametrization that approximates the Fisher information metric. Our strategy is to perform a block diagonal approximation of the full matrix, as has been performed recently in (Desjardins et al., 2015)
for neural networks.
Toward this end, we examine the Fisher information metric in the singleparameter case (), which approximates diagonal blocks of the full metric. Then, the model of interest here is simply . We focus on the case of small because the slow learning of is prominent in this region as noted above. Although the exact form of the metric is difficult to compute, our focus on the small case allows further approximation. Taylor expansion around
and lowestorder approximation give us a metric tensor:
(17) 
where represent polynomials of and . The key factor in is because this represents the fact that a smaller value of makes the model more insensitive to changes in .
The approximated metric obtained above is complicated and difficult to handle. Hence, we consider the following simple reparametrization for :
(18) 
Let us show what metric is introduced in space through the reparametrization. Toward this end, we recall that considering usual gradient ascent on ()space corresponds to introducing a metric in the original space, where is the Jacobian matrix for the reparametrization. Then, the reparametrization corresponds to introducing a block diagonal metric tensor in which each diagonal block is given by
(19) 
The similarity between and is clear. Although they are not identical, is simpler and shares the key factor as .
FABG step
In the FABG step, we replace the FABM step by gradient ascent. We calculate the gradient on the parameter space after reparametrization (), and project it onto the original space. Then, we obtain the following update rule for ():
(20) 
where is the number iterations and are the learning coefficients. We note that the update of by is scaled by , and then accelerated when is small, as expected. When , we update only by . Since the singularity problem comes from and , not , updating retains the closedform solution Eq. (14).
We refer to the FAB algorithm as the FABEG algorithm when the G step replaces the M step. Even though the FABEG algorithm shows faster convergence, the fast initial progress in the FABEM algorithm remains an attractive feature. Thus, to exploit both benefits, we propose a hybrid algorithm in which the learning progresses by FABEM initially and by FABEG later. The pseudocode of the hybrid algorithm is given in Algorithm 2.
5 Experiments
5.1 Overview
First, we evaluate BM (i.e., the proposed method), Lasso, and ARD with a simple example where , and we show that BM outperforms the other two in terms of the sparsityshrinkage tradeoff. Second, using the same example, we show how the parameters are updated by the FABEM and FABEG algorithms. Specifically, we highlight how the use of gradient ascent and the introduction of the reparametrization help to avoid being trapped in the singular region. Third, we demonstrate that the hybrid FABEMEG algorithm converges faster than the FABEM algorithm. Finally, we evaluate BM, Lasso, and ARD again using larger values of .
5.2 Experiment with Two Features
For demonstration purposes, we borrowed a simple setting from Aravkin et al. (2014) who considered that
(21) 
where and are sampled from . Assume that the true parameter values are and , i.e., the first feature is irrelevant and the second feature is relevant. Note that the variance is supposed to be known for simplicity.
5.2.1 Comparison of BM, Lasso, and ARD
In our setting, we generated 500 datasets, each containing samples of . Note that, in BM (Algorithm 1), we adopted zero tolerance for model pruning: we pruned the th feature only when was smaller than the machine epsilon. In Lasso, we determined by 2fold cross validation.
The estimation results are summarized in Figure 2; the left panel shows when the irrelevant feature was pruned and the right panel shows the frequency of pruning of the irrelevant feature in the 500 trials. Note that the relevant feature was not pruned in any of the methods. We can easily see that BM achieved the highest sparsity without shrinkage of . On the other hand, ARD displayed no visible shrinkage as in BM; however, its sparsity was lower than that of BM. Lasso displayed shrinkage bias and the lowest sparsity.
5.2.2 Learning Trajectory of FABEM and FABEG Algorithms
Using the same simple example, we show how the parameters are updated by the FABEM and FABEG algorithms. For comparison, we also performed the FABEG algorithm without reparametrization. We fixed the learning coefficient as for FABEG and for FABEG without reparametrization.
Figure 3 shows typical learning trajectories of and with 100 samples. We considered the 10 initial points of and located diagonally in the upper right. The initial values of and are set to the true values. In FABEM, the learning trajectories were trapped around . In FABEG without reparametrization, the trapping was mitigated but still occurred, especially when the initial values of were small. Intuitively speaking, this is because the gradient for is shallow with small . Thus, the learning trajectory approached to smaller since the feature was irrelevant, and then, the learning of became slower. In contrast, the learning trajectories of FABEG with the reparametrization approached to with fewer iterations regardless of the initial points, which means that the irrelevant feature was pruned quickly. This result empirically demonstrates that using gradient ascent alone improves the convergence only slightly, but combining it with the reparametrization accelerates the convergence sharply.
5.3 Experiment with Larger Number of Features
Next, we explain the results with larger examples (). We generated and
from the uniform distribution in
, and half of the elements in were set as zero. was generated by Eq. (1) with . was set as . We controlled as described in Appendix C.5.3.1 Performance Validation of Hybrid FABEMEG algorithm
With , we demonstrate that the hybrid FABEMEG algorithm converges faster than the FABEM algorithm. Toward this end, we counted the number of correctly pruned features and plotted it against the elapsed time for the algorithms. We set in Algorithm 2 to iterations. Figure 4 shows the number of correctly pruned features against the elapsed time. We can clearly see the faster convergence of the hybrid FABEMEG algorithm. Note that the faster convergence is not attributable to overpruning because the number of wrongly pruned features at termination were (hybrid FABEMEG) and (FABEMEG), which were nearly equal.
5.3.2 Precision and Recall
For , we used two performance measures, Recall and Precision, defined as Precision and Recall , where and are the numbers of true and estimated irrelevant features, respectively, and is the number of correctly pruned features. We examined , , , and , and for each , we generated 100 datasets. Figure 5 summarizes the estimation results for BM (Algorithm 2), Lasso, and ARD. We set the algorithm switching point as iterations. In Lasso, was determined by 10fold cross validation. As shown in the left and middle panels, although BM displayed slightly lower Precision than the others, it achieved the highest Recall. We also computed the
score, the harmonic mean of Precision and Recall, which can be interpreted as a metric for evaluating the performance in terms of the sparsityshrinkage tradeoff. As shown in the right panel, BM attained the highest
score for all values in this range. Thus, we concluded that BM achieved the best performance for the larger values of .6 Conclusion
In this paper, we proposed a new sparse estimation method, BM, whose key feature is that it does not impose direct regularization on the weights. Our strategy was to introduce binary latent variables that randomly mask features and to perform a variational Bayesian inference based on FIC. In addition, we introduced gradient ascent and the reparametrization to accelerate the convergence. Our analysis of the estimators of BM, Lasso, and ARD highlighted how their sparsity mechanisms are different from one another. Finally, experimental comparisons of the three methods demonstrated that BM achieves the best performance in terms of the sparsityshrinkage tradeoff.
Note that augmenting a statistical model by random masking variables itself is not a new idea. For example, van der Maaten et al. (2013)
used random masking to generate virtual training data. However, our approach is distinguished from those studies by its purpose. Namely, we aim to identify whether the features are relevant or not, rather than improving prediction performance. In the augmented model, the FAB algorithm penalizes the masking rates, i.e., existence probability of the features, unlike the sparse regularization techniques where the weight values of the features are penalized. Applying the BM to realworld tasks where model identification is crucial, e.g., causal network inference, is a promising future work.
Y.K. was supported by the Platform Project for Supporting in Drug Discovery and Life Science Research (Platform for Dynamic Approaches to Living Systems) from Japan Agency for Medical Research and development (AMED). S.M. and K.H. were supported by JSPS KAKENHI Grant Number 15K15210 and 15K16055, respectively.
References
 Amari (1998) Shunichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10:251–276, 1998.
 Amari et al. (2006) Shunichi Amari, Hyeyoung Park, and Tomoko Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18(5):1007–1065, 2006.

Aravkin et al. (2014)
Aleksandr Aravkin, James V. Burke, Alessandro Chiuso, and Gianluigi Pillonetto.
Convex vs nonconvex estimators for regression and sparse estimation:
the mean squared error properties of ARD and GLasso.
Journal of Machine Learning Research
, 15:217–252, 2014.  Desjardins et al. (2015) G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu. Natural Neural Networks. ArXiv eprints, 2015.
 Fujimaki and Morinaga (2012) Ryohei Fujimaki and Satoshi Morinaga. Factorized asymptotic bayesian inference for mixture modeling. In AISTATS, 2012.
 Hayashi and Fujimaki (2013) Kohei Hayashi and Ryohei Fujimaki. Factorized asymptotic bayesian inference for latent feature models. In 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013.
 Hayashi et al. (2015) Kohei Hayashi, Shinichi Maeda, and Ryohei Fujimaki. Rebuilding factorized information criterion: Asymptotically accurate marginal likelihood. In International Conference on Machine Learning (ICML), 2015.

MacKay (1994)
DavidJ MacKay.
Bayesian Methods for Backpropagation Networks.
In Models of Neural Networks III, pages 211–254. Springer, 1994.  Maeda and Ishii (2007) Shinichi Maeda and Shin Ishii. Convergence analysis of the EM algorithm and joint minimization of free energy. In IEEE Workshop on Machine Learning for Signal Processing, 2007.
 Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.
 Petersen and Pedersen (2012) K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2012.
 Salakhutdinov et al. (2003) Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahramani. Optimization with EM and expectationconjugategradient. In International Conference on Machine Learning (ICML), 2003.
 Tibshirani (1994) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
 van der Maaten et al. (2013) Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q. Weinberger. Learning with marginalized corrupted features. In International Conference on Machine Learning (ICML), 2013.
 Wei et al. (2008) Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, and Shunichi Amari. Dynamics of learning near singularities in layered networks. Neural Computation, 20(3):813–843, 2008.
 Wipf and Nagarajan (2008) David P. Wipf and Srikantan S. Nagarajan. A new view of automatic relevance determination. In Advances in Neural Information Processing Systems 20, 2008.
Appendix A The OneDimensional ARD Estimator
According to Wipf and Nagarajan (2008), the negative marginal loglikelihood when is given by
(22)  
(23)  
(24) 
In the second line, we use the matrix determinant lemma (Petersen and Pedersen, 2012, Eq. (24)) for the first term and the variant of the ShermanMorrison relation (Petersen and Pedersen, 2012, Eq. (160)) for the second term. The derivative is
(25)  
(26)  
(27) 
The stationary point is then given as
(28)  
(29) 
Note that we use the max operator since is the variance and it must be nonnegative. By substituting the result of into Eq. (5), we obtain
(30)  
(31)  
(32)  
(33)  
(34)  
(35) 
Recall that when . Since and are both nonnegative, the condition is written as , or equivalently, . Substituting this condition into the above equation yields Eq. (8).
Appendix B Derivation of The Lower Bound of FIC
Hayashi et al. (2015) obtained a general representation of the lower bound of FIC, and in this case, we have
(36) 
where and denotes the Hessian matrix of the negative loglikelihood w.r.t. . Note that the priors for and do not appear in the lower bound, since the priors do not depend on , as assumed in Section 4.1. In order to derive the FAB algorithm, we compute the lower bound of the second term on the righthand side.
The Hessian matrix is represented as
(37)  
(38) 
Then, using Hadamard’s inequality for a positivesemidefinite matrix yields
(39)  
(40)  
(41) 
As stated in Fujimaki and Morinaga (2012), since is a concave function, its linear approximation at yields the lower bound:
(42) 
Thus, we obtain Eq. (10) in the main text.
Appendix C Control of Learning Coefficient
We explain how to set the learning coefficients in Section 5.3. We set to a constant value ; however, when the maximum of the update of is greater than , is modified such that the maximum is . We chose the constant as , where is the number of samples.
Comments
There are no comments yet.