I General Formulas
For concreteness, consider a series of independent observations of some dependent variable , i.e. , that we want to explain using a set of potential features . Furthermore, assume that to each of these potential features, , we associate a model parameter . Let denote an index set specifying the positions for which . The cardinality of , i.e. the number of indicator variables with , is denoted
. Also, we define a vectorof length that contains the model parameters corresponding to the active features, . With these definitions, we define the log-likelihood for the observed data as a function of the active features as . Throughout, we assume that the log-likelihood is twice differentiable.
In addition to the log-likelihood, we need to specify prior distributions for the parameters. Here, we will work with factorized priors of the form
where is a parameter that controls the strength of the regularization, is a twice differentiable convex function minimized at the point with , and is the normalization constant. As the function is convex, we are assuming that the second derivative evaluated at , denoted , is positive. Many commonly used priors take this form including Gaussian and hyperbolic priors. Plugging these expressions into (2), yields an expression for the evidence when only the subset of features, , are included:
By definition, this integral is dominated by the second term for strongly regularizing priors. Since the log-likelihood is extensive in the number of data points, , this generally requires that the regularization strength be much larger than the number of data points, . For such strongly regularizing priors we can perform a saddle-point approximation for the evidence. This yields, up to an irrelevant constant,
where is a “renormalized” regularization strength,
is the gradient of the log-likelihood, and
is the Hessian of the log-likelihood, with all derivatives evaluated at . We emphasize that the large parameter in our saddle-point approximation is the regularization strength . This differs from previous statistical physics approaches to Bayesian feature selection that commonly assume that and hence perform a saddle point in Kinney and Atwal (2014); Balasubramanian (1997); Nemenman and Bialek (2002). The fundamental reason for this difference is that we work in the strongly-regularizing regime where the prior is assumed to be important even in the infinite data limit.
Since for strongly-regularizing priors, is the largest scale in the problem, we can expand the log in (5) in a power series expansion in to order to get:
One of the most striking aspects of this expression is that it is independent of the detailed form of the prior function, . All that is required is that the prior is a twice differential convex function with a global minimum at . Different choices of prior simply “renormalize” the effective regularization strength .
Rewriting this expression in terms of the spin variables, , we see that the evidence takes the Ising form with:
and couplings given by:
Notice that the couplings, which are proportional to the small parameter , are weak. According to Bayes rule, , so the posterior probability of a set of features is described by an Ising model of the form:
is the partition function that normalizes the probability distribution. We term this the Bayesian Ising Approximation (BIA)Fisher and Mehta (2014). Finally, for future use, it is useful to define a scale for which the Ising approximation breaks down. This scale can be computed by requiring the that the power series expansion used to derive (8) converge Fisher and Mehta (2014).
We demonstrated the utility of the BIA for feature selection in the specific context of linear regression in a recent paper, and we can import that machinery here Fisher and Mehta (2014). It is useful to explicitly indicate the dependence of various expression on the regularization strength . We want to compute the marginal posterior probability that a feature, , is relevant:
where we have defined the magnetizations . While there are many techniques for calculating the magnetizations of an Ising model, we focus on the mean field approximation which leads to a self-consistent equation Opper and Winther (2001):
Our expressions depend on a free parameter () that determines the strength of the prior distribution. As it is usually difficult, in practice, to choose a specific value of ahead of time it is often helpful to compute the feature selection path; i.e. to compute over a wide range of
’s. Indeed, computing the variable selection path is a common practice when applying other feature selection techniques such as LASSO regressionTibshirani (1996). To obtain the mean field variable selection path as a function of , we notice that and so define the recursive formula:
with a small step size . We have set in all of the examples presented below.
Logistic regression is a commonly used statistical method for modeling categorical data Bishop et al. (2006). To simplify notation, it is useful also useful define an extra feature variable which is always equal to and a -dimensional vector of feature, with corresponding parameters, . In terms of these vectors, the likelihood function for logistic regression takes the compact form
where is the transpose of . If we have independent observations of the data (labeled by index ), the log-likelihood can be written as:
We supplement this likelihood function with an norm on the parameters of the form:
with and where is chosen to match the observed probability that in the data,
Using these expressions, we can calculate the gradient and the Hessian of the log-likelihood:
Notice that the gradient, , is proportional to the correlations between the and
. Furthermore, except for a multiplicative constant reflecting the variance of, is just correlation between and . Thus, as in linear regression Fisher and Mehta (2014), the coefficients of the Ising model are related to the correlations between variables and/or the data. In fact, it is easy to show this is the case for all Generalized Linear Models (see Appendix).
To illustrate our approach, we used the BIA for a logistic regression-based classifier designed to classify Bs and Ds in the notMNIST dataset. The notMNIST dataset consists of diverse images of the letters A-J composed from publicly available computer fonts. Each letter is represented as a 28 by 28 grayscale image where pixel intensities vary between 0 and 255. Figure 1 shows three randomly chosen examples of the letters B and D from the notMNIST dataset. The BIA was performed using 500 randomly chosen examples each letter. Notice that the number of examples (500 or each letter) is comparable to the number of pixels (784) in an image, suggesting that strongly-regularizing the problem is appropriate.
Using the expressions above, we calculated the couplings for the Ising model describing our logistic-regression based classifier and calculated the feature selection path as a function of using the mean field approximation. As in linear regression, we used , where is the number of samples, is the number of potential features, and is the root mean squared correlation between pixels (see Figure 2A). Figure 2B shows the posterior probability of all 784 pixels when . To better visualize this, we have labeled the pixels with the highest posterior probabilities in red in the feature selection path in Fig 2A and in the sample images shown Figure 2C. The results agree well with our basic intuition about which pixels are important for distinguishing the letters B and D.
In conclusion, we have presented a general framework for Bayesian Feature Selection in a large class of statistical models. In contrast to previous Bayesian approaches that assume that the effect of the prior vanishes inversely with the amount of data, we have used strongly-regularizing priors that have large effects on the posterior distribution even in the infinite data limit. We have shown that in the strongly-regularized limit, Bayesian feature selection takes the universal form of an Ising model. Thus, the marginal posterior probabilities that of each feature is relevant can be efficiently computed using a mean-field approximation. Furthermore, for Generalized Linear Models we have shown the coefficients of the Ising model can be calculated directly from correlations between the data and features.
Surprisingly, aside from some mild regularity conditions, our approach is independent of the choice of prior or likelihood. This suggests that it maybe possible to obtain general results about strongly-regularizing priors. It will be interesting to further explore Bayesian inference in this new limit. Our approach also gives a practical algorithm for quickly performing feature selection in many commonly employed statistical and machine learning approaches The methods outlined here are especially well suited for modern data sets where the number of potential features can vastly exceed the number of independent samples. We envision using the Bayesian feature selection algorithm outlined here as part of a two stage procedure. One can use the BIA to rapidly screen irrelevant variables and reduce the complexity of the dataset before applying a more comprehensive cross-validation procedure. More generally, it will be interesting to further develop statistical physics based approaches for the analysis of complex data.
Appendix A Appedix: Formulas for Generalized Linear Models
The gradient. , and Hessian, , of the log-likelihood have particularly simple definitions for Generalized Linear Models (GLMs) which extend the exponential family of distributions. In the exponential family, we can write the distribution in the form
where is a vector of sufficient statistics, and the are called natural parameters. Notice, that for these distributions, we have that
where Cov denotes the covariance (connected correlation function).
In a GLM, we restrict ourselves to distribution to scalar quantities where and say that . Then, we can write the likelihood as
If we have independent data points with then we can write the log-likelihood for such a distribution as
Using the expressions above for the exponential family and (6) we have that
where is the expectation value of for choice of parameter . If we choose to reproduce the empirical probability we get:
Moreover, the entries of the Hessian are given by:
If we consider standardized variables, , then we can write:
- Bishop et al. (2006) C. M. Bishop et al., Pattern recognition and machine learning, Vol. 1 (springer New York, 2006).
- Mackay (2003) D. J. Mackay, Information Theory, Inference and Learning Algorithms, by David JC MacKay, pp. 640. ISBN 0521642981. Cambridge, UK: Cambridge University Press, October 2003. 1 (2003).
- Tibshirani (1996) R. Tibshirani, Journal of the Royal Statistical Society. Series B (Methodological) , 267 (1996).
- Zou and Hastie (2005) H. Zou and T. Hastie, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301 (2005).
- O’Hagan et al. (2004) A. O’Hagan, J. Forster, and M. G. Kendall, Bayesian inference (Arnold London, 2004).
- MacKay (1992) D. J. MacKay, Neural computation 4, 415 (1992).
- Fisher and Mehta (2014) C. K. Fisher and P. Mehta, arXiv preprint arXiv:1407.8187 (2014).
- Kinney and Atwal (2014) J. B. Kinney and G. S. Atwal, Neural computation 26, 637 (2014).
- Balasubramanian (1997) V. Balasubramanian, Neural computation 9, 349 (1997).
- Nemenman and Bialek (2002) I. Nemenman and W. Bialek, Physical Review E 65, 026137 (2002).
- Opper and Winther (2001) M. Opper and O. Winther, Advanced mean field methods: theory and practice , 7 (2001).