Gaussian priors may not be the best possible choice for the input layer weights of a feedforward neural network (NN) because allowing, a priori, a large weight for a potentially irrelevant feature may deteriorate the predictive performance. This behavior is analogous to a linear model because the input layer weights associated with each hidden unit of a multilayer perceptron (MLP) network can be interpreted as separate linear models whose outputs are combined nonlinearly in the next layer. Integrating over the posterior uncertainty of the unknown input weights mitigates the potentially harmful effects of irrelevant features but it may not be sufficient if the number of input features, or the total number of unknown variables, grows large compared with the number of observations. In such cases, an alternative strategy is required to suppress the effect of the irrelevant features. In this article we focus on a two-layer MLP model structure but aim to form a more general framework that can be used to combine linear models with sparsity-promoting priors using general activation functions and interaction terms between the hidden units.
Recently, appealing surrogates for ARD priors have been presented for linear models. These approaches are based on sparsity favoring priors, such as the Laplace prior (Seeger, 2008) and the spike and slab prior (Hernández-Lobato et al., 2008, 2010). The methods utilize the expectation propagation (EP) (Minka, 2001a) algorithm to efficiently integrate over the analytically intractable posterior distributions. Importantly, these sparse priors do not suffer from similar overfitting as many ARD approaches because point estimates of feature specific parameters such as are not used, but instead, approximate integration is done over the posterior uncertainty resulting from a sparse prior on the weights. Expectation propagation provides a useful alternative to MCMC for carrying out the approximate integration because it has been found computationally efficient and very accurate in many practical applications (Nickisch and Rasmussen, 2008; Hernández-Lobato et al., 2010).
In nonlinear regression, sparsity favoring Laplace priors have been considered for NNs, for instance, by Williams (1995), where the inference is performed using the Laplace approximation. However, a problem with the Laplace approximation is that the curvature of the log-posterior density at the posterior mode may not be well defined for all types of prior distributions, such as, the Laplace distribution whose derivatives are not continuous at the origin (Williams, 1995; Seeger, 2008). Implementing a successful algorithm requires some additional approximations as described by Williams (1995), whereas with EP the implementation is straightforward since it relies only on expectations of the prior terms with respect to a Gaussian measure.
Another possibly undesired characteristic of the Laplace approximation is that it approximates the posterior mean of the unknowns with their MAP estimate and their posterior covariance with the negative Hessian of the posterior distribution at the mode. This local estimate may not represent well the overall uncertainty on the unknown variables and it may lead to worse predictive performance for example when the posterior distribution is skewed(Nickisch and Rasmussen, 2008) or multimodal (Jylänki et al., 2011). Furthermore, when there are many unknowns compared to the effective number of observations, which is typical in practical NN applications, the MAP solution may differ significantly from the posterior mean. For example, with linear models the Laplace prior leads to strictly sparse estimates with many zero weight values only when the MAP estimator of the weights is used. The posterior mean estimate, on the other hand, can result in many clearly nonzero values for the same weights whose MAP estimates are zero (Seeger, 2008)
. In such case the Laplace approximation underestimates the uncertainty of the feature relevances, that is, the joint mode is sharply peaked at zero but the bulk of the probability mass is distributed widely at nonzero weight values. Recently, it has also been shown that in connection with linear models the ARD solution is exactly equivalent to a MAP estimate of the coefficients obtained using a particular class of non-factorized coefficient prior distributions which includes models that have desirable advantages over MAP weight estimates(Wipf and Nagarajan, 2008; Wipf et al., 2011). This connection suggests that the Laplace approximation of the input weights with a sparse prior may possess more similar characteristics with the point-estimate based ARD solution compared to the posterior mean solution.
into nonlinear regression by combining the sparse priors with a two-layer NN in a computationally efficient EP framework. We propose a logical extension of the linear regression models by adopting the algorithms presented for sparse linear models to MLPs with a linear input layer. For this purpose, the main challenge is constructing a reliable Gaussian EP approximation for the analytically intractable likelihood resulting from the NN observation model. Previously, Gaussian approximations for NN models have been formed using the extended Kalman filter (EKF)(de Freitas, 1999) and the unscented Kalman filter (UKF) (Wan and van der Merwe, 2000). Alternative mean field approaches possessing similar characteristic with EP have been proposed by Opper and Winther (1996) and Winther (2001).
We extend the ideas from the UKF approach by utilizing similar decoupling approximations for the weights as proposed by Puskorius and Feldkamp (1991) for EKF-based inference and derive a computationally efficient algorithm that does not require numerical sigma point approximations for multi-dimensional integrals. We propose a posterior approximation that assumes the weights associated with the output-layer and each hidden unit independent. The complexity of the EP updates in the resulting algorithm scale linearly with respect to the number of hidden units and they require only one-dimensional numerical quadratures. The complexity of the posterior computations scale similarly to an ensemble of independent sparse linear models (one for each hidden unit) with one additional linear predictor associated with the output layer. It follows that all existing methodology on sparse linear models (e.g., methods that facilitate computations with large number of inputs) can be applied separately on the sparse linear model corresponding to each hidden unit. Furthermore, the complexity of the algorithm scales linearly with respect to the number of observations, which is beneficial for large datasets. The proposed approach can also be extended beyond standard activation functions and NN model structures, for example, by including a linear hidden unit or predefined interactions between the linear input-layer models.
In addition to generalizing the standard EP framework for sparse linear models we introduce an efficient EP approach for inference on the unknown hyperparameters, such as the noise level and the scale parameters of the weight priors. This framework enables flexible definition of different hierarchical priors, such as one common scale parameter for all input weights, or a separate scale parameter for all weights associated with one input variable (i.e., an integrated ARD prior). For example, assigning independent Laplace priors on the input weights with a common unknown scale parameter does not produce very sparse approximate posterior mean solutions, but, if required, more sparse solutions can be obtained by adjusting the common hyperprior of the scale parameters with the ARD specification. We show that by making independent approximations for the hyperparameters, the inference on them can be done simultaneously within the EP algorithm for the network weights, without the need for separate optimization steps which is common for many EP approaches for sparse linear models and GPs(Rasmussen and Williams, 2006; Seeger, 2008)
, as well as combined EKF and expectation maximization (EM) algorithms for NNs(de Freitas, 1999). Additional benefits are achieved by introducing left-truncated priors on the output weights which prevent possible convergence problems in the EP algorithm resulting from inherent unidentifiability in the MLP network specification.
The main contributions of the paper can be summarized as:
An efficiently scaling EP approximation for the non-Gaussian likelihood resulting from the MLP-model that requires numerical approximations only for one-dimensional integrals. We derive closed-form solutions for the parameters of the site term approximations, which can be interpreted as pseudo-observations of a linear model associated with each hidden unit (Sections 3.1–3.3 and Appendices A–E).
An EP approach for integrating over the posterior uncertainty of the input weights and their hierarchical scale parameters assigned on predefined weight groups (Sections 3.1 and 3.2.1). The approach could be applied also for sparse linear models to construct factorized approximations for predefined weight groups with shared hyperparameters.
Approximate integration over the posterior uncertainty of the observation noise simultaneously within the EP algorithm for inference on the weights of a MLP-network (see Appendix D). Using factorizing approximations, the approach could be extended also for approximate inference on other hyperparameters associated with the likelihood terms and could be applied, for example, in recursive filtering.
Key properties of the proposed model are first demonstrated with three artificial case studies in which comparisons are made with a neural network with infinitely many hidden units implemented as a GP regression model with a NN covariance function and an ARD prior (Williams, 1998; Rasmussen and Williams, 2006). Finally, the predictive performance of the proposed approach is assessed using four real-world data sets and comparisons are made with two alternative models with ARD priors: a GP with a NN covariance function where point estimates of the relevance hyperparameters are determined by optimizing their marginal posterior distribution, and a NN where approximate inference on all unknowns is done using MCMC (Neal, 1996).
2 The Model
We focus on two layer NNs where the unknown function value related to a
-dimensional input vectoris modeled as
where is a nonlinear activation function, the number of hidden units, and
the output bias. Vectorcontains the input layer weights related to hidden unit and is the corresponding output layer weight. Input biases can be introduced by adding a constant term to the input vectors . In the sequel, all weights are denoted by vector , where , and .
In this work, we use the following activation function:
where , and the scaling by ensures that the prior variance of does not increase with assuming fixed Gaussian priors on and . We focus on regression problems with Gaussian observation model , where is the noise variance. However, the proposed approach could be extended for other activation functions and observations models, for example, the probit model for binary classification.
2.1 Prior Definitions
To reduce the effects of irrelevant features, independent zero-mean Laplace priors are assigned for the input layer weights:
where is the :th element of , and is a joint hyperparameter controlling the prior variance of all input weights belonging to group , that is, . Here index variable defines the group in which the weight belongs to. The EP approximate inference is done using the transformed scale parameters . The grouping of the weights can be chosen freely and also other weight prior distributions can be used in place of the Laplace distribution (3). By defining a suitable prior on the unknown group scales useful hierarchical priors can be implemented on the input layer. Possible definitions include one common scale parameter for all inputs (), and a separate scale parameter for all weights related to each feature, which implements an ARD prior (). To obtain the traditional ARD setting the Laplace priors (3
) can be replaced with Gaussian distributions. When the scale parameters are considered unknown, Gaussian hyperpriors are assigned to them:
where a common mean and variance have been defined for all groups . Definition (4) corresponds to a log-normal prior on the associated prior variance for the weights in group .
Because of the symmetry of the activation function, changing the signs of output weight and the corresponding input weights results in the same prediction
. This unidentifiability may cause converge problems in the EP algorithm: if the approximate posterior probability mass of output weightconcentrates too close to zero, expected values of and may start fluctuating between small positive and negative values. Therefore the output weights are constrained to positive values by assigning left-truncated heavy-tailed priors to them:
where , , and denotes a Student-
distribution with degrees of freedom, mean zero, and scale parameter . The prior scale is fixed to and the degrees of freedom to , which by experiments was found to produce sufficiently large posterior variations of when the activation function is scaled according to (2) and the observations are normalized to zero mean and unit variance. The heavy-tailed prior (5) enables very large output weights if required, for example, when some hidden unit is forming almost a linear predictor (see, e.g, Section 4.2). A zero-mean Gaussian prior is assigned to the output bias: , where the scale parameter is fixed to because it was also found to be a sufficient simplification with the same data normalization conditions. The noise level is assumed unknown and therefore a log-normal prior is assigned to it:
with prior mean and variance .
The values of the hyperparameters and affect the smoothness properties of the model in different ways. In the following discussion we first assume that there is only one input scale parameter () for clarity. Choosing a smaller value for penalizes more strongly for larger input weights and produces smoother functions with respect to changes in the input features. More precisely, in the two-layer NN model (1) the magnitudes of the input weights (or equivalently the ARD scale parameters) are related to the nonlinearity of the latent function with respect to the corresponding inputs . Strong nonlinearities require large input weights whereas almost a linear function is obtained with a very large output weight and very small input weights for a certain hidden unit (see Section 4.2 for illustration).
Because the values of the activation function are constrained to the interval , hyperparameter controls the overall magnitude of the latent function . Larger values of increase the magnitude of the steps the hidden unit activation can take in the direction of weight vector in the feature space. Choosing a smaller value for can improve the predictive performance by constraining the overall flexibility of the model but too small value can prevent the model from explaining all the variance in the target variable . In this work, we keep fixed to a sufficiently large value and infer from data promoting simultaneously smoother solutions with the prior on . If only one common scale parameter
is used, the sparsity-inducing properties of the prior depend on the shape of the joint distributionresulting from the choice of the prior terms (3). By decreasing , we can favor smaller input weight values overall, and with , we can adjust the thickness of the tails of . On the other hand, if individual scale parameters are assigned for all inputs according to the ARD setting, a family of sparsity-promoting priors is obtained by adjusting and . If is set to a small value, say 0.01, and
is increased, more sparse solutions are favored by allocating increasing amounts of prior probability on the axes of the input weight space. A sparse prior could be introduced also on the output weightsto suppress redundant hidden units but this was not found necessary in the experiments because the proposed EP updates have fixed point at and and during the iterations unused hidden units are gradually driven towards zero (see Section 3.3.1).
3 Approximate Inference
In this section, we describe how approximate Bayesian inference on the unknown model parameters, , , and can be done efficiently using expectation propagation. First, the structure of the approximation and the expressions of the approximate site terms are presented in Section 3.1. A general description of the EP algorithm for determining the parameters of the site approximations is given in Section 3.2 and approximations for the non-Gaussian hierarchial weight priors are described in Section 3.2.1. The various computational blocks required in the EP algorithm are discussed first in Section 3.3 and detailed descriptions of the methods are given in Appendices A–E. Finally, an algorithm description with references to the different building blocks is given in 3.3.1.
3.1 The Structure of the Approximation
Given a set of observations , where , , the posterior distribution of all the unknowns is given by
where and is the marginal likelihood of the observed data conditioned on all fixed hyperparameters (in this case ). Figure 1 shows a directed graph representing the joint distribution (7) where we have also included intermediate random variables and related to each data point to clarify the upcoming description of the approximate inference scheme. To form an analytically tractable approximation, all non-Gaussian terms are approximated with unnormalized Gaussian site functions, which is a suitable approximating family for random vectors defined in the real vector space.
We first consider different possibilities for approximating the likelihood terms which depend on the unknown noise parameter and the unknown weight vectors and through the latent function value according to:
where is a auxiliary matrix formed as Kronecker product. It can be used to write all the linear input layer activations of observation as . The vector valued function applies the nonlinear transformation (2) on each component of according to (the last element corresponds to the output bias ). If we approximate the posterior distribution of all the weights with a multivariate Gaussian approximation that is independent of all the other unknowns including , the resulting EP algorithm requires fast evaluation of the means and covariances of tilted distributions of the form
where is a known mean vector, and a known covariance matrix, and is assumed fixed. Because the non-Gaussian likelihood term depends on
only through linear transformation, it can be shown (by differentiating (9) twice with respect to ) that the normalization term, mean and covariance of
can be exactly determined by using the corresponding moments of the transformed lower dimensional random vector, where the transformation matrix can be written as
This results in significant computational savings because the size of is , where we have denoted the dimensions of and with and respectively. It follows that the EP algorithm can be implemented by propagating the moments of using, for example, the general algorithm described by Cseke and Heskes (2011, appendix C). The same principle has been utilized to form computationally efficient algorithms also for linear binary classification (Minka, 2001b; Qi et al., 2004).
Independent Gaussian posterior approximations for both and can be obtained by approximating the likelihood terms by a product of two unnormalized Gaussian site functions:
where is a scalar scaling parameter. Because of the previously described property, the first likelihood site approximation depends on only through transformation (Cseke and Heskes, 2011):
where is a vector of location parameters, and a site precision matrix. The second likelihood site term dependent on the scalar is chosen to be an unnormalized Gaussian
where the site parameters and control the location and the scale of site function, respectively. Combined with the known Gaussian prior term on this results in a Gaussian posterior approximation for that corresponds to a log-normal approximation for .
The prior terms of the output weights , for , are approximated with
where is a scalar scaling parameter, has a similar exponential form as (12), and the site parameters and control the location and scale of the site, respectively. If the prior scales are assumed unknown, the prior terms of the input weights , are approximated with
where a factorized site approximation with location parameters and , and scale parameters and , is assumed for weight and the associated scale parameter , respectively. A similar exponential form to equation (12) is assumed for both and . This factorizing site approximation results in independent posterior approximations for and each component of as will be described shortly.
The actual numerical values of the normalization parameters , , and are not required during the iterations of the EP algorithm but their effect must be taken into account if one wishes to form an EP approximation for the marginal likelihood (see Appendix G). This estimate could be utilized to compare models or to alternatively determine type-II MAP estimates for hyperparameters , , and in case they are not inferred within the EP framework.
3.1.1 Fully-coupled approximation for the network weights
Multiplying the site approximations together according to (7) and normalizing the resulting expression gives the following Gaussian posterior approximation:
where , , and are the approximate posterior distributions of the weights , the noise parameter , and the input weight scale parameter , respectively. The mean vector and covariance matrix of , are given by
The approximate mean and variance of can be computed analogously to (17). The key property of the approximation (15) is that if we can incorporate the information provided by the data point in the parameters and , for all , the approximate inference on the non-Gaussian priors and is straightforward by adopting the methods described by (Seeger, 2008). In the following sections we will show how this can be done by approximating the joint distribution of , and , given , with a multivariate Gaussian and using it to estimate the parameters and one data point at a time within the EP framework.
3.1.2 Factorizing approximation for the network weights
A drawback with the approximation (16) is that the evaluation of the covariance matrix scales as which may not be feasible when the number of inputs is large. Another difficulty arises in determining the mean and covariance of when is distributed according to (9) because during the EP iterations has similar correlation structure with . If the observation model is Gaussian and is fixed, this requires at least -dimensional numerical quadratures (or other alternative approximations) that may quickly become infeasible as increases. By adopting suitable independence assumptions between and the input weights associated with the different hidden units, the mean and covariance of can be approximated using only 1-dimensional numerical quadratures as will be described in Section 3.3.
The structure of the correlations in the approximation can be studied by dividing into four blocks as follows:
where is a matrix, a matrix, and a matrix. The element contributes to the approximate posterior covariance between and , and the sub-matrix contributes to the approximate covariance between and . To form an alternative computationally more efficient approximation we propose a simpler structure for . First, we approximate with a diagonal matrix, that is, , where only the :th component of the vector contributes to the posterior covariance of . Secondly, we set and approximate with an outer-product of the form . With this precision structure the site approximation (11) can be factorized into terms depending only on the output weights or the input weights associated with the different hidden units :
where the site location parameters now correspond to the first elements of in equation (11), that is, . Analogously, the site location vector corresponds to the last entries of , that is, .
The factored site approximation (19) results in independent posterior approximations for the output weights and the input weights associated with different hidden units:
where and . The approximate mean and covariance of is given by
where the diagonal matrix and the vector collect all the site parameters related to hidden unit : and . The diagonal matrix and the vector contain the prior site scales and the location variables related to the hidden unit . The approximate mean and covariance of the output weights are given by
where the diagonal matrix and the vector contain the prior site scales and the location variables .
For each hidden unit , approximations (20) and (21) can be interpreted as independent linear models with Gaussian likelihood terms , where the pseudo-observations are given by . The approximation for the output weights (22) has no explicit dependence on the input vectors but later we will show that the independence assumption results in a similar dependence on expected values of taken with respect to approximate leave-one-out (LOO) distributions of and .
3.2 Expectation Propagation
The parameters of the approximate posterior distribution (15) are determined using the EP algorithm (Minka, 2001a). The EP algorithm updates the site parameters and the posterior approximation sequentially. In the following, we briefly describe the procedure for updating the likelihood sites and and assume that the prior sites (13) and (14) are kept fixed. Because the likelihood terms do not depend on and posterior approximation is factorized, that is , we need to consider only the approximations for and . Furthermore, independent approximations for and are obtained by using (19) and (20) in place and , respectively.
At each iteration, first a proportion of the :th site term is removed from the posterior approximation to obtain a cavity distribution:
where is a fraction parameter that can be adjusted to implement fractional (or power) EP updates (Minka, 2004, 2005). When , the cavity distribution (23) can be thought of as a LOO posterior approximation where the contribution of the :th likelihood term is removed from . Then, the :th site is replaced with the exact likelihood term to form a tilted distribution
where is a normalization constant, which in this case can also be thought of as an approximation for the LOO predictive density of the excluded data point . The tilted distribution can be regarded as a more refined non-Gaussian approximation to the true posterior distribution. Next, the algorithm attempts to match the approximate posterior distribution with by finding first a Gaussian that satisfies
where KL denotes the Kullback-Leibler divergence. Whenis chosen to be a Gaussian distribution this is equivalent to determining the mean vector and the covariance matrix of and matching them with the mean and covariance of . Then, the parameters of the :th site terms are updated so that the moments of match with :
Finally, the posterior approximation is updated according to the changes in the site parameters. These steps are repeated for all sites in some suitable order until convergence.
From now on, we refer to the previously described EP update scheme as sequential EP. If the update of the posterior approximation in the last step is done only after new parameter values have been determined for all sites (in this case the likelihood term approximations), we refer to parallel EP (see, e.g., van Gerven et al., 2009). Because in our case the approximating family is Gaussian and each likelihood term depends on a linear transformation of , one sequential EP iteration requires (for each of the sites) either one rank- covariance matrix update with the fully-correlated approximation (16), or rank-one covariance matrix updates with the factorized approximation (21, 22). In parallel EP these updates are replaced with a single re-computation of the posterior representation after each sweep over all the sites. In practice, one parallel iteration step over all the sites can be much faster compared to a sequential EP iteration, especially if or are large, but parallel EP may require larger number of iterations for overall convergence.
Setting the fraction parameter to corresponds to regular EP updates whereas choosing a smaller value produces a slightly different approximation that puts less emphasis on preserving all the nonzero probability mass of the tilted distributions (Minka, 2005). Consequently, setting tries to represent possible multimodalities in (24) but ignores modes far away from the main probability mass, which results in tendency to underestimate variances. However, in practice decreasing can improve the overall numerical stability of the algorithm and alleviate convergence problems resulting from possible multimodalities in case the unimodal approximation is not a good fit for the true posterior distribution (Minka, 2005; Seeger, 2008; Jylänki et al., 2011).
There is no theoretical convergence guarantee for the standard EP algorithm but damping the site parameter updates can help to achieve convergence in harder problems (Minka and Lafferty, 2002; Heskes and Zoeter, 2002).111Alternative provably convergent double-loop algorithms exist but usually they require more computational effort in the form of additional inner-loop iterations (Minka, 2001a; Heskes and Zoeter, 2002; Opper and Winther, 2005; Seeger and Nickisch, 2011). In damping, the site parameters are updated to a convex combination of the old values and the new values resulting from (25) defined by a damping factor . For example, the precision parameter of the likelihood site term is updated as and a similar update using the same -value is done on the corresponding location parameter . The convergence problems are usually seen as oscillations over iterations in the site parameter values and they may occur, for example, if there are inaccuracies in the tilted moment evaluations, or if the approximate distribution is not a suitable proxy for the true posterior, for example, due to multimodalities.
3.2.1 EP approximation for the weight prior terms
Assuming fixed site parameters for the likelihood approximation (19), or (11) in the case of full couplings, the EP algorithm for determining the prior term approximations (13) and (14) can be implemented in the same way as with sparse linear models (Seeger, 2008).
To derive EP updates for the non-Gaussian prior sites of the output weights assuming the factorized approximation, we need to consider only the prior site approximations (13) and the approximate posterior defined in equation (22). All approximate posterior information from the observations and the priors on the input weights are conveyed in the parameters determined during the EP iterations for the likelihood sites. The EP implementation of Seeger (2008) can be readily applied by using and as a Gaussian pseudo likelihood as discussed in Appendix E. Because the prior terms depend only on one random variable , deriving the parameters of the cavity distributions and updates for the site parameters and require only manipulating univariate Gaussians. The moments of the tilted distribution can be computed either analytically or using a one-dimensional numerical quadrature depending on the functional form of the exact prior term .
To derive EP updates for the non-Gaussian hierarchical prior sites of the input weights assuming the factorized approximation (20), we can consider the approximate posterior distributions from equation (21) separately with the corresponding prior site approximations (14) related to the components of . All approximate posterior information from the observations is conveyed by the site parameters that together with the input features form a Gaussian pseudo likelihood with a precision matrix and location term (compare with equation 21). It follows that the EP implementation of Seeger (2008) can be applied to update the approximations but, in addition, we have to derive site updates also for the scale parameter approximations . EP algorithms for sparse linear models that operate on exact site terms depending on a nonlinear combination of multiple random variables have been proposed by Hernández-Lobato et al. (2008) and van Gerven et al. (2009).
Because the :th exact prior term (3) depends on both the weight and the corresponding log-transformed scale parameter , the :th cavity distribution is formed by removing a fraction of both site approximations and :
where is the :th marginal approximation extracted from the corresponding approximation , and the approximate posterior for is formed by combining the prior (4) with all the site terms that depend on :
The :th tilted distribution is formed by replacing the removed site terms with a fraction of the exact prior term :
where is a Gaussian approximation formed by determining the mean and covariance of . The site parameters are updated so that the resulting posterior approximation is consistent with the marginal means and variances of :
Because of the factorized approximation, the cavity computations (26) and the site updates (27) require only scalar operations similar to the evaluations of and to the updates of in equations (A) and (46) respectively (see Appendix A and E).
Determining the moments of (27) can be done efficiently using one-dimensional quadratures if the means and variances of with respect to the conditional distribution can be determined analytically. This can be readily done when is a Laplace distribution or a finite mixture of Gaussians. Note also that if we wish to implement an ARD prior we can choose simply , where is a common scale parameter for all weights related to the same input feature, that is, weights