1 Introduction
The Bayesian approach [Gelman et al., 2013]
has been of interest because of its ability to incorporate domain knowledge into reasoning and to provide principled uncertainty estimates. Bayesian neural networks (BNNs)
[Neal, 2012], however, are criticized for the difficulty of encoding domain knowledge into the prior over the highdimensional weight space. This limits their use in applications where domain knowledge is important, e.g., when data are limited or signal is extremely weak and sparse. Applications in genetics often fall into the latter category and are used as motivating examples for the proposed methods.Two types of domain knowledge are often available in scientific applications: ballpark figures on feature sparsity and the signaltonoise ratio. Specifically, feature sparsity refers to the number of features expected to be used by the model. For example, in genomics less than 2% of the genome encodes for genes. A prior on the signaltonoise ratio may encode the amount of variance of the target expected to be explained by the chosen features, and it can be measured as the proportion of variance explained (PVE) [Glantz et al., 1990]. For example, one gene may explain only a tiny fraction of the variance of a given phenotype, e.g., the PVE of the gene can be less than 1%.
Gaussian scale mixtures (GSMs) are commonly used sparse priors on the BNN weights, including, e.g., the horseshoe [Ghosh et al., 2018] and the spikeandslab [Deng et al., 2019]. However, they are not flexible enough to encode all kinds of beliefs about sparsity, and it is not known how to encode information about the PVE into such a prior. Some recent works on functional BNNs encode domain knowledge into a functional prior, e.g., a Gaussian process [FlamShepherd et al., 2017, Sun et al., 2019], but it is not trivial to include sparsity in such priors.
We propose a new informative GSM prior for the weights to explicitly encode domain knowledge about feature sparsity and signaltonoise ratio. Our main contributions are:
1. Propose a joint hyperprior on the local scales that can model beliefs on the number of relevant features, which includes the spikeandslab as a special case (Figure 1).
2. Derive the relation between PVE and the global scale parameter for BNNs with the ReLU activation function.
3. Develop a method to determine the global scale parameter of the GSM prior by using domain knowledge about the PVE, which circumvents heuristics employed in the commonly used approaches
[Blundell et al., 2015] and computationally intensive crossvalidation.2 Background
2.1 Bayesian neural networks
Bayesian neural networks [MacKay, 1992, Neal, 2012] are defined by placing a prior distribution on the weights of a neural network. Then, instead of finding a point estimate of the weights by minimizing a cost function, a posterior distribution of the weights is calculated conditionally on the data. Let denote the output of a BNN and the likelihood. Then, given a dataset of inputs and outputs , training a BNN means computing the posterior distribution .
Variational inference can be used to approximate the intractable with a simpler distribution, , by minimizing . This is equivalent to maximizing the ELBO
(1) 
The first term in Eq.1 is the entropy of the approximated posterior, which can be calculated analytically for many choices of . The second term can be estimated by the reparametrization trick [Kingma and Welling, 2013], which reparametrizes the approximated posterior by a deterministic and differentiable function with , such that .
2.2 Gaussian scale mixture priors
The Gaussian scale mixture [Andrews and Mallows, 1974] is defined to be a zero mean Gaussian conditional on its scale, and the distribution on the scale characterizes its statistical properties, e.g. sparsity. It has been combined with Automatic Relevance Determination (ARD) [MacKay, 1996]
, which is a widely used approach for feature selection. In BNNs, the ARD prior means that all of the outgoing weights
from the same node in layer share a same scale [Neal, 2012], and node will be dropped if becomes to zero. We define the input layer as layer in this paper.A GSM ARD prior on each weight can be written in a hierarchical parametrization form as follows:
(2) 
where is the global scale shared by all of the NN weights, and defines a hyperprior on the local scales. The marginal distribution of can be obtained by integrating out the local scales:
(3) 
Eq.2 can also be written in an equivalent noncentered parametrization [Papaspiliopoulos et al., 2007] form (Figure 2 left):
(4) 
which has a better posterior geometry for inference [Betancourt and Girolami, 2015] than Eq.2. The noncentered parametrization form has been widely used in BNN literature [Louizos et al., 2017, Ghosh et al., 2018], and we also use this form in the rest of the paper.
Different define different marginal prior distributions on the weights according to Eq.3. For example, when is Bernoulli, each weight follows a spikeandslab prior distribution [Mitchell and Beauchamp, 1988]; when
is the halfCauchy distribution, the weights are horseshoe
[Carvalho et al., 2009]. The parameter is important in practice, especially for small data sets [Piironen and Vehtari, 2017], as it controls the shape of , and thus affects the sparsity level of directly. If the distribution of concentrates at , a sparse model will be preferred, otherwise, a dense model will be learned instead.2.3 Proportion of Variance Explained
Assume that the data generating process takes the form
(5) 
where is the model, and is the unexplainable noise. The Proportion of Variance Explained (PVE) [Glantz et al., 1990] of on dataset , also called the coefficient of determination (
) in linear regression, is then defined as:
(6) 
and it is commonly used to measure of the impact of features on the prediction target , especially in genomics [Marttinen et al., 2014]. In general, PVE should be less than 1, because the variance of the predicted values should not exceed that of the data. However, it may easily happen that this does not hold in nonlinear models, such as neural networks, unless explicitly accounted for. For an example see Figure 1.
It is known that the scale parameter of meanfield (fully factorized) Gaussian prior on BNNs affects the variability of the functions drawn from the prior [Neal, 2012], and thus the PVE. As we will show in Section 4, for BNNs with the GSM prior defined in Eq.4, the global scale and the local scales jointly control the PVE, and the effect of grows exponentially as the depth increases. This underlines the importance of setting properly based on the expected PVE. When the scale parameter is set so that the corresponding PVE is close to the true value, the model is more likely to recover the true data generating function (demonstration in Figure 1).
3 Prior knowledge about sparsity
In this section, we propose a new hyperprior for the local scales according to the generative model shown in Figure 2 (right). The new prior generates the local scales conditionally on the number of relevant features, which allows us to explicitly express domain knowledge about the number of relevant features.
3.1 Discrete informative scale priors
We first consider the case when each scale
is a discrete random variable with domain
, such as the independent Bernoulli scale parameters in the spikeandslab prior. We will then generalize this to the continuous scale in Section 3.2.3.1.1 Prior on the number of relevant features
We assume the prior beliefs about the number of relevant features is defined as , i.e., a probability mass function for , where (dimension of the dataset). The prior directly controls the model sparsity. Intuitively, if concentrates on
, a sparse model is preferred as it has a high prior probability to use a small number of features; if
puts a large probability mass on , all of the features are likely to be used instead. Hence, unlike other priors encouraging shrinkage, such as Laplace or horseshoe, our new prior is more interpretable about sparsity as it directly models the number of relevant features.The modeler can choose based on the prior information available. When we have a good idea about the number of relevant features, a unimodal prior can be used, such as a discretized Gaussian:
(7) 
where is the mode, represents the precision as the inverse variance in the continuous Gaussian, and is the normalization constant. Often we may only be able to specify an interval for the number of relevant features. Then we can use, for example, a ‘flattened’ Gaussian (Figure 1):
(8) 
where defines the interval where the probability is uniform and reaches its maximum value, and is the corresponding normalization constant. If there is no prior information about sparsity, a discrete uniform prior over is a plausible alternative.
3.1.2 Feature allocation
Given the prior distribution , we specify how it affects which variables will be used first for a given , and then marginalize over . We introduce identity variables to denote if feature is used () or not (); they should satisfy . Assuming no prior knowledge about relative importance of features (this assumption can be easily replaced if needed), i.e.,
has a jointly uniform distribution given
, we have:(9) 
where the normalization constant is .
Now we can calculate the joint distribution of
by marginalizing out :(10) 
For Bernoulli local scale variables , the itself takes the role of the identity variable . Thus we obtain the joint distribution over discrete scale parameters as
(11) 
where the distribution encodes our domain knowledge of the number of relevant features.
3.2 Informative prior on continuous local scales
When the local scales are continuous (), the number of relevant features
is not a sensible way to define sparsity, because the posterior probability of the event
is zero almost everywhere [Carvalho et al., 2009]. Thus will always equal , unless heuristic pruning is used [Louizos et al., 2017].Instead of using the number of relevant features to represent sparsity, we propose to use the effective number of features [Piironen and Vehtari, 2017], which is defined as
(12) 
where is the shrinkage factor [Carvalho et al., 2009] that can be defined for any GSM prior as follows: The reflects the proportion the feature is shrunk compared to its maximum likelihood estimate, and it is defined as
(13) 
where is a small positive constant [Carvalho et al., 2009]^{1}^{1}1For example, in Bayesian linear regression [Carvalho et al., 2009], where is the variance of data likelihood, is the size of data, and is the global scale parameter.. In our experiments we simply use to apply this with BNNs. We assume that the user can give a prior belief about in a similar way as the number of relevant features in Section 3.1.
Inspired by Eq.10 for the discrete case, we define the joint distribution on :
(14) 
where is a continuous prior for the effective number of features, analogously to the discrete prior over the number of relevant features in Section 3.1. The Binomial coefficient in Eq.14 is calculated for the continuous using the Stirling’s approximation. Since the nonshrinkage factor increases monotonically w.r.t. , we get the joint density of the scale parameters using the law of change of variables:
(15) 
where . We show the marginal distribution of for different in Figure 3. When concentrates more on , e.g., the in Figure 3 left, has a heavier tail (the blue line in Figure 3 right). We note that although the normalization constant is not analytically available in Eq.15, making inference with this prior is still feasible via variational inference and the reparametrization trick.
3.3 BNNs with informative priors
In BNNs, the number of features used by the model (sparsity) is determined by the weights in the input layer, unless all nodes of any hidden layer are jointly dropped, which is very unlikely in practice. Thus we encode the domain knowledge on sparsity in the informative prior of the input layer, i.e., in Eq.11 and 15. For the other layers we use the spikeandslab in our experiments (see Experiments). However, informative priors could also be used in higher layers. Because specifying domain knowledge about the number of hidden nodes would be difficult, could be set to be uniform. This way one could potentially learn the optimal number of hidden nodes, and we leave this for future work.
To summarizing, according to Section 3.1, we use the following prior on the input layer:
(16) 
with the hyperprior on the local scales following for discrete cases and
for continuous cases. We use independent concrete Bernoulli distributions
[Maddison et al., 2016] to approximate the posterior of in the discrete case and Lognormal in the continuous case, and independent Gaussian to approximate the posterior of for both cases.4 Prior knowledge about PVE
In this section, we introduce an approach for determining the global scale for the GSM priors in Eq.4, according to domain knowledge about the expected PVE.
4.1 PVE for Bayesian neural networks
According to the definition of PVE in Eq.6, when is a BNN, has a distribution determined by the distribution of . In this paper, instead of analysing the distribution of , we analyse a summary statistic of the PVE commonly used in Bayesian linear model [Gelman et al., 2019]. The statistic is the expected ,
(17) 
over functions drawn from the BNN. Consequently, instead of restricting PVE directly, we express the prior beliefs using , e.g., requiring it to be less than 1.
For BNNs with arbitrary activation functions or priors , the analytical form of is intractable. However, under certain assumptions, we have the following theorem (Proof in the Supplement):
Theorem 1.
If the weights in different layers () are independent, and priors within any single layer
have a zero mean and the same second moment
, then the expected PVE of a ReLU BNN with hidden layers and units in layer is given bywhere is a constant independent of the priors of weights.
Theorem 1 will be used to determine the global scale according to the prior knowledge on in Section 4.2. Note that Theorem 1 is applicable to the commonly used priors, such as the fully factorized Gaussian or the spikeandslab, as well as to the new proposed informative prior, as long as the second moment of the prior is finite.
4.2 Determining according to PVE
For any zeromean GSM ARD prior for BNNs, the variance (second moment) of the weight in Eq. 4 equals:
(18) 
In practice, we first set the hyperpriors on the local scales (the same for all nodes within a single layer), , by encoding the domain knowledge about sparsity, which determines . According to Theorem 1, the expected PVE can be simplified to
(19) 
where is a constant that depends on the variance of the data, model architecture, and preencoded sparsity, independent of . Hence, we see that when the factors affecting are kept constant, is fully determined by the global scale alone.
Both and in Eq.19 are only analytically tractable when all are infinite (Shown in the Supplement), which is impractical. However, since we know the exact form of , we can obtain a very accurate estimation of by solving the linear regression problem
(20) 
We solve Eq.20 using data , obtained by simulating from the BNN (with the prespecified prior and data ) for multiple values of to get a MonteCarlo estimate of the corresponding , and in practice is enough to make the of Eq.20 greater than . See an algorithm and example in Supplementary.
After the constant has been estimated, can be specified using domain knowledge about , according to Eq.19. We set the aleatoric uncertainty, i.e., in Eq.5, to , so that the model Eq.5 is consistent with our prior knowledge. If we have no relevant knowledge, a noninformative prior on can be used instead, such as the uniform distribution on . According to the change of variable, the prior induced on is given by
(21) 
For some heavytailed priors, e.g. the horseshoe, whose second moments do not exist, does not exist either, and Theorem 1 can not be applied. However, other summary statistics of the , such as the median, may be used to determine the global parameter for the horseshoe. We leave this for future work.
5 Related Work
BNNs with a fully factorized Gaussian prior and posterior were proposed by Graves [2011]. BNNs with a fully factorized Gaussian prior and a mixture of Diracdelta posteriors can be interpreted as NNs with dropout [Gal and Ghahramani, 2016]. Nalisnick et al. [2019] extended these works, and showed that NNs with any multiplicative noise can be interpreted as BNNs with GSM ARD priors. Some priors have been proposed to induce sparsity, such as loguniform priors [Louizos et al., 2017], horseshoe priors [Louizos et al., 2017, Ghosh et al., 2018], and Spikeandslab priors [Deng et al., 2019]. However, none of the works has proposed how to encode domain knowledge explicitly.
Building informative priors for neural networks in the functional space has been widely studied. One common type of prior information concerns the behavior of the output with certain inputs. Noise contrastive priors (NCPs) [Hafner et al., 2018] were designed to encourage reliable high uncertainty for OOD (outof distribution) data points. Gaussian processes have been widely used as functional priors, because of their ability to encode rich functional structures. FlamShepherd et al. [2017] transformed a functional GP prior into a weightspace BNN prior, with which variational inference is performed. Functional BNNs [Sun et al., 2019], however, perform variational inference directly in the functional space, where meaningful functional GP priors can be specified. Pearce et al. [2019] used a combination of different BNN architectures to encode prior knowledge about the function. Although building functional priors can avoid uninterpretable highdimensional weights, encoding sparsity information of features into the functional space is not trivial.
6 Experiments
We first apply BNNs with the new discrete informative prior on synthetic datasets, and compare the convergence rate with the ‘golden standard’ baseline: the spikeandslab prior. We then apply our approach on 5 public UCI datasets with and without injected irrelevant features, and to a GenomeWide Association Study (GWAS) dataset. We show that incorporating domain knowledge on both sparsity and PVE through our approach can improve the results. All data sets were standardized before training.
Common model settings: We used BNNs with ARD spikeandslab prior (SSBNN) as the baseline for discrete scales, where the slab probabilities of input layers were determined based on knowledge about the number of relevant features, and the slab probabilities of hidden layers were set to . We consider the horseshoe prior on all layers (HSBNN) as the baseline for continuous scales, using the same hyperparameters as in Ghosh et al. [2018]. For BNNs with informative priors on the input layers, we used the ‘flattened’ Gaussian (Eq.8) to model the number of relevant features () for discrete cases (InfoD), and the effective number of features () for continuous cases (InfoC). The hidden layers of InfoD are the same as in SSBNN, and InfoC are the same as in the corresponding HSBNN.
6.1 Synthetic data
Setup: We simulated datasets with 500 features, of which only 10 are used to compute the target with the model . The model contains both main effects and interactions between the selected features (details in Supplementary). The noise
is Gaussian with zero mean and standard deviation equal to
, so that the signaltonoise ratio is . We generated 3 datasets of different sizes (300, 700, and 1100), and compared the convergence of both MSE and sparsity (number of features) with different priors for each dataset. We used the test MSE to determine convergence. The expected number of features included in the first layer, calculated using the feature inclusion probabilities, is used as the estimate for the number of relevant features for both the priors.Parameter settings: We applied BNNs with two different types of ARD priors: spikeandslab prior (SSBNN) and discrete informative prior (InfoD). The common global scale is for all models. The slab probability of SSBNN on the first layer is set to , to reflect the expected number of relevant features. We encoded three types of prior knowledge into InfoD by setting and differently:
1. a strongly informative InfoD ();
2. a weakly informative InfoD ();
3. a wrong InfoD ().
We used a concrete Bernoulli with temperature to approximate the posteriors of scales for both SS and InfoD priors, and set the learning rate of Adam to . The neural networks architecture that we used have 2 hidden layers of sizes 100 and 50 nodes.
Results: Figure 4 shows the convergence of test error and sparsity (the number of relevant features) with the different priors. For small datasets (300), when prior is more important, the InfoD prior with the wrong information converges to the same MSE and number of features as the SS prior, while the InfoD priors with correct information converge to a lower MSE and the correct number of features (10, the green dotted lines). When we increase the data size to 700, InfoD with the wrong information converges to the correct sparsity and the same MSE as the correct InfoD priors, but more slowly. For the largest dataset with 1100 data points, the converged MSEs for the different priors are similar. However, with the SS prior the number of features is overestimated even at convergence. The strongly informative InfoD prior always converges faster than the weakly informative InfoD prior, but the advantage diminishes when the size of the dataset increases.
6.2 UCI datasets
Datasets  (P, N)  meanfield  SSBNN  SSBNN+PVE  InfoD  InfoD+PVE  HSBNN  InfoC 

Original datasets (noninformative prior)  
Bike  
California  
Energy  
Concrete  
Yacht  
Extended datasets with irrelevant features (informative prior)  
Bike  
California  
Energy  
Concrete  
Yacht 
Setup: We analyze 5 publicly available datasets (Table 1) : Bike sharing, California housing prices, Energy efficiency, Concrete compressive strength, and Yacht hydrodynamics. We carry out two types of experiments: 1. analyze the original datasets as such, in which case the domain knowledge about sparsity is unknown; 2. concatenate 100 irrelevant features with the original features, which allows us to specify informative priors about sparsity (the number of relevant features is at most the number of original features). We examine whether the performance can be improved by encoding the extra knowledge about the PVE and sparsity into the prior. We use of data for training, for validation, and
for testing. We consider three evaluation metrics: negative loglikelihood (NLL), MSE, and
on test sets. We repeat each experimenttimes on data splits to obtain confidence intervals.
Parameter settings: We considered 3 classes of priors: 1. the standard meanfield Gaussian; 2. discrete GSM ARD priors; 3. continuous GSM ARD priors.
For discrete GSM priors, we used SSBNN as the baseline, with the global scale set to as was done in prior work [Wenzel et al., 2020]. We set the slab probability on the input layer to for the original datasets, and to for the datasets extended with noisy features ( is the number of features in the original dataset), which represents the correct level of sparsity. In addition, we considered the following three discrete informative priors:
1. SSBNN+PVE: the SSBNN prior where the global scale was set such that (see Section 4).
2. InfoD: the discrete informative prior with and the global scale .
3. InfoD+PVE: same as InfoD, but with the global scale set such that .
To assess sensitivity, we also consider for SS+PVE and InfoD+PVE.
Of the continuous GSM priors, we consider the horseshoe BNN (HSBNN) as the baseline. We only consider encoding prior knowledge about sparsity by using a InfoC, because does not exist for the horseshoe priors as discussed in Section 4. We use the same and as in the discrete case (InfoD). The BNNs we used have 3 hidden layers of sizes 100, 50, and 20 nodes. The in Eq.5 was set to for each model.
Results: The results with the negative loglikelihood (NLL) metric are shown in Table 1, and MSE and are reported in the Supplementary.
For the original datasets, we see that setting the global scale according to (SSBNN+PVE and InfoD+PVE) increases the data likelihood, which reflects both the quality of predictive uncertainty and prediction accuracy. The new proposed informative priors (InfoD and InfoC) can also slightly improve the performance for the smaller datasets although we have no prior information about sparsity to encode. Horseshoe (HSBNN) has a comparable performance even without explicitly encoded information.
In the extended datasets with 100 extra irrelevant features, knowledge of both PVE and sparsity improve the performance significantly in discretescale cases. For most datasets, SSBNNs even with the correct level of sparsity are not good enough to produce reasonable results. We find that for the large datasets, the discretescale priors are better than the continuous ones, and for small datasets, continuousscale priors have better performance, although they cannot include information about the PVE. We also find that nonsparsityinducing priors (meanfield Gaussian) work comparably on some original datasets with large , but work poorly on most datasets.
The results with (Supplementary) show that the conclusions are not sensitive to the specific value of PVE assumed.
6.3 GWAS application
Motivation: The goal of a GenomeWide Association Study (GWAS) is to learn associations between genetic variants called SNPs (input features) and phenotypes (targets). Ultimately, the goal is to predict the phenotype given the SNPs of an individual. This task is extremely challenging because 1. the input features are very highdimensional and strongly correlated and 2. they may explain only a tiny fraction of the variance of the phenotype, e.g., less than . Thus, most approaches employ several heuristic but crucial preprocessing steps to reduce the input dimension and correlation. There exists strong domain knowledge about sparsity and the amount of variance explained by the SNPs, and we show that by incorporating this knowledge into informative priors we can accurately predict where alternatives fail.
Dataset: The FINRISKI dataset contains SNPs and 228 different metabolites as phenotypes for 4620 individuals. We selected 6 genes that have previously been associated with the metabolites [Kettunen et al., 2016]. We use the SNPs of each gene to predict the corresponding most correlated metabolite, resulting in 6 different experiments.
Parameter settings: We train BNNs with 1 hidden layer consisting of hidden nodes. We consider 3 different priors: meanfield Gaussian, spikeandslab (SSBNN), and the discrete informative prior (InfoD). We make predictions using the posterior mean and evaluate the performance by the PVE (higher is better) on test data. We use 50% of data for training and 50% for testing, and we repeat this 50 times for each of the 6 experiments (i.e. genes), allowing us to assess experiment specific variability due to the random split and training.
The slab probability of SSBNN is fixed to , and we use and in InfoD, where is the number of SNPs in the chosen gene. This reflects the prior belief that less than of the SNPs in the gene actually affect the phenotype. The global scale of each prior is either fixed to (without prior information about PVE), or calculated by setting to previous findings [Kettunen et al., 2016] according to Section 4.
Results: Figure 5 shows results for the 6 experiments. We see that setting the according to the prior knowledge on the PVE always improves accuracy and reduces uncertainty for all priors (purple bars). Without using the prior knowledge on the PVE, learning with all priors can overfit seriously (blue bars, negative test PVE). The novel informative discrete GSM prior has the highest accuracy with the smallest standard deviation in all experiments, both with PVE and without PVE.
7 Conclusion
We proposed a new informative Gaussian scale mixture prior on BNN weights, whose global and local scale parameters are specified using domain knowledge about expected signaltonoise ratio and sparsity. We demonstrated the utility of the prior on simulated data, publicly available datasets, and in a GWAS application, where they outperformed strong commonly used baselines. The informative hyperprior over the local scales can be generalized to all scale mixture distributions, not just the Gaussian scale mixture, such as the StrawdermanBerger prior. Possible future work includes encoding PVE into heavytailed distributions, such as the horseshoe, and extending the results to hierarchical priors (hyperprior over the global scale).
References

Andrews and Mallows [1974]
D. F. Andrews and C. L. Mallows.
Scale mixtures of normal distributions.
Journal of the Royal Statistical Society: Series B (Methodological), 36(1):99–102, 1974.  Betancourt and Girolami [2015] M. Betancourt and M. Girolami. Hamiltonian Monte Carlo for hierarchical models. Current Trends in Bayesian Methodology with Applications, 79:30, 2015.
 Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Carvalho et al. [2009] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009.

Deng et al. [2019]
W. Deng, X. Zhang, F. Liang, and G. Lin.
An adaptive empirical Bayesian method for sparse deep learning.
In Advances in Neural Information Processing Systems, pages 5564–5574, 2019.  FlamShepherd et al. [2017] D. FlamShepherd, J. Requeima, and D. Duvenaud. Mapping Gaussian process priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop, 2017.

Gal and Ghahramani [2016]
Y. Gal and Z. Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
International Conference on Machine Learning
, pages 1050–1059, 2016.  Gelman et al. [2013] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
 Gelman et al. [2019] A. Gelman, B. Goodrich, J. Gabry, and A. Vehtari. Rsquared for Bayesian regression models. The American Statistician, pages 1–7, 2019.
 Ghosh et al. [2018] S. Ghosh, J. Yao, and F. DoshiVelez. Structured variational learning of Bayesian neural networks with horseshoe priors. In International Conference on Machine Learning, pages 1739–1748, 2018.
 Glantz et al. [1990] S. A. Glantz, B. K. Slinker, and T. B. Neilands. Primer of applied regression and analysis of variance, volume 309. McGrawHill New York, 1990.
 Graves [2011] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 Hafner et al. [2018] D. Hafner, D. Tran, T. Lillicrap, A. Irpan, and J. Davidson. Noise contrastive priors for functional uncertainty. arXiv preprint arXiv:1807.09289, 2018.
 Kettunen et al. [2016] J. Kettunen, A. Demirkan, P. Würtz, H. H. Draisma, T. Haller, R. Rawal, A. Vaarhorst, A. J. Kangas, L.P. Lyytikäinen, M. Pirinen, et al. Genomewide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Nature Communications, 7(1):1–9, 2016.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Louizos et al. [2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298, 2017.

MacKay [1992]
D. J. MacKay.
A practical Bayesian framework for backpropagation networks.
Neural Computation, 4(3):448–472, 1992.  MacKay [1996] D. J. MacKay. Bayesian nonlinear modeling for the prediction competition. In Maximum Entropy and Bayesian Methods, pages 221–234. Springer, 1996.
 Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Marttinen et al. [2014] P. Marttinen, M. Pirinen, A.P. Sarin, J. Gillberg, J. Kettunen, I. Surakka, A. J. Kangas, P. Soininen, P. O’Reilly, M. Kaakinen, et al. Assessing multivariate genemetabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 30(14):2026–2034, 2014.
 Mitchell and Beauchamp [1988] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
 Nalisnick et al. [2019] E. Nalisnick, J. M. HernandezLobato, and P. Smyth. Dropout as a structured shrinkage prior. In International Conference on Machine Learning, pages 4712–4722, 2019.
 Neal [2012] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Papaspiliopoulos et al. [2007] O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59–73, 2007.
 Pearce et al. [2019] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. Expressive priors in Bayesian neural networks: Kernel combinations and periodic functions. arXiv preprint arXiv:1905.06076, 2019.

Piironen and Vehtari [2017]
J. Piironen and A. Vehtari.
On the hyperprior choice for the global shrinkage parameter in the horseshoe prior.
In Artificial Intelligence and Statistics, pages 905–913, 2017.  Sun et al. [2019] S. Sun, G. Zhang, J. Shi, and R. Grosse. Functional variational Bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
 Wenzel et al. [2020] F. Wenzel, K. Roth, B. S. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020.
References

Andrews and Mallows [1974]
D. F. Andrews and C. L. Mallows.
Scale mixtures of normal distributions.
Journal of the Royal Statistical Society: Series B (Methodological), 36(1):99–102, 1974.  Betancourt and Girolami [2015] M. Betancourt and M. Girolami. Hamiltonian Monte Carlo for hierarchical models. Current Trends in Bayesian Methodology with Applications, 79:30, 2015.
 Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Carvalho et al. [2009] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009.

Deng et al. [2019]
W. Deng, X. Zhang, F. Liang, and G. Lin.
An adaptive empirical Bayesian method for sparse deep learning.
In Advances in Neural Information Processing Systems, pages 5564–5574, 2019.  FlamShepherd et al. [2017] D. FlamShepherd, J. Requeima, and D. Duvenaud. Mapping Gaussian process priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop, 2017.

Gal and Ghahramani [2016]
Y. Gal and Z. Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
International Conference on Machine Learning
, pages 1050–1059, 2016.  Gelman et al. [2013] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
 Gelman et al. [2019] A. Gelman, B. Goodrich, J. Gabry, and A. Vehtari. Rsquared for Bayesian regression models. The American Statistician, pages 1–7, 2019.
 Ghosh et al. [2018] S. Ghosh, J. Yao, and F. DoshiVelez. Structured variational learning of Bayesian neural networks with horseshoe priors. In International Conference on Machine Learning, pages 1739–1748, 2018.
 Glantz et al. [1990] S. A. Glantz, B. K. Slinker, and T. B. Neilands. Primer of applied regression and analysis of variance, volume 309. McGrawHill New York, 1990.
 Graves [2011] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 Hafner et al. [2018] D. Hafner, D. Tran, T. Lillicrap, A. Irpan, and J. Davidson. Noise contrastive priors for functional uncertainty. arXiv preprint arXiv:1807.09289, 2018.
 Kettunen et al. [2016] J. Kettunen, A. Demirkan, P. Würtz, H. H. Draisma, T. Haller, R. Rawal, A. Vaarhorst, A. J. Kangas, L.P. Lyytikäinen, M. Pirinen, et al. Genomewide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Nature Communications, 7(1):1–9, 2016.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Louizos et al. [2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298, 2017.

MacKay [1992]
D. J. MacKay.
A practical Bayesian framework for backpropagation networks.
Neural Computation, 4(3):448–472, 1992.  MacKay [1996] D. J. MacKay. Bayesian nonlinear modeling for the prediction competition. In Maximum Entropy and Bayesian Methods, pages 221–234. Springer, 1996.
 Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Marttinen et al. [2014] P. Marttinen, M. Pirinen, A.P. Sarin, J. Gillberg, J. Kettunen, I. Surakka, A. J. Kangas, P. Soininen, P. O’Reilly, M. Kaakinen, et al. Assessing multivariate genemetabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 30(14):2026–2034, 2014.
 Mitchell and Beauchamp [1988] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
 Nalisnick et al. [2019] E. Nalisnick, J. M. HernandezLobato, and P. Smyth. Dropout as a structured shrinkage prior. In International Conference on Machine Learning, pages 4712–4722, 2019.
 Neal [2012] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Papaspiliopoulos et al. [2007] O. Papaspiliopoulos, G. O. Roberts, and M. Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59–73, 2007.
 Pearce et al. [2019] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. Expressive priors in Bayesian neural networks: Kernel combinations and periodic functions. arXiv preprint arXiv:1905.06076, 2019.

Piironen and Vehtari [2017]
J. Piironen and A. Vehtari.
On the hyperprior choice for the global shrinkage parameter in the horseshoe prior.
In Artificial Intelligence and Statistics, pages 905–913, 2017.  Sun et al. [2019] S. Sun, G. Zhang, J. Shi, and R. Grosse. Functional variational Bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
 Wenzel et al. [2020] F. Wenzel, K. Roth, B. S. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020.
Supplementary
8 Proof of Theorem 1
8.1 Introduction
We first introduce the notation and some well known results from probability theory.
Notations: We denote to be any one of the nodes in th hidden layer before activation function , and is the node after the activation. We use to represent all the weights from th layer to th layer. The number of nodes in layer is . We use the subscript , such as to denote, the th node in a layer. The output of the neural network is , where is the input. All the activation functions are ReLU. We assume in the prior distribution, weights in different layers are independent. We use denotes all the weights from layer 0 (input layer) to layer l. The weights from same layer have the same prior with mean and variance . When the
are large, nodes follow Gaussian distribution according to central limit theorem. We assume all weights
are independent with nodes, and there is no bias term in each layer. Features are also independent with each other.Targets: We derive the form of
(22) 
where we normalize to have unit variance.
Probability results: We have following results based on above assumptions.
When is not considered as a random variable:
(23) 
If Gaussian random variable has mean and variance , the first two moment of after ReLU activation are:
(24) 
where , , and are the PDF and CDF of standard Gaussian [Wu et al., 2018].
8.2 Proof of Theorem 1
According to Eq.24, we have following recursion:
(25) 
where is the variance shrinkage factor of layer , i.e., how much the variance will be shrunk by passing through a ReLU activation, and we use for simplicity. We first prove that is a constant for infinitely wide neural networks, and we then show is independent of empirically for any finite neural network.
Lemma 2.
The variance shrinkage factor of th ReLU layer is a constant for any infinitely wide neural network, and it can be calculated by:
where is mentioned in Eq.25.
Proof.
We denote that and are the mean and variance of . According to Eq.23, we know that:
(26) 
Based on the symmetry of the hidden nodes, the covariance between two different hidden nodes in the same layer, , are the same. And the mean for different hidden node in layer is also the same, i.e., .
According to the central limit theorem, the summation and , when the number of hidden units go to infinity. Thus Eq.26 can be rewritten as:
(27) 
which implies that
(28) 
Thus the variance shrinkage factor for any layer is
which is the Lemma 2. ∎
Then Eq.25 can be rewritten as:
(29) 
according to Lemma 2. Note that although theoretically Eq.29 only holds for infinitely wide neural networks, however, empirically (Figure 7) we find that Eq.29 still holds for neural networks with finite hidden nodes.
In prediction tasks, the final layer is , so we have:
(30) 
9 Estimating through linear regression
We provide a simple algorithm to estimate by solving the linear regression problem.
Comments
There are no comments yet.