# Informative Gaussian Scale Mixture Priors for Bayesian Neural Networks

Encoding domain knowledge into the prior over the high-dimensional weight space is challenging in Bayesian neural networks. Two types of domain knowledge are commonly available in scientific applications: 1. feature sparsity (number of relevant features); 2. signal-to-noise ratio, quantified, for instance, as the proportion of variance explained (PVE). We show both types of domain knowledge can be encoded into the widely used Gaussian scale mixture priors with Automatic Relevance Determination. Specifically, we propose a new joint prior over the local (i.e., feature-specific) scale parameters to encode the knowledge about feature sparsity, and an algorithm to determine the global scale parameter (shared by all features) according to the PVE. Empirically, we show that the proposed informative prior improves prediction accuracy on publicly available datasets and in a genetics application.

There are no comments yet.

## Authors

• 3 publications
• 1 publication
• 23 publications
• 101 publications
05/23/2020

### Fine-Grain Few-Shot Vision via Domain Knowledge as Hyperspherical Priors

Prototypical networks have been shown to perform well at few-shot learni...
09/05/2017

### Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Many efforts have been made to use various forms of domain knowledge in ...
09/18/2018

### Comparison between Suitable Priors for Additive Bayesian Networks

Additive Bayesian networks are types of graphical models that extend the...
05/09/2012

### Domain Knowledge Uncertainty and Probabilistic Parameter Constraints

Incorporating domain knowledge into the modeling process is an effective...
10/26/2017

### Laplacian Prior Variational Automatic Relevance Determination for Transmission Tomography

In the classic sparsity-driven problems, the fundamental L-1 penalty met...
05/30/2010

### Empirical learning aided by weak domain knowledge in the form of feature importance

Standard hybrid learners that use domain knowledge require stronger know...
10/31/2019

### Dynamic Regularizer with an Informative Prior

Regularization methods, specifically those which directly alter weights ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Bayesian approach [Gelman et al., 2013]

has been of interest because of its ability to incorporate domain knowledge into reasoning and to provide principled uncertainty estimates. Bayesian neural networks (BNNs)

[Neal, 2012], however, are criticized for the difficulty of encoding domain knowledge into the prior over the high-dimensional weight space. This limits their use in applications where domain knowledge is important, e.g., when data are limited or signal is extremely weak and sparse. Applications in genetics often fall into the latter category and are used as motivating examples for the proposed methods.

Two types of domain knowledge are often available in scientific applications: ballpark figures on feature sparsity and the signal-to-noise ratio. Specifically, feature sparsity refers to the number of features expected to be used by the model. For example, in genomics less than 2% of the genome encodes for genes. A prior on the signal-to-noise ratio may encode the amount of variance of the target expected to be explained by the chosen features, and it can be measured as the proportion of variance explained (PVE) [Glantz et al., 1990]. For example, one gene may explain only a tiny fraction of the variance of a given phenotype, e.g., the PVE of the gene can be less than 1%.

Gaussian scale mixtures (GSMs) are commonly used sparse priors on the BNN weights, including, e.g., the horseshoe [Ghosh et al., 2018] and the spike-and-slab [Deng et al., 2019]. However, they are not flexible enough to encode all kinds of beliefs about sparsity, and it is not known how to encode information about the PVE into such a prior. Some recent works on functional BNNs encode domain knowledge into a functional prior, e.g., a Gaussian process [Flam-Shepherd et al., 2017, Sun et al., 2019], but it is not trivial to include sparsity in such priors.

We propose a new informative GSM prior for the weights to explicitly encode domain knowledge about feature sparsity and signal-to-noise ratio. Our main contributions are:

1. Propose a joint hyper-prior on the local scales that can model beliefs on the number of relevant features, which includes the spike-and-slab as a special case (Figure 1).

2. Derive the relation between PVE and the global scale parameter for BNNs with the ReLU activation function.

3. Develop a method to determine the global scale parameter of the GSM prior by using domain knowledge about the PVE, which circumvents heuristics employed in the commonly used approaches

[Blundell et al., 2015] and computationally intensive cross-validation.

## 2 Background

### 2.1 Bayesian neural networks

Bayesian neural networks [MacKay, 1992, Neal, 2012] are defined by placing a prior distribution on the weights of a neural network. Then, instead of finding a point estimate of the weights by minimizing a cost function, a posterior distribution of the weights is calculated conditionally on the data. Let denote the output of a BNN and the likelihood. Then, given a dataset of inputs and outputs , training a BNN means computing the posterior distribution .

Variational inference can be used to approximate the intractable with a simpler distribution, , by minimizing . This is equivalent to maximizing the ELBO

 L(ϕ)=H(qϕ(w))+Eqϕ(w)[logp(y,w|X)]. (1)

The first term in Eq.1 is the entropy of the approximated posterior, which can be calculated analytically for many choices of . The second term can be estimated by the reparametrization trick [Kingma and Welling, 2013], which reparametrizes the approximated posterior by a deterministic and differentiable function with , such that .

### 2.2 Gaussian scale mixture priors

The Gaussian scale mixture [Andrews and Mallows, 1974] is defined to be a zero mean Gaussian conditional on its scale, and the distribution on the scale characterizes its statistical properties, e.g. sparsity. It has been combined with Automatic Relevance Determination (ARD) [MacKay, 1996]

, which is a widely used approach for feature selection. In BNNs, the ARD prior means that all of the outgoing weights

from the same node in layer share a same scale [Neal, 2012], and node will be dropped if becomes to zero. We define the input layer as layer in this paper.

A GSM ARD prior on each weight can be written in a hierarchical parametrization form as follows:

 w(l)i,j|λ(l)i;τ∼N(w(l)i,j;0,τ2λ(l)2i);λ(l)i∼p(λ(l)i;θ), (2)

where is the global scale shared by all of the NN weights, and defines a hyper-prior on the local scales. The marginal distribution of can be obtained by integrating out the local scales:

 p(w(l)i,j;τ,θ)=∫N(w(l)i,j;0,τ2λ(l)2i)p(λ(l)i;θ)dλ(l)i. (3)

Eq.2 can also be written in an equivalent non-centered parametrization [Papaspiliopoulos et al., 2007] form (Figure 2 left):

 w(l)i,j=β(l)i,jλ(l)i;β(l)i,j∼N(β(l)i,j;0,τ2);λ(l)i∼p(λ(l)i;θ), (4)

which has a better posterior geometry for inference [Betancourt and Girolami, 2015] than Eq.2. The non-centered parametrization form has been widely used in BNN literature [Louizos et al., 2017, Ghosh et al., 2018], and we also use this form in the rest of the paper.

Different define different marginal prior distributions on the weights according to Eq.3. For example, when is Bernoulli, each weight follows a spike-and-slab prior distribution [Mitchell and Beauchamp, 1988]; when

is the half-Cauchy distribution, the weights are horseshoe

[Carvalho et al., 2009]. The parameter is important in practice, especially for small data sets [Piironen and Vehtari, 2017], as it controls the shape of , and thus affects the sparsity level of directly. If the distribution of concentrates at , a sparse model will be preferred, otherwise, a dense model will be learned instead.

### 2.3 Proportion of Variance Explained

Assume that the data generating process takes the form

 y=f(x;w)+ϵ, (5)

where is the model, and is the unexplainable noise. The Proportion of Variance Explained (PVE) [Glantz et al., 1990] of on dataset , also called the coefficient of determination (

) in linear regression, is then defined as:

 PVE(w)=1−% Var(y−f(X;w))Var(y)=Var(f(X;w))Var(y), (6)

and it is commonly used to measure of the impact of features on the prediction target , especially in genomics [Marttinen et al., 2014]. In general, PVE should be less than 1, because the variance of the predicted values should not exceed that of the data. However, it may easily happen that this does not hold in non-linear models, such as neural networks, unless explicitly accounted for. For an example see Figure 1.

It is known that the scale parameter of mean-field (fully factorized) Gaussian prior on BNNs affects the variability of the functions drawn from the prior [Neal, 2012], and thus the PVE. As we will show in Section 4, for BNNs with the GSM prior defined in Eq.4, the global scale and the local scales jointly control the PVE, and the effect of grows exponentially as the depth increases. This underlines the importance of setting properly based on the expected PVE. When the scale parameter is set so that the corresponding PVE is close to the true value, the model is more likely to recover the true data generating function (demonstration in Figure 1).

## 3 Prior knowledge about sparsity

In this section, we propose a new hyper-prior for the local scales according to the generative model shown in Figure 2 (right). The new prior generates the local scales conditionally on the number of relevant features, which allows us to explicitly express domain knowledge about the number of relevant features.

### 3.1 Discrete informative scale priors

We first consider the case when each scale

is a discrete random variable with domain

, such as the independent Bernoulli scale parameters in the spike-and-slab prior. We will then generalize this to the continuous scale in Section 3.2.

#### 3.1.1 Prior on the number of relevant features

We assume the prior beliefs about the number of relevant features is defined as , i.e., a probability mass function for , where (dimension of the dataset). The prior directly controls the model sparsity. Intuitively, if concentrates on

, a sparse model is preferred as it has a high prior probability to use a small number of features; if

puts a large probability mass on , all of the features are likely to be used instead. Hence, unlike other priors encouraging shrinkage, such as Laplace or horseshoe, our new prior is more interpretable about sparsity as it directly models the number of relevant features.

The modeler can choose based on the prior information available. When we have a good idea about the number of relevant features, a unimodal prior can be used, such as a discretized Gaussian:

 pm(m;μ,τm)=cnexp{−τm(m−μ)22}, (7)

where is the mode, represents the precision as the inverse variance in the continuous Gaussian, and is the normalization constant. Often we may only be able to specify an interval for the number of relevant features. Then we can use, for example, a ‘flattened’ Gaussian (Figure 1):

 pm(m;μ−,μ+,τm)=cnexp{−τmR2(m;μ+,μ−)2},R(m;μ−,μ+)=max{(m−μ+),(μ−−m),0}, (8)

where defines the interval where the probability is uniform and reaches its maximum value, and is the corresponding normalization constant. If there is no prior information about sparsity, a discrete uniform prior over is a plausible alternative.

#### 3.1.2 Feature allocation

Given the prior distribution , we specify how it affects which variables will be used first for a given , and then marginalize over . We introduce identity variables to denote if feature is used () or not (); they should satisfy . Assuming no prior knowledge about relative importance of features (this assumption can be easily replaced if needed), i.e.,

has a jointly uniform distribution given

, we have:

 p({Ii}Di=1|m)=cdδ[m−D∑i=1Ii], (9)

where the normalization constant is .

Now we can calculate the joint distribution of

by marginalizing out :

 p({Ii}Di=1;θ)=D∑m=0pm(m;θ)p({Ii}Di=1|m)=pm(D∑i=1Ii;θ)(D∑Di=1Ii)−1. (10)

For Bernoulli local scale variables , the itself takes the role of the identity variable . Thus we obtain the joint distribution over discrete scale parameters as

 pd(λ(l)1,…,λ(l)D;θ)=pm(D∑i=1λ(l)i;θ)(D∑Di=1λ(l)i)−1, (11)

where the distribution encodes our domain knowledge of the number of relevant features.

Generally, the local scales are dependent. However, when is a Binomial distribution (or a Gaussian approximated by a Binomial), the joint distribution of the local scales becomes a product of independent Bernoullis in Eq.11, which corresponds to the spike-and-slab prior (Figure 1).

### 3.2 Informative prior on continuous local scales

When the local scales are continuous (), the number of relevant features

is not a sensible way to define sparsity, because the posterior probability of the event

is zero almost everywhere [Carvalho et al., 2009]. Thus will always equal , unless heuristic pruning is used [Louizos et al., 2017].

Instead of using the number of relevant features to represent sparsity, we propose to use the effective number of features [Piironen and Vehtari, 2017], which is defined as

 meff=D∑i=1ηi=D∑i=1(1−κi), (12)

where is the shrinkage factor [Carvalho et al., 2009] that can be defined for any GSM prior as follows: The reflects the proportion the feature is shrunk compared to its maximum likelihood estimate, and it is defined as

 κi=cscs+λ(l)2i, (13)

where is a small positive constant [Carvalho et al., 2009]111For example, in Bayesian linear regression [Carvalho et al., 2009], where is the variance of data likelihood, is the size of data, and is the global scale parameter.. In our experiments we simply use to apply this with BNNs. We assume that the user can give a prior belief about in a similar way as the number of relevant features in Section 3.1.

Inspired by Eq.10 for the discrete case, we define the joint distribution on :

 p(η1,…,ηD;θ)∝pmeff(D∑i=1ηi;θ)(D∑Di=1ηi)−1, (14)

where is a continuous prior for the effective number of features, analogously to the discrete prior over the number of relevant features in Section 3.1. The Binomial coefficient in Eq.14 is calculated for the continuous using the Stirling’s approximation. Since the non-shrinkage factor increases monotonically w.r.t. , we get the joint density of the scale parameters using the law of change of variables:

 pc(λ1,…,λD;θ)∝pmeff(D∑i=1η(λ(l)i);θ)(D∑Di=1η(λ(l)i))−1D∏i=1dη(λ(l)i)dλ(l)i, (15)

where . We show the marginal distribution of for different in Figure 3. When concentrates more on , e.g., the in Figure 3 left, has a heavier tail (the blue line in Figure 3 right). We note that although the normalization constant is not analytically available in Eq.15, making inference with this prior is still feasible via variational inference and the reparametrization trick.

### 3.3 BNNs with informative priors

In BNNs, the number of features used by the model (sparsity) is determined by the weights in the input layer, unless all nodes of any hidden layer are jointly dropped, which is very unlikely in practice. Thus we encode the domain knowledge on sparsity in the informative prior of the input layer, i.e., in Eq.11 and 15. For the other layers we use the spike-and-slab in our experiments (see Experiments). However, informative priors could also be used in higher layers. Because specifying domain knowledge about the number of hidden nodes would be difficult, could be set to be uniform. This way one could potentially learn the optimal number of hidden nodes, and we leave this for future work.

To summarizing, according to Section 3.1, we use the following prior on the input layer:

 w(0)=β(0)λ(0);β(0)i,j∼N(β(0)i,j;0,τ2), (16)

with the hyper-prior on the local scales following for discrete cases and

for continuous cases. We use independent concrete Bernoulli distributions

[Maddison et al., 2016] to approximate the posterior of in the discrete case and Log-normal in the continuous case, and independent Gaussian to approximate the posterior of for both cases.

## 4 Prior knowledge about PVE

In this section, we introduce an approach for determining the global scale for the GSM priors in Eq.4, according to domain knowledge about the expected PVE.

### 4.1 PVE for Bayesian neural networks

According to the definition of PVE in Eq.6, when is a BNN, has a distribution determined by the distribution of . In this paper, instead of analysing the distribution of , we analyse a summary statistic of the PVE commonly used in Bayesian linear model [Gelman et al., 2019]. The statistic is the expected ,

 μPVE=Ep(w)[PVE(w)], (17)

over functions drawn from the BNN. Consequently, instead of restricting PVE directly, we express the prior beliefs using , e.g., requiring it to be less than 1.

For BNNs with arbitrary activation functions or priors , the analytical form of is intractable. However, under certain assumptions, we have the following theorem (Proof in the Supplement):

###### Theorem 1.

If the weights in different layers () are independent, and priors within any single layer

have a zero mean and the same second moment

, then the expected PVE of a ReLU BNN with hidden layers and units in layer is given by

 μPVE=ασ(0)2L∏l=1σ(l)2Ml∑Di=1Var(xi)Var(y),

where is a constant independent of the priors of weights.

Theorem 1 will be used to determine the global scale according to the prior knowledge on in Section 4.2. Note that Theorem 1 is applicable to the commonly used priors, such as the fully factorized Gaussian or the spike-and-slab, as well as to the new proposed informative prior, as long as the second moment of the prior is finite.

### 4.2 Determining τ according to PVE

For any zero-mean GSM ARD prior for BNNs, the variance (second moment) of the weight in Eq. 4 equals:

 σ(l)2i=E[(β(l)i,j)2]E[(λ(l)i)2]=E[λ(l)2i]τ2. (18)

In practice, we first set the hyper-priors on the local scales (the same for all nodes within a single layer), , by encoding the domain knowledge about sparsity, which determines . According to Theorem 1, the expected PVE can be simplified to

 μPVE=~ατ2L+2 (19)

where is a constant that depends on the variance of the data, model architecture, and pre-encoded sparsity, independent of . Hence, we see that when the factors affecting are kept constant, is fully determined by the global scale alone.

Both and in Eq.19 are only analytically tractable when all are infinite (Shown in the Supplement), which is impractical. However, since we know the exact form of , we can obtain a very accurate estimation of by solving the linear regression problem

 logμPVE=log~α+(2L+2)logτ. (20)

We solve Eq.20 using data , obtained by simulating from the BNN (with the pre-specified prior and data ) for multiple values of to get a Monte-Carlo estimate of the corresponding , and in practice is enough to make the of Eq.20 greater than . See an algorithm and example in Supplementary.

After the constant has been estimated, can be specified using domain knowledge about , according to Eq.19. We set the aleatoric uncertainty, i.e., in Eq.5, to , so that the model Eq.5 is consistent with our prior knowledge. If we have no relevant knowledge, a non-informative prior on can be used instead, such as the uniform distribution on . According to the change of variable, the prior induced on is given by

 p(τ)=p(μPVE)|dμPVEdτ|=~α(2L+2)τ2L+1. (21)

For some heavy-tailed priors, e.g. the horseshoe, whose second moments do not exist, does not exist either, and Theorem 1 can not be applied. However, other summary statistics of the , such as the median, may be used to determine the global parameter for the horseshoe. We leave this for future work.

## 5 Related Work

BNNs with a fully factorized Gaussian prior and posterior were proposed by Graves [2011]. BNNs with a fully factorized Gaussian prior and a mixture of Dirac-delta posteriors can be interpreted as NNs with dropout [Gal and Ghahramani, 2016]. Nalisnick et al. [2019] extended these works, and showed that NNs with any multiplicative noise can be interpreted as BNNs with GSM ARD priors. Some priors have been proposed to induce sparsity, such as log-uniform priors [Louizos et al., 2017], horseshoe priors [Louizos et al., 2017, Ghosh et al., 2018], and Spike-and-slab priors [Deng et al., 2019]. However, none of the works has proposed how to encode domain knowledge explicitly.

Building informative priors for neural networks in the functional space has been widely studied. One common type of prior information concerns the behavior of the output with certain inputs. Noise contrastive priors (NCPs) [Hafner et al., 2018] were designed to encourage reliable high uncertainty for OOD (out-of distribution) data points. Gaussian processes have been widely used as functional priors, because of their ability to encode rich functional structures. Flam-Shepherd et al. [2017] transformed a functional GP prior into a weight-space BNN prior, with which variational inference is performed. Functional BNNs [Sun et al., 2019], however, perform variational inference directly in the functional space, where meaningful functional GP priors can be specified. Pearce et al. [2019] used a combination of different BNN architectures to encode prior knowledge about the function. Although building functional priors can avoid uninterpretable high-dimensional weights, encoding sparsity information of features into the functional space is not trivial.

## 6 Experiments

We first apply BNNs with the new discrete informative prior on synthetic datasets, and compare the convergence rate with the ‘golden standard’ baseline: the spike-and-slab prior. We then apply our approach on 5 public UCI datasets with and without injected irrelevant features, and to a Genome-Wide Association Study (GWAS) dataset. We show that incorporating domain knowledge on both sparsity and PVE through our approach can improve the results. All data sets were standardized before training.

Common model settings: We used BNNs with ARD spike-and-slab prior (SSBNN) as the baseline for discrete scales, where the slab probabilities of input layers were determined based on knowledge about the number of relevant features, and the slab probabilities of hidden layers were set to . We consider the horseshoe prior on all layers (HSBNN) as the baseline for continuous scales, using the same hyper-parameters as in Ghosh et al. [2018]. For BNNs with informative priors on the input layers, we used the ‘flattened’ Gaussian (Eq.8) to model the number of relevant features () for discrete cases (InfoD), and the effective number of features () for continuous cases (InfoC). The hidden layers of InfoD are the same as in SSBNN, and InfoC are the same as in the corresponding HSBNN.

### 6.1 Synthetic data

Setup: We simulated datasets with 500 features, of which only 10 are used to compute the target with the model . The model contains both main effects and interactions between the selected features (details in Supplementary). The noise

is Gaussian with zero mean and standard deviation equal to

, so that the signal-to-noise ratio is . We generated 3 datasets of different sizes (300, 700, and 1100), and compared the convergence of both MSE and sparsity (number of features) with different priors for each dataset. We used the test MSE to determine convergence. The expected number of features included in the first layer, calculated using the feature inclusion probabilities, is used as the estimate for the number of relevant features for both the priors.

Parameter settings: We applied BNNs with two different types of ARD priors: spike-and-slab prior (SSBNN) and discrete informative prior (InfoD). The common global scale is for all models. The slab probability of SSBNN on the first layer is set to , to reflect the expected number of relevant features. We encoded three types of prior knowledge into InfoD by setting and differently:
1. a strongly informative InfoD ();
2. a weakly informative InfoD ();
3. a wrong InfoD ().

We used a concrete Bernoulli with temperature to approximate the posteriors of scales for both SS and InfoD priors, and set the learning rate of Adam to . The neural networks architecture that we used have 2 hidden layers of sizes 100 and 50 nodes.

Results: Figure 4 shows the convergence of test error and sparsity (the number of relevant features) with the different priors. For small datasets (300), when prior is more important, the InfoD prior with the wrong information converges to the same MSE and number of features as the SS prior, while the InfoD priors with correct information converge to a lower MSE and the correct number of features (10, the green dotted lines). When we increase the data size to 700, InfoD with the wrong information converges to the correct sparsity and the same MSE as the correct InfoD priors, but more slowly. For the largest dataset with 1100 data points, the converged MSEs for the different priors are similar. However, with the SS prior the number of features is overestimated even at convergence. The strongly informative InfoD prior always converges faster than the weakly informative InfoD prior, but the advantage diminishes when the size of the dataset increases.

### 6.2 UCI datasets

Setup: We analyze 5 publicly available datasets (Table 1) : Bike sharing, California housing prices, Energy efficiency, Concrete compressive strength, and Yacht hydrodynamics. We carry out two types of experiments: 1. analyze the original datasets as such, in which case the domain knowledge about sparsity is unknown; 2. concatenate 100 irrelevant features with the original features, which allows us to specify informative priors about sparsity (the number of relevant features is at most the number of original features). We examine whether the performance can be improved by encoding the extra knowledge about the PVE and sparsity into the prior. We use of data for training, for validation, and

for testing. We consider three evaluation metrics: negative log-likelihood (NLL), MSE, and

on test sets. We repeat each experiment

times on data splits to obtain confidence intervals.

Parameter settings: We considered 3 classes of priors: 1. the standard mean-field Gaussian; 2. discrete GSM ARD priors; 3. continuous GSM ARD priors.

For discrete GSM priors, we used SSBNN as the baseline, with the global scale set to as was done in prior work [Wenzel et al., 2020]. We set the slab probability on the input layer to for the original datasets, and to for the datasets extended with noisy features ( is the number of features in the original dataset), which represents the correct level of sparsity. In addition, we considered the following three discrete informative priors:

1. SSBNN+PVE: the SSBNN prior where the global scale was set such that (see Section 4).
2. InfoD: the discrete informative prior with and the global scale .
3. InfoD+PVE: same as InfoD, but with the global scale set such that .

To assess sensitivity, we also consider for SS+PVE and InfoD+PVE.

Of the continuous GSM priors, we consider the horseshoe BNN (HSBNN) as the baseline. We only consider encoding prior knowledge about sparsity by using a InfoC, because does not exist for the horseshoe priors as discussed in Section 4. We use the same and as in the discrete case (InfoD). The BNNs we used have 3 hidden layers of sizes 100, 50, and 20 nodes. The in Eq.5 was set to for each model.

Results: The results with the negative log-likelihood (NLL) metric are shown in Table 1, and MSE and are reported in the Supplementary.

For the original datasets, we see that setting the global scale according to (SSBNN+PVE and InfoD+PVE) increases the data likelihood, which reflects both the quality of predictive uncertainty and prediction accuracy. The new proposed informative priors (InfoD and InfoC) can also slightly improve the performance for the smaller datasets although we have no prior information about sparsity to encode. Horseshoe (HSBNN) has a comparable performance even without explicitly encoded information.

In the extended datasets with 100 extra irrelevant features, knowledge of both PVE and sparsity improve the performance significantly in discrete-scale cases. For most datasets, SSBNNs even with the correct level of sparsity are not good enough to produce reasonable results. We find that for the large datasets, the discrete-scale priors are better than the continuous ones, and for small datasets, continuous-scale priors have better performance, although they cannot include information about the PVE. We also find that non-sparsity-inducing priors (mean-field Gaussian) work comparably on some original datasets with large , but work poorly on most datasets.

The results with (Supplementary) show that the conclusions are not sensitive to the specific value of PVE assumed.

### 6.3 GWAS application

Motivation: The goal of a Genome-Wide Association Study (GWAS) is to learn associations between genetic variants called SNPs (input features) and phenotypes (targets). Ultimately, the goal is to predict the phenotype given the SNPs of an individual. This task is extremely challenging because 1. the input features are very high-dimensional and strongly correlated and 2. they may explain only a tiny fraction of the variance of the phenotype, e.g., less than . Thus, most approaches employ several heuristic but crucial preprocessing steps to reduce the input dimension and correlation. There exists strong domain knowledge about sparsity and the amount of variance explained by the SNPs, and we show that by incorporating this knowledge into informative priors we can accurately predict where alternatives fail.

Dataset: The FINRISKI dataset contains SNPs and 228 different metabolites as phenotypes for 4620 individuals. We selected 6 genes that have previously been associated with the metabolites [Kettunen et al., 2016]. We use the SNPs of each gene to predict the corresponding most correlated metabolite, resulting in 6 different experiments.

Parameter settings: We train BNNs with 1 hidden layer consisting of hidden nodes. We consider 3 different priors: mean-field Gaussian, spike-and-slab (SSBNN), and the discrete informative prior (InfoD). We make predictions using the posterior mean and evaluate the performance by the PVE (higher is better) on test data. We use 50% of data for training and 50% for testing, and we repeat this 50 times for each of the 6 experiments (i.e. genes), allowing us to assess experiment specific variability due to the random split and training.

The slab probability of SSBNN is fixed to , and we use and in InfoD, where is the number of SNPs in the chosen gene. This reflects the prior belief that less than of the SNPs in the gene actually affect the phenotype. The global scale of each prior is either fixed to (without prior information about PVE), or calculated by setting to previous findings [Kettunen et al., 2016] according to Section 4.

Results: Figure 5 shows results for the 6 experiments. We see that setting the according to the prior knowledge on the PVE always improves accuracy and reduces uncertainty for all priors (purple bars). Without using the prior knowledge on the PVE, learning with all priors can overfit seriously (blue bars, negative test PVE). The novel informative discrete GSM prior has the highest accuracy with the smallest standard deviation in all experiments, both with PVE and without PVE.

## 7 Conclusion

We proposed a new informative Gaussian scale mixture prior on BNN weights, whose global and local scale parameters are specified using domain knowledge about expected signal-to-noise ratio and sparsity. We demonstrated the utility of the prior on simulated data, publicly available datasets, and in a GWAS application, where they outperformed strong commonly used baselines. The informative hyper-prior over the local scales can be generalized to all scale mixture distributions, not just the Gaussian scale mixture, such as the Strawderman-Berger prior. Possible future work includes encoding PVE into heavy-tailed distributions, such as the horseshoe, and extending the results to hierarchical priors (hyper-prior over the global scale).

## 8 Proof of Theorem 1

### 8.1 Introduction

We first introduce the notation and some well known results from probability theory.

Notations: We denote to be any one of the nodes in th hidden layer before activation function , and is the node after the activation. We use to represent all the weights from th layer to th layer. The number of nodes in layer is . We use the subscript , such as to denote, the th node in a layer. The output of the neural network is , where is the input. All the activation functions are ReLU. We assume in the prior distribution, weights in different layers are independent. We use denotes all the weights from layer 0 (input layer) to layer l. The weights from same layer have the same prior with mean and variance . When the

are large, nodes follow Gaussian distribution according to central limit theorem. We assume all weights

are independent with nodes, and there is no bias term in each layer. Features are also independent with each other.

Targets: We derive the form of

 (22)

where we normalize to have unit variance.

Probability results: We have following results based on above assumptions.
When is not considered as a random variable:

 E[a(l)i]=∑jw(l−1)j,iE[h(l−1)j],Var(a(l)i)=∑jw(l−1)2j,iVar(h(l−1)j)+∑k,j≠kw(l−1)j,iCov(hl−1j,hl−1k)w(l−1)k,i. (23)

If Gaussian random variable has mean and variance , the first two moment of after ReLU activation are:

where , , and are the PDF and CDF of standard Gaussian [Wu et al., 2018].

### 8.2 Proof of Theorem 1

According to Eq.24, we have following recursion:

 Ep(w(1:l−1))[VarX(h(l)j)]=E[E[h2]−E[h]2]=E[VarX(a(l))[μ′a(l)ψ(μ′a(l))+(1+μ′2a(l))Ψ(μ′a(l))]−Var(a(l))SR2(μ′a(l))]=Ep(w(1:l−1))[VarX(a(l))β(μ′a(l))], (25)

where is the variance shrinkage factor of layer , i.e., how much the variance will be shrunk by passing through a ReLU activation, and we use for simplicity. We first prove that is a constant for infinitely wide neural networks, and we then show is independent of empirically for any finite neural network.

###### Lemma 2.

The variance shrinkage factor of th ReLU layer is a constant for any infinitely wide neural network, and it can be calculated by:

 βl=β(0)=π−12π,

where is mentioned in Eq.25.

###### Proof.

We denote that and are the mean and variance of . According to Eq.23, we know that:

 μa(l)i=∑jw(l−1)j,iμh(l−1)j,σ2a(l)i=∑jw(l−1)2j,iσ2h(l−1)j+∑k,j≠kw(l−1)j,iCov(h(l−1)j,h(l−1)k)w(l−1)k,i. (26)

Based on the symmetry of the hidden nodes, the covariance between two different hidden nodes in the same layer, , are the same. And the mean for different hidden node in layer is also the same, i.e., .

According to the central limit theorem, the summation and , when the number of hidden units go to infinity. Thus Eq.26 can be rewritten as:

 μa(l)i=0,σ2a(l)i=∑jw(l−1)2j,iσ2h(l−1)j, (27)

which implies that

 μ′a(l)=0. (28)

Thus the variance shrinkage factor for any layer is

 βl=β(0)=π−12π,

which is the Lemma 2. ∎

Then Eq.25 can be rewritten as:

 Ep(w(1:l−1))[VarX(h(l)j)]=βlEp(w(1:l−1))[VarX(a(l))], (29)

according to Lemma 2. Note that although theoretically Eq.29 only holds for infinitely wide neural networks, however, empirically (Figure 7) we find that Eq.29 still holds for neural networks with finite hidden nodes.

In prediction tasks, the final layer is , so we have:

 Ep(w)[VarX(f(X;w))]=Ep(w(L))[∑jw(L)2j]Ep(w(1:L−1))[VarX(h(L)j)]=MLσ(L)2Ep(w(1:L−1))[VarX(h(L)j)] (30)

According to Eq.23, we have following recursive equation:

 βlEp(w(1:l−1))[%VarX(a(l))]=βlEp(w(1:l−1))[∑jw(l−1)2j,iVarX(h(l−1)j)+∑k,j≠kw(l−1)j,iCov(hl−1j,hl−1k)w(l−1)k,i]=βlEp(w(l−1))[∑jw(l−1)2j,i]Ep(w(1:l−2))[VarX(h(l−1)j)]=βlMl−1σ(l−1)2Ep(w(1:l−2))[VarX(h(l−1)j)]. (31)

Also, for the first layer, by assuming all the features are independent, we have:

 Ep(w(0))[VarX(a(l)i)]=Ep(w(0))[∑jw(0)2j,iVarX(xj)]+Ep(w(0))[∑k,j≠kw(0)j,iCov(xj,xk)w(0)k,i]=σ(0)2∑jVarX(xj). (32)

From eq.29-32, we can conclude that:

 μPVE=σ(0)2L∏l=1βlMlσ(l)2∑Di=1VarX(xi)Vary(y)=ασ(0)2L∏l=1σ(l)2Ml∑Di=1VarX(xi)Vary(y),

where . When the number of hidden nodes of each layer goes to infinity, .

## 9 Estimating ~α through linear regression

We provide a simple algorithm to estimate by solving the linear regression problem.