Bayesian neural network unit priors and generalized Weibull-tail property

10/06/2021
by   Mariia Vladimirova, et al.
Inria
0

The connection between Bayesian neural networks and Gaussian processes gained a lot of attention in the last few years. Hidden units are proven to follow a Gaussian process limit when the layer width tends to infinity. Recent work has suggested that finite Bayesian neural networks may outperform their infinite counterparts because they adapt their internal representations flexibly. To establish solid ground for future research on finite-width neural networks, our goal is to study the prior induced on hidden units. Our main result is an accurate description of hidden units tails which shows that unit priors become heavier-tailed going deeper, thanks to the introduced notion of generalized Weibull-tail. This finding sheds light on the behavior of hidden units of finite Bayesian neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/29/2021

Dependence between Bayesian neural network units

The connection between Bayesian neural networks and Gaussian processes g...
04/23/2021

Exact priors of finite neural networks

Bayesian neural networks are theoretically well-understood only in the i...
06/01/2021

Asymptotics of representation learning in finite Bayesian neural networks

Recent works have suggested that finite Bayesian neural networks may out...
10/11/2018

Bayesian neural networks increasingly sparsify their units with depth

We investigate deep Bayesian neural networks with Gaussian priors on the...
10/12/2021

Uncertainty-based out-of-distribution detection requires suitable function space priors

The need to avoid confident predictions on unfamiliar data has sparked i...
07/26/2021

Are Bayesian neural networks intrinsically good at out-of-distribution detection?

The need to avoid confident predictions on unfamiliar data has sparked i...
11/06/2015

The Poisson Gamma Belief Network

To infer a multilayer representation of high-dimensional count vectors, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Theoretical insights and the development of a comprehensive theory are often the driving force underlying the development of new and improved methods. Neural networks are powerful models, but they still lack comprehensive theoretical support. The Bayesian approach applied to neural networks is considered one of the best-suited frameworks to obtain theoretical explanations and improve the models.

Infinite-width Bayesian neural networks are well-studied. Induced priors in Bayesian neural networks with Gaussian weights are Gaussian processes when the number of hidden units per layer tends to infinity (Neal, 1996; Matthews et al., 2018; Lee et al., 2018; Garriga-Alonso et al., 2019). Stable distributions also lead to stable processes, which are generalizations of Gaussian ones (Favaro et al., 2020).

Since in some cases finite models perform better (Lee et al., 2018; Garriga-Alonso et al., 2019; Lee et al., 2020), there is a need for theoretical justifications. One of the ways is to study induced priors. By analyzing the priors over representations, Aitchison (2020) suggests that finite Bayesian neural networks may generalize better than their infinite counterparts because of their ability to learn representations. Another idea is to find the induced priors in the functional space (Wilson and Izmailov, 2020).

By bounding moments of distributions induced on units,

Vladimirova et al. (2019) proved that hidden units have heavier-tailed upper bounds and follow sub-Weibull distributions with an increasing tail parameter depending on the layer depth. Those bounds are optimal as they are achieved for shallow Bayesian neural networks; however, they are not accurate.

Recently, Zavatone-Veth and Pehlevan (2021); Noci et al. (2021)

showed that there exists a precise description of induced unit priors through Meijer G-functions. These are full descriptions of priors at the unit level. The results are in accordance with the heavy-tailed nature and asymptotic expansions in a wide regime, but it has restrictions. First, the setting is simplified: linear or ReLU activation functions and Gaussian priors on weights. While

Wilson and Izmailov (2020) argue that vague Gaussian priors in the parameter space induce useful function-space priors, in some cases, heavier-tailed priors can perform better (Fortuin et al., 2021). Second, it is hard to work with Meijer G-functions due to their complexity. Our goal is to obtain more general characterizations for hidden units.

We introduce a new concept for describing distributional properties of tails by extending the existing Weibull-tail characterization. A random variable

is called Weibull-tail (Gardes et al., 2011) with tail parameter , which is denoted by

, if its cumulative distribution function

satisfies

(1)

where is a slowly-varying function, i.e. it is a positive function such that for all . We note that the Weibull-tail property only considers the right tail of distributions. Here we adapt the Weibull-tail characterization to the whole space by taking into consideration the left tail as well. Additionally, we introduce generalized Weibull-tail random variables with tail parameter which have Weibull-tail upper and lower bounds for both tails (Definition 2). Such a characterization is easily interpretable and stable under basic operations, even for dependent random variables. The family of generalized Weibull-tail distributions covers a large variety of fundamental distributions such as Gaussian (), gamma (), Weibull (), to name a few, and turns out to be a key tool to describe distributional tails.

Contributions. We make the following contributions:

  • We introduce a new notion of tail characteristics called generalized Weibull-tail, a version of Weibull-tail characteristics on extended to variables on  (Section 2). The additional advantage of this notion is stability under basic operations such as multiplication by a constant and summation.

  • With the results on generalized Weibull-tail characterization and dependence, we obtain an accurate characterization of the heavy-tailed nature of hidden units in Bayesian neural networks (Section 3). We establish these results under possibly heavy-tailed priors and relatively mild assumptions on the non-linearity. The conclusions of Vladimirova et al. (2019); Zavatone-Veth and Pehlevan (2021); Noci et al. (2021) which consider only Gaussian priors, mostly follow as corollaries of the obtained characterization. The comparison of different characterizations and related works are deferred to Sections 4 and 5.

2 Generalized Weibull-tail random variables

The study of the distributional tail behavior arises in many applied probability models of different areas, such as hydrology 

(Strupczewski et al., 2011), finance (Rachev, 2003) and insurance risk theory (McNeil et al., 2015). Since exact distributions are not available in most cases, deriving asymptotic relationships for their tail probabilities becomes essential. In this context, an important role is played by so-called Weibull-tail distributions satisfying Equation (1(Gardes et al., 2011; Gardes and Girard, 2016).

A majority of works focuses on right tails of distributions and develops a theory only applicable to right tails, while it is essential to study both right and left tails of distributions. We extend the notion of Weibull-tail on to and introduce generalized Weibull-tail random variables on  in the following definition. They are stable (under basic operations) extensions of Weibull-tail random variables on  (Appendix B).

[Generalized Weibull-tail on ] A random variable is generalized Weibull-tail on with tail parameter if both its right and left tails are upper and lower bounded by some Weibull-tail functions with tail parameter :

where , , and are slowly-varying functions. We note .

This family includes widely used distributions such as Gaussian (), Laplace (

) and generalized Gaussian distributions. These distributions are also symmetric (around 0), so their left and right tails are equal. Thus, it naturally leads to obtain tail characteristics for symmetric distributions by considering random variables whose absolute value is Weibull-tail on

. See Appendix B for details. While Weibull-tail random variables are also generalized Weibull-tail, the opposite is not always true. Consider slowly-varying functions and . Then function satisfies but is not slowly varying. However, for any , function is the survival function (it is decreasing) of some random variable and it satisfies . Therefore, is but not .

Next, we aim to obtain a tail characterization for the sum of generalized Weibull-tail variables on , where we allow random variables to be dependent. Further, we show that under the following assumption of positive dependence, the sum has a tail parameter equal to the minimum among the considered ones (Theorem 2.1).

[Positive dependence condition] Random variables satisfy the positive dependence (PD) condition if the following inequalities hold for all and some constant :

Remark 2.1

The choice of zeros and in the PD condition is arbitrary: one can choose any instead such that . The choice of the -th variable within is also arbitrary. Besides, if random variables are independent with non-zero right and left tails, then they satisfy the PD condition and the constant is equal to the minimum between and .

There is a great variety of dependent distributions that obey this dependence property including pre-activations in Bayesian neural networks (see Lemma 3.1 for details). Note that the positive orthant dependence condition (POD, see Nelsen, 2007) implies the PD condition.

Theorem 2.1 (Sum of GWT variables)

Let with tail parameters . If satisfy the PD condition of Definition 2, then, with .

All proofs are deferred in Appendix. The intuition of the PD condition is to prevent the tail of a sum from becoming lighter due to a negative connection. The simplest example of such negative dependence comes with counter-monotonicity where is such that . Another less trivial example is for some : the sum is a version of truncated to the compact set . In both cases, it is easy to see that the PD condition does not hold. We conclude this section with a result on the product of independent generalized Weibull-tail random variables.

Theorem 2.2 (Product of variables)

Let be independent symmetric with tail parameters . Then, the product with such that .

Now the obtained results can be applied to Bayesian neural networks, showing that hidden units are generalized Weibull-tail on .

3 Bayesian neural networks induced priors

Neural networks are hierarchical models made of layers: an input, several hidden layers, and an output. Each layer following the input layer consists of units which are linear combinations of previous layer units transformed by a nonlinear function, often referred to as the nonlinearity or activation function denoted by . Given an input , the

-th hidden layer consists of a vector whose size is called the width of the layer, denoted by

. The coefficients of linear combinations are called weights, denoted by for . In Bayesian neural networks, weights are assigned some prior distribution (Neal, 1996). The pre-activations and post-activations of layer are respectively defined as

(2)

where are elements of input vector , so are deterministic numerical object features.

The main ingredient for Theorem 2.1 is the positive dependence condition of Definition 2. The product of hidden units and weights satisfies the positive dependence condition:

Lemma 3.1

Let be some possibly dependent random variables and be symmetric, mutually independent and independent from , then random variables satisfy the PD condition.

Along with Theorem 2.1, the previous lemma implies that neural network hidden units are generalized Weibull-tail with tail parameter depending on those of the weights.

Theorem 3.1 (Hidden units are GWT)

Consider a Bayesian neural network as described in Equation (2) with ReLU activation function. Let -th layer weights be independent symmetric generalized Weibull-tail on  with tail parameter . Then, -th layer pre-activations are generalized Weibull-tail on  with tail parameter such that .

Note that the most popular case of weight prior, iid Gaussian (Neal, 1996), corresponds to weights. This leads to units of layer which are .

To illustrate this theorem, we have built neural networks of 4 hidden layers, with 4 hidden units on each layer. We used a fixed input of size , which can be thought of as an image of dimension . This input was sampled once for all with standard Gaussian entries. In order to obtain samples from the prior distribution of the neural network units, we have sampled the weights from independent centered Gaussians from which units were obtained by forward evaluation with the ReLU non-linearity. This process was iterated times. Note that for a random variable, so the tail parameter can be expressed as:

(3)

In Figure 1, we plot as a function of . We see that the obtained tail parameters approximations are increasing for the increasing layer number and visually correspond to the theoretical tail parameter.

[capbesideposition=left,top,capbesidewidth=.45]figure[]

Figure 1: Solid lines: approximations of tail parameters based on Equation (3) where are hidden units of layers corresponding theoretically to generalized Weibull-tail with tail parameters , under the independent Gaussian weights assumption. Dashed lines

: linear regressions with coefficients equal to the theoretical tail parameters

and manually selected biases to approach the solid lines for visual comparison.

4 Comparison of different characterizations

4.1 Generalized Weibull-tail vs sub-Weibull

Some of the commonly used techniques to study the tail behavior is to consider probability tail bounds such as sub-Gaussian, sub-exponential, or their generalization to sub-Weibull distributions (Vladimirova et al., 2020; Kuchibhotla and Chakrabortty, 2018). A non-negative random variable is called sub-Weibull with tail parameter if its survival function is upper-bounded by that of a Weibull distribution:

(4)

This property ensures the existence of the moment generating function as well as bounds on moments. In contrast, the Weibull-tail property characterizes the survival or density functions without a hand on moments. While tail parameters in Equation (

1) and (4) of generalized Weibull-tail and sub-Weibull properties respectively are different, there exist connections. Notice that for any constants , function is slowly-varying for large enough and . It means that if a random variable is sub-Weibull with parameter , satisfying Equation (4), then the survival function of is upper-bounded by a Weibull-tail function with tail parameter  and slowly-varying function , satisfying Equation (1). If random variable is generalized Weibull-tail with tail parameter , then from the last item of Proposition A.1, for we have

or and for large enough and such that , as illustrated on Figure 2.

Figure 2: Relation between sub-Weibull and generalized Weibull-tail characteristics.

It was recently shown in Vladimirova et al. (2019) that hidden units of Bayesian neural networks with iid Gaussian priors are sub-Weibull with tail parameter proportional to the hidden layer number, that is . It means that the unit distributions of hidden layer can be upper-bounded by some Weibull distributions for all . For larger tail parameter , Weibull distribution is heavier-tailed but being sub-Weibull does not guarantee the heaviness of the tails. However, this upper bound is optimal in the sense that it is achieved for neural networks with one hidden unit per layer.

From Theorem 3.1, for neural networks with independent Gaussian weights, hidden units of -th layer are generalized Weibull-tail with tail parameter so they have upper and lower bounds of the form up to a constant where is some slowly-varying function. Therefore, it proves that hidden units are heavier-tailed as going deeper for any finite numbers of hidden units per layer.

4.2 Meijer G-functions description

In Springer and Thompson (1970)

it was shown that the probability density function of the product of independent normal variables could be expressed through a Meijer G-function. It resulted in an accurate description of induced unit priors given Gaussian priors on weights and linear or ReLU activation function  

(Zavatone-Veth and Pehlevan, 2021; Noci et al., 2021). It is a full description of function-space priors but under strong assumptions, requiring Gaussian priors on weights and linear or ReLU activation functions, and with convoluted expressions. In contrast, we provide results for many distributions, including heavy-tailed ones, and our results can be extended to smooth activation functions, such as PReLU, ELU, Softplus.

5 Future applications

Cold posterior effect and priors.

It was recently empirically found that Gaussian priors led to the cold posterior effect in which a tempered “cold” posterior, obtained by exponentiating the posterior to some power largely greater than one, performs better than an untempered one (Wenzel et al., 2020)

. The performed Bayesian inference is considered sub-optimal due to the need for cold posteriors, and the model is deemed misspecified. From that angle,

Wenzel et al. (2020) suggested that Gaussian priors might not be a good choice for Bayesian neural networks. In some works, data augmentation is argued to be the main reason for this effect (Izmailov et al., 2021; Nabarro et al., 2021) as the increased amount of observed data naturally leads to higher posterior contraction (Izmailov et al., 2021). At the same time, even considering the data augmentation for some models, the cold posterior effect is still present. In addition, Aitchison (2021) demonstrates that the problem might originate in the wrong likelihood of the models and that modifying only the likelihood based on data curation mitigates the cold posterior effect. Nabarro et al. (2021) hypothesize that using an appropriate prior incorporating knowledge of data augmentation might provide a solution. Moreover, heavy-tailed priors have been shown to mitigate the cold posterior effect (Fortuin et al., 2021). According to Theorem 3.1, heavier-tailed priors lead to even heavier-tailed induced priors in function-space. Thus, the heavy-tail property of distributions in function-space might be a highly beneficial feature. Fortuin et al. (2021)

also proposed correlated priors for convolutional neural networks since trained weights are empirically strongly correlated. Correlated priors improve overall performance but do not alleviate the cold posterior effect. Our theory can be extended to correlated weight priors. This direction is promising for further uncovering the effect of weight prior on function-space prior.

Edge of Chaos.

An active line of research studies the propagation of deterministic inputs in neural networks (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019). The main idea is to explore the covariance between pre-activations for two given different data points. Poole et al. (2016) and Schoenholz et al. (2017) obtained recurrence relations under the assumption of Gaussian initialization and Gaussian pre-activations. They conclude that there is a critical line, so-called Edge of Chaos

, separating signal propagation into two regions. The first one is an ordered phase in which all inputs end up asymptotically correlated. The second is a chaotic phase in which all inputs end up asymptotically independent. To propagate the information deeper in a neural network, one should choose Gaussian prior variances corresponding to the separating line.

Hayou et al. (2019) show that the smoothness of the activation function also plays an important role. Since this line of works considers Gaussian priors not only on the weights but also on the pre-activations, it is closely related to a wide regime where the number of hidden units per layer tends to infinity. Given that hidden units are heavier-tailed with depth, we speculate that future research will focus on finding better approximations of the pre-activation functions in recurrence relations obtained for finite-width neural networks.

6 Conclusion

We extend the theory on induced distributions in Bayesian neural networks and establish an accurate and easily interpretable characterization of hidden units tails. The obtained results confirm the heavy-tailed nature of hidden units for different weight priors.

References

  • Aitchison (2020) Laurence Aitchison. Why bigger is not always better: on finite and infinite neural networks. In

    International Conference on Machine Learning

    , 2020.
  • Aitchison (2021) Laurence Aitchison. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021.
  • Bingham et al. (1989) Nicholas H Bingham, Charles M Goldie, and Jef L Teugels. Regular variation. Number 27. Cambridge university press, 1989.
  • Favaro et al. (2020) Stefano Favaro, Sandra Fortini, and Peluchetti Stefano. Stable behaviour of infinitely wide deep neural networks. In

    International Conference on Artificial Intelligence and Statistics

    , 2020.
  • Fortuin et al. (2021) Vincent Fortuin, Adrià Garriga-Alonso, Florian Wenzel, Gunnar Rätsch, Richard Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. arXiv preprint arXiv:2102.06571, 2021.
  • Gardes and Girard (2016) Laurent Gardes and Stéphane Girard.

    On the estimation of the functional Weibull tail-coefficient.

    Journal of Multivariate Analysis

    , 146:29–45, 2016.
  • Gardes et al. (2011) Laurent Gardes, Stéphane Girard, and Armelle Guillou. Weibull tail-distributions revisited: a new look at some tail estimators. Journal of Statistical Planning and Inference, 141(1):429–444, 2011.
  • Garriga-Alonso et al. (2019) Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations, 2019.
  • Hayou et al. (2019) Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning, 2019.
  • Izmailov et al. (2021) Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What are Bayesian neural network posteriors really like? In International Conference on Machine Learning, 2021.
  • Kuchibhotla and Chakrabortty (2018) Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. arXiv preprint arXiv:1804.02605, 2018.
  • Lee et al. (2018) Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
  • Lee et al. (2020) Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. In International Conference on Neural Information Processing Systems, 2020.
  • Matthews et al. (2018) Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
  • McNeil et al. (2015) Alexander J McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative risk management: concepts, techniques and tools. Princeton university press, 2015.
  • Nabarro et al. (2021) Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. Data augmentation in Bayesian neural networks and the cold posterior effect. arXiv preprint arXiv:2106.05586, 2021.
  • Neal (1996) Radford M Neal. Bayesian learning for neural networks. Springer Science & Business Media, 1996.
  • Nelsen (2007) Roger B Nelsen. An introduction to copulas. Springer Science & Business Media, 2007.
  • Noci et al. (2021) Lorenzo Noci, Gregor Bachmann, Kevin Roth, Sebastian Nowozin, and Thomas Hofmann. Precise characterization of the prior predictive distribution of deep ReLU networks. arXiv preprint arXiv:2106.06615, 2021.
  • Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In International Conference on Neural Information Processing Systems, 2016.
  • Rachev (2003) Svetlozar Todorov Rachev. Handbook of Heavy Tailed Distributions in Finance: Handbooks in Finance, Book 1. Elsevier, 2003.
  • Schoenholz et al. (2017) Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017.
  • Springer and Thompson (1970) Melvin Dale Springer and William E. Thompson. The distribution of products of beta, gamma and Gaussian random variables. SIAM Journal on Applied Mathematics, 18(4):721–737, 1970.
  • Strupczewski et al. (2011) Witold G Strupczewski, Krzysztof Kochanek, Iwona Markiewicz, Ewa Bogdanowicz, Stanislaw Weglarczyk, and Vijay P Singh. On the tails of distributions of annual peak flow. Hydrology Research, 42(2-3):171–192, 2011.
  • Vladimirova et al. (2019) Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, and Julyan Arbel. Understanding priors in Bayesian neural networks at the unit level. In International Conference on Machine Learning, 2019.
  • Vladimirova et al. (2020) Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.
  • Wenzel et al. (2020) Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In International Conference on Machine Learning, 2020.
  • Wilson and Izmailov (2020) Andrew Gordon Wilson and Pavel Izmailov.

    Bayesian deep learning and a probabilistic perspective of generalization.

    International Conference on Neural Information Processing Systems, 2020.
  • Zavatone-Veth and Pehlevan (2021) Jacob A Zavatone-Veth and Cengiz Pehlevan. Exact priors of finite neural networks. arXiv preprint arXiv:2104.11734, 2021.

Appendix A Slowly and regularly-varying functions theory

The set of regular-varying functions with index is denoted by . Note that for , the set boils down to the set of slowly-varying functions. In particular, any function can be written , where is slowly-varying.

[Regularly-varying function] Let be a positive function. Then if for all .

Proposition A.1

(Bingham et al., 1989, Proposition 1.3.6) Let be slowly-varying functions. Then:

  1. as .

  2. varies slowly for every .

  3. , , and (if as ) vary slowly.

  4. If is a rational function with positive coefficients, varies slowly.

  5. For any , .

Lemma A.1

If is slowly-varying, then is slowly-varying for .

Lemma A.2

If vary slowly, so does and .

Lemma A.3

Let and be regularly-varying functions with parameters and . Then, the function such that , is regularly-varying with parameter .

Let us express function from the statement: . If , without loss of generality, let us assume that , then

Notice that . For , when . The expression in the exponent , so the exponent . Then and . It means that for the case , .

Let us consider the case with equal parameters , then , with slowly-varying and . With , we can write

Consider the logarithm of the latter expression

Since the function is slowly-varying by Lemma A.2 and , then .

Appendix B Weibull-tail properties on

Let us firstly introduce a notion of generalized Weibull-tail random variable which has an additional property of stability: [Generalized Weibull-tail on ] A random variable is called generalized Weibull-tail with tail parameter if its survival function is bounded by Weibull-tail functions of tail parameter with possibly different slowly-varying functions and :

(5)

We note .

Now we define a random variable whose both right and left tails are Weibull-tail on . [Weibull-tail on ] A random variable on is Weibull-tail on with tail parameter if both its right and left tails are Weibull-tail with tail parameter :

where and are slowly-varying functions. We note .

Lemma B.1
  1. If random variable on is , then is .

  2. If is asymmetric but both tails are Weibull-tail and is , then one of the tails (right or left) is and the other one is where .

  3. For symmetric distributions, is if and only if is .

  1. For , the cumulative distribution function of is the following

    Then, can be expressed as a sum of the right and left tails:

    If is , then is as a consequence of Definition B and Lemma A.3.

  2. Let be and has Weibull left and right tails with different tail parameters. Without loss of generality, assume that is Weibull-tail with tail parameter . According to Lemma A.3, the sum survival function will be Weibull-tail with tail parameter . We obtained a contradiction and must have tail parameter greater or equal . If both tail parameters are greater than , then the tails sum have the tail parameter equal to the minimum tail parameter among them which is greater than . It means that at least one tail must have tail parameter .

  3. For symmetric distributions for any , then and we have the equality.

Lemma B.2
  1. If a random variable is , then is .

  2. For symmetric distributions, is if and only if is .

Similarly as in Lemma B.1, we obtain for .

  1. If is , then right and left tails are upper and lower-bounded by some Weibull-tail functions. Then, the sum of the right and left tails is upper and lower-bounded by these Weibull-tail functions and .

  2. For symmetric distributions we have for all , then and we have the equality.

Lemma B.3 (Power and multiplication by a constant)

If and the distribution of is symmetric, then for .

According to Lemma B.1, . For , the tail of is

Since is generalized Weibull-tail on with tail parameter , , where and are slowly-varying functions, it implies

where , are slowly-varying functions by Lemma A.1. It leads to the statement of the lemma.

Theorem 2.1 (Sum of GWT variables)

Let with tail parameters . If satisfy the PD condition of Definition 2, then, with .

Let us start with . For any random variables and , the following upper bound holds:

The PD condition leads to a lower bound for the sum:

where constant . Thus, the sum survival function has the following bounds for the right tail:

where and are the survival functions of and .

Let and be generalized Weibull-tail on of parameters and , then for large enough

where is the slowly-varying function appearing in the right tail lower bound of generalized Weibull-tail and is the minimum among slowly-varying functions, where and are slowly-varying functions in the right tails upper bounds of and . According to Lemma A.2, is also slowly-varying. Similarly, we can get bounds for the left tail. Therefore, is generalized Weibull-tail on with tail parameter .

Similarly as above when , bounds for the right tail of the sum with survival function are:

where is the survival functions of and constant . The rest of the proof is identical to the one of the case . The case when distributions have only right tails (or only left tails), can be considered as a particular case of the last theorem: a sum of non-negative generalized Weibull-tail random variables is non-negative generalized Weibull-tail with tail parameter equal to the minimum among those of the terms.

Theorem 2.2 (Product of independent variables)

Let be independent symmetric with tail parameters . Then, the product with such that .

Consider two independent symmetric generalized Weibull-tail random variables with tail parameters and , and . From Lemma B.1 and since random variables and are symmetric, it is equivalent to and .

The product of independent symmetric distributions is symmetric since . From Lemma B.1, if and only if .

Our goal is to show that for some slowly-varying functions and , there exist upper and lower bounds for the survival function of and large enough as follows:

(6)
  1. Upper bound. First, notice that from the concavity of the logarithm, we have for any and . Then . The change of variables , implies . From the latter equation, an upper bound of the product tail is

    (7)

    Lemma B.3 implies that and . Taking and , yields a sum of two independent non-negative generalized Weibull-tail random variables with tail parameter on the right-hand side of Equation (7). By Theorem 2.1, this sum is generalized Weibull-tail with the same tail parameter . It means that there exists a slowly-varying function such that the tail of product absolute value is upper-bounded by

    (8)
  2. Lower bound. By independence of and we have

    Since and are generalized Weibull-tail on , we can define function with and being slowly-varying functions in the lower bounds of generalized Weibull-tail and . Then, is slowly-varying by Lemma A.1 and we have

    (9)

Combining together Equations (8) and (9) and Definition B with Lemma B.1 implies the statement of the theorem.

Appendix C Bayesian neural network properties

Proofs of Section 3.

Lemma 3.1

Let be some possibly dependent random variables and be symmetric, mutually independent and independent from , then random variables satisfy the PD condition.

The joint probability for the right tail can be expressed as

(10)

Independence between and yields

where the last equality is due to the mutual independence of weights . Let and . If , the probability . If , then, due to the symmetry of , the probability . Thus, the following lower bound holds:

Notice that

Substituting the latter equations into Equation (10) leads to the lower bound:

By the conditional probability definition, we have

The proof for the left tail is identical.

Theorem 3.1

Consider a Bayesian neural network as described in Equation (2) with ReLU activation function. Let -th layer weights be independent symmetric generalized Weibull-tail on  with tail parameter . Then, -th layer pre-activations are generalized Weibull-tail on  with tail parameter such that .

The goal is to show that where . We proceed by induction on the layer depth .

  1. First hidden layer

    For