1 Introduction
Theoretical insights and the development of a comprehensive theory are often the driving force underlying the development of new and improved methods. Neural networks are powerful models, but they still lack comprehensive theoretical support. The Bayesian approach applied to neural networks is considered one of the bestsuited frameworks to obtain theoretical explanations and improve the models.
Infinitewidth Bayesian neural networks are wellstudied. Induced priors in Bayesian neural networks with Gaussian weights are Gaussian processes when the number of hidden units per layer tends to infinity (Neal, 1996; Matthews et al., 2018; Lee et al., 2018; GarrigaAlonso et al., 2019). Stable distributions also lead to stable processes, which are generalizations of Gaussian ones (Favaro et al., 2020).
Since in some cases finite models perform better (Lee et al., 2018; GarrigaAlonso et al., 2019; Lee et al., 2020), there is a need for theoretical justifications. One of the ways is to study induced priors. By analyzing the priors over representations, Aitchison (2020) suggests that finite Bayesian neural networks may generalize better than their infinite counterparts because of their ability to learn representations. Another idea is to find the induced priors in the functional space (Wilson and Izmailov, 2020).
By bounding moments of distributions induced on units,
Vladimirova et al. (2019) proved that hidden units have heaviertailed upper bounds and follow subWeibull distributions with an increasing tail parameter depending on the layer depth. Those bounds are optimal as they are achieved for shallow Bayesian neural networks; however, they are not accurate.Recently, ZavatoneVeth and Pehlevan (2021); Noci et al. (2021)
showed that there exists a precise description of induced unit priors through Meijer Gfunctions. These are full descriptions of priors at the unit level. The results are in accordance with the heavytailed nature and asymptotic expansions in a wide regime, but it has restrictions. First, the setting is simplified: linear or ReLU activation functions and Gaussian priors on weights. While
Wilson and Izmailov (2020) argue that vague Gaussian priors in the parameter space induce useful functionspace priors, in some cases, heaviertailed priors can perform better (Fortuin et al., 2021). Second, it is hard to work with Meijer Gfunctions due to their complexity. Our goal is to obtain more general characterizations for hidden units.We introduce a new concept for describing distributional properties of tails by extending the existing Weibulltail characterization. A random variable
is called Weibulltail (Gardes et al., 2011) with tail parameter , which is denoted by, if its cumulative distribution function
satisfies(1) 
where is a slowlyvarying function, i.e. it is a positive function such that for all . We note that the Weibulltail property only considers the right tail of distributions. Here we adapt the Weibulltail characterization to the whole space by taking into consideration the left tail as well. Additionally, we introduce generalized Weibulltail random variables with tail parameter which have Weibulltail upper and lower bounds for both tails (Definition 2). Such a characterization is easily interpretable and stable under basic operations, even for dependent random variables. The family of generalized Weibulltail distributions covers a large variety of fundamental distributions such as Gaussian (), gamma (), Weibull (), to name a few, and turns out to be a key tool to describe distributional tails.
Contributions. We make the following contributions:

We introduce a new notion of tail characteristics called generalized Weibulltail, a version of Weibulltail characteristics on extended to variables on (Section 2). The additional advantage of this notion is stability under basic operations such as multiplication by a constant and summation.

With the results on generalized Weibulltail characterization and dependence, we obtain an accurate characterization of the heavytailed nature of hidden units in Bayesian neural networks (Section 3). We establish these results under possibly heavytailed priors and relatively mild assumptions on the nonlinearity. The conclusions of Vladimirova et al. (2019); ZavatoneVeth and Pehlevan (2021); Noci et al. (2021) which consider only Gaussian priors, mostly follow as corollaries of the obtained characterization. The comparison of different characterizations and related works are deferred to Sections 4 and 5.
2 Generalized Weibulltail random variables
The study of the distributional tail behavior arises in many applied probability models of different areas, such as hydrology
(Strupczewski et al., 2011), finance (Rachev, 2003) and insurance risk theory (McNeil et al., 2015). Since exact distributions are not available in most cases, deriving asymptotic relationships for their tail probabilities becomes essential. In this context, an important role is played by socalled Weibulltail distributions satisfying Equation (1) (Gardes et al., 2011; Gardes and Girard, 2016).A majority of works focuses on right tails of distributions and develops a theory only applicable to right tails, while it is essential to study both right and left tails of distributions. We extend the notion of Weibulltail on to and introduce generalized Weibulltail random variables on in the following definition. They are stable (under basic operations) extensions of Weibulltail random variables on (Appendix B).
[Generalized Weibulltail on ] A random variable is generalized Weibulltail on with tail parameter if both its right and left tails are upper and lower bounded by some Weibulltail functions with tail parameter :
where , , and are slowlyvarying functions. We note .
This family includes widely used distributions such as Gaussian (), Laplace (
) and generalized Gaussian distributions. These distributions are also symmetric (around 0), so their left and right tails are equal. Thus, it naturally leads to obtain tail characteristics for symmetric distributions by considering random variables whose absolute value is Weibulltail on
. See Appendix B for details. While Weibulltail random variables are also generalized Weibulltail, the opposite is not always true. Consider slowlyvarying functions and . Then function satisfies but is not slowly varying. However, for any , function is the survival function (it is decreasing) of some random variable and it satisfies . Therefore, is but not .Next, we aim to obtain a tail characterization for the sum of generalized Weibulltail variables on , where we allow random variables to be dependent. Further, we show that under the following assumption of positive dependence, the sum has a tail parameter equal to the minimum among the considered ones (Theorem 2.1).
[Positive dependence condition] Random variables satisfy the positive dependence (PD) condition if the following inequalities hold for all and some constant :
Remark 2.1
The choice of zeros and in the PD condition is arbitrary: one can choose any instead such that . The choice of the th variable within is also arbitrary. Besides, if random variables are independent with nonzero right and left tails, then they satisfy the PD condition and the constant is equal to the minimum between and .
There is a great variety of dependent distributions that obey this dependence property including preactivations in Bayesian neural networks (see Lemma 3.1 for details). Note that the positive orthant dependence condition (POD, see Nelsen, 2007) implies the PD condition.
Theorem 2.1 (Sum of GWT variables)
Let with tail parameters . If satisfy the PD condition of Definition 2, then, with .
All proofs are deferred in Appendix. The intuition of the PD condition is to prevent the tail of a sum from becoming lighter due to a negative connection. The simplest example of such negative dependence comes with countermonotonicity where is such that . Another less trivial example is for some : the sum is a version of truncated to the compact set . In both cases, it is easy to see that the PD condition does not hold. We conclude this section with a result on the product of independent generalized Weibulltail random variables.
Theorem 2.2 (Product of variables)
Let be independent symmetric with tail parameters . Then, the product with such that .
Now the obtained results can be applied to Bayesian neural networks, showing that hidden units are generalized Weibulltail on .
3 Bayesian neural networks induced priors
Neural networks are hierarchical models made of layers: an input, several hidden layers, and an output. Each layer following the input layer consists of units which are linear combinations of previous layer units transformed by a nonlinear function, often referred to as the nonlinearity or activation function denoted by . Given an input , the
th hidden layer consists of a vector whose size is called the width of the layer, denoted by
. The coefficients of linear combinations are called weights, denoted by for . In Bayesian neural networks, weights are assigned some prior distribution (Neal, 1996). The preactivations and postactivations of layer are respectively defined as(2) 
where are elements of input vector , so are deterministic numerical object features.
The main ingredient for Theorem 2.1 is the positive dependence condition of Definition 2. The product of hidden units and weights satisfies the positive dependence condition:
Lemma 3.1
Let be some possibly dependent random variables and be symmetric, mutually independent and independent from , then random variables satisfy the PD condition.
Along with Theorem 2.1, the previous lemma implies that neural network hidden units are generalized Weibulltail with tail parameter depending on those of the weights.
Theorem 3.1 (Hidden units are GWT)
Consider a Bayesian neural network as described in Equation (2) with ReLU activation function. Let th layer weights be independent symmetric generalized Weibulltail on with tail parameter . Then, th layer preactivations are generalized Weibulltail on with tail parameter such that .
Note that the most popular case of weight prior, iid Gaussian (Neal, 1996), corresponds to weights. This leads to units of layer which are .
To illustrate this theorem, we have built neural networks of 4 hidden layers, with 4 hidden units on each layer. We used a fixed input of size , which can be thought of as an image of dimension . This input was sampled once for all with standard Gaussian entries. In order to obtain samples from the prior distribution of the neural network units, we have sampled the weights from independent centered Gaussians from which units were obtained by forward evaluation with the ReLU nonlinearity. This process was iterated times. Note that for a random variable, so the tail parameter can be expressed as:
(3) 
In Figure 1, we plot as a function of . We see that the obtained tail parameters approximations are increasing for the increasing layer number and visually correspond to the theoretical tail parameter.
4 Comparison of different characterizations
4.1 Generalized Weibulltail vs subWeibull
Some of the commonly used techniques to study the tail behavior is to consider probability tail bounds such as subGaussian, subexponential, or their generalization to subWeibull distributions (Vladimirova et al., 2020; Kuchibhotla and Chakrabortty, 2018). A nonnegative random variable is called subWeibull with tail parameter if its survival function is upperbounded by that of a Weibull distribution:
(4) 
This property ensures the existence of the moment generating function as well as bounds on moments. In contrast, the Weibulltail property characterizes the survival or density functions without a hand on moments. While tail parameters in Equation (
1) and (4) of generalized Weibulltail and subWeibull properties respectively are different, there exist connections. Notice that for any constants , function is slowlyvarying for large enough and . It means that if a random variable is subWeibull with parameter , satisfying Equation (4), then the survival function of is upperbounded by a Weibulltail function with tail parameter and slowlyvarying function , satisfying Equation (1). If random variable is generalized Weibulltail with tail parameter , then from the last item of Proposition A.1, for we haveor and for large enough and such that , as illustrated on Figure 2.
It was recently shown in Vladimirova et al. (2019) that hidden units of Bayesian neural networks with iid Gaussian priors are subWeibull with tail parameter proportional to the hidden layer number, that is . It means that the unit distributions of hidden layer can be upperbounded by some Weibull distributions for all . For larger tail parameter , Weibull distribution is heaviertailed but being subWeibull does not guarantee the heaviness of the tails. However, this upper bound is optimal in the sense that it is achieved for neural networks with one hidden unit per layer.
From Theorem 3.1, for neural networks with independent Gaussian weights, hidden units of th layer are generalized Weibulltail with tail parameter so they have upper and lower bounds of the form up to a constant where is some slowlyvarying function. Therefore, it proves that hidden units are heaviertailed as going deeper for any finite numbers of hidden units per layer.
4.2 Meijer Gfunctions description
In Springer and Thompson (1970)
it was shown that the probability density function of the product of independent normal variables could be expressed through a Meijer Gfunction. It resulted in an accurate description of induced unit priors given Gaussian priors on weights and linear or ReLU activation function
(ZavatoneVeth and Pehlevan, 2021; Noci et al., 2021). It is a full description of functionspace priors but under strong assumptions, requiring Gaussian priors on weights and linear or ReLU activation functions, and with convoluted expressions. In contrast, we provide results for many distributions, including heavytailed ones, and our results can be extended to smooth activation functions, such as PReLU, ELU, Softplus.5 Future applications
Cold posterior effect and priors.
It was recently empirically found that Gaussian priors led to the cold posterior effect in which a tempered “cold” posterior, obtained by exponentiating the posterior to some power largely greater than one, performs better than an untempered one (Wenzel et al., 2020)
. The performed Bayesian inference is considered suboptimal due to the need for cold posteriors, and the model is deemed misspecified. From that angle,
Wenzel et al. (2020) suggested that Gaussian priors might not be a good choice for Bayesian neural networks. In some works, data augmentation is argued to be the main reason for this effect (Izmailov et al., 2021; Nabarro et al., 2021) as the increased amount of observed data naturally leads to higher posterior contraction (Izmailov et al., 2021). At the same time, even considering the data augmentation for some models, the cold posterior effect is still present. In addition, Aitchison (2021) demonstrates that the problem might originate in the wrong likelihood of the models and that modifying only the likelihood based on data curation mitigates the cold posterior effect. Nabarro et al. (2021) hypothesize that using an appropriate prior incorporating knowledge of data augmentation might provide a solution. Moreover, heavytailed priors have been shown to mitigate the cold posterior effect (Fortuin et al., 2021). According to Theorem 3.1, heaviertailed priors lead to even heaviertailed induced priors in functionspace. Thus, the heavytail property of distributions in functionspace might be a highly beneficial feature. Fortuin et al. (2021)also proposed correlated priors for convolutional neural networks since trained weights are empirically strongly correlated. Correlated priors improve overall performance but do not alleviate the cold posterior effect. Our theory can be extended to correlated weight priors. This direction is promising for further uncovering the effect of weight prior on functionspace prior.
Edge of Chaos.
An active line of research studies the propagation of deterministic inputs in neural networks (Poole et al., 2016; Schoenholz et al., 2017; Hayou et al., 2019). The main idea is to explore the covariance between preactivations for two given different data points. Poole et al. (2016) and Schoenholz et al. (2017) obtained recurrence relations under the assumption of Gaussian initialization and Gaussian preactivations. They conclude that there is a critical line, socalled Edge of Chaos
, separating signal propagation into two regions. The first one is an ordered phase in which all inputs end up asymptotically correlated. The second is a chaotic phase in which all inputs end up asymptotically independent. To propagate the information deeper in a neural network, one should choose Gaussian prior variances corresponding to the separating line.
Hayou et al. (2019) show that the smoothness of the activation function also plays an important role. Since this line of works considers Gaussian priors not only on the weights but also on the preactivations, it is closely related to a wide regime where the number of hidden units per layer tends to infinity. Given that hidden units are heaviertailed with depth, we speculate that future research will focus on finding better approximations of the preactivation functions in recurrence relations obtained for finitewidth neural networks.6 Conclusion
We extend the theory on induced distributions in Bayesian neural networks and establish an accurate and easily interpretable characterization of hidden units tails. The obtained results confirm the heavytailed nature of hidden units for different weight priors.
Appendix A Slowly and regularlyvarying functions theory
The set of regularvarying functions with index is denoted by . Note that for , the set boils down to the set of slowlyvarying functions. In particular, any function can be written , where is slowlyvarying.
[Regularlyvarying function] Let be a positive function. Then if for all .
Proposition A.1
(Bingham et al., 1989, Proposition 1.3.6) Let be slowlyvarying functions. Then:

as .

varies slowly for every .

, , and (if as ) vary slowly.

If is a rational function with positive coefficients, varies slowly.

For any , .
Lemma A.1
If is slowlyvarying, then is slowlyvarying for .
Lemma A.2
If vary slowly, so does and .
Lemma A.3
Let and be regularlyvarying functions with parameters and . Then, the function such that , is regularlyvarying with parameter .
Let us express function from the statement: . If , without loss of generality, let us assume that , then
Notice that . For , when . The expression in the exponent , so the exponent . Then and . It means that for the case , .
Let us consider the case with equal parameters , then , with slowlyvarying and . With , we can write
Consider the logarithm of the latter expression
Since the function is slowlyvarying by Lemma A.2 and , then .
Appendix B Weibulltail properties on
Let us firstly introduce a notion of generalized Weibulltail random variable which has an additional property of stability: [Generalized Weibulltail on ] A random variable is called generalized Weibulltail with tail parameter if its survival function is bounded by Weibulltail functions of tail parameter with possibly different slowlyvarying functions and :
(5) 
We note .
Now we define a random variable whose both right and left tails are Weibulltail on . [Weibulltail on ] A random variable on is Weibulltail on with tail parameter if both its right and left tails are Weibulltail with tail parameter :
where and are slowlyvarying functions. We note .
Lemma B.1

If random variable on is , then is .

If is asymmetric but both tails are Weibulltail and is , then one of the tails (right or left) is and the other one is where .

For symmetric distributions, is if and only if is .

Let be and has Weibull left and right tails with different tail parameters. Without loss of generality, assume that is Weibulltail with tail parameter . According to Lemma A.3, the sum survival function will be Weibulltail with tail parameter . We obtained a contradiction and must have tail parameter greater or equal . If both tail parameters are greater than , then the tails sum have the tail parameter equal to the minimum tail parameter among them which is greater than . It means that at least one tail must have tail parameter .

For symmetric distributions for any , then and we have the equality.
Lemma B.2

If a random variable is , then is .

For symmetric distributions, is if and only if is .
Similarly as in Lemma B.1, we obtain for .

If is , then right and left tails are upper and lowerbounded by some Weibulltail functions. Then, the sum of the right and left tails is upper and lowerbounded by these Weibulltail functions and .

For symmetric distributions we have for all , then and we have the equality.
Lemma B.3 (Power and multiplication by a constant)
If and the distribution of is symmetric, then for .
According to Lemma B.1, . For , the tail of is
Since is generalized Weibulltail on with tail parameter , , where and are slowlyvarying functions, it implies
where , are slowlyvarying functions by Lemma A.1. It leads to the statement of the lemma.
Theorem 2.1 (Sum of GWT variables)
Let with tail parameters . If satisfy the PD condition of Definition 2, then, with .
Let us start with . For any random variables and , the following upper bound holds:
The PD condition leads to a lower bound for the sum:
where constant . Thus, the sum survival function has the following bounds for the right tail:
where and are the survival functions of and .
Let and be generalized Weibulltail on of parameters and , then for large enough
where is the slowlyvarying function appearing in the right tail lower bound of generalized Weibulltail and is the minimum among slowlyvarying functions, where and are slowlyvarying functions in the right tails upper bounds of and . According to Lemma A.2, is also slowlyvarying. Similarly, we can get bounds for the left tail. Therefore, is generalized Weibulltail on with tail parameter .
Similarly as above when , bounds for the right tail of the sum with survival function are:
where is the survival functions of and constant . The rest of the proof is identical to the one of the case . The case when distributions have only right tails (or only left tails), can be considered as a particular case of the last theorem: a sum of nonnegative generalized Weibulltail random variables is nonnegative generalized Weibulltail with tail parameter equal to the minimum among those of the terms.
Theorem 2.2 (Product of independent variables)
Let be independent symmetric with tail parameters . Then, the product with such that .
Consider two independent symmetric generalized Weibulltail random variables with tail parameters and , and . From Lemma B.1 and since random variables and are symmetric, it is equivalent to and .
The product of independent symmetric distributions is symmetric since . From Lemma B.1, if and only if .
Our goal is to show that for some slowlyvarying functions and , there exist upper and lower bounds for the survival function of and large enough as follows:
(6) 

Upper bound. First, notice that from the concavity of the logarithm, we have for any and . Then . The change of variables , implies . From the latter equation, an upper bound of the product tail is
(7) Lemma B.3 implies that and . Taking and , yields a sum of two independent nonnegative generalized Weibulltail random variables with tail parameter on the righthand side of Equation (7). By Theorem 2.1, this sum is generalized Weibulltail with the same tail parameter . It means that there exists a slowlyvarying function such that the tail of product absolute value is upperbounded by
(8) 
Lower bound. By independence of and we have
Since and are generalized Weibulltail on , we can define function with and being slowlyvarying functions in the lower bounds of generalized Weibulltail and . Then, is slowlyvarying by Lemma A.1 and we have
(9)
Appendix C Bayesian neural network properties
Proofs of Section 3.
Lemma 3.1
Let be some possibly dependent random variables and be symmetric, mutually independent and independent from , then random variables satisfy the PD condition.
The joint probability for the right tail can be expressed as
(10) 
Independence between and yields
where the last equality is due to the mutual independence of weights . Let and . If , the probability . If , then, due to the symmetry of , the probability . Thus, the following lower bound holds:
Notice that
Substituting the latter equations into Equation (10) leads to the lower bound:
By the conditional probability definition, we have
The proof for the left tail is identical.
Theorem 3.1
Consider a Bayesian neural network as described in Equation (2) with ReLU activation function. Let th layer weights be independent symmetric generalized Weibulltail on with tail parameter . Then, th layer preactivations are generalized Weibulltail on with tail parameter such that .
The goal is to show that where . We proceed by induction on the layer depth .

First hidden layer
For
Comments
