## 1 Introduction

We introduce a family of complexity measures for the hypotheses of neural nets, based on a multilevel relative entropy. These complexity measures take into account the multilevel structure of neural nets, as opposed to the classical relative entropy (KL-divergence) term derived from PAC-Bayesian bounds [1] or mutual information bounds [2, 3]. We derive these complexity measures by combining the technique of chaining mutual information (CMI) [4], an algorithm-dependent extension of the classical chaining technique paired with the mutual information bound [2], with the multilevel architecture of neural nets. It is observed in this paper that if a neural net is regularized in a multilevel manner as defined in Section 4, then one can readily construct hierarchical coverings with controlled diameters for its hypothesis set, and exploit this to obtain new multi-scale and algorithm-dependent generalization bounds and, in turn, new regularizers and training algorithms. The effect of such multilevel regularizations on the representation ability of neural nets has also been recently studied in [5, 6] for the special case where layers are nearly-identity functions as for ResNets [7]. Here, we demonstrate the advantage of multilevel architectures by showing how one can obtain accessible hierarchical coverings for their hypothesis sets, introducing the notion of architecture-generated coverings in Section 3. Then we derive our generalization bound for arbitrary-depth feedforward neural nets via applying the CMI technique directly on their hierarchical sequence of generated coverings. Although such a sequence of coverings may not give the tightest possible generalization bound, it has the major advantage of being easily accessible, and hence can be exploited in devising multilevel training algorithms. Designing training algorithms based on hierarchical coverings of hypothesis sets has first been studied in [8], and has recently regained traction in e.g. [9, 10], all in the context of online learning and prediction of individual sequences. With such approaches, hierarchical coverings are no longer viewed merely as methods of proof for generalization bounds: they further allow for algorithms achieving low statistical error.

In our case, the derived generalization bound puts forward a multilevel relative entropy term (see Definition 1). We then turn to minimizing the empirical error with this induced regularization, called here the multilevel entropic regularization. Interestingly, we can solve this minimization problem exactly, obtaining a multi-scale generalization of the celebrated Gibbs posterior distribution; see Sections 5 and 6. The target distribution is obtained in a backwards manner by successive marginalization and tilting of distributions, as described in the Marginalize-Tilt algorithm introduced in Section 6. Unlike the classical Gibbs distribution, its multi-scale counter-part possesses a

*temperature vector*

This paper introduces the new concepts and main results behind this alternative approach to training neural nets. Many directions emerge from this approach, in particular for its applicability. It is worth noting that Markov chain Monte Carlo (MCMC) methods are known to often better cope with non-convexity issues than gradient descent approaches, since they are able to backtrack from local minima

[11]. Furthermore, in contrast to gradient descent, MCMC methods take into account parameter uncertainty that helps preventing overfitting [12]. However, compared to gradient based methods, these methods are typically computationally more demanding.### Further related literature

Information-theoretic approaches to statistical learning have been studied in the PAC-Bayesian theory; see [1, 13, 14] and references therein, and via the recent mutual information bound in e.g. [2, 3, 15, 16, 17, 18, 19]. Deriving generalization bounds for neural nets, based on the PAC-Bayesian theory, has been the focus of recent studies such as [20, 21, 22, 23]. The statistical properties of the Gibbs posterior distribution, also known as the Boltzmann distribution, or the exponential weights distribution in e.g. [24], have been studied in e.g. [25, 26, 27, 3, 15] via an information-theoretic viewpoint. Applications of the Gibbs distribution in devising and analyzing training algorithms have been the focus of recent studies such as [28, 29, 30]. Tilted distributions in unsupervised and semi-supervised statistical learning problems has also been studied in [31] in the context of community detection. For results on applying MCMC methods to large data sets, see [32] and references therein.

### Notation

In this paper, all logarithms are in natural base and all information-theoretic measures are in nats. Let , and denote the relative information, the relative entropy, and the Rényi divergence of order

between probability measures

and , and let denote conditional relative entropy (see Appendix A for precise definitions). In the framework of supervised statistical learning, denotes the instances domain, is the labels domain, denotes the examples domain and is the hypothesis set, where the hypotheses are indexed by an index set . Letbe the loss function. A learning algorithm receives the training set

of examples with i.i.d. random elements drawn from with an unknown distribution . Then it picks an element as the output hypothesis according to a random transformation . For any , let denote the statistical (or population) risk of hypothesis , where . For a given training set , the empirical risk of hypothesis is defined as and the generalization error of hypothesis (dependent on the training set) is defined asAveraging with respect to the joint distribution

, we denote the expected generalization error by and the average statistical risk by Throughout the paper, denotes the spectral norm of matrix and denotes the Euclidean norm of vector . Let denote the Dirac measure centered at .## 2 Preliminary: The CMI technique

Chaining, originated from the work of Kolmogorov and Dudley, is a powerful technique in high dimensional probability for bounding the expected suprema of random processes while taking into account the dependencies between their random variables in a multi-scale manner. Here we emphasize the core idea of the chaining technique: performing refined approximations by using a telescoping sum, named as

*the chaining sum*. If is a random process, then for any one can write

where are finer and finer approximations of the index . Each of the differences , , is called a *link* of the chaining sum. Informally speaking, if the approximations , , are close enough to each other and is close to , then, in many important applications, controlling the expected supremum of each of the links with union bounds and summing them up will give a much tighter bound than bounding the supremum of upfront with a union bound.^{1}^{1}1The idea is that the increments may capture more efficiently
the dependencies.
For instance, the approximations may be the projections of on an increasing sequence of partitions
of . For more information,
see [33, 34, 35] and references therein.

The technique of chaining mutual information, recently introduced in [4], can be interpreted as an algorithm-dependent version of the above, extending a result of Fernique [36] by taking into account such dependencies. In brief, [4] asserts that one can replace the metric entropy in chaining with the mutual information between the input and the discretized output,
to obtain an upper bound on the expected bias of an algorithm which selects its output from a random process .^{2}^{2}2The notion of metric entropy is similar to Hartley entropy in the information theory literature. To deal with the effect of noise in communication systems, Hartley entropy was generalized and replaced by mutual information by Shannon (see [37]). By writing the chaining sum with random index and after taking expectations, we obtain:

(1) |

With this technique, rather than bounding with a single mutual information term such as in [2, 3], one bounds each link , , and then sums them up.

In this paper, first we note that unlike the classical chaining method in which we require finite size partitions whose cardinalities appear in the bounds,^{3}^{3}3Finite partitions is not required in the theory of majorizing measures (generic chaining).
that requirement is unnecessary for the CMI technique. Therefore one may use a hierarchical sequence of coverings of the index set which includes covers of possibly uncountably infinite size.
This fact will be useful for analyzing neural nets with continuous weight values in the next sections. For details, see Appendix B.^{4}^{4}4Using [19, Theorem 2], we also show that for empirical processes, one can replace the mutual information between the whole input set and the discretized output with mutual informations
between individual examples and the discretized output, to obtain a tighter CMI bound. For details, see Appendix B.

The second important contribution is to design the coverings to meet the multilayer structure of neural nets. In the classical chaining and the CMI in [4], these are applied on an arbitrary infinite sequence of -partitions. In this paper, we take a different approach and use the hierarchical sequences of generated coverings associated with multilevel architectures, as defined in the next section.

## 3 Multilevel architectures and their generated coverings

Assume that in a statistical learning problem, the hypothesis set consists of multilevel functions, i.e., the index set consists of elements representable with components as . Examples for neural nets can be: 1. When the components are the layers. 2. When the components are stacks of layers plus skip connections, such as in ResNets [7]. For all , let be the exact covering of determined by all possible values of the first components, i.e. any two indices are in the same set if and only their first components match:

Notice that is a hierarchical sequence of exact coverings of the index set , and the projection set of any in , i.e., the unique set in which includes , is determined only by the values of the first components of . We call the hierarchical sequence of *generated coverings* of the index set , and will use the CMI technique on this sequence in the next sections.^{5}^{5}5Notice that for a given architecture, one can re-parameterize the components with different permutations of to give different generated coverings.

###### Remark 1.

The notion of generated coverings of is akin in nature to the notion of *generated filtrations*

of random processes in probability theory (for a definition, see e.g.

[38, p. 171]) and applying the CMI technique on this sequence is akin to the*martingale method*.

We provide the following simple yet useful example by revisiting Example 1 of [4]:

###### Example 1.

Consider a canonical Gaussian process where , has independent standard normal components and . The process can also be expressed according to the phase of each point , i.e. the unique number such that . Assume that the indices are in the phase form and define the following dyadic sequence of partitions of : For all integers ,

see Figure 1.

Can and the sequence be related to the hypothesis set of a multilevel architecture and its generated coverings? For all integers , let Notice that for each , one can write

where each is uniquely determined by . Fixing the values of and allowing the rest of the matrices to take arbitrary values in their corresponding gives one of the elements of . Therefore, the sequence of generated coverings associated with the index set of the infinite-depth linear neural net

is .

## 4 Multilevel regularization

The purpose of multilevel regularization is to control the diameters of the generated coverings^{6}^{6}6The diameter of a covering for a metric space is defined as the supremum of the diameters of its blocks. and the links of its corresponding chaining sum. Consider a layer feed-forward neural net with parameters
where for all , is a matrix
between hidden layers and .
Let denote any non-linearity which is -Lipschitz^{7}^{7}7One can readily replace the ReLU activation function with any other -Lipschitz activation function which maps the origin to origin. Our bounds in the next section will then depend on . and satisfies

, such as the entry-wise ReLU activation function, and let

either be the soft-max function, or the identity function. For a given , assume that the instances domain is . The feed-forward neural net with parameters is a function defined as For all , let be a fixed matrix such that , and for , define the following set of matrices:(2) |

We assume that the domain of is restricted to . We are regularizing with and , for all , to constrain the
links of the chaining sum
, as we will see in Lemma 1. We name and as the *reference*^{8}^{8}8This is similar to the terminology of “reference matrices" in [39]. and *radius* of , respectively. A common example used in practice is to let the references be identity matrices, such as for residual nets (see e.g. [5, 6, 39]). For instance, for the linear neural net in Example 1, we can take and , for all .

We define the projection of on the generated covering as . Let .

###### Lemma 1.

Let . Assume that and . Then, for all ,

For a proof, see Appendix C.

Notice that for any and any , if is the soft-max function, then , and if is the identity function, then from (2) and the triangle inequality, we derive .
Let the loss function be chosen such that there exists^{9}^{9}9This assumption is similar to the assumption of Lemma 17.6 in [40]. for which for any and any we have
. A commonly used example is the squared loss
i.e. for the net with parameters and for any example , define
.
For classification problems, assume that the labels are one-hot vectors, otherwise, let .
Note that for this loss function, if is the soft-max function, then one can assume , and if is the identity function, then one can take .

## 5 Generalization and excess risk bounds

For all , let denote a *random* matrix and define
and
We can now state the following multi-scale and algorithm-dependent generalization bound derived from the CMI technique, in which mutual informations between the training set and the first layers appear:

###### Theorem 1.

Given the assumptions in the previous section, we have

(3) |

Proof outline. According to (1), one can write the chaining sum with respect to the sequence of generated coverings as

while, based on Lemma 1, observe that for all ,

For a complete proof, see Appendix C.

Notice that we can rewrite (3) as

(4) |

where . The goal in statistical learning is to find an algorithm which minimizes
To that end, we derive an upper bound on from inequality (4) whose minimization over is algorithmically feasible.
If for each , we define to be a fixed distribution on that does not depend on the training set , which we name as *prior distribution*,^{10}^{10}10Similar to the terminology in PAC-Bayes theory (see e.g. [1]).
then from (4) we deduce

(5) | ||||

(6) |

where (5) follows from the inequality for all , which is upper bounding the concave function with a tangent line, and (6) follows from the crucial difference decomposition of mutual information: ; see Lemma 4 in Appendix A. Given fixed parameters , , and for any fixed , let be the conditional distribution which minimizes the right side of (6), i.e.

(7) |

Note that we made the expression in (7) linear in . This, in turn, implies that the algorithm does not depend on the unknown input distribution (recall that ), which is a desired property of . For discrete , the algorithm achieves the following excess risk bound:

###### Theorem 2.

Assume that is a discrete set and for a given input distribution , let denote the index of a hypothesis which achieves the minimum statistical risk among . Then

(8) |

Note that, for all , the relative entropies in Theorem 2 are computed as

For a proof of Theorem 2, a high-probability version, and a result for the case when is not discrete, see Appendix C. A case of special and practical interest is when the prior distributions are consistent, i.e., when there exists a single distribution such that for all . In this case, both (7) and (8) can be expressed with the following new divergence:

###### Definition 1 (Multilevel relative entropy).

For probability measures and , and a vector , define the *multilevel relative entropy* as

(9) |

The prior distributions may be given by Gaussian matrices truncated on bounded-norm sets.

It is shown in [3] (with related results in [27, 24]) that the Gibbs posterior distribution , as defined precisely in Definition 12 in Appendix D, is the unique solution to

where is called the *inverse temperature*.
Thus, based on (7), the desired distribution is a multi-scale generalization of the Gibbs distribution. In the next section, we obtain the functional form of .
Inspired from the terminology for the Gibbs distribution, we call the vector of coefficients in (7) the *temperature vector* of .
Note that for minimizing the excess risk bound (8), the optimal value for , for all , is

Furthermore, as a byproduct of the above analysis, we give new excess risk bounds for the Gibbs distribution in Propositions 3 and 4 in Appendix D (a related result has recently been obtained in [41], though using stability arguments). These results generalize Corollaries 2 and 3 in [3] to arbitrary subgaussian losses, and unlike their proof which is based on stability arguments of [15], merely uses the mutual information bound [2, 3].

## 6 The Marginalize-Tilt (MT) algorithm

The optimization problem (7), which was derived by *chaining* mutual information, can be solved via the *chain rule* of relative entropy, and based on a key property of conditional relative entropy (Lemma 7 in Appendix E), can be shown to have a unique solution.
Note that
if we know the solution to the following more general relative entropy sum minimization:

(10) |

where and distributions are given for all , then we can use that to solve for in (7) for any , by assuming the following: and for all , for all , and

where we combined the expected empirical risk with the last relative entropy in (7) and ignored the resulting term which does not depend of (such combination is similarly performed in [27, Section IV] for proving the optimality of the Gibbs distribution). The solution to (10), denoted as , is the output of Algorithm 1. If and are distributions on a set , then let the relative information denote the logarithm of the Radon–Nikodym derivative of with respect to for all . The algorithm uses the following:

######
Definition 2 (Tilted distribution^{11}^{11}11The tilted distribution is known as the *generalized escort distribution* in the statistical physics and the statistics literatures (see e.g. [42]).).

Given distributions and , let be a dominating measure such that and . The tilted distribution for is defined with

for all . If , then is not defined for .

###### Remark 2.

In the special case that and are distributions on a discrete set , for all , we have

In the case that and

are distributions of real-valued absolutely continuous random variables with probability density functions

and , the tilted random variable has probability density functionNotice that traverses between and as traverses between and .

The following shows the useful role of tilted distributions in linearly combining relative entropies. For a proof, see [43, Theorem 30].

###### Lemma 2.

Let . For any and ,

Proof outline. Algorithm 1 solves for in a backwards manner: Starting from the last term in (10), the algorithm uses the chain rule of relative entropy (see Lemma 3 in Appendix A) to decompose it into two terms; a relative entropy and a conditional relative entropy:

Then, based on Lemma 2, it linearly combines the relative entropy with the previous term in (10) using the corresponding tilted distribution. The algorithm iterates these two steps to reduce solving (10) to a simple problem: minimizing a sum of conditional relative entropies which all can be set equal to zero, *simultaneously*. This is accomplished with given in line 7. For a complete proof, see Appendix E. The proof
also implies that the minimum value of the expression in (10) is a summation of Rényi divergences between functions of distributions
, .

## 7 Multilevel training

By using the MT algorithm to solve (7), we obtain the “twisted distribution” for all . We now seek an efficient implementation of the MT algorithm. We define the multilevel training as simulating , given the training set .
For a two layer net, we implement this with Algorithm 2.
Let , where and are the matrices of the first and second layer, respectively.^{12}^{12}12In this section, we are denoting matrices with lower case for clarity. In the important case of having consistent product priors,
i.e., when we can write and , assuming temperature vector , distribution is equal to:

(11) |

see Appendix F for more details.

Algorithm 2 consists of two Metropolis algorithms, one in an outer level to sample with distribution as the first fraction in (11), and the other in the inner level at line 5 to sample given with conditional distribution equal to second fraction in (11). Line 6, which can be run concurrently with line 5, shows how the inner level sampling is used in the outer level algorithm: Note that to compute the acceptance ratio of the outer level algorithm, we can write

where for any fixed ,