# Probabilistically True and Tight Bounds for Robust Deep Neural Network Training

Training Deep Neural Networks (DNNs) that are robust to norm bounded adversarial attacks remains an elusive problem. While verification based methods are generally too expensive to robustly train large networks, it was demonstrated in Gowal et al. that bounded input intervals can be inexpensively propagated per layer through large networks. This interval bound propagation (IBP) approach lead to high robustness and was the first to be employed on large networks. However, due to the very loose nature of the IBP bounds, particularly for large networks, the required training procedure is complex and involved. In this paper, we closely examine the bounds of a block of layers composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer. In doing so, we propose probabilistic bounds, true bounds with overwhelming probability, that are provably tighter than IBP bounds in expectation. We then extend this result to deeper networks through blockwise propagation and show that we can achieve orders of magnitudes tighter bounds compared to IBP. With such tight bounds, we demonstrate that a simple standard training procedure can achieve the best robustness-accuracy trade-off across several architectures on both MNIST and CIFAR10.

• 2 publications
• 15 publications
• 7 publications
• 93 publications
08/14/2020

### Analytical bounds on the local Lipschitz constants of affine-ReLU functions

In this paper, we determine analytical bounds on the local Lipschitz con...
02/22/2020

### Improving the Tightness of Convex Relaxation Bounds for Training Certifiably Robust Classifiers

Convex relaxations are effective for training and certifying neural netw...
06/03/2019

### Fast and Stable Interval Bounds Propagation for Training Verifiably Robust Models

We present an efficient technique, which allows to train classification ...
06/21/2020

### Network Moments: Extensions and Sparse-Smooth Attacks

The impressive performance of deep neural networks (DNNs) has immensely ...
06/14/2019

### Towards Stable and Efficient Training of Verifiably Robust Neural Networks

Training neural networks with verifiable robustness guarantees is challe...
03/16/2022

### On the Convergence of Certified Robust Training with Interval Bound Propagation

Interval Bound Propagation (IBP) is so far the base of state-of-the-art ...
04/01/2022

### Comparative Analysis of Interval Reachability for Robust Implicit and Feedforward Neural Networks

We use interval reachability analysis to obtain robustness guarantees fo...

## 1 Introduction

Deep neural networks have demonstrated impressive performance in many fields of research with applications ranging from image classification krizhevsky2012imagenet ; he2016deep and semantic segmentation long2015fully to speech recognition hinton2012deep , just to name a few. Despite this success, DNNs are still susceptible to small imperceptible perturbations, which can lead to drastic degradation in performance, particularly in visual classification tasks. Such perturbations are best known and commonly referred to as adversarial attacks. Early work showed that with simple algorithms (e.g. maximize the classification loss with respect to the input using a single optimization iteration goodfellow2014explaining ), one can easily construct such adversaries. Since then, a research surge has emerged to develop simple routines to consistently construct adversarial examples. For instance, moosavi2016deepfool proposed a simple algorithm, called DeepFool, which finds the smallest perturbation that fools a linearized version of the network. Interestingly, the work of moosavi2017universal demonstrated that such adversaries can be both network and input agnostic, i.e. there exists universal deterministic samples that fool a wide range of DNNs across a large number of input samples. More recently, it was shown that such adversaries can be as simple as Gaussian noise bibi2018analytic . Knowing that DNNs are easily susceptible to simple attacks can hinder the public confidence in them especially for real-world deployment, e.g. in self-driving cars and devices for the visually impaired.

Such a performance nuisance has prompted several active research directions, in particular, work towards network defense and verification. Network defense aims to train networks that are robust against adversarial attacks through means of robust training or procedures at inference time that dampen the effectiveness of the attack madry2017towards ; kolter2017provable ; raghunathan2018certified ; alfadly2019analytical . On the other hand, verification aims to certify/verify for a given DNN that there exists no small perturbations of a given input that can change its output prediction katz2017reluplex ; sankaranarayanan2016triplet ; weng2018towards . However, there are also works at the intersection of both often referred to as robustness verification methods, which use verification methods to train robust networks. Such algorithms often try to minimize the exact, or an upper bound, of the worst adversarial loss over all possible bounded energy (often measured in norm) perturbation around a given input.

Although verification methods proved to be effective in training robust networks kolter2017provable , they are very computationally expensive, thus limiting their applicability to only small, at best medium, sized networks. However, Gowal et. al. gowal2018effectiveness recently demonstrated that training large networks robustly is possible by leveraging the cheap-to-compute but very loose interval based verifier, known as interval domain from mirman2018differentiable . In particular, they propagate the - norm bounded input centered at , i.e. , through every layer in the network at a time. This interval bound propagation (IBP) is inexpensive and simple; however, it results in very loose output interval bounds, which in turn requires a complex and involved training procedure.

In this paper, we are interested in improving the tightness of the output interval bounds (referred to as bounds from now on). We do so by examining more closely the bounds for a block of layers that is composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer under - bounded input. In doing so, we propose new bounds for this block of layers, which we prove to be not only supersets to the true bounds of this block in a probabilsitic sense with an overwhelming probability, but also very tight to the true bounds. Then, we show how to leverage such a result and extend it to deeper networks through blockwise bound propagation leading to several orders of magnitude tighter bounds as compared to IBP gowal2018effectiveness .

Contributions. Our contributions are three-fold. (i) We propose new bounds for the block of layers that is composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer. We prove that these bounds are probabilistically true bounds in the sense that they are, with an overwhelming probability, a super set to the true bounds of this block. Moreover, we prove that these bounds are much tighter than the IBP bounds gowal2018effectiveness obtained by propagating the input bounds through every layer in the block. Our bounds get even tighter as the number of hidden nodes in the first affine layer increase. (ii) We show a practical and efficient approach to propagate our bounds (for the block of layers) through blocks, not through individual layers, of a deep network, thus resulting in tighter output bounds compared to IBP. (iii) Lastly, we conduct synthetic experiments to verify the theory as well as the factors of improvement over propagating the bounds layerwise. Moreover, and due to our tight bounds, we show that with a simple standard training procedure, one can robustly train large networks on both MNIST lecun1998mnist and CIFAR10 krizhevsky2009learning achieving state-of-art robustness-accuracy trade-off compared to IBP gowal2018effectiveness . In other words, with standard training and because of our tight bounds, we can consistently improve robustness by significant margins with very minimal effect on test accuracy as compared to IBP gowal2018effectiveness .

## 2 Related Work

Training accurate and robust DNNs remains an elusive problem, since several works have demonstrated that small imperceptible perturbations (adversarial attacks) to the DNN input can drastically affect their performance. Early works showed that with a very simple algorithm, as simple as maximizing the loss with respect to the input for a single iteration goodfellow2014explaining , one can easily construct such adversaries. This has strengthened the line of work towards network verification for both evaluating network robustness and for robust network training. In general, verification approaches can be coarsely categorized as exact or relaxed verifiers.

Exact Verification. Verifiers of this type try to find the exact largest adversarial loss over all possible bounded energy (usually in the norm sense) perturbations around a given input. They are often tailored for piecewise linear networks, e.g. networks with ReLU and LeakyReLU nonlinearities. They typically require solving a mixed integer problem cheng2017maximum ; lomuscio2017approach ; tjeng19 or Satisfiability Modulo Theory (SMT) solvers huang17 ; ehlers17 . The main advantage of these approaches is that they can reason about exact adversarial robustness; however, they generally are computationally intractable for verification purposes let alone any sort of robust network training. The largest network used for verification with such verifiers was with the work of tjeng19 , which employed a mixed integer solver applied to networks of at most 3 hidden layers. The verification is fast for such networks that are pretrained with a relaxed verifier as the method gets much slower on networks that are similar in size but normally trained.

Relaxed Verification. Verifiers of this type aim to find an upper bound on the worst adversarial loss across a range of bounded inputs. For instance, a general framework called CROWN was proposed in zhanghuan18

to certify robustness by bounding the activation with linear and quadratic functions which enables the study of generic, not necessarily piece-wise linear, activation functions. By utilizing the structure in ReLU based networks, the work of

weng2018towards proposed two fast algorithms based on linear approximation on the ReLU units. Several other works utilize the dual view of the verification problem kolter2017provable ; wong18scaling . More recently, the method of salman2019convex unified a large number of recent works in a single convex relaxation framework and revealed several relationships between them. In particular, it was shown that convex relaxation based methods that fit this framework suffer from an inherit barrier once compared to exact verifiers.

For completeness, it is important to note that there are also hybrid methods that combine both exact and relaxed verifiers and have shown to be effective bunel17 .

Although relaxed verifiers are much more computationally friendly than exact verifiers, they are still too expensive for robust training of large networks (networks with more than 5 hidden layers). However, very loose relaxed verifiers possibly can still be exploited for this purpose. In particular, the work of gowal2018effectiveness proposed to use the extremely inexpensive but very loose interval bound propagation (IBP) certificate to train, for the first time, large robust networks with state-of-the-art robustness performance. This was at the expense of a complex and involved training routine due to the loose nature of the bounds. To remedy the training difficulties, we propose probabilistic bounds, not for each layer individually, but for a block of layers jointly. Such bounds are slightly more expensive to compute but are much tighter. We then propagate the bounds through every block in a deeper network to attain overall much tighter bounds compared to layerwise IBP. The tighter bounds allow for simple standard training routines to be employed for robust training of large networks resulting in state-of-art robustness-accuracy trade-off.

## 3 Probabilistically True and Tight Interval Bounds

We analyze the interval bounds of a DNN by proposing probabilistically true and tight bounds for a two-layer network (Affine-ReLU-Affine) and then we propose a mechanism to extend them for deeper networks. But first, we detail the interval bounds of gowal2018effectiveness to put our proposed bounds in context.

### 3.1 Interval Bounds for a Single Affine Layer

For a single affine layer parameterized by and , it is easy to show that its output lower and upper interval bounds for an - norm bounded input are:

 l1=A1x+b1−ϵ|A1|1n, u1=A1x+b1+ϵ|A1|1n. (1)

Note that is an elementwise absolute operator. In the presence of any non-decreasing elementwise nonlinearity (e.g. ReLU), the bounds can then be propagated by applying the nonlinearity to directly. As such, the interval bounds can be propagated through the network one layer at a time, as in gowal2018effectiveness . While this interval bound propagation (IBP) mechanism is a very simple and inexpensive approach to compute bounds, they can be very loose resulting in a complex and involved robust network training procedure gowal2018effectiveness .

### 3.2 Proposed Interval Bounds for an Affine-ReLU-Affine Block

Here, we consider a block of layers of the form Affine-ReLU-Affine in the presence of perturbations at the input. The functional form of this network is , where is an elementwise operator. The affine mappings can be of any size, and throughout the paper,

and without loss of generality the second affine map is a single vector

. Note that also includes convolutional layers, since they are also affine mappings.

#### Layerwise Interval Bound Propagation (IBP) on g.

Here, we apply the layerwise propagation strategy of gowal2018effectiveness detailed in Section 3.1 on function with to obtain bounds . We use these bounds for comparison in what follows.

 LIBP=a⊤2(max(u1,0k)+max(l1,0k)2)−|a⊤2|(max(u1,0k)−max(l1,0k)2)+b2, UIBP=a⊤2(max(u1,0k)+max(l1,0k)2)+|a⊤2|(max(u1,0k)−max(l1,0k)2)+b2.

Note that and are the result of propagating through the first affine map and then the ReLU nonlinearity, as shown in (1).

#### Probabilistically True and Tight Interval Bounds on g.

Our goal is to propose new interval bounds on , as a whole, which are tighter than the IBP bounds , since we believe that such tighter bounds for a two-layer block, when propagated/extended to deeper networks, can be tighter than applying IBP layerwise. Denoting the true output interval bounds of as , the following inequality holds   . Deriving these true (and tight) bounds for in closed form is either hard or results in bounds that are generally very difficult to compute, deeming them impractical for applications such as robust network training. Instead, we propose new closed form expressions for the interval bounds denoted as , which we prove to be probablistically true bounds and tighter than in expectation. As such, we make two main theoretical findings. (i) We prove that holds with high probability, i.e.  and hold with a high probability when the input dimension is large enough (probabilistically true bounds). (ii) We prove that can be arbitrarily tighter than the loose bounds in expectation, as the number of hidden nodes increases (probabilistically tighter bounds).

#### Analysis.

To derive and , we study the bounds of the following function instead.

 ~g(~x)=a⊤2M(A1~x+b1)+b2=a⊤2MA1~x+a⊤2Mb1+b2 (2)

Note that is very similar to the Affine-ReLU-Affine map captured by with the ReLU replaced by a diagonal matrix constructed as follows. If we denote as the upper bound resulting from the propagation of the input bounds through the first affine map , then we have where is an indicator function. In other words, when the element of is non-negative and zero otherwise. Note that for a given , is an affine function with the following output interval bounds when :

 LM=a⊤2MA1x+a⊤2Mb1+b2−ϵ|a⊤2MA1|1n (3) UM=a⊤2MA1x+a⊤2Mb1+b2+ϵ|a⊤2MA1|1n (4)
###### Theorem 1.

(Probabilistically True Bounds in Expectation) Consider an

bounded uniform random variable input

, i.e. , to a block of layers in the form of Affine-ReLU-Affine (parameterized by and for the first and second affine layers, respectively). With the true output interval bounds being , i.e.   , the following holds with an overwhelming probability for a sufficiently large :

where

assuming that the Lyapunov Central Limit Theorem holds. For random matrices

and with i.i.d Gaussian elements of zero mean and and standard deviations, and for a sufficiently large input dimension and , we have:

 EA1,a2[LM]≤EA1,a2[Lapprox]≤EA1,a2[Ltrue], (5) EA1,a2[U%true]≤EA1,a2[Uapprox]≤EA1,a2[UM].

Theorem 1 states that the interval bounds to function are simply looser bounds to the function of interest in expectation, under a plausible distribution of and . Now, we investigate the tightness of these bounds as compared to the IBP bounds .

###### Theorem 2.

(Probabilistically Tighter Bounds in Expectation) Consider an bounded uniform random variable input , i.e. , to a block of layers in the form of Affine-ReLU-Affine (parameterized by and for the first and second affine layers respectively) and . Under the assumption that , we have: .

Theorem 2 states that under some assumptions on and under a plausible distribution for , our proposed interval width can be much smaller than the IBP interval width, i.e. our proposed intervals are much tighter than the IBP intervals.

Next, we show that the inequality assumption in Theorem 2 is very mild. In fact, a wide range of satisfy it and the following proposition gives an example that does so in expectation.

###### Proposition 1.

For a random matrix

with i.i.d elements , then

 EA1(∥A1(:,j)∥2−1√2π∥A1(:,j)∥1)=√2Γ(k+12)Γ(k2)−k√2π≈√k(1−√2π√k).

Proposition 1 implies that as the number of hidden nodes increases, the expectation of the right hand side of the inequality assumption in Theorem 2 grows more negative, while the left hand side of the inequality is zero in expectation when . In other words, for Gaussian zero-mean weights and with a large enough number of hidden nodes , the assumption is satisfied. All proofs and detailed analyses are provided in the supplementary material.

### 3.3 Extending our Probabilistically True and Tight Bounds to Deeper Networks

To extend our proposed bounds to networks deeper than a two-layer block, we simply apply our bound procedure described in Section 3.2, recursively for every block. In particular, consider an -layer neural network defined as and an - norm bounded input centered at , i.e. . Without loss of generality, we assume is bias-free for ease of notation. Then, the output lower and upper bounds of are and , respectively. Here, is a linear map that can be obtained recursively as follows:

 Gi=Ai+1MiGi−1,  where  G0=A1, Mi=diag(1{(Gi−1x+ϵ|Gi−1|1n)≥0})

Note that is the output upper bound through a linear layer parameterized by for input as in (1). With this blockwise propagation, the output interval bounds of

are now estimated by the output intervals of

.

## 4 Experiments

Probabilistically True Bounds. In this experiment, we validate Theorem 1 with several controlled experiments. For a network of the form that has true bounds for , we empiricallyshow that our proposed bounds , under the mild assumptions of Theorem 1, indeed are true with a high probability, i.e. . Moreover, we verify that the larger the network input dimnension is, the inequality holds with even higher probability (as predicted by Theorem 1).

We start by constructing a network where the biases and

are initialized following the default Pytorch

paszke2017automatic initialization. As for the elements of the weight matrices and , they are sampled from and , respectively. We estimate, and by taking the minimum and maximum of Monte-Carlo evaluations of . For a given and with , we uniformly sample examples from the interval . We also sample all corners of the hyper cube . To probabilistically show that the proposed interval is a super set of (i.e. they are true bounds), we evaluate the length of the intersection of the two intervals over the length of the true interval defined as . Note that if and only if is a super set of . For a given , we conduct this experiment times with a varying , , , and and report the average . Then, we run this for a varying number of input size and a varying number of hidden nodes , as reported in Figure 0(a). As predicted by Theorem 1, Figure 0(a) demonstrates that as increases, the proposed interval will be a super set of the true interval, with a higher probability, regardless of the number of hidden nodes . Note that networks that are as wide as , require no more than input dimensions for the proposed intervals to be a superset of the true intervals. In practice, is much larger than that, e.g.  in CIFAR10.

In Figure 0(b), we empirically show that the above behavior also holds for deeper networks. We propagate the bounds blockwise as discussed in Section 3.3 and conduct similar experiments on fully-connected networks. We construct networks with varying depth, where each layer has the same number of nodes equal to the input dimension .

These results indeed suggest that the proposed bounds are true bounds with high probability and this probability increases with larger input dimensions. Here, is better than across different network depths. The same behaviour holds for convolutional networks as well.

Probabilistically Tight Bounds. We experimentally affirm that our bounds can be much tighter than IBP bounds gowal2018effectiveness . In particular, we validate Theorem 2 by comparing interval lengths of our proposed bounds, , to those from IBP, , on networks with functional form . We compute both the difference and ratio of widths for varying values of , , and . Figure 2 reports the average width difference and ratio over runs in a similar setup to the previous section. Figures 1(a) and 1(b) show that the proposed bounds indeed get tighter than IBP, as increases across all values (as predicted by Theorem 2). Note that we show tightness results for in Figure 1(b) as the performance of were very similar to . Moreover, the improvement is consistently present when varying as shown in Figures 1(c) and 1(d).

We also compare the tightness of our bounds to those of IBP with increasing depth for both fully-connected networks (refer to Figures 2(a) and 2(b)) and convolutional networks (refer to Figures 2(c) and 2(d)). For all fully-connected networks, we take . Our proposed bounds get consistently tighter as the network depth increases over all choices of . In particular, the proposed bounds can be more than times tighter than IBP for a 10 layer DNN. A similar observation can also be made for convolutional networks. For convolutional networks, it is expensive to compute our bounds using the procedure described in Section 3.3, so instead we obtain matrices using the easy-to-compute IBP upper bounds. Despite this relaxation, we still obtain very tight and probabilistically true bounds. Note that this slightly modified approach reduces exactly to our bounds for two-layer networks.

Qualitative Results. Following previous work kolter2017provable ; gowal2018effectiveness , Figure 4 visualizes some examples of the proposed bounds and compares them to the true ones for several choices of and a random five-layer fully-connected network with architecture -----. We also show the results of the Monte-Carlo sampling for an input size . More qualitative visualizations for different values of are in the supplementary material.

Training Robust Networks. In this section, we conduct experiments showing that our proposed bounds can be used to robustly train DNNs. We compare our method against models trained nominally (i.e. only the nominal training loss is used), and those trained robustly with IBP gowal2018effectiveness . Given the well-known robustness-accuracy trade off, robust models are often less accurate. Therefore, we compare all methods using an accuracy vs. robustness scatter plot. Following prior art, we use Projected Gradient Descent (PGD) madry2017towards

to measure robustness. We use a loss function similar to the one proposed in

gowal2018effectiveness . In particular, we use , where , , , and

are the cross-entropy loss, output logits, the true class label, and regularization hyperparameter respectively.

represents the “adversarial" logits obtained by combining the lower bound of the true class label and the upper bound of all other labels, as in gowal2018effectiveness . When , nominal training is invoked. Due to the tightness of our bounds and in contrast to IBP gowal2018effectiveness , we follow a standard training procedure that avoids the need to vary or during training gowal2018effectiveness .

Experimental Setup. We train the three network models (small, medium and large) provided by gowal2018effectiveness on both MNIST and CIFAR10. See supplementary material for more details. Following the setup in gowal2018effectiveness , we train all models with and on MNIST and CIFAR10, respectively. Then, we compute the PGD robustness for every of every model architecture for all for MNIST and for all for CIFAR10. To compare training methods, we compute the average PGD robustness over all , the test accuracy, and report them in a 2D scatter plot. In all experiments, we grid search over learning rates and employ a temperature over the logits with a grid of as in hinton2015distilling .

We report the performance results on MNIST for the small, medium and large architectures in Figure 5. For all trained architectures, we only report the results for models that achieve at least a test accuracy of ; otherwise, it is an indication of failure in training. Interestingly, our training scheme can be used to train all architectures for all . This is unlike IBP, which for example was only able to successfully train the large architecture with . Moreover, the models trained with our bounds always achieve better PGD robustness than the nominally trained networks on all architectures (small, medium and large) while preserving similar if not a higher accuracy (large architecture). While the models trained with IBP achieve high robustness, their test accuracy is drastically affected. Note that over all architectures and for some , training with our bounds always yields models with comparable or better PGD robustness but with a much higher test accuracy.

Similar observations can be made when training on CIFAR10 as shown in Figure 6. We only report the performance of trained architectures that achieve at least a test accuracy of . All our trained models successfully train over all architectures and over all . They always achieve better PGD robustness, while maintaining similar or better test accuracy. Interestingly, all the models trained using IBP gowal2018effectiveness achieve a much lower test accuracy.

## 5 Conclusion

In this work, we proposed new interval bounds that are very tight, relatively cheap to compute, and probabilistically true. We analytically showed that for the block (Affine-ReLU-Affine) with large input and hidden layer sizes, our bounds are true with a high probability and several order of magnitudes tighter than the bound obtained with IBP. We then proposed an approach to extend these results to deeper networks through means of blockwise propagation. We conduct extensive experiments verifying our theory on the Affine-ReLU-Affine block, and demonstrating that the same behaviour persist for deeper networks. As a result, we are able to train large models, with standard typical training routine while achieving excellent trade-off between accuracy and robustness.

#### Acknowledgments.

This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research.

## Appendix A Probabilistically True Bounds in Expectation

###### Theorem 1.

(Probabilistically True Bounds in Expectation) Consider an bounded uniform random variable input , i.e. , to a block of layers in the form of Affine-ReLU-Affine (parameterized by and for the first and second Affine layers, respectively). With the true output interval bounds being , i.e.   , the following holds with an overwhelming probability for a sufficiently large :

 Ltrue≥Lapprox=E~y[a⊤2(max(~y,0)+b2)]−m√Var~y([a⊤2(max(~y,0)+b2))), Utrue≤Uapprox=E~y[a⊤2(max(~y,0)+b2)]+m√Var~y([a⊤2(max(~y,0)+b2))),

where assuming that the Lyapunov Central Limit Theorem holds. Then, for random matrices and with elements being i.i.d Gaussian with zero mean and and standard deviations, respectively, and for a sufficiently large input dimension (holds as ) and , we have that

 EA1,a2[LM]≤EA1,a2[Lapprox]≤EA1,a2[Ltrue], (6) EA1,a2[U%true]≤EA1,a2[Uapprox]≤EA1,a2[UM].
###### Proof.

Since

is uniformly distributed between

with mean and covariance matrix . The output of the first linear layer has mean and covariance matrix . Assuming Lyapunov CLT holds then can thus be approximated as for relatively large n. Following the recent analytic expressions derived for the first and output moments of a network of the form of Affine layer followed by a ReLU followed by another Affine in bibi2018analytic , Then the mean is given as follows:

 E~y[a⊤2(max(~y,0)+b2)]=a⊤2[μ~y⊙Φ(μ~y⊘σ~y)+σ~y⊙ϕ(μ~y⊘σ~y)]+b2.

Note that and are the normal cumulative and probability Gaussian density functions and where where diag extracts the diagonal elements of a matrix to a vector and that is an elemenwise division. Moreover, note that where

for ease of notation. As for the variance, it can be approximated following

bibi2018analytic as follows:

 √Var~y([a⊤2(max(~y,0)+b2)))

Note that for ease of notation and that . Thus, we have that

 UM−Uapprox −b2−mϵ√6π(k∑i=1k∑j=1ai2aj2Ψi,j(Hi,jcos−1(−Hi,j)+√1−H2i,j)−(a⊤2√diag(A1A⊤1))2)12

By conditioning over , we have

 Ea2[UM−Uapprox] = Ea2[ϵ|a⊤2MA1|1]−mϵ√6πEa2[ k∑i=1k∑j=1ai2aj2Ψi,j(Hi,jcos−1(−Hi,j) + √1−H2i,j)−(a⊤2√diag(A1AT1))2]12 ≥ + √1−H2i,k)−Ea2[(a⊤2√diag(A1AT1))2]]12 = = ϵn∑j=1E[|a⊤2MA1(:,j)|]−mϵσa2√π−1√6π∥A1∥F = ϵ√2πn∑j=1√var(a⊤2MA1(:,j))−mϵσa2√π−1√6π∥A1∥F = ϵσa2√2πn∑j=1√A1(:,j)TMA1(:,j)−mϵσa2√π−1√6π∥A1∥F = ϵσa2√2πn∑j=1 ⎷k∑i=1A1(i,j)2M(i,i)−mϵσa2√π−1√6π∥A1∥F = ϵσa2√2πn∑j=1 ⎷k∑i=1A1(i,j)21{ui1≥0}−mϵσa2√π−1√6π∥A1∥F

The second equality follows from that are independent random variables. The first inequality follows by Jenson’s where the forth inequality follows from the mean of a folded Gaussian. Lastly, by taking the expectation over where is the set of indices where for all . Since is random, then is also random. Therefore, one can reparametrize the sum and as follows

 EA1,a2,|S|[UM−Uapprox] =E|S|[EA1[ϵσa2√2πn∑j=1√∑i∈SA1(i,j)2 ≈2ϵσa2σA1n√πE|S|[√|S|2]−mϵσa2σA1√π−1√3π√nk2

The second equality follows as a special case of Lemma 1. That is, if we have that . The last approximation follows from stirlings formula for large x. Since , then as the input dimension increases, i.e. ( the theorem follows. ∎

## Appendix B Tightness Compared to Layerwise Propagation

###### Theorem 2.

(Tightness) Consider an bounded input , i.e. to the block of layers that is in the form of Affine-ReLU-Affine(parameterized by and for the first and second Affine layers, respectively) where the elements of i.i.d Gaussian with zero mean and standard deviation. Under the assumption that , we have that .

###### Proof.

Note that

 [(UDM−LDM)−(UM−LM)] =ϵ|a⊤2||A1|1n+12|a⊤2||u1|−12|a⊤2||l1|

Consider the coordinate splitting functions , , and such that for where is a vector of all zeros and 1 in the locations where both . However, since , then . Therefore it is clear that for any vector and an interval , we have that

 x=S++(x)+S+−(x)+S−−(x), (7)

since the sets , and are disjoints and their union . We will denote the difference in the interval lengths as for ease of notation. Thus, we have the following:

 WDM−WM =ϵS++(|a⊤2|)|A1|1n+ϵS+−(|a⊤2|)|A1|1n+ϵS−−(|a⊤2|)|A1|1n+12S++(|a⊤2|)|u1| +12S+−(|a⊤2|)|u1|+12S−−(|a⊤2|)|u1