1 Introduction
Deep neural networks have demonstrated impressive performance in many fields of research with applications ranging from image classification krizhevsky2012imagenet ; he2016deep and semantic segmentation long2015fully to speech recognition hinton2012deep , just to name a few. Despite this success, DNNs are still susceptible to small imperceptible perturbations, which can lead to drastic degradation in performance, particularly in visual classification tasks. Such perturbations are best known and commonly referred to as adversarial attacks. Early work showed that with simple algorithms (e.g. maximize the classification loss with respect to the input using a single optimization iteration goodfellow2014explaining ), one can easily construct such adversaries. Since then, a research surge has emerged to develop simple routines to consistently construct adversarial examples. For instance, moosavi2016deepfool proposed a simple algorithm, called DeepFool, which finds the smallest perturbation that fools a linearized version of the network. Interestingly, the work of moosavi2017universal demonstrated that such adversaries can be both network and input agnostic, i.e. there exists universal deterministic samples that fool a wide range of DNNs across a large number of input samples. More recently, it was shown that such adversaries can be as simple as Gaussian noise bibi2018analytic . Knowing that DNNs are easily susceptible to simple attacks can hinder the public confidence in them especially for realworld deployment, e.g. in selfdriving cars and devices for the visually impaired.
Such a performance nuisance has prompted several active research directions, in particular, work towards network defense and verification. Network defense aims to train networks that are robust against adversarial attacks through means of robust training or procedures at inference time that dampen the effectiveness of the attack madry2017towards ; kolter2017provable ; raghunathan2018certified ; alfadly2019analytical . On the other hand, verification aims to certify/verify for a given DNN that there exists no small perturbations of a given input that can change its output prediction katz2017reluplex ; sankaranarayanan2016triplet ; weng2018towards . However, there are also works at the intersection of both often referred to as robustness verification methods, which use verification methods to train robust networks. Such algorithms often try to minimize the exact, or an upper bound, of the worst adversarial loss over all possible bounded energy (often measured in norm) perturbation around a given input.
Although verification methods proved to be effective in training robust networks kolter2017provable , they are very computationally expensive, thus limiting their applicability to only small, at best medium, sized networks. However, Gowal et. al. gowal2018effectiveness recently demonstrated that training large networks robustly is possible by leveraging the cheaptocompute but very loose interval based verifier, known as interval domain from mirman2018differentiable . In particular, they propagate the  norm bounded input centered at , i.e. , through every layer in the network at a time. This interval bound propagation (IBP) is inexpensive and simple; however, it results in very loose output interval bounds, which in turn requires a complex and involved training procedure.
In this paper, we are interested in improving the tightness of the output interval bounds (referred to as bounds from now on). We do so by examining more closely the bounds for a block of layers that is composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer under  bounded input. In doing so, we propose new bounds for this block of layers, which we prove to be not only supersets to the true bounds of this block in a probabilsitic sense with an overwhelming probability, but also very tight to the true bounds. Then, we show how to leverage such a result and extend it to deeper networks through blockwise bound propagation leading to several orders of magnitude tighter bounds as compared to IBP gowal2018effectiveness .
Contributions. Our contributions are threefold. (i) We propose new bounds for the block of layers that is composed of an affine layer followed by a ReLU nonlinearity followed by another affine layer. We prove that these bounds are probabilistically true bounds in the sense that they are, with an overwhelming probability, a super set to the true bounds of this block. Moreover, we prove that these bounds are much tighter than the IBP bounds gowal2018effectiveness obtained by propagating the input bounds through every layer in the block. Our bounds get even tighter as the number of hidden nodes in the first affine layer increase. (ii) We show a practical and efficient approach to propagate our bounds (for the block of layers) through blocks, not through individual layers, of a deep network, thus resulting in tighter output bounds compared to IBP. (iii) Lastly, we conduct synthetic experiments to verify the theory as well as the factors of improvement over propagating the bounds layerwise. Moreover, and due to our tight bounds, we show that with a simple standard training procedure, one can robustly train large networks on both MNIST lecun1998mnist and CIFAR10 krizhevsky2009learning achieving stateofart robustnessaccuracy tradeoff compared to IBP gowal2018effectiveness . In other words, with standard training and because of our tight bounds, we can consistently improve robustness by significant margins with very minimal effect on test accuracy as compared to IBP gowal2018effectiveness .
2 Related Work
Training accurate and robust DNNs remains an elusive problem, since several works have demonstrated that small imperceptible perturbations (adversarial attacks) to the DNN input can drastically affect their performance. Early works showed that with a very simple algorithm, as simple as maximizing the loss with respect to the input for a single iteration goodfellow2014explaining , one can easily construct such adversaries. This has strengthened the line of work towards network verification for both evaluating network robustness and for robust network training. In general, verification approaches can be coarsely categorized as exact or relaxed verifiers.
Exact Verification. Verifiers of this type try to find the exact largest adversarial loss over all possible bounded energy (usually in the norm sense) perturbations around a given input. They are often tailored for piecewise linear networks, e.g. networks with ReLU and LeakyReLU nonlinearities. They typically require solving a mixed integer problem cheng2017maximum ; lomuscio2017approach ; tjeng19 or Satisfiability Modulo Theory (SMT) solvers huang17 ; ehlers17 . The main advantage of these approaches is that they can reason about exact adversarial robustness; however, they generally are computationally intractable for verification purposes let alone any sort of robust network training. The largest network used for verification with such verifiers was with the work of tjeng19 , which employed a mixed integer solver applied to networks of at most 3 hidden layers. The verification is fast for such networks that are pretrained with a relaxed verifier as the method gets much slower on networks that are similar in size but normally trained.
Relaxed Verification. Verifiers of this type aim to find an upper bound on the worst adversarial loss across a range of bounded inputs. For instance, a general framework called CROWN was proposed in zhanghuan18
to certify robustness by bounding the activation with linear and quadratic functions which enables the study of generic, not necessarily piecewise linear, activation functions. By utilizing the structure in ReLU based networks, the work of
weng2018towards proposed two fast algorithms based on linear approximation on the ReLU units. Several other works utilize the dual view of the verification problem kolter2017provable ; wong18scaling . More recently, the method of salman2019convex unified a large number of recent works in a single convex relaxation framework and revealed several relationships between them. In particular, it was shown that convex relaxation based methods that fit this framework suffer from an inherit barrier once compared to exact verifiers.For completeness, it is important to note that there are also hybrid methods that combine both exact and relaxed verifiers and have shown to be effective bunel17 .
Although relaxed verifiers are much more computationally friendly than exact verifiers, they are still too expensive for robust training of large networks (networks with more than 5 hidden layers). However, very loose relaxed verifiers possibly can still be exploited for this purpose. In particular, the work of gowal2018effectiveness proposed to use the extremely inexpensive but very loose interval bound propagation (IBP) certificate to train, for the first time, large robust networks with stateoftheart robustness performance. This was at the expense of a complex and involved training routine due to the loose nature of the bounds. To remedy the training difficulties, we propose probabilistic bounds, not for each layer individually, but for a block of layers jointly. Such bounds are slightly more expensive to compute but are much tighter. We then propagate the bounds through every block in a deeper network to attain overall much tighter bounds compared to layerwise IBP. The tighter bounds allow for simple standard training routines to be employed for robust training of large networks resulting in stateofart robustnessaccuracy tradeoff.
3 Probabilistically True and Tight Interval Bounds
We analyze the interval bounds of a DNN by proposing probabilistically true and tight bounds for a twolayer network (AffineReLUAffine) and then we propose a mechanism to extend them for deeper networks. But first, we detail the interval bounds of gowal2018effectiveness to put our proposed bounds in context.
3.1 Interval Bounds for a Single Affine Layer
For a single affine layer parameterized by and , it is easy to show that its output lower and upper interval bounds for an  norm bounded input are:
(1) 
Note that is an elementwise absolute operator. In the presence of any nondecreasing elementwise nonlinearity (e.g. ReLU), the bounds can then be propagated by applying the nonlinearity to directly. As such, the interval bounds can be propagated through the network one layer at a time, as in gowal2018effectiveness . While this interval bound propagation (IBP) mechanism is a very simple and inexpensive approach to compute bounds, they can be very loose resulting in a complex and involved robust network training procedure gowal2018effectiveness .
3.2 Proposed Interval Bounds for an AffineReLUAffine Block
Here, we consider a block of layers of the form AffineReLUAffine in the presence of perturbations at the input. The functional form of this network is , where is an elementwise operator. The affine mappings can be of any size, and throughout the paper,
and without loss of generality the second affine map is a single vector
. Note that also includes convolutional layers, since they are also affine mappings.Layerwise Interval Bound Propagation (IBP) on .
Here, we apply the layerwise propagation strategy of gowal2018effectiveness detailed in Section 3.1 on function with to obtain bounds . We use these bounds for comparison in what follows.
Note that and are the result of propagating through the first affine map and then the ReLU nonlinearity, as shown in (1).
Probabilistically True and Tight Interval Bounds on .
Our goal is to propose new interval bounds on , as a whole, which are tighter than the IBP bounds , since we believe that such tighter bounds for a twolayer block, when propagated/extended to deeper networks, can be tighter than applying IBP layerwise. Denoting the true output interval bounds of as , the following inequality holds . Deriving these true (and tight) bounds for in closed form is either hard or results in bounds that are generally very difficult to compute, deeming them impractical for applications such as robust network training. Instead, we propose new closed form expressions for the interval bounds denoted as , which we prove to be probablistically true bounds and tighter than in expectation. As such, we make two main theoretical findings. (i) We prove that holds with high probability, i.e. and hold with a high probability when the input dimension is large enough (probabilistically true bounds). (ii) We prove that can be arbitrarily tighter than the loose bounds in expectation, as the number of hidden nodes increases (probabilistically tighter bounds).
Analysis.
To derive and , we study the bounds of the following function instead.
(2) 
Note that is very similar to the AffineReLUAffine map captured by with the ReLU replaced by a diagonal matrix constructed as follows. If we denote as the upper bound resulting from the propagation of the input bounds through the first affine map , then we have where is an indicator function. In other words, when the element of is nonnegative and zero otherwise. Note that for a given , is an affine function with the following output interval bounds when :
(3)  
(4) 
Theorem 1.
(Probabilistically True Bounds in Expectation) Consider an
bounded uniform random variable input
, i.e. , to a block of layers in the form of AffineReLUAffine (parameterized by and for the first and second affine layers, respectively). With the true output interval bounds being , i.e. , the following holds with an overwhelming probability for a sufficiently large :where
assuming that the Lyapunov Central Limit Theorem holds. For random matrices
and with i.i.d Gaussian elements of zero mean and and standard deviations, and for a sufficiently large input dimension and , we have:(5)  
Theorem 1 states that the interval bounds to function are simply looser bounds to the function of interest in expectation, under a plausible distribution of and . Now, we investigate the tightness of these bounds as compared to the IBP bounds .
Theorem 2.
(Probabilistically Tighter Bounds in Expectation) Consider an bounded uniform random variable input , i.e. , to a block of layers in the form of AffineReLUAffine (parameterized by and for the first and second affine layers respectively) and . Under the assumption that , we have: .
Theorem 2 states that under some assumptions on and under a plausible distribution for , our proposed interval width can be much smaller than the IBP interval width, i.e. our proposed intervals are much tighter than the IBP intervals.
Next, we show that the inequality assumption in Theorem 2 is very mild. In fact, a wide range of satisfy it and the following proposition gives an example that does so in expectation.
Proposition 1.
Proposition 1 implies that as the number of hidden nodes increases, the expectation of the right hand side of the inequality assumption in Theorem 2 grows more negative, while the left hand side of the inequality is zero in expectation when . In other words, for Gaussian zeromean weights and with a large enough number of hidden nodes , the assumption is satisfied. All proofs and detailed analyses are provided in the supplementary material.
3.3 Extending our Probabilistically True and Tight Bounds to Deeper Networks
To extend our proposed bounds to networks deeper than a twolayer block, we simply apply our bound procedure described in Section 3.2, recursively for every block. In particular, consider an layer neural network defined as and an  norm bounded input centered at , i.e. . Without loss of generality, we assume is biasfree for ease of notation. Then, the output lower and upper bounds of are and , respectively. Here, is a linear map that can be obtained recursively as follows:
4 Experiments
Probabilistically True Bounds. In this experiment, we validate Theorem 1 with several controlled experiments. For a network of the form that has true bounds for , we empiricallyshow that our proposed bounds , under the mild assumptions of Theorem 1, indeed are true with a high probability, i.e. . Moreover, we verify that the larger the network input dimnension is, the inequality holds with even higher probability (as predicted by Theorem 1).
We start by constructing a network where the biases and
are initialized following the default Pytorch
paszke2017automatic initialization. As for the elements of the weight matrices and , they are sampled from and , respectively. We estimate, and by taking the minimum and maximum of MonteCarlo evaluations of . For a given and with , we uniformly sample examples from the interval . We also sample all corners of the hyper cube . To probabilistically show that the proposed interval is a super set of (i.e. they are true bounds), we evaluate the length of the intersection of the two intervals over the length of the true interval defined as . Note that if and only if is a super set of . For a given , we conduct this experiment times with a varying , , , and and report the average . Then, we run this for a varying number of input size and a varying number of hidden nodes , as reported in Figure 0(a). As predicted by Theorem 1, Figure 0(a) demonstrates that as increases, the proposed interval will be a super set of the true interval, with a higher probability, regardless of the number of hidden nodes . Note that networks that are as wide as , require no more than input dimensions for the proposed intervals to be a superset of the true intervals. In practice, is much larger than that, e.g. in CIFAR10.In Figure 0(b), we empirically show that the above behavior also holds for deeper networks. We propagate the bounds blockwise as discussed in Section 3.3 and conduct similar experiments on fullyconnected networks. We construct networks with varying depth, where each layer has the same number of nodes equal to the input dimension .
These results indeed suggest that the proposed bounds are true bounds with high probability and this probability increases with larger input dimensions. Here, is better than across different network depths. The same behaviour holds for convolutional networks as well.
Probabilistically Tight Bounds. We experimentally affirm that our bounds can be much tighter than IBP bounds gowal2018effectiveness . In particular, we validate Theorem 2 by comparing interval lengths of our proposed bounds, , to those from IBP, , on networks with functional form . We compute both the difference and ratio of widths for varying values of , , and . Figure 2 reports the average width difference and ratio over runs in a similar setup to the previous section. Figures 1(a) and 1(b) show that the proposed bounds indeed get tighter than IBP, as increases across all values (as predicted by Theorem 2). Note that we show tightness results for in Figure 1(b) as the performance of were very similar to . Moreover, the improvement is consistently present when varying as shown in Figures 1(c) and 1(d).
We also compare the tightness of our bounds to those of IBP with increasing depth for both fullyconnected networks (refer to Figures 2(a) and 2(b)) and convolutional networks (refer to Figures 2(c) and 2(d)). For all fullyconnected networks, we take . Our proposed bounds get consistently tighter as the network depth increases over all choices of . In particular, the proposed bounds can be more than times tighter than IBP for a 10 layer DNN. A similar observation can also be made for convolutional networks. For convolutional networks, it is expensive to compute our bounds using the procedure described in Section 3.3, so instead we obtain matrices using the easytocompute IBP upper bounds. Despite this relaxation, we still obtain very tight and probabilistically true bounds. Note that this slightly modified approach reduces exactly to our bounds for twolayer networks.
Qualitative Results. Following previous work kolter2017provable ; gowal2018effectiveness , Figure 4 visualizes some examples of the proposed bounds and compares them to the true ones for several choices of and a random fivelayer fullyconnected network with architecture . We also show the results of the MonteCarlo sampling for an input size . More qualitative visualizations for different values of are in the supplementary material.
Training Robust Networks. In this section, we conduct experiments showing that our proposed bounds can be used to robustly train DNNs. We compare our method against models trained nominally (i.e. only the nominal training loss is used), and those trained robustly with IBP gowal2018effectiveness . Given the wellknown robustnessaccuracy trade off, robust models are often less accurate. Therefore, we compare all methods using an accuracy vs. robustness scatter plot. Following prior art, we use Projected Gradient Descent (PGD) madry2017towards
to measure robustness. We use a loss function similar to the one proposed in
gowal2018effectiveness . In particular, we use , where , , , andare the crossentropy loss, output logits, the true class label, and regularization hyperparameter respectively.
represents the “adversarial" logits obtained by combining the lower bound of the true class label and the upper bound of all other labels, as in gowal2018effectiveness . When , nominal training is invoked. Due to the tightness of our bounds and in contrast to IBP gowal2018effectiveness , we follow a standard training procedure that avoids the need to vary or during training gowal2018effectiveness .Experimental Setup. We train the three network models (small, medium and large) provided by gowal2018effectiveness on both MNIST and CIFAR10. See supplementary material for more details. Following the setup in gowal2018effectiveness , we train all models with and on MNIST and CIFAR10, respectively. Then, we compute the PGD robustness for every of every model architecture for all for MNIST and for all for CIFAR10. To compare training methods, we compute the average PGD robustness over all , the test accuracy, and report them in a 2D scatter plot. In all experiments, we grid search over learning rates and employ a temperature over the logits with a grid of as in hinton2015distilling .
We report the performance results on MNIST for the small, medium and large architectures in Figure 5. For all trained architectures, we only report the results for models that achieve at least a test accuracy of ; otherwise, it is an indication of failure in training. Interestingly, our training scheme can be used to train all architectures for all . This is unlike IBP, which for example was only able to successfully train the large architecture with . Moreover, the models trained with our bounds always achieve better PGD robustness than the nominally trained networks on all architectures (small, medium and large) while preserving similar if not a higher accuracy (large architecture). While the models trained with IBP achieve high robustness, their test accuracy is drastically affected. Note that over all architectures and for some , training with our bounds always yields models with comparable or better PGD robustness but with a much higher test accuracy.
Similar observations can be made when training on CIFAR10 as shown in Figure 6. We only report the performance of trained architectures that achieve at least a test accuracy of . All our trained models successfully train over all architectures and over all . They always achieve better PGD robustness, while maintaining similar or better test accuracy. Interestingly, all the models trained using IBP gowal2018effectiveness achieve a much lower test accuracy.
5 Conclusion
In this work, we proposed new interval bounds that are very tight, relatively cheap to compute, and probabilistically true. We analytically showed that for the block (AffineReLUAffine) with large input and hidden layer sizes, our bounds are true with a high probability and several order of magnitudes tighter than the bound obtained with IBP. We then proposed an approach to extend these results to deeper networks through means of blockwise propagation. We conduct extensive experiments verifying our theory on the AffineReLUAffine block, and demonstrating that the same behaviour persist for deeper networks. As a result, we are able to train large models, with standard typical training routine while achieving excellent tradeoff between accuracy and robustness.
Acknowledgments.
This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research.
References
 (1) S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli, “On the effectiveness of interval bound propagation for training verifiably robust models,” arXiv preprint arXiv:1810.12715, 2018.

(2)
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, pp. 1097–1105, 2012. 
(3)
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  (4) J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
 (5) G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 (6) I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 (7) S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582, 2016.
 (8) S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773, 2017.

(9)
A. Bibi, M. Alfadly, and B. Ghanem, “Analytic expressions for probabilistic moments of pldnn with gaussian input,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9099–9107, 2018.  (10) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
 (11) J. Z. Kolter and E. Wong, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” arXiv preprint arXiv:1711.00851, vol. 1, no. 2, p. 3, 2017.
 (12) A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses against adversarial examples,” arXiv preprint arXiv:1801.09344, 2018.
 (13) M. Alfadly, A. Bibi, and B. Ghanem, “Analytical moment regularizer for gaussian robust networks,” arXiv preprint arXiv:1904.11005, 2019.
 (14) G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification, pp. 97–117, Springer, 2017.
 (15) S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa, “Triplet probabilistic embedding for face verification and clustering,” in 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS), pp. 1–8, IEEE, 2016.
 (16) T.W. Weng, H. Zhang, H. Chen, Z. Song, C.J. Hsieh, D. Boning, I. S. Dhillon, and L. Daniel, “Towards fast computation of certified robustness for relu networks,” arXiv preprint arXiv:1804.09699, 2018.

(17)
M. Mirman, T. Gehr, and M. Vechev, “Differentiable abstract interpretation for
provably robust neural networks,” in
International Conference on Machine Learning
, pp. 3575–3583, 2018. 
(18)
Y. LeCun, “The mnist database of handwritten digits,”
http://yann. lecun. com/exdb/mnist/, 1998.  (19) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009.
 (20) C.H. Cheng, G. Nührenberg, and H. Ruess, “Maximum resilience of artificial neural networks,” in International Symposium on Automated Technology for Verification and Analysis, pp. 251–268, Springer, 2017.
 (21) A. Lomuscio and L. Maganti, “An approach to reachability analysis for feedforward relu neural networks,” arXiv preprint arXiv:1706.07351, 2017.
 (22) K. X. Tjeng, Vincent and R. Tedrake, “Evaluating robustness of neural networks with mixed integer programming,” International Conference on Learning Representations, ICLR19, 2019.
 (23) S. W. Xiaowei Huang, Marta Kwiatkowska and M. Wu, “Safety verification of deep neural networks,” International Conference on Computer Aided Verification, 2017.

(24)
R. Ehlers, “Formal verification of piecewise linear feedforward neural networks,”
International Symposium on Automated Technology for Verification and Analysis, 2017.  (25) P.Y. C. C.J. H. Huan Zhang, TsuiWei Weng and L. Daniel, “Efficient neural network robustness certification with general activation functions,” Neural Information Processing Systems, 2018.
 (26) J. H. M. Eric Wong, Frank Schmidt and Z. Kolter, “Scaling provable adversarial defenses,” in Neural Information Processing Systems, NIPS18, 2018.
 (27) H. Salman, G. Yang, H. Zhang, C.J. Hsieh, and P. Zhang, “A convex relaxation barrier to tight robust verification of neural networks,” arXiv preprint arXiv:1902.08722, 2019.
 (28) P. H. T. P. K.M. P. K. Rudy Bunel, Ilker Turkaslan, “A unified view of piecewise linear neural network verification,” arXiv preprint arXiv:1711.00455, 2018.
 (29) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshop, 2017.
 (30) G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
Appendix A Probabilistically True Bounds in Expectation
Theorem 1.
(Probabilistically True Bounds in Expectation) Consider an bounded uniform random variable input , i.e. , to a block of layers in the form of AffineReLUAffine (parameterized by and for the first and second Affine layers, respectively). With the true output interval bounds being , i.e. , the following holds with an overwhelming probability for a sufficiently large :
where assuming that the Lyapunov Central Limit Theorem holds. Then, for random matrices and with elements being i.i.d Gaussian with zero mean and and standard deviations, respectively, and for a sufficiently large input dimension (holds as ) and , we have that
(6)  
Proof.
Since
is uniformly distributed between
with mean and covariance matrix . The output of the first linear layer has mean and covariance matrix . Assuming Lyapunov CLT holds then can thus be approximated as for relatively large n. Following the recent analytic expressions derived for the first and output moments of a network of the form of Affine layer followed by a ReLU followed by another Affine in bibi2018analytic , Then the mean is given as follows:Note that and are the normal cumulative and probability Gaussian density functions and where where diag extracts the diagonal elements of a matrix to a vector and that is an elemenwise division. Moreover, note that where
for ease of notation. As for the variance, it can be approximated following
bibi2018analytic as follows:Note that for ease of notation and that . Thus, we have that
By conditioning over , we have
The second equality follows from that are independent random variables. The first inequality follows by Jenson’s where the forth inequality follows from the mean of a folded Gaussian. Lastly, by taking the expectation over where is the set of indices where for all . Since is random, then is also random. Therefore, one can reparametrize the sum and as follows
The second equality follows as a special case of Lemma 1. That is, if we have that . The last approximation follows from stirlings formula for large x. Since , then as the input dimension increases, i.e. ( the theorem follows. ∎
Appendix B Tightness Compared to Layerwise Propagation
Theorem 2.
(Tightness) Consider an bounded input , i.e. to the block of layers that is in the form of AffineReLUAffine(parameterized by and for the first and second Affine layers, respectively) where the elements of i.i.d Gaussian with zero mean and standard deviation. Under the assumption that , we have that .
Proof.
Note that
Consider the coordinate splitting functions , , and such that for where is a vector of all zeros and 1 in the locations where both . However, since , then . Therefore it is clear that for any vector and an interval , we have that
(7) 
since the sets , and are disjoints and their union . We will denote the difference in the interval lengths as for ease of notation. Thus, we have the following: