1 Introduction
Neural networks have become increasingly effective at many difficult machinelearning tasks. However, the nonlinear and largescale nature of neural networks make them hard to analyze and, therefore, they are mostly used as blackbox models without formal guarantees. In particular, neural networks are highly vulnerable to attacks, or more generally, uncertainty in the input. In the context of image classification, for example, neural networks can be easily deluded into changing their classification labels by slightly perturbing the input image. Indeed, it has been shown that even imperceptible perturbations in the input of the stateoftheart neural networks cause natural images to be misclassified with high probability
(MoosaviDezfooli et al., 2017). These perturbations can be either of an adversarial nature (Szegedy et al., 2013), or they could merely occur due to compression, resizing, and cropping (Zheng et al., 2016). As another example, in the space of malware classification, the existence of adversarial examples not only limits their potential application settings but entirely defeats its purpose. These drawbacks limit the adoption of neural networks in safetycritical applications such as selfdriving vehicles (Bojarski et al., 2016), aircraft collision avoidance procedures (Julian et al., 2016), speech recognition, and recognition of voice commands; see (Xiang et al., 2018a) for a survey.Motivated by the serious consequences of the fragility of neural networks to input uncertainties or adversarial attacks, there has been an increasing effort in developing tools to measure or improve the robustness of neural networks. Many results focus on specific adversarial attacks and make an attempt to harden the network by, for example, crafting hardtoclassify examples
(Goodfellow et al., ; Kurakin et al., 2016; Papernot et al., 2016; MoosaviDezfooli et al., 2016). Although these methods are scalable and work well in practice, they still suffer from false negatives. Safetycritical applications require provable robustness against any bounded variations in the input data. As a result, many tools have recently been used, adapted, or developed for this purpose, such as mixedinteger linear programming
(Bastani et al., 2016; Lomuscio & Maganti, 2017; Tjeng et al., 2017), robust optimization and duality theory (Kolter & Wong, 2017; Dvijotham et al., 2018), Satisfiability Modulo Theory (SMT) (Pulina & Tacchella, 2012), dynamical systems (Ivanov et al., 2018; Xiang et al., 2018b), Abstract Interpretation (Mirman et al., 2018) and many others (Hein & Andriushchenko, 2017). All these works aim at bounding the worstcase value of a performance measure when their input is perturbed within a specified range.In this paper, we develop a semidefinite program (SDP) for safety verification and robustness analysis of neural networks against normbounded uncertainties in the input. Our main idea is to abstract the nonlinear activation functions by the constraints they impose on the pre and post activation values. In particular, we describe various properties of activation functions using quadratic constraints, such as bounded slope, bounded values, monotonicity, and repetition across layers. Using this abstraction, any properties (e.g., safety or robustness) that we can guarantee for the “abstracted” network will automatically be satisfied by the original network as well. The quadratic form of these constraints allows us to formulate the problem as an SDP. Our main tool for developing the SDP is the procedure from robust control, which allows us to reason about multiple quadratic constraints. As a notable advantage, we can analyze networks with any combination of activation functions across the layers. In this paper, we focus on a canonical problem (formally stated in 2.1
) that can be adapted to other closely related problems such as sensitivity analysis with respect to input perturbations, output reachable set estimation, adversarial training examples generation, and nearduplicate detection.
1.1 Related Work
The performance of certification algorithms for neural networks can be measured along three axes. The first axis is the tightness of the certification bounds; the second axis is the computational complexity, and, the third axis is applicability across various models (e.g. different activation functions). These axes conflict. For instance, the conservatism of the algorithm is typically at odds with the computational complexity. On the other hand, generalizable algorithms tend to be more conservative. The relative advantage of any of these algorithms is application specific. For example, reachability analysis and safety verification applications call for less conservative algorithms, whereas in robust training, computationally fast algorithms are desirable
(Weng et al., 2018).On the one hand, formal verification techniques such as Satisfiability Modulo (SMT) solvers (Ehlers, 2017; Huang et al., 2017; Katz et al., 2017), or integer programming approaches (Lomuscio & Maganti, 2017; Tjeng et al., 2017)
rely on combinatorial optimization to provide tight certification bounds for piecewise linear networks, whose complexity scales exponentially with the size of the network in the worstcase. A notable work to improve scalability is
(Tjeng et al., 2017), where the authors do exact verification of piecewiselinear networks using mixedinteger programming with an order of magnitude reduction in computational cost via tight formulations for nonlinearities and careful preprocessing.On the other hand, certification algorithms based on continuous optimization are more scalable but less accurate. A notable work in this category is reported in (Kolter & Wong, 2017), where the authors propose a linearprogramming (LP) relaxation of piecewise linear networks and provide upper bounds on the worstcase loss using weak duality. The main advantage of this work is that the proposed algorithm solely relies on forward and backpropagation operations on a modified network, and thus is easily integrable into existing learning algorithms. In (Raghunathan et al., 2018a), the authors propose an SDP relaxation of onelayer sigmoidbased neural networks based on bounding the worstcase loss with a firstorder Taylor expansion. Finally, the closest work to the present work is (Raghunathan et al., 2018b), in which the authors propose a semidefinite relaxation (SDR) for certifying robustness of piecewise linear multilayer neural networks. This technique provides tighter bounds than that of (Kolter & Wong, 2017), although it is less scalable.
Our contribution. The present work, which also relies on an SDP relaxation, has the following features:

We use various forms of quadratic constraints (QCs) to abstract any type of activation function.

We can control the tradeoff between computational complexity and conservatism by systematically including or excluding different types of QCs.

For onelayer neural networks, the proposed SDP offers an order of magnitude reduction in computational complexity compared to (Raghunathan et al., 2018b) while preserving the accuracy. In particular, there are decision variables (where is the total number of neurons), while the SDP of (Raghunathan et al., 2018b) has decision variables.

For multilayer neural networks, the SDP of the present work–with all possible QCs included–is more accurate than that of (Raghunathan et al., 2018b) with the same computational complexity.
The main drawback of our approach (and all SDPbased approaches) is the limited scalability of SDPs in general. To overcome this issue for the case of deep networks with more than (roughly) five thousands of neurons, we propose to adopt a modular approach, in which we analyze the network layer by layer via solving a sequence of small SDPs, as opposed to a single large one. This approach mitigates the scalability issues but induces more conservatism.
1.2 Notation and Preliminaries
We denote the set of real numbers by , the set of real
dimensional vectors by
, the set of dimensional matrices by , and thedimensional identity matrix by
. We denote by , , and the sets of by symmetric, positive semidefinite, and positive definite matrices, respectively. The norm () is displayed by .2 Safety and Robustness Analysis
2.1 Problem Statement
Consider the nonlinear vectorvalued function
described by a multilayer feedforward neural network. Given a bounded set
of possible inputs (e.g., adversarial examples), the neural network maps to an output set given by(1) 
The desirable properties that we would like to verify can often be represented by a safety specification set in the output space of the neural network. In this context, the network is safe if the output set lies within the safe region, i.e., if the inclusion holds. In the context of image classification, for example, a popular choice are perturbations in the norm, i.e., , where is a correctly classified test image, is the set of all possible images obtained by perturbing each pixel of by ,
is the set of all perturbed logit inputs to the classifier, and
is the set of all logit values that produce the same label as . Then the condition guarantees that the network will assign the same label to all images in (local robustness).Checking the condition , however, requires an exact computation of the nonconvex set , which is very difficult. Instead, our interest is in finding a nonconservative outer approximation of and verifying the safety properties by checking the condition . This approach detects all false negatives but also produces false positives, whose rate depends on the tightness of the overapproximation–see Figure 1. The goal of this paper is to solve this problem for a broad class of input uncertainties and safety specification sets using semidefinite programming.
2.2 Neural Network Model
For the model of the neural network, we consider an layer feedforward neural network described by the following recursive equations:
(2)  
where is the input to the network, is a bounded uncertainty set,
are the weight matrix and bias vector of the
th layer. The nonlinear activation function(ReLU, sigmoid, tanh, leaky ReLU, etc.) is applied coordinatewise to the preactivation vectors, i.e., it is of the form
(3) 
where is the activation function of each individual neuron. The output depends on the specific application we are considering. For example, in image classification with crossentropy loss, represents the logit input to the softmax function; or, in feedback control, is the input to the neural network controller (e.g., tracking error) and is the control input to the plant.
3 Problem Abstraction via Quadratic Constraints
In this section, our goal is to provide an abstraction of the verification problem that can be converted into a semidefinite program. Our main tool is Quadratic Constraints (QCs), which were first developed in the context of robust control (Megretski & Rantzer, 1997) for describing nonlinear, timevarying, or uncertain components of a system but we adapt it here for our purposes. We start off with the abstraction of the uncertainty set using QCs.
3.1 Input uncertainty
We now provide a particular way of representing the input set that will prove useful for developing the SDP.
Definition
Let be a nonempty set. Suppose is the set of all symmetric matrices such that
(4) 
We then say that satisfies the QC defined by .
Note that by definition, is a convex cone. Furthermore, we can write
(5) 
In other words, we can overapproximate by expressing it as a possibly infinite intersection of sets defined by quadratic inequalities.
Proposition (QC for hyperrectangle)
The hyperrectangle satisfies the quadratic constraint defined by
(6) 
where for all .
Our particular focus in this paper is on perturbations in norm, , which are a particular class of hyperrectangles with and .
We can adapt the result of Proposition 3.1 to other uncertainty sets such as polytopes, zonotopes, and ellipsoids. We do not elaborate on these attack models in this paper. We instead assume that the uncertainty set can be abstracted by a quadratic constraint of the form (4). We will see in 4 that the matrix appears as a decision variable in the SDP. In this way, we can optimize the outer approximation of to minimize the conservatism of the specific verification problem we want to solve.
3.2 Safety Specification Set
In our framework, we can consider specification sets that can be represented (or inner approximated) by the intersection of finitely many quadratic inequalities:
(7) 
where the are given. This characterization includes ellipsoids and polytopes in particular. For instance, for a safety specification set described by the polytope
the are given by
3.3 Abstraction of Nonlinearities by Quadratic Constraints
One of the main difficulties in the analysis of neural networks is the presence of nonlinear activation functions. To simplify the analysis, instead of analyzing the network directly, our main idea is to remove the nonlinear activation functions from the network but retain the constraints they impose on the pre and postactivation signals. Using this abstraction, any properties (e.g., safety or robustness) that we can guarantee for the constrained network will automatically be satisfied by the original network as well. In the following, we show how we can encode many of the important properties of the activation functions (e.g., monotonicity, bounded slope, and bounded values) using quadratic constraints. We first provide the formal definition below.
Definition
Let and suppose is the set of all symmetric and indefinite matrices such that the inequality
(8) 
holds for all . Then we say satisfies the quadratic constraint defined by .
We remark that our definition of a quadratic constraint slightly differs from the one used in robust control (Megretski & Rantzer, 1997), by including a constant in the vector surrounding the matrix , which allows us to incorporate affine constraints (e.g. bounded nonlinearities).
The derivation of quadratic constraints is function specific but there are certain rules that can be used for all of them which we describe below.
3.3.1 Sloperestricted Nonlinearities
Consider the nonlinear function with . We say that is sectorbounded in the sector () if the following condition holds for all :
(9) 
Intuitively (and for the onedimensional case ), this inequality means that the function lies in the sector formed by the lines and . As an example, the ReLU function belongs to the sector . The sector condition, however, does not impose any restriction on the slope of the function. This motivates a more accurate description of nonlinearities that have bounded slope.
Definition (sloperestricted nonlinearity)
A nonlinear function is sloperestricted on () if
(10) 
for any two pairs and , where and with an abuse of notation.
For the onedimensional case (), the slope restriction condition in (10) states that the chord connecting any two points on the curve of the function has a slope that is at least and at most :
(11) 
Comparing (9) and (10), we see that the sector bound condition is a special case of the slope restriction condition when . As a result, a sloperestricted nonlinearity with is also sector bounded; see Figure 2 for an illustration.
In the context of neural networks, our interest is in repeated nonlinearities of the form (3). Furthermore, the activation values might be bounded from below or above (e.g., the ReLU function which outputs a nonnegative value). The quadratic inequality in (10) is too conservative and does not capture these properties. In the following, we discuss QCs for these properties.
3.3.2 Repeated Nonlinearities
Suppose is sloperestricted on and let be a vectorvalued function constructed by componentwise repetition of . It is not hard to verify that is also sloperestricted in the same sector. However, this representation simply ignores the fact that all the nonlinearities that compose are the same. By taking advantage of this structure, we can refine the quadratic constraint that describes . To be specific, for an inputoutput pair , we can write the inequality
(12) 
for all . This particular QC considerably reduces conservatism, especially for deep networks, as it reasons about the coupling between the neurons throughout the entire network. By making an analogy to dynamical systems, we can interpret the neural network as a timevarying discretetime dynamical system where the same nonlinearity is repeated for all time indexes (the layer number). Then the QC in (12) couples all the possible neurons.
Lemma (repeated nonlinearities)
Suppose is sloperestricted on . Then the vectorvalued function satisfies
(13) 
where , , and is the th unit vector in .
3.3.3 Bounded Nonlinearities
Finally, suppose the nonlinear function values are bounded, i.e., for all . This bound is equivalent to
(14) 
We can write a similar inequality when the pre activation values are known to be bounded.
3.4 Quadratic Constraints for Activation Functions
To connect the results of the previous two subsections to activation functions in neural networks, we recall the following result from (Heath & Wills, 2005).
Lemma (gradient of convex functions)
Consider a function that is convex and smooth. Then the gradient function is sloperestricted in the sector .
Notably, all activation functions used in deep neural networks are gradients of convex functions. They therefore belong to the class of sloperestricted nonlinearities, according to Lemma 3.4. We have the following result.
Proposition
The following statements hold true.

The ReLU function is sloperestricted and sectorbounded in the sector .

The tanh function, is sloperestricted and sectorbounded in the sector .

The leaky ReLU function, with is sloperestricted and sectorbounded in the sector .

The exponential linear function (ELU), with is sloperestricted and sectorbounded in the sector .

The softmax function, is sloperestricted in the sector .
Proof
It is easy to show that each of the activation functions mentioned above is the gradient of a convex function.
Although the above rules can be used to guide the search for valid QCs for activation functions, a less conservative description of activation functions requires a casebycase treatment to further exploit the structure of the nonlinearity. For instance, the ReLU function precisely lies on the boundary of the sector . Indeed, it can be described by the following constraints (Raghunathan et al., 2018b):
(15) 
The first constraint is the boundary of the sector and the other constraints simply prune these boundaries to recover the ReLU function. In the following lemma, we provide a full QC characterization of the ReLU function.
Lemma (QC for ReLU function)
The ReLU function, , satisfies the QC defined by where
(16) 
Here and is given by
with and .
Deriving nonconservative QCs for the other functions (other than ReLU) is more complicated as they are not on the boundary of any sector. However, by bounding these functions at multiple points by sector bounds of the form (10), we can obtain a substantially better overapproximation. In Figure 3, we illustrate this idea for the tanh function.
4 SDP for Onelayer Neural Networks
For the sake of simplicity in the exposition, we start with the analysis of onelayer neural networks and then extend the results to the multilayer case in 5. In the following theorem, we put all the pieces together to develop an SDP that can assert whether .
Theorem (SDP for one layer)
Consider a onelayer neural network described by the equations
(17) 
Suppose , where is bounded and satisfies the quadratic constraint defined by , i.e., for any ,
(18) 
Furthermore, suppose satisfies the quadratic constraint defined by , i.e., for any ,
(19) 
Consider the following matrix inequality:
(20) 
where equationparentequation
(21a)  
(21b)  
(21c) 
and is a given symmetric matrix. If (20) is feasible for some , then
Theorem 4 states that if the matrix inequality (20) is feasible for some , then we can certify that the network is safe with respect to the perturbation set and safety specification set , i.e., . Since and are both convex, (20) is a linear matrix inequality (LMI) feasibility problem and, hence, can be efficiently solved via interiorpoint method solvers for convex optimization.
The crux of our idea in the development of Theorem 4 is the procedure (Yakubovich, 1997), a technique to reason about multiple quadratic constraints, and is frequently used in robust control and optimization (Boyd et al., 1994; BenTal et al., 2009).
4.1 Certified Upper Bounds
In Theorem 4, we developed a feasibility problem to assert whether the output set is enclosed in the safe set . In particular, if is described by the halfspace
(22) 
with a given and , then the feasibility of the LMI in Theorem 4 implies , or equivalently,
(23) 
In other words, is a certified upper bound on the quantity . Now if treat as a decision variable, we can optimize this bound by minimizing subject to the LMI constraint in (20). This is particularly useful for reachability analysis, where the goal is to over approximate the output set by a polyhedron of the form
where are given and the goal is to find the smallest value of for all such that .
5 Multilayer Neural Networks
In this section, we turn to multilayer neural networks. Assuming that all the activation functions are the same across the layers (repetition across layers), we can concatenate all the pre and postactivation signals together and form a more compact representation. To see this, we first introduce , where is the number of hidden layers. Then, we can write (2) compactly as equationparentequation
(24a)  
where  
(24b) 
In the following result, we develop the multilayer counterpart of Theorem 4 for the model in (24).
Theorem (SDP for multiple layers)
Consider the multilayer neural network described by (24). Suppose and satisfy the quadratic constraints defined by and , respectively, as in (18) and (19). Consider the following LMI.
(25) 
where equationparentequation
(26a)  
(26b)  
(26c) 
and is a given symmetric matrix. If (25) is feasible for some , then
6 Discussion
In this section, we discuss the numerical aspects of our approach. For solving the SDP, we used MOSEK (ApS, 2017) with CVX (CVX Research, 2012) on a 5core personal computer with 8GB of RAM. We start off with the computational complexity of the proposed SDP.
6.1 Computational Complexity
Input set. The quadratic constraint that overapproximates the input set is indexed by decision variables, where is the input dimension. However, if we use a diagonal matrix in (6), we can reduce the number of decision variables to without significantly increasing the conservatism.
Activation functions. For a network with hidden neurons, if we use all possible quadratic constraints, the number of decision variables will be , which is the same number of decision variables as in (Raghunathan et al., 2018b). For the onelayer case, if we ignore repeated nonlinearities, we arrive at decision variables. In our numerical experiments, we did not observe any additional conservatism after removing repeated nonlinearities across the neurons of the same layer. However, accounting for repeated nonlinearities was very effective for the case of multiple layers.
Safety specification set. The number of decision variables for the safety specification set depends on how we would like to bound the output set. For instance, for finding a single hyperplane, we have only one decision variable. For the case of ellipsoids, there will be decision variables.
6.2 Experiments
In Figure 4, we compare the bounds for a network with inputs, outputs, and a varying number of hidden layers with neurons per layer. We observe that the bounds obtained by the SDP remain relatively accurate as a result of including repeated nonlinearities. In the supplementary material, we visualize the overapproximations for different scenarios. In Figure 5 we depict the effect of the number of layers on the quality of approximation; in Figure 6, we show the effect of number of hidden neurons on the quality of approximation for a singlelayer network, and in Figure 7, we change the perturbation size .
In Table 1, we report the computation time (CVX overhead included) for a network with one hidden layer and a varying number of neurons. We observe that the SDR of (Raghunathan et al., 2018b) runs out of memory for larger networks (1600 in the computer used for this experiment). However, the SDP of this paper can solve networks of up to size 5000 with the same memory.
Number of neurons  Solve time  

SDP (this paper)  SDR  
200  3.2  2.7 
400  11.3  20.4 
800  78.6  149.1 
1200  311.2  799.1 
1600  1072.6  OOM 
2000  1249.7  OOM 
3000  3126.5  OOM 
7 Conclusion
In this paper, we proposed an SDP for robustness analysis and safety verification of feedforward fullyconnected neural networks with general activation functions. We used quadratic constraints to abstract various elements of the problem, namely, the input uncertainty set, the safety specification set, and the nonlinear activation functions. To reduce conservatism, we developed quadratic constraints that are able to reason about the coupling between neurons throughout the entire network. We focused on uncertainty sets. However, we can consider any set that can be over approximated by quadratic constraints, namely, ellipsoids, polytopes, and zonotopes.
References
 ApS (2017) ApS, M. The MOSEK optimization toolbox for MATLAB manual. Version 8.1., 2017. URL http://docs.mosek.com/8.1/toolbox/index.html.
 Bastani et al. (2016) Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., and Criminisi, A. Measuring neural net robustness with constraints. In Advances in neural information processing systems, pp. 2613–2621, 2016.
 BenTal et al. (2009) BenTal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009.
 Bojarski et al. (2016) Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Boyd et al. (1994) Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan, V. Linear matrix inequalities in system and control theory, volume 15. Siam, 1994.
 CVX Research (2012) CVX Research, I. CVX: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx, August 2012.
 D’amato et al. (2001) D’amato, F., Rotea, M. A., Megretski, A., and Jönsson, U. New results for analysis of systems with repeated nonlinearities. Automatica, 37(5):739–747, 2001.
 Dvijotham et al. (2018) Dvijotham, K., Stanforth, R., Gowal, S., Mann, T., and Kohli, P. A dual approach to scalable verification of deep networks. arXiv preprint arXiv:1803.06567, 2018.
 Ehlers (2017) Ehlers, R. Formal verification of piecewise linear feedforward neural networks. In International Symposium on Automated Technology for Verification and Analysis, pp. 269–286. Springer, 2017.
 (10) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572.
 Heath & Wills (2005) Heath, W. P. and Wills, A. G. Zamesfalb multipliers for quadratic programming. In Decision and Control, 2005 and 2005 European Control Conference. CDCECC’05. 44th IEEE Conference on, pp. 963–968. IEEE, 2005.
 Hein & Andriushchenko (2017) Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pp. 2266–2276, 2017.
 Huang et al. (2017) Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pp. 3–29. Springer, 2017.
 Ivanov et al. (2018) Ivanov, R., Weimer, J., Alur, R., Pappas, G. J., and Lee, I. Verisig: verifying safety properties of hybrid systems with neural network controllers. arXiv preprint arXiv:1811.01828, 2018.
 Julian et al. (2016) Julian, K. D., Lopez, J., Brush, J. S., Owen, M. P., and Kochenderfer, M. J. Policy compression for aircraft collision avoidance systems. In Digital Avionics Systems Conference (DASC), 2016 IEEE/AIAA 35th, pp. 1–10. IEEE, 2016.
 Katz et al. (2017) Katz, G., Barrett, C., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Springer, 2017.
 Kolter & Wong (2017) Kolter, J. Z. and Wong, E. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 1(2):3, 2017.
 Kulkarni & Safonov (2002) Kulkarni, V. V. and Safonov, M. G. All multipliers for repeated monotone nonlinearities. IEEE Transactions on Automatic Control, 47(7):1209–1212, 2002.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 Lomuscio & Maganti (2017) Lomuscio, A. and Maganti, L. An approach to reachability analysis for feedforward relu neural networks. arXiv preprint arXiv:1706.07351, 2017.
 Megretski & Rantzer (1997) Megretski, A. and Rantzer, A. System analysis via integral quadratic constraints. IEEE Transactions on Automatic Control, 42(6):819–830, 1997.
 Mirman et al. (2018) Mirman, M., Gehr, T., and Vechev, M. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575–3583, 2018.

MoosaviDezfooli et al. (2016)
MoosaviDezfooli, S.M., Fawzi, A., and Frossard, P.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2574–2582, 2016.  MoosaviDezfooli et al. (2017) MoosaviDezfooli, S.M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. arXiv preprint, 2017.

Papernot et al. (2016)
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami,
A.
The limitations of deep learning in adversarial settings.
In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pp. 372–387. IEEE, 2016.  Pulina & Tacchella (2012) Pulina, L. and Tacchella, A. Challenging smt solvers to verify neural networks. AI Communications, 25(2):117–135, 2012.
 Raghunathan et al. (2018a) Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018a.
 Raghunathan et al. (2018b) Raghunathan, A., Steinhardt, J., and Liang, P. S. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10900–10910, 2018b.
 Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tjeng et al. (2017) Tjeng, V., Xiao, K., and Tedrake, R. Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356, 2017.
 Weng et al. (2018) Weng, T.W., Zhang, H., Chen, H., Song, Z., Hsieh, C.J., Boning, D., Dhillon, I. S., and Daniel, L. Towards fast computation of certified robustness for relu networks. arXiv preprint arXiv:1804.09699, 2018.
 Xiang et al. (2018a) Xiang, W., Musau, P., Wild, A. A., Lopez, D. M., Hamilton, N., Yang, X., Rosenfeld, J., and Johnson, T. T. Verification for machine learning, autonomy, and neural networks survey. arXiv preprint arXiv:1810.01989, 2018a.
 Xiang et al. (2018b) Xiang, W., Tran, H.D., and Johnson, T. T. Output reachable set estimation and verification for multilayer neural networks. IEEE transactions on neural networks and learning systems, (99):1–7, 2018b.
 Yakubovich (1997) Yakubovich, V. Sprocedure in nonlinear control theory. Vestnick Leningrad Univ. Math., 4:73–93, 1997.
 Zheng et al. (2016) Zheng, S., Song, Y., Leung, T., and Goodfellow, I. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4480–4488, 2016.
Appendix A Appendix
a.1 Proof of Proposition 3.1
Note that the inequality implies
(27) 
for all , where . Summing the preceding inequality over all and denoting , we will arrive at the claimed inequality.
a.2 Proof of Lemma 3.3.2
a.3 Proof of Lemma 3.4
Consider the equivalence in (15) for the th coordinate of the activation function :
(30) 
Multiplying the constraints by , , and , respectively and adding them together, we obtain
(31) 
Substituting
back into (31), we get
(32) 
On the other hand, since is a repeated nonlinearity, it satisfies the inequality
(33) 
according to Lemma 3.3.2, where the expression for is given in the lemma. Adding (33) and (31) for all yields the desired QC for the ReLU function.
a.4 Proof of Theorem 4
Using the assumption that satisfies the QC defined by , we can write the following QC from the identity :
(34) 
By substituting the identity
(35) 
back into (34) and denoting , we can write the inequality
(36) 
for all . Next, by assumption satisfies the QC defined by . The corresponding quadratic inequality can be written as
for all . The above QC can be written as
(37) 
for all and all . Suppose (20) holds for some . By left and right multiplying both sides of (19) by and , respectively, we obtain
Therefore, the last term on the lefthand side must be nonpositive for all , or, equivalently,
(38) 
Using the relation , the above inequality can be written as
The proof is now complete.
a.5 Proof of Theorem 5
Since satisfies the QC defined by , we can write the following QC from the identity :
(39) 
The preceding inequality is equivalent to
(40) 
for all . Next, by assumption satisfies the QC defined by . The corresponding quadratic inequality can be written as
(41) 
for all and all . Suppose (25) holds for some . By left and right multiplying both sides of (19) by and , respectively, we obtain
for all . Therefore, the last quadratic term must be nonpositive, from where we can write
(42) 
for all . Using the relation from (24), the above inequality can be written as
The proof is now complete.