Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks

06/12/2019 ∙ by Mahyar Fazlyab, et al. ∙ 9

Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A function is globally Lipschitz continuous on if there exists a nonnegative constant such that

(1)

The smallest such is called the Lipschitz constant of . The Lipschitz constant is the maximum ratio between variations in the output space and variations in the input space of and thus is a measure of sensitivity of the function with respect to input perturbations.

When a function is characterized by a deep neural network (DNN), tight bounds on its Lipschitz constant can be extremely useful in a variety of applications. In classification tasks, for instance, can be used as a certificate of robustness of a neural network classifier to adversarial attacks if it is estimated tightly [szegedy2013intriguing]. In deep reinforcement learning, tight bounds on the Lipschitz constant of a DNN-based controller can be directly used to analyze the stability of the closed-loop system. Lipschitz regularity can also play a key role in derivation of generalization bounds [bartlett2017spectrally]. In these applications and many others, it is essential to have tight bounds on the Lipschitz constant of DNNs. However, as DNNs have highly complex and non-linear structures, estimating the Lipschitz constant both accurately and efficiently has remained a significant challenge.

Our contributions.

In this paper we propose a novel convex programming framework to derive tight bounds on the global Lipschitz constant of deep feed-forward neural networks. Our framework yields significantly more

accurate bounds compared to the state-of-the-art and lends itself to a distributed implementation, leading to efficient computation of the bounds for large-scale networks.

Our approach. We use the fact that nonlinear activation functions used in neural networks are gradients of convex functions; hence, as operators, they satisfy certain properties that can be abstracted as quadratic constraints on their input-output values. This particular abstraction allows us to pose the Lipschitz estimation problem as a semidefinite program (SDP), which we call LipSDP. A striking feature of LipSDP is its flexibility to span the trade-off between estimation accuracy and computational efficiency by adding or removing extra decision variables. In particular, for a neural network with layers and a total of

hidden neurons, the number of decision variables can vary from

(least accurate but most scalable) to (most accurate but least scalable). As such, we derive several distinct yet related formulations of LipSDP that span this trade-off. To scale each variant of LipSDP to larger networks, we also propose a distributed implementation.

Our results. We illustrate our approach in a variety of experiments on both randomly generated networks as well as networks trained on the MNIST [lecun1998mnist] and Iris [Dua:2019] datasets. First, we show empirically that our Lipschitz bounds are the most accurate compared to all other existing methods of which we are aware. In particular, our experiments on neural networks trained for MNIST show that our bounds almost coincide with the true Lipschitz constant and outperform all comparable methods. For details, see Figure 1(a). Furthermore, we investigate the effect of two robust training procedures [madry2017towards, kolter2017provable] on the Lipschitz constant for networks trained on the MNIST dataset. Our results suggest that robust training procedures significantly decrease the Lipschitz constant of the resulting classifiers. Moreover, we use the Lipschitz bound for two robust training procedures to derive non-vacuous lower bounds on the minimum adversarial perturbation necessary to change the classification of any instance from the test set. For details, see Figure 4.

Related work. The problem of estimating the Lipschitz constant for neural networks has been studied in several works. In [szegedy2013intriguing], the authors estimate the global Lipschitz constant of DNNs by the product of Lipschitz constants of individual layers. This approach is scalable and general but yields trivial bounds. We are only aware of two other methods that give non-trivial upper bounds on the global Lipschitz constant of fully-connected neural networks and can scale to networks with more than two hidden layers. In [combettes_lipschitz], Combettes and Pesquet derive bounds on Lipschitz constants by treating the activation functions as non-expansive averaged operators. The resulting algorithm scales well with the number of hidden units per layer, but very poorly (in fact exponential) with the number of layers. In [virmaux2018lipschitz]

, Virmaux and Scaman decompose the weight matrices of a neural network via singular value decomposition and approximately solve a convex maximization problem over the unit cube. Notably, estimating the Lipschitz constant using the method in

[virmaux2018lipschitz] is intractable even for small networks; indeed, the authors of [virmaux2018lipschitz]

use a greedy algorithm to compute a bound, which may underapproximate the Lipschitz constant. Bounding Lipschitz constants for the specific case of convolutional neural networks (CNNs) has also been addressed in

[balan2017lipschitz, zou2018lipschitz, bartlett2017spectrally].

Using Lipschitz bounds in the context of adversarial robustness and safety verification has also been addressed in several works [weng2018evaluating, ruan2018reachability, weng2018towards]. In particular, in [weng2018evaluating], the authors convert the robustness analysis problem into a local Lipschitz constant estimation problem, where they estimate this local constant by a set of independently and identically sampled local gradients. This algorithm is scalable but is not guaranteed to provide upper bounds. In a similar work, the authors of [weng2018towards]

exploit the piece-wise linear structure of ReLU functions to estimate the local Lipschitz constant of neural networks. In

[fazlyab2019safety], the authors use quadratic constraints and semidefinite programming to analyze local (point-wise) robustness of neural networks. In contrast, our Lipschitz bounds can be used as a global certificate of robustness and are agnostic to the choice of the test data.

1.1 Motivating applications

We now enumerate several applications that highlight the importance of estimating the Lipschitz constant of DNNs accurately and efficiently.

Robustness certification of classifiers. In response to fragility of DNNs to adversarial attacks, there has been considerable effort in recent years to improve the robustness of neural networks against adversarial attacks and input perturbations [goodfellow6572explaining, papernot2016distillation, zheng2016improving, kurakin2016adversarial, madry2017towards, kolter2017provable]. In order to certify and/or improve the robustness of neural networks, one must be able to bound the possible outputs of the neural network over a region of input space. This can be done either locally around a specific input [bastani2016measuring, tjeng2017evaluating, gehr2018ai2, singh2018fast, dutta2018output, raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety, kolter2017provable, jordan2019provable, wong2018scaling, zhang2018efficient], or globally by bounding the sensitivity of the function to input perturbations, i.e., the Lipschitz constant [huster2018limitations, szegedy2013intriguing, qian2018l2, weng2018evaluating]. Indeed, tight upper bounds on the Lipschitz constant can be used to derive non-vacuous lower bounds on the magnitudes of perturbations necessary to change the decision of neural networks. Finally, an efficient computation of these bounds can be useful in either assessing robustness after training [raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety] or promoting robustness during training [kolter2017provable, tsuzuku2018lipschitz]. In the experiments section, we explore this application in depth.

Stability analysis of closed-loop systems with learning controllers. A central problem in learning-based control is to provide stability or safety guarantees for a feedback control loop when a learning-enabled component, such as a deep neural network, is introduced in the loop [aswani2013provably, berkenkamp2017safe, jin2018stability]. The Lipschitz constant of a neural network controller bounds its gain. Therefore a tight estimate can be useful for certifying the stability of the closed-loop system.

Notation. We denote the set of real

-dimensional vectors by

, the set of -dimensional matrices by , and the

-dimensional identity matrix by

. We denote by , , and the sets of -by- symmetric, positive semidefinite, and positive definite matrices, respectively. The -norm () is denoted by . The -norm of a matrix is the largest singular value of . We denote the -th unit vector in by . We write for a diagonal matrix whose diagonal entries starting in the upper left corner are .

2 LipSDP: Lipschitz certificates via semidefinite programming

2.1 Problem statement

Consider an -layer feed-forward neural network described by the following recursive equations:

(2)

Here is an input to the network and and

are the weight matrix and bias vector for the

-th layer. The function is the concatenation of activation functions at each layer, i.e., it is of the form . In this paper, our goal is to find tight bounds on the Lipschitz constant of the map in -norm. More precisely, we wish to find the smallest constant such that .

The main source of difficulty in solving this problem is the presence of the nonlinear activation functions. To combat this difficulty, our main idea is to abstract these activation functions by a set of constraints that they impose on their input and output values. Then any property (including Lipschitz continuity) that is satisfied by our abstraction will also be satisfied by the original network.

2.2 Description of activation functions by quadratic constraints

In this section, we introduce several definitions and lemmas that characterize our abstraction of nonlinear activation functions. These results are crucial to the formulation of an SDP that can bound the Lipschitz constants of networks in Section 2.3.

Definition (Slope-restricted non-linearity)

A function is slope-restricted on where if

(3)

The inequality in (3) simply states that the slope of the chord connecting any two points on the curve of the function is at least and at most (see Figure 1). By multiplying all sides of (3) by , we can write the slope restriction condition as . By the left inequality, the operator is strongly monotone with parameter [ryu2016primer], or equivalently the anti-derivative function is strongly convex with parameter . By the right-hand side inequality, is one-sided Lipschitz with parameter . Altogether, the preceding inequalities state that the anti-derivative function is -strongly convex and -smooth.

Note that all activation functions used in deep learning satisfy the slope restriction condition in (

3) for some . For instance, the ReLU, tanh, and sigmoid activation functions are all slope restricted with and . More details can be found in [fazlyab2019safety].

Definition (Incremental Quadratic Constraint [accikmecse2011observers])

A function satisfies the incremental quadratic constraint defined by if for any and ,

(4)

In the above definition, is the set of all multiplier matrices that characterize , and is a convex cone by definition. As an example, the softmax operator is the gradient of the convex function . This function is smooth and strongly convex with paramters and [boyd2004convex]. For this class of functions, it is known that the gradient function satisfies the quadratic inequality [nesterov2013introductory]

(5)

Therefore, the softmax operator satisfies the incremental quadratic constraint defined by , where the middle matrix in the above inequality.

To see the connection between incremental quadratic constraints and slope-restricted nonlinearities, note that (3) can be equivalently written as the single inequality

(6)

Multiplying through by and rearranging terms, we can write (6) as

(7)

which, in view of Definition 2.2, is an incremental quadratic constraint for . From this perspective, incremental quadratic constraints generalize the notion of slope-restricted nonlinearities to multi-variable vector-valued nonlinearities.

Repeated nonlinearities. Now consider the vector-valued function obtained by applying a slope-restricted function component-wise to a vector . By exploiting the fact that the same function is applied to each component, we can characterize by incremental quadratic constraints. In the following lemma, we provide such a characterization.

Lemma

Suppose is slope-restricted on . Define the set

(8)

Then for any the vector-valued function satisfies

(9)

Figure 1: An illustrative description of encoding activation functions by quadratic constraints.

Concretely, this lemma captures the coupling between neurons in a neural network by taking advantage of two particular structures: (a) the same activation function is applied to each hidden neuron and (b) all activation functions are slope-restricted on the same interval . In this way, we can write the slope restriction condition in (2.2) for any pair of activation functions in a given neural network. A conic combination of these constraints would yield (9), where are the coefficients of this combination. See Figure 1 for an illustrative description.

We will see in the next section that the matrix that parameterizes the multiplier matrix in (9) appears as a decision variable in an SDP, in which the objective is to find an admissible that yields the tightest bound on the Lipschitz constant.

2.3 LipSDP for single-layer neural network

To develop an optimization problem to estimate the Lipschitz constant of a fully-connected feed-forward neural network, the key insight is that the Lipschitz condition in (1) is in fact equivalent to an incremental quadratic constraint for the map characterized by the neural network. By coupling this to the incremental quadratic constraints satisfied by the cascade combination of the activation functions [fazlyab2018analysis], we can develop an SDP to minimize an upper bound on the Lipschitz constant of . This result is formally stated in the following theorem.

Theorem (Lipshitz certificates for single-layer neural networks)

Consider a single-layer neural network described by . Suppose , where is slope-restricted in the sector . Define as in (8). Suppose there exists a such that the matrix inequality

(10)

holds for some . Then for all .

Theorem 2.3 provides us with a sufficient condition for to be an upper bound on the Lipschitz constant of . In particular, we can find the tightest bound by solving the following optimization problem:

(11)

where the decision variables are . Note that is linear in and and the set is convex. Hence, (11) is an SDP, which can be solved numerically for its global minimum.

2.4 LipSDP for multi-layer neural networks

We now consider the multi-layer case. Assuming that all the activation functions are the same, we can write the neural network model in (2) compactly as

(12)

where is the concatenation of the input and the activation values, and the matrices , , and are given by [fazlyab2019safety]

(13)

The particular representation in (12) facilitates the extension of LipSDP to multiple layers, as stated in the following theorem.

Theorem (Lipschitz certificates for multi-layer neural networks)

Consider an -layer fully connected neural network described by (2). Let be the total number of hidden neurons and suppose the activation functions are slope-restricted in the sector . Define as in (8). Define and as in (13). Consider the matrix inequality

(14)

If (14) is satisfied for some , then , .

In a similar way to the single-layer case, we can find the best bound on the Lipschitz constant by solving the SDP in (11) with defined as in (14).

Remark

We have only considered the norm in our exposition. By using the inequality , the -Lipschitz bound implies

or, equivalently, Hence, is a Lipschitz constant of when and norms are used in the input and output spaces, respectively. We can also extend our framework to accommodate quadratic norms , where .

2.5 Variants of LipSDP: reconciling accuracy and efficiency

In LipSDP, there are decision variables (), where is the total number of hidden neurons. For , the variable couples the -th and -th hidden neuron. For , the variable constrains the input-output of the -th activation function individually. Using all these decision variables would provide the tightest convex relaxation in our formulation. Interestingly, numerical experiments suggest that this tightest convex relaxation yields bounds on the Lipschitz constant that almost coincide with the naive lower bound on the Lipschitz constant. However, solving this SDP with all the decision variables included is impractical for large networks. Nevertheless, we can consider a hierarchy of relaxations of LipSDP by removing a subset of the decision variables. Below, we give a brief description of the efficiency and accuracy of each variant. Throughout, we let be the total number of neurons and the number of hidden layers.

  1. LipSDP-Network imposes constraints on all possible pairs of activation functions and has decision variables. It is the least scalable but the most accurate method.

  2. LipSDP-Neuron ignores the cross coupling constraints among different neurons and has decision variables. It is more scalable and less accurate than LipSDP-Network. For this case, we have .

  3. LipSDP-Layer considers only one constraint per layer, resulting in decision variables. It is the most scalable and least accurate method. For this variant, we have .

Parallel implementation by splitting. The Lipschitz constant of the composition of two or more functions can be bounded by the product of the Lispchtiz constants of the individual functions. By splitting a neural network up into small sub-networks, one can first bound the Lipschitz constant of each sub-network and then multiply these constants together to obtain a Lipschitz constant for the entire network. Because sub-networks do not share weights, it is possible to compute the Lipschitz constants for each sub-network in parallel. This greatly improves the scalability of of each variant of LipSDP with respect to the total number of activation functions in the network.

3 Experiments

LipSDP- Neuron LipSDP- Layer 500 5.22 2.85 1000 27.91 17.88 1500 82.12 58.61 2000 200.88 146.09 2500 376.07 245.94 3000 734.63 473.25
Table 1: Computation time in seconds for evaluating Lipschitz bounds of one-hidden-layer neural networks with a varying number of hidden units. A plot showing the Lipschitz constant for each network tested in this table has been provided in the Appendix.
LipSDP- Neuron LipSDP- Layer 5 20.33 3.41 10 32.18 7.06 50 87.45 25.88 100 135.85 40.39 200 221.2 64.90 500 707.56 216.49
Table 2: Computation time in seconds for computing Lipschitz bounds of -hidden-layer neural networks with 100 activation functions per layer. For LipSDP-Neuron and LipSDP-Layer, we split each network up into 5-layer sub-networks.

In this section we describe several experiments that highlight the key aspects of this work. In particular, we show empirically that our bounds are much tighter than any comparable method, we study the impact of robust training on our Lipschitz bounds, and we analyze the scalability of our methods.

Experimental setup. For our experiments we used MATLAB, the CVX toolbox [grant2008cvx] and MOSEK [mosek] on a 9-core CPU with 16GB of RAM to solve the SDPs. All classifiers trained on MNIST used an 80-20 train-test split.

Training procedures. Several training procedures have recently been proposed to improve the robustness of neural network classifiers. Two prominent procedures are the LP-based method in [kolter2017provable] and projected gradient descent (PGD) based method in [madry2017towards]. We refer to these training methods as LP-Train and PGD-Train, respectively. Both procedures take as input a parameter that defines the perturbation of the training data points.

Baselines. Throughout the experiments, we will often show comparisons to what we call the naive lower and upper bounds. As has been shown in several previous works [combettes_lipschitz, virmaux2018lipschitz], trivial lower and upper bounds on the Lipschitz constant of a feed-forward neural network with hidden-layers are given by and , which we refer to them as naive lower and upper bounds, respectively. We are aware of only two methods that bound the Lipschitz constant and can scale to fully-connected networks with more than two hidden layers; these methods are [combettes_lipschitz], which we will refer to as CPLip, and [virmaux2018lipschitz], which is called SeqLip.

(a) Comparison of Lipschitz bounds found by various methods for five-hidden-layer networks trained on MNIST with the Adam optimizer. Each network had a test accuracy above 97%.
(b)

Lipschitz bounds obtained by splitting a 100-layer network into sub-networks. Each sub-network had six layers, and the weights were generated randomly by sampling from a normal distribution.

(c) LipSDP-Network Lipschitz bounds and computation time for a one-hidden-layer network with 100 neurons. The weights for this network were obtained by sampling from a normal distribution.
Figure 2: Comparison of the accuracy LipSDP methods to other methods that compute the Lipschitz constant and scalability analysis of all three SeqLip methods.

We compare the Lipschitz bounds obtained by LipSDP-Neuron, LipSDP-Layer, CPLip, and SeqLip in Figure 1(a). It is evident from this figure that the bounds from LipSDP-Neuron are tighter than CPLip and SeqLip. Our results show that the true Lipschitz constants of the networks shown above are very close to the naive lower bound.

To demonstrate the scalability of the LipSDP formulations, we split a 100-hidden layer neural network into sub-networks with six hidden layers each and computed the Lipschitz bounds using LipSDP-Neuron and LipSDP-Layer. The results are shown in Figure 1(b). Furthermore, in Tables 1 and 2, we show the computation time for scaling the LipSDP methods in the number of hidden units per layer and in the number of layers. In particular, the largest network we tested in Table 2 had 50,000 hidden neurons; SDPLip-Neuron took approximately 12 minutes to find a Lipschitz bound, and SDPLip-Layer took approximately 4 minutes.

To evaluate SDPLip-Network, we coupled random pairs of hidden neurons in a one-hidden-layer network and plotted the computation time and Lipschitz bound found by SDPLip-Network as we increased the number of paired neurons. Our results show that as the number of coupled neurons increases, the computation time increases quadratically. This shows that while this method is the most accurate of the three proposed LipSDP methods, it is intractable for even modestly large networks.

(a) Lipschitz bounds for a one-hidden-layer neural networks trained on the MNIST dataset with the Adam optimizer and LP-Train and PGD-Train for two values of the robustness parameter . Each network reached an accuracy of 95% or higher.
Figure 3: Histograms showing the local robustness (in norm) around each correctly-classified test instance from the MNIST dataset. The neural networks had three hidden layers with 100, 50, 20 neurons, respectively. All classifiers had a test accuracy of 97%.
Figure 4: Analysis of impact of robust training on the Lipschitz constant and the distance to misclassification for networks trained on MNIST

Impact of robust training. In Figure 4, we empirically demonstrate that the Lipschitz bound of a neural network is directly related to the robustness of the corresponding classifier. This figure shows that LP-train and PGD-Train networks achieve lower Lipschitz bounds than standard training procedures. Figure 2(a) indicates that robust training procedures yield lower Lipschitz constants than networks trained with standard training procedures such as the Adam optimizer. Figure 3 shows the utility of sharply estimating the Lipschitz constant; a lower value of guarantees that a neural network is more locally robust to input perturbations; see Proposition A.1 in the Appendix.

Figure 5: Trade-off between accuracy and Lipschitz constant for different values of the robustness parameter used for LP-Train and PGD-Train. All networks had one hidden layer with 50 hidden neurons.
Figure 6: Lipschitz constants for topologically identical three-hidden-layer networks with ReLU and leaky ReLU activation functions. All classifiers were trained until they reached 97% test accuracy.

In the same vein, Figure 3 shows the impact of varying the robustness parameter used in LP-Train and PGD-Train

on the test accuracy of networks trained for a fixed number of epochs and the corresponding Lipschitz constants. In essence, these results quantify how much robustness a fixed classifier can handle before accuracy plummets. Interestingly, the drops in accuracy as

increases coincide with corresponding drops in the Lipschitz constant for both LP-Train and PGD-Train.

Robustness for different activation functions. The framework proposed in this work allows us to examine the impact of using different activation functions on the Lipschitz constant of neural networks. We trained two sets of neural networks on the MNIST dataset. The first set used ReLU activation functions, while the second set used leaky ReLU activations. Figure 6 shows empirically that the networks with the leaky ReLU activation function have larger Lipschitz constants than networks of the same architecture with the ReLU activation function.

4 Conclusions and future work

In this paper, we proposed a hierarchy of semidefinite programs to derive tight upper bounds on the Lipschitz constant of feed-forward fully-connected neural networks. Some comments are in order. First, our framework can be directly used to certify convolutional neural networks (CNNs) by unrolling them to a large feed-forward neural network. A future direction is to exploit the special structure of CNNs in the resulting SDP. Second, we only considered one application of Lipschitz bounds in depth (robustness certification). Having an accurate upper bound on the Lipschitz constant can be useful in domains beyond robustness analysis, such as stability analysis of feedback systems with control policies updated by deep reinforcement learning. Furthermore, Lipschitz bounds can be utilized during training as a heuristic to promote out-of-sample generalization

[tsuzuku2018lipschitz]. We intend to pursue these applications for future work.

References

Appendix A Appendix

a.1 Robustness certification of DNN-based classifiers

Consider a classifier described by a feed-forward neural network , where is the number of input features and is the number of classes. In this context, the function takes as input an instance or measurement and returns a -dimensional vector of scores – one for each class. The classification rule is based on assigning to the class with the highest score. That is, we define the classification to be . Now suppose that is an instance that is classified correctly by the neural network. To evaluate the local robustness of the neural network around , we consider a bounded set that represents the set of all possible -norm perturbations of . Then the classifier is locally robust at against if it assigns all the perturbed inputs to the same class as the unperturbed input, i.e., if

(15)

In the following proposition, we derive a sufficient condition to guarantee local robustness around for the perturbation set .

Proposition

Consider a neural-network classifier with Lipschitz constant in the -norm. Let be given and consider the inequality

(16)

Then (16) implies for all .

The inequality in (16) provides us with a simple and computationally efficient test for assessing the point-wise robustness of a neural network. According to (16), a more accurate estimation of the Lipschitz constant directly increases the maximum perturbation that can be certified for each test example. This makes the framework suitable for model selection, wherein one wishes to select the model that is most robust to adversarial perturbations from a family of proposed classifiers [peck2017lower].

a.2 Proof of Proposition a.1

Let be the class of . Define the polytope in the output space of :

which is the set of all outputs whose score is the highest for class ; and, of course, . The distance of to the boundary of the polytope is the minimum distance of to all edges of the polytope:

Note that the Lipschitz condition implies for all . The condition in (16) then implies

and for all . Therefore, the output of the classification would not change for any .

Figure 7: Illustration of local robustness certification using the Lipschitz bound.

a.3 Proof of Lemma 2.2

We first prove the following lemma, which is a slight variation to the lemma proved in [fazlyab2019safety].

Lemma

Suppose is slope-restricted on and satisfies . Define the set

(17)

Then the vector-valued function defined by satisfies

(18)

for all .

Proof of Lemma A.3. Note that the slope restriction condition implies that for any two pairs and , we can write the following incremental quadratic constraint:

(19)

where is arbitrary. Similarly, for any two pairs and , we can also write the incremental quadratic constraint

(20)

where is arbitrary. By adding (19) and (20) and vectorizing the notation, we would arrive at the compact representation (18).

Proof of Lemma 2.2. For a fixed , define the map by shifting , as follows,

It is not hard to verify that if is slope-restricted on , the map is also slope-restricted on the same interval for any fixed . Next, for a fixed define

Since is slope-restricted on and satisfies for any fixed , it follows from Lemma A.3 that satisfies the quadratic constraint

By substituting the definition of and setting , we obtain

(21)

a.4 Proof of Theorem 2.3

Define and for two arbitrary inputs . Using Lemma 2.2, we can write the quadratic inequality

where and is defined as in (8). The preceding inequality can be simplified to

(22)

By left and right multiplying in (10) by and , respectively, and rearranging terms, we obtain

(23)

By adding both sides of the preceding inequalities, we obtain

or, equivalently,

Finally, note that by definition of and , we have and . Therefore, the preceding inequality implies

(24)

a.5 Proof of Theorem 2.4

For two arbitrary inputs , define Using the compact notation in (12), we can write

Multiply both sides of the first matrix in (14) by and , respectively and use the preceding identities to obtain

(25)

where the last inequality follows from Lemma 2.2. On the other hand, by multiplying both sides of the second matrix in (14) by and , respectively, we can write

(26)

where we have used the fact that and . By adding both sides of (25) and (26), we get

(27)

When the LMI in (14) holds, the left-hand side of (27) is non-positive, implying that the right-hand side is non-positive.

a.6 Bounding the output set of a neural network classifier

Figure 8: Bounds on the image of an -ball under a neural network trained on the Iris dataset. The network was trained using the Adam optimizer and reached a test accuracy of 99%.

As we have shown, accurate estimation of the Lipschitz constant can provide tighter bounds on adversarial examples drawn from a perturbation set around a nominal point . Figure 8 shows these bounds for a neural network trained on the Iris dataset. Because this dataset has three classes, we project the output sets onto the coordinate axes. This figure clearly shows that our bound is quite close to the naive lower bound. Indeed, the naive upper bound and the constant computed by CPLip are considerably looser.

a.7 Further analysis of scalability

Figure 9: Lipschitz bounds for networks in Table 1.

Figure 9 shows the Lipschitz bounds found for the networks used in Table 1.