1 Introduction
A function is globally Lipschitz continuous on if there exists a nonnegative constant such that
(1) 
The smallest such is called the Lipschitz constant of . The Lipschitz constant is the maximum ratio between variations in the output space and variations in the input space of and thus is a measure of sensitivity of the function with respect to input perturbations.
When a function is characterized by a deep neural network (DNN), tight bounds on its Lipschitz constant can be extremely useful in a variety of applications. In classification tasks, for instance, can be used as a certificate of robustness of a neural network classifier to adversarial attacks if it is estimated tightly [szegedy2013intriguing]. In deep reinforcement learning, tight bounds on the Lipschitz constant of a DNNbased controller can be directly used to analyze the stability of the closedloop system. Lipschitz regularity can also play a key role in derivation of generalization bounds [bartlett2017spectrally]. In these applications and many others, it is essential to have tight bounds on the Lipschitz constant of DNNs. However, as DNNs have highly complex and nonlinear structures, estimating the Lipschitz constant both accurately and efficiently has remained a significant challenge.
Our contributions.
In this paper we propose a novel convex programming framework to derive tight bounds on the global Lipschitz constant of deep feedforward neural networks. Our framework yields significantly more
accurate bounds compared to the stateoftheart and lends itself to a distributed implementation, leading to efficient computation of the bounds for largescale networks.Our approach. We use the fact that nonlinear activation functions used in neural networks are gradients of convex functions; hence, as operators, they satisfy certain properties that can be abstracted as quadratic constraints on their inputoutput values. This particular abstraction allows us to pose the Lipschitz estimation problem as a semidefinite program (SDP), which we call LipSDP. A striking feature of LipSDP is its flexibility to span the tradeoff between estimation accuracy and computational efficiency by adding or removing extra decision variables. In particular, for a neural network with layers and a total of
hidden neurons, the number of decision variables can vary from
(least accurate but most scalable) to (most accurate but least scalable). As such, we derive several distinct yet related formulations of LipSDP that span this tradeoff. To scale each variant of LipSDP to larger networks, we also propose a distributed implementation.Our results. We illustrate our approach in a variety of experiments on both randomly generated networks as well as networks trained on the MNIST [lecun1998mnist] and Iris [Dua:2019] datasets. First, we show empirically that our Lipschitz bounds are the most accurate compared to all other existing methods of which we are aware. In particular, our experiments on neural networks trained for MNIST show that our bounds almost coincide with the true Lipschitz constant and outperform all comparable methods. For details, see Figure 1(a). Furthermore, we investigate the effect of two robust training procedures [madry2017towards, kolter2017provable] on the Lipschitz constant for networks trained on the MNIST dataset. Our results suggest that robust training procedures significantly decrease the Lipschitz constant of the resulting classifiers. Moreover, we use the Lipschitz bound for two robust training procedures to derive nonvacuous lower bounds on the minimum adversarial perturbation necessary to change the classification of any instance from the test set. For details, see Figure 4.
Related work. The problem of estimating the Lipschitz constant for neural networks has been studied in several works. In [szegedy2013intriguing], the authors estimate the global Lipschitz constant of DNNs by the product of Lipschitz constants of individual layers. This approach is scalable and general but yields trivial bounds. We are only aware of two other methods that give nontrivial upper bounds on the global Lipschitz constant of fullyconnected neural networks and can scale to networks with more than two hidden layers. In [combettes_lipschitz], Combettes and Pesquet derive bounds on Lipschitz constants by treating the activation functions as nonexpansive averaged operators. The resulting algorithm scales well with the number of hidden units per layer, but very poorly (in fact exponential) with the number of layers. In [virmaux2018lipschitz]
, Virmaux and Scaman decompose the weight matrices of a neural network via singular value decomposition and approximately solve a convex maximization problem over the unit cube. Notably, estimating the Lipschitz constant using the method in
[virmaux2018lipschitz] is intractable even for small networks; indeed, the authors of [virmaux2018lipschitz]use a greedy algorithm to compute a bound, which may underapproximate the Lipschitz constant. Bounding Lipschitz constants for the specific case of convolutional neural networks (CNNs) has also been addressed in
[balan2017lipschitz, zou2018lipschitz, bartlett2017spectrally].Using Lipschitz bounds in the context of adversarial robustness and safety verification has also been addressed in several works [weng2018evaluating, ruan2018reachability, weng2018towards]. In particular, in [weng2018evaluating], the authors convert the robustness analysis problem into a local Lipschitz constant estimation problem, where they estimate this local constant by a set of independently and identically sampled local gradients. This algorithm is scalable but is not guaranteed to provide upper bounds. In a similar work, the authors of [weng2018towards]
exploit the piecewise linear structure of ReLU functions to estimate the local Lipschitz constant of neural networks. In
[fazlyab2019safety], the authors use quadratic constraints and semidefinite programming to analyze local (pointwise) robustness of neural networks. In contrast, our Lipschitz bounds can be used as a global certificate of robustness and are agnostic to the choice of the test data.1.1 Motivating applications
We now enumerate several applications that highlight the importance of estimating the Lipschitz constant of DNNs accurately and efficiently.
Robustness certification of classifiers. In response to fragility of DNNs to adversarial attacks, there has been considerable effort in recent years to improve the robustness of neural networks against adversarial attacks and input perturbations [goodfellow6572explaining, papernot2016distillation, zheng2016improving, kurakin2016adversarial, madry2017towards, kolter2017provable]. In order to certify and/or improve the robustness of neural networks, one must be able to bound the possible outputs of the neural network over a region of input space. This can be done either locally around a specific input [bastani2016measuring, tjeng2017evaluating, gehr2018ai2, singh2018fast, dutta2018output, raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety, kolter2017provable, jordan2019provable, wong2018scaling, zhang2018efficient], or globally by bounding the sensitivity of the function to input perturbations, i.e., the Lipschitz constant [huster2018limitations, szegedy2013intriguing, qian2018l2, weng2018evaluating]. Indeed, tight upper bounds on the Lipschitz constant can be used to derive nonvacuous lower bounds on the magnitudes of perturbations necessary to change the decision of neural networks. Finally, an efficient computation of these bounds can be useful in either assessing robustness after training [raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety] or promoting robustness during training [kolter2017provable, tsuzuku2018lipschitz]. In the experiments section, we explore this application in depth.
Stability analysis of closedloop systems with learning controllers. A central problem in learningbased control is to provide stability or safety guarantees for a feedback control loop when a learningenabled component, such as a deep neural network, is introduced in the loop [aswani2013provably, berkenkamp2017safe, jin2018stability]. The Lipschitz constant of a neural network controller bounds its gain. Therefore a tight estimate can be useful for certifying the stability of the closedloop system.
Notation. We denote the set of real
dimensional vectors by
, the set of dimensional matrices by , and thedimensional identity matrix by
. We denote by , , and the sets of by symmetric, positive semidefinite, and positive definite matrices, respectively. The norm () is denoted by . The norm of a matrix is the largest singular value of . We denote the th unit vector in by . We write for a diagonal matrix whose diagonal entries starting in the upper left corner are .2 LipSDP: Lipschitz certificates via semidefinite programming
2.1 Problem statement
Consider an layer feedforward neural network described by the following recursive equations:
(2) 
Here is an input to the network and and
are the weight matrix and bias vector for the
th layer. The function is the concatenation of activation functions at each layer, i.e., it is of the form . In this paper, our goal is to find tight bounds on the Lipschitz constant of the map in norm. More precisely, we wish to find the smallest constant such that .The main source of difficulty in solving this problem is the presence of the nonlinear activation functions. To combat this difficulty, our main idea is to abstract these activation functions by a set of constraints that they impose on their input and output values. Then any property (including Lipschitz continuity) that is satisfied by our abstraction will also be satisfied by the original network.
2.2 Description of activation functions by quadratic constraints
In this section, we introduce several definitions and lemmas that characterize our abstraction of nonlinear activation functions. These results are crucial to the formulation of an SDP that can bound the Lipschitz constants of networks in Section 2.3.
Definition (Sloperestricted nonlinearity)
A function is sloperestricted on where if
(3) 
The inequality in (3) simply states that the slope of the chord connecting any two points on the curve of the function is at least and at most (see Figure 1). By multiplying all sides of (3) by , we can write the slope restriction condition as . By the left inequality, the operator is strongly monotone with parameter [ryu2016primer], or equivalently the antiderivative function is strongly convex with parameter . By the righthand side inequality, is onesided Lipschitz with parameter . Altogether, the preceding inequalities state that the antiderivative function is strongly convex and smooth.
Note that all activation functions used in deep learning satisfy the slope restriction condition in (
3) for some . For instance, the ReLU, tanh, and sigmoid activation functions are all slope restricted with and . More details can be found in [fazlyab2019safety].Definition (Incremental Quadratic Constraint [accikmecse2011observers])
A function satisfies the incremental quadratic constraint defined by if for any and ,
(4) 
In the above definition, is the set of all multiplier matrices that characterize , and is a convex cone by definition. As an example, the softmax operator is the gradient of the convex function . This function is smooth and strongly convex with paramters and [boyd2004convex]. For this class of functions, it is known that the gradient function satisfies the quadratic inequality [nesterov2013introductory]
(5) 
Therefore, the softmax operator satisfies the incremental quadratic constraint defined by , where the middle matrix in the above inequality.
To see the connection between incremental quadratic constraints and sloperestricted nonlinearities, note that (3) can be equivalently written as the single inequality
(6) 
Multiplying through by and rearranging terms, we can write (6) as
(7) 
which, in view of Definition 2.2, is an incremental quadratic constraint for . From this perspective, incremental quadratic constraints generalize the notion of sloperestricted nonlinearities to multivariable vectorvalued nonlinearities.
Repeated nonlinearities. Now consider the vectorvalued function obtained by applying a sloperestricted function componentwise to a vector . By exploiting the fact that the same function is applied to each component, we can characterize by incremental quadratic constraints. In the following lemma, we provide such a characterization.
Lemma
Suppose is sloperestricted on . Define the set
(8) 
Then for any the vectorvalued function satisfies
(9) 
Concretely, this lemma captures the coupling between neurons in a neural network by taking advantage of two particular structures: (a) the same activation function is applied to each hidden neuron and (b) all activation functions are sloperestricted on the same interval . In this way, we can write the slope restriction condition in (2.2) for any pair of activation functions in a given neural network. A conic combination of these constraints would yield (9), where are the coefficients of this combination. See Figure 1 for an illustrative description.
We will see in the next section that the matrix that parameterizes the multiplier matrix in (9) appears as a decision variable in an SDP, in which the objective is to find an admissible that yields the tightest bound on the Lipschitz constant.
2.3 LipSDP for singlelayer neural network
To develop an optimization problem to estimate the Lipschitz constant of a fullyconnected feedforward neural network, the key insight is that the Lipschitz condition in (1) is in fact equivalent to an incremental quadratic constraint for the map characterized by the neural network. By coupling this to the incremental quadratic constraints satisfied by the cascade combination of the activation functions [fazlyab2018analysis], we can develop an SDP to minimize an upper bound on the Lipschitz constant of . This result is formally stated in the following theorem.
Theorem (Lipshitz certificates for singlelayer neural networks)
Consider a singlelayer neural network described by . Suppose , where is sloperestricted in the sector . Define as in (8). Suppose there exists a such that the matrix inequality
(10) 
holds for some . Then for all .
Theorem 2.3 provides us with a sufficient condition for to be an upper bound on the Lipschitz constant of . In particular, we can find the tightest bound by solving the following optimization problem:
(11) 
where the decision variables are . Note that is linear in and and the set is convex. Hence, (11) is an SDP, which can be solved numerically for its global minimum.
2.4 LipSDP for multilayer neural networks
We now consider the multilayer case. Assuming that all the activation functions are the same, we can write the neural network model in (2) compactly as
(12) 
where is the concatenation of the input and the activation values, and the matrices , , and are given by [fazlyab2019safety]
(13)  
The particular representation in (12) facilitates the extension of LipSDP to multiple layers, as stated in the following theorem.
Theorem (Lipschitz certificates for multilayer neural networks)
Consider an layer fully connected neural network described by (2). Let be the total number of hidden neurons and suppose the activation functions are sloperestricted in the sector . Define as in (8). Define and as in (13). Consider the matrix inequality
(14) 
If (14) is satisfied for some , then , .
In a similar way to the singlelayer case, we can find the best bound on the Lipschitz constant by solving the SDP in (11) with defined as in (14).
Remark
We have only considered the norm in our exposition. By using the inequality , the Lipschitz bound implies
or, equivalently, Hence, is a Lipschitz constant of when and norms are used in the input and output spaces, respectively. We can also extend our framework to accommodate quadratic norms , where .
2.5 Variants of LipSDP: reconciling accuracy and efficiency
In LipSDP, there are decision variables (), where is the total number of hidden neurons. For , the variable couples the th and th hidden neuron. For , the variable constrains the inputoutput of the th activation function individually. Using all these decision variables would provide the tightest convex relaxation in our formulation. Interestingly, numerical experiments suggest that this tightest convex relaxation yields bounds on the Lipschitz constant that almost coincide with the naive lower bound on the Lipschitz constant. However, solving this SDP with all the decision variables included is impractical for large networks. Nevertheless, we can consider a hierarchy of relaxations of LipSDP by removing a subset of the decision variables. Below, we give a brief description of the efficiency and accuracy of each variant. Throughout, we let be the total number of neurons and the number of hidden layers.

LipSDPNetwork imposes constraints on all possible pairs of activation functions and has decision variables. It is the least scalable but the most accurate method.

LipSDPNeuron ignores the cross coupling constraints among different neurons and has decision variables. It is more scalable and less accurate than LipSDPNetwork. For this case, we have .

LipSDPLayer considers only one constraint per layer, resulting in decision variables. It is the most scalable and least accurate method. For this variant, we have .
Parallel implementation by splitting. The Lipschitz constant of the composition of two or more functions can be bounded by the product of the Lispchtiz constants of the individual functions. By splitting a neural network up into small subnetworks, one can first bound the Lipschitz constant of each subnetwork and then multiply these constants together to obtain a Lipschitz constant for the entire network. Because subnetworks do not share weights, it is possible to compute the Lipschitz constants for each subnetwork in parallel. This greatly improves the scalability of of each variant of LipSDP with respect to the total number of activation functions in the network.
3 Experiments
In this section we describe several experiments that highlight the key aspects of this work. In particular, we show empirically that our bounds are much tighter than any comparable method, we study the impact of robust training on our Lipschitz bounds, and we analyze the scalability of our methods.
Experimental setup. For our experiments we used MATLAB, the CVX toolbox [grant2008cvx] and MOSEK [mosek] on a 9core CPU with 16GB of RAM to solve the SDPs. All classifiers trained on MNIST used an 8020 traintest split.
Training procedures. Several training procedures have recently been proposed to improve the robustness of neural network classifiers. Two prominent procedures are the LPbased method in [kolter2017provable] and projected gradient descent (PGD) based method in [madry2017towards]. We refer to these training methods as LPTrain and PGDTrain, respectively. Both procedures take as input a parameter that defines the perturbation of the training data points.
Baselines. Throughout the experiments, we will often show comparisons to what we call the naive lower and upper bounds. As has been shown in several previous works [combettes_lipschitz, virmaux2018lipschitz], trivial lower and upper bounds on the Lipschitz constant of a feedforward neural network with hiddenlayers are given by and , which we refer to them as naive lower and upper bounds, respectively. We are aware of only two methods that bound the Lipschitz constant and can scale to fullyconnected networks with more than two hidden layers; these methods are [combettes_lipschitz], which we will refer to as CPLip, and [virmaux2018lipschitz], which is called SeqLip.
We compare the Lipschitz bounds obtained by LipSDPNeuron, LipSDPLayer, CPLip, and SeqLip in Figure 1(a). It is evident from this figure that the bounds from LipSDPNeuron are tighter than CPLip and SeqLip. Our results show that the true Lipschitz constants of the networks shown above are very close to the naive lower bound.
To demonstrate the scalability of the LipSDP formulations, we split a 100hidden layer neural network into subnetworks with six hidden layers each and computed the Lipschitz bounds using LipSDPNeuron and LipSDPLayer. The results are shown in Figure 1(b). Furthermore, in Tables 1 and 2, we show the computation time for scaling the LipSDP methods in the number of hidden units per layer and in the number of layers. In particular, the largest network we tested in Table 2 had 50,000 hidden neurons; SDPLipNeuron took approximately 12 minutes to find a Lipschitz bound, and SDPLipLayer took approximately 4 minutes.
To evaluate SDPLipNetwork, we coupled random pairs of hidden neurons in a onehiddenlayer network and plotted the computation time and Lipschitz bound found by SDPLipNetwork as we increased the number of paired neurons. Our results show that as the number of coupled neurons increases, the computation time increases quadratically. This shows that while this method is the most accurate of the three proposed LipSDP methods, it is intractable for even modestly large networks.
Impact of robust training. In Figure 4, we empirically demonstrate that the Lipschitz bound of a neural network is directly related to the robustness of the corresponding classifier. This figure shows that LPtrain and PGDTrain networks achieve lower Lipschitz bounds than standard training procedures. Figure 2(a) indicates that robust training procedures yield lower Lipschitz constants than networks trained with standard training procedures such as the Adam optimizer. Figure 3 shows the utility of sharply estimating the Lipschitz constant; a lower value of guarantees that a neural network is more locally robust to input perturbations; see Proposition A.1 in the Appendix.
In the same vein, Figure 3 shows the impact of varying the robustness parameter used in LPTrain and PGDTrain
on the test accuracy of networks trained for a fixed number of epochs and the corresponding Lipschitz constants. In essence, these results quantify how much robustness a fixed classifier can handle before accuracy plummets. Interestingly, the drops in accuracy as
increases coincide with corresponding drops in the Lipschitz constant for both LPTrain and PGDTrain.Robustness for different activation functions. The framework proposed in this work allows us to examine the impact of using different activation functions on the Lipschitz constant of neural networks. We trained two sets of neural networks on the MNIST dataset. The first set used ReLU activation functions, while the second set used leaky ReLU activations. Figure 6 shows empirically that the networks with the leaky ReLU activation function have larger Lipschitz constants than networks of the same architecture with the ReLU activation function.
4 Conclusions and future work
In this paper, we proposed a hierarchy of semidefinite programs to derive tight upper bounds on the Lipschitz constant of feedforward fullyconnected neural networks. Some comments are in order. First, our framework can be directly used to certify convolutional neural networks (CNNs) by unrolling them to a large feedforward neural network. A future direction is to exploit the special structure of CNNs in the resulting SDP. Second, we only considered one application of Lipschitz bounds in depth (robustness certification). Having an accurate upper bound on the Lipschitz constant can be useful in domains beyond robustness analysis, such as stability analysis of feedback systems with control policies updated by deep reinforcement learning. Furthermore, Lipschitz bounds can be utilized during training as a heuristic to promote outofsample generalization
[tsuzuku2018lipschitz]. We intend to pursue these applications for future work.References
Appendix A Appendix
a.1 Robustness certification of DNNbased classifiers
Consider a classifier described by a feedforward neural network , where is the number of input features and is the number of classes. In this context, the function takes as input an instance or measurement and returns a dimensional vector of scores – one for each class. The classification rule is based on assigning to the class with the highest score. That is, we define the classification to be . Now suppose that is an instance that is classified correctly by the neural network. To evaluate the local robustness of the neural network around , we consider a bounded set that represents the set of all possible norm perturbations of . Then the classifier is locally robust at against if it assigns all the perturbed inputs to the same class as the unperturbed input, i.e., if
(15) 
In the following proposition, we derive a sufficient condition to guarantee local robustness around for the perturbation set .
Proposition
Consider a neuralnetwork classifier with Lipschitz constant in the norm. Let be given and consider the inequality
(16) 
Then (16) implies for all .
The inequality in (16) provides us with a simple and computationally efficient test for assessing the pointwise robustness of a neural network. According to (16), a more accurate estimation of the Lipschitz constant directly increases the maximum perturbation that can be certified for each test example. This makes the framework suitable for model selection, wherein one wishes to select the model that is most robust to adversarial perturbations from a family of proposed classifiers [peck2017lower].
a.2 Proof of Proposition a.1
Let be the class of . Define the polytope in the output space of :
which is the set of all outputs whose score is the highest for class ; and, of course, . The distance of to the boundary of the polytope is the minimum distance of to all edges of the polytope:
Note that the Lipschitz condition implies for all . The condition in (16) then implies
and for all . Therefore, the output of the classification would not change for any .
a.3 Proof of Lemma 2.2
We first prove the following lemma, which is a slight variation to the lemma proved in [fazlyab2019safety].
Lemma
Suppose is sloperestricted on and satisfies . Define the set
(17) 
Then the vectorvalued function defined by satisfies
(18) 
for all .
Proof of Lemma A.3. Note that the slope restriction condition implies that for any two pairs and , we can write the following incremental quadratic constraint:
(19) 
where is arbitrary. Similarly, for any two pairs and , we can also write the incremental quadratic constraint
(20) 
where is arbitrary. By adding (19) and (20) and vectorizing the notation, we would arrive at the compact representation (18).
Proof of Lemma 2.2. For a fixed , define the map by shifting , as follows,
It is not hard to verify that if is sloperestricted on , the map is also sloperestricted on the same interval for any fixed . Next, for a fixed define
Since is sloperestricted on and satisfies for any fixed , it follows from Lemma A.3 that satisfies the quadratic constraint
By substituting the definition of and setting , we obtain
(21) 
a.4 Proof of Theorem 2.3
Define and for two arbitrary inputs . Using Lemma 2.2, we can write the quadratic inequality
where and is defined as in (8). The preceding inequality can be simplified to
(22) 
By left and right multiplying in (10) by and , respectively, and rearranging terms, we obtain
(23)  
By adding both sides of the preceding inequalities, we obtain
or, equivalently,
Finally, note that by definition of and , we have and . Therefore, the preceding inequality implies
(24) 
a.5 Proof of Theorem 2.4
For two arbitrary inputs , define Using the compact notation in (12), we can write
Multiply both sides of the first matrix in (14) by and , respectively and use the preceding identities to obtain
(25)  
where the last inequality follows from Lemma 2.2. On the other hand, by multiplying both sides of the second matrix in (14) by and , respectively, we can write
(26) 
where we have used the fact that and . By adding both sides of (25) and (26), we get
(27) 
When the LMI in (14) holds, the lefthand side of (27) is nonpositive, implying that the righthand side is nonpositive.
a.6 Bounding the output set of a neural network classifier
As we have shown, accurate estimation of the Lipschitz constant can provide tighter bounds on adversarial examples drawn from a perturbation set around a nominal point . Figure 8 shows these bounds for a neural network trained on the Iris dataset. Because this dataset has three classes, we project the output sets onto the coordinate axes. This figure clearly shows that our bound is quite close to the naive lower bound. Indeed, the naive upper bound and the constant computed by CPLip are considerably looser.
a.7 Further analysis of scalability
Figure 9 shows the Lipschitz bounds found for the networks used in Table 1.
Comments
There are no comments yet.