A function is globally Lipschitz continuous on if there exists a nonnegative constant such that
The smallest such is called the Lipschitz constant of . The Lipschitz constant is the maximum ratio between variations in the output space and variations in the input space of and thus is a measure of sensitivity of the function with respect to input perturbations.
When a function is characterized by a deep neural network (DNN), tight bounds on its Lipschitz constant can be extremely useful in a variety of applications. In classification tasks, for instance, can be used as a certificate of robustness of a neural network classifier to adversarial attacks if it is estimated tightly [szegedy2013intriguing]. In deep reinforcement learning, tight bounds on the Lipschitz constant of a DNN-based controller can be directly used to analyze the stability of the closed-loop system. Lipschitz regularity can also play a key role in derivation of generalization bounds [bartlett2017spectrally]. In these applications and many others, it is essential to have tight bounds on the Lipschitz constant of DNNs. However, as DNNs have highly complex and non-linear structures, estimating the Lipschitz constant both accurately and efficiently has remained a significant challenge.
In this paper we propose a novel convex programming framework to derive tight bounds on the global Lipschitz constant of deep feed-forward neural networks. Our framework yields significantly moreaccurate bounds compared to the state-of-the-art and lends itself to a distributed implementation, leading to efficient computation of the bounds for large-scale networks.
Our approach. We use the fact that nonlinear activation functions used in neural networks are gradients of convex functions; hence, as operators, they satisfy certain properties that can be abstracted as quadratic constraints on their input-output values. This particular abstraction allows us to pose the Lipschitz estimation problem as a semidefinite program (SDP), which we call LipSDP. A striking feature of LipSDP is its flexibility to span the trade-off between estimation accuracy and computational efficiency by adding or removing extra decision variables. In particular, for a neural network with layers and a total of
hidden neurons, the number of decision variables can vary from(least accurate but most scalable) to (most accurate but least scalable). As such, we derive several distinct yet related formulations of LipSDP that span this trade-off. To scale each variant of LipSDP to larger networks, we also propose a distributed implementation.
Our results. We illustrate our approach in a variety of experiments on both randomly generated networks as well as networks trained on the MNIST [lecun1998mnist] and Iris [Dua:2019] datasets. First, we show empirically that our Lipschitz bounds are the most accurate compared to all other existing methods of which we are aware. In particular, our experiments on neural networks trained for MNIST show that our bounds almost coincide with the true Lipschitz constant and outperform all comparable methods. For details, see Figure 1(a). Furthermore, we investigate the effect of two robust training procedures [madry2017towards, kolter2017provable] on the Lipschitz constant for networks trained on the MNIST dataset. Our results suggest that robust training procedures significantly decrease the Lipschitz constant of the resulting classifiers. Moreover, we use the Lipschitz bound for two robust training procedures to derive non-vacuous lower bounds on the minimum adversarial perturbation necessary to change the classification of any instance from the test set. For details, see Figure 4.
Related work. The problem of estimating the Lipschitz constant for neural networks has been studied in several works. In [szegedy2013intriguing], the authors estimate the global Lipschitz constant of DNNs by the product of Lipschitz constants of individual layers. This approach is scalable and general but yields trivial bounds. We are only aware of two other methods that give non-trivial upper bounds on the global Lipschitz constant of fully-connected neural networks and can scale to networks with more than two hidden layers. In [combettes_lipschitz], Combettes and Pesquet derive bounds on Lipschitz constants by treating the activation functions as non-expansive averaged operators. The resulting algorithm scales well with the number of hidden units per layer, but very poorly (in fact exponential) with the number of layers. In [virmaux2018lipschitz]
, Virmaux and Scaman decompose the weight matrices of a neural network via singular value decomposition and approximately solve a convex maximization problem over the unit cube. Notably, estimating the Lipschitz constant using the method in[virmaux2018lipschitz] is intractable even for small networks; indeed, the authors of [virmaux2018lipschitz]
use a greedy algorithm to compute a bound, which may underapproximate the Lipschitz constant. Bounding Lipschitz constants for the specific case of convolutional neural networks (CNNs) has also been addressed in[balan2017lipschitz, zou2018lipschitz, bartlett2017spectrally].
Using Lipschitz bounds in the context of adversarial robustness and safety verification has also been addressed in several works [weng2018evaluating, ruan2018reachability, weng2018towards]. In particular, in [weng2018evaluating], the authors convert the robustness analysis problem into a local Lipschitz constant estimation problem, where they estimate this local constant by a set of independently and identically sampled local gradients. This algorithm is scalable but is not guaranteed to provide upper bounds. In a similar work, the authors of [weng2018towards]
exploit the piece-wise linear structure of ReLU functions to estimate the local Lipschitz constant of neural networks. In[fazlyab2019safety], the authors use quadratic constraints and semidefinite programming to analyze local (point-wise) robustness of neural networks. In contrast, our Lipschitz bounds can be used as a global certificate of robustness and are agnostic to the choice of the test data.
1.1 Motivating applications
We now enumerate several applications that highlight the importance of estimating the Lipschitz constant of DNNs accurately and efficiently.
Robustness certification of classifiers. In response to fragility of DNNs to adversarial attacks, there has been considerable effort in recent years to improve the robustness of neural networks against adversarial attacks and input perturbations [goodfellow6572explaining, papernot2016distillation, zheng2016improving, kurakin2016adversarial, madry2017towards, kolter2017provable]. In order to certify and/or improve the robustness of neural networks, one must be able to bound the possible outputs of the neural network over a region of input space. This can be done either locally around a specific input [bastani2016measuring, tjeng2017evaluating, gehr2018ai2, singh2018fast, dutta2018output, raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety, kolter2017provable, jordan2019provable, wong2018scaling, zhang2018efficient], or globally by bounding the sensitivity of the function to input perturbations, i.e., the Lipschitz constant [huster2018limitations, szegedy2013intriguing, qian2018l2, weng2018evaluating]. Indeed, tight upper bounds on the Lipschitz constant can be used to derive non-vacuous lower bounds on the magnitudes of perturbations necessary to change the decision of neural networks. Finally, an efficient computation of these bounds can be useful in either assessing robustness after training [raghunathan2018certified, raghunathan2018semidefinite, fazlyab2019safety] or promoting robustness during training [kolter2017provable, tsuzuku2018lipschitz]. In the experiments section, we explore this application in depth.
Stability analysis of closed-loop systems with learning controllers. A central problem in learning-based control is to provide stability or safety guarantees for a feedback control loop when a learning-enabled component, such as a deep neural network, is introduced in the loop [aswani2013provably, berkenkamp2017safe, jin2018stability]. The Lipschitz constant of a neural network controller bounds its gain. Therefore a tight estimate can be useful for certifying the stability of the closed-loop system.
Notation. We denote the set of real
-dimensional vectors by, the set of -dimensional matrices by , and the
-dimensional identity matrix by. We denote by , , and the sets of -by- symmetric, positive semidefinite, and positive definite matrices, respectively. The -norm () is denoted by . The -norm of a matrix is the largest singular value of . We denote the -th unit vector in by . We write for a diagonal matrix whose diagonal entries starting in the upper left corner are .
2 LipSDP: Lipschitz certificates via semidefinite programming
2.1 Problem statement
Consider an -layer feed-forward neural network described by the following recursive equations:
Here is an input to the network and and
are the weight matrix and bias vector for the-th layer. The function is the concatenation of activation functions at each layer, i.e., it is of the form . In this paper, our goal is to find tight bounds on the Lipschitz constant of the map in -norm. More precisely, we wish to find the smallest constant such that .
The main source of difficulty in solving this problem is the presence of the nonlinear activation functions. To combat this difficulty, our main idea is to abstract these activation functions by a set of constraints that they impose on their input and output values. Then any property (including Lipschitz continuity) that is satisfied by our abstraction will also be satisfied by the original network.
2.2 Description of activation functions by quadratic constraints
In this section, we introduce several definitions and lemmas that characterize our abstraction of nonlinear activation functions. These results are crucial to the formulation of an SDP that can bound the Lipschitz constants of networks in Section 2.3.
Definition (Slope-restricted non-linearity)
A function is slope-restricted on where if
The inequality in (3) simply states that the slope of the chord connecting any two points on the curve of the function is at least and at most (see Figure 1). By multiplying all sides of (3) by , we can write the slope restriction condition as . By the left inequality, the operator is strongly monotone with parameter [ryu2016primer], or equivalently the anti-derivative function is strongly convex with parameter . By the right-hand side inequality, is one-sided Lipschitz with parameter . Altogether, the preceding inequalities state that the anti-derivative function is -strongly convex and -smooth.
Note that all activation functions used in deep learning satisfy the slope restriction condition in (3) for some . For instance, the ReLU, tanh, and sigmoid activation functions are all slope restricted with and . More details can be found in [fazlyab2019safety].
Definition (Incremental Quadratic Constraint [accikmecse2011observers])
A function satisfies the incremental quadratic constraint defined by if for any and ,
In the above definition, is the set of all multiplier matrices that characterize , and is a convex cone by definition. As an example, the softmax operator is the gradient of the convex function . This function is smooth and strongly convex with paramters and [boyd2004convex]. For this class of functions, it is known that the gradient function satisfies the quadratic inequality [nesterov2013introductory]
Therefore, the softmax operator satisfies the incremental quadratic constraint defined by , where the middle matrix in the above inequality.
To see the connection between incremental quadratic constraints and slope-restricted nonlinearities, note that (3) can be equivalently written as the single inequality
Multiplying through by and rearranging terms, we can write (6) as
which, in view of Definition 2.2, is an incremental quadratic constraint for . From this perspective, incremental quadratic constraints generalize the notion of slope-restricted nonlinearities to multi-variable vector-valued nonlinearities.
Repeated nonlinearities. Now consider the vector-valued function obtained by applying a slope-restricted function component-wise to a vector . By exploiting the fact that the same function is applied to each component, we can characterize by incremental quadratic constraints. In the following lemma, we provide such a characterization.
Suppose is slope-restricted on . Define the set
Then for any the vector-valued function satisfies
Concretely, this lemma captures the coupling between neurons in a neural network by taking advantage of two particular structures: (a) the same activation function is applied to each hidden neuron and (b) all activation functions are slope-restricted on the same interval . In this way, we can write the slope restriction condition in (2.2) for any pair of activation functions in a given neural network. A conic combination of these constraints would yield (9), where are the coefficients of this combination. See Figure 1 for an illustrative description.
We will see in the next section that the matrix that parameterizes the multiplier matrix in (9) appears as a decision variable in an SDP, in which the objective is to find an admissible that yields the tightest bound on the Lipschitz constant.
2.3 LipSDP for single-layer neural network
To develop an optimization problem to estimate the Lipschitz constant of a fully-connected feed-forward neural network, the key insight is that the Lipschitz condition in (1) is in fact equivalent to an incremental quadratic constraint for the map characterized by the neural network. By coupling this to the incremental quadratic constraints satisfied by the cascade combination of the activation functions [fazlyab2018analysis], we can develop an SDP to minimize an upper bound on the Lipschitz constant of . This result is formally stated in the following theorem.
Theorem (Lipshitz certificates for single-layer neural networks)
Consider a single-layer neural network described by . Suppose , where is slope-restricted in the sector . Define as in (8). Suppose there exists a such that the matrix inequality
holds for some . Then for all .
Theorem 2.3 provides us with a sufficient condition for to be an upper bound on the Lipschitz constant of . In particular, we can find the tightest bound by solving the following optimization problem:
where the decision variables are . Note that is linear in and and the set is convex. Hence, (11) is an SDP, which can be solved numerically for its global minimum.
2.4 LipSDP for multi-layer neural networks
We now consider the multi-layer case. Assuming that all the activation functions are the same, we can write the neural network model in (2) compactly as
where is the concatenation of the input and the activation values, and the matrices , , and are given by [fazlyab2019safety]
The particular representation in (12) facilitates the extension of LipSDP to multiple layers, as stated in the following theorem.
Theorem (Lipschitz certificates for multi-layer neural networks)
Consider an -layer fully connected neural network described by (2). Let be the total number of hidden neurons and suppose the activation functions are slope-restricted in the sector . Define as in (8). Define and as in (13). Consider the matrix inequality
If (14) is satisfied for some , then , .
We have only considered the norm in our exposition. By using the inequality , the -Lipschitz bound implies
or, equivalently, Hence, is a Lipschitz constant of when and norms are used in the input and output spaces, respectively. We can also extend our framework to accommodate quadratic norms , where .
2.5 Variants of LipSDP: reconciling accuracy and efficiency
In LipSDP, there are decision variables (), where is the total number of hidden neurons. For , the variable couples the -th and -th hidden neuron. For , the variable constrains the input-output of the -th activation function individually. Using all these decision variables would provide the tightest convex relaxation in our formulation. Interestingly, numerical experiments suggest that this tightest convex relaxation yields bounds on the Lipschitz constant that almost coincide with the naive lower bound on the Lipschitz constant. However, solving this SDP with all the decision variables included is impractical for large networks. Nevertheless, we can consider a hierarchy of relaxations of LipSDP by removing a subset of the decision variables. Below, we give a brief description of the efficiency and accuracy of each variant. Throughout, we let be the total number of neurons and the number of hidden layers.
LipSDP-Network imposes constraints on all possible pairs of activation functions and has decision variables. It is the least scalable but the most accurate method.
LipSDP-Neuron ignores the cross coupling constraints among different neurons and has decision variables. It is more scalable and less accurate than LipSDP-Network. For this case, we have .
LipSDP-Layer considers only one constraint per layer, resulting in decision variables. It is the most scalable and least accurate method. For this variant, we have .
Parallel implementation by splitting. The Lipschitz constant of the composition of two or more functions can be bounded by the product of the Lispchtiz constants of the individual functions. By splitting a neural network up into small sub-networks, one can first bound the Lipschitz constant of each sub-network and then multiply these constants together to obtain a Lipschitz constant for the entire network. Because sub-networks do not share weights, it is possible to compute the Lipschitz constants for each sub-network in parallel. This greatly improves the scalability of of each variant of LipSDP with respect to the total number of activation functions in the network.
In this section we describe several experiments that highlight the key aspects of this work. In particular, we show empirically that our bounds are much tighter than any comparable method, we study the impact of robust training on our Lipschitz bounds, and we analyze the scalability of our methods.
Experimental setup. For our experiments we used MATLAB, the CVX toolbox [grant2008cvx] and MOSEK [mosek] on a 9-core CPU with 16GB of RAM to solve the SDPs. All classifiers trained on MNIST used an 80-20 train-test split.
Training procedures. Several training procedures have recently been proposed to improve the robustness of neural network classifiers. Two prominent procedures are the LP-based method in [kolter2017provable] and projected gradient descent (PGD) based method in [madry2017towards]. We refer to these training methods as LP-Train and PGD-Train, respectively. Both procedures take as input a parameter that defines the perturbation of the training data points.
Baselines. Throughout the experiments, we will often show comparisons to what we call the naive lower and upper bounds. As has been shown in several previous works [combettes_lipschitz, virmaux2018lipschitz], trivial lower and upper bounds on the Lipschitz constant of a feed-forward neural network with hidden-layers are given by and , which we refer to them as naive lower and upper bounds, respectively. We are aware of only two methods that bound the Lipschitz constant and can scale to fully-connected networks with more than two hidden layers; these methods are [combettes_lipschitz], which we will refer to as CPLip, and [virmaux2018lipschitz], which is called SeqLip.
We compare the Lipschitz bounds obtained by LipSDP-Neuron, LipSDP-Layer, CPLip, and SeqLip in Figure 1(a). It is evident from this figure that the bounds from LipSDP-Neuron are tighter than CPLip and SeqLip. Our results show that the true Lipschitz constants of the networks shown above are very close to the naive lower bound.
To demonstrate the scalability of the LipSDP formulations, we split a 100-hidden layer neural network into sub-networks with six hidden layers each and computed the Lipschitz bounds using LipSDP-Neuron and LipSDP-Layer. The results are shown in Figure 1(b). Furthermore, in Tables 1 and 2, we show the computation time for scaling the LipSDP methods in the number of hidden units per layer and in the number of layers. In particular, the largest network we tested in Table 2 had 50,000 hidden neurons; SDPLip-Neuron took approximately 12 minutes to find a Lipschitz bound, and SDPLip-Layer took approximately 4 minutes.
To evaluate SDPLip-Network, we coupled random pairs of hidden neurons in a one-hidden-layer network and plotted the computation time and Lipschitz bound found by SDPLip-Network as we increased the number of paired neurons. Our results show that as the number of coupled neurons increases, the computation time increases quadratically. This shows that while this method is the most accurate of the three proposed LipSDP methods, it is intractable for even modestly large networks.
Impact of robust training. In Figure 4, we empirically demonstrate that the Lipschitz bound of a neural network is directly related to the robustness of the corresponding classifier. This figure shows that LP-train and PGD-Train networks achieve lower Lipschitz bounds than standard training procedures. Figure 2(a) indicates that robust training procedures yield lower Lipschitz constants than networks trained with standard training procedures such as the Adam optimizer. Figure 3 shows the utility of sharply estimating the Lipschitz constant; a lower value of guarantees that a neural network is more locally robust to input perturbations; see Proposition A.1 in the Appendix.
In the same vein, Figure 3 shows the impact of varying the robustness parameter used in LP-Train and PGD-Train
on the test accuracy of networks trained for a fixed number of epochs and the corresponding Lipschitz constants. In essence, these results quantify how much robustness a fixed classifier can handle before accuracy plummets. Interestingly, the drops in accuracy asincreases coincide with corresponding drops in the Lipschitz constant for both LP-Train and PGD-Train.
Robustness for different activation functions. The framework proposed in this work allows us to examine the impact of using different activation functions on the Lipschitz constant of neural networks. We trained two sets of neural networks on the MNIST dataset. The first set used ReLU activation functions, while the second set used leaky ReLU activations. Figure 6 shows empirically that the networks with the leaky ReLU activation function have larger Lipschitz constants than networks of the same architecture with the ReLU activation function.
4 Conclusions and future work
In this paper, we proposed a hierarchy of semidefinite programs to derive tight upper bounds on the Lipschitz constant of feed-forward fully-connected neural networks. Some comments are in order. First, our framework can be directly used to certify convolutional neural networks (CNNs) by unrolling them to a large feed-forward neural network. A future direction is to exploit the special structure of CNNs in the resulting SDP. Second, we only considered one application of Lipschitz bounds in depth (robustness certification). Having an accurate upper bound on the Lipschitz constant can be useful in domains beyond robustness analysis, such as stability analysis of feedback systems with control policies updated by deep reinforcement learning. Furthermore, Lipschitz bounds can be utilized during training as a heuristic to promote out-of-sample generalization[tsuzuku2018lipschitz]. We intend to pursue these applications for future work.
Appendix A Appendix
a.1 Robustness certification of DNN-based classifiers
Consider a classifier described by a feed-forward neural network , where is the number of input features and is the number of classes. In this context, the function takes as input an instance or measurement and returns a -dimensional vector of scores – one for each class. The classification rule is based on assigning to the class with the highest score. That is, we define the classification to be . Now suppose that is an instance that is classified correctly by the neural network. To evaluate the local robustness of the neural network around , we consider a bounded set that represents the set of all possible -norm perturbations of . Then the classifier is locally robust at against if it assigns all the perturbed inputs to the same class as the unperturbed input, i.e., if
In the following proposition, we derive a sufficient condition to guarantee local robustness around for the perturbation set .
Consider a neural-network classifier with Lipschitz constant in the -norm. Let be given and consider the inequality
Then (16) implies for all .
The inequality in (16) provides us with a simple and computationally efficient test for assessing the point-wise robustness of a neural network. According to (16), a more accurate estimation of the Lipschitz constant directly increases the maximum perturbation that can be certified for each test example. This makes the framework suitable for model selection, wherein one wishes to select the model that is most robust to adversarial perturbations from a family of proposed classifiers [peck2017lower].
a.2 Proof of Proposition a.1
Let be the class of . Define the polytope in the output space of :
which is the set of all outputs whose score is the highest for class ; and, of course, . The distance of to the boundary of the polytope is the minimum distance of to all edges of the polytope:
Note that the Lipschitz condition implies for all . The condition in (16) then implies
and for all . Therefore, the output of the classification would not change for any .
a.3 Proof of Lemma 2.2
We first prove the following lemma, which is a slight variation to the lemma proved in [fazlyab2019safety].
Suppose is slope-restricted on and satisfies . Define the set
Then the vector-valued function defined by satisfies
for all .
Proof of Lemma A.3. Note that the slope restriction condition implies that for any two pairs and , we can write the following incremental quadratic constraint:
where is arbitrary. Similarly, for any two pairs and , we can also write the incremental quadratic constraint
Proof of Lemma 2.2. For a fixed , define the map by shifting , as follows,
It is not hard to verify that if is slope-restricted on , the map is also slope-restricted on the same interval for any fixed . Next, for a fixed define
Since is slope-restricted on and satisfies for any fixed , it follows from Lemma A.3 that satisfies the quadratic constraint
By substituting the definition of and setting , we obtain
a.4 Proof of Theorem 2.3
Define and for two arbitrary inputs . Using Lemma 2.2, we can write the quadratic inequality
where and is defined as in (8). The preceding inequality can be simplified to
By left and right multiplying in (10) by and , respectively, and rearranging terms, we obtain
By adding both sides of the preceding inequalities, we obtain
Finally, note that by definition of and , we have and . Therefore, the preceding inequality implies
a.5 Proof of Theorem 2.4
For two arbitrary inputs , define Using the compact notation in (12), we can write
Multiply both sides of the first matrix in (14) by and , respectively and use the preceding identities to obtain
a.6 Bounding the output set of a neural network classifier
As we have shown, accurate estimation of the Lipschitz constant can provide tighter bounds on adversarial examples drawn from a perturbation set around a nominal point . Figure 8 shows these bounds for a neural network trained on the Iris dataset. Because this dataset has three classes, we project the output sets onto the coordinate axes. This figure clearly shows that our bound is quite close to the naive lower bound. Indeed, the naive upper bound and the constant computed by CPLip are considerably looser.
a.7 Further analysis of scalability
Figure 9 shows the Lipschitz bounds found for the networks used in Table 1.