Machine learning models, such as deep neural networks (DNNs), have been remarkably successful in performing many tasks   . However, it has been shown that they fail catastrophically when very small distortions are added to normal data examples  . These adversarial examples are easy to produce , transfer from one model to another  , and are very hard to detect .
Many methods have been proposed to address this problem, but most have been quickly overcome by new attacks  . This cycle has happened regularly enough that the burden of proof is on the defender that her or his defense will hold up against future attacks. One promising approach to meet this burden is to compute and optimize a certificate
: a guarantee that no attack of a certain magnitude can change the classifier’s decision for a large majority of examples.
In order to provide such a guarantee, one must be able to bound the possible outputs for a region of input space. This can be done for the region around a specific input  or by globally bounding the sensitivity of the function to shifts on the input, i.e., the function’s Lipschitz constant  . Once the output is bounded for a given input region, one can check whether the class changes. If not, there is no adversarial example in the region. If the class does change, the model can alert the user or safety mechanisms to the possibility of manipulation.
We argue in this paper that despite the achievements reported in , Lipschitz-based approaches suffer from some representational limitations that may prevent them from achieving higher levels of performance and being applicable to more complicated problems. We suggest that directly addressing these limitations may lead to further gains in robustness.
This paper is organized as follows: Section 2 defines the Lipschitz constant and shows that classifiers with strong Lipschitz-based guarantees exist. Section 3 describes a simple method for computing a Lipschitz constant for deep neural networks, while Section 4 presents experimental and theoretical limitations for this method. Section 5 describes an alternative method for computing a Lipschitz constant and presents some of its limitations. Finally, Section 6 presents conclusions and a long term goal for future research.
2 Lipschitz Bounds
We now define the Lipschitz constant referenced throughout this paper. Let a function be called -Lipschitz continuous if
are the metrics associated with vector spacesand , respectively.
Loosely speaking, a Lipschitz constant is a bound on the slope of : if the input changes by , the output changes by at most . If there is no value where is -Lipschitz continuous and , then we say is the minimal Lipschitz constant. In this paper, we restrict our analysis to Minkowski spaces with distance metric . We now show that global Lipschitz constants can in principle be used to provide certificates far exceeding the current state-of-the-art, and thus are worthy of further development. Let be a dataset where for . Let be a positive scalar such that
for . There exists a -Lipschitz function where for . We relegate the full proof to appendix A.1, but we define a function meeting the criteria of the proposition that can be constructed for any dataset:
where and are the closest vectors to in with and , respectively.
The function described above shows that the Lipschitz method can be used to provide a robustness guarantee against any perturbation of magnitude less than . This can be extended to a multi-class setting in a straightforward manner by using a set of one vs. all classifiers. Table 1 shows the distance to the closest out-of-class example for the 95th percentile of samples; i.e., 95% of samples are at least away from the nearest neighbor of a different class. Proposition 2 implies the existence of a classifier that is provably robust for 95% of samples against perturbations of magnitude . This bound would far exceed the certifications offered by current methods, i.e.,   , and even the (non-certified) adversarial performance of .
It is important to note that the existence of a -Lipschitz function in Proposition 2 does not say anything about how easy it is to learn such a function from examples that generalizes to new ones. Indeed, the function described in the proof is likely to generalize poorly. However, we argue that current methods for optimizing the Lipschitz constant of a neural network suffer much more from underfitting than overfitting: training and validation certificates tend to be similar, and adding model capacity and training iterations do not appear to materially improve the training certificates. This suggests that we need more powerful models. The remainder of this paper is focused on how one might go about developing more powerful models.
3 Atomic Lipschitz Constants
The simplest method for constructing a Lipschitz constant for a neural network composes the Lipschitz constants of atomic components. If and are - and -Lipschitz continuous functions, respectively, and , then is -Lipschitz continuous where . Applying this recursively provides a bound for an arbitrary neural network.
For many components, we can compute the minimal Lipschitz constant exactly. For linear operators, , the minimal Lipschitz constant is given by the matrix norm of induced by :
For , this is equivalent to the largest magnitude row of :
The norm of is known as its spectral normhas a Lipschitz constant of 1 regardless of the choice of . Therefore, for a neural network composed of linear operators and ReLUs, a Lipschitz constant is provided by
Several recent papers have utilized this concept or an extension of it to additional layer types.  uses it to analyze the theoretical sensitivity of deep neural networks.  and  enforce constraints on the singular values of matrices as a way of increasing robustness to existing attacks. Finally,  penalizes the spectral norms of matrices and uses equation 6 to compute a Lipschitz constant for the network.
4 Limitations of Atomic Lipschitz Constants
One might surmise that this approach can solve the problem of adversarial examples: compose enough layers together with the right balance of objectives, overcoming whatever optimization difficulties arise, and one can train classifiers with high accuracy, guaranteed low variability, and improved robustness to attacks. Unfortunately, this does not turn out to be the case, as we will show first experimentally and then theoretically.
4.1 Experimental Limitations
First, we can observe the limits of this technique in a shallow setting. We train a two layer fully connected neural network with 500 hidden units on the MNIST dataset. We penalize with weight . We denote the score for class as and the computed Lipschitz constant of the difference between and as . We certify the network for example with correct class against a perturbation of magnitude by verifying that for .
Figures 1 (a) and (b) show results for and , respectively. In both cases, adding a penalty provides a larger region of certified robustness, but increasing the penalty hurts performance on unperturbed data and eventually ceases to improve the certified region. This was true for both test and training (not shown) data. This level of certification is considerably weaker than our theoretical limit from Proposition 2.
There also does not appear to be much certification benefit to adding more layers. We extended the methodology to multi-layer networks and show the results in figures 1 (c) and (d). Using the penalty proved difficult to optimize for deeper networks. The penalty was more successful, but only saw a mild improvement over the shallow model. The results in (d) also compare favorably to those of , which uses a 4 layer convolutional network.
4.2 Theoretical Limitations
We now consider the set of neural networks with a given atomic Lipschitz bound and the functions it can compute. This set of functions is important because it limits how well a neural network can split a dataset with particular margins, and thus how strong the certificate can be.
Let be the set of neural networks with an atomic Lipschitz bound of k in space:
We focus our analysis here on space. To show the limitations of , consider the simple 1-Lipschitz function . Expressing with ReLU’s and linear units is simple exercise, shown in figure 2. However, since
the neural network in figure 2 is a member of , but not . This is only one possible implementation of , but as we will show, the atomic component method cannot express this function with a Lipschitz bound lower than 2, and the situation gets worse as more non-linear variations are added.
We now provide two definitions that will help delineate the functions that the neural networks in can compute.
For a function , let the total variation be defined as
where is the set of partitions of the interval .
The total variation captures how much a function changes over its entire domain, which we will use on the gradients of neural networks. is finite for neural network gradients, as the gradient only changes when a ReLU switches states, and this can only happen a finite number of times for finite networks. Clearly, for the slope of the absolute value function, this quantity is 2: the slope changes from -1 to 1 at .
For a function , define a quantity
and call it the intrinsic variability of .
As we will show, the intrinsic variability is a quantity that is nonexpansive under the ReLU operation. The intrinsic variability the slope of the absolute value function is 4: we add the magnitude of the slopes at the extreme points, 1 in each case, to the total variation of 2. We now begin a set of proofs to show that is limited in the functions it can approximate. This limit does not come from the Lipschitz constant of a function , but by the intrinsic variability of its derivative, .
For a linear combination of functions ,
Proof is relegated to appendix A.2 Let a function be called eventually constant if
Let be a function where
is eventually constant. For the ReLU activation function,
Proof is relegated to appendix A.3
Let be a scalar-valued function .
Let where , and . For any selection of and ,
Proof is relegated to appendix A.4
A function in has a hard limit on the intrinsic variability of its slope along a line through its input space. If we try to learn the absolute value function while penalizing the bound , we will inevitably end up with training objectives that are in direct competition with one another. One can imagine more difficult cases where there is some oscillation in the data manifold and the bounds deteriorate further: for instance is also 1-Lipschitz, but can only be approximated with arbitrarily small error by a member of . While this limit is specific to , since , it also provides a limit to .
5 Paired-layer Lipschitz Constants and Their Limitations
We have shown the limitations of the atomic bounding method both experimentally and theoretically, so naturally we look for other approaches to bounding the Lipschitz constant of neural network layers. A fairly successful approach was given by .  presents a method for bounding a fully connected neural network with one hidden layer and ReLU activations, which yielded impressive performance on the MNIST dataset. This approach optimizes the weights of the two layers in concert, so we call it the paired-layer approach. The paper does not attempt to extend the method to deeper neural networks, but it can be done in a relatively straightforward fashion.
5.1 Certifying a Two-layer Neural Network
Ignoring biases for notational convenience, a two-layer neural network with weights and can be expressed
where . We consider a single output, although extending to a multi-class setting is straightforward. If were fixed, such a network would be linear with Lipschitz constant .  accounts for a changeable by finding the assignment of that maximizes the Lipschitz constant and using this as a bound for the real Lipschitz constant:
They convert this problem to a mixed integer quadratic program and bound it in a tractable and differential manner using semi-definite programming, the details of which are explained in . We can add a penalty on this quantity to the objective function to find a model with relatively high accuracy and low Lipschitz constant. We did not have access to the training procedure developed by , but we were able to closely replicate their results on MNIST and compare them to the atomic bounding approach, shown in figure 3 (a).
5.2 Theoretical Benefits and Limitations of Paired-layer Approach
Figure 3 shows that there are practical benefits to the paired-layer approach, and we can also show a corresponding increase in expressive power. Similar to , we define a set of neural networks , although we will restrict the definition to 2 layer networks in space:
Let be the set of two-layer neural networks with a paired-layer Lipschitz bound of k in space:
can express functions that cannot. For example, we can apply the paired-layer method to the neural network in figure 2 by enumerating the different cases. In this case the bound is tight, meaning that the neural network is in . From Theorem 4.2, we know that this function cannot be expressed by any member of . It is easy to see that any two layer neural network in is also in , so we can say confidently that the paired-layer bounds are tighter than atomic bounds.
This additional expressiveness is not merely academic. Figure 3 (b) shows the output of the networks from (a) along a particular line in input space, scaled by the given Lipschitz bound. The function learned by the paired-layer method does in fact exhibit an intrinsic variability larger than , meaning that function cannot be represented by a network in . This suggests that the gains in performance may be coming from the increased expressiveness of the model family.
It is still easy to construct functions for which the paired-layer bounds are loose, however. Figure 4 shows a 1-Lipschitz function and a corresponding neural network that is only in . The problem arises from the fact that the two hidden units cannot both be on, but the quadratic programming problem in equation 17 implies that they can. For a 1-D problem, the bound essentially adds up the magnitudes of the paths with positive weights and the paths with negative weights and takes the maximum. A higher dimensional problem can be reduced to a 1-D problem by considering arbitrary lines through the input space.
The expressive limitations of are apparent when we consider its components. Any neural network in is a sum of combinations of the four basic forms in figure 5, with various biases and slopes. The sum of the slope magnitudes from the positive paths can be no greater than , and likewise for the negative paths. Each form has a characteristic way of affecting the slope at the extremes and changing the slope. For instance form (a) adds a positive slope at as well as a positive change in . From here we can see that there is still a connection between the total variation and extreme values of and the bound . While the paired-layer bounds are better than the atomic ones, they still become arbitrarily bad for e.g., oscillating functions.
We have presented a case that existing methods for computing a Lipschitz constant of a neural network suffer from representational limitations that may be preventing them from considerably stronger robustness guarantees against adversarial examples. Addressing these limitations should enable models that can, at a minimum, exhibit strong guarantees for training data and hopefully extend these to out-of-sample data. Ideally, we envision universal Lipschitz networks: a family of neural networks that can represent an arbitrary k-Lipschitz function with a tight bound. The development of such a family of models and methods for optimizing them carries the potential of extensive gains in adversarial robustness.
This research was partially sponsored by the U.S. Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
-  Athalye, A., Carlini, N., Wagner, D.A.: Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR abs/1802.00420 (2018)
-  Carlini, N., Wagner, D.A.: Adversarial examples are not easily detected: Bypassing ten detection methods. In: AISec@CCS (2017)
-  Carlini, N., Wagner, D.A.: Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP) pp. 39–57 (2017)
-  Cissé, M., Bojanowski, P., Grave, E., Dauphin, Y., Usunier, N.: Parseval networks: Improving robustness to adversarial examples. In: ICML (2017)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (almost) from scratch.Journal of Machine Learning Research 12, 2493–2537 (2011)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. CoRR abs/1412.6572 (2014)
-  Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 82–97 (2012)
-  Kolter, J.Z., Wong, E.: Provable defenses against adversarial examples via the convex outer adversarial polytope. CoRR abs/1711.00851 (2017)
-  In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
-  Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083 (2017)
-  Papernot, N., McDaniel, P.D., Goodfellow, I.J., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: AsiaCCS (2017)
-  Qian, H., Wegman, M.N.: L2-nonexpansive neural networks. CoRR abs/1802.07896 (2018)
-  Raghunathan, A., Steinhardt, J., Liang, P.: Certified defenses against adversarial examples. CoRR abs/1801.09344 (2018)
-  Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fergus, R.: Intriguing properties of neural networks. CoRR abs/1312.6199 (2013)
-  Tramèr, F., Papernot, N., Goodfellow, I.J., Boneh, D., McDaniel, P.D.: The space of transferable adversarial examples. CoRR abs/1704.03453 (2017)
-  Tsuzuku, Y., Sato, I., Sugiyama, M.: Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. CoRR abs/1802.04034 (2018)
Appendix A Proofs
a.1 Proof of Proposition 2
Consider the function
where and are the closest vectors to in with and , respectively. Since , the conditions are mutually exclusive. When and ,
The inverse is true for , therefore holds for all . is continuous at the non-differentiable boundaries between the piecewise conditions of and the selections of and . Therefore, it suffices to show that each continuously differentiable piece is -Lipschitz. Using Definition 2, we must show
For the first condition of with a fixed , we get
which holds for due to the Minkowski inequality. The same holds for the second condition. Since the third condition is constant, must be -Lipschitz and the proof is complete.
a.2 Proof of Lemma 4.2
Using the chain rule, we get
The triangle inequality gives us the following two inequalities
Let be a maximal partition for , giving us
a.3 Proof of Lemma 4.2
Let be an interval outside of which is constant. Assume that for . In this case,
If then at some point , and transitions from to 0. Otherwise for , . Therefore,
Putting the different intervals together, we get
So the statement holds when our assumption about is met. To address cases where has negative values in , consider an interval where . We note that and . Since must transition from to , over ,
Since transitions from to 0 to over so,
Applying this to all such intervals gives us
a.4 Proof of Theorem 4.2
Combining the definition of with Definition 4.2, we can see that and
. We consider the additional linear transform as the zeroth layer of a modified network. Consider unitin the zeroth layer as a function . is constant, with
where is element of . Therefore
We also have , so by Definition 4.2
We recursively define functions for each unit in layers to :