Neural networks (NNs) and deep learning have lately been successful in many fields[1, 2]
, where they are mostly used for classification and segmentation problems, as well as in reinforcement learning
. The main advantages of NNs are that they can be trained straightforwardly using backpropagation and as universal function approximators they have the capability to represent complex nonlinearities. While NNs are powerful and broadly applicable they lack rigorous guarantees which is why they are not yet applied to safety-critical applications such as medical devices and autonomous driving. Adversarial attacks can easily deceive an NN by adding imperceptible perturbations to the input which is a problem that has recently been tackled increasingly in a number of different ways such as adversarial training 6]. Another promising approach is to show that an NN is provably robust against norm-bounded adversarial perturbations [7, 8] while yet another one is to use Lipschitz constants as a robustness measure that indicate the sensitivity of the output to perturbations in the input . Based on this notion of Lipschitz continuity, we propose a framework for training of robust NNs that encourages a small Lipschitz constant by including a regularizer or respectively, a constraint on the NN’s Lipschitz constant.
, the target values are manipulated based on the estimated Lipschitz constant whereas we use the Lipschitz constant as a regularization functional similar to. However, they use local Lipschitz constants whereas we penalize the global one. In 
, Fazlyab et al. propose an interesting new estimation scheme for more accurate upper bounds on the Lipschitz constant than the weights’ spectral norms exploiting the structure of the nonlinear activation functions. Activation functions are gradients of convex potential functions, and hence monotonically increasing functions with bounded slopes, which is used in to state the property of slope-restriction as an incremental quadratic constraint and then formulate a semidefinite program (SDP) that determines an upper bound on the Lipschitz constant. In , three variants of the Lipschitz constant estimation framework are proposed trading off accuracy and computational tractability. In this paper, we disprove by counterexample the most accurate approach presented in , and we employ the other approaches for training of robust NNs. More specifically, we include the SDP-based Lipschitz bound characterization of  in the training procedure via an Alternating Direction Method of Multipliers (ADMM) scheme. We present two versions of the proposed training method, one is a regularizer rendering the Lipschitz constant small and the other one enforces guaranteed upper bounds on the Lipschitz constant during training.
The main contributions of this manuscript are the two training procedures for robust NNs based on the notion of Lipschitz continuity. In addition, we show that the method for Lipschitz constant estimation for NNs that was recently proposed in  requires a modification for the least conservative choice of decision variables. This manuscript is organized as follows. In Section II, we introduce Lipschitz constant estimation for NNs based on  but disprove their most accurate Lipschitz estimator. In Section III, we present a training procedure with Lipschitz regularization and outline the setup of the optimization problem that is solved using ADMM. Subsequently, we analyze the convergence of the ADMM scheme and finally, introduce a variation of the proposed procedure that allows to enforce Lipschitz bounds on the NN. In Section IV, we provide two examples on which we successfully applied the proposed training procedures. We submitted preliminary ideas of this work as a late breaking result to the 21st IFAC World congress .
Ii Lipschitz constant estimation
In this section, we briefly introduce robustness in the context of neural networks. We then state a method to estimate the Lipschitz constant of an NN based on  and finally argue why one of the methods for Lipschitz constant estimation proposed in  is incorrect.
Ii-a Robustness of NNs
A robust NN does not change its prediction if the input is perturbed imperceptibly. To quantify robustness, a suitable robustness measure has to be defined. One definition is that perturbations from a norm-bounded uncertainty set may not change the prediction. Alternatively, using probabilistic approaches, random perturbations do not change the prediction with a certain probability. A third alternative is the Lipschitz constant, a sensitivity measure. A functionis globally Lipschitz continuous if there exists an such that
The smallest for which (1) holds is the Lipschitz constant . If the input changes from to , the Lipschitz constant gives an upper bound on how much the output changes. Hence, a low Lipschitz constant indicates low sensitivity which is equivalent to high robustness. In this work, we aim to minimize the Lipschitz constant or bound the Lipschitz constant from above during training, respectively, to increase the robustness of the resulting NN.
Regularization, i.e., adding a penalty term to the objective function of the NN, is a prevalent measure in NN training in order to prevent overfitting. L2 regularization penalizes the squared norm of the weights and L1 regularization the weights’ L1 norm. Bounding the weights counteracts the fit of sudden peaks and outliers, promotes better generalization, and smoothens the resulting NN. Furthermore, the product of the spectral norms of the weights provides a trivial bound on an NN’s Lipschitz constant and consequently, L1 and L2 regularization improve the robustness of an NN in the sense of Lipschitz continuity. In this paper, we penalize a more accurate estimate of the Lipschitz constant, leading to a more direct and potentially more effective approach.
Ii-B Lipschitz constant estimation
In the following, we outline a method to estimate bounds on the Lipschitz constant of single-hidden layer NNs exploiting the slope-restricted structure of the nonlinear activation functions, as it was shown in . This method named LipSDP yields more accurate bounds than trivial bounds, i.e., the product of the spectral norms of the weights.
Continuous nonlinear activation functions can be interpreted as gradients of continuously differentiable, convex potential functions and consequently, they are inherently slope-restricted, meaning that their slope is at least and at most ,
. Consider the vector of activation functionsand a fully-connected feed-forward NN with one hidden layer described by the following equation
where are the weight matrices and are the biases of the -th layers, , , and
being the dimension of the input, the neurons in the hidden layer, and the output, respectively. For every neuron, the slope-restriction property (2) can be written as an incremental quadratic constraint. Weighting these quadratic constraints with , or respectively, the whole network with a diagonal weighting matrix
results in an incremental quadratic constraint for the overall network:
The proof directly follows from A.4 in . The smallest value for the Lipschitz upper bound is determined by solving the SDP
where and serve as decision variables. The resulting Lipschitz bound then is a robustness certificate.
Ii-C Counterexample for LipSDP with coupling
In , three versions of the method LipSDP for Lipschitz constant estimation are stated that reconcile accuracy of the Lipschitz bound and computational complexity of the method by adjusting the number of the decision variables in the SDP. In this section, we give an illustrative counterexample to show that Theorem 1 in  does not hold for the most accurate variant of LipSDP.
is introduced and Theorem 1 is stated with replaced by . In the following, we give a minimal counterexample to show that, as suggested in this manuscript, a further restriction of the class of is required. For that purpose, consider an NN with one hidden layer of size , input and output size , activation function and weights and biases
The resulting NN provides a good fit for the cosine function on with a maximum deviation in the output of 0.0843. Therefore, the maximum slope of the cosine gives a good approximation of the Lipschitz constant of this NN, which is at , such that . The linear matrix inequality (LMI) (5) is feasible for arbitrarily small and
such that and . An arbitrarily small is obviously no upper bound on the Lipschitz constant of a cosine like function.
To understand why in this case the LipSDP method fails to provide an upper bound on the Lipschitz constant, we look into the coupling of the neurons accounted for by . Theorem 1 in  builds on the assumption that (4) holds for all (Lemma 1 in ). In the following, we show by counterexample that Lemma 1 in  is incorrect, i.e., that for a given slope-restricted function there are and that violate (4). We choose , as the function that is slope-restricted in the sector , and . Let and . Then evaluating (4) yields
which disproves Lemma 1 of  and consequently, Theorem 1 of  needs to be modified. As we argued in the previous section, Eq. (4) holds for the more conservative choice of diagonal matrices , such that the proof of Theorem 1 of this manuscript directly follows from the proof of Theorem 1 in  provided in A.4 of , the extended version of .
Iii Training robust NNs
In Section II-B, we stated a method that provides certificates on an NN’s Lipschitz constant. In this section, we employ these certificates to design a training procedure for robust NNs. We present two versions of it, the first one allows for minimization of the upper bound on the Lipschitz constant and the second one allows to enforce a desired bound on the Lipschitz constant.
Iii-a Weights as decision variables
Eq. (6) can be used to assess an NN’s robustness after training, whereas in this manuscript, to promote robustness during training, we use Eq. (6) to update the weights while minimizing the bound on the Lipschitz constant. Applying the Schur complement to (5) for , the LMI can be linearized in the weight matrices , yielding an LMI that is linear in and , for fixed :
For , the underlying constraint is not convex in . Consequently, we cannot state LMI constraints for and instead set . This is a conservative choice for some activation functions, yet the tight lower bound for the most common ones. E.g. for and the tight bounds are and and the sigmoid function is slope-restricted with
and the sigmoid function is slope-restricted withand .
While the Lipschitz constant estimation scheme in  optimizes over , throughout the manuscript, we choose to be a fixed matrix. This introduces conservatism into the framework and necessitates a suitable choice for in order to keep the introduced conservatism to a minimum, e.g., the matrix may be determined from the Lipschitz constant estimation outlined in Section II-B on the vanilla NN or the L2 regularized NN of the same problem. Based on (7), we can now set up an SDP in and
that can be used to adjust the weights in order to minimize the Lipschitz bound.
Note that solving Eq. (8) independently leads to the trivial solution and weights that do not provide a satisfactory fit for the corresponding input-output data and hence, the trade-off between accuracy of the NN and its robustness  has to be taken into account in the problem formulation. We achieve this by including the Lipschitz constant in the training objective and minimizing both, the NN’s loss and its upper bound on the Lipschitz constant, using the Alternating Direction Method of Multipliers that allows for optimization of two objectives.
Iii-B Lipschitz regularization
In general, NNs are trained on input-output data with the objective of minimizing a predefined loss that we assume to be continuous throughout this manuscript, e.g. the mean squared error, cross-entropy or hinge loss. We propose to not only minimize the NN’s loss but also its Lipschitz constant. This yields an optimization problem with two separate objectives that can be solved conveniently using ADMM.
ADMM is an algorithm that solves optimization problems by splitting them into smaller subproblems that are easier to handle individually . In order to apply the ADMM algorithm, the objective must be separable. The resulting subobjectives are then defined on uncoupled convex sets and subject to linear equality constraints. The ADMM scheme solves the resulting optimization problem through independent minimization steps on the augmented Lagrangian of the optimization problem and a dual update step. The objectives at hand, i.e., the NN’s loss and the Lipschitz bound, are indeed separable and defined on uncoupled convex sets. However, the problems are not completely independent and need to be connected through a linear constraint that requires the introduction of additional variables of equal size as . The loss of the NN is an explicit function of the weights and the Lipschitz bound depends on through the LMI (7), yielding the following optimization problem:
where is a weighting parameter adjusting the trade-off between accuracy and robustness. Applying the ADMM scheme to problem (9), results in the augmented Lagrangian function
with Lagrange multipliers , and the penalty parameter . The optimum for (9) is then determined via the following iterative ADMM update steps:
For training of robust NNs, we carry out the corresponding updates consecutively until convergence. The loss function is optimized analytically using backpropagation (Eq. (10a)) whereas solving the SDP of the Lipschitz update step (10b) requires to solve an SDP in every iteration and thereby adds additional computations compared to training of a vanilla NN.
It is possible to extend the framework and optimize over , , and at the same time which requires a second LMI constraint in (8) and an additional update step, resulting in a multi-block ADMM scheme. This reduces conservatism but increases computation time.
It is also possible to extend the ADMM scheme outlined above to training of robust multi-layer NNs, using the corresponding Lipschitz bounds for multi-layer NNs in . Similar to the single-layer case, the Lipschitz bound for -layer NNs is convex in the weights and for .
In the following, we apply a general convergence result for ADMM (§ 3.2 and appendix A in ) to (9) in order to prove that the proposed training procedure for robust NNs converges under certain convexity assumptions. Feasibility of the LMI (7) can be included into the optimization objective using the indicator function
For convergence considerations, we then look at the unaugmented Lagrangian
and assume that is strictly convex. As we will comment at the end of this section, this assumption typically does not hold globally.
The NN’s loss is strictly convex.
If the Slater condition is fulfilled, then strong duality holds for and strong duality indicates that has a saddle point . ∎
Boyd et al. show convergence for ADMM if (i) both objectives are convex, closed, and proper functions and if (ii) the unaugmented Lagrangian has a saddle point . In the following, we show that (9) fulfills these conditions. For , , and any , the LMI (7) is strictly feasible. According to Lemma 6, this fact and Assumption 5 imply that the unaugmented Lagrangian (11) has a saddle point. The input-output map of the NN (3) is continuous as it consists of the sum of weighted activation functions that are continuous by design, as well as the loss function. Also, as a measure of errors of the predictions to the true labels, the loss is lower bounded by and therefore, it never attains the value . Together with the fact that its domain is nonempty, we conclude that the loss function is proper. Furthermore, continuity and convexity imply that the loss function is closed. The subobjective is obviously convex, continuous in and defined on a set (7) that is convex in and . Also, for while the LMI (7) remains feasible such that the function is closed. As a quadratic function, it is lower bounded by and therefore proper. The indicator function of a convex set is also convex. It is closed as each sublevel set of is closed and by defintion never attains which renders it proper. ∎
In general, NNs are highly nonlinear and non-convex and so is the loss function , i.e., Assumption 5
is typically not satisfied. Yet in training of vanilla NNs the loss converges reliably to a local minimum for an adequate choice of hyperparameters. Hence, within a small neighborhood of the minimum, the convexity assumption is reasonable. Moreover, the proposed ADMM scheme that includes an additional convex objective to the usual training procedure does not add complexity in the form of non-convexity to the optimization problem. A suitable initialization ofand is recommended, not only for faster convergence but also to find a better solution. Nevertheless, only convergence to a local minimum can be expected.
Iii-D Enforcing Lipschitz bounds
In Section III-B, we suggested to minimize an upper bound on the Lipschitz constant of an NN. Using the proposed ADMM framework, it is also possible to enforce a desired upper bound on the Lipschitz constant during training of an NN. In that case, the Lipschitz constant is not minimized but instead set to a desired value . Judiciously, does not appear in the optimization objective of this setup
Theorem 8 follows from the fact that, by design, the bound on the Lipschitz constant is enforced in every iteration of the ADMM scheme, more specifically, in every Lipschitz update step by the LMI constraint on the weights . Convergence can be shown exactly as in Theorem 6, where again, the loss of the NN is not convex in general, nevertheless Assumption 5 allows for insights into convergence to local minima.
The training procedure based on (12) allows to choose the value of the Lipschitz bound and to train NNs with Lipschitz guarantees. This way, a desired degree of robustness can be directly enforced. However, the choice of such a constraint on is always connected to the trade-off between accuracy and robustness, as the fit generally deteriorates when decreasing the Lipschitz constant constraint. In addition, it is helpful to initialize the weight parameters appropriately which does not only accelerate training but may also facilitate a better fit.
Iv Simulation Results
In this section, we illustrate the benefits of the two presented variants for training of robust NNs on two toy examples. For that purpose, we design an NN for regression problems with input and output . We use a feed-forward NN with one hidden layer, neurons in the hidden layer, the activation function that is slope-restricted with , , and the mean squared error (MSE) as the loss function. We train three NNs, a vanilla NN, an NN with L2 regularization (L2-NN) for benchmarking and finally the Lipschitz regularized NN (Lipschitz-NN), wherein the NN loss update step (10a
) is solved using stochastic gradient descent and the SDP is solved using numerical SDP solvers[19, 20].
We first fit a classifying function that takes either the value zero or one, compare Fig.1 for a plot of the function. We apply the Lipschitz regularization scheme presented in Section II-B with and , wherein we initialize the Lipschitz-NN from the L2 regularized NN with penalty parameter . The evaluation of the resulting MSE losses and bounds on the Lipschitz constant, that is summarized in Table I, shows that the loss of the nominal NN that was trained without any regularizer is the smallest. Naturally, there is a trade-off between accuracy and robustness. Yet, the Lipschitz bound of 17.65 is rather high and can be restrained with a regularizer. Comparing the two regularizers, the robustness certificate of the Lipschitz-NN is smaller than the one of the L2-NN whereas the MSE loss is similar.
To illustrate the enforcement of Lipschitz bounds as proposed in Section III-D, we train NNs to fit data that is taken from a noisy sinusoidal function. We set and the Lipschitz bound to , the same value as the upper bound on the Lipschitz constant obtained with L2 regularization, and initialize the parameters from the L2-NN with . The three resulting NNs produce the graphs shown in Fig. 2. Note that the nominal NN is overfitting the sine for which is reflected in the high Lipschitz bound of . The map obtained from the L2-NN notably provides an inferior fit, especially in the range , whereas the Lipschitz-NN provides a good fit while its Lipschitz bound is kept lower than the one of the L2-NN such that similar robustness is maintained. The Lipschitz-NN exceeds the other NNs in robustness () while performing only marginally worse than the nominal NN, cf. the MSE loss on training and testing data, and on the underlying sine curve in Table II. The results show that Lipschitz regularization and enforcement of Lipschitz bounds can be used to effectively train robust NNs while trading off robustness and accuracy. Code to reproduce the examples can be found at https://github.com/ppauli/Robust_Lipschitz-NNs.
|Training MSE||Testing MSE||Sine MSE|
We proposed a framework for training of single-hidden layer NNs that encourages robustness, by both considering Lipschitz regularization and by enforcing Lipschitz bounds during training. The underlying SDP  estimates the upper bound on the Lipschitz constant more accurately than traditional methods as it exploits the fact that activation functions are slope-restricted. We designed an optimization scheme based on this SDP that trains an NN to fit input-output data and at the same time increases its robustness in terms of Lipschitz continuity. We used ADMM to solve the underlying optimization problem and to therein conveniently incorporate the trade-off between accuracy and robustness. In addition, we presented a variation of the framework that allows for bounding the Lipschitz constant by a desired value, i.e., training NNs with robustness guarantees. We successfully tested our method on two toy examples where we benchmarked it with L2 regularization.
Next steps include the extension of the method to multi-layer NNs, reducing conservatism by not only optimizing over the weights but also over the weighting matrix , and the implementation on more conclusive higher dimensional problems. In addition, for benchmarking purposes, we plan to compare the proposed methods to other training procedures that improve robustness.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,”
Journal of machine learning research, vol. 12, no. Aug, pp. 2493–2537, 2011.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016, pp. 582–597.
-  E. Wong and Z. Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” in International Conference on Machine Learning, 2018, pp. 5283–5292.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
-  Y. Tsuzuku, I. Sato, and M. Sugiyama, “Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 6541–6550.
-  M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier, “Parseval networks: Improving robustness to adversarial examples,” in International Conference on Machine Learning, 2017, pp. 854–863.
-  M. Hein and M. Andriushchenko, “Formal guarantees on the robustness of a classifier against adversarial manipulation,” in Advances in Neural Information Processing Systems, 2017, pp. 2266–2276.
-  M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 11 423–11 434.
-  P. Pauli, A. Koch, J. Berberich, and F. Allgöwer, “Robust neural networks via Lipschitz regularization and enforced Lipschitz bounds,” in 21st IFAC World Congress as late breaking result (submitted), 2020.
-  A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in Neural Information Processing Systems, 1992, pp. 950–957.
-  M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” arXiv preprint arXiv:1906.04893, 2019.
-  D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” arXiv preprint arXiv:1805.12152, 2018.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
-  S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
-  J. Löfberg, “Yalmip : A toolbox for modeling and optimization in matlab,” in In Proceedings of the CACSD Conference, Taipei, Taiwan, 2004.
-  MOSEK ApS, The MOSEK optimization toolbox for MATLAB manual. Version 9.0., 2019. [Online]. Available: http://docs.mosek.com/9.0/toolbox/index.html