I Introduction
Neural networks (NNs) and deep learning have lately been successful in many fields
[1, 2], where they are mostly used for classification and segmentation problems, as well as in reinforcement learning
[3]. The main advantages of NNs are that they can be trained straightforwardly using backpropagation and as universal function approximators they have the capability to represent complex nonlinearities. While NNs are powerful and broadly applicable they lack rigorous guarantees which is why they are not yet applied to safetycritical applications such as medical devices and autonomous driving. Adversarial attacks can easily deceive an NN by adding imperceptible perturbations to the input
[4] which is a problem that has recently been tackled increasingly in a number of different ways such as adversarial training [5][6]. Another promising approach is to show that an NN is provably robust against normbounded adversarial perturbations [7, 8] while yet another one is to use Lipschitz constants as a robustness measure that indicate the sensitivity of the output to perturbations in the input [9]. Based on this notion of Lipschitz continuity, we propose a framework for training of robust NNs that encourages a small Lipschitz constant by including a regularizer or respectively, a constraint on the NN’s Lipschitz constant.Trivial Lipschitz bounds of NNs can be determined by the product of the spectral norms of the weights [4] which is used during training in [10]. In [9]
, the target values are manipulated based on the estimated Lipschitz constant whereas we use the Lipschitz constant as a regularization functional similar to
[11]. However, they use local Lipschitz constants whereas we penalize the global one. In [12], Fazlyab et al. propose an interesting new estimation scheme for more accurate upper bounds on the Lipschitz constant than the weights’ spectral norms exploiting the structure of the nonlinear activation functions. Activation functions are gradients of convex potential functions, and hence monotonically increasing functions with bounded slopes, which is used in
[12] to state the property of sloperestriction as an incremental quadratic constraint and then formulate a semidefinite program (SDP) that determines an upper bound on the Lipschitz constant. In [12], three variants of the Lipschitz constant estimation framework are proposed trading off accuracy and computational tractability. In this paper, we disprove by counterexample the most accurate approach presented in [12], and we employ the other approaches for training of robust NNs. More specifically, we include the SDPbased Lipschitz bound characterization of [12] in the training procedure via an Alternating Direction Method of Multipliers (ADMM) scheme. We present two versions of the proposed training method, one is a regularizer rendering the Lipschitz constant small and the other one enforces guaranteed upper bounds on the Lipschitz constant during training.The main contributions of this manuscript are the two training procedures for robust NNs based on the notion of Lipschitz continuity. In addition, we show that the method for Lipschitz constant estimation for NNs that was recently proposed in [12] requires a modification for the least conservative choice of decision variables. This manuscript is organized as follows. In Section II, we introduce Lipschitz constant estimation for NNs based on [12] but disprove their most accurate Lipschitz estimator. In Section III, we present a training procedure with Lipschitz regularization and outline the setup of the optimization problem that is solved using ADMM. Subsequently, we analyze the convergence of the ADMM scheme and finally, introduce a variation of the proposed procedure that allows to enforce Lipschitz bounds on the NN. In Section IV, we provide two examples on which we successfully applied the proposed training procedures. We submitted preliminary ideas of this work as a late breaking result to the 21st IFAC World congress [13].
Ii Lipschitz constant estimation
In this section, we briefly introduce robustness in the context of neural networks. We then state a method to estimate the Lipschitz constant of an NN based on [12] and finally argue why one of the methods for Lipschitz constant estimation proposed in [12] is incorrect.
Iia Robustness of NNs
A robust NN does not change its prediction if the input is perturbed imperceptibly. To quantify robustness, a suitable robustness measure has to be defined. One definition is that perturbations from a normbounded uncertainty set may not change the prediction. Alternatively, using probabilistic approaches, random perturbations do not change the prediction with a certain probability. A third alternative is the Lipschitz constant, a sensitivity measure. A function
is globally Lipschitz continuous if there exists an such that(1) 
The smallest for which (1) holds is the Lipschitz constant . If the input changes from to , the Lipschitz constant gives an upper bound on how much the output changes. Hence, a low Lipschitz constant indicates low sensitivity which is equivalent to high robustness. In this work, we aim to minimize the Lipschitz constant or bound the Lipschitz constant from above during training, respectively, to increase the robustness of the resulting NN.
Regularization, i.e., adding a penalty term to the objective function of the NN, is a prevalent measure in NN training in order to prevent overfitting. L2 regularization penalizes the squared norm of the weights and L1 regularization the weights’ L1 norm. Bounding the weights counteracts the fit of sudden peaks and outliers, promotes better generalization, and smoothens the resulting NN
[14]. Furthermore, the product of the spectral norms of the weights provides a trivial bound on an NN’s Lipschitz constant and consequently, L1 and L2 regularization improve the robustness of an NN in the sense of Lipschitz continuity. In this paper, we penalize a more accurate estimate of the Lipschitz constant, leading to a more direct and potentially more effective approach.IiB Lipschitz constant estimation
In the following, we outline a method to estimate bounds on the Lipschitz constant of singlehidden layer NNs exploiting the sloperestricted structure of the nonlinear activation functions, as it was shown in [12]. This method named LipSDP yields more accurate bounds than trivial bounds, i.e., the product of the spectral norms of the weights.
Continuous nonlinear activation functions can be interpreted as gradients of continuously differentiable, convex potential functions and consequently, they are inherently sloperestricted, meaning that their slope is at least and at most ,
(2) 
where
. Consider the vector of activation functions
and a fullyconnected feedforward NN with one hidden layer described by the following equation(3) 
where are the weight matrices and are the biases of the th layers, , , and
being the dimension of the input, the neurons in the hidden layer, and the output, respectively. For every neuron, the sloperestriction property (
2) can be written as an incremental quadratic constraint. Weighting these quadratic constraints with , or respectively, the whole network with a diagonal weighting matrixresults in an incremental quadratic constraint for the overall network:
(4) 
for all , . We now formulate an SDP based on [12] that exploits (4) for the estimation of an upper bound on the Lipschitz constant of the map characterized by the underlying NN.
Theorem 1.
For a fullyconnected feedforward neural network with one hidden layer (
3) and sloperestricted activation functions in the sector , suppose there exist , such that(5) 
Then, (3) is globally Lipschitz continuous with Lipschitz bound .
The proof directly follows from A.4 in [15]. The smallest value for the Lipschitz upper bound is determined by solving the SDP
(6) 
where and serve as decision variables. The resulting Lipschitz bound then is a robustness certificate.
IiC Counterexample for LipSDP with coupling
In [12], three versions of the method LipSDP for Lipschitz constant estimation are stated that reconcile accuracy of the Lipschitz bound and computational complexity of the method by adjusting the number of the decision variables in the SDP. In this section, we give an illustrative counterexample to show that Theorem 1 in [12] does not hold for the most accurate variant of LipSDP.
Theorem 1 in [12] resembles Theorem 1 of this manuscript with the difference that in [12] instead of a set of symmetric coupling matrices
is introduced and Theorem 1 is stated with replaced by . In the following, we give a minimal counterexample to show that, as suggested in this manuscript, a further restriction of the class of is required. For that purpose, consider an NN with one hidden layer of size , input and output size , activation function and weights and biases
The resulting NN provides a good fit for the cosine function on with a maximum deviation in the output of 0.0843. Therefore, the maximum slope of the cosine gives a good approximation of the Lipschitz constant of this NN, which is at , such that . The linear matrix inequality (LMI) (5) is feasible for arbitrarily small and
such that and . An arbitrarily small is obviously no upper bound on the Lipschitz constant of a cosine like function.
To understand why in this case the LipSDP method fails to provide an upper bound on the Lipschitz constant, we look into the coupling of the neurons accounted for by . Theorem 1 in [12] builds on the assumption that (4) holds for all (Lemma 1 in [12]). In the following, we show by counterexample that Lemma 1 in [12] is incorrect, i.e., that for a given sloperestricted function there are and that violate (4). We choose , as the function that is sloperestricted in the sector , and . Let and . Then evaluating (4) yields
which disproves Lemma 1 of [12] and consequently, Theorem 1 of [12] needs to be modified. As we argued in the previous section, Eq. (4) holds for the more conservative choice of diagonal matrices , such that the proof of Theorem 1 of this manuscript directly follows from the proof of Theorem 1 in [12] provided in A.4 of [15], the extended version of [12].
Iii Training robust NNs
In Section IIB, we stated a method that provides certificates on an NN’s Lipschitz constant. In this section, we employ these certificates to design a training procedure for robust NNs. We present two versions of it, the first one allows for minimization of the upper bound on the Lipschitz constant and the second one allows to enforce a desired bound on the Lipschitz constant.
Iiia Weights as decision variables
Eq. (6) can be used to assess an NN’s robustness after training, whereas in this manuscript, to promote robustness during training, we use Eq. (6) to update the weights while minimizing the bound on the Lipschitz constant. Applying the Schur complement to (5) for , the LMI can be linearized in the weight matrices , yielding an LMI that is linear in and , for fixed :
(7) 
Remark 2.
For , the underlying constraint is not convex in . Consequently, we cannot state LMI constraints for and instead set . This is a conservative choice for some activation functions, yet the tight lower bound for the most common ones. E.g. for and the tight bounds are and
and the sigmoid function is sloperestricted with
and .While the Lipschitz constant estimation scheme in [12] optimizes over , throughout the manuscript, we choose to be a fixed matrix. This introduces conservatism into the framework and necessitates a suitable choice for in order to keep the introduced conservatism to a minimum, e.g., the matrix may be determined from the Lipschitz constant estimation outlined in Section IIB on the vanilla NN or the L2 regularized NN of the same problem. Based on (7), we can now set up an SDP in and
(8) 
that can be used to adjust the weights in order to minimize the Lipschitz bound.
Note that solving Eq. (8) independently leads to the trivial solution and weights that do not provide a satisfactory fit for the corresponding inputoutput data and hence, the tradeoff between accuracy of the NN and its robustness [16] has to be taken into account in the problem formulation. We achieve this by including the Lipschitz constant in the training objective and minimizing both, the NN’s loss and its upper bound on the Lipschitz constant, using the Alternating Direction Method of Multipliers that allows for optimization of two objectives.
IiiB Lipschitz regularization
In general, NNs are trained on inputoutput data with the objective of minimizing a predefined loss that we assume to be continuous throughout this manuscript, e.g. the mean squared error, crossentropy or hinge loss. We propose to not only minimize the NN’s loss but also its Lipschitz constant. This yields an optimization problem with two separate objectives that can be solved conveniently using ADMM.
ADMM is an algorithm that solves optimization problems by splitting them into smaller subproblems that are easier to handle individually [17]. In order to apply the ADMM algorithm, the objective must be separable. The resulting subobjectives are then defined on uncoupled convex sets and subject to linear equality constraints. The ADMM scheme solves the resulting optimization problem through independent minimization steps on the augmented Lagrangian of the optimization problem and a dual update step. The objectives at hand, i.e., the NN’s loss and the Lipschitz bound, are indeed separable and defined on uncoupled convex sets. However, the problems are not completely independent and need to be connected through a linear constraint that requires the introduction of additional variables of equal size as . The loss of the NN is an explicit function of the weights and the Lipschitz bound depends on through the LMI (7), yielding the following optimization problem:
(9) 
where is a weighting parameter adjusting the tradeoff between accuracy and robustness. Applying the ADMM scheme to problem (9), results in the augmented Lagrangian function
with Lagrange multipliers , and the penalty parameter . The optimum for (9) is then determined via the following iterative ADMM update steps:
(10a)  
(10b)  
(10c) 
For training of robust NNs, we carry out the corresponding updates consecutively until convergence. The loss function is optimized analytically using backpropagation (Eq. (
10a)) whereas solving the SDP of the Lipschitz update step (10b) requires to solve an SDP in every iteration and thereby adds additional computations compared to training of a vanilla NN.Remark 3.
It is possible to extend the framework and optimize over , , and at the same time which requires a second LMI constraint in (8) and an additional update step, resulting in a multiblock ADMM scheme. This reduces conservatism but increases computation time.
Remark 4.
It is also possible to extend the ADMM scheme outlined above to training of robust multilayer NNs, using the corresponding Lipschitz bounds for multilayer NNs in [12]. Similar to the singlelayer case, the Lipschitz bound for layer NNs is convex in the weights and for .
IiiC Convergence
In the following, we apply a general convergence result for ADMM (§ 3.2 and appendix A in [17]) to (9) in order to prove that the proposed training procedure for robust NNs converges under certain convexity assumptions. Feasibility of the LMI (7) can be included into the optimization objective using the indicator function
For convergence considerations, we then look at the unaugmented Lagrangian
(11) 
and assume that is strictly convex. As we will comment at the end of this section, this assumption typically does not hold globally.
Assumption 5.
The NN’s loss is strictly convex.
Lemma 6.
Proof.
If the Slater condition is fulfilled, then strong duality holds for and strong duality indicates that has a saddle point [18]. ∎
Proof.
Boyd et al. show convergence for ADMM if (i) both objectives are convex, closed, and proper functions and if (ii) the unaugmented Lagrangian has a saddle point [17]. In the following, we show that (9) fulfills these conditions. For , , and any , the LMI (7) is strictly feasible. According to Lemma 6, this fact and Assumption 5 imply that the unaugmented Lagrangian (11) has a saddle point. The inputoutput map of the NN (3) is continuous as it consists of the sum of weighted activation functions that are continuous by design, as well as the loss function. Also, as a measure of errors of the predictions to the true labels, the loss is lower bounded by and therefore, it never attains the value . Together with the fact that its domain is nonempty, we conclude that the loss function is proper. Furthermore, continuity and convexity imply that the loss function is closed. The subobjective is obviously convex, continuous in and defined on a set (7) that is convex in and . Also, for while the LMI (7) remains feasible such that the function is closed. As a quadratic function, it is lower bounded by and therefore proper. The indicator function of a convex set is also convex. It is closed as each sublevel set of is closed and by defintion never attains which renders it proper. ∎
In general, NNs are highly nonlinear and nonconvex and so is the loss function , i.e., Assumption 5
is typically not satisfied. Yet in training of vanilla NNs the loss converges reliably to a local minimum for an adequate choice of hyperparameters. Hence, within a small neighborhood of the minimum, the convexity assumption is reasonable. Moreover, the proposed ADMM scheme that includes an additional convex objective to the usual training procedure does not add complexity in the form of nonconvexity to the optimization problem. A suitable initialization of
and is recommended, not only for faster convergence but also to find a better solution. Nevertheless, only convergence to a local minimum can be expected.IiiD Enforcing Lipschitz bounds
In Section IIIB, we suggested to minimize an upper bound on the Lipschitz constant of an NN. Using the proposed ADMM framework, it is also possible to enforce a desired upper bound on the Lipschitz constant during training of an NN. In that case, the Lipschitz constant is not minimized but instead set to a desired value . Judiciously, does not appear in the optimization objective of this setup
(12) 
where and serve as decision variables. For enforcement of Lipschitz bounds, we apply the ADMM algorithm as in (10) to (12) instead of (9).
Theorem 8.
Theorem 8 follows from the fact that, by design, the bound on the Lipschitz constant is enforced in every iteration of the ADMM scheme, more specifically, in every Lipschitz update step by the LMI constraint on the weights . Convergence can be shown exactly as in Theorem 6, where again, the loss of the NN is not convex in general, nevertheless Assumption 5 allows for insights into convergence to local minima.
Corollary 9.
The training procedure based on (12) allows to choose the value of the Lipschitz bound and to train NNs with Lipschitz guarantees. This way, a desired degree of robustness can be directly enforced. However, the choice of such a constraint on is always connected to the tradeoff between accuracy and robustness, as the fit generally deteriorates when decreasing the Lipschitz constant constraint. In addition, it is helpful to initialize the weight parameters appropriately which does not only accelerate training but may also facilitate a better fit.
Iv Simulation Results
In this section, we illustrate the benefits of the two presented variants for training of robust NNs on two toy examples. For that purpose, we design an NN for regression problems with input and output . We use a feedforward NN with one hidden layer, neurons in the hidden layer, the activation function that is sloperestricted with , , and the mean squared error (MSE) as the loss function. We train three NNs, a vanilla NN, an NN with L2 regularization (L2NN) for benchmarking and finally the Lipschitz regularized NN (LipschitzNN), wherein the NN loss update step (10a
) is solved using stochastic gradient descent and the SDP is solved using numerical SDP solvers
[19, 20].We first fit a classifying function that takes either the value zero or one, compare Fig.
1 for a plot of the function. We apply the Lipschitz regularization scheme presented in Section IIB with and , wherein we initialize the LipschitzNN from the L2 regularized NN with penalty parameter . The evaluation of the resulting MSE losses and bounds on the Lipschitz constant, that is summarized in Table I, shows that the loss of the nominal NN that was trained without any regularizer is the smallest. Naturally, there is a tradeoff between accuracy and robustness. Yet, the Lipschitz bound of 17.65 is rather high and can be restrained with a regularizer. Comparing the two regularizers, the robustness certificate of the LipschitzNN is smaller than the one of the L2NN whereas the MSE loss is similar.Training MSE  

Nominal NN  0.0161  17.65 
L2NN  0.0457  7.815 
LipschitzNN  0.0437  6.378 
To illustrate the enforcement of Lipschitz bounds as proposed in Section IIID, we train NNs to fit data that is taken from a noisy sinusoidal function. We set and the Lipschitz bound to , the same value as the upper bound on the Lipschitz constant obtained with L2 regularization, and initialize the parameters from the L2NN with . The three resulting NNs produce the graphs shown in Fig. 2. Note that the nominal NN is overfitting the sine for which is reflected in the high Lipschitz bound of . The map obtained from the L2NN notably provides an inferior fit, especially in the range , whereas the LipschitzNN provides a good fit while its Lipschitz bound is kept lower than the one of the L2NN such that similar robustness is maintained. The LipschitzNN exceeds the other NNs in robustness () while performing only marginally worse than the nominal NN, cf. the MSE loss on training and testing data, and on the underlying sine curve in Table II. The results show that Lipschitz regularization and enforcement of Lipschitz bounds can be used to effectively train robust NNs while trading off robustness and accuracy. Code to reproduce the examples can be found at https://github.com/ppauli/Robust_LipschitzNNs.
Training MSE  Testing MSE  Sine MSE  

Nominal NN  0.0590  0.0990  0.0106  9.747 
L2NN  0.1159  0.1521  0.0681  2.403 
LipschitzNN  0.0687  0.1009  0.0128  2.316 
V Conclusion
We proposed a framework for training of singlehidden layer NNs that encourages robustness, by both considering Lipschitz regularization and by enforcing Lipschitz bounds during training. The underlying SDP [12] estimates the upper bound on the Lipschitz constant more accurately than traditional methods as it exploits the fact that activation functions are sloperestricted. We designed an optimization scheme based on this SDP that trains an NN to fit inputoutput data and at the same time increases its robustness in terms of Lipschitz continuity. We used ADMM to solve the underlying optimization problem and to therein conveniently incorporate the tradeoff between accuracy and robustness. In addition, we presented a variation of the framework that allows for bounding the Lipschitz constant by a desired value, i.e., training NNs with robustness guarantees. We successfully tested our method on two toy examples where we benchmarked it with L2 regularization.
Next steps include the extension of the method to multilayer NNs, reducing conservatism by not only optimizing over the weights but also over the weighting matrix , and the implementation on more conclusive higher dimensional problems. In addition, for benchmarking purposes, we plan to compare the proposed methods to other training procedures that improve robustness.
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. 
[2]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,”
Journal of machine learning research
, vol. 12, no. Aug, pp. 2493–2537, 2011.  [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [4] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
 [5] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 [6] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016, pp. 582–597.
 [7] E. Wong and Z. Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” in International Conference on Machine Learning, 2018, pp. 5283–5292.
 [8] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
 [9] Y. Tsuzuku, I. Sato, and M. Sugiyama, “Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 6541–6550.
 [10] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier, “Parseval networks: Improving robustness to adversarial examples,” in International Conference on Machine Learning, 2017, pp. 854–863.
 [11] M. Hein and M. Andriushchenko, “Formal guarantees on the robustness of a classifier against adversarial manipulation,” in Advances in Neural Information Processing Systems, 2017, pp. 2266–2276.
 [12] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 11 423–11 434.
 [13] P. Pauli, A. Koch, J. Berberich, and F. Allgöwer, “Robust neural networks via Lipschitz regularization and enforced Lipschitz bounds,” in 21st IFAC World Congress as late breaking result (submitted), 2020.
 [14] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in Neural Information Processing Systems, 1992, pp. 950–957.
 [15] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” arXiv preprint arXiv:1906.04893, 2019.
 [16] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” arXiv preprint arXiv:1805.12152, 2018.
 [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
 [18] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [19] J. Löfberg, “Yalmip : A toolbox for modeling and optimization in matlab,” in In Proceedings of the CACSD Conference, Taipei, Taiwan, 2004.
 [20] MOSEK ApS, The MOSEK optimization toolbox for MATLAB manual. Version 9.0., 2019. [Online]. Available: http://docs.mosek.com/9.0/toolbox/index.html