Training robust neural networks using Lipschitz bounds

by   Patricia Pauli, et al.
University of Stuttgart

Due to their susceptibility to adversarial perturbations, neural networks (NNs) are hardly used in safety-critical applications. One measure of robustness to such perturbations in the input is the Lipschitz constant of the input-output map defined by an NN. In this work, we propose a framework to train NNs while at the same time encouraging robustness by keeping their Lipschitz constant small, thus addressing the robustness issue. More specifically, we design an optimization scheme based on the Alternating Direction Method of Multipliers that minimizes not only the training loss of an NN but also its Lipschitz constant resulting in a semidefinite programming based training procedure that promotes robustness. We design two versions of this training procedure. The first one includes a regularizer that penalizes an accurate upper bound on the Lipschitz constant. The second one allows to enforce a desired Lipschitz bound on the NN at all times during training. Finally, we provide two examples to show that the proposed framework successfully increases the robustness of NNs.


page 1

page 2

page 3

page 4


Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds

Certified robustness is a desirable property for deep neural networks in...

Lipschitz Bounds and Provably Robust Training by Laplacian Smoothing

In this work we propose a graph-based learning framework to train models...

Stable Rank Normalization for Improved Generalization in Neural Networks and GANs

Exciting new work on the generalization bounds for neural networks (NN) ...

Robust and Provably Monotonic Networks

The Lipschitz constant of the map between the input and output space rep...

Measuring Representational Robustness of Neural Networks Through Shared Invariances

A major challenge in studying robustness in deep learning is defining th...

Controlling the Complexity and Lipschitz Constant improves polynomial nets

While the class of Polynomial Nets demonstrates comparable performance t...

Neural network training under semidefinite constraints

This paper is concerned with the training of neural networks (NNs) under...

I Introduction

Neural networks (NNs) and deep learning have lately been successful in many fields

[1, 2]

, where they are mostly used for classification and segmentation problems, as well as in reinforcement learning


. The main advantages of NNs are that they can be trained straightforwardly using backpropagation and as universal function approximators they have the capability to represent complex nonlinearities. While NNs are powerful and broadly applicable they lack rigorous guarantees which is why they are not yet applied to safety-critical applications such as medical devices and autonomous driving. Adversarial attacks can easily deceive an NN by adding imperceptible perturbations to the input

[4] which is a problem that has recently been tackled increasingly in a number of different ways such as adversarial training [5]

and defensive distillation

[6]. Another promising approach is to show that an NN is provably robust against norm-bounded adversarial perturbations [7, 8] while yet another one is to use Lipschitz constants as a robustness measure that indicate the sensitivity of the output to perturbations in the input [9]. Based on this notion of Lipschitz continuity, we propose a framework for training of robust NNs that encourages a small Lipschitz constant by including a regularizer or respectively, a constraint on the NN’s Lipschitz constant.

Trivial Lipschitz bounds of NNs can be determined by the product of the spectral norms of the weights [4] which is used during training in [10]. In [9]

, the target values are manipulated based on the estimated Lipschitz constant whereas we use the Lipschitz constant as a regularization functional similar to

[11]. However, they use local Lipschitz constants whereas we penalize the global one. In [12]

, Fazlyab et al. propose an interesting new estimation scheme for more accurate upper bounds on the Lipschitz constant than the weights’ spectral norms exploiting the structure of the nonlinear activation functions. Activation functions are gradients of convex potential functions, and hence monotonically increasing functions with bounded slopes, which is used in

[12] to state the property of slope-restriction as an incremental quadratic constraint and then formulate a semidefinite program (SDP) that determines an upper bound on the Lipschitz constant. In [12], three variants of the Lipschitz constant estimation framework are proposed trading off accuracy and computational tractability. In this paper, we disprove by counterexample the most accurate approach presented in [12], and we employ the other approaches for training of robust NNs. More specifically, we include the SDP-based Lipschitz bound characterization of [12] in the training procedure via an Alternating Direction Method of Multipliers (ADMM) scheme. We present two versions of the proposed training method, one is a regularizer rendering the Lipschitz constant small and the other one enforces guaranteed upper bounds on the Lipschitz constant during training.

The main contributions of this manuscript are the two training procedures for robust NNs based on the notion of Lipschitz continuity. In addition, we show that the method for Lipschitz constant estimation for NNs that was recently proposed in [12] requires a modification for the least conservative choice of decision variables. This manuscript is organized as follows. In Section II, we introduce Lipschitz constant estimation for NNs based on [12] but disprove their most accurate Lipschitz estimator. In Section III, we present a training procedure with Lipschitz regularization and outline the setup of the optimization problem that is solved using ADMM. Subsequently, we analyze the convergence of the ADMM scheme and finally, introduce a variation of the proposed procedure that allows to enforce Lipschitz bounds on the NN. In Section IV, we provide two examples on which we successfully applied the proposed training procedures. We submitted preliminary ideas of this work as a late breaking result to the 21st IFAC World congress [13].

Ii Lipschitz constant estimation

In this section, we briefly introduce robustness in the context of neural networks. We then state a method to estimate the Lipschitz constant of an NN based on [12] and finally argue why one of the methods for Lipschitz constant estimation proposed in [12] is incorrect.

Ii-a Robustness of NNs

A robust NN does not change its prediction if the input is perturbed imperceptibly. To quantify robustness, a suitable robustness measure has to be defined. One definition is that perturbations from a norm-bounded uncertainty set may not change the prediction. Alternatively, using probabilistic approaches, random perturbations do not change the prediction with a certain probability. A third alternative is the Lipschitz constant, a sensitivity measure. A function

is globally Lipschitz continuous if there exists an such that


The smallest for which (1) holds is the Lipschitz constant . If the input changes from to , the Lipschitz constant gives an upper bound on how much the output changes. Hence, a low Lipschitz constant indicates low sensitivity which is equivalent to high robustness. In this work, we aim to minimize the Lipschitz constant or bound the Lipschitz constant from above during training, respectively, to increase the robustness of the resulting NN.

Regularization, i.e., adding a penalty term to the objective function of the NN, is a prevalent measure in NN training in order to prevent overfitting. L2 regularization penalizes the squared norm of the weights and L1 regularization the weights’ L1 norm. Bounding the weights counteracts the fit of sudden peaks and outliers, promotes better generalization, and smoothens the resulting NN

[14]. Furthermore, the product of the spectral norms of the weights provides a trivial bound on an NN’s Lipschitz constant and consequently, L1 and L2 regularization improve the robustness of an NN in the sense of Lipschitz continuity. In this paper, we penalize a more accurate estimate of the Lipschitz constant, leading to a more direct and potentially more effective approach.

Ii-B Lipschitz constant estimation

In the following, we outline a method to estimate bounds on the Lipschitz constant of single-hidden layer NNs exploiting the slope-restricted structure of the nonlinear activation functions, as it was shown in [12]. This method named LipSDP yields more accurate bounds than trivial bounds, i.e., the product of the spectral norms of the weights.

Continuous nonlinear activation functions can be interpreted as gradients of continuously differentiable, convex potential functions and consequently, they are inherently slope-restricted, meaning that their slope is at least and at most ,



. Consider the vector of activation functions

and a fully-connected feed-forward NN with one hidden layer described by the following equation


where are the weight matrices and are the biases of the -th layers, , , and

being the dimension of the input, the neurons in the hidden layer, and the output, respectively. For every neuron, the slope-restriction property (

2) can be written as an incremental quadratic constraint. Weighting these quadratic constraints with , or respectively, the whole network with a diagonal weighting matrix

results in an incremental quadratic constraint for the overall network:


for all , . We now formulate an SDP based on [12] that exploits (4) for the estimation of an upper bound on the Lipschitz constant of the map characterized by the underlying NN.

Theorem 1.

For a fully-connected feed-forward neural network with one hidden layer (

3) and slope-restricted activation functions in the sector , suppose there exist , such that


Then, (3) is globally Lipschitz continuous with Lipschitz bound .

The proof directly follows from A.4 in [15]. The smallest value for the Lipschitz upper bound is determined by solving the SDP


where and serve as decision variables. The resulting Lipschitz bound then is a robustness certificate.

Ii-C Counterexample for LipSDP with coupling

In [12], three versions of the method LipSDP for Lipschitz constant estimation are stated that reconcile accuracy of the Lipschitz bound and computational complexity of the method by adjusting the number of the decision variables in the SDP. In this section, we give an illustrative counterexample to show that Theorem 1 in [12] does not hold for the most accurate variant of LipSDP.

Theorem 1 in [12] resembles Theorem 1 of this manuscript with the difference that in [12] instead of a set of symmetric coupling matrices

is introduced and Theorem 1 is stated with replaced by . In the following, we give a minimal counterexample to show that, as suggested in this manuscript, a further restriction of the class of is required. For that purpose, consider an NN with one hidden layer of size , input and output size , activation function and weights and biases

The resulting NN provides a good fit for the cosine function on with a maximum deviation in the output of 0.0843. Therefore, the maximum slope of the cosine gives a good approximation of the Lipschitz constant of this NN, which is at , such that . The linear matrix inequality (LMI) (5) is feasible for arbitrarily small and

such that and . An arbitrarily small is obviously no upper bound on the Lipschitz constant of a cosine like function.

To understand why in this case the LipSDP method fails to provide an upper bound on the Lipschitz constant, we look into the coupling of the neurons accounted for by . Theorem 1 in [12] builds on the assumption that (4) holds for all (Lemma 1 in [12]). In the following, we show by counterexample that Lemma 1 in [12] is incorrect, i.e., that for a given slope-restricted function there are and that violate (4). We choose , as the function that is slope-restricted in the sector , and . Let and . Then evaluating (4) yields

which disproves Lemma 1 of [12] and consequently, Theorem 1 of [12] needs to be modified. As we argued in the previous section, Eq. (4) holds for the more conservative choice of diagonal matrices , such that the proof of Theorem 1 of this manuscript directly follows from the proof of Theorem 1 in [12] provided in A.4 of [15], the extended version of [12].

Iii Training robust NNs

In Section II-B, we stated a method that provides certificates on an NN’s Lipschitz constant. In this section, we employ these certificates to design a training procedure for robust NNs. We present two versions of it, the first one allows for minimization of the upper bound on the Lipschitz constant and the second one allows to enforce a desired bound on the Lipschitz constant.

Iii-a Weights as decision variables

Eq. (6) can be used to assess an NN’s robustness after training, whereas in this manuscript, to promote robustness during training, we use Eq. (6) to update the weights while minimizing the bound on the Lipschitz constant. Applying the Schur complement to (5) for , the LMI can be linearized in the weight matrices , yielding an LMI that is linear in and , for fixed :

Remark 2.

For , the underlying constraint is not convex in . Consequently, we cannot state LMI constraints for and instead set . This is a conservative choice for some activation functions, yet the tight lower bound for the most common ones. E.g. for and the tight bounds are and

and the sigmoid function is slope-restricted with

and .

While the Lipschitz constant estimation scheme in [12] optimizes over , throughout the manuscript, we choose to be a fixed matrix. This introduces conservatism into the framework and necessitates a suitable choice for in order to keep the introduced conservatism to a minimum, e.g., the matrix may be determined from the Lipschitz constant estimation outlined in Section II-B on the vanilla NN or the L2 regularized NN of the same problem. Based on (7), we can now set up an SDP in and


that can be used to adjust the weights in order to minimize the Lipschitz bound.

Note that solving Eq. (8) independently leads to the trivial solution and weights that do not provide a satisfactory fit for the corresponding input-output data and hence, the trade-off between accuracy of the NN and its robustness [16] has to be taken into account in the problem formulation. We achieve this by including the Lipschitz constant in the training objective and minimizing both, the NN’s loss and its upper bound on the Lipschitz constant, using the Alternating Direction Method of Multipliers that allows for optimization of two objectives.

Iii-B Lipschitz regularization

In general, NNs are trained on input-output data with the objective of minimizing a predefined loss that we assume to be continuous throughout this manuscript, e.g. the mean squared error, cross-entropy or hinge loss. We propose to not only minimize the NN’s loss but also its Lipschitz constant. This yields an optimization problem with two separate objectives that can be solved conveniently using ADMM.

ADMM is an algorithm that solves optimization problems by splitting them into smaller subproblems that are easier to handle individually [17]. In order to apply the ADMM algorithm, the objective must be separable. The resulting subobjectives are then defined on uncoupled convex sets and subject to linear equality constraints. The ADMM scheme solves the resulting optimization problem through independent minimization steps on the augmented Lagrangian of the optimization problem and a dual update step. The objectives at hand, i.e., the NN’s loss and the Lipschitz bound, are indeed separable and defined on uncoupled convex sets. However, the problems are not completely independent and need to be connected through a linear constraint that requires the introduction of additional variables of equal size as . The loss of the NN is an explicit function of the weights and the Lipschitz bound depends on through the LMI (7), yielding the following optimization problem:


where is a weighting parameter adjusting the trade-off between accuracy and robustness. Applying the ADMM scheme to problem (9), results in the augmented Lagrangian function

with Lagrange multipliers , and the penalty parameter . The optimum for (9) is then determined via the following iterative ADMM update steps:


For training of robust NNs, we carry out the corresponding updates consecutively until convergence. The loss function is optimized analytically using backpropagation (Eq. (

10a)) whereas solving the SDP of the Lipschitz update step (10b) requires to solve an SDP in every iteration and thereby adds additional computations compared to training of a vanilla NN.

Remark 3.

It is possible to extend the framework and optimize over , , and at the same time which requires a second LMI constraint in (8) and an additional update step, resulting in a multi-block ADMM scheme. This reduces conservatism but increases computation time.

Remark 4.

It is also possible to extend the ADMM scheme outlined above to training of robust multi-layer NNs, using the corresponding Lipschitz bounds for multi-layer NNs in [12]. Similar to the single-layer case, the Lipschitz bound for -layer NNs is convex in the weights and for .

Iii-C Convergence

In the following, we apply a general convergence result for ADMM (§ 3.2 and appendix A in [17]) to (9) in order to prove that the proposed training procedure for robust NNs converges under certain convexity assumptions. Feasibility of the LMI (7) can be included into the optimization objective using the indicator function

For convergence considerations, we then look at the unaugmented Lagrangian


and assume that is strictly convex. As we will comment at the end of this section, this assumption typically does not hold globally.

Assumption 5.

The NN’s loss is strictly convex.

Lemma 6.

Let Assumption 5 hold. If there exists an , such that the LMI (7) is strictly feasible (Slater condition), i.e.,

then has a saddle point.


If the Slater condition is fulfilled, then strong duality holds for and strong duality indicates that has a saddle point [18]. ∎

Theorem 7.

Let Assumption 5 hold. Then the ADMM scheme (10) converges to the global minimum of (9).


Boyd et al. show convergence for ADMM if (i) both objectives are convex, closed, and proper functions and if (ii) the unaugmented Lagrangian has a saddle point [17]. In the following, we show that (9) fulfills these conditions. For , , and any , the LMI (7) is strictly feasible. According to Lemma 6, this fact and Assumption 5 imply that the unaugmented Lagrangian (11) has a saddle point. The input-output map of the NN (3) is continuous as it consists of the sum of weighted activation functions that are continuous by design, as well as the loss function. Also, as a measure of errors of the predictions to the true labels, the loss is lower bounded by and therefore, it never attains the value . Together with the fact that its domain is nonempty, we conclude that the loss function is proper. Furthermore, continuity and convexity imply that the loss function is closed. The subobjective is obviously convex, continuous in and defined on a set (7) that is convex in and . Also, for while the LMI (7) remains feasible such that the function is closed. As a quadratic function, it is lower bounded by and therefore proper. The indicator function of a convex set is also convex. It is closed as each sublevel set of is closed and by defintion never attains which renders it proper. ∎

In general, NNs are highly nonlinear and non-convex and so is the loss function , i.e., Assumption 5

is typically not satisfied. Yet in training of vanilla NNs the loss converges reliably to a local minimum for an adequate choice of hyperparameters. Hence, within a small neighborhood of the minimum, the convexity assumption is reasonable. Moreover, the proposed ADMM scheme that includes an additional convex objective to the usual training procedure does not add complexity in the form of non-convexity to the optimization problem. A suitable initialization of

and is recommended, not only for faster convergence but also to find a better solution. Nevertheless, only convergence to a local minimum can be expected.

Iii-D Enforcing Lipschitz bounds

In Section III-B, we suggested to minimize an upper bound on the Lipschitz constant of an NN. Using the proposed ADMM framework, it is also possible to enforce a desired upper bound on the Lipschitz constant during training of an NN. In that case, the Lipschitz constant is not minimized but instead set to a desired value . Judiciously, does not appear in the optimization objective of this setup


where and serve as decision variables. For enforcement of Lipschitz bounds, we apply the ADMM algorithm as in (10) to (12) instead of (9).

Theorem 8.

When training a fully-connected NN with one hidden layer (3) by executing the ADMM scheme (10) for (12), the upper bound on the Lipschitz constant is guaranteed to be with weights at all times during training.

Theorem 8 follows from the fact that, by design, the bound on the Lipschitz constant is enforced in every iteration of the ADMM scheme, more specifically, in every Lipschitz update step by the LMI constraint on the weights . Convergence can be shown exactly as in Theorem 6, where again, the loss of the NN is not convex in general, nevertheless Assumption 5 allows for insights into convergence to local minima.

Corollary 9.

Let Assumption 5 hold. Then the ADMM scheme (10) applied to (12) converges to the global minimum of (12).

The training procedure based on (12) allows to choose the value of the Lipschitz bound and to train NNs with Lipschitz guarantees. This way, a desired degree of robustness can be directly enforced. However, the choice of such a constraint on is always connected to the trade-off between accuracy and robustness, as the fit generally deteriorates when decreasing the Lipschitz constant constraint. In addition, it is helpful to initialize the weight parameters appropriately which does not only accelerate training but may also facilitate a better fit.

Iv Simulation Results

In this section, we illustrate the benefits of the two presented variants for training of robust NNs on two toy examples. For that purpose, we design an NN for regression problems with input and output . We use a feed-forward NN with one hidden layer, neurons in the hidden layer, the activation function that is slope-restricted with , , and the mean squared error (MSE) as the loss function. We train three NNs, a vanilla NN, an NN with L2 regularization (L2-NN) for benchmarking and finally the Lipschitz regularized NN (Lipschitz-NN), wherein the NN loss update step (10a

) is solved using stochastic gradient descent and the SDP is solved using numerical SDP solvers

[19, 20].

We first fit a classifying function that takes either the value zero or one, compare Fig.

1 for a plot of the function. We apply the Lipschitz regularization scheme presented in Section II-B with and , wherein we initialize the Lipschitz-NN from the L2 regularized NN with penalty parameter . The evaluation of the resulting MSE losses and bounds on the Lipschitz constant, that is summarized in Table I, shows that the loss of the nominal NN that was trained without any regularizer is the smallest. Naturally, there is a trade-off between accuracy and robustness. Yet, the Lipschitz bound of 17.65 is rather high and can be restrained with a regularizer. Comparing the two regularizers, the robustness certificate of the Lipschitz-NN is smaller than the one of the L2-NN whereas the MSE loss is similar.

Fig. 1: Graphs of nominal NN, L2-NN, and Lipschitz-NN with Lipschitz regularization.
Training MSE
Nominal NN 0.0161 17.65
L2-NN 0.0457 7.815
Lipschitz-NN 0.0437 6.378
TABLE I: MSE losses and Lipschitz upper bounds

To illustrate the enforcement of Lipschitz bounds as proposed in Section III-D, we train NNs to fit data that is taken from a noisy sinusoidal function. We set and the Lipschitz bound to , the same value as the upper bound on the Lipschitz constant obtained with L2 regularization, and initialize the parameters from the L2-NN with . The three resulting NNs produce the graphs shown in Fig. 2. Note that the nominal NN is overfitting the sine for which is reflected in the high Lipschitz bound of . The map obtained from the L2-NN notably provides an inferior fit, especially in the range , whereas the Lipschitz-NN provides a good fit while its Lipschitz bound is kept lower than the one of the L2-NN such that similar robustness is maintained. The Lipschitz-NN exceeds the other NNs in robustness () while performing only marginally worse than the nominal NN, cf. the MSE loss on training and testing data, and on the underlying sine curve in Table II. The results show that Lipschitz regularization and enforcement of Lipschitz bounds can be used to effectively train robust NNs while trading off robustness and accuracy. Code to reproduce the examples can be found at

Fig. 2: Graphs of nominal NN, L2-NN, and Lipschitz-NN with enforced Lipschitz bounds.
Training MSE Testing MSE Sine MSE
Nominal NN 0.0590 0.0990 0.0106 9.747
L2-NN 0.1159 0.1521 0.0681 2.403
Lipschitz-NN 0.0687 0.1009 0.0128 2.316
TABLE II: MSE losses and Lipschitz upper bounds

V Conclusion

We proposed a framework for training of single-hidden layer NNs that encourages robustness, by both considering Lipschitz regularization and by enforcing Lipschitz bounds during training. The underlying SDP [12] estimates the upper bound on the Lipschitz constant more accurately than traditional methods as it exploits the fact that activation functions are slope-restricted. We designed an optimization scheme based on this SDP that trains an NN to fit input-output data and at the same time increases its robustness in terms of Lipschitz continuity. We used ADMM to solve the underlying optimization problem and to therein conveniently incorporate the trade-off between accuracy and robustness. In addition, we presented a variation of the framework that allows for bounding the Lipschitz constant by a desired value, i.e., training NNs with robustness guarantees. We successfully tested our method on two toy examples where we benchmarked it with L2 regularization.

Next steps include the extension of the method to multi-layer NNs, reducing conservatism by not only optimizing over the weights but also over the weighting matrix , and the implementation on more conclusive higher dimensional problems. In addition, for benchmarking purposes, we plan to compare the proposed methods to other training procedures that improve robustness.