1 Introduction
Nowadays, image classifiers can achieve superhuman performance, but most of them are not robust against small, imperceptible and adversariallychosen perturbations of their inputs
[8, 55]. This sensitivity of neural networks to input perturbations becomes a major issue, and over the past decade, the research progress plays out as a catandmouse game between the development of more and more powerful attacks [22, 36, 11, 16] and the design of empirical defense mechanisms [41, 43, 14]. Finishing the game calls for certified adversarial robustness [47, 63]. While recent work devised defenses with theoretical guarantees against adversarial perturbations, they share the same limitation, i.e., the tradeoffs between expressivity and robustness, and between scalability and accuracy.Many papers explored the interpretation of neural networks as a parameter estimation problem of nonlinear dynamical systems
[25, 18, 40]. Reconsidering the ResNet architecture as an Euler discretization of a continuous dynamical system yields to the trend around Neural Ordinary Differential Equation
[12]. For instance in the seminal work of [25], the continuous formulation offers more flexibility to investigate the stability of neural networks during inference, knowing that the discretization will be then implemented by the architecture design. The notion of stability, in this context, quantifies how a small perturbation on the initial value impacts the trajectories of the dynamical system.Since robustness to adversarial attacks is closely related to this notion of dynamical stability, we draw inspiration from this continuous and dynamical interpretation. This enables us to readily introduce convex potentials in the design of the gradient flow. We show that this choice of parametrization yields to bydesign Lipschitz neural networks. This amounts to an improved adversarial robustness. At the very core of our approach lies a new Lipschitz nonlinear operator that we call Stable Block layer which allows us to adapt convex potential flows to the discretized case. These blocks enjoy the desirable property of stabilizing the training of the neural network by controlling the gradient norm, hence overcoming the exploding gradient issue. We experimentally demonstrate our approach by training very deep and wide neural networks on CIFAR10 [35], reaching stateofthe art results in terms of standard and certified underattack accuracy.
Outline of the paper.
In Section 2, we recall existing results on Lipschitz networks and Residual Networks. In Section 3, we first prove that Residual Networks with convex potentials are Lipschitz. Section 4 present how to build a Lipschitz residual neural networks from the theoretical insight of the previous section. Finally, in Section 5, we validate our methods with a comprehensive set of experiments exhibiting the stability and scalability of our approach as well as stateoftheart results both in terms of accuracy and robustness.
2 Background and Related Work
We consider a classification task from an input space to a label space . To this end, we aim at learning a classifier function such that the predicted label for an input is . For a given couple inputlabel , we say is correctly classified if .
In this paper, we aim at devising defense mechanisms against adversarial attacks, i.e., given a ball of radius around the input with label , an adversarial attack is a perturbation s.t. and .
2.1 Lipschitz property of Neural Networks
The Lipschitz constant has seen a growing interest in the last few years in the field of deep learning
[61, 20, 15, 7]. Indeed, numerous results have shown that neural networks with a small Lipschitz constant exhibit better generalization [5], higher robustness to adversarial attacks [55, 19, 57], and better stability [64, 56]. Formally, we define the Lipschitz constant with respect to the norm of a Lipschitz continuous function as follows:Note that in this work, we will consider the Lipschitz constant with respect to the norm and we will denote a Lipschitz continuous function “LLipschitz” where .
Intuitively, if a classifier is Lipschitz, one can bound the impact of a given input variation on the output, hence obtaining guarantees on the adversarial robustness. We can formally characterize the robustness of a neural network with respect to its Lipschitz constant with the following proposition:
Proposition 1 ([57]).
Let be an Lipschitz continuous classifier for the norm. Then, for and for every with label such that
then we have for every such that :
Consequently, the margin needs to be large and the Lipschitz constant small to get optimal guarantees on the robustness for neural networks.
2.2 Lipschitz Regularization for Robustness
Based on this theoretical insight, researchers have developed several techniques to regularize and constrain the Lipschitzness of neural networks. However the computation of the Lipschitz constant of neural networks has been shown to be NPhard [61]. Most methods therefore tackle the problem by reducing or constraining the Lipschitz constant at the layer level. For instance, the work of [13, 30, 62] exploit the orthogonality of the weights matrices to build Lipschitz layers. Other approaches [23, 33, 50, 52, 3] proposed to estimate or upperbound the spectral norm of convolutional and dense layers using for instance the power iteration method [21]. While these methods have shown interesting results in terms of accuracy, empirical robustness and efficiency, they can not provide provable guarantees since the Lipschitz constant of the trained networks remains unknown or vacuous.
2.3 Certified Adversarial Robustness
Recently, two approaches have been developed to come up with certified adversarial robustness. A classifier is said to be certifiably robust if for any input , one can easily obtain a guarantee that the classifier’s prediction is constant within some set around . An sample is said to be certifiable at level for the classifier if one can certify that s.t. , .
The first category relies on randomization [37, 14, 46] and consists in adding isotropic noise to the input during both training and inference phases. In order to get nonvacuous provable guarantees, such approaches often require to query the network hundreds of times to infer the label of a single image. This computational cost naturally limits the use of these methods in practice.
The second approach directly exploits the Lipschitzness property with the design of builtin
Lipschitz layers providing deterministic guarantees. Following this line, one can either normalize the weight matrices by their largest singular values making the layer
Lipschitz, e.g. [65, 42, 19, 2] or project the weight matrices on the Stiefel manifold [39, 56]. However, these latter methods are computationally expensive because they use iterative algorithms [9, 34] which hinder their use in large scale settings.2.4 Residual Networks
To prevent from gradient vanishing issues in neural networks during the training phase [28], [27] proposed the Residual Network (ResNet) architecture. Based on this architecture, several works [25, 18, 40, 12] proposed a “continuous time” interpretation inspired by dynamical systems as follows:
(1) 
This continuous time interpretation helps as it allows us to consider the stability of the forward propagation through the stability of the associated dynamical system. A dynamical system is said to be stable if two trajectories starting from respectively an input and a shifted input remain sufficiently close to each other all along the propagation. This stability property takes all its sense in the context of adversarial classification.
It was argued by [25] that when does not depend on or vary slowly with time^{1}^{1}1This blurry definition of "vary slowly" makes the property difficult to apply.
, the stability can be characterized by the eigenvalues of the Jacobian matrix
: the dynamical system is stable if the real part of the eigenvalues of the Jacobian stay negative throughout the propagation. This property however only relies on intuition and this condition might be difficult to verify in practice. In the following, in order to derive stability properties, we study gradient flows and convex potentials, which are subclasses on Residual networks.Other works [31, 38] also proposed to enhance adversarial robustness using dynamical systems interpretations of Residual Networks. Both works argues that using particular discretization scheme would make gradient attacks more difficult to compute due to numerical stability. These works did not provide any provable guarantees for such approaches.
3 Lipschitzness via Convex Potentials
Starting with definition of the ResNet architecture and its associated flow, in this section, (i) we show that using convex potentials allows us to build Lipschitz networks, and (ii) we propose a simple method to parametrize such networks based on Input Convex Neural Networks [1].
3.1 ResNet gradient flows and potentials
Among dynamical systems, an important class are the gradient flows that play an important role in our work. Starting from Equation 1 we can define a gradient flow as follows:
Definition 1.
Let be a family of differentiable functions on , the ResNet gradient flow associated to is defined as:
In this case, is called a potential function. This means that the family of functions , with derives from a simpler family of scalar valued function via the gradient operation: .
The discretized version of the ResNet gradient flows derives naturally from explicit Euler discretization: . Since deriving from a potential is a restricting property, one may ask if such flows can approximate all functions. Next proposition shows that every differentiable function can be expressed as a twolayer discrete ResNet gradient flow.
Proposition 2.
Let be a continuously differentiable function. Then, there exist two differentiable potential functions and such that the following discretized ResNet gradient flow for starting from satisfies for all .
Gradient flow might be a useful tool to study Lipschitzness of neural networks. In the next section, we propose to rely on convex potentials to build Lipschitz layers.
3.2 Convex Potentials
Following the continuous interpretation, we introduced a subclass of Residual Networks defined as the gradient flow deriving from a family of potential functions . The next proposition shows that if the potentials are convex, then the gradient flow is Lipschitz with regards to its initial condition, i.e. the input of the network.
Proposition 3.
Let be a family of convex differentiable functions. Let and be two continuous ResNet gradient flows associated with differing in their respective initial points and , then for ,
This simple property suggests that if we could build a ResNet with convex potentials, it would be less sensitive to input perturbations and therefore more robust to adversarial examples. However, this property does not hold when we discretize the time steps with the explicit Euler scheme, as implied by the ResNet architecture. One need an additional smoothness property on the potential functions to generalize it to discretized gradient flows. Recall that a function is said to be smooth if it is differentiable and is Lipschitz. We now prove an equivalent property for discretized time steps.
Proposition 4.
Let be a family of convex differentiable functions such that for all , is smooth. Let us define the following discretized ResNet gradient flow using as a step size:
Consider now two flows and with initial points and respectively, if for , then
Remark.
A Residual Network defined as in Proposition 4 may not express all Lipschitz functions. The convexity and smoothness assumptions required on the flow could indeed restrict the expressivity of the network. There are strong reasons to believe that convex potential flows cannot define universal approximators of Lipschitz functions. See Appendix for further discussion.
3.3 Parametrizing Convex Potentials
Our previous result (Proposition 4) requires the computation of the gradient for convex functions. We can leverage layer Input Convex Neural Networks [1]
to define an efficient solution. For any vectors
, and bias terms , and for a convex function, the potential defined as:defines a convex function in as the composition of a linear and a convex function. Its gradient with respect to its input is then:
with and are respectively the matrix and vector obtained by the concatenation of, respectively, and , and is applied elementwise. Moreover, assuming is Lipschitz, we have that is smooth. denotes the spectral norm of , i.e., the greatest singular value of defined as:
The reciprocal also holds: if is a nondecreasing Lipschitz function, and , there exists a convex smooth function such that
where is applied elementwise. The next section shows how this property can be used to implement the building block and training of such layers.
4 Building Lipchitz Residual Networks
4.1 Stable Blocks
From the previous section, we derive the following Stable Block layer:
(2) 
Written in a matrix form, this layer can be implemented with every linear operation . In the context of image classification, it is beneficial to use convolutions^{2}^{2}2For instance, one can leverage the Conv2D and Conv2D_transpose
functions of the PyTorch framework
[45]instead of generic linear transforms represented by a dense matrix.
Residual Networks [27] are also composed of other types of layers which increase or decrease the dimensionality of the flow. Typically, in a classical setting, the number of input channels is gradually increased, while the size of the image is reduced with pooling layers. In order to build a Lipschitz Residual Network, all operations need to be properly scale or normalize in order to maintain the proper Lipschitz constant.
First, to increase the number of channels, a concatenation operation of several stable blocks can be easily performed. Recall that if are Lipschitz functions, then the function is also a Lipschitz function. With this property, it is straightforward to build a Lipschitz Stable Expansion Block from concatenating Stable Blocks.
Dimensionality reduction is another essential operation in neural networks. On one hand, its goal is to reduce the number of parameters and thus the amount of computation required to build the network. On the other hand it allows the model to progressively map the input space on the output dimension, which corresponds in many cases to the number of different labels . In this context, several operations exist: pooling layers are used to extract information present in a region of the feature map generated by a convolution layer. One can easily adapt pooling layers (e.g. max and average) to make them Lipschitz [5]. Finally, a simple method to reduce the dimension is the product with a nonsquare matrix. In this paper, we simply implement it as the truncation of the output. This obviously maintains the Lipschitz constant.
4.2 Computing spectral norms
Our StableBlock, described in Equation 2, can be adapted to any kind of linear transformations but requires the computation of the spectral norm for these transformations. Given that computation of the spectral norm of a linear function is known to be NPhard [54], an efficient approximate method is required during training in order to keep the complexity tractable.
Many techniques exist to approximate the spectral norm (or the largest singular value), and most of them exhibit a tradeoff between efficiency and accuracy. Several methods exploit the structure of convolutional layers to build an upper bound on the spectral norm of the linear transform performed by the convolution [33, 52, 3]. While these methods are generally efficient, they can less relevant and adapted to certain settings. For instance in our context, using a loose upper bound of the spectral norm will hinder the expressive power of the layer and make it too contracting.
We rely on the Power Iteration Method (PM). This method converges at a geometric rate towards the largest singular value of a matrix. While it can appear to be computationally expensive due to the large number of required iterations for convergence, it is possible to drastically reduce the number of iterations during training. Indeed, as in [42], by considering that the weights’ matrices change slowly during training, one can perform only one iteration of the PM for each step of the training and let the algorithm slowly converges along with the training process^{3}^{3}3Note that a typical training requires approximately 200K steps where 100 steps of PM would be enough for convergence. We describe with more details in Algorithm 1, the operations performed during a forward pass with a StableBlock.
However for evaluation purpose, we need to compute the certified adversarial robustness, and this requires to ensure the convergence of the PM. Therefore, we perform iterations for each layer^{4}^{4}4 iterations of Power Method is sufficient to converge with a geometric rate. at inference time. Also note that at inference time, the computation of the spectral norm only needs to be performed once for each layer.
5 Experiments
Total Depth  Depth Linear  Num. Features  Num. Channels  
5  30  70  5  10  15  1024  2048  4096  30  60  90  
Clean  72.8  75.0  74.9  74.9  76.1  76.3  74.3  74.8  75.0  74.8  75.2  75.4 
PGD  67.8  70.1  70.2  70.3  71.4  71.6  69.3  70.0  70.3  70.2  70.5  70.7 
AutoAttack  66.1  68.3  68.5  68.6  69.7  69.8  67.5  68.3  68.7  68.5  68.7  68.9 
Certifed  55.1  57.5  57.6  57.9  58.8  58.8  56.9  57.5  57.6  57.6  58.2  58.1 
for different hyperparameters of our Stable Block architecture. We considered a baseline model with
layers ( convolutions and fullyconnected layers), feature maps on each convolution and a width of for each fullyconnected Stable Block layer. From left to right, we vary the depth, the number of Linear Stable Blocks, the number of feature maps in the convolutions and the width of fully connected layers.To evaluate our new Lipschitz Stable Block layers, we carry out an extensive set of experiments. In this section, we first recall the concurrent approaches that build
Lipschitz Neural Networks and stress their limitations. Then, after describing the details of our setting, our experimental results are summarized. By computing the certified and empirical adversarial accuracy of our networks, we show that our architecture is more accurate and robust than other competing approaches. Finally, we demonstrate the stability and the scalability of our approach by training very deep neural networks up to 1000 layers without normalization tricks or gradient clipping.
5.1 Concurrent Approaches
Projection on an Orthogonal Space.
The work of [39] and [56] (denoted BCOP and Cayley respectively) are considered the most relevant approach to our work. Indeed, their approaches consist of projecting the weights matrices onto an orthogonal space in order to preserve gradient norms and enhance adversarial robustness by guaranteeing low Lipschitz constants. While both works have similar objectives, their execution is different. The BCOP layer (Block Convolution Orthogonal Parameterization) uses an iterative algorithm proposed by [9] to orthogonalize the linear transform performed by a convolution. However several iterations are necessary to guarantee orthogonality making this method very costly. The method proposed by [56] suffers from similar limitations. They use the Cayley transform to orthogonalize the weights matrices and while this algorithm is not iterative, it involves matrix inversion.
Reshaped Kernel Methods.
It has been shown by [13] and [57] that the spectral norm of a convolution can be upperbounded by the norm of a reshaped kernel matrix. Consequently, orthogonalizing directly this matrix upperbound the spectral norm of the convolution by . While this method is more computationally efficient than orthogonalizing the whole convolution, it lacks expressivity as the other singular values of the convolution are certainly too constrained. In the following, we denote by RKO and CRKO the “Reshaped Kernel Methods” done with the BCOP orthogonalization method and the Cayley transform respectively.
5.2 Training and Architectural Details
Experimental setting.
We demonstrate the effectiveness of our approach on a classification task with CIFAR10 dataset [35]. In order to make our results comparable with the Cayley method [56], we use the same training configuration. Consequently, we trained our networks with a batch size of over epochs. We use standard data augmentation (i.e., random cropping and flipping), a learning rate of with Adam optimizer [17] without weight decay and a piecewise triangular learning rate scheduler.
Hyperparameter Study.
We now describe the impact of the hyperparameters on the accuracy of our models. In order to obtain the best performing networks, we finetuned several hyperparameters: the depth, the number of feature maps of the convolutions, the width and the number of the linear layers at the end of the networks and finally the learning rate. To study the hyperparameters influence in our architecture we considered a baseline model with layers ( convolutions and fullyconnected layers), feature maps on each convolution and a width of for each fullyconnected layer. We tested multiple variations by trying different values of hyperparameters around the baseline model to capture their impact. We conclude that the depth had an important impact on the performance of the model but saturate at around (with fullyconnected layers). We reported results in Table 1. A very important performance gain is obtained by increasing the number of feature maps of the convolution from 3 to 30 with little gain afterwards. On the other hand, the width of the linear layers had a small impact on performance.
Ours  Cayley  0.85Cayley  BCOP  RKO  CRKO  
Clean  78.56  
PGD  74.33  
AutoAttack  72.72  
Certified  61.13  
Emp. Lipschitz 
runs: mean and standard deviation are reported. Results for other models are reported from
[56]. Our model outperform other methods both in empirical and certified robust accuracy.Ours  ResNet9  WideResnet  
Cayley  BCOP  Cayley  BCOP  
Clean  83.41  82.99  81.39  
PGD  77.23  76.02  74.56  
AutoAttack  75.04  73.16  71.86 
5.3 Results
In this section, we present our results on adversarial robustness. We provide results on provable robustness as well as empirical robustness. After investigating the impact of hyperparameters on our architecture, we fixed the parameters of our architecture to convolutional Stable Block layers with feature maps, inner convolution channels, and fullyconnected Stable Block layers with a width of .
Certified Adversarial Robustness.
In order to train Lipschitz Residual Networks, we used the architecture and techniques presented in Section 4 and we did not normalized the inputs. We report our results in Tables 2 as well as in Figure 2. Table 2 compare the performance of our architecture against other stateoftheart Lipschitz networks (presented in Section 5.1). We show that our network outperforms every other approaches on every measured metrics. Indeed, we outperform by more than points other methods on clean accuracy, more than points against the Projected Gradient Method attack proposed by [41] and the AutoAttack [16]. Furthermore, our model obtain a certified accuracy of which is more than two points over the previous best approach. Finally, Figure 2 shows the evolution of the certified accuracy and empirical accuracy under PGD attacks ( iterations) as a function of varying between and
. We observe a substantial gap between the empirical robustness and the certifiable robustness, probably due to the upper estimate of the Lipschitz constants.
Empirical Adversarial Robustness.
We reported in Table 3 the results for normalized inputs models. We remark our model outperform the performances of Cayley models on all tasks: we gain points in clean accuracy and more than point in robust accuracy under PGD attack. In comparison, with the same number of layers, an undefended ResNet reach a clean accuracy of but close to under attack. Figure 2 shows that the models without normalized inputs reach better levels of robustness when becomes large. We reach around of accuracy under PGD attack for . For comparison, the best empirically robust model in the literature achieves clean accuracy and robust accuracy for using a ResNet50 [4] with Adversarial Training [41] and additional unlabeled data.
5.4 Training stability: scaling up to layers
While the Residual Network architecture limits, by design, gradient vanishing issues, it still suffers from exploding gradients in many cases [26]
. To prevent such scenarii, batch normalization layers
[32] are used in most Residual Networks to stabilize the training.Recently, several works [42, 19] have proposed to normalize the linear transformation of each layer by their spectral norm. Such a method would limit exploding gradients but would again suffer from gradient vanishing issues. Indeed, spectral normalization might be too restrictive: dividing by the spectral norm can make other singular values vanishingly small. While more computationally expensive (spectral normalization can be done with Power Method iteration), orthogonal projections prevent both exploding and vanishing issues.
On the contrary the architecture proposed in this paper has the advantage to naturally control the gradient norm of the output with respect to a given layer. Therefore, our architecture can get the best of both worlds: limiting exploding and vanishing issues while maintaining scalability. To demonstrate the scalability of our approach, we experiment the ability to scale our architecture to very high depth (up to 1000 layers) without any additional normalization/regularization tricks, such as Dropout [53], Batch Normalization [32] or gradient clipping [44]. With the work done by [64], which leverage Dynamical Isometry and a Mean Field Theory to train a layers neural network, we believe, to the best of our knowledge, to be the second to perform such training. For sake of computation efficiency, we limit this experiment to architecture with feature maps. We report the accuracy in terms of epochs for our architecture in Figure 2 for a varying number of convolutional layers. It is worth noting that for the deepest networks, it may take a few epochs before the start of convergence. As [64], we remark there is no gain in using very deep architecture for this task.
6 Conclusion
In this paper, we presented a new generic method to build Lipschitz layers. We leverage the dynamical system interpretation of Residual Networks and show that using convex potential flows naturally defines Lipschitz neural networks. After proposing a parametrization based on Input Convex Neural Networks [1], we show that our models are able to reach stateoftheart results in classification and robustness in comparison which other existing Lipschitz approaches. We also experimentally show that our layers provides scalable approaches without further regularizations to train very deep architectures.
Exploiting the ResNet architecture for devising flows have been an important research topic. For example, in the context of generative modeling, Invertible Neural Networks [6] and Normalizing Flows [48, 60] are both import research topic. More recently, Sylvester Normalizing Flows [58] or Convex Potential Flows [29] have had similar ideas to this present work but for a very different setting and applications. In particular, they did not have interest in the contraction property of convex flows and the link with adversarial robustness have been underexploited.
Further work.
Our models may not express all Lipschitz functions. Knowing which functions can be approximated by our layers is difficult even in the linear case. Nevertheless, this is clearly an important question that requires further investigation. One can also think of extending our work by the study of other dynamical systems. For instance, recent architectures such as Hamiltonian Networks [24] and Momentum Networks [49]
exhibit interesting properties. Finally, we hope to extend our approach to Recurrent Neural Networks
[51] and Transformers [59].References

Amos et al. [2017]
Brandon Amos, Lei Xu, and J Zico Kolter.
Input convex neural networks.
In
International Conference on Machine Learning
, 2017.  Anil et al. [2019] Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approximation. In International Conference on Machine Learning, 2019.

Araujo et al. [2021]
Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, and Jamal Atif.
On lipschitz regularization of convolutional layers using toeplitz
matrix theory.
ThirtyFifth AAAI Conference on Artificial Intelligence
, 2021. 
Augustin et al. [2020]
Maximilian Augustin, Alexander Meinke, and Matthias Hein.
Adversarial robustness on inand outdistribution improves
explainability.
In
European Conference on Computer Vision
, pages 228–245. Springer, 2020.  Bartlett et al. [2017] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.
 Behrmann et al. [2019] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, 2019.
 Béthune et al. [2021] Louis Béthune, Alberto GonzálezSanz, Franck Mamalet, and Mathieu Serrurier. The many faces of 1lipschitz neural networks. arXiv preprint arXiv:2104.05097, 2021.
 Biggio et al. [2013] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, 2013.

Björck et al. [1971]
Åke Björck et al.
An iterative algorithm for computing the best estimate of an orthogonal matrix.
SIAM Journal on Numerical Analysis, 1971.  Bosch [1987] AJ Bosch. Note on the factorization of a square matrix into two hermitian or symmetric matrices. SIAM Review, 29(3):463–468, 1987.
 Carlini et al. [2017] Nicholas Carlini et al. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017.
 Chen et al. [2018] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018.
 Cisse et al. [2017] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, 2017.
 Cohen et al. [2019] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, 2019.

Combettes and Pesquet [2020]
Patrick L Combettes and JeanChristophe Pesquet.
Lipschitz certificates for layered network structures driven by
averaged activation operators.
SIAM Journal on Mathematics of Data Science
, 2020.  Croce et al. [2020] Francesco Croce et al. Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks. In International Conference on Machine Learning, 2020.
 Diederik P. Kingma [2014] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2014.
 E [2017] Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 2017.
 Farnia et al. [2019] Farzan Farnia, Jesse Zhang, and David Tse. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2019.
 Fazlyab et al. [2019] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 2019.
 Golub et al. [2000] Gene H Golub et al. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 2000.
 Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 Gouk et al. [2021] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 2021.
 Greydanus et al. [2019] Samuel J Greydanus, Misko Dzumba, and Jason Yosinski. Hamiltonian neural networks. 2019.
 Haber et al. [2017] Eldad Haber et al. Stable architectures for deep neural networks. Inverse problems, 2017.
 Hayou et al. [2021] Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, and Judith Rousseau. Stable resnet. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition
, 2016.  Hochreiter et al. [2001] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies, 2001.

Huang et al. [2021]
ChinWei Huang, Ricky T. Q. Chen, Christos Tsirigotis, and Aaron Courville.
Convex potential flows: Universal probability distributions with optimal transport and convex optimization.
In International Conference on Learning Representations, 2021.  Huang et al. [2020a] Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, and Ling Shao. Controllable orthogonalization in training dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020a.
 Huang et al. [2020b] Yifei Huang, Yaodong Yu, Hongyang Zhang, Yi Ma, and Yuan Yao. Adversarial robustness of stabilized neuralodes might be from obfuscated gradients. Mathematical and Scientific Machine Learning, 2020b.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Jia et al. [2017] Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural networks via singular value bounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 Kautsky and Turcajová [1994] Jaroslav Kautsky and Radka Turcajová. A matrix approach to discrete wavelets. In Wavelet Analysis and Its Applications. Elsevier, 1994.
 Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 Kurakin et al. [2016] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 Lecuyer et al. [2018] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), 2018.
 Li et al. [2020] Mingjie Li, Lingshen He, and Zhouchen Lin. Implicit euler skip connections: Enhancing adversarial robustness via numerical stability. In International Conference on Machine Learning, pages 5874–5883. PMLR, 2020.
 Li et al. [2019] Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B Grosse, and JoernHenrik Jacobsen. Preventing gradient attenuation in lipschitz constrained convolutional networks. 2019.
 Lu et al. [2018] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Miyato et al. [2018]
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.
Spectral normalization for generative adversarial networks.
In International Conference on Learning Representations, 2018.  MoosaviDezfooli et al. [2019] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 Pascanu et al. [2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, 2013.
 Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Pinot et al. [2019] Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, Cédric GouyPailler, and Jamal Atif. Theoretical evidence for adversarial robustness through randomization. In Advances in Neural Information Processing Systems, 2019.
 Raghunathan et al. [2018] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018.
 Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, 2015.
 Sander et al. [2021] Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Momentum residual neural networks. 2021.
 Sedghi et al. [2018] Hanie Sedghi, Vineet Gupta, and Philip Long. The singular values of convolutional layers. In International Conference on Learning Representations, 2018.

Sherstinsky [2020]
Alex Sherstinsky.
Fundamentals of recurrent neural network (rnn) and long shortterm memory (lstm) network.
Physica D: Nonlinear Phenomena, 404:132306, 2020.  Singla et al. [2021] Sahil Singla et al. Fantastic four: Differentiable and efficient bounds on singular values of convolution layers. In International Conference on Learning Representations, 2021.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
 Steinberg [2005] Daureen Steinberg. Computation of matrix norms with applications to robust optimization. Research thesis, TechnionIsrael University of Technology, 2005.
 Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 Trockman et al. [2021] Asher Trockman et al. Orthogonalizing convolutional layers with the cayley transform. In International Conference on Learning Representations, 2021.
 Tsuzuku et al. [2018] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, 2018.
 van den Berg et al. [2018] Rianne van den Berg, Leonard Hasenclever, Jakub Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. In proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 Verine et al. [2021] Alexandre Verine, Yann Chevaleyre, Fabrice Rossi, and benjamin negrevergne. On the expressivity of bilipschitz normalizing flows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
 Virmaux and Scaman [2018] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. 2018.

Wang et al. [2020]
Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X. Yu.
Orthogonal convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.  Wong et al. [2018] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, 2018.
 Xiao et al. [2018] Lechao Xiao, Yasaman Bahri, Jascha SohlDickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018.
 Yoshida and Miyato [2017] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
Appendix A Proofs
a.1 Proof of Proposition 2
Before proving this proposition we recall the following lemma.
Lemma 1 ([10]).
Let . There exist two symmetric matrices with invertible such
We are now set to prove Proposition 2.
Proof.
Let be a continuously differentiable function. For all , is a matrix then, thanks to previous lemma, there exist two symmetric matrices with invertible such .
If and are two twice differentiable functions, then the discretized flow defines the following function: . Then we have:
being symmetric, there exist a twice differentiable function such that for all . Similarly, let also define similarly. Let , then there exists such that . Then we set . Note that it is useless to define outer of . Then we get for such and : for all . Then there exists a constant such that . By setting well the constant over we can get , hence the result.
∎
a.2 Proof of Proposition 3
Proof.
Consider the time derivative of the square difference between the two flows:
The last inequality derives directly from the usual characterization of the convexity. Therefore the quantity is a decreasing function of time. In particular, which concludes the proof. ∎
a.3 Proof of Proposition 4
Proof.
With , we can write:
This equality allows us to derive the equivalence between and:
Moreover, assuming that being that:
We can see with this last inequality that if we enforce , we get which concludes the proof. ∎
Appendix B Expressivity of discretized convex potential flows
Let us define the space of real symmetric matrices with singular values bounded by . Let us also define the space of real matrices with singular values bounded by in absolute value. Let . Then one can prove^{5}^{5}5A proof and justification of this result can be found here: https://mathoverflow.net/questions/60174/factorizationofarealmatrixintohermitianxhermitianisitstable that . Thus there exists such that for all matrices , for all matrices such that .
Applied to the expressivity of discretized convex potential flows, the previous result means that there exists a Lipschitz linear function that cannot be approximated as a discretized flow of any depth of convex linear smooth potential flows as in Proposition 4. Indeed such a flow would write: where are symmetric matrices whose eigenvalues are in , in other words such transformations are exactly described by for some .
Appendix C Additional experiments
c.1 Relaxing linear layers
h = 1.0  h = 0.1  h = 0.01  
Clean  85.10  82.23  78.53 
PGD  61.45  62.99  60.98 
AutoAttack  57.03  58.82  57.63 
The table above shows the result of the relaxed training of our StableBlock architecture, i.e. we fixed the step in the discretized convex potential flow of Proposition 4. Increasing the constant allows for an important improvement in the clean accuracy, but we loose in robust empirical accuracy. While computing the certified accuracy is not possible in this case due to the unknown value of the Lipschitz constant, we can still notice that the training of the network are still stable without normalization tricks, and offer a nonnegligible level of robustness.
c.2 Effect of longer trainings
Epochs  
50  100  200  400  
Clean  76.7  78.5  80.1  80.3 
PGD  73.2  74.3  75.0  75.1 
AutoAttack  71.9  72.7  73.2  73.1 
Certified  59.5  61.1  60.6  59.9 
The table above shows the evolution of the accuracy, empirical robust accuracy and certified accuracy given the duration of the training in epochs. We can observe that the clean accuracy increases from 50 epoch to 400 without overfitting. However the certified accuracy max out at around 100 epochs and start decreasing afterwards suggesting a tradeoff between higher standard accuracy and certified accuracy.