Scalable Lipschitz Residual Networks with Convex Potential Flows

10/25/2021
by   Laurent Meunier, et al.
0

The Lipschitz constant of neural networks has been established as a key property to enforce the robustness of neural networks to adversarial examples. However, recent attempts to build 1-Lipschitz Neural Networks have all shown limitations and robustness have to be traded for accuracy and scalability or vice versa. In this work, we first show that using convex potentials in a residual network gradient flow provides a built-in 1-Lipschitz transformation. From this insight, we leverage the work on Input Convex Neural Networks to parametrize efficient layers with this property. A comprehensive set of experiments on CIFAR-10 demonstrates the scalability of our architecture and the benefit of our approach for ℓ_2 provable defenses. Indeed, we train very deep and wide neural networks (up to 1000 layers) and reach state-of-the-art results in terms of standard and certified accuracy, along with empirical robustness, in comparison with other 1-Lipschitz architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/25/2018

Limitations of the Lipschitz constant as a defense against adversarial examples

Several recent papers have discussed utilizing Lipschitz constants to li...
07/06/2021

Provable Lipschitz Certification for Generative Models

We present a scalable technique for upper bounding the Lipschitz constan...
04/28/2017

Parseval Networks: Improving Robustness to Adversarial Examples

We introduce Parseval networks, a form of deep neural networks in which ...
02/16/2021

A Law of Robustness for Weight-bounded Neural Networks

Robustness of deep neural networks against adversarial perturbations is ...
04/13/2018

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

We show that any smooth bi-Lipschitz h can be represented exactly as a c...
02/04/2021

Invertible DenseNets with Concatenated LipSwish

We introduce Invertible Dense Networks (i-DenseNets), a more parameter e...
04/12/2019

The coupling effect of Lipschitz regularization in deep neural networks

We investigate robustness of deep feed-forward neural networks when inpu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, image classifiers can achieve superhuman performance, but most of them are not robust against small, imperceptible and adversarially-chosen perturbations of their inputs 

[8, 55]. This sensitivity of neural networks to input perturbations becomes a major issue, and over the past decade, the research progress plays out as a cat-and-mouse game between the development of more and more powerful attacks [22, 36, 11, 16] and the design of empirical defense mechanisms [41, 43, 14]. Finishing the game calls for certified adversarial robustness [47, 63]. While recent work devised defenses with theoretical guarantees against adversarial perturbations, they share the same limitation, i.e., the tradeoffs between expressivity and robustness, and between scalability and accuracy.

Many papers explored the interpretation of neural networks as a parameter estimation problem of nonlinear dynamical systems 

[25, 18, 40]

. Reconsidering the ResNet architecture as an Euler discretization of a continuous dynamical system yields to the trend around Neural Ordinary Differential Equation 

[12]. For instance in the seminal work of [25], the continuous formulation offers more flexibility to investigate the stability of neural networks during inference, knowing that the discretization will be then implemented by the architecture design. The notion of stability, in this context, quantifies how a small perturbation on the initial value impacts the trajectories of the dynamical system.

Since robustness to adversarial attacks is closely related to this notion of dynamical stability, we draw inspiration from this continuous and dynamical interpretation. This enables us to readily introduce convex potentials in the design of the gradient flow. We show that this choice of parametrization yields to by-design -Lipschitz neural networks. This amounts to an improved adversarial robustness. At the very core of our approach lies a new -Lipschitz non-linear operator that we call Stable Block layer which allows us to adapt convex potential flows to the discretized case. These blocks enjoy the desirable property of stabilizing the training of the neural network by controlling the gradient norm, hence overcoming the exploding gradient issue. We experimentally demonstrate our approach by training very deep and wide neural networks on CIFAR-10 [35], reaching state-of-the art results in terms of standard and certified under-attack accuracy.

Outline of the paper.

In Section 2, we recall existing results on Lipschitz networks and Residual Networks. In Section 3, we first prove that Residual Networks with convex potentials are -Lipschitz. Section 4 present how to build a -Lipschitz residual neural networks from the theoretical insight of the previous section. Finally, in Section 5, we validate our methods with a comprehensive set of experiments exhibiting the stability and scalability of our approach as well as state-of-the-art results both in terms of accuracy and robustness.

2 Background and Related Work

We consider a classification task from an input space to a label space . To this end, we aim at learning a classifier function such that the predicted label for an input is . For a given couple input-label , we say is correctly classified if .

In this paper, we aim at devising defense mechanisms against adversarial attacks, i.e., given a ball of radius around the input with label , an adversarial attack is a perturbation s.t. and .

2.1 Lipschitz property of Neural Networks

The Lipschitz constant has seen a growing interest in the last few years in the field of deep learning 

[61, 20, 15, 7]. Indeed, numerous results have shown that neural networks with a small Lipschitz constant exhibit better generalization [5], higher robustness to adversarial attacks [55, 19, 57], and better stability [64, 56]. Formally, we define the Lipschitz constant with respect to the norm of a Lipschitz continuous function as follows:

Note that in this work, we will consider the Lipschitz constant with respect to the norm and we will denote a Lipschitz continuous function L-Lipschitz” where .

Intuitively, if a classifier is Lipschitz, one can bound the impact of a given input variation on the output, hence obtaining guarantees on the adversarial robustness. We can formally characterize the robustness of a neural network with respect to its Lipschitz constant with the following proposition:

Proposition 1 ([57]).

Let be an -Lipschitz continuous classifier for the norm. Then, for and for every with label such that

then we have for every such that :

Consequently, the margin needs to be large and the Lipschitz constant small to get optimal guarantees on the robustness for neural networks.

2.2 Lipschitz Regularization for Robustness

Based on this theoretical insight, researchers have developed several techniques to regularize and constrain the Lipschitzness of neural networks. However the computation of the Lipschitz constant of neural networks has been shown to be NP-hard [61]. Most methods therefore tackle the problem by reducing or constraining the Lipschitz constant at the layer level. For instance, the work of [13, 30, 62] exploit the orthogonality of the weights matrices to build Lipschitz layers. Other approaches [23, 33, 50, 52, 3] proposed to estimate or upper-bound the spectral norm of convolutional and dense layers using for instance the power iteration method [21]. While these methods have shown interesting results in terms of accuracy, empirical robustness and efficiency, they can not provide provable guarantees since the Lipschitz constant of the trained networks remains unknown or vacuous.

2.3 Certified Adversarial Robustness

Recently, two approaches have been developed to come up with certified adversarial robustness. A classifier is said to be certifiably robust if for any input , one can easily obtain a guarantee that the classifier’s prediction is constant within some set around . An sample is said to be certifiable at level for the classifier if one can certify that s.t. , .

The first category relies on randomization [37, 14, 46] and consists in adding isotropic noise to the input during both training and inference phases. In order to get non-vacuous provable guarantees, such approaches often require to query the network hundreds of times to infer the label of a single image. This computational cost naturally limits the use of these methods in practice.

The second approach directly exploits the Lipschitzness property with the design of built-in

-Lipschitz layers providing deterministic guarantees. Following this line, one can either normalize the weight matrices by their largest singular values making the layer

-Lipschitz, e.g. [65, 42, 19, 2] or project the weight matrices on the Stiefel manifold [39, 56]. However, these latter methods are computationally expensive because they use iterative algorithms [9, 34] which hinder their use in large scale settings.

2.4 Residual Networks

To prevent from gradient vanishing issues in neural networks during the training phase [28][27] proposed the Residual Network (ResNet) architecture. Based on this architecture, several works [25, 18, 40, 12] proposed a “continuous time” interpretation inspired by dynamical systems as follows:

(1)

This continuous time interpretation helps as it allows us to consider the stability of the forward propagation through the stability of the associated dynamical system. A dynamical system is said to be stable if two trajectories starting from respectively an input and a shifted input remain sufficiently close to each other all along the propagation. This stability property takes all its sense in the context of adversarial classification.

It was argued by [25] that when does not depend on or vary slowly with time111This blurry definition of "vary slowly" makes the property difficult to apply.

, the stability can be characterized by the eigenvalues of the Jacobian matrix

: the dynamical system is stable if the real part of the eigenvalues of the Jacobian stay negative throughout the propagation. This property however only relies on intuition and this condition might be difficult to verify in practice. In the following, in order to derive stability properties, we study gradient flows and convex potentials, which are sub-classes on Residual networks.

Other works [31, 38] also proposed to enhance adversarial robustness using dynamical systems interpretations of Residual Networks. Both works argues that using particular discretization scheme would make gradient attacks more difficult to compute due to numerical stability. These works did not provide any provable guarantees for such approaches.

3 Lipschitzness via Convex Potentials

Starting with definition of the ResNet architecture and its associated flow, in this section, (i) we show that using convex potentials allows us to build -Lipschitz networks, and (ii) we propose a simple method to parametrize such networks based on Input Convex Neural Networks [1].

3.1 ResNet gradient flows and potentials

Among dynamical systems, an important class are the gradient flows that play an important role in our work. Starting from Equation 1 we can define a gradient flow as follows:

Definition 1.

Let be a family of differentiable functions on , the ResNet gradient flow associated to is defined as:

In this case, is called a potential function. This means that the family of functions , with derives from a simpler family of scalar valued function via the gradient operation: .

The discretized version of the ResNet gradient flows derives naturally from explicit Euler discretization: . Since deriving from a potential is a restricting property, one may ask if such flows can approximate all functions. Next proposition shows that every differentiable function can be expressed as a two-layer discrete ResNet gradient flow.

Proposition 2.

Let be a continuously differentiable function. Then, there exist two differentiable potential functions and such that the following discretized ResNet gradient flow for starting from satisfies for all .

Gradient flow might be a useful tool to study Lipschitzness of neural networks. In the next section, we propose to rely on convex potentials to build -Lipschitz layers.

3.2 Convex Potentials

Following the continuous interpretation, we introduced a subclass of Residual Networks defined as the gradient flow deriving from a family of potential functions . The next proposition shows that if the potentials are convex, then the gradient flow is -Lipschitz with regards to its initial condition, i.e. the input of the network.

Proposition 3.

Let be a family of convex differentiable functions. Let and be two continuous ResNet gradient flows associated with differing in their respective initial points and , then for ,

This simple property suggests that if we could build a ResNet with convex potentials, it would be less sensitive to input perturbations and therefore more robust to adversarial examples. However, this property does not hold when we discretize the time steps with the explicit Euler scheme, as implied by the ResNet architecture. One need an additional smoothness property on the potential functions to generalize it to discretized gradient flows. Recall that a function is said to be -smooth if it is differentiable and is -Lipschitz. We now prove an equivalent property for discretized time steps.

Proposition 4.

Let be a family of convex differentiable functions such that for all , is -smooth. Let us define the following discretized ResNet gradient flow using as a step size:

Consider now two flows and with initial points and respectively, if for , then

Remark.

A Residual Network defined as in Proposition 4 may not express all -Lipschitz functions. The convexity and smoothness assumptions required on the flow could indeed restrict the expressivity of the network. There are strong reasons to believe that convex potential flows cannot define universal approximators of -Lipschitz functions. See Appendix for further discussion.

3.3 Parametrizing Convex Potentials

Our previous result (Proposition 4) requires the computation of the gradient for convex functions. We can leverage -layer Input Convex Neural Networks [1]

to define an efficient solution. For any vectors

, and bias terms , and for a convex function, the potential defined as:

defines a convex function in as the composition of a linear and a convex function. Its gradient with respect to its input is then:

with and are respectively the matrix and vector obtained by the concatenation of, respectively, and , and is applied element-wise. Moreover, assuming is -Lipschitz, we have that is -smooth. denotes the spectral norm of , i.e., the greatest singular value of defined as:

The reciprocal also holds: if is a non-decreasing -Lipschitz function, and , there exists a convex -smooth function such that

where is applied element-wise. The next section shows how this property can be used to implement the building block and training of such layers.

4 Building -Lipchitz Residual Networks

4.1 Stable Blocks

From the previous section, we derive the following Stable Block layer:

(2)

Written in a matrix form, this layer can be implemented with every linear operation . In the context of image classification, it is beneficial to use convolutions222For instance, one can leverage the Conv2D and Conv2D_transpose

functions of the PyTorch framework 

[45]

instead of generic linear transforms represented by a dense matrix.

Residual Networks [27] are also composed of other types of layers which increase or decrease the dimensionality of the flow. Typically, in a classical setting, the number of input channels is gradually increased, while the size of the image is reduced with pooling layers. In order to build a -Lipschitz Residual Network, all operations need to be properly scale or normalize in order to maintain the proper Lipschitz constant.

First, to increase the number of channels, a concatenation operation of several stable blocks can be easily performed. Recall that if are -Lipschitz functions, then the function is also a -Lipschitz function. With this property, it is straightforward to build a -Lipschitz Stable Expansion Block from concatenating Stable Blocks.

Dimensionality reduction is another essential operation in neural networks. On one hand, its goal is to reduce the number of parameters and thus the amount of computation required to build the network. On the other hand it allows the model to progressively map the input space on the output dimension, which corresponds in many cases to the number of different labels . In this context, several operations exist: pooling layers are used to extract information present in a region of the feature map generated by a convolution layer. One can easily adapt pooling layers (e.g. max and average) to make them -Lipschitz [5]. Finally, a simple method to reduce the dimension is the product with a non-square matrix. In this paper, we simply implement it as the truncation of the output. This obviously maintains the Lipschitz constant.

1.2

1:Input: , vector: , weights: ,
2:Compute the layer and return
3:
4:      
5:
6:return
Algorithm 1 Computation of a StableBlock

4.2 Computing spectral norms

Our StableBlock, described in Equation 2, can be adapted to any kind of linear transformations but requires the computation of the spectral norm for these transformations. Given that computation of the spectral norm of a linear function is known to be NP-hard [54], an efficient approximate method is required during training in order to keep the complexity tractable.

Many techniques exist to approximate the spectral norm (or the largest singular value), and most of them exhibit a trade-off between efficiency and accuracy. Several methods exploit the structure of convolutional layers to build an upper bound on the spectral norm of the linear transform performed by the convolution [33, 52, 3]. While these methods are generally efficient, they can less relevant and adapted to certain settings. For instance in our context, using a loose upper bound of the spectral norm will hinder the expressive power of the layer and make it too contracting.

We rely on the Power Iteration Method (PM). This method converges at a geometric rate towards the largest singular value of a matrix. While it can appear to be computationally expensive due to the large number of required iterations for convergence, it is possible to drastically reduce the number of iterations during training. Indeed, as in [42], by considering that the weights’ matrices change slowly during training, one can perform only one iteration of the PM for each step of the training and let the algorithm slowly converges along with the training process333Note that a typical training requires approximately 200K steps where 100 steps of PM would be enough for convergence. We describe with more details in Algorithm 1, the operations performed during a forward pass with a StableBlock.

However for evaluation purpose, we need to compute the certified adversarial robustness, and this requires to ensure the convergence of the PM. Therefore, we perform iterations for each layer444 iterations of Power Method is sufficient to converge with a geometric rate. at inference time. Also note that at inference time, the computation of the spectral norm only needs to be performed once for each layer.

5 Experiments

Total Depth Depth Linear Num. Features Num. Channels
5 30 70 5 10 15 1024 2048 4096 30 60 90
Clean 72.8 75.0 74.9 74.9 76.1 76.3 74.3 74.8 75.0 74.8 75.2 75.4
PGD 67.8 70.1 70.2 70.3 71.4 71.6 69.3 70.0 70.3 70.2 70.5 70.7
AutoAttack 66.1 68.3 68.5 68.6 69.7 69.8 67.5 68.3 68.7 68.5 68.7 68.9
Certifed 55.1 57.5 57.6 57.9 58.8 58.8 56.9 57.5 57.6 57.6 58.2 58.1
Table 1: Evolution of standard accuracy, certified accuracy and Empirical Robustness under PGD Attack [41] and AutoAttack [16]

for different hyperparameters of our Stable Block architecture. We considered a baseline model with

layers ( convolutions and fully-connected layers), feature maps on each convolution and a width of for each fully-connected Stable Block layer. From left to right, we vary the depth, the number of Linear Stable Blocks, the number of feature maps in the convolutions and the width of fully connected layers.

To evaluate our new -Lipschitz Stable Block layers, we carry out an extensive set of experiments. In this section, we first recall the concurrent approaches that build

-Lipschitz Neural Networks and stress their limitations. Then, after describing the details of our setting, our experimental results are summarized. By computing the certified and empirical adversarial accuracy of our networks, we show that our architecture is more accurate and robust than other competing approaches. Finally, we demonstrate the stability and the scalability of our approach by training very deep neural networks up to 1000 layers without normalization tricks or gradient clipping.

5.1 Concurrent Approaches

Projection on an Orthogonal Space.

The work of [39] and [56] (denoted BCOP and Cayley respectively) are considered the most relevant approach to our work. Indeed, their approaches consist of projecting the weights matrices onto an orthogonal space in order to preserve gradient norms and enhance adversarial robustness by guaranteeing low Lipschitz constants. While both works have similar objectives, their execution is different. The BCOP layer (Block Convolution Orthogonal Parameterization) uses an iterative algorithm proposed by [9] to orthogonalize the linear transform performed by a convolution. However several iterations are necessary to guarantee orthogonality making this method very costly. The method proposed by [56] suffers from similar limitations. They use the Cayley transform to orthogonalize the weights matrices and while this algorithm is not iterative, it involves matrix inversion.

Reshaped Kernel Methods.

It has been shown by [13] and [57] that the spectral norm of a convolution can be upper-bounded by the norm of a reshaped kernel matrix. Consequently, orthogonalizing directly this matrix upper-bound the spectral norm of the convolution by . While this method is more computationally efficient than orthogonalizing the whole convolution, it lacks expressivity as the other singular values of the convolution are certainly too constrained. In the following, we denote by RKO and CRKO the “Reshaped Kernel Methods” done with the BCOP orthogonalization method and the Cayley transform respectively.

5.2 Training and Architectural Details

Experimental setting.

We demonstrate the effectiveness of our approach on a classification task with CIFAR-10 dataset [35]. In order to make our results comparable with the Cayley method [56], we use the same training configuration. Consequently, we trained our networks with a batch size of over epochs. We use standard data augmentation (i.e., random cropping and flipping), a learning rate of with Adam optimizer [17] without weight decay and a piecewise triangular learning rate scheduler.

Hyperparameter Study.

We now describe the impact of the hyperparameters on the accuracy of our models. In order to obtain the best performing networks, we finetuned several hyperparameters: the depth, the number of feature maps of the convolutions, the width and the number of the linear layers at the end of the networks and finally the learning rate. To study the hyperparameters influence in our architecture we considered a baseline model with layers ( convolutions and fully-connected layers), feature maps on each convolution and a width of for each fully-connected layer. We tested multiple variations by trying different values of hyperparameters around the baseline model to capture their impact. We conclude that the depth had an important impact on the performance of the model but saturate at around (with fully-connected layers). We reported results in Table 1. A very important performance gain is obtained by increasing the number of feature maps of the convolution from 3 to 30 with little gain afterwards. On the other hand, the width of the linear layers had a small impact on performance.

Ours Cayley 0.85-Cayley BCOP RKO CRKO
Clean 78.56
PGD 74.33
AutoAttack 72.72
Certified 61.13
Emp. Lipschitz
Table 2: This table shows accuracy trained without normalizing inputs. Robust accuracy is computed for . Experiments on our model are averaged over

runs: mean and standard deviation are reported. Results for other models are reported from 

[56]. Our model outperform other methods both in empirical and certified robust accuracy.
Ours ResNet9 WideResnet
Cayley BCOP Cayley BCOP
Clean 83.41 82.99 81.39
PGD 77.23 76.02 74.56
AutoAttack 75.04 73.16 71.86
Table 3: This table shows the empirical adversarial robustness () of models trained with normalized inputs. The results are averaged over 5 runs: mean and standard deviation are reported. Results for other models are reported from [56]. This table shows that out model outperforms the best existing approaches.

5.3 Results

Figure 1: Certified Accuracy and Accuracy under PGD attacks ( iterations) as a function of varying between and for our model. Dashed and plain lines correspond to the model with normalized and unnormalized inputs respectively.
Figure 2: Standard test accuracy in function of the number of epochs (log-scale) for various depths for our neural networks ().

In this section, we present our results on adversarial robustness. We provide results on provable robustness as well as empirical robustness. After investigating the impact of hyperparameters on our architecture, we fixed the parameters of our architecture to convolutional Stable Block layers with feature maps, inner convolution channels, and fully-connected Stable Block layers with a width of .

Certified Adversarial Robustness.

In order to train -Lipschitz Residual Networks, we used the architecture and techniques presented in Section 4 and we did not normalized the inputs. We report our results in Tables 2 as well as in Figure 2. Table 2 compare the performance of our architecture against other state-of-the-art -Lipschitz networks (presented in Section 5.1). We show that our network outperforms every other approaches on every measured metrics. Indeed, we outperform by more than points other methods on clean accuracy, more than points against the Projected Gradient Method attack proposed by [41] and the AutoAttack [16]. Furthermore, our model obtain a certified accuracy of which is more than two points over the previous best approach. Finally, Figure 2 shows the evolution of the certified accuracy and empirical accuracy under PGD attacks ( iterations) as a function of varying between and

. We observe a substantial gap between the empirical robustness and the certifiable robustness, probably due to the upper estimate of the Lipschitz constants.

Empirical Adversarial Robustness.

We reported in Table 3 the results for normalized inputs models. We remark our model outperform the performances of Cayley models on all tasks: we gain points in clean accuracy and more than point in robust accuracy under PGD attack. In comparison, with the same number of layers, an undefended ResNet reach a clean accuracy of but close to under attack. Figure 2 shows that the models without normalized inputs reach better levels of robustness when becomes large. We reach around of accuracy under PGD attack for . For comparison, the best empirically robust model in the literature achieves clean accuracy and robust accuracy for using a ResNet50 [4] with Adversarial Training [41] and additional unlabeled data.

5.4 Training stability: scaling up to layers

While the Residual Network architecture limits, by design, gradient vanishing issues, it still suffers from exploding gradients in many cases [26]

. To prevent such scenarii, batch normalization layers 

[32] are used in most Residual Networks to stabilize the training.

Recently, several works [42, 19] have proposed to normalize the linear transformation of each layer by their spectral norm. Such a method would limit exploding gradients but would again suffer from gradient vanishing issues. Indeed, spectral normalization might be too restrictive: dividing by the spectral norm can make other singular values vanishingly small. While more computationally expensive (spectral normalization can be done with Power Method iteration), orthogonal projections prevent both exploding and vanishing issues.

On the contrary the architecture proposed in this paper has the advantage to naturally control the gradient norm of the output with respect to a given layer. Therefore, our architecture can get the best of both worlds: limiting exploding and vanishing issues while maintaining scalability. To demonstrate the scalability of our approach, we experiment the ability to scale our architecture to very high depth (up to 1000 layers) without any additional normalization/regularization tricks, such as Dropout [53], Batch Normalization [32] or gradient clipping [44]. With the work done by [64], which leverage Dynamical Isometry and a Mean Field Theory to train a layers neural network, we believe, to the best of our knowledge, to be the second to perform such training. For sake of computation efficiency, we limit this experiment to architecture with feature maps. We report the accuracy in terms of epochs for our architecture in Figure 2 for a varying number of convolutional layers. It is worth noting that for the deepest networks, it may take a few epochs before the start of convergence. As [64], we remark there is no gain in using very deep architecture for this task.

6 Conclusion

In this paper, we presented a new generic method to build -Lipschitz layers. We leverage the dynamical system interpretation of Residual Networks and show that using convex potential flows naturally defines -Lipschitz neural networks. After proposing a parametrization based on Input Convex Neural Networks [1], we show that our models are able to reach state-of-the-art results in classification and robustness in comparison which other existing -Lipschitz approaches. We also experimentally show that our layers provides scalable approaches without further regularizations to train very deep architectures.

Exploiting the ResNet architecture for devising flows have been an important research topic. For example, in the context of generative modeling, Invertible Neural Networks [6] and Normalizing Flows [48, 60] are both import research topic. More recently, Sylvester Normalizing Flows [58] or Convex Potential Flows [29] have had similar ideas to this present work but for a very different setting and applications. In particular, they did not have interest in the contraction property of convex flows and the link with adversarial robustness have been under-exploited.

Further work.

Our models may not express all -Lipschitz functions. Knowing which functions can be approximated by our layers is difficult even in the linear case. Nevertheless, this is clearly an important question that requires further investigation. One can also think of extending our work by the study of other dynamical systems. For instance, recent architectures such as Hamiltonian Networks [24] and Momentum Networks [49]

exhibit interesting properties. Finally, we hope to extend our approach to Recurrent Neural Networks 

[51] and Transformers [59].

References

  • Amos et al. [2017] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In

    International Conference on Machine Learning

    , 2017.
  • Anil et al. [2019] Cem Anil, James Lucas, and Roger Grosse. Sorting out lipschitz function approximation. In International Conference on Machine Learning, 2019.
  • Araujo et al. [2021] Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, and Jamal Atif. On lipschitz regularization of convolutional layers using toeplitz matrix theory.

    Thirty-Fifth AAAI Conference on Artificial Intelligence

    , 2021.
  • Augustin et al. [2020] Maximilian Augustin, Alexander Meinke, and Matthias Hein. Adversarial robustness on in-and out-distribution improves explainability. In

    European Conference on Computer Vision

    , pages 228–245. Springer, 2020.
  • Bartlett et al. [2017] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017.
  • Behrmann et al. [2019] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, 2019.
  • Béthune et al. [2021] Louis Béthune, Alberto González-Sanz, Franck Mamalet, and Mathieu Serrurier. The many faces of 1-lipschitz neural networks. arXiv preprint arXiv:2104.05097, 2021.
  • Biggio et al. [2013] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, 2013.
  • Björck et al. [1971] Åke Björck et al.

    An iterative algorithm for computing the best estimate of an orthogonal matrix.

    SIAM Journal on Numerical Analysis, 1971.
  • Bosch [1987] AJ Bosch. Note on the factorization of a square matrix into two hermitian or symmetric matrices. SIAM Review, 29(3):463–468, 1987.
  • Carlini et al. [2017] Nicholas Carlini et al. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017.
  • Chen et al. [2018] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018.
  • Cisse et al. [2017] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, 2017.
  • Cohen et al. [2019] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, 2019.
  • Combettes and Pesquet [2020] Patrick L Combettes and Jean-Christophe Pesquet. Lipschitz certificates for layered network structures driven by averaged activation operators.

    SIAM Journal on Mathematics of Data Science

    , 2020.
  • Croce et al. [2020] Francesco Croce et al. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, 2020.
  • Diederik P. Kingma [2014] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2014.
  • E [2017] Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 2017.
  • Farnia et al. [2019] Farzan Farnia, Jesse Zhang, and David Tse. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2019.
  • Fazlyab et al. [2019] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 2019.
  • Golub et al. [2000] Gene H Golub et al. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 2000.
  • Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • Gouk et al. [2021] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 2021.
  • Greydanus et al. [2019] Samuel J Greydanus, Misko Dzumba, and Jason Yosinski. Hamiltonian neural networks. 2019.
  • Haber et al. [2017] Eldad Haber et al. Stable architectures for deep neural networks. Inverse problems, 2017.
  • Hayou et al. [2021] Soufiane Hayou, Eugenio Clerico, Bobby He, George Deligiannidis, Arnaud Doucet, and Judith Rousseau. Stable resnet. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • Hochreiter et al. [2001] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
  • Huang et al. [2021] Chin-Wei Huang, Ricky T. Q. Chen, Christos Tsirigotis, and Aaron Courville.

    Convex potential flows: Universal probability distributions with optimal transport and convex optimization.

    In International Conference on Learning Representations, 2021.
  • Huang et al. [2020a] Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, and Ling Shao. Controllable orthogonalization in training dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020a.
  • Huang et al. [2020b] Yifei Huang, Yaodong Yu, Hongyang Zhang, Yi Ma, and Yuan Yao. Adversarial robustness of stabilized neuralodes might be from obfuscated gradients. Mathematical and Scientific Machine Learning, 2020b.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • Jia et al. [2017] Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural networks via singular value bounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • Kautsky and Turcajová [1994] Jaroslav Kautsky and Radka Turcajová. A matrix approach to discrete wavelets. In Wavelet Analysis and Its Applications. Elsevier, 1994.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kurakin et al. [2016] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
  • Lecuyer et al. [2018] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), 2018.
  • Li et al. [2020] Mingjie Li, Lingshen He, and Zhouchen Lin. Implicit euler skip connections: Enhancing adversarial robustness via numerical stability. In International Conference on Machine Learning, pages 5874–5883. PMLR, 2020.
  • Li et al. [2019] Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B Grosse, and Joern-Henrik Jacobsen. Preventing gradient attenuation in lipschitz constrained convolutional networks. 2019.
  • Lu et al. [2018] Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.

    Spectral normalization for generative adversarial networks.

    In International Conference on Learning Representations, 2018.
  • Moosavi-Dezfooli et al. [2019] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • Pascanu et al. [2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, 2013.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Pinot et al. [2019] Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, Cédric Gouy-Pailler, and Jamal Atif. Theoretical evidence for adversarial robustness through randomization. In Advances in Neural Information Processing Systems, 2019.
  • Raghunathan et al. [2018] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018.
  • Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, 2015.
  • Sander et al. [2021] Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Momentum residual neural networks. 2021.
  • Sedghi et al. [2018] Hanie Sedghi, Vineet Gupta, and Philip Long. The singular values of convolutional layers. In International Conference on Learning Representations, 2018.
  • Sherstinsky [2020] Alex Sherstinsky.

    Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network.

    Physica D: Nonlinear Phenomena, 404:132306, 2020.
  • Singla et al. [2021] Sahil Singla et al. Fantastic four: Differentiable and efficient bounds on singular values of convolution layers. In International Conference on Learning Representations, 2021.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
  • Steinberg [2005] Daureen Steinberg. Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology, 2005.
  • Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • Trockman et al. [2021] Asher Trockman et al. Orthogonalizing convolutional layers with the cayley transform. In International Conference on Learning Representations, 2021.
  • Tsuzuku et al. [2018] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, 2018.
  • van den Berg et al. [2018] Rianne van den Berg, Leonard Hasenclever, Jakub Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. In proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Verine et al. [2021] Alexandre Verine, Yann Chevaleyre, Fabrice Rossi, and benjamin negrevergne. On the expressivity of bi-lipschitz normalizing flows. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  • Virmaux and Scaman [2018] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. 2018.
  • Wang et al. [2020] Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X. Yu.

    Orthogonal convolutional neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • Wong et al. [2018] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, 2018.
  • Xiao et al. [2018] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018.
  • Yoshida and Miyato [2017] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.

Appendix A Proofs

a.1 Proof of Proposition 2

Before proving this proposition we recall the following lemma.

Lemma 1 ([10]).

Let . There exist two symmetric matrices with invertible such

We are now set to prove Proposition 2.

Proof.

Let be a continuously differentiable function. For all , is a matrix then, thanks to previous lemma, there exist two symmetric matrices with invertible such .

If and are two twice differentiable functions, then the discretized flow defines the following function: . Then we have:

being symmetric, there exist a twice differentiable function such that for all . Similarly, let also define similarly. Let , then there exists such that . Then we set . Note that it is useless to define outer of . Then we get for such and : for all . Then there exists a constant such that . By setting well the constant over we can get , hence the result.

a.2 Proof of Proposition 3

Proof.

Consider the time derivative of the square difference between the two flows:

The last inequality derives directly from the usual characterization of the convexity. Therefore the quantity is a decreasing function of time. In particular, which concludes the proof. ∎

a.3 Proof of Proposition 4

Proof.

With , we can write:

This equality allows us to derive the equivalence between and:

Moreover, assuming that being that:

We can see with this last inequality that if we enforce , we get which concludes the proof. ∎

Appendix B Expressivity of discretized convex potential flows

Let us define the space of real symmetric matrices with singular values bounded by . Let us also define the space of real matrices with singular values bounded by in absolute value. Let . Then one can prove555A proof and justification of this result can be found here: https://mathoverflow.net/questions/60174/factorization-of-a-real-matrix-into-hermitian-x-hermitian-is-it-stable that . Thus there exists such that for all matrices , for all matrices such that .

Applied to the expressivity of discretized convex potential flows, the previous result means that there exists a -Lipschitz linear function that cannot be approximated as a discretized flow of any depth of convex linear -smooth potential flows as in Proposition 4. Indeed such a flow would write: where are symmetric matrices whose eigenvalues are in , in other words such transformations are exactly described by for some .

Appendix C Additional experiments

c.1 Relaxing linear layers

h = 1.0 h = 0.1 h = 0.01
Clean 85.10 82.23 78.53
PGD 61.45 62.99 60.98
AutoAttack 57.03 58.82 57.63

The table above shows the result of the relaxed training of our StableBlock architecture, i.e. we fixed the step in the discretized convex potential flow of Proposition 4. Increasing the constant allows for an important improvement in the clean accuracy, but we loose in robust empirical accuracy. While computing the certified accuracy is not possible in this case due to the unknown value of the Lipschitz constant, we can still notice that the training of the network are still stable without normalization tricks, and offer a non-negligible level of robustness.

c.2 Effect of longer trainings

Epochs
50 100 200 400
Clean 76.7 78.5 80.1 80.3
PGD 73.2 74.3 75.0 75.1
AutoAttack 71.9 72.7 73.2 73.1
Certified 59.5 61.1 60.6 59.9

The table above shows the evolution of the accuracy, empirical robust accuracy and certified accuracy given the duration of the training in epochs. We can observe that the clean accuracy increases from 50 epoch to 400 without overfitting. However the certified accuracy max out at around 100 epochs and start decreasing afterwards suggesting a trade-off between higher standard accuracy and certified accuracy.