The Many Faces of 1-Lipschitz Neural Networks

04/11/2021 ∙ by Louis Béthune, et al. ∙ 0

Lipschitz constrained models have been used to solve specifics deep learning problems such as the estimation of Wasserstein distance for GAN, or the training of neural networks robust to adversarial attacks. Regardless the novel and effective algorithms to build such 1-Lipschitz networks, their usage remains marginal, and they are commonly considered as less expressive and less able to fit properly the data than their unconstrained counterpart. The goal of the paper is to demonstrate that, despite being empirically harder to train, 1-Lipschitz neural networks are theoretically better grounded than unconstrained ones when it comes to classification. To achieve that we recall some results about 1-Lipschitz function in the scope of deep learning and we extend and illustrate them to derive general properties for classification. First, we show that 1-Lipschitz neural network can fit arbitrarily difficult frontier making them as expressive as classical ones. When minimizing the log loss, we prove that the optimization problem under Lipschitz constraint is well posed and have a minimum, whereas regular neural networks can diverge even on remarkably simple situations. Then, we study the link between classification with 1-Lipschitz network and optimal transport thanks to regularized versions of Kantorovich-Rubinstein duality theory. Last, we derive preliminary bounds on their VC dimension.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Lipschitz constant of neural networks has attired a great attention in the last few years. 1-Lipschitz network has been first used to estimate the Wasserstein distance between two probability distributions, thanks to Kantorovich-Rubinstein duality, in the seminal work of 

[3].

Deep neural networks with are known to be vulnerable to adversarial attacks [29]: a carefully chosen small shift to the input, usually indistinguishable from noise, can change the class prediction. This is mainly due to the Lipschitz constant of neural networks which can grow arbitrarily high when unconstrained. One possible defense against adversarial attacks is to constraint the Lipschitz constant of the network [20], which allows to have provable robustness guarantees, together with an improvement of generalization [27] and interpretability of the model [31].

These approaches propose different ways to control precisely the Lipschitz of the network such as gradient penalty, spectral normalization and orthogonalization of the matrix weights. These algorithms facilitate greatly the learning of 1-Lipschitz networks, and 1-Lipschitz functions are known to have some desirable properties for machine learning. However they are primarily used in the previously described area and are not applied widely on deep learning applications. The reasons commonly invoked against the 1-lipschitz network is that they are much more difficult to train and far less expressive than unconstrained networks, making them rarely competitive on challenging benchmarks.

The goal of the paper is to demonstrate that, despite being empirically harder to train, 1-Lipschitz neural networks are theoretically better grounded than unconstrained ones when it comes to classification. First, even if the common belief wants that Lipschitz constrained neural networks are less expressive than their unconstrained counterpart is obviously true for regression, this intuition fades when it comes to classification. Indeed, if is -Lipschitz neural network, then is a 1-Lipschitz neural network with the same decision frontier. This corroborates the findings of [41] that high accuracy and high robustness are not necessarily antagonist objectives. We also explore the statistical and optimization properties of 1-Lipschitz networks in Section 4 and their VC dimension in Section 5.

The main contribution of the paper is to recall known results about 1-Lipshichtz function in machine learning and to extend them in order to have a general view of the multiple interest of this class of functions for deep learning. The first section is devoted to the state of the art of 1-Lipschitz neural network. Then we show that 1-Lipschitz neural classifier are able to learn arbitrary complex decision frontiers and are optimal when it comes to robustness. In the third section, we prove that unconstrained optimization of binary cross entropy is an hill-posed optimization problem that becomes well posed when considering the class of 1-Lipschitz networks. We demonstrate that even on very simple toy example, unconstrained neural networks may have an arbitrarily high Lipschitz constant which lead to overfitting. In the last sections, we outline the link between classification with 1-Lipschitz functions and optimal transport and we shows that the class of 1-Lipschitz functions with margin have a finite VC dimension and then provable error bounds.

Figure 1: HMF (see Proposition 1) in the case , with 3D (a) and 2D (b) plotting. The frontier can be highly complex and irregular (here the Von Koch Snowflake), it does not prevent the classifier to be 1-Lipschitz. The support of class is the interior ring, while the center and the exterior corresponds to class

. In (c) we train 1-Lipschitz neural network with Mean Squared Error to fit the ground truth (160 000 pixels), for 20 epochs. The orange strip corresponds to

. With a Mean Absolute Error of 0.52 the function learned is visually indistinguishable from its ground truth.

2 Notations and related work

2.1 Notations

In the following sections we will focus on classification tasks defined over . The label set is for binary classification and with

for multi-class. The observations are sampled from a joint distribution

with and . The support of is denoted . The goal is to learn a classifier modeling .

The Lipschitz constant of function is defined as the smallest such that for all we have , in this case is said -Lipschitz. For simplicity we will focus on euclidean norm in the rest of the paper. The set of -Lipschitz functions over will be denoted . The gradient of w.r.t will be written , and its Jacobian . The norm of any matrix must be understood as operator norm.

In the rest of the paper, “unconstrained neural network” must be understood as any feed forward network of fixed depth (without recurrent mechanisms) with parametrized affine layers (including convolutions), and

elementwiseactivation functions (such as ReLU, sigmoid, tanh, and other popular variants). Such neural networks are known to be Lipschitz [25] but with no knowledge on their constant

. The last layer is assumed without activation function (to produce logits in

), since this last activation can be merged into the loss function.

2.2 Related work

A neural network is a composition of linear and non-linear functions. As a composition of functions, the Lipschitz constant of a multilayer network is upper bounded by the product of the individual Lipschitz of each layer. However, it is known that evaluating the Lipschitz constant exactly is a NP-hard problem [36] and many approaches have been proposed to estimate it. Clipping the weights of a networks as in Wasserstein GAN [3], or regularisation are a way to constraint the Lipschitz constant of a network, however it gives no guarantee about its real value, only a very crude upper bound. Gradient penalty [12] and spectral regularization [42] allows for a better control of the constant but still no guarantees. Normalizing by the Frobenius norm  [24] leads to a tighter upper but spectral normalization as proposed in [23] gives better guarantees and allows each dense layer to be exactly 1-Lipschitz. However, spectral normalization results in neural networks with effective Lipschitz constant far smaller than , leading to vanishing gradient. While most activation function are 1-Lipschitz (including ReLU, sigmoid, tanh) some other layers such as Attention are not even Lipschitz [16]. Some attempts have been made to propose Lipschitz recurrent units [9].

In [12], authors shows that the optimal solution of the Kantorovich-Rubinstein dual transport problem verifies almost everywhere. [2] proved that such functions are dense in the set of 1-Lipschitz functions w.r.t uniform convergence. They also establish that if a neural network verifies almost everywhere and uses elementwise non linearities, then is an affine function. They proposed sorting activation function to circumvent this issue, in combination with Björck orthonormalization [6]

to ensure that all eigenvalues are close to

. Note this property doesn’t hold when dealing with convolutional layers but can be guaranteed in this case by using Block Convolution Orthogonal Parameterization (BCOP) [20, 38, 21]. GroupSort [30] is another useful activation function for those architectures.

Orthogonal kernels are of special interest in the context of normalizing flows [13]. The optimization over the orthogonal group has been studied in [19, 15] with tools like Stiefel manifold and Cayley transform, or stochastic optimization [8].

Adversarial attacks (see [43] and references therein) are small perturbations in input space, invisible to humans, but nonetheless able to change class prediction. This is a vulnerability of modern deep neural networks making them not suitable for critical applications. Adversarial training [22] leads to empirical improvements but fails to provide certificates. Certificates can be produced by bounding the Lipschitz constant, using extreme value theory [40], linear approximations [39] or polynomial optimization [17]. In [32] the control of Lipschitz constant and margins is used to guarantee robustness against attacks. In [26], the authors link classification with optimal transport by considering a hinge regularized version of the Kantorovich-Rubinstein optimization. They built provably robust classifiers using 1-Lipschitz neural networks with almost everywhere.

In [37] Lipschitz classifiers are shown to be isometrically isomorphic to linear large margin classifiers over some Banach space, into which the data have been embedded. Generalization bounds are provided by the work of [11] using Vapnik–Chervonenkis theory.

2.3 Experimental setting

All the experiments done in the paper uses the Deel.Lip111https://github.com/deel-ai/deel-lip library developed for [26]. The network verifies almost everywhere thanks to 1) orthogonal matrices in affine layers 2) layerwise sort activation function. GroupSort2 [30] is defined such that and

. Spectral normalization and Björck orthonormalization algorithm ensures that most singular values equals

. This implementation is based on the seminal work of [2], who first proved that those functions are dense in the space of 1-Lipschitz functions. With dilation by a constant we can parameterize the set . We minimize losses using Adam optimizer. We did not use convolutions because we could not guarantee formally that in the current implementation.

3 1-Lipschitz classifier

In this section we show that 1-Lipschitz functions are as powerful as any other classifier, like their unconstrained counterpart. In particular, when classes are separable they can achieve 100% accuracy. In the non separable case the optimal Bayes classifier can nonetheless be imitated.

We also recall the main property of 1-Lipschitz neural networks: their ability to produce robustness certificates against adversarial examples.

3.1 Frontier decision fitting

Proposition 1 (Lipschitz Binary classification).

For any binary classifier with closed pre-images ( is a closed set for ) there exists a 1-Lipschitz function such that on . Moreover everywhere is defined we have . Note that does not need to be defined everywhere on , while is.

Proof.

(Sketch, full proof in Appendix).

Definition 1 (Highest Mountain Function)

Let the maximum margin frontier: the set of points equidistant to and . We define the binary Highest Mountain Function (HMF) as on , which implicitly depends of .

Then verifies the aforementioned properties. ∎

Similar results can be obtained for multiclass classification.

Proposition 2 (Lipschitz Multiclass classification).

For any multiclass classifier with closed pre-images there exists a 1-Lipschitz function such that on . Moreover everywhere is defined we have .

For simplicity in the following sections we will focus on the binary classification case only (), and (resp. ) will denote the distribution of class (resp.

), assuming that they have the same probability mass (balanced case).

With these propositions in mind, or re-using the proof sketched in Section 1, we can deduce Corollary 1.

Corollary 1 (1-Lipschitz Networks are as powerful as unconstrained ones).

For any neural network there exists 1-Lipschitz neural network such that .

The Error of a classifier is defined as . The Risk of a classifier is defined as where denotes the optimal Bayes classifier. Some empirical studies shows that indeed most datasets are separated [41]

such as CIFAR10 or MNIST.

Corollary 2 (Separable classes implies zero error).

Classes are said -separable if it exists such that the distance between and exceeds . In this case 1-Lipschitz neural networks can achieve zero error.

Even if the classes are not separable, 1-Lipschitz neural network can nonetheless imitate the optimal Bayes classifier.

Corollary 3 (Arbitrary zero risk).

If

admits a probability density function (w.r.t Lebesgue measure), assume that the pre-images of Bayes classifier are closed, then 1-Lipschitz neural network can achieve zero risk.

The hypothesis class contains a classifier with optimal accuracy, despite having hugely constrained Lipschitz constant. The difficulty lies in the optimization (see Section 4) over such class of functions.

Example 1 (Highest Mountain Function).

An example of HMF is depicted in Figures 1 (a) and (b), with chosen to be the fourth iteration of Von Koch Snowflake. We train a train a 6-layers (5 hidden layers of width 128) 1-Lipschitz NN by regression to fit the ground truth (160 000 pixels, 20 epochs) and we obtain the function in Figure 1 (c), with mean absolute error of 0.52. The orange strip corresponds to the zone . It proves empirically that this classification task, associated to a very sharp (almost fractal) decision frontier, can be solved by 1-Lipschitz NN. There is not constraint on the shape of the frontier: only on the “speed” at which we move away from it.

3.2 Robustness

One of the most appealing properties of 1-Lipschitz neural networks is their ability to provide robustness certificates against adversarial attacks.

Definition 2 (Adversarial Example)

For any classifier , any , consider the following problem:

(1)

is an adversarial attack, is an adversarial example, and is the robustness radius of . The smallest achievable by is the minimum robustness radius of .

While unconstrained neural networks have usually very small robustness radius [29], 1-Lipschitz neural networks can provide certificates [32].

Property 1 (Robustness Certificates [20]).

For any 1-Lipschitz neural network, the robustness radius at example verifies .

Computing the certificate is straightforward and do not increase runtime, contrary to methods based on bounding boxes or abstract interpretation. There is no need for costly adversarial training [22] that fails to produce guarantees.

The HMF() function associated to the Bayes classifier is the one providing the largest certificate among the classifiers of maximum accuracy.

Corollary 4.

Under the hypothesis of Corollary 3, for the HMF(b), the bound of Property 1 is tight: . In particular is guaranteed to be an adversarial attack. The risk is the smallest possible. There is no classifier with the same risk and better certificates. Said otherwise the HMF() is the solution to:

(2)
under the constraint

Unfortunately the HMF() cannot be explicitly constructed since it relies on the (generally unknown) optimal Bayes classifier. We deduce that a robust 1-Lipschitz classifier must certainly try to maximize for each in the training set. One question remains: which loss should we chose to achieve this goal ?

4 Binary Cross Entropy and 1-Lipschitz neural networks

The Binary Cross Entropy (BCE) loss (also called logloss) is among the most popular choices within the deep learning community. In the next section we highlight some of its properties w.r.t the Lipschitz constant.

(a) Evolution of as function of time throughout optimization process. Synthetic task with inputs and labels for . Same setting as Example 2. Newton method diverges while the curve of SGD flatten. On 32 bits floating point architecture the norm of the gradient quickly vanish below machine precision.
(b) -Lipschitz functions minimizing BCE of Example 3. The red bar plot corresponds to the training points, their height is proportional to their weight and their orientation to their label (+1 or -1). For high enough an inflexion point appears, whereas for small the small weights examples are treated as noise.

Let a neural network. For an example with label , with the logistic function mapping logits to probabilities, the BCE is written .

Vanishing and Exploding gradients have been a long time issue in the training of neural networks. The latter is usually avoided by regularizing the weights of the networks and using bounded losses, while the former can be avoided using residual connections (such ideas can found on LSTM 

[10] or ResNet [14]). On 1-Lipschitz neural networks we can guarantee the absence of exploding gradient.

Proposition 3 (No exploding gradients [20]).

Assume that

is a feed forward neural network and that each layer

is 1-Lipschitz, where is either a 1-Lipschitz affine transformation either a 1-Lipschitz activation function. Let the loss function. Let , and . Then:

(3)
(4)

Vanishing gradient is still an issue with BCE: making the training tedious.

On unconstrained neural networks, minimization of this loss leads to saturation of the logits and uncontrolled growth of Lipschitz constant.

Proposition 4 (Saturated Neural Networks have high Lipschitz constant).

Let be a sequence of neural networks, that minimizes the BCE over a non trivial (with more than one class) training set of size , i.e:

(5)

Let the Lipschitz constant of . Then .

This issue is specially important since the high Lipschitz constant of neural networks have been identified as the main cause of adversarial vulnerabilities. With saturated logits the probability will be either or which do not carry any useful information on the true confidence of the classifier, specially in the out-of-distribution setting.

Example 2 (Illustration on linear classifier).

Consider binary classification task on , with, for simplicity, classes that are linearly separable. We use an affine model for the logits (with and ), that can be seen as a one-layer neural network. Since the classes are linearly separable it exits such that achieves accuracy. However, as noticed in [5] (Section 4.3.2) the cross entropy loss will not be zero. The loss can be minimized only with the diverging sequence of parameters as . Turn out the infimum is not a minimum !

Even on this trivial example, with a hugely constrained model the minimization problem is ill-defined. Without or regularization term the minimizer can not be attained. However most order-1 methods struggle to saturate the logits, even on unconstrained neural networks as depicted in Figure 1(a), whereas order 2 methods diverge as expected. The poor properties of the optimizer are one of the reasons the ill-posed problem of BCE minimization do not lead to explosion of weights in unconstrained networks.

Conversely, 1-Lipschitz neural network cannot reach zero loss. Yet the minimizer of BCE is well defined.

Proposition 5 (BCE minimization for 1-Lipschitz functions).

Let be a compact and . Then the minimum of Equation 6 is attained.

(6)

Proposition 4 shows that the minimum exists. Machine learning practitioners are mostly interested by the minimization of the empirical risk (i.e maximization of the accuracy). However one cannot guarantee that the minimizer of BCE will maximize accuracy (see Example 3). Moreover, if is a minimum of Equation 6 for some , in general is not necessarily a minimum of Equation 6 for .

(c) -Lipschitz neural network with BCE loss. Higher ensures better fitting of the training set.
(d) -Lipschitz neural network with Hinge loss. Small margins makes the training easier.
Figure 2: Training Loss and Training Error as function of epoch, on Cifar10 dataset (“dogs” versus “cats”). BCE in 1(c) and Hinge in 1(d). Metrics on the test set are not displayed on purpose, since our goal is to understand the optimization problem and not evaluate generalization capabilities. Log scale on y-axis loss.
Example 3.

We illustrate this phenomenon in Figure 1(b) by training various neural networks with different Lipschitz constants . We chose a simple setting with only four training points in . The inputs are respectively. Their weights are respectively and their labels . We plot the value of to highlight the different shapes of the minimizer as function of . High values of leads to better fitting.

We observe the same phenomenon on CIFAR10 with Lipschitz contrained NN, see Figure 1(c). We place ourselves in over-parameterized regime with five hidden layers of width . We compare the loss and the error on the training set. We see that the network end up severely under-fitting for small values of . Not because the optimal classifier is not part of the hypothesis space, but rather because the minimizer of binary cross entropy is not necessarily the minimizer of the error. As grows, we close to up to the maximum accuracy. The loss itself is responsible for the poor score, and not by any means the hypothesis space. Bigger Lipschitz constant might ultimately lead to overfitting, playing the same role as the (usually omitted) temperature scaling parameter : .

Nonetheless, the class of Lipschitz classifiers enjoys another remarkable property since it is a Glivenko-Cantelli class: BCE is a consistent estimator. Said otherwise, as the size of the training set increases the training loss becomes a proxy for the test loss: 1-Lipschitz neural networks will not overfit in the limit of (very) large sample size.

Proposition 6 (Train Loss is a proxy of Test Loss).

Let a probability measure on where is a bounded set. Let be a sample of

iid random variables with law

. Let:

Then we have (taking the limit ):

(7)

It is an other flavor of the bias-variance trade-off. We know thanks to Corollaries 

2 and 3 that the class of Lipschitz function does not suffer any bias when it comes to classification. With Proposition 6 we also know that the variance can be made as small as we want by increasing the size of the training set. While this statement seems rather trivial, we emphasize that is not a property shared by unconstrained neural networks: increasing the size of the training set does not give any guarantees to generalization capabilities. Adversarial examples are an example of such failure to reduce variance.

Loss Unconstrained NN -Lipschitz NN
BCE minimizer ill-defined, (Proposition 4) attained (Proposition 5)
in practice vanishing gradient (Figure 1(a)) L must be tuned (Example 3)
consistent estimator no yes (Proposition 6)
Wasserstein minimizer ill-defined, attained, minimum
in practice diverges during training weak classifier (Proposition 7)
consistent estimator no yes (Appendix B)
Hinge minimizer attained attained, high accuracy for small
in practice no garantees on margin large margin classifier for big
consistent estimator no yes (Appendix B)
hKR [26] minimizer ill-defined, attained
consistent estimator no yes [26]
Robustness certificates no yes (Property 1)
Exploding gradient yes for some and architecture no (Proposition 3)
Vanishing gradient yes for some and architecture yes for some
Table 1: Summary of different candidate losses, and influence of the Lipschitz constraint on the minimum. BCE minimization is ill-posed for unconstrained NN, but because of Vanishing Gradient the algorithm converges nonetheless. For margin small enough, if the classes are separable, 0% training error is achievable by hinge.

5 Alternative losses and link with Optimal Transport

We see that BCE is not necessarily the most suitable loss because of its dependence in Lipschitz constant, and the vanishing gradient issue. The loss might seem to be a good pick at first sight, because it cannot vanish and explicitly maximizes the logits . Unfortunately its minimum is the Wasserstein distance [35] between and according to the Kantorovich-Rubinstein duality:

(8)

The minimizer of 8 is known to be a weak classifier, as demonstrated empirically in [26]. We precise their observations in Proposition 7.

Proposition 7 (KR minimizer is a weak classifier).

For every there exists distributions and with disjoint supports in such that for any minimizer of equation 8, the error of classifier is superior to .

Hinge loss allows, in principle, to reach maximum accuracy, as used in [20]. The combination is still a regularized OT problem [26]. BCE minimization can also be seen trough the lens of OT (Appendix E). Results are summarized in Table 1.

With margin we can bound the VC dimension [33] of hypothesis class. The value can be understood as confidence. Hence, we may be interested in a classifier that takes decision only if the logit is above some threshold , while can be understood as examples for which the classifier is unsure: the label may be flipped using attacks of norm . In this setting, we fall back to PAC learnability.

Proposition 8 (1-Lipschitz Functions with margin are PAC learnable).

Consider a binary classification task with bounded support . Let the margin. Let the hypothesis class defined as follow.

(9)

Then the VC dimension of is finite:

(10)

with and . is the unit ball, and must be understood as Minkovski sum [28].

Here is a dummy symbol that the classifier may use to say “I don’t feel confident”; using it is not allowed to shatter a set. Interestingly if the classes are separable (), choosing guarantees that maximal accuracy is reachable. Prior over the separability of the input space is turned into VC bounds over the space of hypothesis.

The previous result covers the whole class of 1-Lipschitz functions with margins. We can give an other bound corresponding to a practical implementation of Lipschitz networks. With GroupSort2 activation functions (as in the work of [30]) we get the following rough upper bound:

Proposition 9 (VC dimension of 1-Lipschitz neural network with Sorting).

Let a 1-Lipschitz neural network with parameters , with GroupSort2 activation function, and a total of neurons. Let the hypothesis class spanned by this architecture. Then:

(11)

From Proposition 9 we can derive generalization bounds using PAC theory. Note that most results on VC dimension of neural network uses the hypothesis that the activation function is applied element-wise (such as in [4]) and get asymptotically tighter lower bounds for ReLU case. Such hypothesis does not apply anymore here, however we believe that the preliminary result can be strengthened.

6 Conclusion

We proved that 1-Lipschitz exhibit numerous attractive properties. They are easily certifiable and can reach high accuracy. However the loss function must be chosen accordingly: Binary Cross Entropy is not necessarily the best choice because of vanishing gradient. Hinge and HKR have appealing properties.

Their training remains a challenge. The solutions of the optimization problem in equation 6 still need to be characterized and understood. In particular, what is the bias induced by the minimizer of the loss.

Most architectural innovations of the past years such as Batch Normalization, Dropout and Attention are not 1-Lipschitz (see 

[16]), and cannot benefit to 1-Lipschitz neural networks in straightforward manner. Alternatives need to be found. Orthogonal convolutions are still an active research area (see [38] or [21]), but are required to reach SOTA on image benchmarks.

If future works can overcome these challenges, it open the path to neural networks that are both effective and provably robust. For these reasons we believe they are a promising direction of further research for the community.

7 Acknowledgments

This work received funding from the French Investing for the Future PIA3 program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). A special thanks to Thibaut Boissin for the support with Deel-Lip library. We thank Sébastien Gerchinovitz for critical proof checking, and Jean-Michel Loubes for useful discussions.

References

  • [1] J. W. A.W. van der vaart (1996) Weak convergence and empirical processes. Springer-Verlag New York. Cited by: Appendix B.
  • [2] C. Anil, J. Lucas, and R. Grosse (2019) Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291–301. Cited by: §2.2, §2.3.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017)

    Wasserstein generative adversarial networks

    .
    In International conference on machine learning, pp. 214–223. Cited by: §1, §2.2.
  • [4] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian (2019) Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.. Journal of Machine Learning Research 20, pp. 63–1. Cited by: §5.
  • [5] C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: Example 2.
  • [6] Å. Björck and C. Bowie (1971)

    An iterative algorithm for computing the best estimate of an orthogonal matrix

    .
    SIAM Journal on Numerical Analysis 8 (2), pp. 358–364. Cited by: §2.2.
  • [7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989) Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) 36 (4), pp. 929–965. Cited by: Appendix C.
  • [8] K. Choromanski, D. Cheikhi, J. Davis, V. Likhosherstov, A. Nazaret, A. Bahamou, X. Song, M. Akarte, J. Parker-Holder, J. Bergquist, et al. (2020) Stochastic flows and geometric optimization on the orthogonal group. In International Conference on Machine Learning, pp. 1918–1928. Cited by: §2.2.
  • [9] N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney (2021)

    Lipschitz recurrent neural networks

    .
    In International Conference on Learning Representations, Cited by: §2.2.
  • [10] F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. Cited by: §4.
  • [11] L. Gottlieb, A. Kontorovich, and R. Krauthgamer (2014) Efficient classification for metric data. IEEE Transactions on Information Theory 60 (9), pp. 5750–5759. Cited by: §2.2.
  • [12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5767–5777. Cited by: §2.2, §2.2.
  • [13] L. Hasenclever, J. M. Tomczak, R. van den Berg, and M. Welling (2017) Variational inference with orthogonal normalizing flows. In Bayesian Deep Learning, NIPS 2017 workshop, Cited by: §2.2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §4.
  • [15] L. Huang, X. Liu, B. Lang, A. Yu, Y. Wang, and B. Li (2018) Orthogonal weight normalization: solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §2.2.
  • [16] H. Kim, G. Papamakarios, and A. Mnih (2020) The lipschitz constant of self-attention. arXiv preprint arXiv:2006.04710. Cited by: §2.2, §6.
  • [17] F. Latorre, P. Rolland, and V. Cevher (2019) Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §2.2.
  • [18] E. León and G. M. Ziegler (2018) Spaces of convex n-partitions. In New Trends in Intuitive Geometry, pp. 279–306. Cited by: Appendix C.
  • [19] M. Lezcano-Casado and D. Martınez-Rubio (2019) Cheap orthogonal constraints in neural networks: a simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pp. 3794–3803. Cited by: §2.2.
  • [20] Q. Li, S. Haque, C. Anil, J. Lucas, R. B. Grosse, and J. Jacobsen (2019) Preventing gradient attenuation in lipschitz constrained convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, Cambridge, MA. Cited by: §1, §2.2, §5, Property 1, Proposition 3.
  • [21] S. Liu, X. Li, Y. Zhai, C. You, Z. Zhu, C. Fernandez-Granda, and Q. Qu (2021) Convolutional normalization: improving deep convolutional network robustness and training. arXiv preprint arXiv:2103.00673. Cited by: §2.2, §6.
  • [22] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.2, §3.2.
  • [23] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.2.
  • [24] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In NIPS, Cited by: §2.2.
  • [25] K. Scaman and A. Virmaux (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3839–3848. Cited by: §2.1.
  • [26] M. Serrurier, F. Mamalet, A. González-Sanz, T. Boissin, J. Loubes, and E. del Barrio (2020) Achieving robustness in classification using optimal transport with hinge regularization. arXiv preprint arXiv:2006.06520. Cited by: §2.2, §2.3, Table 1, §5, §5.
  • [27] J. Sokolić, R. Giryes, G. Sapiro, and M. R. D. Rodrigues (2017) Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65 (16), pp. 4265–4280. External Links: Document Cited by: §1.
  • [28] S. J. Szarek (1998) Metric entropy of homogeneous spaces. Banach Center Publications 43 (1), pp. 395–410. Cited by: Appendix C, Proposition 8.
  • [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1, §3.2.
  • [30] U. Tanielian and G. Biau (2021) Approximating lipschitz continuous functions with groupsort neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 442–450. Cited by: §2.2, §2.3, §5.
  • [31] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness may be at odds with accuracy

    .
    In International Conference on Learning Representations, Cited by: §1.
  • [32] Y. Tsuzuku, I. Sato, and M. Sugiyama (2018) Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, Vol. 31, pp. 6541–6550. Cited by: §2.2, §3.2.
  • [33] V. N. Vapnik and A. Y. Chervonenkis (2015) On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11–30. Cited by: §5.
  • [34] V. Vapnik (2013)

    The nature of statistical learning theory

    .
    Springer science & business media. Cited by: Appendix C.
  • [35] C. Villani (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §5.
  • [36] A. Virmaux and K. Scaman (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. Advances in Neural Information Processing Systems 31, pp. 3835–3844. Cited by: §2.2.
  • [37] U. von Luxburg and O. Bousquet (2004) Distance-based classification with lipschitz functions.. J. Mach. Learn. Res. 5, pp. 669–695. Cited by: §2.2.
  • [38] J. Wang, Y. Chen, R. Chakraborty, and S. X. Yu (2020)

    Orthogonal convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11505–11515. Cited by: §2.2, §6.
  • [39] L. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, L. Daniel, D. Boning, and I. Dhillon (2018) Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pp. 5276–5285. Cited by: §2.2.
  • [40] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations, Cited by: §2.2.
  • [41] Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhutdinov, and K. Chaudhuri (2020) A closer look at accuracy vs. robustness. Advances in Neural Information Processing Systems 33. Cited by: §1, §3.1.
  • [42] Y. Yoshida and T. Miyato (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §2.2.
  • [43] X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §2.2.

Appendix A Proofs of Section 3

The proof of Proposition 1 is constructive, we need to introduce the Highest Mountain Function first.

Definition 3 (Highest Mountain Function)

Let any classifier with closed pre-images. Let and . Let and the distance to a closed set . Let . We define as follow:

(12)

Now we can prove that the function previously defined verifies all the properties.

Proof.

We start by proving that is 1-Lipschitz.

First, consider the case and . Then . Assume without loss of generality that . Let such that (guaranteed to exist since is closed). Then by definition of we have . So:

(13)

The case and is identical. Now consider the case and . We have . We will proceed by contradiction. Assume that . Let such that and . Let:

Then . So by definition of we have . But we also have:

(14)

So we have which is a contradiction. Consequently, we must have . The function is indeed 1-Lipschitz.

Now, we will prove that everywhere it is defined. Let such that is unique. Consider with a small positive real. We have , it follows by triangular inequality that . We see that:

The vector

is the (unique) vector for which is minimal. Knowing that is 1-Lipschitz yields that . For points for which is not unique, the gradient is not defined because different directions minimize which contradicts the uniqueness of gradient vector.

Finally, note that on and . Indeed, in this case either either and the result is straightforward. ∎

For the case we must slightly change the definition to prove Proposition 2.

Definition 4 (Multiclass Highest Mountain Function)

Let any classifier with closed pre-images. Let . Let . We define as follow:

(15)

In overall the proof remains the same.

Proof.

We start by proving that is 1-Lipschitz.

We will now prove that for any -norm on with . First, consider the case . Then using the proof of Proposition 1. Now, consider the case and . Then:

(16)

Using the same technique as in the previous proof, if we assume then we can construct verifying both and which is a contradiction. Consequently .

Each row of is the gradient of some on which the reasoning of the case applies (like in previous proof). We conclude similarly that everywhere it is defined.

Finally, note that is equal to everywhere is defined, which concludes the proof. ∎

Proof of Corollary 2. If classes are separable the optimal Bayes classifier achieves zero error. Moreover, the topological closure yields a set of closed sets that are all disjoints (since ) and on which Proposition 2 can by applied, yielding a 1-Lipschitz neural network with the wanted properties.

Proof of Corollary 3. Straightforward application of Proposition 1 for optimal Bayes classifier.

Appendix B Proofs of Section 4

The proof of Proposition 4 only requires to take a look at the logits of two examples having different labels.

Proof.

Let . For the pair , as , by positivity of we must have:

(17)

As the right hand side has limit zero, we have:

(18)

Consequently . By definition so . ∎

The proof of Proposition 5 is an application of Arzelà–Ascoli theorem.

Proof.

Let . Consider a sequence of functions in such that .

Consider the sequence . We want to prove that is bounded. Proceed by contradiction and observe that if then . Indeed, for we can guarantee that is constant over and in this case one of the two classes is misclassified, knowing that yields the desired result. But if , then cannot not converges to . Consequently, must be upper bounded by some .

Hence the sequence is uniformly bounded. Moreover each function is -Lipschitz so the sequence is uniformly equicontinuous. By applying Arzelà–Ascoli theorem we deduce that it exists a subsequence (where is strictly increasing) that converges uniformly to some , and . As , the infimum is indeed a minimum. ∎

Proof of Theorem 6 is an application of Glivenko-Cantelli theorem.

Proof.

We proved in Proposition 5 that the minimum of equation 6 is attained, so we replace by . We restrict ourselves to a subset of on which because the minimum lies in this subspace. We have:

Let . Note that is also Lipschitz and bounded on . The entropy with bracket (see [1], Chapter 2.1) of the class of functions is finite (see [1], Chapter 3.2). Consequently is Glivenko-Cantelli. Finally which concludes the proof. ∎

To prove Proposition 3

we just need to write the chain rule.

Proof.

The gradient is computed using chain rule. Let any parameter of layer . Let

a dummy variable corresponding to the input of layer

, which is also the output of layer . Then:

(19)

with . As the layers of the neural network are all 1-Lipschitz, then:

Hence:

(20)

Finally, for we replace by the appropriate parameter which yields the desired result. ∎

Results of Table 1. We can apply the same reasoning as in the proof of Proposition 6. If we replace BCE with hinge, the resulting class is still Glivenko-Cantelli, so the theorem apply. Hence, hinge is a consistent estimator in the space of 1-Lipschitz functions. The result is also straightforward for : consistency of Wasserstein distance is a textbook result.

Appendix C Proofs of Section 5

Proof of Proposition 7.

Proof.

We will build and as a finite collection of Diracs. Let and for some , where denotes the Dirac distribution in . A example is depicted in Figure 3 for . In dimension one, the optimal transportation plan is easy to compute: each atom of mass from at position is matched with the corresponding one in to its immediate right. Consequently we must have . The function is not uniquely defined on segments but it does not matter: since is 1-Lipschitz we must have . Consequently in every case for we must have and . Said otherwise, is strictly increasing on and .

The solutions of the problems are invariant by translations: if is solution, then with is also a solution. Let’s take a look at classifier . If is chosen such that and for some then points are correctly classified on a total of points. It corresponds to an error of . Take to conclude. ∎

Figure 3: Pathological distributions and of points each, on which the accuracy of the Wasserstein minimizer cannot be better than .

Proof of Proposition 8.

Proof.

The implication “finite VC dimension” “PAC learnable” is a classical result from [7].

The VC dimension of is the maximum size of a set shattered by . As the functions are 1-Lipschitz, if then , and . Consequently, a finite set is shattered by if and only if for all we have where is the open ball of center and radius .

The maximum number of disjoint balls of radius that fit inside is known as the packing number of with radius . is bounded, hence its packing number is finite.

The bounds on the packing number are a direct application of [28] (Lemma 1). ∎

The proof of Proposition 9 uses the number of affine pieces generated by GroupSort2 activation function.

Proof.

We first prove that that is piecewise affine and the number of such pieces is not greater than , where is the number of neurons in layer . We proceed by induction on the depth of the neural network. For depth we have an affine function which contain only one affine piece by definition (the whole domain), so the result is true.

Now assume that a neural network of depth with widths has affine pieces. The enumeration starting at is not a mistake: we pursue the induction for a neural network of depth and widths . The composition of affine function is affine, hence applying an affine transformation preserves the number of pieces. The analysis fall back to the number of distinct affine pieces created by GroupSort2 activation function. If such activation function creates pieces then we have the immediate bound .

Let the Jacobian of the GroupSort2 operation evaluated in . The cardinal is the number of distinct affine pieces. For GroupSort2 we have a combinations of MinMax gates. Each MinMax gate is defined on and contains two pieces: one on which the gate behaves like identity and the other one on which the gate behaves like a transposition. Consequently we have and unrolling the recurrence yields the desired result.

Now, we just need to apply the Lemma 1 with .

Lemma 1 (Piecewise affine function).

Let a class of classifiers that are piecewise affine, such that the pieces form a convex partition of with pieces (each piece of the partition is a convex set). Then we have:

The proof of 1 is detailed below.