1 Introduction
The Lipschitz constant of neural networks has attired a great attention in the last few years. 1Lipschitz network has been first used to estimate the Wasserstein distance between two probability distributions, thanks to KantorovichRubinstein duality, in the seminal work of
[3].Deep neural networks with are known to be vulnerable to adversarial attacks [29]: a carefully chosen small shift to the input, usually indistinguishable from noise, can change the class prediction. This is mainly due to the Lipschitz constant of neural networks which can grow arbitrarily high when unconstrained. One possible defense against adversarial attacks is to constraint the Lipschitz constant of the network [20], which allows to have provable robustness guarantees, together with an improvement of generalization [27] and interpretability of the model [31].
These approaches propose different ways to control precisely the Lipschitz of the network such as gradient penalty, spectral normalization and orthogonalization of the matrix weights. These algorithms facilitate greatly the learning of 1Lipschitz networks, and 1Lipschitz functions are known to have some desirable properties for machine learning. However they are primarily used in the previously described area and are not applied widely on deep learning applications. The reasons commonly invoked against the 1lipschitz network is that they are much more difficult to train and far less expressive than unconstrained networks, making them rarely competitive on challenging benchmarks.
The goal of the paper is to demonstrate that, despite being empirically harder to train, 1Lipschitz neural networks are theoretically better grounded than unconstrained ones when it comes to classification. First, even if the common belief wants that Lipschitz constrained neural networks are less expressive than their unconstrained counterpart is obviously true for regression, this intuition fades when it comes to classification. Indeed, if is Lipschitz neural network, then is a 1Lipschitz neural network with the same decision frontier. This corroborates the findings of [41] that high accuracy and high robustness are not necessarily antagonist objectives. We also explore the statistical and optimization properties of 1Lipschitz networks in Section 4 and their VC dimension in Section 5.
The main contribution of the paper is to recall known results about 1Lipshichtz function in machine learning and to extend them in order to have a general view of the multiple interest of this class of functions for deep learning. The first section is devoted to the state of the art of 1Lipschitz neural network. Then we show that 1Lipschitz neural classifier are able to learn arbitrary complex decision frontiers and are optimal when it comes to robustness. In the third section, we prove that unconstrained optimization of binary cross entropy is an hillposed optimization problem that becomes well posed when considering the class of 1Lipschitz networks. We demonstrate that even on very simple toy example, unconstrained neural networks may have an arbitrarily high Lipschitz constant which lead to overfitting. In the last sections, we outline the link between classification with 1Lipschitz functions and optimal transport and we shows that the class of 1Lipschitz functions with margin have a finite VC dimension and then provable error bounds.
2 Notations and related work
2.1 Notations
In the following sections we will focus on classification tasks defined over . The label set is for binary classification and with
for multiclass. The observations are sampled from a joint distribution
with and . The support of is denoted . The goal is to learn a classifier modeling .The Lipschitz constant of function is defined as the smallest such that for all we have , in this case is said Lipschitz. For simplicity we will focus on euclidean norm in the rest of the paper. The set of Lipschitz functions over will be denoted . The gradient of w.r.t will be written , and its Jacobian . The norm of any matrix must be understood as operator norm.
In the rest of the paper, “unconstrained neural network” must be understood as any feed forward network of fixed depth (without recurrent mechanisms) with parametrized affine layers (including convolutions), and
elementwiseactivation functions (such as ReLU, sigmoid, tanh, and other popular variants). Such neural networks are known to be Lipschitz [25] but with no knowledge on their constant. The last layer is assumed without activation function (to produce logits in
), since this last activation can be merged into the loss function.
2.2 Related work
A neural network is a composition of linear and nonlinear functions. As a composition of functions, the Lipschitz constant of a multilayer network is upper bounded by the product of the individual Lipschitz of each layer. However, it is known that evaluating the Lipschitz constant exactly is a NPhard problem [36] and many approaches have been proposed to estimate it. Clipping the weights of a networks as in Wasserstein GAN [3], or regularisation are a way to constraint the Lipschitz constant of a network, however it gives no guarantee about its real value, only a very crude upper bound. Gradient penalty [12] and spectral regularization [42] allows for a better control of the constant but still no guarantees. Normalizing by the Frobenius norm [24] leads to a tighter upper but spectral normalization as proposed in [23] gives better guarantees and allows each dense layer to be exactly 1Lipschitz. However, spectral normalization results in neural networks with effective Lipschitz constant far smaller than , leading to vanishing gradient. While most activation function are 1Lipschitz (including ReLU, sigmoid, tanh) some other layers such as Attention are not even Lipschitz [16]. Some attempts have been made to propose Lipschitz recurrent units [9].
In [12], authors shows that the optimal solution of the KantorovichRubinstein dual transport problem verifies almost everywhere. [2] proved that such functions are dense in the set of 1Lipschitz functions w.r.t uniform convergence. They also establish that if a neural network verifies almost everywhere and uses elementwise non linearities, then is an affine function. They proposed sorting activation function to circumvent this issue, in combination with Björck orthonormalization [6]
to ensure that all eigenvalues are close to
. Note this property doesn’t hold when dealing with convolutional layers but can be guaranteed in this case by using Block Convolution Orthogonal Parameterization (BCOP) [20, 38, 21]. GroupSort [30] is another useful activation function for those architectures.Orthogonal kernels are of special interest in the context of normalizing flows [13]. The optimization over the orthogonal group has been studied in [19, 15] with tools like Stiefel manifold and Cayley transform, or stochastic optimization [8].
Adversarial attacks (see [43] and references therein) are small perturbations in input space, invisible to humans, but nonetheless able to change class prediction. This is a vulnerability of modern deep neural networks making them not suitable for critical applications. Adversarial training [22] leads to empirical improvements but fails to provide certificates. Certificates can be produced by bounding the Lipschitz constant, using extreme value theory [40], linear approximations [39] or polynomial optimization [17]. In [32] the control of Lipschitz constant and margins is used to guarantee robustness against attacks. In [26], the authors link classification with optimal transport by considering a hinge regularized version of the KantorovichRubinstein optimization. They built provably robust classifiers using 1Lipschitz neural networks with almost everywhere.
2.3 Experimental setting
All the experiments done in the paper uses the Deel.Lip^{1}^{1}1https://github.com/deelai/deellip library developed for [26]. The network verifies almost everywhere thanks to 1) orthogonal matrices in affine layers 2) layerwise sort activation function. GroupSort2 [30] is defined such that and
. Spectral normalization and Björck orthonormalization algorithm ensures that most singular values equals
. This implementation is based on the seminal work of [2], who first proved that those functions are dense in the space of 1Lipschitz functions. With dilation by a constant we can parameterize the set . We minimize losses using Adam optimizer. We did not use convolutions because we could not guarantee formally that in the current implementation.3 1Lipschitz classifier
In this section we show that 1Lipschitz functions are as powerful as any other classifier, like their unconstrained counterpart. In particular, when classes are separable they can achieve 100% accuracy. In the non separable case the optimal Bayes classifier can nonetheless be imitated.
We also recall the main property of 1Lipschitz neural networks: their ability to produce robustness certificates against adversarial examples.
3.1 Frontier decision fitting
Proposition 1 (Lipschitz Binary classification).
For any binary classifier with closed preimages ( is a closed set for ) there exists a 1Lipschitz function such that on . Moreover everywhere is defined we have . Note that does not need to be defined everywhere on , while is.
Proof.
(Sketch, full proof in Appendix).
Definition 1 (Highest Mountain Function)
Let the maximum margin frontier: the set of points equidistant to and . We define the binary Highest Mountain Function (HMF) as on , which implicitly depends of .
Then verifies the aforementioned properties. ∎
Similar results can be obtained for multiclass classification.
Proposition 2 (Lipschitz Multiclass classification).
For any multiclass classifier with closed preimages there exists a 1Lipschitz function such that on . Moreover everywhere is defined we have .
For simplicity in the following sections we will focus on the binary classification case only (), and (resp. ) will denote the distribution of class (resp.
), assuming that they have the same probability mass (balanced case).
With these propositions in mind, or reusing the proof sketched in Section 1, we can deduce Corollary 1.
Corollary 1 (1Lipschitz Networks are as powerful as unconstrained ones).
For any neural network there exists 1Lipschitz neural network such that .
The Error of a classifier is defined as . The Risk of a classifier is defined as where denotes the optimal Bayes classifier. Some empirical studies shows that indeed most datasets are separated [41]
such as CIFAR10 or MNIST.
Corollary 2 (Separable classes implies zero error).
Classes are said separable if it exists such that the distance between and exceeds . In this case 1Lipschitz neural networks can achieve zero error.
Even if the classes are not separable, 1Lipschitz neural network can nonetheless imitate the optimal Bayes classifier.
Corollary 3 (Arbitrary zero risk).
If
admits a probability density function (w.r.t Lebesgue measure), assume that the preimages of Bayes classifier are closed, then 1Lipschitz neural network can achieve zero risk.
The hypothesis class contains a classifier with optimal accuracy, despite having hugely constrained Lipschitz constant. The difficulty lies in the optimization (see Section 4) over such class of functions.
Example 1 (Highest Mountain Function).
An example of HMF is depicted in Figures 1 (a) and (b), with chosen to be the fourth iteration of Von Koch Snowflake. We train a train a 6layers (5 hidden layers of width 128) 1Lipschitz NN by regression to fit the ground truth (160 000 pixels, 20 epochs) and we obtain the function in Figure 1 (c), with mean absolute error of 0.52. The orange strip corresponds to the zone . It proves empirically that this classification task, associated to a very sharp (almost fractal) decision frontier, can be solved by 1Lipschitz NN. There is not constraint on the shape of the frontier: only on the “speed” at which we move away from it.
3.2 Robustness
One of the most appealing properties of 1Lipschitz neural networks is their ability to provide robustness certificates against adversarial attacks.
Definition 2 (Adversarial Example)
For any classifier , any , consider the following problem:
(1) 
is an adversarial attack, is an adversarial example, and is the robustness radius of . The smallest achievable by is the minimum robustness radius of .
While unconstrained neural networks have usually very small robustness radius [29], 1Lipschitz neural networks can provide certificates [32].
Property 1 (Robustness Certificates [20]).
For any 1Lipschitz neural network, the robustness radius at example verifies .
Computing the certificate is straightforward and do not increase runtime, contrary to methods based on bounding boxes or abstract interpretation. There is no need for costly adversarial training [22] that fails to produce guarantees.
The HMF() function associated to the Bayes classifier is the one providing the largest certificate among the classifiers of maximum accuracy.
Corollary 4.
Under the hypothesis of Corollary 3, for the HMF(b), the bound of Property 1 is tight: . In particular is guaranteed to be an adversarial attack. The risk is the smallest possible. There is no classifier with the same risk and better certificates. Said otherwise the HMF() is the solution to:
(2)  
under the constraint 
Unfortunately the HMF() cannot be explicitly constructed since it relies on the (generally unknown) optimal Bayes classifier. We deduce that a robust 1Lipschitz classifier must certainly try to maximize for each in the training set. One question remains: which loss should we chose to achieve this goal ?
4 Binary Cross Entropy and 1Lipschitz neural networks
The Binary Cross Entropy (BCE) loss (also called logloss) is among the most popular choices within the deep learning community. In the next section we highlight some of its properties w.r.t the Lipschitz constant.
Let a neural network. For an example with label , with the logistic function mapping logits to probabilities, the BCE is written .
Vanishing and Exploding gradients have been a long time issue in the training of neural networks. The latter is usually avoided by regularizing the weights of the networks and using bounded losses, while the former can be avoided using residual connections (such ideas can found on LSTM
[10] or ResNet [14]). On 1Lipschitz neural networks we can guarantee the absence of exploding gradient.Proposition 3 (No exploding gradients [20]).
Assume that
is a feed forward neural network and that each layer
is 1Lipschitz, where is either a 1Lipschitz affine transformation either a 1Lipschitz activation function. Let the loss function. Let , and . Then:(3)  
(4) 
Vanishing gradient is still an issue with BCE: making the training tedious.
On unconstrained neural networks, minimization of this loss leads to saturation of the logits and uncontrolled growth of Lipschitz constant.
Proposition 4 (Saturated Neural Networks have high Lipschitz constant).
Let be a sequence of neural networks, that minimizes the BCE over a non trivial (with more than one class) training set of size , i.e:
(5) 
Let the Lipschitz constant of . Then .
This issue is specially important since the high Lipschitz constant of neural networks have been identified as the main cause of adversarial vulnerabilities. With saturated logits the probability will be either or which do not carry any useful information on the true confidence of the classifier, specially in the outofdistribution setting.
Example 2 (Illustration on linear classifier).
Consider binary classification task on , with, for simplicity, classes that are linearly separable. We use an affine model for the logits (with and ), that can be seen as a onelayer neural network. Since the classes are linearly separable it exits such that achieves accuracy. However, as noticed in [5] (Section 4.3.2) the cross entropy loss will not be zero. The loss can be minimized only with the diverging sequence of parameters as . Turn out the infimum is not a minimum !
Even on this trivial example, with a hugely constrained model the minimization problem is illdefined. Without or regularization term the minimizer can not be attained. However most order1 methods struggle to saturate the logits, even on unconstrained neural networks as depicted in Figure 1(a), whereas order 2 methods diverge as expected. The poor properties of the optimizer are one of the reasons the illposed problem of BCE minimization do not lead to explosion of weights in unconstrained networks.
Conversely, 1Lipschitz neural network cannot reach zero loss. Yet the minimizer of BCE is well defined.
Proposition 5 (BCE minimization for 1Lipschitz functions).
Let be a compact and . Then the minimum of Equation 6 is attained.
(6) 
Proposition 4 shows that the minimum exists. Machine learning practitioners are mostly interested by the minimization of the empirical risk (i.e maximization of the accuracy). However one cannot guarantee that the minimizer of BCE will maximize accuracy (see Example 3). Moreover, if is a minimum of Equation 6 for some , in general is not necessarily a minimum of Equation 6 for .
Example 3.
We illustrate this phenomenon in Figure 1(b) by training various neural networks with different Lipschitz constants . We chose a simple setting with only four training points in . The inputs are respectively. Their weights are respectively and their labels . We plot the value of to highlight the different shapes of the minimizer as function of . High values of leads to better fitting.
We observe the same phenomenon on CIFAR10 with Lipschitz contrained NN, see Figure 1(c). We place ourselves in overparameterized regime with five hidden layers of width . We compare the loss and the error on the training set. We see that the network end up severely underfitting for small values of . Not because the optimal classifier is not part of the hypothesis space, but rather because the minimizer of binary cross entropy is not necessarily the minimizer of the error. As grows, we close to up to the maximum accuracy. The loss itself is responsible for the poor score, and not by any means the hypothesis space. Bigger Lipschitz constant might ultimately lead to overfitting, playing the same role as the (usually omitted) temperature scaling parameter : .
Nonetheless, the class of Lipschitz classifiers enjoys another remarkable property since it is a GlivenkoCantelli class: BCE is a consistent estimator. Said otherwise, as the size of the training set increases the training loss becomes a proxy for the test loss: 1Lipschitz neural networks will not overfit in the limit of (very) large sample size.
Proposition 6 (Train Loss is a proxy of Test Loss).
Let a probability measure on where is a bounded set. Let be a sample of
iid random variables with law
. Let:Then we have (taking the limit ):
(7) 
It is an other flavor of the biasvariance tradeoff. We know thanks to Corollaries
2 and 3 that the class of Lipschitz function does not suffer any bias when it comes to classification. With Proposition 6 we also know that the variance can be made as small as we want by increasing the size of the training set. While this statement seems rather trivial, we emphasize that is not a property shared by unconstrained neural networks: increasing the size of the training set does not give any guarantees to generalization capabilities. Adversarial examples are an example of such failure to reduce variance.Loss  Unconstrained NN  Lipschitz NN  
BCE  minimizer  illdefined, (Proposition 4)  attained (Proposition 5) 
in practice  vanishing gradient (Figure 1(a))  L must be tuned (Example 3)  
consistent estimator  no  yes (Proposition 6)  
Wasserstein  minimizer  illdefined,  attained, minimum 
in practice  diverges during training  weak classifier (Proposition 7)  
consistent estimator  no  yes (Appendix B)  
Hinge  minimizer  attained  attained, high accuracy for small 
in practice  no garantees on margin  large margin classifier for big  
consistent estimator  no  yes (Appendix B)  
hKR [26]  minimizer  illdefined,  attained 
consistent estimator  no  yes [26]  
Robustness certificates  no  yes (Property 1)  
Exploding gradient  yes for some and architecture  no (Proposition 3)  
Vanishing gradient  yes for some and architecture  yes for some 
5 Alternative losses and link with Optimal Transport
We see that BCE is not necessarily the most suitable loss because of its dependence in Lipschitz constant, and the vanishing gradient issue. The loss might seem to be a good pick at first sight, because it cannot vanish and explicitly maximizes the logits . Unfortunately its minimum is the Wasserstein distance [35] between and according to the KantorovichRubinstein duality:
(8) 
The minimizer of 8 is known to be a weak classifier, as demonstrated empirically in [26]. We precise their observations in Proposition 7.
Proposition 7 (KR minimizer is a weak classifier).
For every there exists distributions and with disjoint supports in such that for any minimizer of equation 8, the error of classifier is superior to .
Hinge loss allows, in principle, to reach maximum accuracy, as used in [20]. The combination is still a regularized OT problem [26]. BCE minimization can also be seen trough the lens of OT (Appendix E). Results are summarized in Table 1.
With margin we can bound the VC dimension [33] of hypothesis class. The value can be understood as confidence. Hence, we may be interested in a classifier that takes decision only if the logit is above some threshold , while can be understood as examples for which the classifier is unsure: the label may be flipped using attacks of norm . In this setting, we fall back to PAC learnability.
Proposition 8 (1Lipschitz Functions with margin are PAC learnable).
Consider a binary classification task with bounded support . Let the margin. Let the hypothesis class defined as follow.
(9) 
Then the VC dimension of is finite:
(10) 
with and . is the unit ball, and must be understood as Minkovski sum [28].
Here is a dummy symbol that the classifier may use to say “I don’t feel confident”; using it is not allowed to shatter a set. Interestingly if the classes are separable (), choosing guarantees that maximal accuracy is reachable. Prior over the separability of the input space is turned into VC bounds over the space of hypothesis.
The previous result covers the whole class of 1Lipschitz functions with margins. We can give an other bound corresponding to a practical implementation of Lipschitz networks. With GroupSort2 activation functions (as in the work of [30]) we get the following rough upper bound:
Proposition 9 (VC dimension of 1Lipschitz neural network with Sorting).
Let a 1Lipschitz neural network with parameters , with GroupSort2 activation function, and a total of neurons. Let the hypothesis class spanned by this architecture. Then:
(11) 
From Proposition 9 we can derive generalization bounds using PAC theory. Note that most results on VC dimension of neural network uses the hypothesis that the activation function is applied elementwise (such as in [4]) and get asymptotically tighter lower bounds for ReLU case. Such hypothesis does not apply anymore here, however we believe that the preliminary result can be strengthened.
6 Conclusion
We proved that 1Lipschitz exhibit numerous attractive properties. They are easily certifiable and can reach high accuracy. However the loss function must be chosen accordingly: Binary Cross Entropy is not necessarily the best choice because of vanishing gradient. Hinge and HKR have appealing properties.
Their training remains a challenge. The solutions of the optimization problem in equation 6 still need to be characterized and understood. In particular, what is the bias induced by the minimizer of the loss.
Most architectural innovations of the past years such as Batch Normalization, Dropout and Attention are not 1Lipschitz (see
[16]), and cannot benefit to 1Lipschitz neural networks in straightforward manner. Alternatives need to be found. Orthogonal convolutions are still an active research area (see [38] or [21]), but are required to reach SOTA on image benchmarks.If future works can overcome these challenges, it open the path to neural networks that are both effective and provably robust. For these reasons we believe they are a promising direction of further research for the community.
7 Acknowledgments
This work received funding from the French Investing for the Future PIA3 program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). A special thanks to Thibaut Boissin for the support with DeelLip library. We thank Sébastien Gerchinovitz for critical proof checking, and JeanMichel Loubes for useful discussions.
References
 [1] (1996) Weak convergence and empirical processes. SpringerVerlag New York. Cited by: Appendix B.
 [2] (2019) Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291–301. Cited by: §2.2, §2.3.

[3]
(2017)
Wasserstein generative adversarial networks
. In International conference on machine learning, pp. 214–223. Cited by: §1, §2.2.  [4] (2019) Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks.. Journal of Machine Learning Research 20, pp. 63–1. Cited by: §5.
 [5] (2006) Pattern recognition and machine learning. springer. Cited by: Example 2.

[6]
(1971)
An iterative algorithm for computing the best estimate of an orthogonal matrix
. SIAM Journal on Numerical Analysis 8 (2), pp. 358–364. Cited by: §2.2.  [7] (1989) Learnability and the vapnikchervonenkis dimension. Journal of the ACM (JACM) 36 (4), pp. 929–965. Cited by: Appendix C.
 [8] (2020) Stochastic flows and geometric optimization on the orthogonal group. In International Conference on Machine Learning, pp. 1918–1928. Cited by: §2.2.

[9]
(2021)
Lipschitz recurrent neural networks
. In International Conference on Learning Representations, Cited by: §2.2.  [10] (1999) Learning to forget: continual prediction with lstm. Cited by: §4.
 [11] (2014) Efficient classification for metric data. IEEE Transactions on Information Theory 60 (9), pp. 5750–5759. Cited by: §2.2.
 [12] (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5767–5777. Cited by: §2.2, §2.2.
 [13] (2017) Variational inference with orthogonal normalizing flows. In Bayesian Deep Learning, NIPS 2017 workshop, Cited by: §2.2.

[14]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4. 
[15]
(2018)
Orthogonal weight normalization: solution to optimization over multiple dependent stiefel manifolds in deep neural networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 32. Cited by: §2.2.  [16] (2020) The lipschitz constant of selfattention. arXiv preprint arXiv:2006.04710. Cited by: §2.2, §6.
 [17] (2019) Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §2.2.
 [18] (2018) Spaces of convex npartitions. In New Trends in Intuitive Geometry, pp. 279–306. Cited by: Appendix C.
 [19] (2019) Cheap orthogonal constraints in neural networks: a simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, pp. 3794–3803. Cited by: §2.2.
 [20] (2019) Preventing gradient attenuation in lipschitz constrained convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, Cambridge, MA. Cited by: §1, §2.2, §5, Property 1, Proposition 3.
 [21] (2021) Convolutional normalization: improving deep convolutional network robustness and training. arXiv preprint arXiv:2103.00673. Cited by: §2.2, §6.
 [22] (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.2, §3.2.
 [23] (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.2.
 [24] (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In NIPS, Cited by: §2.2.
 [25] (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3839–3848. Cited by: §2.1.
 [26] (2020) Achieving robustness in classification using optimal transport with hinge regularization. arXiv preprint arXiv:2006.06520. Cited by: §2.2, §2.3, Table 1, §5, §5.
 [27] (2017) Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65 (16), pp. 4265–4280. External Links: Document Cited by: §1.
 [28] (1998) Metric entropy of homogeneous spaces. Banach Center Publications 43 (1), pp. 395–410. Cited by: Appendix C, Proposition 8.
 [29] (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1, §3.2.
 [30] (2021) Approximating lipschitz continuous functions with groupsort neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 442–450. Cited by: §2.2, §2.3, §5.

[31]
(2019)
Robustness may be at odds with accuracy
. In International Conference on Learning Representations, Cited by: §1.  [32] (2018) Lipschitzmargin training: scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, Vol. 31, pp. 6541–6550. Cited by: §2.2, §3.2.
 [33] (2015) On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11–30. Cited by: §5.

[34]
(2013)
The nature of statistical learning theory
. Springer science & business media. Cited by: Appendix C.  [35] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §5.
 [36] (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. Advances in Neural Information Processing Systems 31, pp. 3835–3844. Cited by: §2.2.
 [37] (2004) Distancebased classification with lipschitz functions.. J. Mach. Learn. Res. 5, pp. 669–695. Cited by: §2.2.

[38]
(2020)
Orthogonal convolutional neural networks
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11505–11515. Cited by: §2.2, §6.  [39] (2018) Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pp. 5276–5285. Cited by: §2.2.
 [40] (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations, Cited by: §2.2.
 [41] (2020) A closer look at accuracy vs. robustness. Advances in Neural Information Processing Systems 33. Cited by: §1, §3.1.
 [42] (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §2.2.
 [43] (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §2.2.
Appendix A Proofs of Section 3
The proof of Proposition 1 is constructive, we need to introduce the Highest Mountain Function first.
Definition 3 (Highest Mountain Function)
Let any classifier with closed preimages. Let and . Let and the distance to a closed set . Let . We define as follow:
(12) 
Now we can prove that the function previously defined verifies all the properties.
Proof.
We start by proving that is 1Lipschitz.
First, consider the case and . Then . Assume without loss of generality that . Let such that (guaranteed to exist since is closed). Then by definition of we have . So:
(13) 
The case and is identical. Now consider the case and . We have . We will proceed by contradiction. Assume that . Let such that and . Let:
Then . So by definition of we have . But we also have:
(14)  
So we have which is a contradiction. Consequently, we must have . The function is indeed 1Lipschitz.
Now, we will prove that everywhere it is defined. Let such that is unique. Consider with a small positive real. We have , it follows by triangular inequality that . We see that:
The vector
is the (unique) vector for which is minimal. Knowing that is 1Lipschitz yields that . For points for which is not unique, the gradient is not defined because different directions minimize which contradicts the uniqueness of gradient vector.Finally, note that on and . Indeed, in this case either either and the result is straightforward. ∎
For the case we must slightly change the definition to prove Proposition 2.
Definition 4 (Multiclass Highest Mountain Function)
Let any classifier with closed preimages. Let . Let . We define as follow:
(15) 
In overall the proof remains the same.
Proof.
We start by proving that is 1Lipschitz.
We will now prove that for any norm on with . First, consider the case . Then using the proof of Proposition 1. Now, consider the case and . Then:
(16) 
Using the same technique as in the previous proof, if we assume then we can construct verifying both and which is a contradiction. Consequently .
Each row of is the gradient of some on which the reasoning of the case applies (like in previous proof). We conclude similarly that everywhere it is defined.
Finally, note that is equal to everywhere is defined, which concludes the proof. ∎
Appendix B Proofs of Section 4
The proof of Proposition 4 only requires to take a look at the logits of two examples having different labels.
Proof.
Let . For the pair , as , by positivity of we must have:
(17) 
As the right hand side has limit zero, we have:
(18) 
Consequently . By definition so . ∎
The proof of Proposition 5 is an application of Arzelà–Ascoli theorem.
Proof.
Let . Consider a sequence of functions in such that .
Consider the sequence . We want to prove that is bounded. Proceed by contradiction and observe that if then . Indeed, for we can guarantee that is constant over and in this case one of the two classes is misclassified, knowing that yields the desired result. But if , then cannot not converges to . Consequently, must be upper bounded by some .
Hence the sequence is uniformly bounded. Moreover each function is Lipschitz so the sequence is uniformly equicontinuous. By applying Arzelà–Ascoli theorem we deduce that it exists a subsequence (where is strictly increasing) that converges uniformly to some , and . As , the infimum is indeed a minimum. ∎
Proof of Theorem 6 is an application of GlivenkoCantelli theorem.
Proof.
We proved in Proposition 5 that the minimum of equation 6 is attained, so we replace by . We restrict ourselves to a subset of on which because the minimum lies in this subspace. We have:
Let . Note that is also Lipschitz and bounded on . The entropy with bracket (see [1], Chapter 2.1) of the class of functions is finite (see [1], Chapter 3.2). Consequently is GlivenkoCantelli. Finally which concludes the proof. ∎
To prove Proposition 3
we just need to write the chain rule.
Proof.
The gradient is computed using chain rule. Let any parameter of layer . Let
a dummy variable corresponding to the input of layer
, which is also the output of layer . Then:(19) 
with . As the layers of the neural network are all 1Lipschitz, then:
Hence:
(20) 
Finally, for we replace by the appropriate parameter which yields the desired result. ∎
Results of Table 1. We can apply the same reasoning as in the proof of Proposition 6. If we replace BCE with hinge, the resulting class is still GlivenkoCantelli, so the theorem apply. Hence, hinge is a consistent estimator in the space of 1Lipschitz functions. The result is also straightforward for : consistency of Wasserstein distance is a textbook result.
Appendix C Proofs of Section 5
Proof of Proposition 7.
Proof.
We will build and as a finite collection of Diracs. Let and for some , where denotes the Dirac distribution in . A example is depicted in Figure 3 for . In dimension one, the optimal transportation plan is easy to compute: each atom of mass from at position is matched with the corresponding one in to its immediate right. Consequently we must have . The function is not uniquely defined on segments but it does not matter: since is 1Lipschitz we must have . Consequently in every case for we must have and . Said otherwise, is strictly increasing on and .
The solutions of the problems are invariant by translations: if is solution, then with is also a solution. Let’s take a look at classifier . If is chosen such that and for some then points are correctly classified on a total of points. It corresponds to an error of . Take to conclude. ∎
Proof of Proposition 8.
Proof.
The implication “finite VC dimension” “PAC learnable” is a classical result from [7].
The VC dimension of is the maximum size of a set shattered by . As the functions are 1Lipschitz, if then , and . Consequently, a finite set is shattered by if and only if for all we have where is the open ball of center and radius .
The maximum number of disjoint balls of radius that fit inside is known as the packing number of with radius . is bounded, hence its packing number is finite.
The bounds on the packing number are a direct application of [28] (Lemma 1). ∎
The proof of Proposition 9 uses the number of affine pieces generated by GroupSort2 activation function.
Proof.
We first prove that that is piecewise affine and the number of such pieces is not greater than , where is the number of neurons in layer . We proceed by induction on the depth of the neural network. For depth we have an affine function which contain only one affine piece by definition (the whole domain), so the result is true.
Now assume that a neural network of depth with widths has affine pieces. The enumeration starting at is not a mistake: we pursue the induction for a neural network of depth and widths . The composition of affine function is affine, hence applying an affine transformation preserves the number of pieces. The analysis fall back to the number of distinct affine pieces created by GroupSort2 activation function. If such activation function creates pieces then we have the immediate bound .
Let the Jacobian of the GroupSort2 operation evaluated in . The cardinal is the number of distinct affine pieces. For GroupSort2 we have a combinations of MinMax gates. Each MinMax gate is defined on and contains two pieces: one on which the gate behaves like identity and the other one on which the gate behaves like a transposition. Consequently we have and unrolling the recurrence yields the desired result.
Now, we just need to apply the Lemma 1 with .
Lemma 1 (Piecewise affine function).
Let a class of classifiers that are piecewise affine, such that the pieces form a convex partition of with pieces (each piece of the partition is a convex set). Then we have:
The proof of 1 is detailed below.
Let the growth function [34] of . According to Sauer’s lemma [34] if it grows polynomially with the number of points, then the degree of the polynomial is an upper bound on the VC dimension. We will show that is indeed the case by computing a crude upper bound of the degree. Assume that we are given points, and