1 Introduction
Neural networks, as universal approximators (of e.g. continuous functions), may represent highly complex functions, which in practice results in them being quite sensitive to their inputs, as evidenced e.g. by adversarial examples (Szegedy et al., 2013). One simple measure of how sensitive to its inputs a network is, is the network’s Lipschitz constant. Indeed, an increasing number of applications call for the control of this Lipschitz constant, which yields guarantees for the inputoutputsensitivity. One particular way of controlling this Lipschitz constant of a network is by spectral normalization of its convolutions and weight matrices, which explicitly limits the network’s Lipschitz constant in terms of the 2norm.
Since it determines a tradeoff between expressiveness and robustness, controlling a network’s Lipschitz constant is a form of regularization (Oberman and Calder, 2018). A direct application of this principle is the limiting of the network’s Lipschitz constant in order to robustify it against adversarial attacks (Tsuzuku et al., 2018a). A version of this idea is also found in spectrally normalized GANs (Miyato et al., 2018b), which regularize the discriminator and which results in betterlooking images. Note that in practice this is similar to Wasserstein GANs (Arjovsky et al., 2017), which require enforcing the Lipschitz constant of their critic network to be lower or equal to one (in order to approximate a Wasserstein distance). Another instance in which spectral normalization can be used to enforce necessary guarantees is found in invertible residual networks (Behrmann et al., 2019), which require the Lipschitz constant of the residual blocks to be below one. Furthermore, the Lipschitz constant of an invertible network influences the stability of inversion, as shown in (Behrmann et al., 2020). For example in additive coupling layers (Dinh et al., 2014), limiting the layers’ Lipschitz constant will result in an increase in both the forward and inverse numerical stability. This in particular has implications for training with restored (instead of stored) activations (Gomez et al., 2017).
To calculate and (thus enforce) spectral norms on convolutional layers, common methods either use the power method to calculate the norms, or use (typically nontight) upper bounds as an approximation. Enforcing these spectral norms is usually done by performing a normalization step (i.e. dividing the layers’ weights by the computed Lipschitz constant) after each step of (stochastic) gradient descent, which means that the calculation of the spectral norms has to be performed many times during training. This means that the spectral normalization can strongly influence the overall speed of training, potentially bottlenecking the training duration. Details on this are given in section
2.3.In this work, we propose a novel method for spectral normalization tailored to depthwise separable convolutions (Sifre and Mallat, 2014)
. Depthwise separable convolutions are a form of convolutional layers, in which the channelwise convolution and the interchannel combination of channels are decoupled. Our proposed method makes use of the specific structure of depthwise separable convolutions to achieve significant speedups compared to other methods, while having practically no additional memory overhead. In a realistic training scenario (training MobileNetV2 on ImageNet), each epoch takes only 2% additional training time – thus making spectral normalization essentially ’for free’ for depthwise separable convolutional networks.
2 Problem setting
2.1 Depthwise separable convolutions
Let denote a filter corresponding to an input’s th channel (denoted ) and an output’s th channel (denoted ). Then
(1) 
describes the conventional multichannel convolution operation, where signifies a spatial filtering (a convolution, respectively a crosscorrelation). The spatial and interchannel computations of these multichannel convolutions are hence inherently linked. Recently, depthwise separable convolutions have arisen as a family of computationally cheaper (and yet expressive) alternatives, in which the spatial and interchannel computations are decoupled in the form of depthwise respectively pointwise convolutions. Depthwise convolutions apply filtering to each input channel independently (resulting in the same number of output channels), which can be equivalently represented by setting for all in equation (1). These serve as the spatial operations of depthwise separable convolutions. Pointwise convolutions on the other hand can be represented by multichannel convolutions as in equation (1) for which (e.g.
convolutions in 2D), i.e. for nonstrided filterings the output channels are merely linear combinations of the input channels. They constitute the interchannel computations of depthwise separable convolutions.
The various realizations of depthwise separable convolutions in the literature mainly differ by their number and order of operations and where (and how often) a nonlinearity is applied. While a depthwise convolution followed by a pointwise convolution is a lowrank factorization of a conventional convolution (Sifre and Mallat, 2014), recent approaches usually incorporate nonlinearities between the two types of operations (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Tan and Le, 2019). It is worth highlighting that EfficientNet (Tan and Le, 2019), which employs the depthwise separable convolutions defined by MobileNetV2 (Sandler et al., 2018)
, is currently stateoftheart in the ILSVRC (ImageNet) challenge
(Pham et al., 2020).2.2 Mathematical preliminaries
A function is called Lipschitzcontinuous on two normed spaces and , if there is a constant such that
for all . The smallest such constant is referred to as the Lipschitzconstant of and will be denoted .
For continuous, almost everywhere differentiable functions, the Lipschitz constant of a function is given by the supremum of the operator norm (w.r.t. the norms of , ) of all possible derivatives, i.e. . In the following, we will assume that and are Euclidean spaces (equipped with the 2norm), in which case the above operator norm becomes the spectral norm of
(where defined), i.e. the largest singular value of
. This highlights the difficulty of actually calculating the Lipschitz constant of a neural network – even in the case of locally linear functions, such as e.g. ReLU networks, this would require the evaluation of the operator norm in each locally linear region. In fact, even for a two layer neural network, calculating the Lipschitz constant is NPhard
(Scaman and Virmaux, 2018). Approximating this Lipschitz constant can be achieved by phrasing the Lipschitz estimation problem as an optimization problem (Fazlyab et al., 2019; Latorre et al., 2020; Chen et al., 2020). These are, however, still very costly.On the other hand, calculating upper bounds of Lipschitz constants of neural networks is tractable, if finding an upper bound of the individual layers’ Lipschitz constants is tractable. In particular, if for Lipschitz continuous functions , then (chain rule for Lipschitz constants
). While the bounds this chain rule yields may be too loose for approximating the true Lipschitz constant of the
whole network, it may still be possible to make approximate statements about individual layers’ Lipschitz constants. For layers of the type , where is linear,is a bias vector and
is an activation function with bounded Lipschitz constant, it holds that
(2) 
i.e. the Lipschitz constant of a nonlinear layer can typically be upper bounded by the spectral norm of its linear portion. This is true in particular for convolutional layers or dense layers with e.g. ReLU, tanh, sigmoid, softmax or ELU nonlinearity (all of which have .
2.3 Spectral normalization
One particularly important insight about Lipschitz constants is the fact that the ability to calculate them often enables one to enforce a specific Lipschitz constant in a function.
In fact, it holds that for any , so that has for some desired . Instead of scaling the function itself, most types of neural network layers admit a scaling of their associated parameters: Since for example convolutions are positively 1homogeneous in their kernels, it suffices to scale their filter kernel accordingly. That means that for a convolutional operator with filter kernel (and for our setting of spaces equipped with the 2norm), it holds that by rescaling this kernel via (and keeping a possible bias), one obtains a new convolutional operator with . In the following, we will refer to this normalization procedure as spectral normalization. A common way of training neural networks with spectrally normalized convolutional kernels is to apply spectral normalization after each optimization step. Note that if we overestimate the spectral norm, then still .
If one wishes to enforce some specific Lipschitz constant, multiplying the output of the normalized convolution by some desired then makes the overall mapping Lipschitz. We call this method hard scaling and refer to the scaling constant.
One can make this scaling a bit more flexible by instead multiplying with , where is a learnable parameter. Since , this will bound the enforced Lipschitz constant to be below . Due to the dependence on the learnable parameter , one allows the ’right’ Lipschitz constant up to to be learned. We will refer to this variant as soft scaling.
2.4 Spectral norm estimation for convolutional layers
Unlike in the case of dense layers, whose parameters directly allow for a computation of their spectral norm via matrix methods, it is more difficult to assess a convolution’s norm solely from its parameters. In fact, the same kernel generally defines not one, but a whole family of operators (depending e.g. on the input resolution and stride). The estimation of their norms can either be performed with generic methods or make use of the operators’ specific structure. One approach is to restrict the convolutional operators to be exactly orthogonal (Xiao et al., 2018; Li et al., 2019) to strictly enforce unitnorm convolutions. By relaxing this constraint to approximately orthogonal convolutions, the spectral norm is likewise relaxed to values close to (Cisse et al., 2017) or up to (Qian and Wegman, 2018) one. Another class of methods is to exploit the structure of multichannel convolution operators to gain insights on their singular values. In particular, by making use of the theory of doubly block circulant matrices Sedghi et al. (2019) are able to derive an exact formula for the singular values of circulant
2D convolutions, using Fourier transforms and a singular value decomposition. While this is exact, the values are only approximate for the more commonly used zeropadded or ’valid’ convolutions. Furthermore, the computation is quite slow. Similarlyderived and faster to compute upper bounds can be derived for circulant convolutions
(Singla and Feizi, 2019) or padded convolutions (Araujo et al., 2020). As these methods are designed for 2D convolutions (using the structure of doubly block circulant/Toeplitz matrices), it is furthermore not immediately clear how to extend these to 3D.A very general method for numerically calculating spectral norms of linear operators is to employ the power method (Yoshida and Miyato, 2017; Miyato et al., 2018a; Virmaux and Scaman, 2018; Tsuzuku et al., 2018b; Farnia et al., 2018; Gouk et al., 2018; Behrmann et al., 2019)
, where the spectral norm of a linear operator is approximated by an iterative procedure. Here, for a randomly initialized vector
from the domain of the convolutional operator , the iteration(3) 
converges to the leading rightsingular vector of (almost surely), while converges to the desired spectral norm . The iteration (3
) highlights the computational burden of performing the power method during training, as the most computationally demanding operation of a convolutional neural network has to be performed multiple times per gradient step until some desired precision is reached. The training time is thus effectively multiplied by (almost) the number of iterations that have to be performed. This desired precision level can be controlled by a userspecified parameter
, which defines the stopping criterion The necessary number of iterations can be greatly reduced by not initializing randomly, but instead initializing with the estimated leading rightsingular vector from the previous gradient descent step (henceforth called warmstart power method), assuming the most recent optimization step did not change the leading rightsingular vector too much. This, however, requires keeping those vectors in memory, which are of the same size as the input features. This can be prohibitive in the case of memoryconstrained applications such as 3D medical segmentation (where batch sizes of 1 or 2 are common).3 Proposed solution
The main difficulty in determining the spectral norm for multichannel convolutions lies in the fact that the spatial convolutions and the interplay between different channels are inherently linked. In the case of depthwise separable convolutions however, the spatial and interchannel computations are decoupled. This now also allows for a separate calculation (and thus normalization, cf. section 2.3) of the individual operations’ norms, which – as it turns out – is much faster, as well as more memoryefficient than the warmstart power method.
3.1 Spectral norms of depthwise convolutions
Depthwise convolutions  the spatial component of depthwise separable convolutions  apply one (usually zeropadded) filtering to each input channel. While common methods for conventional multichannel convolutions are either slow (e.g. a randomly initialized power method or the method from (Sedghi et al., 2019)), produce nontight upper bounds (Singla and Feizi, 2021; Araujo et al., 2020) or require a lot of additional memory (warmstart power method), the following method is accurate, fast and can be computed onthefly – but is only valid for depthwise convolutions.
For this, we will make use of a form of the wellknown convolution theorem^{1}^{1}1Since most frameworks actually implement crosscorrelation, we derive our formulas using the corresponding CrossCorrelation Theorem. The final statement is, however, independent of whether we use crosscorrelation or (actual) convolutions., which states that taking the discrete Fourier transform (DFT) of a circulant convolution between two signals is equivalent to the product of both signals’ DFTs (up to a complex conjugation), such that we only need to search for the largest absolute value of the filters’ Fourier coefficients. Furthermore, our derivation will be independent of the dimensionality of the data, i.e. whether we are dealing with 1D, 2D or 3D convolutions. Note that the following theorem has significant overlap with Theorem 5 from (Sedghi et al., 2019), but we derive this independently from the dimensionality of the data and provide additional context, why the Fourier transform leads to the correct spectral norm, even in the case of a realtoreal convolution.
Theorem 1 (Norms of Circulant Convolutions, SingleChannel).
Let and . Let be a (singlechannel, unitstride) circulant convolution with filter . Let denote a zeropadding operator. Then , where and denotes the discrete dimensional discrete Fourier transform.
Proof.
In the following, we will derive an expression for the spectral norm of the complex crosscorrelation (Appendix A.2), after which we will show that this is an upper bound for the spectral norm of the real crosscorrelation. Let . Let further . Let denote the complex crosscorrelation, such that the restriction of to describes the same mapping as . Then, for all
(4) 
according to the CrossCorrelation Theorem (Appendix A.3). Rewriting this in matrixvector formulation^{2}^{2}2For a rigorous derivation, cf. Appendix A.4. yields
(5) 
where is a reordering into a column vector and correspond to and , respectively. Since this holds for arbitrary (and is orthogonal), this generalizes to such that
i.e. the entries of
constitute the eigenvalues of
. Because (where is the Hermitian transpose of ), it holds that , i.e. is a normal matrix. According to the spectral theorem, this implies that the singular values of (and, by isomorphism, of ) are the absolute values of the entries , which coincide with the absolute values of the entries of .Now, it remains to show that the spectral norms of the realtoreal operator and the complextocomplex operator are the same. As both are isometrically isomorphic to (which has only real entries due to the use of a real filter ), its spectral norm is the same, whether we view it as a real matrix or as a complex matrix (see Appendix Lemma 1). Hence,
(6)  
finishing our proof.
∎
While the above theorem yields exact spectral norms, it has two shortcomings, as it only deals with circulant and singlechannel convolutions. In practice however, depthwise convolutions are typically zeropadded (and multichannel). As it turns out, by zeropadding the filters to just a bit bigger than the input image (to the size of the padded image), an in practice quite tight upper bound of the actual spectral norm of the depthwise, zeropadded convolution is found.
Theorem 2 (Norms of ZeroPadded and MultiChannel Depthwise Convolutions).
Let and
. For odddimensional filters
(for ), let . We define(7)  
where denotes the singlechannel zeropadded convolution with filter . Let
denote the zeropadding. Then, with , it holds that
where denotes the discrete dimensional discrete Fourier transform.
Since there are very fast and highly parallelizable algorithms for computing the DFT (the Fast Fourier Transform), the above computation can be performed very quickly.
Note that the above theorems are derived for unitstride depthwise convolutions. For strided convolutions, the upper bound still holds, but it will in reality be more vacuous. To see this, let denote the strided convolution and let denote the unitstrided convolution. Then there is a subsampling operator , such that . Due to , it holds that . In the case of stride in all directions, one can only make the ’educated guess’ based on the assumption that all components of contribute to its overall norm evenly, in which case . This e.g. means that for 2D data, our estimation is about 2 times too high for .
3.2 Calculating Lipschitz constants for pointwise convolutions
As discussed in section 2.1, pointwise convolutions are simply multichannel convolutions
with , which is why in 2D they are often called convolutions. Here, we propose a variation of the warmstart power method for a specific matrix, which – compared to the naïve power method for the full operator – is typically much faster and memoryefficient. In particular, it is independent of the size of the convolved data.
Theorem 3.
Let be a pointwise convolution with kernel , such that for all and , where and denote the number of input respectively output channels. Then we call
(8) 
the connectivity matrix of and it holds that
Proof.
is similar to a block diagonal matrix with blocks . As the singular values of block diagonal matrices are the union of the blocks’ singular values, the spectral norm (i.e. the largest singular value) of coincides with the spectral norm of . ∎
Performing the power method with the connectivity matrix has the additional advantage, that the warmstart version of the power method requires only storing a small rightsingular vector^{3}^{3}3Alternatively, the leftsingular vector from can be stored. from
. Compare that to the naïve power method, which requires storing a vector of the size of the input (respectively output) tensor of the layer. We refer to the thusimproved warmstart power method as the
efficient warmstart power method. An illustration of this method is found in Figure 1.For a square images of resolution by, both the computational and memory demand is divided by a factor of . This effect is even more pronounced in case of 3D (or even higherorder) data, where the memory cost for the naïve method can easily be several gigabytes, whereas our method is typically limited to kilobytes of storage for storing the leading singular vector (for at most a few thousand channels).
4 Experiments
In this section, we perform several experiments on classical benchmarks in order to evaluate our proposed method for spectral normalization. Note that the goal here is not to convince the reader of the usefulness of spectral normalization in general (interested readers are referred to the many examples in sections 1 and 2.4), but to benchmark the accuracy and speed of our method and to uncover the interplay with other variables, such as the learning rate.
4.1 Accuracy of depthwise spectral norm upper bound
While the accuracy of our efficient power method for pointwise convolutions is governed by the userspecified parameter, for depthwise convolutions we only have access to upper bounds of the true spectral norm. As shown in Theorem 2, the upper bound comes into play solely because of zeropadding, whose relative
effect should become smaller for large resolutions of the input images (in comparison to small kernel sizes). In order to check how tight this upper bound is (and thus how close our approximation is), we calculate the relative overestimation (approximation/actual value) for randomly generated (Gaussian distributed)
filters, depending on the feature resolution. Here, the ’true’ spectral norm was approximated with 30 iterations of the power method. We varied the feature resolution from (the smallest resolution in e.g. MobileNetV2 Sandler et al. (2018), VGG Simonyan and Zisserman (2015) and ResNet He et al. (2016)) to 128 and generated 1000 random filters per resolution. The results are shown in Figure 2. One can observe that the relative error is indeed decreasing with an increase in the resolution, from (in median) 17% to 2%. Note that due to the linearity of the investigated methods, the results of this experiment are independent of the standard deviation of the chosen Gaussian distribution.
The above analysis can be seen as an investigation into the approximation accuracy at initialization. So how good is the approximation in a fullytrained network, where the filters do not follow a Gaussian distribution? For this, we did the above comparison for a MobileNetV2, which was pretrained on ImageNet Russakovsky et al. (2015). MobileNetV2 contains 17 depthwise convolutional layers, 13 of which are unitstride and 4 of which have a stride of 2 in each direction. The correct feature resolution from the original article was used for the calculation of the spectral norms. For the 13 unitstride convolutions, the average relative overestimation was 2.80%, with a standard deviation of 1.55%. In the case of the strided convolutions (for which we expect more vacuous bounds, as explained in section 3.1), the relative overestimation of our method was 75.6% (standard deviation: 20.6%). In summary, in a realistic setting, our estimation of the spectral norm is highly accurate for unitstride depthwise convolutions, but less accurate for strided convolutions (but still better than the ’educated guess’ of a factor of 2, which we laid out in section 3.1).
4.2 Lipschitz constant and learning rate
Since the Lipschitz constant of a neural network determines the magnitude of its gradients, training with gradientbased methods requires adapting the learning rate accordingly – i.e. a lower Lipschitz constant necessitates a larger learning rate. In order to evaluate the interplay of the learning rate and scaling constant
(both for soft and hard scaling), we perform a study on CIFAR10
Krizhevsky et al. (2009), in which we test different settings for both variables. For this, we train a ResNet34 (He et al., 2016), for which we exchanged all convolutions by depthwise separable convolutions (one depthwise convolution, followed by a pointwise convolution without intermediate activation function). We therefore end up with a network architecture consisting of 33 depthwise separable convolutional layers (of which 3 are strided convolutions with a stride of 2) and one fully connected (softmax) layer. We trained all networks for 300 epochs, using scaling constants from 1 to 10 and learning rates
for stochastic gradient descent with a fixed momentum of. Since batch normalization influences the Lipschitz constant (both through the running means of the batches’ standard deviations and the
parameter), we train our models without batch normalization. We compute and restrict the spectral norm of every layer, using the proposed method to spectrally normalize the depthwise separable convolutions, as well as the warmstart power method to restrict the spectral norm of the final classification layer. Additionally, we choose the precision parameter to be 0.01 for all pointwise convolutional layers and initialize the soft scaling parameter as so that at initialization.The results of this experiment are depicted in Table 1. As predicted, lower scaling constants tend to require higher learning rates. Furthermore, the known regularizing effect of the Lipschitz constraints can be observed, since increasing initially increases the prediction accuracy (underfitting regime), before it decreases again (overfitting regime). A careful tuning of the learning rate is needed to achieve the best performance. In particular, too high of a learning rate may collapse the training. Moreover, there is no clear winner between hard and soft scaling.
Learning rate  

1e5  1e4  1e3  1e2  1e1  
Scaling constant  H  S  H  S  H  S  H  S  H  S 
1  15.38  14.98  16.64  15.56  16.34  16.28  10.00  10.00  10.00  10.00 
2  17.23  17.05  19.93  19.00  45.01  42.17  54.70  55.65  10.00  10.00 
3  18.02  17.48  45.88  42.08  67.37  72.51  74.73  72.33  10.00  10.00 
4  35.45  35.34  62.90  62.73  81.07  80.94  84.71  85.65  10.00  10.00 
5  50.68  50.36  74.85  74.73  86.68  86.72  90.53  90.74  10.00  10.00 
6  59.73  59.65  80.45  80.19  88.60  87.63  10.00  10.00  10.00  10.00 
7  62.10  62.39  81.10  81.06  85.24  85.36  10.00  10.00  10.00  10.00 
8  61.53  62.74  81.21  81.17  80.16  10.00  10.00  10.00  10.00  10.00 
9  45.52  52.25  32.69  41.92  10.00  10.00  10.00  10.00  10.00  10.00 
10  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00  10.00 
4.3 Spectral normalization on ImageNet
In addition to the previous CIFAR10 experiments, we conducted a study on the much higherresolution ImageNet dataset (Russakovsky et al., 2015). For this, we trained MobileNetV2 (Sandler et al., 2018), a standard architecture for image classification using depthwise separable convolutions, using , and again without batch normalization. Lower and higher values for in our case lead a strong breakdown in performance. The models were trained with soft scaling for 150 epochs, using the precision parameter for every pointwise convolutional layer. All models were trained with a batch size of 128, divided on two NVIDIA V100 GPUs. Due to the interplay of scaling constant and learning rate (see subsection 4.2), the learning rate had to be adapted for each .
In Figure 3 the Top1 as well as the Top5 validation accuracy per scaling constant are plotted. One can observe that the accuracy increases as one increases the scaling constant, indicating that we never reach the overfitting regime. Moreover, it can be seen that the increase in accuracy is steeper in the lower regions of , later leveling off.
The use of soft scaling offers an additional degree of flexibility, as the training can slightly adjust the Lipschitz constant. We noticed that in the ImageNet experiments, the initialization of the learnable parameter had to be tuned. For lower values of (up to ), the last initialization from the previous CIFAR10 experiments (, resulting in ) still worked very well, whereas higher values of required reducing the initial value of somewhat (to so that ), otherwise resulting in exploding gradients or missing convergence. We also for instance experimented with or (which initializes the network as if there was no spectral normalization at initialization), but both did not generally work well. However, in all cases it was enough to monitor the training for a few gradient steps to judge whether an adjustment of the initial value for was necessary.
4.4 Time complexity
In the following, we aim to show the impact of our method on training times, which we measured both with and without spectral normalization. Instead of benchmarking only linear layers, this includes the time for data loading, augmentation, backpropagation etc, which allows for a realistic assessment of the method’s overhead in a real training scenario. In addition to the absolute times per epoch, we show the relative factor between our approach and training the network without spectral normalization.
For this, we observe the average training times per epoch of the last two experimental cohorts (ResNet34 on CIFAR10 and MobileNetV2 on ImageNet, best accuracy model each), and compare these to the case of the same network without spectral normalization (and still without batch normalization). The results are summarized in Table 2.
In the case of CIFAR10, our additional overhead amounted to an increase of about 63% (where a considerable amount was spent on the normalization of the final, fullyconnected layer). In the case of ImageNet, the time per epoch only increased by 2%, which can be seen as negligible. Since the cost of our normalization is small in comparison to convolutions on large features, it is to be expected that our method has less of an impact on the training time for higherdimensional data.
Time per epoch  

SpecNorm  no SpecNorm  Relative  
CIFAR10  59.59 s  36.63 s  1.63 
ImageNet  3074 s  3016 s  1.02 
5 Conclusion & future work
Mathematical guarantees in the form of Lipschitz constants have increasingly come into the focus of modern neural networks research. While there are simple and generic ways of calculating (and thus enforcing) spectral norms (resulting in said Lipschitz guarantees), these are typically quite costly in time or memory, or result in inaccurate upper bounds. For application for which efficiency is of interest, depthwise separable convolutions have emerged as a fast form of convolutions, which nonetheless allow for stateoftheart results in challenging benchmarks (cf. Pham et al. (2020)). In this work, introduce a very simple procedure, with which the spectral normalization of depthwise separable convolutions can be performed with negligible computational and memory overhead, while being quite accurate in practice.
We imagine future work based on our proposed method both from an application as well as a methodological perspective. From a methodological standpoint, going from the spectral norm to different norms is another possible avenue. In the case of the pointwise convolution, this is straightfoward, as this just entails computing matrix norms (which are often fast to compute, e.g. in the case of the 1norm or norm). In terms of applications, research into architectures that make full use of depthwise separable convolutions for any of the applications named in section 1 is needed (e.g. spectrally normalized GANs), which can then benefit from our proposed method. Making the step to 3D will yield even bigger performance improvements compared to the conventional approaches. One use case is the stabilization of invertible networks for memoryefficient 3D segmentation (Etmann et al., 2020), which can be achieved by limiting the layer’s Lipschitz constant (Behrmann et al., 2020), e.g. via our method.
Acknowledgements
CE acknowledges support from the Wellcome Innovator Award RG98755. CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Award RG98755, European Union Horizon 2020 research and innovation programme under the Marie SkodowskaCurie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute.
References
 On lipschitz regularization of convolutional layers using toeplitz matrix theory. arXiv preprint arXiv:2006.08391. Cited by: §2.4, §3.1.

Wasserstein generative adversarial networks
. InInternational conference on machine learning
, pp. 214–223. Cited by: §1.  Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §1, §2.4.
 Understanding and mitigating exploding inverses in invertible neural networks. arXiv preprint arXiv:2006.09347. Cited by: §1, §5.
 Semialgebraic optimization for lipschitz constants of relu networks. In Conference on Neural Information Processing Systems, Cited by: §2.2.

Xception: deep learning with depthwise separable convolutions
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.1.  Parseval networks: improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 854–863. Cited by: §2.4.
 Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
 IUNets: learnable invertible upand downsampling for largescale inverse problems. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. Cited by: §5.
 Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, Cited by: §2.4.
 Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, pp. 11427–11438. Cited by: §2.2.
 The reversible residual network: backpropagation without storing activations. arXiv preprint arXiv:1707.04585. Cited by: §1.
 Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368. Cited by: §2.4.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1, §4.2.
 MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §2.1.
 Learning multiple layers of features from tiny images. Cited by: §B.1, §4.2.
 Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §2.2.
 Preventing gradient attenuation in lipschitz constrained convolutional networks. In Advances in neural information processing systems, pp. 15390–15402. Cited by: §2.4.
 Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.4.
 Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1.
 Lipschitz regularized deep neural networks converge and generalize. arXiv preprint arXiv:1808.09540. Cited by: §1.
 Meta pseudo labels. arXiv preprint arXiv:2003.10580. Cited by: §2.1, §5.
 L2nonexpansive neural networks. In International Conference on Learning Representations, Cited by: §2.4.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §B.2, §4.1, §4.3.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 4, Table 6, §2.1, §4.1, §4.3.
 Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Advances in Neural Information Processing Systems 2018December (1), pp. 3835–3844. External Links: 1805.10965, ISSN 10495258 Cited by: §2.2.
 The singular values of convolutional layers. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–12. External Links: 1805.10408 Cited by: §2.4, §3.1, §3.1.
 Rigidmotion scattering for image classification [phd thesis]. Ecole Polytechnique. Cited by: §1, §2.1.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §4.1.
 Bounding singular values of convolution layers. arXiv preprint arXiv:1911.10258. Cited by: §2.4.
 Fantastic four: differentiable and efficient bounds on singular values of convolution layers. In International Conference on Learning Representations, External Links: Link Cited by: §3.1.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
 EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. Cited by: §2.1.
 Lipschitzmargin training: scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. 6541–6550. External Links: Link Cited by: §1.
 Lipschitzmargin training: scalable certification of perturbation invariance for deep neural networks. In Advances in neural information processing systems, pp. 6541–6550. Cited by: §2.4.
 Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pp. 3835–3844. Cited by: §2.4.
 Dynamical isometry and a mean field theory of cnns: how to train 10,000layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393. Cited by: §2.4.
 Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §2.4.
Appendix A Additional Details and Proofs
In the following, we will give additional details and proofs for the theory in Section 3. For compactness, we will use a multiindex notation indicated by bold letters and numbers. All multiindices have entries, meant to represent dimensional data (e.g. for images), i.e. . We will also use the shorthand , respectively . For some (respectively ), we write . Division and addition are meant entrywise. Furthermore, we define the sum notations
and . We denote the imaginary unit by , as well as the complex conjugate of a complex vector by .
a.1 Discrete Fourier Transform
Recall the following definition of the (unnormalized) discrete Fourier transform (DFT).
Definition 1.
We call
(9)  
defined by
(10) 
the discrete Fourier transform.
Compared to the ’normalized’ version of the discrete Fourier transform (which is multiplied by and is unitary), this unnormalized formulation of the DFT results in slightly simpler statements for the main results.
a.2 Circulant CrossCorrelation
Here, we will define the complex version of the crosscorrelation operation. Note that virtually all neural network libraries implement convolutions as crosscorrelation.
Definition 2.
For , we call the vector defined by
(11) 
the circulant crosscorrelation of with , if is circularly continued, i.e. for all for and . We denote this crosscorrelation by .
a.3 Circulant CrossCorrelation Theorem
Theorem 4.
Let and . Then
or, written differently,
where denotes the componentwise multiplication.
Proof.
Using the fact that for :
Due to the circular continuation of and the fact that the exponential function is periodic on the imaginary axis, we can simplify the rightmost sum to the Fourier transform of :
In summary, this proves the statement
∎
a.4 Matrix Formulation of the CrossCorrelation Theorem
Here, we make the reformulation of the crosscorrelation theorem into matrix multiplications (which are used to derive the statement in Theorem 1) more precise.
Remark 1.
Let , . Let further be an operation that reorders a vector from into a (column) vector in , which is a unitary operator (i.e. , where is the adjoint of ). Representing an operator in a different coordinate system entails first transforming to the required domain of the operation (here: via the transition operator ), and then transforming the output to the desired codomain (here: via ). This means that for the linear operators and to be reformulated into matrices, we define and . Likewise, the entrywise multiplication in (formulated as a bilinear operator)
(12)  
is equivalently represented as an operation in via
(13)  
According to the CrossCorrelation Theorem A.3, it holds that for all . Due to the unitarity of , it holds that and , meaning that
and
Leftmultiplying by then leads to
due to the unitarity of . Note that entrywise multiplication in can be represented by the multiplication of a diagonal matrix with a column vector, which results in the statement
a.5 Matrix Spectral Norm Equality
The spectral norm of the realtoreal convolution (according to Theorem 1) is derived using its eigenvalues, when viewed as a complextocomplex operator. Since eigenvalues of operations defined on spaces over and differ, we now assert that the spectral norm of matrix with only real entries (and by extension, our realtoreal convolutional operator) does not depend on whether we view it as a matrix in or as a matrix in .
Lemma 1.
We denote the spectral norm for real matrices by
as well as the spectral norm for complex matrices by
Then for any , it holds that
Proof.
We will prove the above statement by showing that both and .
It’s easy to see that
since .
On the other hand, since only has real entries,
is symmetric and real, which means that there is an orthogonal matrix
which diagonalizes , i.e.where are the singular values of .
Then, when viewed as a real matrix,
Note that
, as an orthogonal matrix, is in particularly a unitary matrix, i.e.
. Thus,∎
a.6 Zeropadded multichannel convolutions
A singlechannel zeropadded convolution’s spectral norm is upper bounded by the spectral norm of the circulant convolution on an enlarged domain.
Lemma 2.
Let be a filter of odd spatial dimensions, i.e. with for for all . Let
denote the zeropadded singlechannel convolution with filter on the domain and let for
denote the respective circulant convolution on the domain . Then
Proof.
Let and be the zeropadding respectively circulant padding operation from to (for some and ). Both can be easily verified to be linear. We define and .
Let be the valid convolution with filter on the domain .
Note that for any , it holds that
(14) 
since at every equation, only zeros are added to the sumofsquares when calculating the 2norms. It follows that
(15)  
finishing our proof. ∎
See 2
Proof.
The zeropadded multichannel depthwise convolution is isometrically isomorphic to a block diagonal matrix , where each block is isometrically isomorphic to a matrix representing the zeropadded singlechannel convolution. With Lemma 2 and the fact that the set of singular values of is the union of all singular values of the , the statement follows immediately. ∎
Appendix B Experimental details
This section details the hyperparameters, settings and data preprocessing steps used for the experiments in section
4.b.1 Lipschitz constant and learning rate
An overview of the training settings for the CIFAR10 Lipschitz constant and learning rate experiment can be found in Table 3. We trained the networks on a NVIDIA V100 GPU with scaling constants from 1 to 10 and learning rates both with soft and hard scaling.
Parameter  CIFAR10 

Network architecture  ResNet34 
Layers  Replaced conventional convs by depthwise separable convs 
Dataset  CIFAR10 
Loss function  Cross Entropy Loss 
Optimizer  SGD (Momentum: 0.9) 
Learning rate  , , , , 
Learning rate schedule  Multiplication of learning rate with at epoch milestones 
Scheduler milestones  150, 250 
Epochs  300 
Batch size  128 
Spectral normalization  True 
Scaling constant  1, 2, 3, 4, 5, 6, 7, 8, 9, 10 
Pointwise convolution  0.01 
Soft/Hard scaling  Soft / Hard 
Initialization learnable parameter  3 /  
We use the CIFAR10 dataset Krizhevsky et al. (2009) with the default train/test split and the following data preprocessing steps during training and testing:

Training:

random cropping (with a padding of size 4) to images of size

random horizontal flipping

normalizing per channel with mean and standard deviation


Testing:

normalizing per channel with mean and standard deviation

b.2 Spectral normalization on ImageNet
An overview of the hyperparameters used for the ImageNet classification experiments is shown in Table 4. We trained the networks on two NVIDIA V100 GPUs with scaling constants 10, 12, 15, 20, 30, 40 and learning rates , , , , , respectively.
Parameter  ImageNet 

Network architecture  MobileNetV2 
Layers  as in (Sandler et al., 2018) 
Dataset  ImageNet 
Loss function  Cross Entropy Loss 
Optimizer  SGD (Momentum: 0.9) 
Learning rate  / / / / / 
Learning rate schedule  Multiplication of learning rate with at epoch milestones 
Scheduler milestones  50, 100 
Epochs  150 
Batch size  128 
Spectral normalization  True 
Scaling constant  10/ 12/ 15/ 20/ 30/ 40 
Pointwise convolution  0.01 
Soft/Hard scaling  Soft 
Initialization learnable parameter  3/ 3/ 3/ 3/ 0.5/ 0.5 
We use the ImageNet dataset Russakovsky et al. (2015) with the default train/val/test split and the following data preprocessing during training and validation:

Training:

cropping of random size and resizing to images of size

random horizontal flipping

normalizing per channel with mean and standard deviation


Validation:

resizing image to size

center cropping to size

normalizing per channel with mean and standard deviation

b.3 Time complexity
For the time complexity experiments, we trained identical networks with hyperparameters shown in Table 5 and Table 6 – with and without spectral normalization for CIFAR10 and ImageNet classification. The data train/test split as well as the preprocessing steps for CIFAR10 and ImageNet can be found in subsection B.1 and B.2, respectively.
Parameter  CIFAR10 w SpecNorm  CIFAR10 w/o SpecNorm 

Network architecture  ResNet34  ResNet34 
Layers  Standard with DS Convs  Standard with DS Convs 
Dataset  CIFAR10  CIFAR10 
Loss function  Cross Entropy Loss  Cross Entropy Loss 
Optimizer  SGD (Momentum: 0.9)  SGD (Momentum: 0.9) 
Learning rate  
Learning rate schedule  Multiply lr with at milestones  Multiply lr with at milestones 
Scheduler milestones  150, 250  150, 250 
Epochs  300  300 
Batch size  128  128 
Spectral normalization  True  False 
Scaling constant  5   
Pointwise convolution  0.01   
Soft/Hard scaling  Hard   
Initialization learnable parameter     
Activations depthwise separable convolutions  None  None 
Parameter  ImageNet w SpecNorm  ImageNet w/o SpecNorm 

Network architecture  MobileNetV2  MobileNetV2 
Layers  as in (Sandler et al., 2018)  as in (Sandler et al., 2018) 
Dataset  ImageNet  ImageNet 
Loss function  Cross Entropy Loss  Cross Entropy Loss 
Optimizer  SGD (Momentum: 0.9)  SGD (Momentum: 0.9) 
Learning rate  
Learning rate schedule  Multiply lr with at milestones   
Scheduler milestones  50, 100   
Epochs  150  5 
Batch size  128  128 
Spectral normalization  True  False 
Scaling constant  40   
Pointwise convolution  0.01   
Soft/Hard scaling  Soft   
Initialization learnable parameter  0.5   
Activations depthwise separable convolutions  ReLU6  ReLU6 
Comments
There are no comments yet.