Depthwise Separable Convolutions Allow for Fast and Memory-Efficient Spectral Normalization

An increasing number of models require the control of the spectral norm of convolutional layers of a neural network. While there is an abundance of methods for estimating and enforcing upper bounds on those during training, they are typically costly in either memory or time. In this work, we introduce a very simple method for spectral normalization of depthwise separable convolutions, which introduces negligible computational and memory overhead. We demonstrate the effectiveness of our method on image classification tasks using standard architectures like MobileNetV2.

Authors

• 2 publications
• 11 publications
• 5 publications
• 71 publications
• Network Decoupling: From Regular to Depthwise Separable Convolutions

Depthwise separable convolution has shown great efficiency in network de...
08/16/2018 ∙ by Jianbo Guo, et al. ∙ 0

• Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

We demonstrate that 1x1-convolutions in 1D time-channel separable convol...
03/31/2021 ∙ by Gonçalo Mordido, et al. ∙ 0

• Smoothed Dilated Convolutions for Improved Dense Prediction

Dilated convolutions, also known as atrous convolutions, have been widel...
08/27/2018 ∙ by Zhengyang Wang, et al. ∙ 0

• Depthwise Separable Convolutions for Neural Machine Translation

Depthwise separable convolutions reduce the number of parameters and com...
06/09/2017 ∙ by Łukasz Kaiser, et al. ∙ 0

• QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures

We present QuickNet, a fast and accurate network architecture that is bo...
01/09/2017 ∙ by Tapabrata Ghosh, et al. ∙ 0

• Asymptotic Singular Value Distribution of Linear Convolutional Layers

In convolutional neural networks, the linear transformation of multi-cha...
06/12/2020 ∙ by Xinping Yi, et al. ∙ 0

• Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference

A rising research challenge is running costly machine learning (ML) netw...
07/14/2021 ∙ by Jackson Farley, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks, as universal approximators (of e.g. continuous functions), may represent highly complex functions, which in practice results in them being quite sensitive to their inputs, as evidenced e.g. by adversarial examples (Szegedy et al., 2013). One simple measure of how sensitive to its inputs a network is, is the network’s Lipschitz constant. Indeed, an increasing number of applications call for the control of this Lipschitz constant, which yields guarantees for the input-output-sensitivity. One particular way of controlling this Lipschitz constant of a network is by spectral normalization of its convolutions and weight matrices, which explicitly limits the network’s Lipschitz constant in terms of the 2-norm.

Since it determines a trade-off between expressiveness and robustness, controlling a network’s Lipschitz constant is a form of regularization (Oberman and Calder, 2018). A direct application of this principle is the limiting of the network’s Lipschitz constant in order to robustify it against adversarial attacks (Tsuzuku et al., 2018a). A version of this idea is also found in spectrally normalized GANs (Miyato et al., 2018b), which regularize the discriminator and which results in better-looking images. Note that in practice this is similar to Wasserstein GANs (Arjovsky et al., 2017), which require enforcing the Lipschitz constant of their critic network to be lower or equal to one (in order to approximate a Wasserstein distance). Another instance in which spectral normalization can be used to enforce necessary guarantees is found in invertible residual networks (Behrmann et al., 2019), which require the Lipschitz constant of the residual blocks to be below one. Furthermore, the Lipschitz constant of an invertible network influences the stability of inversion, as shown in (Behrmann et al., 2020). For example in additive coupling layers (Dinh et al., 2014), limiting the layers’ Lipschitz constant will result in an increase in both the forward and inverse numerical stability. This in particular has implications for training with restored (instead of stored) activations (Gomez et al., 2017).

To calculate and (thus enforce) spectral norms on convolutional layers, common methods either use the power method to calculate the norms, or use (typically non-tight) upper bounds as an approximation. Enforcing these spectral norms is usually done by performing a normalization step (i.e. dividing the layers’ weights by the computed Lipschitz constant) after each step of (stochastic) gradient descent, which means that the calculation of the spectral norms has to be performed many times during training. This means that the spectral normalization can strongly influence the overall speed of training, potentially bottlenecking the training duration. Details on this are given in section

2.3.

In this work, we propose a novel method for spectral normalization tailored to depthwise separable convolutions (Sifre and Mallat, 2014)

. Depthwise separable convolutions are a form of convolutional layers, in which the channel-wise convolution and the inter-channel combination of channels are decoupled. Our proposed method makes use of the specific structure of depthwise separable convolutions to achieve significant speedups compared to other methods, while having practically no additional memory overhead. In a realistic training scenario (training MobileNetV2 on ImageNet), each epoch takes only 2% additional training time – thus making spectral normalization essentially ’for free’ for depthwise separable convolutional networks.

2 Problem setting

2.1 Depthwise separable convolutions

Let denote a filter corresponding to an input’s -th channel (denoted ) and an output’s -th channel (denoted ). Then

 y⟨i⟩=∑jθ⟨i,j⟩∗x⟨j⟩ (1)

describes the conventional multi-channel convolution operation, where signifies a spatial filtering (a convolution, respectively a cross-correlation). The spatial and inter-channel computations of these multi-channel convolutions are hence inherently linked. Recently, depthwise separable convolutions have arisen as a family of computationally cheaper (and yet expressive) alternatives, in which the spatial and inter-channel computations are decoupled in the form of depthwise respectively pointwise convolutions. Depthwise convolutions apply filtering to each input channel independently (resulting in the same number of output channels), which can be equivalently represented by setting for all in equation (1). These serve as the spatial operations of depthwise separable convolutions. Pointwise convolutions on the other hand can be represented by multi-channel convolutions as in equation (1) for which (e.g.

convolutions in 2D), i.e. for non-strided filterings the output channels are merely linear combinations of the input channels. They constitute the inter-channel computations of depthwise separable convolutions.

The various realizations of depthwise separable convolutions in the literature mainly differ by their number and order of operations and where (and how often) a nonlinearity is applied. While a depthwise convolution followed by a pointwise convolution is a low-rank factorization of a conventional convolution (Sifre and Mallat, 2014), recent approaches usually incorporate nonlinearities between the two types of operations (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Tan and Le, 2019). It is worth highlighting that EfficientNet (Tan and Le, 2019), which employs the depthwise separable convolutions defined by MobileNetV2 (Sandler et al., 2018)

, is currently state-of-the-art in the ILSVRC (ImageNet) challenge

(Pham et al., 2020).

2.2 Mathematical preliminaries

A function is called Lipschitz-continuous on two normed spaces and , if there is a constant such that

 ∥g(x1)−g(x2)∥Y≤L∥x1−x2∥X

for all . The smallest such constant is referred to as the Lipschitz-constant of and will be denoted .

For continuous, almost everywhere differentiable functions, the Lipschitz constant of a function is given by the supremum of the operator norm (w.r.t. the norms of , ) of all possible derivatives, i.e. . In the following, we will assume that and are Euclidean spaces (equipped with the 2-norm), in which case the above operator norm becomes the spectral norm of

(where defined), i.e. the largest singular value of

. This highlights the difficulty of actually calculating the Lipschitz constant of a neural network – even in the case of locally linear functions, such as e.g. ReLU networks, this would require the evaluation of the operator norm in each locally linear region. In fact, even for a two layer neural network, calculating the Lipschitz constant is NP-hard

(Scaman and Virmaux, 2018). Approximating this Lipschitz constant can be achieved by phrasing the Lipschitz estimation problem as an optimization problem (Fazlyab et al., 2019; Latorre et al., 2020; Chen et al., 2020). These are, however, still very costly.

On the other hand, calculating upper bounds of Lipschitz constants of neural networks is tractable, if finding an upper bound of the individual layers’ Lipschitz constants is tractable. In particular, if for Lipschitz continuous functions , then (chain rule for Lipschitz constants

). While the bounds this chain rule yields may be too loose for approximating the true Lipschitz constant of the

whole network, it may still be possible to make approximate statements about individual layers’ Lipschitz constants. For layers of the type , where is linear,

is a bias vector and

is an activation function with bounded Lipschitz constant, it holds that

 Lip(f) ≤Lip(σ)⋅Lip(A)=Lip(σ)∥A∥2, (2)

i.e. the Lipschitz constant of a nonlinear layer can typically be upper bounded by the spectral norm of its linear portion. This is true in particular for convolutional layers or dense layers with e.g. ReLU, tanh, sigmoid, softmax or ELU nonlinearity (all of which have .

2.3 Spectral normalization

One particularly important insight about Lipschitz constants is the fact that the ability to calculate them often enables one to enforce a specific Lipschitz constant in a function.

In fact, it holds that for any , so that has for some desired . Instead of scaling the function itself, most types of neural network layers admit a scaling of their associated parameters: Since for example convolutions are positively 1-homogeneous in their kernels, it suffices to scale their filter kernel accordingly. That means that for a convolutional operator with filter kernel (and for our setting of spaces equipped with the 2-norm), it holds that by rescaling this kernel via (and keeping a possible bias), one obtains a new convolutional operator with . In the following, we will refer to this normalization procedure as spectral normalization. A common way of training neural networks with spectrally normalized convolutional kernels is to apply spectral normalization after each optimization step. Note that if we overestimate the spectral norm, then still .

If one wishes to enforce some specific Lipschitz constant, multiplying the output of the normalized convolution by some desired then makes the overall mapping -Lipschitz. We call this method hard scaling and refer to the scaling constant.
One can make this scaling a bit more flexible by instead multiplying with , where is a learnable parameter. Since , this will bound the enforced Lipschitz constant to be below . Due to the dependence on the learnable parameter , one allows the ’right’ Lipschitz constant up to to be learned. We will refer to this variant as soft scaling.

2.4 Spectral norm estimation for convolutional layers

Unlike in the case of dense layers, whose parameters directly allow for a computation of their spectral norm via matrix methods, it is more difficult to assess a convolution’s norm solely from its parameters. In fact, the same kernel generally defines not one, but a whole family of operators (depending e.g. on the input resolution and stride). The estimation of their norms can either be performed with generic methods or make use of the operators’ specific structure. One approach is to restrict the convolutional operators to be exactly orthogonal (Xiao et al., 2018; Li et al., 2019) to strictly enforce unit-norm convolutions. By relaxing this constraint to approximately orthogonal convolutions, the spectral norm is likewise relaxed to values close to (Cisse et al., 2017) or up to (Qian and Wegman, 2018) one. Another class of methods is to exploit the structure of multi-channel convolution operators to gain insights on their singular values. In particular, by making use of the theory of doubly block circulant matrices Sedghi et al. (2019) are able to derive an exact formula for the singular values of circulant

2D convolutions, using Fourier transforms and a singular value decomposition. While this is exact, the values are only approximate for the more commonly used zero-padded or ’valid’ convolutions. Furthermore, the computation is quite slow. Similarly-derived and faster to compute upper bounds can be derived for circulant convolutions

(Singla and Feizi, 2019) or padded convolutions (Araujo et al., 2020). As these methods are designed for 2D convolutions (using the structure of doubly block circulant/Toeplitz matrices), it is furthermore not immediately clear how to extend these to 3D.

A very general method for numerically calculating spectral norms of linear operators is to employ the power method (Yoshida and Miyato, 2017; Miyato et al., 2018a; Virmaux and Scaman, 2018; Tsuzuku et al., 2018b; Farnia et al., 2018; Gouk et al., 2018; Behrmann et al., 2019)

, where the spectral norm of a linear operator is approximated by an iterative procedure. Here, for a randomly initialized vector

from the domain of the convolutional operator , the iteration

 vi+1=ATAvi/∥vi∥2 (3)

converges to the leading right-singular vector of (almost surely), while converges to the desired spectral norm . The iteration (3

) highlights the computational burden of performing the power method during training, as the most computationally demanding operation of a convolutional neural network has to be performed multiple times per gradient step until some desired precision is reached. The training time is thus effectively multiplied by (almost) the number of iterations that have to be performed. This desired precision level can be controlled by a user-specified parameter

, which defines the stopping criterion The necessary number of iterations can be greatly reduced by not initializing randomly, but instead initializing with the estimated leading right-singular vector from the previous gradient descent step (henceforth called warm-start power method), assuming the most recent optimization step did not change the leading right-singular vector too much. This, however, requires keeping those vectors in memory, which are of the same size as the input features. This can be prohibitive in the case of memory-constrained applications such as 3D medical segmentation (where batch sizes of 1 or 2 are common).

3 Proposed solution

The main difficulty in determining the spectral norm for multi-channel convolutions lies in the fact that the spatial convolutions and the interplay between different channels are inherently linked. In the case of depthwise separable convolutions however, the spatial and inter-channel computations are decoupled. This now also allows for a separate calculation (and thus normalization, cf. section 2.3) of the individual operations’ norms, which – as it turns out – is much faster, as well as more memory-efficient than the warm-start power method.

3.1 Spectral norms of depthwise convolutions

Depthwise convolutions - the spatial component of depthwise separable convolutions - apply one (usually zero-padded) filtering to each input channel. While common methods for conventional multi-channel convolutions are either slow (e.g. a randomly initialized power method or the method from (Sedghi et al., 2019)), produce non-tight upper bounds (Singla and Feizi, 2021; Araujo et al., 2020) or require a lot of additional memory (warm-start power method), the following method is accurate, fast and can be computed on-the-fly – but is only valid for depthwise convolutions.

For this, we will make use of a form of the well-known convolution theorem111Since most frameworks actually implement cross-correlation, we derive our formulas using the corresponding Cross-Correlation Theorem. The final statement is, however, independent of whether we use cross-correlation or (actual) convolutions., which states that taking the discrete Fourier transform (DFT) of a circulant convolution between two signals is equivalent to the product of both signals’ DFTs (up to a complex conjugation), such that we only need to search for the largest absolute value of the filters’ Fourier coefficients. Furthermore, our derivation will be independent of the dimensionality of the data, i.e. whether we are dealing with 1D, 2D or 3D convolutions. Note that the following theorem has significant overlap with Theorem 5 from (Sedghi et al., 2019), but we derive this independently from the dimensionality of the data and provide additional context, why the Fourier transform leads to the correct spectral norm, even in the case of a real-to-real convolution.

Theorem 1 (Norms of Circulant Convolutions, Single-Channel).

Let and . Let be a (single-channel, unit-stride) circulant convolution with filter . Let denote a zero-padding operator. Then , where and denotes the discrete -dimensional discrete Fourier transform.

Proof.

In the following, we will derive an expression for the spectral norm of the complex cross-correlation (Appendix A.2), after which we will show that this is an upper bound for the spectral norm of the real cross-correlation. Let . Let further . Let denote the complex cross-correlation, such that the restriction of to describes the same mapping as . Then, for all

 F~Kθx =¯¯¯¯¯¯¯¯¯FΘ⊙Fx, (4)

according to the Cross-Correlation Theorem (Appendix A.3). Rewriting this in matrix-vector formulation222For a rigorous derivation, cf. Appendix A.4. yields

 FKθVec(x)=diag(Vec(¯¯¯¯¯¯¯¯¯FΘ))⋅F⋅Vec(x), (5)

where is a reordering into a column vector and correspond to and , respectively. Since this holds for arbitrary (and is orthogonal), this generalizes to such that

 Kθ=F−1⋅diag(Vec(¯¯¯¯¯¯¯¯¯FΘ))⋅F,

i.e. the entries of

constitute the eigenvalues of

. Because (where is the Hermitian transpose of ), it holds that , i.e. is a normal matrix. According to the spectral theorem, this implies that the singular values of (and, by isomorphism, of ) are the absolute values of the entries , which coincide with the absolute values of the entries of .

Now, it remains to show that the spectral norms of the real-to-real operator and the complex-to-complex operator are the same. As both are isometrically isomorphic to (which has only real entries due to the use of a real filter ), its spectral norm is the same, whether we view it as a real matrix or as a complex matrix (see Appendix Lemma 1). Hence,

 (6) = ∥~Kθ∥2=maxi|^Θi|,

finishing our proof.

While the above theorem yields exact spectral norms, it has two shortcomings, as it only deals with circulant and single-channel convolutions. In practice however, depthwise convolutions are typically zero-padded (and multi-channel). As it turns out, by zero-padding the filters to just a bit bigger than the input image (to the size of the padded image), an in practice quite tight upper bound of the actual spectral norm of the depthwise, zero-padded convolution is found.

Theorem 2 (Norms of Zero-Padded and Multi-Channel Depthwise Convolutions).

Let and

. For odd-dimensional filters

(for ), let . We define

 Kθ:XC→ XC (7) x⟨j⟩↦ κθjx⟨j⟩ for each channel j

where denotes the single-channel zero-padded convolution with filter . Let

denote the zero-padding. Then, with , it holds that

 ∥Kθ∥2≤maxi,j|^Θji|,

where denotes the discrete -dimensional discrete Fourier transform.

The proof is found in Appendix A.6. Our proposed method based on this Theorem is found in Figure 1.

Since there are very fast and highly parallelizable algorithms for computing the DFT (the Fast Fourier Transform), the above computation can be performed very quickly.

Note that the above theorems are derived for unit-stride depthwise convolutions. For strided convolutions, the upper bound still holds, but it will in reality be more vacuous. To see this, let denote the strided convolution and let denote the unit-strided convolution. Then there is a subsampling operator , such that . Due to , it holds that . In the case of stride in all directions, one can only make the ’educated guess’ based on the assumption that all components of contribute to its overall norm evenly, in which case . This e.g. means that for 2D data, our estimation is about 2 times too high for .

3.2 Calculating Lipschitz constants for pointwise convolutions

As discussed in section 2.1, pointwise convolutions are simply multi-channel convolutions

 y⟨i⟩=∑jθ⟨i,j⟩∗x⟨j⟩

with , which is why in 2D they are often called -convolutions. Here, we propose a variation of the warm-start power method for a specific matrix, which – compared to the naïve power method for the full operator – is typically much faster and memory-efficient. In particular, it is independent of the size of the convolved data.

Theorem 3.

Let be a pointwise convolution with kernel , such that for all and , where and denote the number of input respectively output channels. Then we call

 Θ:=⎡⎢ ⎢ ⎢⎣θ⟨1,1⟩⋯θ⟨1,Cin⟩⋮⋱⋮θ⟨Cout,1⟩⋯θ⟨Cout,Cin⟩⎤⎥ ⎥ ⎥⎦∈RCout×Cin, (8)

the connectivity matrix of and it holds that

 ∥Kθ∥2=∥Θ∥2.
Proof.

is similar to a block diagonal matrix with blocks . As the singular values of block diagonal matrices are the union of the blocks’ singular values, the spectral norm (i.e. the largest singular value) of coincides with the spectral norm of . ∎

Performing the power method with the connectivity matrix has the additional advantage, that the warm-start version of the power method requires only storing a small right-singular vector333Alternatively, the left-singular vector from can be stored. from

. Compare that to the naïve power method, which requires storing a vector of the size of the input (respectively output) tensor of the layer. We refer to the thus-improved warm-start power method as the

efficient warm-start power method. An illustration of this method is found in Figure 1.

For a square images of resolution -by-, both the computational and memory demand is divided by a factor of . This effect is even more pronounced in case of 3D (or even higher-order) data, where the memory cost for the naïve method can easily be several gigabytes, whereas our method is typically limited to kilobytes of storage for storing the leading singular vector (for at most a few thousand channels).

4 Experiments

In this section, we perform several experiments on classical benchmarks in order to evaluate our proposed method for spectral normalization. Note that the goal here is not to convince the reader of the usefulness of spectral normalization in general (interested readers are referred to the many examples in sections 1 and 2.4), but to benchmark the accuracy and speed of our method and to uncover the interplay with other variables, such as the learning rate.

4.1 Accuracy of depthwise spectral norm upper bound

While the accuracy of our efficient power method for pointwise convolutions is governed by the user-specified -parameter, for depthwise convolutions we only have access to upper bounds of the true spectral norm. As shown in Theorem 2, the upper bound comes into play solely because of zero-padding, whose relative

effect should become smaller for large resolutions of the input images (in comparison to small kernel sizes). In order to check how tight this upper bound is (and thus how close our approximation is), we calculate the relative overestimation (approximation/actual value) for randomly generated (Gaussian distributed)

filters, depending on the feature resolution. Here, the ’true’ spectral norm was approximated with 30 iterations of the power method. We varied the feature resolution from (the smallest resolution in e.g. MobileNetV2 Sandler et al. (2018), VGG Simonyan and Zisserman (2015) and ResNet He et al. (2016)) to 128 and generated 1000 random filters per resolution. The results are shown in Figure 2

. One can observe that the relative error is indeed decreasing with an increase in the resolution, from (in median) 17% to 2%. Note that due to the linearity of the investigated methods, the results of this experiment are independent of the standard deviation of the chosen Gaussian distribution.

The above analysis can be seen as an investigation into the approximation accuracy at initialization. So how good is the approximation in a fully-trained network, where the filters do not follow a Gaussian distribution? For this, we did the above comparison for a MobileNetV2, which was pretrained on ImageNet Russakovsky et al. (2015). MobileNetV2 contains 17 depthwise convolutional layers, 13 of which are unit-stride and 4 of which have a stride of 2 in each direction. The correct feature resolution from the original article was used for the calculation of the spectral norms. For the 13 unit-stride convolutions, the average relative overestimation was 2.80%, with a standard deviation of 1.55%. In the case of the strided convolutions (for which we expect more vacuous bounds, as explained in section 3.1), the relative overestimation of our method was 75.6% (standard deviation: 20.6%). In summary, in a realistic setting, our estimation of the spectral norm is highly accurate for unit-stride depthwise convolutions, but less accurate for strided convolutions (but still better than the ’educated guess’ of a factor of 2, which we laid out in section 3.1).

4.2 Lipschitz constant and learning rate

Since the Lipschitz constant of a neural network determines the magnitude of its gradients, training with gradient-based methods requires adapting the learning rate accordingly – i.e. a lower Lipschitz constant necessitates a larger learning rate. In order to evaluate the interplay of the learning rate and scaling constant

(both for soft and hard scaling), we perform a study on CIFAR-10

Krizhevsky et al. (2009), in which we test different settings for both variables. For this, we train a ResNet-34 (He et al., 2016)

, for which we exchanged all convolutions by depthwise separable convolutions (one depthwise convolution, followed by a pointwise convolution without intermediate activation function). We therefore end up with a network architecture consisting of 33 depthwise separable convolutional layers (of which 3 are strided convolutions with a stride of 2) and one fully connected (softmax) layer. We trained all networks for 300 epochs, using scaling constants from 1 to 10 and learning rates

for stochastic gradient descent with a fixed momentum of

. Since batch normalization influences the Lipschitz constant (both through the running means of the batches’ standard deviations and the

-parameter), we train our models without batch normalization. We compute and restrict the spectral norm of every layer, using the proposed method to spectrally normalize the depthwise separable convolutions, as well as the warm-start power method to restrict the spectral norm of the final classification layer. Additionally, we choose the precision parameter to be 0.01 for all pointwise convolutional layers and initialize the soft scaling parameter as so that at initialization.

The results of this experiment are depicted in Table 1. As predicted, lower scaling constants tend to require higher learning rates. Furthermore, the known regularizing effect of the Lipschitz constraints can be observed, since increasing initially increases the prediction accuracy (underfitting regime), before it decreases again (overfitting regime). A careful tuning of the learning rate is needed to achieve the best performance. In particular, too high of a learning rate may collapse the training. Moreover, there is no clear winner between hard and soft scaling.

4.3 Spectral normalization on ImageNet

In addition to the previous CIFAR-10 experiments, we conducted a study on the much higher-resolution ImageNet dataset (Russakovsky et al., 2015). For this, we trained MobileNetV2 (Sandler et al., 2018), a standard architecture for image classification using depthwise separable convolutions, using , and again without batch normalization. Lower and higher values for in our case lead a strong breakdown in performance. The models were trained with soft scaling for 150 epochs, using the precision parameter for every pointwise convolutional layer. All models were trained with a batch size of 128, divided on two NVIDIA V100 GPUs. Due to the interplay of scaling constant and learning rate (see subsection 4.2), the learning rate had to be adapted for each .

In Figure 3 the Top-1 as well as the Top-5 validation accuracy per scaling constant are plotted. One can observe that the accuracy increases as one increases the scaling constant, indicating that we never reach the overfitting regime. Moreover, it can be seen that the increase in accuracy is steeper in the lower regions of , later leveling off.

The use of soft scaling offers an additional degree of flexibility, as the training can slightly adjust the Lipschitz constant. We noticed that in the ImageNet experiments, the initialization of the learnable parameter had to be tuned. For lower values of (up to ), the last initialization from the previous CIFAR-10 experiments (, resulting in ) still worked very well, whereas higher values of required reducing the initial value of somewhat (to so that ), otherwise resulting in exploding gradients or missing convergence. We also for instance experimented with or (which initializes the network as if there was no spectral normalization at initialization), but both did not generally work well. However, in all cases it was enough to monitor the training for a few gradient steps to judge whether an adjustment of the initial value for was necessary.

4.4 Time complexity

In the following, we aim to show the impact of our method on training times, which we measured both with and without spectral normalization. Instead of benchmarking only linear layers, this includes the time for data loading, augmentation, backpropagation etc, which allows for a realistic assessment of the method’s overhead in a real training scenario. In addition to the absolute times per epoch, we show the relative factor between our approach and training the network without spectral normalization.

For this, we observe the average training times per epoch of the last two experimental cohorts (ResNet-34 on CIFAR-10 and MobileNetV2 on ImageNet, best accuracy model each), and compare these to the case of the same network without spectral normalization (and still without batch normalization). The results are summarized in Table 2.

In the case of CIFAR-10, our additional overhead amounted to an increase of about 63% (where a considerable amount was spent on the normalization of the final, fully-connected layer). In the case of ImageNet, the time per epoch only increased by 2%, which can be seen as negligible. Since the cost of our normalization is small in comparison to convolutions on large features, it is to be expected that our method has less of an impact on the training time for higher-dimensional data.

5 Conclusion & future work

Mathematical guarantees in the form of Lipschitz constants have increasingly come into the focus of modern neural networks research. While there are simple and generic ways of calculating (and thus enforcing) spectral norms (resulting in said Lipschitz guarantees), these are typically quite costly in time or memory, or result in inaccurate upper bounds. For application for which efficiency is of interest, depthwise separable convolutions have emerged as a fast form of convolutions, which nonetheless allow for state-of-the-art results in challenging benchmarks (cf. Pham et al. (2020)). In this work, introduce a very simple procedure, with which the spectral normalization of depthwise separable convolutions can be performed with negligible computational and memory overhead, while being quite accurate in practice.

We imagine future work based on our proposed method both from an application as well as a methodological perspective. From a methodological standpoint, going from the spectral norm to different norms is another possible avenue. In the case of the pointwise convolution, this is straightfoward, as this just entails computing matrix norms (which are often fast to compute, e.g. in the case of the 1-norm or -norm). In terms of applications, research into architectures that make full use of depthwise separable convolutions for any of the applications named in section 1 is needed (e.g. spectrally normalized GANs), which can then benefit from our proposed method. Making the step to 3D will yield even bigger performance improvements compared to the conventional approaches. One use case is the stabilization of invertible networks for memory-efficient 3D segmentation (Etmann et al., 2020), which can be achieved by limiting the layer’s Lipschitz constant (Behrmann et al., 2020), e.g. via our method.

Acknowledgements

CE acknowledges support from the Wellcome Innovator Award RG98755. CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Award RG98755, European Union Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute.

References

• A. Araujo, B. Negrevergne, Y. Chevaleyre, and J. Atif (2020) On lipschitz regularization of convolutional layers using toeplitz matrix theory. arXiv preprint arXiv:2006.08391. Cited by: §2.4, §3.1.
• M. Arjovsky, S. Chintala, and L. Bottou (2017)

.
In

International conference on machine learning

,
pp. 214–223. Cited by: §1.
• J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J. Jacobsen (2019) Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §1, §2.4.
• J. Behrmann, P. Vicol, K. Wang, R. Grosse, and J. Jacobsen (2020) Understanding and mitigating exploding inverses in invertible neural networks. arXiv preprint arXiv:2006.09347. Cited by: §1, §5.
• T. Chen, J. Lasserre, V. Magron, and E. Pauwels (2020) Semialgebraic optimization for lipschitz constants of relu networks. In Conference on Neural Information Processing Systems, Cited by: §2.2.
• F. Chollet (2017)

Xception: deep learning with depthwise separable convolutions

.
In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

,
Cited by: §2.1.
• M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier (2017) Parseval networks: improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 854–863. Cited by: §2.4.
• L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
• C. Etmann, R. Ke, and C. Schönlieb (2020) IUNets: learnable invertible up-and downsampling for large-scale inverse problems. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. Cited by: §5.
• F. Farnia, J. Zhang, and D. Tse (2018) Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, Cited by: §2.4.
• M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas (2019) Efficient and accurate estimation of lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, pp. 11427–11438. Cited by: §2.2.
• A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017) The reversible residual network: backpropagation without storing activations. arXiv preprint arXiv:1707.04585. Cited by: §1.
• H. Gouk, E. Frank, B. Pfahringer, and M. Cree (2018) Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368. Cited by: §2.4.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1, §4.2.
• A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: 1704.04861 Cited by: §2.1.
• A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §B.1, §4.2.
• F. Latorre, P. Rolland, and V. Cevher (2020) Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, Cited by: §2.2.
• Q. Li, S. Haque, C. Anil, J. Lucas, R. B. Grosse, and J. Jacobsen (2019) Preventing gradient attenuation in lipschitz constrained convolutional networks. In Advances in neural information processing systems, pp. 15390–15402. Cited by: §2.4.
• T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018a) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.4.
• T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018b) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1.
• A. M. Oberman and J. Calder (2018) Lipschitz regularized deep neural networks converge and generalize. arXiv preprint arXiv:1808.09540. Cited by: §1.
• H. Pham, Q. Xie, Z. Dai, and Q. V. Le (2020) Meta pseudo labels. arXiv preprint arXiv:2003.10580. Cited by: §2.1, §5.
• H. Qian and M. N. Wegman (2018) L2-nonexpansive neural networks. In International Conference on Learning Representations, Cited by: §2.4.
• O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §B.2, §4.1, §4.3.
• M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 4, Table 6, §2.1, §4.1, §4.3.
• K. Scaman and A. Virmaux (2018) Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Advances in Neural Information Processing Systems 2018-December (1), pp. 3835–3844. External Links: 1805.10965, ISSN 10495258 Cited by: §2.2.
• H. Sedghi, V. Gupta, and P. M. Long (2019) The singular values of convolutional layers. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–12. External Links: 1805.10408 Cited by: §2.4, §3.1, §3.1.
• L. Sifre and S. Mallat (2014) Rigid-motion scattering for image classification [phd thesis]. Ecole Polytechnique. Cited by: §1, §2.1.
• K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §4.1.
• S. Singla and S. Feizi (2019) Bounding singular values of convolution layers. arXiv preprint arXiv:1911.10258. Cited by: §2.4.
• S. Singla and S. Feizi (2021) Fantastic four: differentiable and efficient bounds on singular values of convolution layers. In International Conference on Learning Representations, External Links: Link Cited by: §3.1.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
• M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. Cited by: §2.1.
• Y. Tsuzuku, I. Sato, and M. Sugiyama (2018a) Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. 6541–6550. External Links: Link Cited by: §1.
• Y. Tsuzuku, I. Sato, and M. Sugiyama (2018b) Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. In Advances in neural information processing systems, pp. 6541–6550. Cited by: §2.4.
• A. Virmaux and K. Scaman (2018) Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pp. 3835–3844. Cited by: §2.4.
• L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington (2018) Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393. Cited by: §2.4.
• Y. Yoshida and T. Miyato (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §2.4.

Appendix A Additional Details and Proofs

In the following, we will give additional details and proofs for the theory in Section 3. For compactness, we will use a multi-index notation indicated by bold letters and numbers. All multi-indices have entries, meant to represent -dimensional data (e.g. for images), i.e. . We will also use the shorthand , respectively . For some (respectively ), we write . Division and addition are meant entry-wise. Furthermore, we define the sum notations

 N−1∑n=0f[n]:=N1−1∑n1=0⋯Nd−1∑nd=0f[n1,…,nd]

and . We denote the imaginary unit by , as well as the complex conjugate of a complex vector by .

a.1 Discrete Fourier Transform

Recall the following definition of the (unnormalized) discrete Fourier transform (DFT).

Definition 1.

We call

 F:CN→ .CN (9) f↦ .^f,

defined by

 ^f[j]:=N−1∑n=0f[n]e−i2π⟨n/N,j⟩ (10)

the discrete Fourier transform.

Compared to the ’normalized’ version of the discrete Fourier transform (which is multiplied by and is unitary), this unnormalized formulation of the DFT results in slightly simpler statements for the main results.

a.2 Circulant Cross-Correlation

Here, we will define the complex version of the cross-correlation operation. Note that virtually all neural network libraries implement convolutions as cross-correlation.

Definition 2.

For , we call the vector defined by

 g[n]=N−1∑k=0¯¯¯¯¯¯¯¯¯¯¯Θ[n]f[n+k] (11)

the circulant cross-correlation of with , if is circularly continued, i.e. for all for and . We denote this cross-correlation by .

a.3 Circulant Cross-Correlation Theorem

Theorem 4.

Let and . Then

 ^g[j]=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(FΘ)[j]⋅(Ff)[j],

or, written differently,

 F(Θ∗f)=¯¯¯¯¯¯¯¯¯¯¯¯¯(FΘ)⊙(Ff),

where denotes the component-wise multiplication.

Proof.
 ^g[j]= N−1∑n=0g[n]e−i2π⟨n/N,j⟩ = N−1∑n=0N−1∑k=0¯¯¯¯¯¯¯¯¯¯¯Θ[k]f[n+k]e−i2π⟨n/N,j⟩ = N−1∑n=0N−1∑l=0¯¯¯¯¯¯¯¯¯¯¯Θ[k]f[n+k]e−i2π⟨(n+k−k)/N,j⟩ = N−1∑n=0N−1∑l=0¯¯¯¯¯¯¯¯¯¯¯Θ[k]f[n+k]e−i2π⟨(n+k)/N,j⟩e−i2π⟨−k/N,j⟩ = N−1∑k=0¯¯¯¯¯¯¯¯¯¯¯Θ[k]e−i2π⟨−k/N,j⟩N−1∑n=0f[n+k]e−i2π⟨(n+k)/N,j⟩

Using the fact that for :

 N−1∑k=0¯¯¯¯¯¯¯¯¯¯¯Θ[k]e−i2π⟨−k/N,j⟩=N−1∑k=0¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Θ[k]e−i2π⟨k/N,j⟩=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯∑N−1k=0Θ[k]e−i2π⟨k/N,j⟩=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(FΘ)[j]

Due to the circular continuation of and the fact that the exponential function is -periodic on the imaginary axis, we can simplify the rightmost sum to the Fourier transform of :

 N−1∑n=0f[n+k]e−i2π⟨(n+k)/N,j⟩=(Ff)[j]

In summary, this proves the statement

 ^g[j]=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯(FΘ)[j]⋅(Ff)[j].

a.4 Matrix Formulation of the Cross-Correlation Theorem

Here, we make the reformulation of the cross-correlation theorem into matrix multiplications (which are used to derive the statement in Theorem 1) more precise.

Remark 1.

Let , . Let further be an operation that reorders a vector from into a (column) vector in , which is a unitary operator (i.e. , where is the adjoint of ). Representing an operator in a different coordinate system entails first transforming to the required domain of the operation (here: via the transition operator ), and then transforming the output to the desired codomain (here: via ). This means that for the linear operators and to be reformulated into matrices, we define and . Likewise, the entry-wise multiplication in (formulated as a bilinear operator)

 M:~X×~X →~X (12) (a,b) ↦a⊙b

is equivalently represented as an operation in via

 →M:→X×→X →→X (13) (c,d) ↦V⋅M(V∗c,V∗d).

According to the Cross-Correlation Theorem A.3, it holds that for all . Due to the unitarity of , it holds that and , meaning that

 FKθx=(V∗FV)(V∗KθV)x=V∗FKθVec(x)

and

 M(FΘ,Fx)=M((V∗FV)Θ,(V∗FV)x)=M(V∗Vec(^Θ),V∗FVec(x)).

 FKθVec(x)=→M(FVec(Θ),FVec(x)),

due to the unitarity of . Note that entry-wise multiplication in can be represented by the multiplication of a diagonal matrix with a column vector, which results in the statement

 FKθVec(x)=diag(FVec(Θ))Vec(x).

a.5 Matrix Spectral Norm Equality

The spectral norm of the real-to-real convolution (according to Theorem 1) is derived using its eigenvalues, when viewed as a complex-to-complex operator. Since eigenvalues of operations defined on spaces over and differ, we now assert that the spectral norm of matrix with only real entries (and by extension, our real-to-real convolutional operator) does not depend on whether we view it as a matrix in or as a matrix in .

Lemma 1.

We denote the spectral norm for real matrices by

 ∥K∥2,Rn:=maxx∈Rn∥x∥2=1∥Kx∥2,

as well as the spectral norm for complex matrices by

 ∥~K∥2,Cn:=maxx∈Cn∥x∥2=1∥~Kx∥2.

Then for any , it holds that

 ∥K∥2,Rn=∥K∥2,Cn.
Proof.

We will prove the above statement by showing that both and .

It’s easy to see that

 ∥A∥2,Rn=maxx∈Rn∥x∥2=1∥Ax∥2≤maxx∈Cn∥x∥2=1∥Ax∥2=∥A∥2,Cn,

since .

On the other hand, since only has real entries,

is symmetric and real, which means that there is an orthogonal matrix

which diagonalizes , i.e.

 KTK=ST=:Ddiag(σ21,…,σ2n)S,

where are the singular values of .

Then, when viewed as a real matrix,

 ∥K∥22,Rn=maxx∈Rn∥x∥2=1∥Kx∥22=maxx∈Rn∥x∥2=1xTKTKx=maxx∈Rn∥x∥2=1(Sx)TD(Sx)=maxy∈Rn∥y∥2=1yTDy=maxy∈Rn∥y∥2=1n∑k=1σ2ky2k=σ21.

Note that

, as an orthogonal matrix, is in particularly a unitary matrix, i.e.

. Thus,

 ∥K∥22,Cn=maxx∈Cn∥x∥2=1∥Kx∥22=maxx∈Cn∥x∥2=1xHKHKx=maxx∈Cn∥x∥2=1(Sx)HD(Sx)=maxy∈Cn∥y∥2=1yHDy=maxy∈Cn∥y∥2=1n∑k=1σ2k|yk|2 ≤ maxy∈Cn∥y∥2=1n∑k=1σ21|yk|2=σ21=∥K∥22,Rn.

A single-channel zero-padded convolution’s spectral norm is upper bounded by the spectral norm of the circulant convolution on an enlarged domain.

Lemma 2.

Let be a filter of odd spatial dimensions, i.e. with for for all . Let

 Kzeroθ,N:RN→RN

denote the zero-padded single-channel convolution with filter on the domain and let for

 Kcircθ,¯N:RN+2p→RN+2p

denote the respective circulant convolution on the domain . Then

 ∥K zeroθ,N∥2≤∥K circθ,¯N∥2.
Proof.

Let and be the zero-padding respectively circulant padding operation from to (for some and ). Both can be easily verified to be linear. We define and .

Let be the valid convolution with filter on the domain .

Note that for any , it holds that

since at every equation, only zeros are added to the sum-of-squares when calculating the 2-norms. It follows that

finishing our proof. ∎

See 2

Proof.

The zero-padded multi-channel depthwise convolution is isometrically isomorphic to a block diagonal matrix , where each block is isometrically isomorphic to a matrix representing the zero-padded single-channel convolution. With Lemma 2 and the fact that the set of singular values of is the union of all singular values of the , the statement follows immediately. ∎

Appendix B Experimental details

This section details the hyperparameters, settings and data preprocessing steps used for the experiments in section

4.

b.1 Lipschitz constant and learning rate

An overview of the training settings for the CIFAR-10 Lipschitz constant and learning rate experiment can be found in Table 3. We trained the networks on a NVIDIA V100 GPU with scaling constants from 1 to 10 and learning rates both with soft and hard scaling.

We use the CIFAR-10 dataset Krizhevsky et al. (2009) with the default train/test split and the following data pre-processing steps during training and testing:

1. Training:

• random cropping (with a padding of size 4) to images of size

• random horizontal flipping

• normalizing per channel with mean and standard deviation

2. Testing:

• normalizing per channel with mean and standard deviation

b.2 Spectral normalization on ImageNet

An overview of the hyperparameters used for the ImageNet classification experiments is shown in Table 4. We trained the networks on two NVIDIA V100 GPUs with scaling constants 10, 12, 15, 20, 30, 40 and learning rates , , , , , respectively.

We use the ImageNet dataset Russakovsky et al. (2015) with the default train/val/test split and the following data pre-processing during training and validation:

1. Training:

• cropping of random size and resizing to images of size

• random horizontal flipping

• normalizing per channel with mean and standard deviation

2. Validation:

• resizing image to size

• center cropping to size

• normalizing per channel with mean and standard deviation

b.3 Time complexity

For the time complexity experiments, we trained identical networks with hyperparameters shown in Table 5 and Table 6 – with and without spectral normalization for CIFAR-10 and ImageNet classification. The data train/test split as well as the pre-processing steps for CIFAR-10 and ImageNet can be found in subsection B.1 and B.2, respectively.