The remarkable success of deep neural networks at generalizing to natural data is at odds with the traditional notions of model complexity and their empirically demonstrated ability to fit arbitrary random data to perfect accuracy(Zhang et al., 2017a; Arpit et al., 2017). This has prompted recent investigations of possible implicit regularization mechanisms inherent in the learning process which induce a bias towards low complexity solutions (Neyshabur et al., 2014; Soudry et al., 2017; Poggio et al., 2018; Neyshabur et al., 2017).
In this work, we take a slightly shifted view on implicit regularization by suggesting that low-complexity functions are learned faster during training by gradient descent. We expose this bias by taking a closer look at neural networks through the lens of Fourier analysis. While they can approximate arbitrary functions, we find that these networks prioritize learning the low frequency modes, a phenomenon we call the spectral bias
. This bias manifests itself not just in the process of learning, but also in the parameterization of the model itself: in fact, we show that the lower frequency components of trained networks are more robust to random parameter perturbations. Finally, we also expose and analyze the rather intricate interplay between the spectral bias and the geometry of the data manifold by showing that high frequencies get easier to learn when the data lies on a lower-dimensional manifold of complex shape embedded in the input space of the model. We focus the discussion on networks with rectified linear unit (ReLU) activations, whose continuous piece-wise linear structure enables an analytic treatment.
We exploit the continuous piecewise-linear structure of ReLU networks to evaluate its Fourier spectrum (Section 2).
We find empirical evidence of a spectral bias: i.e. lower frequencies are learned first. We also show that lower frequencies are more robust to random perturbations of the network parameters (Section 3).
We study the role of the shape of the data manifold: we show how complex manifold shapes can facilitate the learning of higher frequencies and develop a theoretical understanding of this behavior (Section 4).
2 Fourier analysis of ReLU networks
Throughout the paper we call ‘ReLU network’ a scalar function defined by a neural network with hidden layers of widths
and a single output neuron:
where each is an affine function ( and ) and. In the standard basis, for some weight matrix
and bias vector.
ReLU networks are known to be continuous piece-wise linear (CPWL) functions, where the linear regions are convex polytopes (Raghu et al., 2016; Montufar et al., 2014; Zhang et al., 2018; Arora et al., 2018). Remarkably, the converse also holds: every CPWL function can be represented by a ReLU network (Arora et al., 2018, Theorem 2.1), which in turn endows ReLU networks with universal approximation properties. Given the ReLU network from Eqn. 1, we can make the piecewise linearity explicit by writing,
where is an index for the linear regions and is the indicator function on . As shown in Appendix B in more detail, each region corresponds to an activation pattern222We adopt the terminology of Raghu et al. (2016); Montufar et al. (2014). of all hidden neurons of the network, which is a binary vector with components conditioned on the sign of the input of the respective neuron. The matrix is given by
where is obtained from the original weight by setting its column to zero whenever the neuron of the layer is inactive.
2.2 Fourier Spectrum
In the following, we study the structure of ReLU networks in terms of their Fourier representation, , where
is the Fourier transform333
Note that general ReLU networks need not be squared integrable: for instance, the class of two-layer ReLU networks represent an arrangement of hyperplanes(Montufar et al., 2014) and hence grow linearly as . In such cases, the Fourier transform is to be understood in the sense of tempered distributions acting on rapidly decaying smooth functions as . See Appendix C for a formal treatment. . Lemmas 1 and 2 yield the explicit form of the Fourier components (we refer to Appendix C for the proofs and technical details).
The Fourier transform of ReLU networks decomposes as,
where and is the Fourier transform of the indicator function of .
The Fourier transform of the indicator over linear regions appearing in Eqn. 4 are fairly intricate mathematical objects. Diaz et al. (2016) develop an elegant procedure for evaluating it in arbitrary dimensions via a recursive application of Stokes theorem. We describe this procedure in detail444We also generalize the construction to tempered distributions. in Appendix C.2, and present here its main corollary.
Let be a full dimensional polytope in . Its Fourier spectrum takes the form:
where is the union of -dimensional subspaces that are orthogonal to some -codimensional face of , is in and the indicator over .
The Fourier components of the ReLU network with parameters is given by the rational function:
where is the union of -dimensional subspaces that are orthogonal to some -codimensional faces of some polytope and is .
Note that Eqn 6 applies to general ReLU networks with arbitrary width and depth555Symmetries that might arise due to additional assumptions can be used to further develop Eqn 6, see e.g. Eldan & Shamir (2016) for 2-layer networks..
Discussion. We make the following two observations. First, the spectral decay of ReLU networks is highly anisotropic in large dimensions. In almost all directions of , we have a decay. However, the decay can be as slow as in specific directions orthogonal to the dimensional faces bounding the linear regions666Note that such a rate is not guaranteed by piecewise smoothness alone. For instance, the function is continuous and smooth everywhere except at , yet it decays as in the Fourier domain..
Second, the numerator in Eqn 6 is bounded by (cf. Appendix C.3), where is the number of linear regions and is the Lipschitz constant of the network. Further, the Lipschitz constant can be bounded as (cf. Appendix C.6):
where is the spectral norm and the max norm, and is the number of units in the -th layer. This makes the bound on scale exponentially in depth and polynomial in width. As for the number of linear regions, Montufar et al. (2014) and Raghu et al. (2016) obtain tight bounds that exhibit the same scaling behaviour (Raghu et al., 2016, Theorem 1). In Appendix A.5, we qualitatively ablate over the depth and width of the network to expose how this reflects on the Fourier spectrum of the network.
3 Lower Frequencies are Learned First
We now present experiments showing that networks tend to fit lower frequencies first during training. We refer to this phenomenon as the spectral bias, and discuss it in light of the results of Section 2.
3.1 Synthetic Experiments
The setup is as follows777More experimental details and additional plots are provided in Appendix A.1.: Given frequencies with corresponding amplitudes , and phases , we consider the mapping given by
A 6-layer deep 256-unit wide ReLU network is trained to regress with and input samples spaced equally over ; its spectrum in expectation over is monitored as training progresses. In the first setting, we set equal amplitude for all frequencies and in the second setting, the amplitude increases from to . Figure 1 shows the normalized magnitudes at various frequencies, as training progresses with full-batch gradient descent. Further, Figure 2 shows the learned function at intermediate training iterations. The result is that lower frequencies (i.e. smaller ’s) are regressed first, regardless of their amplitudes.
Our goal here is to illustrate a phenomenon that complements the one highlighted above: lower frequencies are more robust to parameter perturbations. The set up is the same as in Experiment 1. The network is trained to regress a target function with frequencies and amplitudes . After convergence to , we consider random (isotropic) perturbations of given magnitude , where is a random unit vector in parameter space. We evaluate the network function at the perturbed parameters, and compute the magnitude of its discrete Fourier transform at frequencies to obtain . We also average over 100 samples of to obtain , which we normalize by . Finally, we average over the phases (see Eqn 8). The result, shown in Figure 3, demonstrates that higher frequencies are significantly less robust than the lower ones, guiding the intuition that expressing higher frequencies requires the parameters to be finely-tuned to work together. In other words, parameters that contribute towards expressing high-frequency components occupy a small volume in the parameter space. We formalize this in Appendix D.
Discussion . Multiple theoretical aspects may underlie these observations. First, for a fixed architecture, recall that the numerator in Theorem 1 is888The tightness of this bound is verified empirically in appendix A.5. (where is the Lipschitz constant of the function). However, is bounded by the parameter norm, which can only increase gradually during training by gradient descent. This leads to the higher frequencies being learned999This assumes that the Lipschitz constant of the (noisy) target function is larger than that of the network at initialization. late in the optimization process. To confirm that the bound indeed increases as the model fits higher frequencies, we plot in Fig 1 the spectral norm of weights of each layer during training for both cases of constant and increasing amplitudes.
Second (cf. Appendix C.4), the exact form of the Fourier spectrum yields that for a fixed direction , the spectral decay rate of the parameter gradient is at most one exponent of lower than that of . If for a fixed we have where , we obtain for the residual and (continuous) training step :
where we use the fact that is just the learning rate times the parameter gradient of the loss which is independent101010Note however that the loss term might involve a sum or an integral over all frequencies, but the summation is over a different variable. of , and assume that the target function is fixed. Eqn 9 shows that the rate of change of the residual decays with increasing frequency, which is what we find in Experiment 1.
3.2 Real-Data Experiments
While Experiments 1 and 2 establish the spectral bias by explicitly evaluating the Fourier coefficients, doing so becomes prohibitively expensive for larger (e.g. on MNIST). To tackle this, we propose the following set of experiments to measure the effect of spectral bias indirectly on MNIST.
In this experiment, we investigate how the validation performance dependent on the frequency of noise added to the training target. We find that the best validation performance on MNIST is particularly insensitive to the magnitude of high-frequency noise, yet it is adversely affected by low-frequency noise. We consider a target (binary) function defined on the space of MNIST inputs. Samples form a subset of the MNIST dataset comprising samples belonging to two classes. Let be a noise function:
corresponding to a radial wave defined on the -dimensional input space111111The rationale behind using a radial wave is that it induces oscillations (simultaneously) along all spatial directions. Another viable option is to induce oscillations along the principle axes of the data: we have verified that the key trends of interest are preserved.. The final target function is then given by , where is the effective amplitude of the noise. We fit the same network as in Experiment 1 to the target with the MSE loss. In the first set of experiments, we ablate over for a pair of fixed s, while in the second set we ablate over for a pair of fixed s. In Figure 4, we show the respective validation loss curves, where the validation set is obtained by evaluating on a separate subset of the data, i.e. . Figure 11 (in appendix A.3) shows the respective training curves.
Discussion. The profile of the loss curves varies significantly with the frequency of noise added to the target. In Figure 3(a), we see that the validation performance is adversely affected by the amplitude of the low-frequency noise, whereas Figure 3(b) shows that the amplitude of high-frequency noise does not significantly affect the best validation score. This is explained by the fact that the network readily fits the noise signal if it is low frequency, whereas the higher frequency noise is only fit later in the training. In the latter case, the dip in validation score early in the training is when the network has learned the low frequency true target function ; the remainder of the training is spent learning the higher-frequencies in the training target , as we shall see in the next experiment. Figures 3(c) and 3(d) confirm that the dip in validation score exacerbates for increasing frequency of the noise. Further, we observe that for higher frequencies (e.g. ), increasing the amplitude does not significantly degrade the best performance at the dip, confirming that the network is fairly robust to the amplitude of high-frequency noise.
Finally, we note that the dip in validation score was also observed by Arpit et al. (2017) with i.i.d. noise121212 Recall that i.i.d. noise is white-noise, which has a constant Fourier spectrum magnitude in expectation, i.e. it also contains high-frequency components.
Recall that i.i.d. noise is white-noise, which has a constant Fourier spectrum magnitude in expectation, i.e. it also contains high-frequency components.in a classification setting.
To investigate the dip observed in Experiment 3 , we now take a more direct approach by considering a generalized notion of frequency. To that end, we project the network function to the space spanned by the orthonormal eigenfunctions (sorted by decreasing eigenvalues) resemble sinusoids
, we now take a more direct approach by considering a generalized notion of frequency. To that end, we project the network function to the space spanned by the orthonormal eigenfunctionsof the Gaussian RBF kernel (Braun et al., 2006). These eigenfunctions
(sorted by decreasing eigenvalues) resemble sinusoids(Fasshauer, 2011), and the index can be thought of as being a proxy for the frequency, as can be seen from Figure 6 (see Appendix A.4 for additional details and supporting plots). While we will call as the spectrum of the function , it should be understood as , where and on the MNIST samples . This allows us to define a noise function as:
where is the number of available samples and . Like in Experiment 3, the target function is given by , and the same network is trained to regress . Figure 5 shows the (generalized) spectrum and , and that of as training progresses. Figure 13 (in appendix) shows the corresponding dip in validation loss, where the validation set is same as the training set but with true target function instead of the noised target .
Discussion. From Figure 5, we learn that the drop in validation score observed in Figure 4 is exactly when the higher-frequencies of the noise signal are yet to be learned. As the network gradually learns the higher frequency eigenfunctions, the validation loss increases while the training loss continues to decrease. Thus these experiments show that the phenomenon of spectral bias persists on non-synthetic data and in high dimensional input spaces.
4 Not all Manifolds are Learned Equal
In this section, we investigate subtleties that arise when the data lies on a lower dimensional manifold embedded in the higher dimensional input space of the model. We find that the shape of the data-manifold impacts the learnability of high frequencies in a non-trivial way. As we shall see, this is because low frequency functions in the input space may have high frequency components when restricted to lower dimensional manifolds of complex shapes. We demonstrate results in an illustrative minimal setting131313We include additional experiments on MNIST and CIFAR-10 in appendices A.6 and A.7., free from unwanted confounding factors, and present a theoretical analysis of the phenomenon.
Manifold hypothesis. We consider the case where the data lies on a lower dimensional data manifold embedded in input space (Goodfellow et al., 2016), which we assume to be the image of some injective mapping defined on a lower dimensional latent space . Under this hypothesis and in the context of the standard regression problem, a target function defined on the data manifold can be identified with a function defined on the latent space. Regressing is therefore equivalent to finding such that matches
. Further, assuming that the data probability distributionsupported on is induced by
from the uniform distributionin the latent space , the mean square error can be expressed as:
Observe that there is a vast space of degenerate solutions that minimize the mean squared error – namely all functions on that yield the same function when restricted to the data manifold .
Our findings from the previous section suggest that neural networks are biased towards expressing a particular subset of such solutions, namely those that are low frequency. It is also worth noting that there exist methods that restrict the space of solutions: notably adversarial training (Goodfellow et al., 2014) and Mixup (Zhang et al., 2017b).
Experimental set up. The experimental setting is designed to afford control over both the shape of the data manifold and the target function defined on it. We will consider the family of curves in generated by mappings given by
Functions learned by two identical networks (up to initialization) to classify the binarized value of a sine wave of frequencydefined on a
manifold. Both yield close to perfect accuracy for the samples defined on the manifold (scatter plot), yet they differ significantly elsewhere. The shaded regions show the predicted class (Red or Blue) whereas contours show the confidence (absolute value of logits).
Here, defines the data-manifold and corresponds to a flower-shaped curve with petals, or a unit circle when (see e.g. Fig 7). Given a signal defined on the latent space , the task entails learning a network such that matches the signal .
The set-up is similar to that of Experiment 1, and is as defined in Eqn. 8 with frequencies , and amplitudes . The model is trained on the dataset with uniformly spaced samples between and . The spectrum of in expectation over is monitored as training progresses, and the result is shown in Fig 8 for various . Fig 7(e) shows the corresponding mean squared error curves. More experimental details in appendix A.2.
The results demonstrate a clear attenuation of the spectral bias as grows. Moreover, Fig 7(e) suggests that the larger the , the easier the learning task.
Here, we adapt the setting of Experiment 5 to binary classification by simply thresholding the function at to obtain a binary target signal. To simplify visualization, we only use signals with a single frequency mode , such that . We train the same network on the resulting classification task with cross-entropy loss141414 We use Pytorch’s
We use Pytorch’sBCEWithLogitsLoss. Internally, it takes a sigmoid of the network’s output (the logits) before evaluating the cross-entropy. for and . The heatmap in Fig 9 shows the classification accuracy for each pair. Fig 7 shows visualizations of the functions learned by the same network, trained on under identical conditions up to random initialization.
Discussion. These experiments hint towards a rich interaction between the shape of the manifold and the effective difficulty of the learning task. The key mechanism underlying this phenomenon (as we formalize below) is that the relationship between frequency spectrum of the network and that of the fit is mediated by the embedding map . In particular, we argue that a given signal defined on the manifold is easier to fit when the coordinate functions of the manifold embedding itself has high frequency components. Thus, in our experimental setting, the same signal embedded in a flower with more petals can be captured with lower frequencies of the network.
To understand this mathematically, we address the following questions: given a target function , how small can the frequencies of a solution be such that ? And further, how does this relate to the geometry of the data-manifold induced by ? To find out, we write the Fourier transform of the composite function,
The kernel depends on only and elegantly encodes the correspondence between frequencies in input space and frequencies in the latent space . Following a procedure from Bergner et al. , we can further investigate the behaviour of the kernel in the regime where the stationary phase approximation is applicable, i.e. when (cf. section 3.2. of Bergner et al. ). In this regime, the integral is dominated by critical points of its phase, which satisfy
where is the Jacobian matrix of . Non-zero values of the kernel correspond to pairs such that Eqn 15 has a solution. Further, given that the components of (i.e. its coordinate functions) are defined on an interval , one can use their Fourier series representation together with Eqn 15 to obtain a condition on their frequencies (shown in appendix C.7). More precisely, we find that the -th component of the RHS in Eqn 15 is proportional to where is the frequency of the coordinate function . This yields that we can get arbitrarily large frequencies if is large151515Consider that the data-domain is bounded, implying that cannot be arbitrarily scaled. enough for large , even when is fixed.
This is precisely what Experiments 5 and 6 demonstrate in a minimal setting. From Eqn 4, observe that the coordinate functions have a frequency mode at . For increasing , it is apparent that the frequency magnitudes (in the latent space) that can be expressed with the same frequency (in the input space) increases with increasing . This allows the remarkable interpretation that the neural network function can express large frequencies on a manifold () with smaller frequencies w.r.t its input domain (), provided that the coordinate functions of the data manifold embedding itself has high-frequency components.
5 Related Work
A number of works have focused on showing that neural networks are capable of approximating arbitrarily complex functions. Hornik et al. (1989); Cybenko (1989); Leshno et al. (1993) have shown that neural networks can be universal approximators when given sufficient width; more recently, Lu et al. (2017) proved that this property holds also for width-bounded networks. Montufar et al. (2014) showed that the number of linear regions of deep ReLU networks grows polynomially with width and exponentially with depth; Raghu et al. (2016) generalized this result and provided asymptotically tight bounds. There have been various results of the benefits of depth for efficient approximation (Poole et al., 2016; Telgarsky, 2016; Eldan & Shamir, 2016). These analysis on the expressive power of deep neural networks can in part explain why over-parameterized networks can perfectly learn random input-output mappings (Zhang et al., 2017a).
Our work more directly follows the line of research on implicit regularization in neural networks trained by gradient descent (Neyshabur et al., 2014; Soudry et al., 2017; Poggio et al., 2018; Neyshabur et al., 2017). In fact, while our Fourier analysis of deep ReLU networks also reflects the width and depth dependence of their expressivity, we focused on showing a learning bias of these networks towards simple functions with dominant lower frequency components. We view our results as a first step towards formalizing the findings of Arpit et al. (2017), where it is empirically shown that deep networks prioritize learning simple patterns of the data during training.
A few other works studied neural networks through the lens of harmonic analysis. For example, Candès (1999) used the ridgelet transform to build constructive procedures for approximating a given function by neural networks, in the case of oscillatory activation functions. This approach has been recently generalized to unbounded activation functions by Sonoda & Murata (2017). Eldan & Shamir (2016) use insights on the support of the Fourier spectrum of two-layer networks to derive a worse-case depth-separation result. Barron (1993) makes use of Fourier space properties of the target function to derive an architecture-dependent approximation bound. In a concurrent and independent work, Xu et al. (2018) make the same observation that lower frequencies are learned first. The subsequent work by Xu (2018)
proposes a theoretical analysis of the phenomenon in the case of 2-layer networks with sigmoid activation, based on the spectrum of the sigmoid function.
In light of our findings, it is worth comparing the case of neural networks and other popular algorithms such that kernel machines (KM) and -nearest neighbor classifiers. We refer to the Appendix E for a detailed discussion and references. In summary, our discussion there suggests that 1. DNNs strike a good balance between function smoothness and expressivity/parameter-efficiency compared with KM; 2. DNNs learn a smoother function compared with NNs since the spectrum of the DNN decays faster compared with NNs in the experiments shown there.
We studied deep ReLU networks through the lens of Fourier analysis. Several conclusions can be drawn from our analysis. While neural networks can approximate arbitrary functions, we find that they favour low frequency ones – hence they exhibit a bias towards smooth functions – a phenomenon that we called spectral bias. We also illustrated how the geometry of the data manifold impacts expressivity in a non-trivial way, as high frequency functions defined on complex manifolds can be expressed by lower frequency network functions defined in input space.
We view future work that explore the properties of neural networks in Fourier domain as promising. For example, the Fourier transform affords a natural way of measuring how fast a function can change within a small neighborhood in its input domain; as such, it is a strong candidate for quantifying and analyzing the sensitivity of a model – which in turn provides a natural measure of complexity (Novak et al., 2018). We hope to encourage more research in this direction.
The authors would like to thank Joan Bruna, Rémi Le Priol, Vikram Voleti, Ullrich Köthe, Steffen Wolf, Lorenzo Cerrone, Sebastian Damrich, as well as the anonymous reviewers for their valuable feedback.
- Arora et al. (2018) Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1J_rgWRW.
- Arpit et al. (2017) Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.
- Barron (1993) Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Bengio et al. (2009) Bengio, Y. et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
- (5) Bergner, S., Möller, T., Weiskopf, D., and Muraki, D. J. A spectral analysis of function concatenations and its implications for sampling in direct volume visualization.
Braun et al. (2006)
Braun, M. L., Lange, T., and Buhmann, J. M.
Model selection in kernel methods based on a spectral analysis of
Joint Pattern Recognition Symposium, pp. 344–353. Springer, 2006.
- Candès (1999) Candès, E. J. Harmonic analysis of neural networks. Applied and Computational Harmonic Analysis, 6(2):197–218, 1999.
- Cybenko (1989) Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989.
- Devroye et al. (1996) Devroye, L., Györfi, L., and Lugosi, G. Consistency of the k-nearest neighbor rule. In A Probabilistic Theory of Pattern Recognition, pp. 169–185. Springer, 1996.
- Diaz et al. (2016) Diaz, R., Le, Q.-N., and Robins, S. Fourier transforms of polytopes, solid angle sums, and discrete volume. arXiv preprint arXiv:1602.08593, 2016.
- Draxler et al. (2018) Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.
- Eldan & Shamir (2016) Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907–940, 2016.
- Fasshauer (2011) Fasshauer, G. E. Positive definite kernels: past, present and future. Dolomite Research Notes on Approximation, 4:21–63, 2011.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
Hammer & Gersmann (2003)
Hammer, B. and Gersmann, K.
A note on the universal approximation capability of support vector machines.Neural Processing Letters, 17(1):43–53, 2003.
- Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kolsbjerg et al. (2016) Kolsbjerg, E. L., Groves, M. N., and Hammer, B. An automated nudged elastic band method. The Journal of chemical physics, 145(9):094107, 2016.
- Leshno et al. (1993) Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
- Lu et al. (2017) Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, pp. 6231–6239, 2017.
- Ma & Belkin (2017) Ma, S. and Belkin, M. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pp. 3781–3790, 2017.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT-.
- Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
- Neyshabur et al. (2014) Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5949–5958, 2017.
- Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJC2SzZCW.
- Poggio et al. (2018) Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., and Mhaskar, H. Theory of deep learning iii: the non-overfitting puzzle. Technical report, Technical report, CBMM memo 073, 2018.
- Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3360–3368. Curran Associates, Inc., 2016.
- Raghu et al. (2016) Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
- Rasmussen (2004) Rasmussen, C. E. Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63–71. Springer, 2004.
- Serov (2017) Serov, V. Fourier series, Fourier transform and their applications to mathematical physics. Springer, 2017.
- Sonoda & Murata (2017) Sonoda, S. and Murata, N. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
- Soudry et al. (2017) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
- Spivak (2018) Spivak, M. Calculus On Manifolds: A Modern Approach To Classical Theorems Of Advanced Calculus. CRC press, 2018.
- Telgarsky (2016) Telgarsky, M. Benefits of depth in neural networks. Conference on Learning Theory (COLT), 2016, 2016.
- Xu (2018) Xu, Z. J. Understanding training and generalization in deep learning by fourier analysis. arXiv preprint arXiv:1808.04295, 2018.
- Xu et al. (2018) Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of deep neural network in frequency domain. arXiv preprint arXiv:1807.01251, 2018.
- Zhang et al. (2017a) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR), 2017a.
- Zhang et al. (2017b) Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017b.
- Zhang et al. (2018) Zhang, L., Naitzat, G., and Lim, L.-H. Tropical geometry of deep neural networks. arXiv preprint arXiv:1805.07091, 2018.
Appendix A Experimental Details
a.1 Experiment 1
We fit a 6 layer ReLU network with 256 units per layer to the target function , which is a superposition of sine waves with increasing frequencies:
where , and is sampled from the uniform distribution . In the first setting, we set equal amplitude for all frequencies, i.e. , while in the second setting we assign larger amplitudes to the higher frequencies, i.e. . We sample on 200 uniformly spaced points in and train the network for steps of full-batch gradient descent with Adam (Kingma & Ba, 2014)
. Note that we do not use stochastic gradient descent to avoid the stochasticity in parameter updates as a confounding factor. We evaluate the network on the same 200 point grid every 100 training steps and compute the magnitude of its (single-sided) discrete fourier transform at frequencieswhich we denote with . Finally, we plot in figure 1 the normalized magnitudes averaged over 10 runs (with different sets of sampled phases ). We also record the spectral norms of the weights at each layer as the training progresses, which we plot in figure 1 for both settings (the spectral norm is evaluated with 10 power iterations). In figure 2, we show an example target function and the predictions of the network trained on it (over the iterations), and in figure 10 we plot the loss curves.
a.2 Experiment 5
We use the same 6-layer deep 256-unit wide network and define the target function
where , and . We sample on a grid with 1000 uniformly spaced points between 0 and 1 and map it to the input domain via to obtain a dataset , on which we train the network with 50000 full-batch gradient descent steps of Adam. On the same 1000-point grid, we evaluate the magnitude of the (single-sided) discrete Fourier transform of every 100 training steps at frequencies and average over 10 runs (each with a different set of sampled ’s). Fig 8 shows the evolution of the spectrum as training progresses for , and Fig 7(e) shows the corresponding loss curves.
a.3 Experiment 3
a.4 Experiment 4
Consider the Gaussian Radial Basis Kernel, given by:
where is a compact subset of and is defined as the width of the kernel161616We drop the subscript to simplify the notation.. Since is positive definite (Fasshauer, 2011), Mercer’s Theorem can be invoked to express it as:
where is the eigenfunction of satisfying:
Due to positive definiteness of the kernel, the eigenvalues are non-negative and the eigenfunctions form an orthogonal basis of , i.e. . The analogy to the final case is easily seen: let be the set of samples, a function. One obtains (cf. Chapter 4 (Rasmussen, 2004)):
where . Now, defining as the positive definite kernel matrix with elements , we consider it’s eigendecomposition where is the diagonal matrix of (w.l.o.g sorted) eigenvalues and the columns of are the corresponding eigenvectors. This yields:
Like in (Braun et al., 2006), we define the spectrum of the function as:
where . The value can be thought of a generalized notion of frequency. Indeed, it is known (Fasshauer, 2011; Rasmussen, 2004), for instance, that the eigenfunctions resemble sinusoids with increasing frequencies (for increasing or decreasing ). In Figure 6, we plot the eigenvectors and for uniformly spaced between . Further, in Figure ? we evaluate the discrete Fourier transform of all eigenvectors, and find that the eigenfunction index does indeed coincide with frequency . Finally, we remark that the link between signal complexity and the spectrum is extensively studied in (Braun et al., 2006).
a.4.1 Loss Curves Accompanying Figure 5
a.5 Qualitative Ablation over Architectures
Theorem 1 exposes the relationship between the fourier spectrum of a network and its depth, width and max-norm of parameters. The following experiment is a qualitative ablation study over these variables.
In this experiment, we fit various networks to the -function at (see Fig 13(a)). Its spectrum is constant for all frequencies (Fig 13(b)), which makes it particularly useful for testing how well a given network can fit large frequencies. Fig 17 shows the ablation over weight clip (i.e. max parameter max-norm), Fig 15 over depth and Fig 16 over width. Fig 18 exemplarily shows how the network prediction evolves with training iterations. All networks are trained for 60K iterations of full-batch gradient descent under identical conditions (Adam optimizer with , no weight decay).
We make the following observations.
Fig 15 shows that increasing the depth (for fixed width) significantly improves the network’s ability to fit higher frequencies (note that the depth increases linearly).
Fig 16 shows that increasing the width (for fixed depth) also helps, but the effect is considerably weaker (note that the width increases exponentially).
Fig 17 shows that increasing the weight clip (or the max parameter max-norm) also helps the network fit higher frequencies.
The above observations are all consistent with Theorem 1, and further show that lower frequencies are learned first (i.e. the spectral bias, cf. Experiment 1). Further, Figure 17 shows that constraining the Lipschitz constant (weight clip) prevents the network from learning higher frequencies, furnishing evidence that the bound can be tight.
a.6 MNIST: A Proof of Concept
In the following experiment, we show that given two manifolds of the same dimension – one flat and the other not – the task of learning random labels is harder to solve if the input samples lie on the same manifold. We demonstrate on MNIST under the assumption that the manifold hypothesis is true, and use the fact that the spectrum of the target function we use (white noise) is constant in expectation, and therefore independent of the underlying coordinate system when defined on the manifold.
In this experiment, we investigate if it is easier to learn a signal on a more realistic data-manifold like that of MNIST (assuming the manifold hypothesis is true), and compare with a flat manifold of the same dimension. To that end, we use the -dimensional feature-space of a denoising171717This experiment yields the same result if variational autoencoders are used instead.autoencoder as a proxy for the real data-manifold of unknown number of dimensions. The decoder functions as an embedding of in the input space , which effectively amounts to training a network on the reconstructions of the autoencoder. For comparision, we use an injective embedding181818The xy-plane is an injective embedding of a subset of in . of a 64-dimensional hyperplane in . The latter is equivalent to sampling -dimensional vectors from and setting all but the first 64 components to zero. The target function is white-noise, sampled as scalars from the uniform distribution . Two identical networks are trained under identical conditions, and Fig 19 shows the resulting loss curves, each averaged over 10 runs.
This result complements the findings of (Arpit et al., 2017) and (Zhang et al., 2017a), which show that it’s easier to fit random labels to random inputs if the latter is defined on the full dimensional input space (i.e. the dimension of the flat manifold is the same as that of the input space, and not that of the underlying data-manifold being used for comparison).
a.7 Cifar-10: It’s All Connected
We have seen that deep neural networks are biased towards learning low frequency functions. This should have as a consequence that isolated bubbles of constant prediction are rare. This in turn implies that given any two points in the input space and a network function that predicts the same class for the said points, there should be a path connecting them such that the network prediction does not change along the path. In the following, we present an experiment where we use a path finding method to find such a path between all Cifar-10 input samples indeed exist.
Using AutoNEB (Kolsbjerg et al., 2016), we construct paths between (adversarial) Cifar-10 images that are classified by a ResNet20 to be all of the same target class. AutoNEB bends a linear path between points in some space so that some maximum energy along the path is minimal. Here, the space is the input space of the neural network, i.e. the space of images and the logit output of the ResNet20 for a given class is minimized. We construct paths between the following points in image space:
From one training image to another,
from a training image to an adversarial,
from one adversarial to another.
We only consider pairs of images that belong to the same class (or, for adversarials, that originate from another class , but that the model classifies to be of the specified class ). For each class, we randomly select 50 training images and select a total of 50 random images from all other classes and generate adversarial samples from the latter. Then, paths between all pairs from the whole set of images are computed.
The AutoNEB parameters are chosen as follows: We run four NEB iterations with 10 steps of SGD with learning rate and momentum . This computational budget is similar to that required to compute the adversarial samples. The gradient for each NEB step is computed to maximize the logit output of the ResNet-20 for the specified target class . We use the formulation of NEB without springs (Draxler et al., 2018).
The result is very clear: We can find paths between all pairs of images for all CIFAR10 labels that do not cross a single decision boundary.
This means that all paths belong to the same connected component regarding the output of the DNN. This holds for all possible combinations of images in the above list. Figure 21 shows connecting training to adversarial images and Figure 20 paths between pairs of adversarial images. Paths between training images are not shown, they provide no further insight. Note that the paths are strikingly simple: Visually, they are hard to distinguish from the linear interpolation. Quantitatively, they are essentially (but not exactly) linear, with an average length
paths between pairs of adversarial images. Paths between training images are not shown, they provide no further insight. Note that the paths are strikingly simple: Visually, they are hard to distinguish from the linear interpolation. Quantitatively, they are essentially (but not exactly) linear, with an average lengthlonger than the linear connection.
Appendix B The Continuous Piecewise Linear Structure of Deep ReLU Networks
We consider the class of ReLU network functions defined by Eqn. 1. Following the terminology of (Raghu et al., 2016; Montufar et al., 2014), each linear region of the network then corresponds to a unique activation pattern, wherein each hidden neuron is assigned an activation variable , conditioned on whether its input is positive or negative. ReLU networks can be explictly expressed as a sum over all possible activation patterns, as in the following lemma.
Given binary vectors with , let the affine function defined by if , and otherwise. ReLU network functions, as defined in Eqn. 1, can be expressed as
where denotes the indicator function of the subset , and is the polytope defined as the set of solutions of the following linear inequalities (for all ):
is therefore affine on each of the polytopes , which finitely partition the input space to convex polytopes. Remarkably, the correspondence between ReLU networks and CPWL functions goes both ways: Arora et al. (2018) show that every CPWL function can be represented by a ReLU network, which in turn endows ReLU networks with the universal approximation property.
Finally, in the standard basis, each affine map is specified by a weight matrix and a bias vector . In the linear region , can be expressed as , where in particular
where is obtained from by setting its th column to zero whenever .
Appendix C Fourier Analysis of ReLU Networks
c.1 Proof of Lemma 1
Case 1: The function has compact support. The vector-valued function is continuous everywhere and has well-defined and continuous gradients almost everywhere. So by Stokes’ theorem (see e.g Spivak (2018)), the integral of its divergence is a pure boundary term. Since we restricted to functions with compact support, the theorem yields
The integrand is , so we deduce,
Case 2: The function does not have compact support. Without the assumption of compact support, the function is not squared-integrable. The Fourier transform therefore only exists in the sense of distributions, as defined below.
Let be the Schwartz space over of rapidly decaying test functions which together with its derivatives decay to zero as faster than any power of . A tempered distribution is a continuous linear functional on . A function that doesn’t grow faster than a polynomial at infinity can be identified with a tempered distribution as:
In the following, we shall identify with . The Fourier transform of the tempered distribution is defined as:
where is the Fourier transform of . In this sense, the standard notion of the Fourier transform is generalized to functions that are not squared-integrable.
Consider the continuous piecewise-linear ReLU network . Since it can grow at most linearly, we interpret it as a tempered distribution on . Recall that the linear regions are enumerated by . Let be the restriction of to , making . The distributional derivative of is given by:
where is the indicator over and we used . It then follows from elementary properties of Schwartz spaces (see e.g. Chapter 16 of Serov (2017)) that:
c.2 Fourier Transform of Polytopes
c.2.1 Theorem 1 of Diaz et al. (2016)
Let be a dimensional polytope in , such that . Denote by a vector in the Fourier space, by the linear phase function, by the Fourier transform of the indicator function on , by the boundary of and by the -dimensional (Hausdorff) measure. Let be the orthogonal projection of on to (obtained by removing all components of orthogonal to ). Given a dimensional facet of , let be the unit normal vector to that points out of . It then holds:
1. If , then is constant on , and we have:
2. But if , then:
The above theorem provides a recursive relation for computing the Fourier transform of an arbitrary polytope. More precisely, the Fourier transform of a -dimensional polytope is expressed as a sum of fourier transforms over the dimensional boundaries of the said polytope (which are themselves polytopes) times a weight term (with ). The recursion terminates if , which then yields a constant.
To structure this computation, Diaz et al. (2016) introduce a book-keeping device called the face poset of the polytope. It can be understood as a weighted directed acyclic graph (DAG) with polytopes of various dimensions as its nodes. We start at the root node which is the full dimensional polytope (i.e. we initially set ). For all of the codimension-one boundary faces of , we then draw an edge from the root to node and weight it with a term given by:
and repeat the process iteratively for each . Note that the weight term is where . This process yields tree paths where each has one dimension less than . For a given path and , the terminal node for this path, , is the first polytope for which . The final Fourier transform is obtained by multiplying the weights along each path and summing over all tree paths:
where for an arbitrary point in .
To write this as a weighted sum of indicator functions, as in Lemma 2, let denote the set of all tree paths of length , i.e. . For a tree path , let be the orthogonal to the terminal node , i.e the vectors such that . The sum over in Eqn (35) can be split as:
where and . In words, is the set of all vectors that are orthogonal to some -codimensional face of the polytope. We identify: