Neural networks have recently been applied to a number of diverse problems with impressive results (van den Oord et al., 2016; Silver et al., 2017; Berthelot et al., 2017). These breakthroughs largely appear to be driven by application rather than an understanding of the capabilities and training of neural networks. Recently, significant work has been done to increase understanding of neural networks (Choromanska et al., 2015; Haeffele & Vidal, 2015; Poole et al., 2016; Schoenholz et al., 2017; Zhang et al., 2016; Martin & Mahoney, 2017; Shwartz-Ziv & Tishby, 2017; Balduzzi et al., 2017; Raghu et al., 2017). However, there is still work to be done to bring theoretical understanding in line with the results seen in practice.
The connection between neural networks and kernel machines has long been studied (Neal, 1994). Much past work has been done to investigate the equivalent kernel of certain neural networks, either experimentally (Burgess, 1997), through sampling (Sinha & Duchi, 2016; Livni et al., 2017; Lee et al., 2017), or analytically by assuming some random distribution over the weight parameters in the network (Williams, 1997; Cho & Saul, 2009; Pandey & Dukkipati, 2014a, b; Daniely et al., 2016; Bach, 2017a)1992), which laid a strong mathematical foundation for a Bayesian approach to training networks. Another reason may be that some researchers may hold the intuitive (but not necessarily principled) view that the Central Limit Theorem (CLT) should somehow apply.
In this work, we investigate the equivalent kernels for networks with Rectified Linear Unit (ReLU), Leaky ReLU (LReLU) or other activation functions, one-hidden layer, and more general weight distributions. Our analysis carries over to deep networks. We investigate the consequences that weight initialization has on the equivalent kernel at the beginning of training. While initialization schemes that mitigate exploding/vanishing gradient problems(Hochreiter, 1991; Bengio et al., 1994; Hochreiter et al., 2001) for other activation functions and weight distribution combinations have been explored in earlier works (Glorot & Bengio, 2010; He et al., 2015)
, we discuss an initialization scheme for Muli-Layer Perceptrons (MLPs) with LReLUs and weights coming from distributions withmean and finite absolute third moment. The derived kernels also allow us to analyze the loss of information as an input is propagated through the network, offering a complementary view to the shattered gradient problem (Balduzzi et al., 2017).
Consider a fully connected (FC) feedforward neural network with inputs and a hidden layer with neurons. Let be the activation function of all the neurons in the hidden layer. Further assume that the biases are , as is common when initializing neural network parameters. For any two inputs propagated through the network, the dot product in the hidden layer is
where denotes the
dimensional vector in the hidden layer andis the weight vector into the neuron. Assuming an infinite number of hidden neurons, the sum in (1) has an interpretation as an inner product in feature space, which corresponds to the kernel of a Hilbert space. We have
is the probability density function (PDF) for the identically distributed weight vectorin the network. The connection of (2) to the kernels in kernel machines is well-known (Neal, 1994; Williams, 1997; Cho & Saul, 2009).
Probabilistic bounds for the error between (1) and (2) have been derived in special cases (Rahimi & Recht, 2008) when the kernel is shift-invariant. Two specific random feature mappings are considered: (1) Random Fourier features are taken for the in (1 ). Calculating the approximation error in this way requires being able to sample from the PDF defined by the Fourier transform of the target kernel. More explicitly, the weight distribution
). Calculating the approximation error in this way requires being able to sample from the PDF defined by the Fourier transform of the target kernel. More explicitly, the weight distributionis the Fourier transform of the target kernel and the samples are replaced by some appropriate scale of . (2) A random bit string is associated to each input according to a grid with random pitch sampled from imposed on the input space. This method requires having access to the second derivative of the target kernel to sample from the distribution .
Other work (Bach, 2017b) has focused on the smallest error between a target function in the reproducing kernel Hilbert space (RKHS) defined by (2) and an approximate function expressible by the RKHS with the kernel (1). More explicitly, let be the representation of in the RKHS. The quantity (with some suitable norm) is studied for the best set of and random with an optimized distribution.
Yet another measure of kernel approximation error is investigated by Rudi & Rosasco (2017). Let and be the optimal solutions to the ridge regression problem of minimizing a regularized cost function
be the optimal solutions to the ridge regression problem of minimizing a regularized cost functionusing the kernel (1) and the kernel (2) respectively. The number of datapoints required to probabilistically bound is found to be under a suitable set of assumptions. This work notes the connection between kernel machines and one-layer Neural Networks with ReLU activations and Gaussian weights by citing Cho & Saul (2009). We extend this connection by considering other weight distributions and activation functions.
In this work our focus is on deriving expressions for the target kernel, not the approximation error. Additionally, we consider random mappings that have not been considered elsewhere. Our work is related to work by Poole et al. (2016) and Schoenholz et al. (2017). However, our results apply to the unbounded (L)ReLU activation function and more general weight distributions, and their work considers random biases as well as weights.
3 Equivalent Kernels for Infinite Width Hidden Layers
. In particular, the equivalent kernel for a one-hidden layer network with spherical Gaussian weights of varianceand mean is the Arc-Cosine Kernel (Cho & Saul, 2009)
where is the angle between the inputs and and denotes the norm. Noticing that the Arc-Cosine Kernel depends on and only through their norms, with an abuse of notation we will henceforth set Define the normalized kernel
to be the cosine similarity between the signals in the hidden layer. The normalized Arc-Cosine Kernel is given by
where is the angle between the signals in the first layer. Figure 1 shows a plot of the normalized Arc-Cosine Kernel.
One might ask how the equivalent kernel changes for a different choice of weight distribution. We investigate the equivalent kernel for networks with (L)ReLU activations and general weight distributions in Section 3.1 and 3.2. The equivalent kernel can be composed and applied to deep networks. The kernel can also be used to choose good weights for initialization. These, as well as other implications for practical neural networks, are investigated in Section 5.
3.1 Kernels for Rotationally-Invariant Weights
In this section we show that (3) holds more generally than for the case where is Gaussian. Specifically, (3) holds when is any rotationally invariant distribution. We do this by casting (2) as the solution to an ODE, and then solving the ODE. We then extend this result using the same technique to the case where is LReLU.
A rotationally-invariant PDF one with the property for all and orthogonal matrices . Recall that the class of rotationally-invariant distributions (Bryc, 1995), as a subclass of elliptically contoured distributions (Johnson, 2013), includes the Gaussian distribution, the multivariate t-distribution, the symmetric multivariate Laplace distribution, and symmetric multivariate stable distributions.
Suppose we have a one-hidden layer feedforward network with ReLU and random weights with uncorrelated and identically distributed rows with rotationally-invariant PDF and . The equivalent kernel of the network is (3).
First, we require the following.
With the conditions in Proposition 1 and inputs the equivalent kernel of the network is the solution to the Initial Value Problem (IVP)
where is the angle between the inputs and . The derivatives are meant in the distributional sense; they are functionals applying to all test functions in . is given by the dimensional integral
where is the Heaviside step function.
Now differentiating twice with respect to yields the second order ODE (4). The usefulness of the ODE in its current form is limited, since the forcing term as in (5) is difficult to interpret. However, regardless of the underlying distribution on weights , as long as the PDF in (5) corresponds to any rotationally-invariant distribution, the integral enjoys a much simpler representation.
The proof is given in Appendix B.
Note that in the representation of the forcing term, the underlying distribution appears only as a constant . For all rotationally-invariant distributions, the forcing term in (4) results in an equivalent kernel with the same form. We can combine Propositions 2 and 3 to find the equivalent kernel assuming rotationally-invariant weight distributions.
One can apply the same technique to the case of LReLU activations where specifies the gradient of the activation for .
This is just a slightly more involved calculation than the ReLU case; we defer our proof to the supplementary material.
3.2 Asymptotic Kernels
In this section we approximate for large and more general weight PDFs. We invoke the CLT as , which requires a condition that we discuss briefly before presenting it formally. The dot product can be seen as a linear combination of the weights, with the coefficients corresponding to the coordinates of . Roughly, such a linear combination will obey the CLT if many coefficients are non-zero. To let , we construct a sequence of inputs . This may appear unusual in the context of neural networks, since is fixed and finite in practice. The sequence is used only for asymptotic analysis.
is fixed and finite in practice. The sequence is used only for asymptotic analysis.
As an example if the dataset were CelebA (Liu et al., 2015) with inputs, one would have . To generate an artificial sequence, one could down-sample the image to be of size , , and so on. At each point in the sequence, one could normalize the point so that its norm is . One could similarly up-sample the image.
Intuitively, if the up-sampled image does not just insert zeros, as increases the we expect the ratio to decrease because the denominator stays fixed and the numerator gets smaller. In our proof the application of CLT requires to decrease faster than . Hypothesis 5 states this condition precisely.
For , define sequences of inputs and with fixed , , and for all .
Letting be the coordinate of , assume that and are both .
for two datasets, suggesting it makes reasonable assumptions on high dimensional data such as images and audio.
. The images are compressed using Bicubic Interpolation. (Right) CHiME3_embedded_et05_real live speech data from The 4th CHiME Speech Separation and Recognition Challenge(Vincent et al., 2017; Barker et al., 2017). Each clip is trimmed to a length of seconds and the true sample rate is Hz, so the true dimensionality is . Compression is achieved through subsampling by integer factors.
Consider an infinitely wide FC layer with almost everywhere continuous activation functions . Suppose the random weights come from an IID distribution with PDF such that and . Suppose that the conditions in Hypothesis 5 are satisfied. Then
where denotes convergence in distribution and is a Gaussian random vector with covariance matrix and mean. Every has the same mean and covariance matrix as
Convergence in distribution is a weak form of convergence, so we cannot expect in general that all kernels should converge asymptotically. For some special cases however, this is indeed possible to show. We first present the ReLU case.
Let , , , and be as defined in Theorem 6. Define the corresponding kernel to be Consider a second infinitely wide FC layer with inputs. Suppose the random weights come from a spherical Gaussian with and finite variance with PDF . Define the corresponding kernel to be . Suppose that the conditions in Hypothesis 5 are satisfied and the activation functions are . Then for all ,
4 Empirical Verification of Results
We empirically verify our results using two families of weight distributions. First, consider the -dimensional t-distribution
with degrees of freedomand identity shape matrix . The multivariate t-distribution approaches the multivariate Gaussian as
. Random variables drawn from the multivariate t-distribution are uncorrelated but not independent. This distribution is rotationally-invariant and satisfies the conditions in Propositions (1) and (4).
Second, consider the multivariate distribution
which is not rotationally-invariant (except when which coincides with a Gaussian distribution) but whose random variables are IID and satisfy the conditions in Theorem 6. As
this distribution converges pointwise to the uniform distribution on.
In Figure 3, we empirically verify Propositions 1 and 4. In the one hidden layer case, the samples follow the blue curve , regardless of the specific multivariate t weight distribution which varies with . We also observe that the universality of the equivalent kernel appears to hold for the distribution (7) regardless of the value of , as predicted by theory. We discuss the relevance of the curves in Section 5.
5 Implications for Practical Networks
5.1 Composed Kernels in Deep Networks
A recent advancement in understanding the difficulty in training deep neural networks is the identification of the shattered gradients problem (Balduzzi et al., 2017)
A simple observation that complements this view is obtained through repeated composition of the normalized kernel. As , the angle between two inputs in the layer of a LReLU network random weights with and approaches .
A result similar to the following is hinted at by Lee et al. (2017), citing Schoenholz et al. (2017). Their analysis, which considers biases in addition to weights (Poole et al., 2016), yields insights on the trainability of random neural networks that our analysis cannot. However, their argument does not appear to provide a complete formal proof for the case when the activation functions are unbounded, e.g., ReLU. The degeneracy of the composed kernel with more general activation functions is also proved by Daniely (2016), with the assumption that the weights are Gaussian distributed.
The normalized kernel corresponding to LReLU activations converges to a fixed point at .
Let and define
The magnitude of the derivative of is which is bounded above by on . Therefore, is a contraction mapping. By Banach’s fixed point theorem there exists a unique fixed point . Set to verify that is a solution, and is unique. ∎
Corollary 8 implies that for this deep network, the angle between any two signals at a deep layer approaches . No matter what the input is, the kernel “sees” the same thing after accounting for the scaling induced by the norm of the input. Hence, it becomes increasingly difficult to train deeper networks, as much of the information is lost and the outputs will depend merely on the norm of the inputs; the signals decorrelate as they propagate through the layers.
At first this may seem counter-intuitive. An appeal to intuition can be made by considering the corresponding linear network with deterministic and equal weight matrices in each layer, which amounts to the celebrated power iteration method. In this case, the repeated application of a matrix transformation to a vector.
Figure 3 shows that the theoretical normalized kernel for networks of increasing depth closely follows empirical samples from randomly initialized neural networks.
In addition to convergence of direction, by also requiring that it can be shown that after accounting for scaling, the magnitude of the signals converge as the signals propagate through the network. This is analogous to having the dominant eigenvalue equal to in the power iteration method comparison.
The quantity in a -layer random (L)ReLU network of infinite width with random uncorrelated and identically distributed rotationally-invariant weights with approaches as .
Denote the output of one neuron in the layer of a network by and let be the kernel for the -layer network. Then
which approaches as . ∎
Contrary to the shattered gradients analysis, which applies to gradient based optimizers, our analysis relates to any optimizers that initialize weights from some distribution satisfying conditions in Proposition 4 or Corollary 7. Since information is lost during signal propagation, the network’s output shares little information with the input. An optimizer that tries to relate inputs, outputs and weights through a suitable cost function will be “blind” to relationships between inputs and outputs.
Our results can be used to argue against the utility of controversial Extreme Learning Machines (ELM) (Huang et al., 2004), which randomly initialize hidden layers from symmetric distributions and only learn the weights in the final layer. A single layer ELM can be replaced by kernel ridge regression using the equivalent kernel. Furthermore, a Multi-Layer ELM (Tang et al., 2016) with (L)ReLU activations utilizes a pathological kernel as shown in Figure 3. It should be noted that ELM bears resemblance to early works (Schmidt et al., 1992; Pao et al., 1994).
This applies whenever the conditions in Proposition 4 or Corollary 12 are satisfied. This agrees with the well-known case when the elements of are IID (He et al., 2015) and . For small values of , (8) is well approximated by the known result (He et al., 2015). For larger values of , this approximation breaks down, as shown in Figure 4.
An alternative approach to weight initialization is the data-driven approach (Mishkin & Matas, 2016)
, which can be applied to more complicated network structures such as convolutional and max-pooling layers commonly used in practice. As parameter distributions change during training, batch normalization inserts layers with learnable scaling and centering parameters at the cost of increased computation and complexity(Ioffe & Szegedy, 2015).
We have considered universal properties of MLPs with weights coming from a large class of distributions. We have theoretically and empirically shown that the equivalent kernel for networks with an infinite number of hidden ReLU neurons and all rotationally-invariant weight distributions is the Arc-Cosine Kernel. The CLT can be applied to approximate the kernel for high dimensional input data. When the activations are LReLUs, the equivalent kernel has a similar form. The kernel converges to a fixed point, showing that information is lost as signals propagate through the network.
One avenue for future work is to study the equivalent kernel for different activation functions, noting that some activations such as the ELU may not be expressible in a closed form (we do show in the supplementary material however, that the ELU does have an asymptotically universal kernel).
Since wide networks with centered weight distributions have approximately the same equivalent kernel, powerful trained deep and wide MLPs with (L)ReLU activations should have asymmetric, non-zero mean, non-IID parameter distributions. Future work may consider analyzing the equivalent kernels of trained networks and more complicated architectures. We should not expect that may be expressed neatly as in these cases. This work is a crucial first step in identifying invariant properties in neural networks and sets a foundation from which we hope to expand in future.
Appendix A Proof of Proposition 2
The kernel with weight PDF and ReLU is
Let be the angle between and . Define and with . Following Cho & Saul (2009), there exists some rotation matrix such that and . We have
Let and note that the dot product is invariant under rotations and the determinant of the Jacobian of the transformation is since is orthogonal. We have
One may view the integrand as a functional acting on test functions of . Denote the set of infinitely differentiable test functions on by . The linear functional acting over is a Generalized Function and we may take distributional derivatives under the integral by Theorem 7.40 of Jones (1982). Differentiating twice,
The initial condition is obtained by putting in (9) and noting that the resulting integrand contains a factor of which is everywhere. Similarly, the integrand of contains a factor of .
The ODE is meant in a distributional sense, that
, where is a distribution with a distributional second derivative . ∎
Appendix B Proof of Proposition 3
Denote the marginal PDF of the first two coordinates of by Due to the rotational invariance of ,
for any orthogonal matrix. So
It remains to check that . is integrable since
Therefore, is finite almost everywhere. This is only true if . must be a function, so the distributional and classical derivatives coincide. ∎
Appendix C Proof of Theorem 6
There exist some orthonormal such that and . We would like to examine the asymptotic distribution of
Let and . Note that and . Also note that and are uncorrelated since .
Let , , be the identity matrix and . Then for any convex set and some , by the Berry-Esseen Theorem,
where is given by
The last line is due to the fact that
Now and , so if by Hypothesis 5 converges in distribution to the bivariate spherical Gaussian with variance . Then the random vector converges in distribution to the bivariate Gaussian random variable with covariance matrix . Since is continuous almost everywhere, by the Continuous Mapping Theorem,
If or , we may treat as and the above still holds. ∎
Appendix D Proof of Corollary 7
We have and would like to bring the limit inside the expected value. By Theorem 6 and Theorem 25.12 of Billingsley (1995), it suffices to show that is uniformly integrable. Define to be the joint PDF of . We have
but the integrand is whenever or . So
We may raise the Heaviside functions to any power without changing the value of the integral. Squaring the Heaviside functions and applying Hölder’s inequality, we have
Examining the first of these factors,