randrelu
This repository maintains the code of random ReLU features method.
view repo
We propose random ReLU features models in this work. Its motivation is rooted in both kernel methods and neural networks. We prove the universality and generalization performance of random ReLU features. Parallel to Barron's theorem, we consider the ReLU feature class, extended from the reproducing kernel Hilbert space of random ReLU features, and prove a strong quantitative approximation theorem, where both inner weights and outer weights of the the neural network with ReLU nodes as an approximator are bounded by constants. We also prove a similar approximation theorem for composition of functions in ReLU feature class by multi-layer ReLU networks. Separation theorem between ReLU feature class and their composition is proved as a consequence of separation between shallow and deep networks. These results reveal nice properties of ReLU nodes from the view of approximation theory, providing support for regularization on weights of ReLU networks and for the use of random ReLU features in practice. Our experiments confirm that the performance of random ReLU features is comparable with random Fourier features.
READ FULL TEXT VIEW PDFThis repository maintains the code of random ReLU features method.
The idea of applying random non-linear functions to generate features to improve regression and classification algorithms has been around for at least two decades; see, e.g., Igelnik and Pao (1995) and Huang et al. (2006b). Consider a model where the function of interest is written as a linear combination of non-linear functions,
The non-linear nodes
can be either chosen according to some probability distribution or selected by optimizing some objective function. Inspired by neural networks, a common choice of
is , where is a non-linear function on(often it is a sigmoidal function). In classic neural network training methods, all weights are optimized. However, we are interested in settings where the inner weights,
and , are generated by a probability distribution and only outer weights are tuned by some optimization method. Huang et al. (2006a) proves that when is bounded piecewise continuous and are generated by some continuous probability distribution, there exists that converges to any given continuous function with respect to norm over a compact subset of almost surely as . This result demonstrates the capability of models using these random features. However, the result is weaker compared to the universal approximation property established for neural networks (Cybenko (1989); Hornik (1991); Leshno et al. (1993)) or kernel methods (Micchelli et al. (2006)), where the approximation is established under the supremum norm.Rahimi and Recht (2008b) show that certain kernel functions can be approximated by random Fourier features with an appropriate probability distribution over parameters. In later work (Rahimi and Recht, 2009), they provide the convergence rate of such approximations by applying Maurey’s sparsification lemma (see Pisier (1980-1981)). In this paper, we begin with the observation that choosing a family of non-linear functions and a probability distribution over them always defines a kernel and the corresponding reproducing kernel Hilbert space (RKHS) implicitly. Then an important issue is whether the kernel defined in this way has the universal approximation property. This perspective provides us more freedom to design new powerful random features other than random Fourier ones. Since we can always apply Maurey’s sparsification lemma to establish the universality of random features method once the universality of the corresponding kernel method is established, we do not distinguish these two slightly different concepts and just call it universality of random features method.
The development of the random features method stems from a desire to take advantage of both kernel methods and neural networks. On the one hand, these random features models have the the same structure as two-layer (i.e., a single hidden layer) neural networks, and are thus scalable to large data sets, yield fast test-time predictions, and are able to use a variety of optimization methods designed for neural networks. On the other hand, the method approximates kernel methods and leads to convex optimization problems.
During the current revival of neural networks, branded as deep learning, much simpler non-linear functions, such as rectified linear unit (ReLU) and leaky ReLU, have been widely used. We call neural networks using these types of nodes
ReLU networks for convenience. When the inner weights of a two-layer ReLU network are chosen according to some probability distribution, the network becomes a new random features model. The universal approximation property of standard ReLU networks has been confirmed by the powerful theorem in Leshno et al. (1993). However the approximation property of random ReLU features model has not been studied. In this paper, we show that the universal approximation property of standard ReLU networks actually implies that of random ReLU features models. This result justifies the use of random ReLU features in various learning tasks.Beyond the universal approximation property, we further study quantitative aspects of the approximation property of ReLU networks and random ReLU features models. Barron (1993) shows that any function with
norm of the Fourier transform of its gradient bounded by
can be approximated by a linear combination of sigmoidal activation nodes with error under the norm with respect to the data distribution . And the norm of the coefficients of each activation nodes is bounded by .Lee et al. (2017) extend the result on approximation of Barron’s functions by two-layer neural networks to the approximation of compositions of Barron’s functions by multi-layer neural networks. However, neither Barron’s nor Lee et. al.’s results provide constraints on the weights inside the activation nodes, which is important both for regularization in practice and for generalization performance analysis of neural networks.
Both approximation results can be naturally extended to ReLU neural networks without effort since the difference between a ReLU node and its shift is a sigmoidal function (see Lee et al. (2017)
). However, we show that by considering a more natural function class and exploiting the fact that the ReLU activation function is homogeneous of degree 1, we can obtain stronger results compared to those of
Barron (1993) and Lee et al. (2017). In particular, the inner weights of the approximating network can be controlled. This provides a connection between the approximation analysis and generalization analysis, and further it provides support for the regularization in practice.Various authors considered the approximation using random features models of function classes with bounded VC dimension (Girosi, 1995), bounded Rademacher complexity (Gnecco and Sanguineti, 2008), or functions expressed as the expectation of certain non-linear features (Rahimi and Recht, 2008a). But these works all used characterizations quite different from Barron’s and ours, and none of them considered the special case of ReLU features.
The overall contributions of this paper are summarized as follows.
We propose the random ReLU features model. We establish its universal approximation property, obtain its learning rate when used with support vector machines or logistic regression, and compare its performance with random Fourier features method in experiments.
In analogy with Barron’s class, we define a class of functions , called the ReLU feature class, and establish a quantitative approximation theorem for functions in using ReLU neural networks. In our approximation theorem, both outer and inner weights are controlled. The connection between and the kernel-based random ReLU features is revealed.
We prove that any composition of functions in can be approximated by multi-layer ReLU neural networks, with the norm of weight matrices bounded by constants that depend on . We use the results of Eldan and Shamir (2016) to establish a separation theorem showing the essential difference in capacities between and compositions of functions in when .
Our work shows how the universality of the RKHS of random features is implied by that of the corresponding neural networks. Moreover, using ReLU features as an example, we build a bridge between kernel methods and neural networks.
In Section 2, we review some basic functional analysis results that are useful for establishing universal approximation results. We also describe the random features method and define the notations we use in the paper. The universal approximation property of random ReLU features is given in Section 3. The quantitative approximation results and separation result for functions in and their compositions are presented in Section 4. We describe the random ReLU features algorithm and give a simple generalization error bound in Section 5. The performance of random ReLU features on several benchmark data sets and their advantages over random Fourier features (which are obtained by approximating Gaussian kernels) are discussed in Section 6. All the proofs can be found in the appendices.
When a subset of is dense, we call it universal. To show that a subset is dense in a Banach space, we need only consider its annihilator as described by the following lemma.
For a Banach space and its subset , the linear span of is dense in if and only if , the annihilator of , is .
The proof can be easily derived from Theorem 8 in Chapter 8 of Lax (2002). It is a consequence of Hahn-Banach theorem. Throughout the paper, we assume that is a bounded subset of . Over , the dual space of , all the real continuous functions on , is the space of all signed measures equipped with the total variation norm, denoted by (see Theorem 14 in Chapter 8 of Lax (2002)). As the consequence of Lemma 1 and the duality between and , Micchelli et al. (2006) use the following useful criteria for justifying the density of a class of continuous functions in .
A set of continuous functions on a compact set is universal if and only if for a signed measure ,
To understand machine learning at a theoretical level, we are interested in the approximation property of the hypothesis class accessible by learning algorithms. The hypothesis class of one-hidden-layer neural networks consists of linear combination of any finite number of non-linear activation nodes composed with affine transform of data points. For kernel-based methods, the hypothesis class is the RKHS determined by the kernel function. Both the universality of RKHS and that of neural nets can be established by applying Lemma
2.For any kernel function , we call , where is a Hilbert space, a feature map of , if
(1) |
Feature representations of kernels are very useful for understanding the approximation property of the RKHS, and also for scaling up kernel methods for large data sets. In practice, one can choose a kernel function for the problem first, and then pick up a feature map based on some transform of the kernel function. However, the process can be reversed. One can design a map from to a Hilbert space first, and then a kernel function is defined by Equation 1. A particularly useful feature map chooses to be , where is a probability distribution over some parameter space . Then the RKHS determined by such a feature map is
Because is a probability measure, we can approximate the function in the RKHS by
with s sampled according to . Then the coefficients
s can be determined by a training process minimizing the empirical risk with respect to some loss function such as hinge loss in the support vector machine case or softmax cross entropy in logistic regression; this is known as the random features method. The generalization error of random features method has been well-studied; see
Rudi and Rosasco (2017). Its performance in several practical problems has also gathered attention (see Huang et al. (2014)). And note that in the whole process of applying random features method, the kernel function does not show up at all.As we noted in the introduction, the random features models of the special form where means the vector consisting of first coordinates of , are naturally connected to 2-layer neural networks. We will see in Section 3 that the universality of these two models are also closely connected.
A function of homogeneity 1 on satisfies for any . We simply call such a function homogeneous. It is obvious that such a function is fully determined by its value at as follows,
This includes ReLU and leaky ReLU activation nodes. For simplicity, we just call them ReLU nodes in the rest of the paper. We denote by in the paper. We explore the approximation property of the RKHS determined by the feature map and that of the neural networks using as activation node.
Throughout the paper, represents the slice of vector from th to th coordinates. without any subscripts means Euclidean norm. where is a measure means the total variation norm of . represents the composition of functions .
In this section, we show that the random ReLU features method defines a kernel and the corresponding RKHS which has the universal approximation property for many feature distributions. The following proposition is actually even stronger. It shows that any non-linear function eventually bounded by a ReLU function will define a universal RKHS when appropriate feature distributions are selected.
Assume that the absolute value of the continuous function is upper bounded by a homogeneous function outside a bounded interval. is a bounded subset of . is a probability distribution over whose support is dense in
. If the second moment of
is bounded and is not linear, the RKHSis dense in .
First, it is easy to see that . Indeed, let us assume that for , and for . Denote the upper bound of the second moment of by and that of the radius of by . Then
The functions in the RKHS are all continuous. So we can use Lemma 2 to justify the universality. For a signed measure with finite total variation, assume that
for all . We want to show that must be the measure. Since the function is integrable over , by Fubini’s theorem we have
equals for all . Then
(2) |
Indeed the function defined on the left hand side of Equation 2 be 0 everywhere because of continuity. Since is not a polynomial, by Leshno et al. (1993), we know that must be a 0 measure. If it is not, then there exists in such that where . Because the linear span of is dense in , there must exist nodes such that
This contradicts Equation 2. ∎
Although we may not have an explicit kernel function for , Proposition 3 guarantees that the RKHS we defined using and is rich enough to approximate any continuous target function. Note that the RKHS defined by and is different from the function class of ReLU networks, but the universality of the RKHS can be derived from that of ReLU networks. And this relation can be extended to many other feature maps, for example bounded continuous. The proof remains nearly the same and the universality of random features can always be derived as a corollary of the universality of the hypothesis class of the neural networks using the corresponding activation functions.
Under the assumptions of Proposition 3, we may get a large output from the random features in some cases. Using ReLU nodes, we can avoid such a situation by constraining the features over the unit sphere, as shown by following more useful result.
Assume that is homogeneous and continuous, and is a probability distribution over the unit sphere whose support is dense over the whole sphere. If is not linear, the RKHS
is dense in .
Therefore, to make use of random ReLU features, we can actually sample inner weights uniformly over the sphere. We will give a detailed description of random ReLU features method in the machine learning context and analyze its performance in Sections 5 and 6. For the rest of the paper, we always assume that is ReLU.
In this section, we prove a quantitative approximation theorem for ReLU networks similar to Barron’s result and extend it to the multi-layer case.
Barron (1993) considers the following function class
where is a complex measure on and . He shows that any function in can be approximated by a 1-hidden-layer neural network of sigmoidal nodes,
with and error less than . However, his work does not derive any constraints over inner weights, s and s.
We note that
has a very similar form with the functions in the RKHS in Proposition 3. This motivates us to consider the similar construction using ReLU node .
Denote by the set
Then we define the function class to be the set of functions of the following form
And we further define that .
Note that in our definition, we use the Euclidean norm of instead of . For compact with radius , . So our definition includes fewer functions in the class than that using . We should note that the integral representation of in the definition is not unique. There exists the case where the same is defined by different and , but only satisfies the criterion of . The original Barron’s class exhibits a similar situation. The Fourier transform of can only be defined when it is extended to the whole space and such extension is not unique and Barron’s constant is defined to be the smallest integral.
Some properties of and are given below. The proof is simple and therefore skipped.
consists of continuous functions.
Assume that . Then functions in are Lipschitz.
, where is the second moment of .
is universal.
Usually is strictly included in and so is in . Indeed, when is absolutely continuous with respect to Lebesgue measure, for any fixed , belongs to , but not . And only contains Lipschitz functions, but there exist continuous functions over that are not Lipschitz. We will see later that is also strictly smaller than -Lipschitz functions.
For the functions in , we can approximate them by ReLU networks. Stronger than Barron’s approximation theorem, our theorem provides upper bounds for the weights both outside and inside the non-linear nodes.
Assume that is bounded by a ball of radius . And is a probability measure on . For any , there exists with , and for all , such that .
Our theorem shows that for ReLU networks to approximate the functions in within an error of , only nodes are required. Moreover, every outer weights in front of nodes are bounded by and only differ by signs, and the inner weights are of unit length. Compared with the theorem in Barron (1993), where only outer weights are bounded under norm, our extra constraints largely shrink the search area for approximators. This improvement comes from the fact that is defined by ReLU function, which is homogeneous of degree 1. The proof can be found in Appendix B.
Even though Barron’s class also consists of Lipschitz functions, Barron’s proof requires piecewise constant functions in an intermediate step when constructing the sigmoidal networks to approximate target functions, and thus loses the Lipschitz property. Even if we instead consider the more direct approximators such as linear combination of cosine and sine nodes, it is still unclear how the inner weights can be controlled.
The networks of cosine and sine nodes have the same form of random Fourier features models. The relation between random Fourier features models and Barron’s class is exactly parallel to the relation between random ReLU features models and ReLU feature class.
Inspired by Lee et al. (2017)’s work, we also extend our result to the composite of functions in . First, we need to define the class for vector-valued functions.
For a function from to , we say that it belongs to the class if each component for and .
Note that Proposition 6 (2) still holds for the vector valued function class .
The following theorem, parallel to Theorem 3.5 in Lee et al. (2017), shows that for the composition of functions in , we can use a multi-layer ReLU network to approximate it. And all the weights in the neural network can be controlled by some constant related to . The proof can be found in Appendix C.
Assume that for all , is a compact set with radius in and denote the unit ball in by . is a probability measure on . belongs to for an and any . Then for , there exists a set with
and an -layer neural networks where
such that
Moreover, the th layer of the neural network contains nodes. The weight matrix from layer to layer , denoted by , has Frobenius norm bounded by . Each bias term is bounded by .
The significance of Theorem 9 stands upon the separation, that is the different capacity, between and composition of . Eldan and Shamir (2016) proves that there exist functions expressible by 3-layer ReLU network of many nodes, which however can not be approximated by any 2-layer (1-hidden-layer) ReLU networks with many nodes. By checking the weights of neural networks constructed in their work, we can show that it is also a composition of functions in . This proves the following proposition. The details of proof can be found in the appendix.
There exists universal constants such that for any , there exists a probability measure on and two functions and with the following properties:
both and belong to and respectively,
and every function in with satisfies
It means that the composite of two ReLU feature classes contains substantially more complicated functions than each of them. Lee et al. (2017)’s separation result shows that there exists a function with Barron’s constant greater than , but it is the composite of two functions whose Barron’s constant are . However, this separation theorem does not rule out the possibility that can be approximated by some functions with small Barron’s constant, and further be approximated by 1-hidden-layer neural networks with polynomially many sigmoidal nodes. In this sense, our separation theorem is stronger by showing that some composite of functions in cannot even be approximated by functions.
The proposition above also implies that ReLU feature class does not include all -Lipschitz functions, at least in high dimensional case. Because the composite function constructed in Proposition 10 is -Lipschitz, but it does not belong to any for .
In this section, we will show that the generalization error of support vector machines (SVM) and logistic regressions (LR) using random ReLU features can be reduced as much as we want given sufficient samples and random features. The derivation presented here makes use of Bach (2017)’s Proposition 1, which can be viewed as a refined version of Maurey’s sparsification lemma. We assume that the 2-norm of the outer weights is bounded by some constant during the training. This assumption simplifies the proof to a large extent and is not completely impractical. Analysis on the usual regularized optimization formulation of SVM or LR is possible (see Sun et al. (2018) and Rudi et al. (2016)), but we do not want to distract readers by such a technical analysis.
To state our result clearly, we denote that
We assume that there exists a function such that
where is the target function. By Proposition 4, for any and , we can always find a so that such an exists in .
For samples generated by and random weights generated by , denote by the solution of the following optimization problem.
where
and is hinge or logistic loss.
Then we have the following theorem
Assume the loss function is hinge loss or logistic loss. With probability greater than , we have
if
and
This bound guarantees that when the number of samples and the number of features are large enough, the learning algorithm described above will return a solution whose performance is no worse than the best one in the space . In particular, Theorem 11 together with Proposition 4
implies the universal consistency of random ReLU features method. Its proof is straightforward based on Bach’s approximation results and basic statistical learning theory. See Appendix
D.We compared the performance of random ReLU features method to the popular random Fourier features with Gaussian feature distribution on four synthetic data sets and three real data sets: MNIST (Lecun and Cortes ), adult and covtype (Dheeru and Karra Taniskidou (2017)). Our purpose is to show that in practice, random ReLU features models display comparable performance with random Fourier features models and have several advantages over random Fourier features with respect to computational efficiency.
First, note that for random Fourier features method, the bandwidth parameter plays an important role. The performance is very sensitive to the scale of the bandwidth parameter. Because the scales of the data in different problems may vary in a large range, introducing the bandwidth parameter is necessary. Therefore, we also introduce a bandwidth parameter into random ReLU features method as follows:
(3) |
We choose to divide the bias term by instead of to multiply the slope vector by because these two operations are equivalent under the ReLU activation node and our choice costs less time. This form can also avoid the output of random features being too large. Note that for random Fourier features these two forms are not equivalent.
For all four synthetic data sets, we used 20 random features for each method; for real data sets we used 2000 random features. For the binary classification tasks, we used hinge loss. For the multi-class classification tasks like MNIST and covtype, we chose logistic loss. Even though adding a regularization term is popular in practice, we chose to constrain the 2-norm of the outer weights by a large constant (1000 for synthetic data sets and 10000 for real data sets) as described in Section 5
. The optimization method was the plain stochastic gradient descent and the model was implemented using Tensorflow
(Abadi et al., 2015). The learning rate and bandwidth were screened carefully for both models through grid search.In Figure 1
, we present the dependence of two methods on the bandwidth parameters in the screening step. Each point displays the best 5-fold cross validation accuracy among all learning rates. We can see that the performance of random Fourier features with Gaussian distribution is more sensitive to the choice of bandwidth than random ReLU features.
We list the accuracy and training time for two methods, in Table 1 and 2, respectively. We can see that on all the data sets, random ReLU features method requires shorter training time. It outperforms random Fourier features with higher accuracy on adult and MNIST data sets. Its performance is similar to random Fourier features on sine, checkboard and square. However, its performance on strips and covtype data set is significantly worse.
The training time and testing time (not listed) of random ReLU features are always shorter than random Fourier features. This is mainly because half of random ReLU feature vectors coordinates are zero. For random Fourier features, we cannot expect that.
Fourier | ReLU | |
---|---|---|
sine | 0.993(0.007) | 0.984(0.005) |
strips | 0.834(0.084) | 0.732(0.006) |
square | 0.948(0.038) | 0.934(0.015) |
checkboard | 0.716(0.045) | 0.743(0.027) |
adult | 0.838(0.002) | 0.846(0.002) |
mnist | 0.937(0.001) | 0.951(0.001) |
covtype | 0.816(0.001) | 0.769(0.002) |
Fourier | ReLU | |
---|---|---|
sine | 1.597(0.050) | 1.564(0.052) |
strips | 1.598(0.056) | 1.565(0.052) |
square | 1.769(0.061) | 1.743(0.057) |
checkboard | 1.581(0.078) | 1.545(0.073) |
adult | 6.648(0.181) | 5.849(0.216) |
mnist | 70.438(0.321) | 69.229(1.080) |
covtype | 125.719(0.356) | 112.613(1.558) |
The universal approximation property, guaranteed generalization performance and the sparser feature vectors support the use of ReLU in random features method. Investigation on the RKHS of random ReLU features and the larger ReLU feature class reveals the connection between kernel-based learning method and neural networks. In particular, by considering the ReLU feature class and using the homogeneity property, we obtain the 2-layer and multi-layer neural networks approximation results with all parameters constrained. This extends previous studies on Barron’s class. Our experiments on three real data sets show that the performance of random ReLU nodes is comparable with random Fourier features in many cases. It has fewer hyper-parameters to tune and brings improvement in training and testing time cost. Further systematic investigations can help us better understand its performance in practice.
The MNIST database of handwritten digits.
URL http://yann.lecun.com/exdb/mnist/.The MNIST database of handwritten digits.
URL http://yann.lecun.com/exdb/mnist/.Here is the proof of Proposition 4
To prove the desired result, we only need to show that can be expressed as in Proposition 3. Note that for any
It can be written as
where is a probability measure supported over such that
Then set and . It is easy to see that the second moment of is bounded and is integrable with respect to . ∎
Here is the proof of Proposition 6
This is clear because is continuous.
If and are bounded by . is -Lipschitz. For in and in ,
For any in the ball of radius of the RKHS , we have that
Then
∎
It is not surprising that as with Barron’s work, the key step of the proof is also the following Maurey’s sparsification lemma.
Assume that are
i.i.d. random variables with values in the unit ball of a Hilbert space. Then with probability greater than
, we have(4) |
Furthermore, there exists in the unit ball such that
(5) |
With this lemma, we can prove our main theorem.
If is constantly 0, we only need to choose to be for all . Now assume that is not constantly 0. Then we can always restrict the feature space to and have . can be written into the following form
where equals 1 when belongs to the positive set of and -1 when it belongs to the negative set of . Since is a probability measure and
we can apply Lemma 12 and get the conclusion that there exists s such that
By setting and , the conclusion is proved. ∎
The main proof techniques are the same with Lee et al. (2017), but we need to obtain an upper bound of weight matrices. Here we prove a more general theorem which directly implies Theorem 9.
Assume that for all , is a compact set with radius in and is the unit ball in . is a probability measure on . belongs to for . Then for any and , there exists a set with
and an -layer neural networks where
such that
Moreover, the th layer of the neural network contains nodes. The weight matrix from layer to layer , denoted by , has Frobenius norm bounded by . Each bias term is bounded by .
For , we construct the approximation for by applying Theorem 7 to each component of . First, set .
Set
and the conclusion holds for . Assume that there exist and as described in the theorem. Define
Then by Markov’s inequality and induction assumption,
Then we want to construct on to approximate , again by applying Theorem 7 to each component of . Note that the measure we consider here is the push-forward of by , which is a positive measure with total measure less than .
And by triangle inequality,
(6) | |||
(7) | |||
(8) | |||