1. Introduction
We consider a supervised learning problem from a data set ,
with the data independent identically distributed (i.i.d.) samples from an unknown probability distribution
. The distribution is not known a priori but it is accessible from samples of the data. We assume that there exists a function such thatwhere the noise is represented by iid random variables
with and .We assume that the target function can be approximated by a single layer neural network which defines an approximation
(1) 
where we use the notation for the parameters of the network ,
. We consider a particular activation function that is also known as Fourier features
Here is the Euclidean scalar product in . The goal of the neural network training is to minimize, over the set of parameters , the risk functional
(2) 
Since the distribution is not known in practice the minimization problem is solved for the empirical risk
(3) 
The network is said to be overparametrized if the width is greater than the number of training points, i.e., . We shall assume a fixed width such that when we study the dependence on the size of training data sets. We focus on the reconstruction with the regularized least squares type risk function
The leastsquare functional is augmented by the regularization term with a Tikhonov regularization parameter . For the sake of brevity we often omit the arguments and use the notation for . We also use for the Euclidean norm on .
To approximately reconstruct
from the data based on the least squares method is a common task in statistics and machine learning, cf.
[15], which in a basic setting takes the form of the minimization problem(4) 
where
(5) 
represents an artificial neural network with one hidden layer.
Suppose we assume that the frequencies are random and we denote the conditional expectation with respect to the distribution of conditioned on the data . Since a minimum is always less than or equal to its mean, there holds
(6) 
The minimization in the right hand side of (6) is also known as the random Fourier features problem, see [10, 5, 14]. In order to obtain a better bound in (6) we assume that , are i.i.d. random variables with the common probability distribution and introduce a further minimization
(7) 
An advantage of this splitting into two minimizations is that the inner optimization is a convex problem, so that several robust solution methods are available. The question is: how can the density in the outer minimization be determined?
The goal of this work is to formulate a systematic method to approximately sample from an optimal distribution . The first step is to determine the optimal distribution. Following Barron’s work [4] and [8], we first derive in Section 2
the known error estimate
(8) 
based on independent samples from the distribution . Then, as in importance sampling, it is shown that the right hand side is minimized by choosing , where
is the Fourier transform of
. Our next step is to formulate an adaptive method that approximately generates independent samples from the density , thereby following the general convergence (8). We propose to use the Metropolis sampler:
given frequencies with corresponding amplitudes a proposal is suggested and corresponding amplitudes determined by the minimum in (7), then

the Metropolis test is for each to accept with probability .
The choice of the Metropolis criterion and selection of is explained in Remark 3.2. This adaptive algorithm (Algorithm 1) is motivated mainly by two properties based on the regularized empirical measure related to the amplitudes , where :

Property (a) implies that the optimal density will asymptotically equidistribute , i.e., becomes constant since is constant.
The proposed adaptive method aims to equidistribute the amplitudes : if is large more frequencies will be sampled Metropoliswise in the neighborhood of and if is small then fewer frequencies will be sampled in the neighborhood. Algorithm 1 includes the dramatic simplification to compute all amplitudes in one step for the proposed frequencies, so that the computationally costly step to solve the convex minimization problem for the amplitudes is not done for each individual Metropolis test. A reason that this simplification works is the asymptotic independence shown in Proposition 3.1. We note that the regularized amplitude measure is impractical to compute in high dimension . Therefore Algorithm 1 uses the amplitudes instead and consequently Proposition 3.1 serves only as a motivation that the algorithm can work.
In some sense, the adaptive random features Metropolis method is a stochastic generalization of deterministic adaptive computational methods for differential equations where the optimal efficiency is obtained for equidistributed error indicators, pioneered in [2]
. In the deterministic case, additional degrees of freedom are added where the error indicators are large, e.g., by subdividing finite elements or time steps. The random features Metropolis method analogously adds frequency samples where the indicators
are large.A common setting is to, for fixed number of data point , find the number of Fourier features
with similar approximation errors as for kernel ridge regression. Previous such results on the kernel learning improving the sampling for random Fourier features are presented, e.g., in
[3], [17], [9] and [1]. Our focus is somewhat different, namely for fixed number of Fourier features find an optimal method by adaptively adjusting the frequency sampling density for each data set. In [17] the Fourier features are adaptively sampled based on a density parametrized as a linear combination of Gaussians. The work [3] and [9]determine the optimal density as a leverage score for sampling random features, based on a singular value decomposition of an integral operator related to the reproducing kernel Hilbert space, and formulates a method to optimally resample given samples. Our adaptive random feature method on the contrary is not based on a parametric description or resampling and we are not aware of other non parametric adaptive methods generating samples for random Fourier features for general kernels. The work
[1] studies how to optimally choose the number of Fourier features for a given number of data points and provide upper and lower error bounds. In addition [1] presents a method to effectively sample from the leverage score in the case of Gaussian kernels.We demonstrate computational benefits of the proposed adaptive algorithm by including a simple example that provides explicitly the computational complexity of the adaptive sampling Algorithm 1. Numerical benchmarks in Section 5 then further document gains in efficiency and accuracy in comparison with the standard random Fourier features that use a fixed distribution of frequencies.
Although our analysis is carried for the specific activation function , thus directly related to random Fourier features approximations, we note that in the numerical experiments (see Experiment 5 in Section 5) we also tested the activation function
often used in the definition of neural networks and called the sigmoid activation. With such a change of the activation function the concept of sampling frequencies turns into sampling weights. Numerical results in Section 5 suggest that Algorithm 1 performs well also in this case. A detailed study of a more general class of activation functions is subject of ongoing work.
Theoretical motivations of the algorithm are given in Sections 2 and 3. In Section 3 we formulate and prove the weak convergence of the scaled amplitudes . In Section 2 we derive the optimal density for sampling the frequencies, under the assumption that are independent and . Section 4 describe the algorithms. Practical consequences of the theoretical results and numerical tests with different data sets are described in Section 5.
2. Optimal frequency distribution
2.1. Approximation rates using a Monte Carlo method
The purpose of this section is to derive a bound for
(9) 
and apply it to estimating the approximation rate for random Fourier features.
The Fourier transform
has the inverse representation
provided and are functions. We assume are independent samples from a probability density . Then the Monte Carlo approximation of this representation yields the neural network approximation with the estimator defined by the empirical average
(10) 
To asses the quality of this approximation we study the variance of the estimator
. By construction and i.i.d. sampling of the estimator is unbiased, that is(11) 
and we define
Using this Monte Carlo approximation we obtain a bound on the error which reveals a rate of convergence with respect to the number of features .
Theorem 2.1.
Suppose the frequencies are i.i.d. random variables with the common distribution , then
(12) 
and
(13) 
If there is no measurement error, i.e., and , then
(14) 
Proof.
Direct calculation shows that the variance of the Monte Carlo approximation satisfies
(15) 
and since a minimum is less than or equal to its average we obtain the random feature error estimate in the case without a measurement error, i.e., and ,
Including the measurement error yields after a straightforward calculation an additional term
∎
2.2. Comments on the convergence rate and its complexity
The bounds (14) and (13) reveal the rate of convergence with respect to . To demonstrate the computational complexity and importance of using the adaptive sampling of frequencies we fix the approximated function to be a simple Gaussian
and we consider the two cases and . Furthermore, we choose a particular distribution by assuming the frequencies ,
from the standard normal distribution
(i.e., the Gaussian density with the mean zero and variance one).Example I (large ) In the first example we assume that , thus the integral is unbounded. The error estimate (14) therefore indicates no convergence. Algorithm 1 on the other hand has the optimal convergence rate for this example.
Example II (small ) In the second example we choose thus the convergence rate in (14) becomes while the rate is for the optimal distribution , as . The purpose of the adaptive random feature algorithm is to avoid the large factor .
To have the loss function bounded by a given tolerance
requires therefore that the nonadaptive random feature method uses , and the computational work to solve the linear least squares problem is with proportional to .In contrast, the proposed adaptive random features Metropolis method solves the least squares problem several times with a smaller to obtain the bound for the loss. The number of Metropolis steps is asymptotically determined by the diffusion approximation in [11] and becomes proportional to . Therefore the computational work is smaller for the adaptive method.
2.3. Optimal Monte Carlo sampling.
This section determines the optimal density for independent Monte Carlo samples in (12) by minimizing, with respect to , the right hand side in the variance estimate (12).
Theorem 2.2.
The probability density
(16) 
is the solution of the minimization problem
(17) 
Proof.
The change of variables implies for any . Define for any and close to zero
At the optimum we have
and the optimality condition implies . Consequently the optimal density becomes
∎
We note that the optimal density does not depend on the number of Fourier features, , and the number of data points, , in contrast to the optimal density for the least squares problem (28) derived in [3].
As mentioned at the beginning of this section sampling from the distribution leads to the tight upper bound on the approximation error in (14).
3. Asymptotic behavior of amplitudes
The optimal density can be related to data as follows: by considering the problem (9) and letting be a least squares minimizer of
(18) 
the vanishing gradient at a minimum yields the normal equations, for ,
Thus if the data points are distributed according to a distribution with a density we have
(19) 
and the normal equations can be written in the Fourier space as
(20) 
Given the solution of the normal equation (20) we define
(21) 
Given a sequence of samples drawn independently from a density we impose the following assumptions:

there exists a constant such that
(A1) for all ,

as we have
(A2) 
there is a bounded open set such that
(A3) 
the sequence is dense in the support of , i.e.
(A4)
We note that (A4) almost follows from (A3), since that implies that the density
has bounded first moment. Hence the law of large numbers implies that with probability one the sequence
is dense in the support of . In order to treat the limiting behaviour of as we introduce the empirical measure(22) 
Thus we have for
(23) 
so that the normal equations (20) take the form
(24) 
By the assumption (A1) the empirical measures are uniformly bounded in the total variation norm
We note that by (A3) the measures in have their support in . We obtain the weak convergence result stated as the following Proposition.
Proposition 3.1.
Let be the solution of the normal equation (20) and the empirical measures defined by (22). Suppose that the assumptions (A1), (A2), (A3), and (A4) hold, and that the density of data, , has support on all of and satisfies , then
(25) 
where are non negative smooth functions with a support in the ball and satisfying .
Proof.
To simplify the presentation we introduce
The proof consists of three steps:
Step 1. As are standard mollifiers, we have, for a fixed , that the smooth functions (we omit in the notation )
have uniformly bounded with respect to derivatives . Let be the Minkowski sum . By compactness, see [6], there is a converging subsequence of functions , i.e., in as . Since as a consequence of the assumption (A3) we have that for all , and hence for all , then the limit has its support in . Hence can be extended to zero on . Thus we obtain
(26) 
for all .
Step 2. The normal equations (24) can be written as a perturbation of the convergence (26) using that we have
Thus we rewrite the term in (24) as
now considering a general point instead of and the change of measure from to , and by Taylor’s theorem
where
since by assumption the set is dense in the support of . Since , as by assumption (A2), the normal equation (24) implies that the limit is determined by
(27) 
We have here used that the function is continuous as , and the denseness of the sequence . Step 3. From the assumption (A1) all are uniformly bounded in the total variation norm and supported on a compact set, therefore there is a weakly converging subsequence , i.e., for all
This subsequence of can be chosen as a subsequence of the converging sequence . Consequently we have
As in we obtain by (27)
and we conclude, by the inverse Fourier transform, that
for and in the Schwartz class. If the support of is we obtain that . ∎
The approximation in Proposition 3.1 is in the sense of the limit of the large data set, , which implies . Then by the result of the proposition the regularized empirical measure for , namely , satisfies
which shows that converges weakly to as and we have
. We remark that this argument gives heuristic justification for the proposed adaptive algorithm to work, in particular, it explains an idea behind the choice of the likelihood ratio in Metropolis acceptreject criterion.
Remark 3.2.
By Proposition 3.1 converges weakly to as . If it also converged strongly, the asymptotic sampling density for in the random feature Metropolis method would satisfy which has the fixed point solution
Comments
There are no comments yet.