# Adaptive random Fourier features with Metropolis sampling

The supervised learning problem to determine a neural network approximation ℝ^d∋ x↦∑_k=1^Kβ̂_k e^iω_k· x with one hidden layer is studied as a random Fourier features algorithm. The Fourier features, i.e., the frequencies ω_k∈ℝ^d, are sampled using an adaptive Metropolis sampler. The Metropolis test accepts proposal frequencies ω_k', having corresponding amplitudes β̂_k', with the probability min{1, (|β̂_k'|/|β̂_k|)^γ}, for a certain positive parameter γ, determined by minimizing the approximation error for given computational work. This adaptive, non-parametric stochastic method leads asymptotically, as K→∞, to equidistributed amplitudes |β̂_k|, analogous to deterministic adaptive algorithms for differential equations. The equidistributed amplitudes are shown to asymptotically correspond to the optimal density for independent samples in random Fourier features methods. Numerical evidence is provided in order to demonstrate the approximation properties and efficiency of the proposed algorithm. The algorithm is tested both on synthetic data and a real-world high-dimensional benchmark.

## Authors

• 2 publications
• 3 publications
• 11 publications
• 3 publications
• 3 publications
02/04/2021

### Wind Field Reconstruction with Adaptive Random Fourier Features

We investigate the use of spatial interpolation methods for reconstructi...
10/05/2020

### Smaller generalization error derived for deep compared to shallow residual neural networks

Estimates of the generalization error are proved for a residual neural n...
07/08/2019

### Multiscale High-Dimensional Sparse Fourier Algorithms for Noisy Data

We develop an efficient and robust high-dimensional sparse Fourier algor...
09/09/2021

### The uniform sparse FFT with application to PDEs with random coefficients

We develop an efficient, non-intrusive, adaptive algorithm for the solut...
12/13/2018

### Fourier RNNs for Sequence Analysis and Prediction

Fourier methods have a long and proven track record in as an excellent t...
05/12/2019

### Numerical meshless solution of high-dimensional sine-Gordon equations via Fourier HDMR-HC approximation

In this paper, an implicit time stepping meshless scheme is proposed to ...
09/22/2021

### Sharp Analysis of Random Fourier Features in Classification

We study the theoretical properties of random Fourier features classific...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

We consider a supervised learning problem from a data set ,

with the data independent identically distributed (i.i.d.) samples from an unknown probability distribution

. The distribution is not known a priori but it is accessible from samples of the data. We assume that there exists a function such that

where the noise is represented by iid random variables

with and .

We assume that the target function can be approximated by a single layer neural network which defines an approximation

 (1) β(x;ω,^β)=K∑k=1^βks(ωk,x),

where we use the notation for the parameters of the network ,

. We consider a particular activation function that is also known as Fourier features

Here is the Euclidean scalar product in . The goal of the neural network training is to minimize, over the set of parameters , the risk functional

 (2) L(^β,ω)=E~ρ[ℓ(β(x;^β,ω),y)]≡∫ℓ(β(x;^β,ω),y)~ρ(dxdy).

Since the distribution is not known in practice the minimization problem is solved for the empirical risk

 (3) ^LN(^β,ω)=1NN∑n=1ℓ(β(xn;^β,ω),yn).

The network is said to be over-parametrized if the width is greater than the number of training points, i.e., . We shall assume a fixed width such that when we study the dependence on the size of training data sets. We focus on the reconstruction with the regularized least squares type risk function

 ℓ(β(xn;^β,ω),yn)=|yn−β(xn;^β,ω)|2+λK∑k=1|^βk|2.

The least-square functional is augmented by the regularization term with a Tikhonov regularization parameter . For the sake of brevity we often omit the arguments and use the notation for . We also use for the Euclidean norm on .

To approximately reconstruct

from the data based on the least squares method is a common task in statistics and machine learning, cf.

[15], which in a basic setting takes the form of the minimization problem

 (4) minβ∈NK{E~ρ[|yn−β(xn)|2]+λK∑k=1|^βk|2},

where

 (5) NK:={β(x)=K∑k=1^βks(ωk,x)},

represents an artificial neural network with one hidden layer.

Suppose we assume that the frequencies are random and we denote the conditional expectation with respect to the distribution of conditioned on the data . Since a minimum is always less than or equal to its mean, there holds

 (6)

The minimization in the right hand side of (6) is also known as the random Fourier features problem, see [10, 5, 14]. In order to obtain a better bound in (6) we assume that , are i.i.d. random variables with the common probability distribution and introduce a further minimization

 (7) minpEω[min^β∈CK{E~ρ[|yn−β(xn;^β,ω)|2]+λ|^β|2}]

An advantage of this splitting into two minimizations is that the inner optimization is a convex problem, so that several robust solution methods are available. The question is: how can the density in the outer minimization be determined?

The goal of this work is to formulate a systematic method to approximately sample from an optimal distribution . The first step is to determine the optimal distribution. Following Barron’s work [4] and [8], we first derive in Section 2

the known error estimate

 (8) Eω[min^β∈CK{E~ρ[|yn−β(xn)|2]+λ|^β|2}]≤K−1(1(2π)d∫Rd|^f(ω)|2p(ω)dω−f2(x)),

based on independent samples from the distribution . Then, as in importance sampling, it is shown that the right hand side is minimized by choosing , where

is the Fourier transform of

. Our next step is to formulate an adaptive method that approximately generates independent samples from the density , thereby following the general convergence (8). We propose to use the Metropolis sampler:

• given frequencies with corresponding amplitudes a proposal is suggested and corresponding amplitudes determined by the minimum in (7), then

• the Metropolis test is for each to accept with probability .

The choice of the Metropolis criterion and selection of is explained in Remark 3.2. This adaptive algorithm (Algorithm 1) is motivated mainly by two properties based on the regularized empirical measure related to the amplitudes , where :

• The quantities converge to in asymptotically, as and , as shown in Proposition 3.1. For the proof of Proposition 3.1 we consider a simplified setting where the support of the -data is all of .

• Property (a) implies that the optimal density will asymptotically equidistribute , i.e., becomes constant since is constant.

The proposed adaptive method aims to equidistribute the amplitudes : if is large more frequencies will be sampled Metropolis-wise in the neighborhood of and if is small then fewer frequencies will be sampled in the neighborhood. Algorithm 1 includes the dramatic simplification to compute all amplitudes in one step for the proposed frequencies, so that the computationally costly step to solve the convex minimization problem for the amplitudes is not done for each individual Metropolis test. A reason that this simplification works is the asymptotic independence shown in Proposition 3.1. We note that the regularized amplitude measure is impractical to compute in high dimension . Therefore Algorithm 1 uses the amplitudes instead and consequently Proposition 3.1 serves only as a motivation that the algorithm can work.

In some sense, the adaptive random features Metropolis method is a stochastic generalization of deterministic adaptive computational methods for differential equations where the optimal efficiency is obtained for equidistributed error indicators, pioneered in [2]

. In the deterministic case, additional degrees of freedom are added where the error indicators are large, e.g., by subdividing finite elements or time steps. The random features Metropolis method analogously adds frequency samples where the indicators

are large.

A common setting is to, for fixed number of data point , find the number of Fourier features

with similar approximation errors as for kernel ridge regression. Previous such results on the kernel learning improving the sampling for random Fourier features are presented, e.g., in

[3], [17], [9] and [1]. Our focus is somewhat different, namely for fixed number of Fourier features find an optimal method by adaptively adjusting the frequency sampling density for each data set. In [17] the Fourier features are adaptively sampled based on a density parametrized as a linear combination of Gaussians. The work [3] and [9]

determine the optimal density as a leverage score for sampling random features, based on a singular value decomposition of an integral operator related to the reproducing kernel Hilbert space, and formulates a method to optimally resample given samples. Our adaptive random feature method on the contrary is not based on a parametric description or resampling and we are not aware of other non parametric adaptive methods generating samples for random Fourier features for general kernels. The work

[1] studies how to optimally choose the number of Fourier features for a given number of data points and provide upper and lower error bounds. In addition [1] presents a method to effectively sample from the leverage score in the case of Gaussian kernels.

We demonstrate computational benefits of the proposed adaptive algorithm by including a simple example that provides explicitly the computational complexity of the adaptive sampling Algorithm 1. Numerical benchmarks in Section 5 then further document gains in efficiency and accuracy in comparison with the standard random Fourier features that use a fixed distribution of frequencies.

Although our analysis is carried for the specific activation function , thus directly related to random Fourier features approximations, we note that in the numerical experiments (see Experiment 5 in Section 5) we also tested the activation function

 s(ω,x)=11+e−ω⋅x,

often used in the definition of neural networks and called the sigmoid activation. With such a change of the activation function the concept of sampling frequencies turns into sampling weights. Numerical results in Section 5 suggest that Algorithm 1 performs well also in this case. A detailed study of a more general class of activation functions is subject of ongoing work.

Theoretical motivations of the algorithm are given in Sections 2 and 3. In Section 3 we formulate and prove the weak convergence of the scaled amplitudes . In Section 2 we derive the optimal density for sampling the frequencies, under the assumption that are independent and . Section 4 describe the algorithms. Practical consequences of the theoretical results and numerical tests with different data sets are described in Section 5.

## 2. Optimal frequency distribution

### 2.1. Approximation rates using a Monte Carlo method

The purpose of this section is to derive a bound for

 (9) Eω[min^β∈CK{E~ρ[|β(x)−y|2]+λ|^β|2}]

and apply it to estimating the approximation rate for random Fourier features.

The Fourier transform

 ^f(ω):=(2π)−d/2∫Rdf(x)e−iω⋅xdx

has the inverse representation

 f(x)=(2π)−d/2∫Rd^f(ω)eiω⋅xdω

provided and are functions. We assume are independent samples from a probability density . Then the Monte Carlo approximation of this representation yields the neural network approximation with the estimator defined by the empirical average

 (10) α(x,ω)=1KK∑k=11(2π)d/2^f(ωk)p(ωk)eiωk⋅x.

To asses the quality of this approximation we study the variance of the estimator

. By construction and i.i.d. sampling of the estimator is unbiased, that is

 (11) Eω[α(x,ω)]=f(x),

and we define

 ^αk:=1(2π)d/2^f(ωk)Kp(ωk).

Using this Monte Carlo approximation we obtain a bound on the error which reveals a rate of convergence with respect to the number of features .

###### Theorem 2.1.

Suppose the frequencies are i.i.d. random variables with the common distribution , then

 (12) Varω[α(x,ω)]=1KEω[|^f(ω)|2(2π)dp(ω)2−f2(x)],

and

 (13) Eω[min^β∈CK{E~ρ[|β(x)−y|2]+λ|^β|2}]≤1+λKEω[|^f(ω)|2(2π)dp(ω)2]+E~ρ[|y−f(x)|2].

If there is no measurement error, i.e., and , then

 (14) Eω[min^β∈CK{E~ρ[|β(x)−f(x)|2]+λ|^β|2}]≤1+λKEω[|^f(ω)|2(2π)dp(ω)2].
###### Proof.

Direct calculation shows that the variance of the Monte Carlo approximation satisfies

 (15) Eω[|α(x,ω)−f(x)|2]=K−2Eω[K∑k=1K∑ℓ=1(^f(ωk)eiωk⋅x(2π)d/2p(ωk)−f(x))∗(^f(ωℓ)eiωℓ⋅x(2π)d/2p(ωℓ)−f(x))]=K−1Eω[|^f(ω)eiω⋅x(2π)d/2p(ω)−f(x)|2]=K−1Eω[|^f(ω)|2(2π)dp(ω)2−f2(x)],

and since a minimum is less than or equal to its average we obtain the random feature error estimate in the case without a measurement error, i.e., and ,

 Eω[min^β∈CK{E~ρ[|β(x)−f(x)|2]+λ|^β|2}]≤Eω[E~ρ[|α(x)−f(x)|2]+λ|^α|2]≤1KE[|^f(ω)|2(2π)dp(ω)2−f2(x)]+λKEω[|^f(ω)|2(2π)dp(ω)2]≤1+λKEω[|^f(ω)|2(2π)dp(ω)2].

Including the measurement error yields after a straightforward calculation an additional term

 Eω[min^β∈CK{E~ρ[|β(x)−y|2]+λ|^β|2}]≤1+λKEω[|^f(ω)|2(2π)dp(ω)2]+E~ρ[|y−f(x)|2].

### 2.2. Comments on the convergence rate and its complexity

The bounds (14) and (13) reveal the rate of convergence with respect to . To demonstrate the computational complexity and importance of using the adaptive sampling of frequencies we fix the approximated function to be a simple Gaussian

 f(x)=e−|x|2σ2/2,with^f(ω)=1(2πσ2)d/2e−|ω|2/(2σ2),

and we consider the two cases and . Furthermore, we choose a particular distribution by assuming the frequencies ,

from the standard normal distribution

(i.e., the Gaussian density with the mean zero and variance one).

Example I (large ) In the first example we assume that , thus the integral is unbounded. The error estimate (14) therefore indicates no convergence. Algorithm 1 on the other hand has the optimal convergence rate for this example.

Example II (small ) In the second example we choose thus the convergence rate in (14) becomes while the rate is for the optimal distribution , as . The purpose of the adaptive random feature algorithm is to avoid the large factor .

To have the loss function bounded by a given tolerance

requires therefore that the non-adaptive random feature method uses , and the computational work to solve the linear least squares problem is with proportional to .

In contrast, the proposed adaptive random features Metropolis method solves the least squares problem several times with a smaller to obtain the bound for the loss. The number of Metropolis steps is asymptotically determined by the diffusion approximation in [11] and becomes proportional to . Therefore the computational work is smaller for the adaptive method.

### 2.3. Optimal Monte Carlo sampling.

This section determines the optimal density for independent Monte Carlo samples in (12) by minimizing, with respect to , the right hand side in the variance estimate (12).

###### Theorem 2.2.

The probability density

 (16) p∗(ω)=|ˆf(ω)|∫Rd|ˆf(ω′)|dω′.

is the solution of the minimization problem

 (17) minp,∫Rdp(ω)dω=1{1(2π)d∫Rd|ˆf(ω)|2p(ω)dω}.
###### Proof.

The change of variables implies for any . Define for any and close to zero

 H(ε):=∫Rd|ˆf(ω)|2q(ω)+εv(ω)dω∫Rdq(ω)+εv(ω)dω.

At the optimum we have

and the optimality condition implies . Consequently the optimal density becomes

 p∗(ω)=|ˆf(ω)|∫Rd|ˆf(ω′)|dω′.

We note that the optimal density does not depend on the number of Fourier features, , and the number of data points, , in contrast to the optimal density for the least squares problem (28) derived in [3].

As mentioned at the beginning of this section sampling from the distribution leads to the tight upper bound on the approximation error in (14).

## 3. Asymptotic behavior of amplitudes ^βk

The optimal density can be related to data as follows: by considering the problem (9) and letting be a least squares minimizer of

 (18) minζ∈NK{E~ρ[|ζ(x)−y|2 | ω]+λ|^ζ|2}

the vanishing gradient at a minimum yields the normal equations, for ,

 K∑k=1E~ρ[ei(ωk−ωℓ)⋅x^ζk]+λ^ζℓ=E~ρ[ye−iωℓ⋅x]=Ex[f(x)e−iωℓ⋅x].

Thus if the -data points are distributed according to a distribution with a density we have

 (19) K∑k=1∫Rdei(ωk−ωℓ)⋅x^ζkρ(x)dx+λ^ζℓ=∫Rdf(x)e−iωℓ⋅xρ(x)dx

and the normal equations can be written in the Fourier space as

 (20) K∑k=1^ρ(ωℓ−ωk)^ζk+λ^ζℓ=ˆ(fρ)(ωℓ),ℓ=1,…,K.

Given the solution of the normal equation (20) we define

 (21) ^zk:=Kp(ωk)^ζk.

Given a sequence of samples drawn independently from a density we impose the following assumptions:

• there exists a constant such that

 (A1) K∑k=1|^ζk|≡1KK∑k=1|^zk|p(ωk)≤C

for all ,

• as we have

 (A2) limK→∞maxk∈{1,…,K}|^ζk|≡limK→∞maxk∈{1,…,K}|^zk|Kp(ωk)=0,
• there is a bounded open set such that

 (A3) supp^f⊂suppp⊂U,
• the sequence is dense in the support of , i.e.

 (A4) ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{ωk}∞k=1=suppp.

We note that (A4) almost follows from (A3), since that implies that the density

has bounded first moment. Hence the law of large numbers implies that with probability one the sequence

is dense in the support of . In order to treat the limiting behaviour of as we introduce the empirical measure

 (22) ^ZK(ω):=1KK∑k=1^zkp(ωk)δ(ω−ωk).

Thus we have for

 (23) K∑k=1^ρ(ωℓ−ωk)^ζk=∫Rd^ρ(ωℓ−ω)^ZK(dω),

so that the normal equations (20) take the form

 (24) ∫Rd^ρ(ωℓ−ω)^ZK(dω)+λ^ζℓ=ˆ(fρ)(ωℓ),ℓ=1,…,K.

By the assumption (A1) the empirical measures are uniformly bounded in the total variation norm

 ∫Rd|^ZK|(dω)=1KK∑k=1|^zk|p(ωk)≤C.

We note that by (A3) the measures in have their support in . We obtain the weak convergence result stated as the following Proposition.

###### Proposition 3.1.

Let be the solution of the normal equation (20) and the empirical measures defined by (22). Suppose that the assumptions (A1), (A2), (A3), and (A4) hold, and that the density of -data, , has support on all of and satisfies , then

 (25) limε→0+limK→∞∫Rdϕε(⋅−ω′)^ZK(dω′)=^f, in L1(Rd),

where are non negative smooth functions with a support in the ball and satisfying .

###### Proof.

To simplify the presentation we introduce

 ^χK,ε(ω):=^ZK∗ϕε(ω)=∫Rdϕε(ω−ω′)^ZK(dω′).

The proof consists of three steps:

• compactness yields a convergent subsequence of the regularized empirical measures ,

• the normal equation (20) implies a related equation for the subsequence limit, and

• a subsequence of the empirical measures converges weakly and as the limit normal equation establishes (25).

Step 1. As are standard mollifiers, we have, for a fixed , that the smooth functions (we omit in the notation )

 ^χK≡^ZK∗ϕε:Rd→C

have uniformly bounded with respect to derivatives . Let be the Minkowski sum . By compactness, see [6], there is a converging subsequence of functions , i.e., in as . Since as a consequence of the assumption (A3) we have that for all , and hence for all , then the limit has its support in . Hence can be extended to zero on . Thus we obtain

 (26) limK→∞∫Rdg(ω)^χK(dω)=∫Rdg(ω)^χ(ω)dω

for all .

Step 2. The normal equations (24) can be written as a perturbation of the convergence (26) using that we have

 ∫Rdg(ω)^ZK(dω)−∫Rdg(ω)^χK(dω)=∫Rd(g(ω)−g∗ϕε(ω))^ZK(dω)=O(ε).

Thus we re-write the term in (24) as

 ∫Rd^ρ(ω−ω′)^ZK(dω′)=∫Rd^ρ(ω−ω′)^χK(ω′)dω′+O(ε),

now considering a general point instead of and the change of measure from to , and by Taylor’s theorem

 ^ρ(ω−ω′)=^ρ(ωp−ω′)+^ρ(ω−ω′)−^ρ(ωp−ω′)=^ρ(ωp−ω′)+∫10∇^ρ(sω+(1−s)ωp−ω′)ds⋅(ω−ωp)

where

 minp∈{1,…,K}|∫10∇^ρ(sω+(1−s)ωp−ω′)ds⋅(ω−ωp)|→0,as K→∞.

since by assumption the set is dense in the support of . Since , as by assumption (A2), the normal equation (24) implies that the limit is determined by

 (27) ˆ(fρ)(ω)=∫Rd^ρ(ω−ω′)^χ(ω′)dω′+O(ε),ω∈Rd.

We have here used that the function is continuous as , and the denseness of the sequence . Step 3. From the assumption (A1) all are uniformly bounded in the total variation norm and supported on a compact set, therefore there is a weakly converging subsequence , i.e., for all

 limK→∞∫Vg(ω)^ZK(dω)→∫Vg(ω)^Z(dω).

This subsequence of can be chosen as a subsequence of the converging sequence . Consequently we have

 limK→∞^ZK∗ϕε=^Z∗ϕε=^χ.

As in we obtain by (27)

 ˆ(fρ)(ω)=∫Rd^ρ(ω−ω′)^Z(dω′),ω∈Rd,

and we conclude, by the inverse Fourier transform, that

 Z(x)ρ(x)=f(x)ρ(x),x∈Rd,

for and in the Schwartz class. If the support of is we obtain that . ∎

The approximation in Proposition  3.1 is in the sense of the limit of the large data set, , which implies . Then by the result of the proposition the regularized empirical measure for , namely , satisfies

 ^ZK∗ϕε→K→∞^f∗ϕε→ε→0+^f, in L1(Rd),

which shows that converges weakly to as and we have

. We remark that this argument gives heuristic justification for the proposed adaptive algorithm to work, in particular, it explains an idea behind the choice of the likelihood ratio in Metropolis accept-reject criterion.

###### Remark 3.2.

By Proposition 3.1 converges weakly to as . If it also converged strongly, the asymptotic sampling density for in the random feature Metropolis method would satisfy which has the fixed point solution