Online Optimization with Costly and Noisy Measurements using Random Fourier Expansions

03/31/2016 ∙ by Laurens Bliek, et al. ∙ Delft University of Technology 0

This paper analyzes DONE, an online optimization algorithm that iteratively minimizes an unknown function based on costly and noisy measurements. The algorithm maintains a surrogate of the unknown function in the form of a random Fourier expansion (RFE). The surrogate is updated whenever a new measurement is available, and then used to determine the next measurement point. The algorithm is comparable to Bayesian optimization algorithms, but its computational complexity per iteration does not depend on the number of measurements. We derive several theoretical results that provide insight on how the hyper-parameters of the algorithm should be chosen. The algorithm is compared to a Bayesian optimization algorithm for a benchmark problem and three applications, namely, optical coherence tomography, optical beam-forming network tuning, and robot arm control. It is found that the DONE algorithm is significantly faster than Bayesian optimization in the discussed problems, while achieving a similar or better performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

pyDONEc

A python implementation of the DONE optimization algorithm. Wraps on a C++ dll


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many optimization algorithms use the derivative of an objective function, but often this information is not available in practice. Regularly, a closed form expression for the objective function is not available and function evaluations are costly. Examples are objective functions that rely on the outcome of a simulation or an experiment. Approximating derivatives with finite differences is costly in high-dimensional problems, especially if the objective function is costly to evaluate. More efficient algorithms for derivative-free optimization (DFO) problems exist. Typically, in DFO algorithms a model is used that can be optimized without making use of the derivative of the underlying function [conn2009introduction, rios2013derivative]. Some examples of commonly used DFO algorithms are the simplex method  [nelder1965simplex], NEWUOA [powell2006newuoa], BOBYQA [powell2009bobyqa], and DIRECT [jones1993lipschitzian]. Additionally, measurements of a practical problem are usually corrupted by noise. Several techniques have been developed to cope with a higher noise level and make better use of the expensive objective functions evaluations. Filtering and pattern search optimization algorithms such as implicit filtering [gilmore1995implicit] and SID-PSM [custodio2007]

can handle local minima resulting from high frequency components. Bayesian optimization, also known as sequential Kriging optimization, deals with heteroscedastic noise and perturbations very well. One of the first and best known Bayesian optimization algorithms is EGO 

[jones1998efficient]

. Bayesian optimization relies on a surrogate model that represents a probability distribution of the unknown function under noise, for example Gaussian processes or Student’s-t processes

[krige1951, bergstra2011algorithms, hutter2011sequential, martinez2014]. In these processes different kernels and kernel learning methods are used for the covariance function [roustant2012dicekriging, snoek2012practical]

. The surrogate model is used to decide where the next measurement should be taken. New measurements are used to update the surrogate model. Bayesian optimization has been successfully used in various applications, including active user modeling and reinforcement learning

[brochu2010tutorial], robotics [Martinez2009], hyper-parameter tuning [bergstra2011algorithms], and optics [Rehman2015].

Recently, the Data-based Online Nonlinear Extremum-seeker (DONE) algorithm was proposed in [Verstraete15]. It is similar to Bayesian optimization, but simpler and faster. The DONE algorithm uses random Fourier expansions [rahimi2007] (RFEs) as a surrogate model. The nature of the DONE algorithm makes the understanding of the hyper-parameters easier. In RFE models certain parameters are chosen randomly. In this paper, we derive a close-to-optimal probability distribution for some of these parameters. We also derive an upper bound for the regularization parameter used in the training of the RFE model.

The advantages of the DONE algorithm are illustrated in an analytic benchmark problem and three applications. We numerically compare DONE to BayesOpt [martinez2014], a Bayesian optimization library that was shown to outperform many other similar libraries in [martinez2014]. The first application is optical coherence tomography (OCT), a 3D imaging method based on interference often used to image the human retina [nasiri2009, Bonora13, Verstraete15]. The second application we consider is the tuning of an optical beam-forming network (OBFN). OBFNs are used in wireless communication systems to steer phased array antennas in the desired direction by making use of positive interference of synchronized signals [hansen2009phased, roeloffzen2005ring, meijerink2010novel, zhuang2006single, zhuang2010ring, Bliek2015166]. The third application is a robot arm of which the tip has to be directed to a desired position [de2009method].

This paper is organized as follows. Section II gives a short overview and provides new theoretical insights on random Fourier expansions, the surrogate model on which the DONE algorithm is based. We have noticed a gap in the literature, where approximation guarantuees are given for ideal, but unknown RFE weights, while in practice RFE weights are computed via linear least squares. We investigate several properties of the ideal weights and combine these results with existing knowledge of RFEs to obtain approximation guarantees for least-square weights. Section III explains the DONE algorithm. Theoretically optimal as well as more practical ways to choose the hyper-parameters of this algorithm are given in Section IV. In Section V the DONE algorithm and BayesOpt are compared for a benchmark problem and for the three aforementioned applications. We conclude the paper in Section VI.

Ii Random Fourier Expansions

In this section, we will describe the surrogate model that we will use for optimization. There is a plethora of black-box modeling techniques to approximate a function from measurements available in the literature, with neural networks, kernel methods, and of course classic linear models probably being the most popular 

[hofmann2008kernel, suykens2012nonlinear, theodoridis2015machine]. In this paper, we use random Fourier expansions (RFEs) [rahimi2007] to model the unknown function because they offer a unique mix of computational efficiency, theoretical guarantees and ease of use that make them ideal for online processing. While general neural networks are more expressive than random Fourier features, they are difficult to use and come without theoretical guarantees. Standard kernel methods suffer from high computational complexity because the number of kernels equals the number of measurements. RFEs have been originally introduced to reduce the computational burden that comes with kernel methods, as will be explained next [rahimi2007, rahimi2009weighted, singh2012online].

Assume that we are provided scalar measurements taken at measurement points as well as a kernel that, in a certain sense, measures the closeness of two measurement points. To train the kernel expansion

(1)

a linear system involving the kernel matrix has to be solved for the coefficients . The computational costs of training and evaluating (1) grow cubicly and linearly in the number of datapoints , respectively. This can be prohibitive for large values of . We now explain how RFEs can be used to reduce the complexity [rahimi2007]. Assuming the kernel

is shift-invariant and has Fourier transform

, it can be normalized such that is a probability distribution [rahimi2007]. That is, we have

(2)
We will use several trigonometric properties and the fact that is real to continue the derivation. This gives
(3)

if

are independent samples of the random variable

with probability distribution function (p.d.f.) , and are independent samples of the random variable

with a uniform distribution. For

we thus have:

(4)

Note that the number of coefficients is now independent of the number of measurements . This is especially advantageous in online applications where the number of measurements keeps increasing. We use the following definition of a random Fourier expansion.

Definition 1.

A Random Fourier Expansion (RFE) is a function of the form ,

(5)

with , the being realizations of independent and identically distributed (i.i.d.) uniformly distributed random variables on , and with the

being realizations of i.i.d. random vectors

with an arbitrary continuous p.d.f. . The and the are assumed to be mutually independent.

We finally remark that there are other approaches to reduce the complexity of kernel methods and make them suitable for online processing, which are mainly based on sparsity [burges1996simplified, scholkopf2002sampling, quinonero2005unifying, chen2013quantized]. However, these are much more difficult to tune than using RFEs [singh2012online]. It is also possible to use other basis functions instead of the cosine, but the cosine was among the top performers in an exhaustive comparison with similar models [zhang2015comprehensive]. Moreover, the parameters of the cosines have intuitive interpretations in terms of the Fourier transform.

Ii-a Ideal RFE Weights

In this section, we deal with the problem of fitting a RFE to a given function . We derive ideal but in practice unknown weights . We start with the case of infinitely many samples and basis functions (see also [girosi1992convergence, barron1993universal]), which corresponds to turning the corresponding sums into integrals.

Theorem 1.

Let be a real-valued function and let

(6)

Then, for all ,

(7)

Here, and denote the magnitude and phase of the Fourier transform . The sets and denote the space of square integrable functions and the space of all essentially bounded functions, respectively.

Proof.

For , we have

(8)

Using that is real, we find that

(9)

For , we have another useful expression for the ideal weights that is used later on in this section, namely

(10)

The function in Theorem 1 is not unique. However, of all functions that satisfy (7), the given is the one with minimum norm.

Theorem 2.

Let be as in Theorem 1. If satisfies

(11)

then with equality if and only if in the sense.

Proof.

First, using Parseval’s theorem and for any real constant , note that

(12)

Assume that . Then we get

(13)

Following the above equality we can conclude that . The following now holds:

(14)

Furthermore, equality holds if and only if . That is, the minimum norm solution is unique in .

These results will be used to derive ideal weights for a RFE with a finite number of basis functions as in Definition 1 by sampling the weights in (6

). We prove unbiasedness in the following theorem, while variance properties are analyzed in Appendix 

B.

Theorem 3.

For any continuous p.d.f. with if , the choice

(15)

makes the (stochastic) RFE

an unbiased estimator, i.e.,

for any .

Proof.

Using Theorem 1, we have

(16)

These ideal weights enjoy many other nice properties such as infinity norm convergence [rahimi2008uniform]. In practice, however, a least squares approach is used for a finite . This is investigated in the next subsection.

Ii-B Convergence of the Least Squares Solution

The ideal weights depend on the Fourier transform of the unknown function that we wish to approximate. Of course, this knowledge is not available in practice. We therefore assume a finite number of measurement points that have been drawn independently from a p.d.f. that is defined on a compact set , and corresponding measurements , with , where

have been drawn independently from a zero-mean normal distribution with finite variance

. The input and noise terms are assumed independent of each other. We determine the weights by minimizing the squared error

(17)

Here,

(18)

and is a regularization parameter added to deal with noise, over-fitting and ill-conditioning.

Since the parameters are drawn from continuous probability distributions, only the weights need to be determined, making the problem a linear least squares problem. The unique minimizer of is

(19)

The following theorem shows that RFEs whose coefficient vector have been obtained through a least squares fit as in (19) can approximate the function arbitrarily well. Similar results were given in [girosi1992convergence, barron1993universal, rahimi2008uniform, jones1992simple], but we emphasize that these convergence results did concern RFEs employing the ideal coefficient vector given earlier in Theorem 3 that is unknown in practice. Our theorem, in contrast, concerns the practically relevant case where the coefficient vector has been obtained through a least-squares fit to the data.

Theorem 4.

The difference between the function and the RFE trained with linear least squares can become arbitrarily small if enough measurements and basis functions are used. More precisely, suppose that and that . Then, for every and , there exist constants and such that

(20)

for all , , with probability at least . Here, is the -th element of the random vector corresponding to the weight vector given in (19), and is the solution to

(21)

The proof of this theorem is given in Appendix A. In Section IV-B we show how to obtain in practice.

Iii Online Optimization Algorithm

In this section, we will investigate the DONE algorithm, which locates a minimum of an unknown function based on noisy evaluations of this function. Each evaluation, or measurement, is used to update a RFE model of the unknown function, based on which the next measurement point is determined. Updating this model has a constant computation time of order per iteration, with being the number of basis functions. We emphasize that this is in stark contrast to Bayesian optimization algorithms, where the computational cost of adding a new measurement increases with the total number of measurements so far. We also remark that the DONE algorithm operates online because the model is updated after each measurement. The advantage over offline methods, in which first all measurements are taken and only then processed, is that the number of required measurements is usually lower as measurement points are chosen adaptively.

Iii-a Recursive Least Squares Approach for the Weights

In the online scenario, a new measurement taken at the point becomes available at each iteration These are used to update the RFE. Let , then we aim to find the vector of RFE weights by minimizing the regularized mean square error

(22)

Let be the minimum of ,

(23)

Assuming we have found , we would like to use this information to find without solving (23) again. The recursive least squares algorithm is a computationally efficient method that determines from as follows [sayed1998recursive, Sec. 21]:

(24)
(25)
(26)
(27)

with initialization , .

We implemented a square-root version of the above algorithm, also known as the inverse QR algorithm [sayed1998recursive, Sec. 21], which is known to be especially numerically reliable. Instead of performing the update rules (24)-(27) explicitly, we find a rotation matrix that lower triangularizes the upper triangular matrix in Eq. (28) below and generates a post-array with positive diagonal entries:

(28)

The rotation matrix

can be found by performing a QR decomposition of the transpose of the matrix on the left hand side of (

28), or by the procedure explained in [sayed1998recursive, Sec. 21]. The computational complexity of this update is per iteration.

Iii-B DONE Algorithm

We now explain the different steps of the DONE algorithm. The DONE algorithm is used to iteratively find a minimum of a function on a compact set by updating a RFE at each new measurement, and using this RFE as a surrogate of for optimization. It is assumed that the function is unknown and only measurements perturbed by noise can be obtained: . The algorithm consists of four steps that are repeated for each new measurement: 1) take a new measurement, 2) update the RFE, 3) find a minimum of the RFE, 4) choose a new measurement point. We now explain each step in more detail.


Initialization

Before running the algorithm, an initial starting point and the number of basis functions have to be chosen. The parameters and of the RFE expansion are drawn from continuous probability distributions as defined in Definition 1. The p.d.f. and the regularization parameter have to be chosen a priori as well. Practical ways for choosing the hyper-parameters will be discussed later in Sect. IV. These hyper-parameters stay fixed over the whole duration of the algorithm. Let , and .


Step 1: New measurement

Unlike in Section II-B, it is assumed that measurements are taken in a recursive fashion. At the start of iteration , a new measurement is taken at the point .


Step 2: Update the RFE

As explained in Section III-A, we update the RFE model based on the new measurement from Step 1 by using the inverse QR algorithm given in  (24)-(27). Only the weights are updated. The parameters and stay fixed through-out the whole algorithm.


Step 3: Optimization on the RFE

After updating the RFE, an iterative optimization algorithm is used to find a (possibly local) minimum of the RFE. All derivatives of the RFE can easily be calculated. Using an analytic expression of the Jacobian will increase the performance of the optimization method used in this step, while not requiring extra measurements of as in the finite difference method. For functions that are costly to evaluate, this is a big advantage. The method used in the proposed algorithm is an L-BFGS method [nocedal1980updating, nocedal2006numerical]. Other optimization methods can also be used. The initial guess for the optimization is the projection of the current measurement point plus a random perturbation:

(29)

where is the projection onto . The random perturbation prevents the optimization algorithm from starting exactly in the point where the model was trained. Increasing its value will increase the exploration capabilities of the DONE algorithm but might slow down convergence. In the proposed algorithm, is chosen to be white Gaussian noise.


Step 4: Choose a new measurement point

The minimum found in the previous step is used to update the RFE again. A perturbation is added to the current minimum to avoid the algorithm getting trapped unnecessarily in insignificant local minima or saddle points [pogu1994global]:

(30)

The random perturbations can be seen as an exploration strategy and are again chosen to be white Gaussian noise. Increasing their variance increases the exploration capabilities of the DONE algorithm but might slow down convergence. In practice, we typically use the same distribution for and . Finally, the algorithm increases and returns to Step .


The full algorithm is shown below in Algorithm 1 for the case .

1:procedure DONE()
2:     Draw from independently.
3:     Draw from independently.
4:     
5:     
6:     
7:     for   do
8:         
9:         
10:         
11:         Draw from .
12:         
13:         
14:         Draw from .
15:               
16:     return
Algorithm 1 DONE Algorithm
1:procedure updateRFE()
2:     Retrieve , and from (28)
3:     
4:     
5:     return
Algorithm 2 updateRFE

Iv Choice of Hyper-parameters

In this section, we will analyze the influence of the hyper-parameters of the DONE algorithm and, based on these results, provide practical ways of choosing them. The performance of DONE depends on the following hyper-parameters:

  • number of basis functions ,

  • p.d.f. ,

  • regularization parameter ,

  • exploration parameters and .

The influence of is straight-forward: increasing will lead to a better performance (a better RFE fit) of the DONE algorithm at the cost of more computation time. Hence, should be chosen high enough to get a good approximation, but not too high to avoid unnecessarily high computation times. It should be noted that does not need to be very precise. Over-fitting should not be a concern for this parameter since we make use of regularization. The exploration parameters determine the trade-off between exploration and exploitation, similar to the use of the acquisition function in Bayesian optimization [brochu2010tutorial, snoek2012practical]. The parameter influences the exploration of the RFE surrogate in Step 3 of the DONE algorithm, while determines exploration of the original function. Assuming both to be close to each other, and are usually chosen to be equal. If information about local optima of the RFE surrogate or of the original function is available, this could be used to determine good values for these hyper-parameters. Alternatively, similar to Bayesian optimization the expected improvement could be used for that purpose, but this remains for future work. The focus of this section will be on choosing and .

Iv-a Probability Distribution of Frequencies

Recall the parameters and from Definition 1, which are obtained by sampling independently from the continuous probability distributions and

, respectively. In the following, we will investigate the first and second order moments of the RFE and try to find a distribution

that minimizes the variance of the RFE.

Unfortunately, as shown in Theorem 7 in Appendix B, it turns out that the optimal p.d.f. is

(31)

This distribution depends on the input and both the phase and magnitude of the Fourier transform of . But if both and were known, then the function itself would be known, and standard optimization algorithms could be used directly. Furthermore, we would like to use a p.d.f. for that does not depend on the input , since the parameters are chosen independently from the input in the initialization step of the algorithm.

In calibration problems, the objective function suffers from an unknown offset, . This unknown offset does not change the magnitude in the Fourier domain, but it does change the phase. Since the phase is thus unknown, we choose a uniform distribution for such that . However, the magnitude can be measured in this case. Section V-B describes an example of such a problem. We will now derive a way to choose for calibration problems.

In order to get a close to optimal p.d.f. for that is independent of the input and of the phase of the Fourier transform of , we look at a complex generalization of the RFE. In this complex problem, it turns out we can circumvent the disadvantages mentioned above by using a p.d.f. that depends only on .

Theorem 5.

Let , with being i.i.d. random vectors with a continuous p.d.f. over that satisfies if , and being random variables with uniform distribution from . Then is an unbiased estimator of for all if

(32)

For this choice of , the variance of is minimal if

(33)

giving a variance of

Proof.

The unbiasedness follows directly from the Fourier inversion theorem,

(35)

The proof of minimum variance is similar to the proof of [rubinstein2011simulation, Thm. 4.3.1]. ∎

Note that the coefficients can be complex in this case. Next, we show that the optimal p.d.f. for a complex RFE, , is still close-to-optimal (in terms of the second moment) when used in the real RFE from Definition 1.

Theorem 6.

Let be as in (33) and let with weights be as in Theorem 3. Let be the set of probability distribution functions for that are positive when . Then, we have

(36)

The proof is given in Appendix B. We now discuss how to choose in practice.

If no information of is available, the standard approach of choosing as a zero-mean normal distribution can be used. The variance is an important hyper-parameter in this case, and any method of hyper-parameter tuning can be used to find it. However, most hyper-parameter optimization methods are computationally expensive because they require running the whole algorithm multiple times. In the case that is not exactly known, but some information about it is available (because it can be estimated or measured for example), this can be circumvented. The variance