Learning to be Global Optimizer

03/10/2020 ∙ by Haotian Zhang, et al. ∙ Xi'an Jiaotong University 0

The advancement of artificial intelligence has cast a new light on the development of optimization algorithm. This paper proposes to learn a two-phase (including a minimization phase and an escaping phase) global optimization algorithm for smooth non-convex functions. For the minimization phase, a model-driven deep learning method is developed to learn the update rule of descent direction, which is formalized as a nonlinear combination of historical information, for convex functions. We prove that the resultant algorithm with the proposed adaptive direction guarantees convergence for convex functions. Empirical study shows that the learned algorithm significantly outperforms some well-known classical optimization algorithms, such as gradient descent, conjugate descent and BFGS, and performs well on ill-posed functions. The escaping phase from local optimum is modeled as a Markov decision process with a fixed escaping policy. We further propose to learn an optimal escaping policy by reinforcement learning. The effectiveness of the escaping policies is verified by optimizing synthesized functions and training a deep neural network for CIFAR image classification. The learned two-phase global optimization algorithm demonstrates a promising global search capability on some benchmark functions and machine learning tasks.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper considers unconstrained continuous global optimization problem:


where is smooth and non-convex. The study of continuous global optimization can be dated back to 1950s [1]. The outcomes are very fruitful, please see [2] for a basic reference on most aspects of global optimization, [3] for a comprehensive archive of online information, and [4] for practical applications.

Numerical methods for global optimization can be classified into four categories according to their available guarantees, namely, incomplete, asymptotically complete, complete, and rigorous methods 

[5]. We make no attempt on referencing or reviewing the large amount of literatures. Interested readers please refer to a WWW survey by Hart [6] and Neumaier [3]. Instead, this paper focuses on a sub-category of incomplete method, the two-phase approach [7, 8].

A two-phase optimization approach is composed of a sequence of cycles, each cycle consists of two phases, a minimization phase and an escaping phase. At the minimization phase, a minimization algorithm is used to find a local minimum for a given starting point. The escaping phase aims to obtain a good starting point for the next minimization phase so that the point is able to escape from the local minimum.

I-a The Minimization Phase

Classical line search iterative optimization algorithms, such as gradient descent, conjugate gradient descent, Newton method, and quasi-Newton methods like DFP and BFGS, etc., have flourished decades since 1940s [9, 10]. These algorithms can be readily used in the minimization phase.

At each iteration, these algorithms usually take the following location update formula:


where is the iteration index, are the iterates, is often taken as where is the step size and is the descent direction. It is the chosen of that largely determines the performance of these algorithms in terms of convergence guarantees and rates.

In these algorithms, is updated by using first-order or second-order derivatives. For examples, in gradient descent (GD), and in Newton method where is the Hessian matrix. These algorithms were usually with mathematical guarantee on their convergence for convex functions. Further, it has been proven that first-order methods such as gradient descent usually converges slowly (with linear convergence rate), while second-order methods such as conjugate gradient and quasi-Newton can be faster (with super linear convergence rate), but their numerical performances could be poor in some cases (e.g. quadratic programming with ill-conditioned Hessian due to poorly chosen initial points).

For a specific optimization problem, it is usually hard to tell which of these algorithms is more appropriate. Further, the no-free-lunch theorem [11] states that “for any algorithm, any elevated performance over one class of problems is offset by performance over another class”. In light of this theorem, efforts have been made on developing optimization algorithms with adaptive descent directions.

The study of combination of various descent directions can be found way back to 1960s. For examples, the Broyden family [12] uses a linear combination of DFP and BFGS updates for the approximation to the inverse Hessian. In the Levenberg-Marquardt (LM) algorithm [13]

for nonlinear least square problem, a linear combination of the Hessian and identity matrix with non-negative damping factor is employed to avoid slow convergence in the direction of small gradients. In the accelerated gradient method and recently proposed stochastic optimization algorithms, such as momentum 

[14], AdaGrad [15], AdaDelta [16], ADAM [17]

and such, moments of the first-order and second-order gradients are combined and estimated iteratively to obtain the location update.

Besides these work, only recently the location update is proposed to be adaptively learned by considering it as a parameterized function of appropriate historical information:


where represents the information gathered up to iterations, including such as iterates, gradients, function criteria, Hessians and so on, and is the parameter.

Neural networks are used to model in recent literature simply because they are capable of approximating any smooth function. For example, Andrychowicz et al. [18] proposed to model

by long short term memory (LSTM) neural network 

[19] for differentiable , in which the input of LSTM includes and the hidden states of LSTM. Li et al. [20]

used neural networks to model the location update for some machine learning tasks such as logistic/linear regression and neural net classifier. Chen et al. 

[21] proposed to obtain the iterate directly for black-box optimization problems, where the iterate is obtained by LSTM which take previous queries and function evaluations, and hidden states as inputs.

Neural networks used in existing learning to learn approaches are simply used as a block box. The interpretability issue of deep learning is thus inherited. A model-driven method with prior knowledge from hand-crafted classical optimization algorithms is thus much appealing. Model driven deep learning [22, 23]

has shown its ability on learning hyper-parameters for a compressed sensing problem of the MRI image analysis, and for stochastic gradient descent methods 

[24, 25].

I-B The Escaping Phase

A few methods, including tunneling [26] and filled function [8], have been proposed to escape from local optimum. The tunneling method was first proposed by Levy and Montalvo [26]. The core idea is to use the zero of an auxiliary function, called tunneling function, as the new starting point for next minimization phase. The filled function method was first proposed by Ge and Qin [8]. The method aims to find a point which falls into the attraction basin of a better than current local minimizer by minimizing an auxiliary function, called the filled function. The tunneling and filled function methods are all based on the construction of auxiliary function, and the auxiliary functions are all built upon the local minimum obtained from previous minimization phase. They are all originally proposed for smooth global optimization.

Existing research on tunneling and filled function is either on developing better auxiliary functions or extending to constrained and non-smooth optimization problems [27, 28, 29]. In general, these methods have similar drawbacks. First, the finding of zero or optimizer of the auxiliary function is itself a hard optimization problem. Second, it is not always guaranteed to find a better starting point when minimizing the auxiliary function [30]. Third, there often exists some hyper-parameters which are critical to the methods’ escaping performances, but are difficult to control [31]. Fourth, some proposed auxiliary functions are built with exponent or logarithm term. This could cause ill-condition problem for the minimization phase [30]. Last but not least, it has been found that though the filled and tunneling function methods have desired theoretical properties, their numerical performance is far from satisfactory [30].

I-C Main Contributions

In this paper, we first propose a model-driven learning approach to learn adaptive descent directions for locally convex functions. A local-convergence guaranteed algorithm is then developed based on the learned directions. We further model the escaping phase within the filled function method as a Markov decision process (MDP) and propose two policies, namely a fixed policy and a policy learned by policy gradient, on deciding the new starting point. Combining the learned local algorithm and the escaping policy, a two-phase global optimization algorithm is finally formed.

We prove that the learned local search algorithm is convergent; and we explain the insight of the fixed policy which can has a higher probability to find promising starting points than random sampling. Extensive experiments are carried out to justify the effectiveness of the learned local search algorithm, the two policies and the learned two-phase global optimization algorithm.

The rest of the paper is organized as follows. Section II briefly discusses the reinforcement learning and policy gradient to be used in the escaping phase. Section III presents the model-driven learning to learn approach for convex optimization. The escaping phase is presented in Section IV, in which the fixed escaping policy under the MDP framework is presented in Section IV-B, while the details of the learned policy is presented in Section IV-C. Controlled experimental study is presented in Section V. Section VI concludes the paper and discusses future work.

Ii Brief Introduction of Reinforcement Learning

In reinforcement learning (RL), the learner (agent) chooses to take an action at each time step; the action changes the state of environment; (possibly delayed) feedback (reward) returns as the response of the environment to the learner’s action and affects the learner’s next decision. The learner aims to find an optimal policy so that the actions decided by the policy maximize cumulative rewards along time.

Fig. 1: Illustration of a finite horizon Markov decision process.

Consider a finite-horizon MDP with continuous state and action space defined by the tuple where denotes the state space, the action space, the initial distribution of the state, the reward, and the time horizon, respectively. At each time , there are , and a transition probability where denotes the transition probability of conditionally based on and . The policy , where is the probability of choosing action when observing current state with as the parameter.

As shown in Fig. 1, starting from a state , the agent chooses ; after executing the action, agent arrives at state . Meanwhile, agent receives a reward (or ) from the environment. Iteratively, a trajectory can be obtained. The optimal policy is to be found by maximizing the expectation of the cumulative reward :


where the expectation is taken over trajectory where


A variety of reinforcement learning algorithms have been proposed for different scenarios of the state and action spaces, please see [32] for recent advancements. The RL algorithms have succeeded overwhelmingly for playing games such as GO [33], Atari [34] and many others.

We briefly introduce the policy gradient method for continuous state space [35], which will be used in our study. Taking derivative of w.r.t. discarding unrelated terms, we have


Eq. 6 can be calculated by sampling trajectories in practice:


where denotes action (state) at time in the th trajectory, is the cumulative reward of the th trajectory.

For continuous state and action space, normally assume



can be any smooth function, like radial basis function, linear function, and even neural networks.

Iii Model-driven Learning to Learn for Local Search

In this section, we first summarize some well-known first- and second-order classical optimization algorithms. Then the proposed model-driven learning to optimize method for locally convex functions is presented.

Iii-a Classical Optimization Methods

In the sequel, denote , , . The descent direction at the -th iteration of some classical methods is of the following form [12]:


where is an approximation to the inverse of the Hessian matrix, and is a coefficient that varies for different conjugate GDs. For example, could take for Crowder-Wolfe conjugate gradient method [12].

The update of also varies for different quasi-Newton methods. In the Huang family, is updated as follows:




The Broyden family is a special case of the Huang family in case , and .

Iii-B Learning the descent direction: d-Net

We propose to consider the descent direction as a nonlinear function of with parameter for the adaptive computation of descent search direction . Denote


We propose


where is the identity matrix.

At each iteration, rather than updating directly, we update the multiplication of and like in the Huang family  [12]:


It can be seen that with different parameter and settings, can degenerate to different directions:

  • when , the denominator of is not zero, and , the update degenerates to conjugate gradient.

  • when , and the denominator of is not zero, and , the update becomes the preconditioned conjugate gradient.

  • when , and , the update degenerates to the Huang family.

  • when , the denominator of is not zero, and the update becomes the steepest GD.

Based on Eq. 15, a new optimization algorithm, called adaptive gradient descent algorithm (AGD), can be established. It is summarized in Alg. 1. It is seen that to obtain a new direction by Eq. 15, information from two steps ahead is required as included in . To initiate the computation of new direction, in Alg. 1, first a steep gradient descent step (lines 3-5) and then a non-linear descent step (lines 7-10) are applied. With these prepared information, AGD iterates (lines 14-18) until the norm of gradient at the solution is less than a positive number .

1:  initialize , and
2:  # a steep gradient descent step
4:  Choose through line search; 
5:  ;
6:  # a non-linear descent step
7:  , ;
8:  Compute and ;
9:  Choose through line search; 
10:  ;
11:  Set ;
12:  repeat
13:     Compute , , ; 
14:     Gather ; 
15:     Compute ; 
16:     Choose through line search; 
17:     Update ;
20:  until 
Algorithm 1 The adaptive gradient descent algorithm (AGD)

To specify the parameters in the direction update function , like [18], we unfold AGD into iterations. Each iteration can be considered as a layer in a neural network. We thus have a ‘deep’ neural network with layers. The resultant network is called d-Net. Fig. 2 shows the unfolding.

Like normal neural networks, we need to train for its parameters

. To learn the parameters, the loss function

is defined as


That is, we expect these parameters are optimal not only to a single function, but to a class of functions ; and to all the criteria along the iterations.

We hereby choose to be the Gaussian function family:


There are two reasons to choose the Gaussian function family. First, any is locally convex. That is, let represents the Hessian matrix of , it is seen that

Second, it is known that finite mixture Gaussian model can approximate a Riemann integrable function with arbitrary accuracy [36]. Therefore, to learn an optimization algorithm that guarantees convergence to local optima, it is sufficient to choose functions that are locally convex.

Given , when optimizing , the expectation can be obtained by Monte Carlo approximation with a set of functions sampled from , that is

where . can then be optimized by the steepest GD algorithm.

Note: The contribution of the proposed d-Net can be summarized as follows. First, there is a significant difference between the proposed learning to learn approach with existing methods, such as [18, 21]. In existing methods, LSTM is used as a ‘black-box’ for the determination of descent direction, and the parameters of the used LSTM is shared among the time horizon. Whereas in our approach, the direction is a combination of known and well-studied directions, i.e. a ‘white-box’, which means that our model is interpretable. This is a clear advantage against black-box models.

Second, in classical methods, such as the Broyden and Huang family and LM, descent directions are constructed through a linear combination. On the contrary, the proposed method is nonlinear and subsumes a wide range of classical methods. This may result in better directions.

Further, the combination parameters used in classical methods are considered to be hyper-parameters. They are normally set by trial and error. In the AGD, these parameters are learned from the optimization experiences to a class of functions, so that the directions can adapt to new optimization problem.

Fig. 2: The unfolding of the AGD.

Iii-C Group d-Net

To further improve the search ability of d-Net, we employ a group of d-Nets, dubbed as Gd-Net. These d-Nets are connected sequentially, with shared parameters among them. Input of the -th () d-Net is the gradient from -th d-Net. To apply Gd-Net, an initial point is taken as the input, and is brought forward through these d-Nets until the absolute gradient norm is less than a predefined small positive real number.

In the following we show that Gd-Net guarantees convergence to optimum for convex functions. We first prove that AGD is convergent. Theorem 1 summarizes the result. Please see Appendix A for proof.

Theorem 1.

Assume is continuous and differentiable and the sublevel set

is bounded. The sequence obtained by AGD with exact line search converges to a stable point.

Since d-Net is the unfolding of AGD, from Theorem 1, it is sure that the iterate sequence obtained by d-Net is non-increasing for any initial with properly learned parameters. Therefore, applying a sequence of d-Net (i.e. Gd-Net) on a bound function from any initial point will result in a sequence of non-increasing function values. This ensures that the convergence of the sequence, which indicates that Gd-Net is convergent under the assumption of Theorem 1.

Iv Escaping from local optimum

Gd-Net guarantees convergence for locally convex functions. To approach global optimality, we present a method to escape from the local optimum once trapped. Our method is based on the filled-function method, and is embedded within the MDP framework.

Iv-a The Escaping Phase in the Filled Function Method

In the escaping phase of the filled function method, a local search method is applied to minimize the filled function for a good starting point for next minimization phase. To apply the local search method, the starting point is set as where is the local minimizer obtained from previous minimization phase, is a small constant and is the search direction.

Many filled functions have been constructed (please see [30] for a survey). One of the popular filled-functions [8] is defined as follows


where is a hyper-parameter. It is expected that minimizing can lead to a local minimizer which is away from due to the exist of the exponential term.

Theoretical analysis has been conducted on the filled function methods in terms of its escaping ability [8]. However, the filled function methods have many practical weaknesses yet to overcome.

First, the hyper-parameter is critical to algorithm performance. Basically speaking, if is small, it struggles to escape from , otherwise it may miss some local minima. But it is very hard to determine the optimal value of . There has no theoretical results, neither rule of thumb on how to choose .

Second, the search direction is also very important to the algorithmic performance. Different ’s may lead to different local minimizers, and the local minimizers are not necessarily better than . In literature, usually a trial-and-error procedure is applied to find the best direction from a set of pre-fixed directions, e.g. along the coordinates [8]. This is apparently not effective. To the best of our knowledge, no work has been done in this avenue.

Third, minimizing itself is hard and may not lead to a local optimum, but a saddle point [8] even when a promising search direction is used. Unfortunately, there is no studies on how to deal with this scenario in literature. Fig. 3 shows a demo about this phenomenon. In the figure, the contour of is shown in red lines, while the negative gradients of the filled function are shown in blue arrows. From Fig. 3, it is seen that minimizing from a local minimizer of at will lead to the saddle point at .

Fig. 3: Red lines show the contour of the three-hump function , blue arrows are the gradients of . There is a saddle point at (12,15) for .

Iv-B The Proposed Escaping Scheme

The goal of an escaping phase is to find a new starting point such that can escape from the attraction basin of (the local minimizer obtained from previous minimization phase) if a minimization procedure is applied, where , is the direction and is called the escaping length in this paper.

Rather than choosing from a pre-fixed set, we could sample some directions, either randomly or sequentially following certain rules. In this section, we propose an effective way to sample directions, or more precisely speaking ’s.

In our approach, the sampling of is modeled as a finite-horizon MDP. That is, the sampling is viewed as the execution of a policy : at each time step , given the current state , and reward , an action , i.e. the increment , is obtained by the policy. The policy returns by deciding a search direction and an escaping length .

At each time step , the state is composed of a collection of previously used search directions and their scores , where is a hyper-parameter. Here the score of a search direction measures how promising a direction is in terms of the quality of the new starting point that it can lead to. A new starting point is of high quality if applying local search from it can lead to a better minimizer than current one. The initial state includes a set of directions sampled uniformly at random, and their corresponding scores.

In the following, we first define ‘score’, then present the policy on deciding and , and the transition probability . Without causing confusion, we omit the subscript in the sequel.

Iv-B1 Score

Given a search direction , a local minimizer , define


where is the step size along .

Since is a local minimizer, by definition, we cannot find a solution with smaller along if is within the attraction basin of . However, if there is a such that is a point with smaller criterion than (i.e. ), and there is no other local minimizer within , we can prove that there exists a , such that when , and when (proof can be seen in Appendix B). Theorem 2 summarizes the result under the following assumptions:

  • has finite number of local minimizer.

  • For every local optimum , there exists a such that is convex in .

  • The attractive basin of each local optimum is convex.

Theorem 2.

If is a point outside the boundary of ’s attraction basin, there is no other local minimizer within . Then there exists a such that

obtains its maximum at . And is monotonically increasing in , and monotonically decreasing in .

If we let , then . This implies that is actually along the direction from pointing to . This tells whether a direction can lead to a new minimizer or not. A direction with a positive indicates that it could lead to a local minimizer different to present one.

We therefore define the score of a direction , , to be the greatest along , i.e.


For such that , we say it is promising.

In the following, we present the policy on finding (or new starting point). The policy includes two sub-policies. One is to find the new point given a promising direction, i.e. to find the escaping length. The other is to decide the promising direction.

Iv-B2 Policy on finding the escaping length

First we propose to use a simple filled function as follows:


Here is called the ‘escaping length controller’ since it controlls how far a solution could escape from the current local optimum. Alg. 2 summarizes the policy proposed to determine the optimal and the new starting point .

0:  a local minimum , a direction , a bound , an initial escaping length controller , a learning rate , some constants , and
0:  a new starting point and (the score of )
1:  repeat
2:     set ;
3:     optimize along starting from for iterations, i.e. evaluate the criteria of a sequence of points defined by ;
4:     compute ;
5:     ;
6:  until  or .
7:  ;
8:  if , then set ;
9:  return and .
Algorithm 2 Policy on finding a new starting point

In Alg. 2, given a direction , the filled function is optimized for steps (line 3). The sum of the iterates’ function values, denoted as (line 4), is maximized w.r.t. by gradient ascent (line 5). The algorithm terminates if a stable point of is found (), or the search is out of bound (). When the search is out of bound, a negative score is set for the direction (line 8). As a by-product, Alg. 2 also returns the score of the given direction .

We prove that can escape from the attraction basin of and ends up in another attraction basin of a local minimizer with smaller criterion if exists. Theorem 3 summarizes the result.

Theorem 3.

Suppose that is a point such that , and there are no other points that are with smaller or equal criterion than within . If the learning rate is sufficiently small, then there exists an such that .

According to this theorem, we have the following corollary.

Corollary 1.

Suppose that is the solution obtained by optimizing along starting from at the -th iteration, then will be in an attraction basin of , if the basin ever exists.

Theorem 3 can be explained intuitively as follows. Consider pushing a ball down the peak of a mountain with height (it can be regarded as the ball’s gravitational potential energy) along a direction . The ball will keep moving until it arrives at a point for some such that . For any , the ball has a positive velocity, i.e. where . But the ball has a zero velocity at , and negative at , Hence reaches its maximum in . The integral is approximated by its discrete sum, i.e. , in Alg. 2.

Further, according to the law of the conservation of energy, the ball will keep moving until at some , in which case . This means that the ball falls into the attraction basin of a smaller criterion than as shown in Fig. 4(b). Fig. 4(a) shows when is small, in iterations, the ball reaches some but .

Moreover, if there is no smaller local minimizers in search region, the ball will keep going until it rolls outside the restricted search region bounded by as shown in Fig. 4(e) which means Alg. 2 fails to find . Fig. 4(c)(d) show the cases when there are more than one local minimizers within the search region.

Fig. 4: Possible scenarios encountered when estimating . (a) shows the case when is not large enough, while (b) shows when is appropriate. (c) shows that reaches a local minimum, but because ; (d) shows the case when there are more than one local minimizer. (e) shows when there is no smaller local minimizer within .

Once such has been found, the corresponding will enter an attraction basin of a local minimum with smaller criterion than . If we cannot find such an in the direction of within a distance to , we consider that there is no another smaller local minimum along . If it is the case, is non-promising. We thus set a negative score for it as shown in line 8 of Alg. 2.

It is seen that the running of line 5 of Alg. 2 requires to compute gradients of at each iteration. This causes Alg. 2 time consuming. We hereby propose to accelerate this procedure by fixing but finding a proper number of iterations. Alg. 3 summarizes the fast policy. Given a direction , during the search, the learning rate and the escaping length controller are fixed. At each iteration of Alg. 3, an iterate is obtained by applying gradient descent over . The gradient of over is computed (line 6). is computed (line 7). Alg. 3 terminates if there is an , such that or the search is beyond the bound. It is seen that during the search, at each iteration, we only need to compute the gradient for once, which can significantly reduce the computational cost in comparison with Alg. 2.

0:  a local minimizer , a search direction , a bound and positive scalars and ;
0:  score and
1:  set and ;
2:  compute and ;
3:  repeat
4:     set ;
6:     compute and ;
7:     compute
8:  until  or
9:  compute ;
10:  if , set ;
11:  return and .
Algorithm 3 Fast policy on finding a new starting point

Alg. 3 aims to find an integer such that but . The existence of such an can be illustrated as follows. It is seen that (please see Eq. 31 in Appendix B). This implies that . When , we have and so that and for any , and and . Thus, there exists an such that and .

Corollary 1 proves that if there exists a better local minimum along , then applying Alg. 2 or Alg. 3, we are able to escape from the local attraction basin of .

Iv-B3 Policy on the sampling of promising directions

In the following, we show how to sample directions that are of high probability to be promising. We first present a fixed policy, then propose to learn for an optimal policy by policy gradient.

0:  a local minimizer , an integer and
0:  a set of candidate directions and starting points
1:  sample directions uniformly at random; apply Alg. 2 or Alg. 3 to obtain their scores and ;
2:  set .
3:  repeat
4:     sample
5:     apply Alg. 2 or Alg. 3 to obtain and .
6:     if  then
7:        set and .
8:     end if
9:     set and ;
10:     set and ;
11:     .
12:  until .
13:  return and .
Algorithm 4 Fixed policy on sampling promising direction

Alg. 4 summarizes the fixed policy method. In Alg. 4, first a set of directions are sampled uniformly at random (line 1). Their scores are computed by Alg. 2 or Alg. 3. Archives used to store the directions and starting points are initialized (line 2). A direction is sampled by using a linear combination of previous directions with their respective scores as coefficients (line 4). If the sampled direction has a positive score, its score and the obtained starting point are included in the archive. The sets of scores and directions are updated accordingly in a FIFO manner (lines 9-10). The algorithm terminates if the number of sampling exceeds .

We hope that the developed sampling algorithm is more efficient than that of the random sampling in terms of finding promising direction. denotes the probability of finding a promising direction by using the random sampling, be the probability by the fixed policy. Then in Appendix C, we will do some explanation why .

Iv-B4 The transition

In our MDP model, the probability transition is deterministic. The determination of new starting point depends on the sampling of a new direction and its score . New state is then updated in a FIFO manner. That is, at each time step, the first element in is replaced by the newly sampled .

All the proofs in this section are given in Appendix B.

Iv-C Learning the Escaping Policy by Policy Gradient

In the presented policy, a linear combination of previous directions with their scores as coefficients is applied to sample a new direction. However, this policy is not necessarily optimal. In this section, we propose to learn an optimal policy by the policy gradient algorithm [35].

The learning is based on the same foregoing MDP framework. The goal is to learn the optimal coefficients for combining previously sampled directions. We assume that at time , the coefficients are obtained as follows:

where and .

is the output of a feed-forward neural network

with parameter , and is the coefficients. The current state is the composition of and .

Fig. 5 shows the framework of estimating the coefficients and sampling a new direction at a certain time step. For the next time step, and are updated

and , .

The policy gradient algorithm is used to learn for the neural network . We assume that and the policy can be stated as follows:

The reward is defined to be


where is a constant, is the score of the sampled direction and is the indicator function.

Alg. 5 summaries the policy gradient learning procedure for . is updated in epochs. At each epoch, first a sample of trajectories is obtained (lines 3-23). Given , a trajectory can be sampled as follows. First, a set of initial directions is randomly generated and their scores are computed by Alg. 3 (lines 6-7). new directions and their corresponding scores are then obtained (lines 9-22). At each step, the obtained direction , the policy function and the reward are gathered in the current trajectory (line 20). After the trajectory sampling, and are updated in lines 2428 and line 29, respectively.

0:  a local minimum , an integer , the number of training epochs , the number of trajectories , and learning rate .
0:  the optimal network parameter .
1:  randomly initialize ;
2:  for  do
3:     // create trajectories
4:     for  do
5:        set ;
6:        sample directions uniformly at random;
7:        apply Alg. 3 to obtain their scores and ;
8:        set ;
9:        repeat
10:           // sample new direction
11:           sample ;
12:           compute and ;
13:           apply Alg. 3 to obtain ;
14:           // update the state
15:           set by Eq. 23
16:            and ;
17:            and ;
18:           set , and ;
19:           // update the trajectory