Meta-descent for Online, Continual Prediction

07/17/2019 ∙ by Andrew Jacobsen, et al. ∙ University of Alberta 0

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider continual, non-stationary prediction problems. Consider a learning system whose objective is to learn a large collection of predictions about an agent’s future interactions with the world. The predictions specify the value of some signal many steps in the future, given that the agent follows some specific course of action. There are many examples of such prediction learning systems including Predictive State Representations [Littman, Sutton, and Singh2001], Observable Operator Models [Jaeger2000], Temporal-difference Networks [Sutton and Tanner2004], and General Value Functions [Sutton et al.2011]. In our setting, the agent continually interacts with the world, making new predictions about the future, and revising its previous predictions as new outcomes are revealed. Occasionally, partially due to changes in the world and partially due to changes in the agent’s own behaviour, the targets may change and the agent must refine its predictions. 111We exclude recent meta-learning frameworks (MAML [Finn, Abbeel, and Levine2017], LTLGDGD [Andrychowicz et al.2016]) because they assume access to a collection of tasks that can be sampled independently, enabling the agent to learn how to select meta-parameters for a new problem. In our setting, the agent must solve a large collection of non-stationary prediction problems in parallel using off-policy learning methods.

Stochastic gradient descent (SGD) is a natural choice for our setting because gradient descent methods work well when paired with abundant training data. The performance of SGD is dependent on the step-size parameter (scalar, vector or matrix), which scales the gradient to mitigate sample variance and improve data efficiency. Most modern large-scale learning systems make use of optimization algorithms that attempt to approximate stochastic second-order gradient descent to adjust both the direction and magnitude of the descent direction, with early work indicating the benefits of such quasi-second order methods if used carefully in the stochastic case

[Schraudolph, Yu, and Günter2007, Bordes, Bottou, and Gallinari2009]

. Many of these algorithms attempt to approximate the diagonal of the inverse Hessian, which describes the curvature of the loss function, and so maintain a vector of step-sizes—one for each parameter. Starting from AdaGrad

[McMahan and Streeter2010, Duchi, Hazan, and Singer2011], several diagonal approximations have been proposed, including RmsProp [Tieleman and Hinton2012], AdaDelta [Zeiler2012], vSGD [Schaul, Zhang, and LeCun2013], Adam [Kingma and Ba2015] and AmsGrad [Reddi, Kale, and Kumar2018]. Stochastic quasi-second order updates have been derived specifically for temporal difference learning, with some empirical success [Meyer et al.2014], particularly in terms of parameter sensitivity [Pan, White, and White2017, Pan, Azer, and White2017]. On the other hand, second order methods, by design, assume the loss and thus Hessian are fixed, and so non-stationary dynamics or drifting targets could be problematic.

A related family of optimization algorithms, called meta-descent algorithms, were developed for continual, online prediction problems. These algorithms perform meta-gradient descent adapting a vector of step-size parameters to minimize the error of the base learner, instead of approximating the Hessian. Meta-descent applied to the step-size was first introduced for online least-mean squares methods [Jacobs1988, Sutton1992b, Sutton1992a, Almeida et al.1998, Mahmood et al.2012], including the linear complexity method IDBD [Sutton1992b]. IDBD was later extended to more general losses [Schraudolph1999] and to support (semi-gradient) temporal difference methods [Dabney and Barto2012, Dabney2014, Kearney et al.2018]

. These methods are well-suited to non-stationary problems, and have been shown to ignore irrelevant features. The main limitation of several of these meta-descent algorithms, however, is that the derivations are heuristic, making it difficult to extend to new settings beyond linear temporal difference learning. The more general approaches, like Stochastic Meta-Descent (SMD)

[Schraudolph1999], require the update to be a stochastic gradient descent update and have some issues in biasing towards smaller step-sizes [Wu et al.2018]. It remains an open challenge to make these meta-descent strategies as broadly and easily applicable as the AdaGrad variants.

In this paper we introduce a new meta-descent algorithm, called AdaGain, that attempts to optimize the stability of the base learner, rather than convergence to a fixed point. AdaGain is built on a generic derivation scheme that allows it to be easily combined with a variety of base-learners including SGD, (semi-gradient) temporal-difference learning and even optimized SGD updates, like AMSGrad. Our goal is to investigate the utility of both meta-descent methods and the more widely used quasi-second order optimizers in online, continual prediction problems.We provide an extensive empirical comparison on (1) canonical optimization problems that are difficult to optimize with large flat regions (2) an online, supervised tracking problem where the optimal step-sizes can be computed, (3) a finite Markov Decision Process with linear features that cause conventional temporal difference learning to diverge, and (4) a high-dimensional time-series prediction problem using data generated from a real mobile robot. In problems with non-stationary dynamics the meta-descent methods can exhibit an advantage over the quasi-second order methods. On the difficult optimization problems, however, meta-descent methods fail, which, retrospectively, is unsurprising given the meta-optimization problem for stepsizes is similarly difficult to optimize. We show that AdaGain can possess the advantages of both families — performing well on both optimization problems with flat regions as well as non-stationary problems — by selecting an appropriate base learner, such as RMSProp.

2 Background and Notation

In this paper we consider online continual prediction problems modeled as non-stationary, uncontrolled dynamical systems. On each discrete time step , the agent observes the internal state of the system through an imperfect summary vector for some , such as the sensor readings of a mobile robot. On each step, the agent makes a prediction about a target signal . In the simplest case, the target of the prediction is a component of the observation vector on the next step

—the classic one-step prediction. In the more general case, the target is constructed by mapping the entire future of the observation time series to a scalar, such as the discounted sum formulation used in reinforcement learning:

, where discounts the contribution of future observations to the infinite sum. The prediction is generated by a parametrized function, with modifiable parameter vector .

In online continual prediction problems the agent updates its predictions (via ) with each new sample , unlike the more common batch and stochastic settings. The agent’s objective is to minimize the error between the prediction given by and the target before it is observed, over all time steps. Online continual prediction problems are typically solved using stochastic updates to adapt the parameter vector after each time step to reduce the error (retroactively) between and . Generically, for stochastic update vector , the weights are modified

(1)

for a vector step-size , where the operator denotes element-wise multiplication. Given an update vector, the goal is to select to reduce error, into the future. Semi-gradient methods like temporal difference learning follow a similar scheme, but is not the gradient of an objective function.

Step-size adaptation for the stationary setting is often based on estimating second-order updates.

222A related class of algorithms are natural gradient methods, which aim to be robust to the functional parametrization. Incremental natural gradient methods have been proposed [Amari, Park, and Fukumizu2000], including for policy evaluation with gradient TD methods [Dabney and Thomas2014]. However, these algorithms do not remove the need select a step-size, and so we do not consider them further here. The idea is to estimate the loss function locally around the current weights using a second-order Taylor series approximation—which requires the Hessian . A closed-form solution can then be obtained for the approximation, because it is a quadratic function, giving the next candidate solution . If instead the Hessian is approximated—such as with a diagonal approximation—then we obtain quasi-second order updates. Taken to the extreme, with the Hessian approximated by a scalar, as , we obtain first-order gradient descent with a step-size of . For the batch setting, the gains from second order methods are clear, with a convergence rate333There is a large literature on accelerated first-order descent methods, starting from early work on momentum [Nesterov1983] and many since focused mainly on variance reduction (c.f. [Roux, Schmidt, and Bach2012]). These methods can complement step-size adaptation, but are not well-suited to non-stationary problems because many of the algorithms are designed for a batch of data and focus on increasing convergence rate to a fixed minimum. of , as opposed to for first-order descent.

These gains are not as clear in the stochastic setting, but diagonal approximations appear to provide an effective balance between computation and convergence rate improvements [Bordes, Bottou, and Gallinari2009]. duchi2011adaptive  duchi2011adaptive provide a general regret analysis for diagonal approximations methods proving sublinear regret if step-sizes decrease to zero overtime. One algorithm, AdaGrad, uses the vector step-size for a fixed and a small , with element-wise division. RMSProp and Adam—which are not guaranteed to obtain sublinear regret—use a running average rather than a sum of gradients, with Adam additionally including a momentum term for faster convergence. AMSGrad is a modification of Adam, that satisfies the regret criteria, without decaying the step-sizes as aggressively as AdaGrad.

The meta-descent strategies instead directly learn step-sizes that minimize the same objective as the base learner. A simpler set of such methods, called hypergradient methods [Jacobs1988, Almeida et al.1998, Baydin et al.2018], only adjust the step-size based on its impact on the weights on a single step. Hypergradient Descent (HD) [Baydin et al.2018] takes the gradient of the loss w.r.t. a scalar step-size , to get the meta-gradient for the step-size as . The update simply requires storing the vector and updating , for a meta step-size . More generally, meta-descent methods, like IDBD [Sutton1992b] and SMD [Schraudolph1999], consider the impact of the step-size back in time, through the weights, with the -th element in vector

(2)

The goal is to approximate this gradient efficiently, usually using a recursive strategy. We derive such a strategy for AdaGain below using a different meta-descent objective, and for completeness include the derivation for the SMD objective in the appendix (as the original contains an error).

2.1 Illustrative example

To make the problem more concrete, consider a simple state-less tracking problem driven by two interacting Gaussians:

(3)

where the agent only observes the sequence . The objective is minimize mean squared error (MSE) between a scalar prediction and the target . This problem is non-stationary because and change periodically and the agent has no knowledge of the schedule. Since and govern how quickly the mean drifts and the sampling variance in , the agent must step its step-size accordingly: larger requires larger stepsize, larger requires a smaller step-size. The agent must continually change its scalar step-size value in order to achieve low MSE. The optimal constant scalar step-size can be computed in this simple domain [Sutton1992b], and is shown by the black dashed line in Figure 1. We compared the step-sizes learned by several well-know quasi-second order methods (AdaGrad, RMSProp, Adadelta) and three meta-descent strategies including our own AdaGain. We ran the experiment for over 24 hours to test the robustness of these methods in a long-running continual prediction task. Several methods including AdaGain were able to match the optimal step-size. However, several well-known methods including AdaGrad and AdaDelta completely fail in this problem. In addition, the meta-descent strategy SMD diverged after 8183817 time steps, highlighting the special challenges of online, continual prediction problems.

Figure 1: Optimal Gain Experiment. Depicted is the last 500,000 steps out of . AdaGrad, and AdaDelta fail to learn the correct progression of stepsizes, and SMD diverges.

3 Adaptive Gain for Stability

Tracking—continually updating the weights with recent experience—contrasts the typical goal of convergence. Much of the previous algorithm development for step-size adaptation, however, has been towards the aim of convergence, with algorithms like AdaGrad and AMSGrad that decay step-sizes over time. Assuming finite representational capacity, there may be aspects of the problem that can never be accurately modeled or predicted by the agent. In these partially observable problems tracking and thus treating the problem as if it were non-stationary can improve prediction accuracy compared with methods that converge [Sutton, Koop, and Silver2007]. In continual learning we assume the agent’s task partially observable in this way, and develop a new step-size method that can facilitate tracking.

We treat the learning system as a dynamical system—where the weight update is based on stochastic updates known to suitably track the targets—and consider the choice of step-size as the inputs to the system to maintain stability. Such a view has been previously considered under adaptive gain for least-mean squares (LMS) [Benveniste, Metivier, and Priouret1990, Chapter 4], where weights are treated as state following a random drift. To generalize this idea to other incremental algorithms, we propose a more general criteria based on the magnitude of the update vector.

A criteria for to maintain stability in the system is to keep the norm of the update vector small

(4)

The update on this time step is dependent on the step-size because that step-size influences and past updates. The expected value is over all possible update vectors for the given step-size and assuming the system started with some . If the dynamics are ergodic, does not depend on the initial , and is only driven by the underlying state dynamics and the choice of . The step-size can be seen as a control input for this system, with the goal to maintain a stable dynamical system by minimizing over time.

We derive an algorithm to estimate for this dynamical system, which we call AdaGain: Adaptive Gain for Stability. The algorithm is derived for a generic update that is differentiable w.r.t. the weights ; we provide specific examples for particular updates in the appendix, including for linear TD.

3.1 Generic algorithm with quadratic-complexity

We derive the full quadratic-complexity algorithm to start, and then introduce approximations to obtain a linear-complexity algorithm. To minimize (4), we use stochastic gradient descent, and thus need to compute the gradient of w.r.t. the step-size . For step-size as the th element in the vector , and the -th element in vector

The key, then, is to track how a change in the weights impacts the update and how changes in the step-size impact the weights. The first term can be computed instantaneously on this step. For the second term, however, the impact of the step-size on the weights goes back further to previous updates. We show how to obtain a recursive form for this step-size gradient, .

where , , and  Therefore, represents a sum of updates, with a recursive weighting on previous adjusting the weight of previous updates in the sum.

We can approximate the gradient using this recursive relationship, without storing all previous samples. Though the above updates are exact, we obtain an approximation when implementing such a recursive form in practice. When using computed on the last time step , this gradient estimate is in fact w.r.t. the previous step-size , rather than . Because these step-sizes are slowly changing, this gradient still provides a reasonable estimate; however, for many steps into the past, the accumulated gradients in are likely inaccurate. To improve the approximation, and forget old gradients, we introduce a forgetting parameter , which focuses the accumulation of gradients in to a more recent window.

The gradient update to the step-size also needs to ensure that the step-sizes remain positive. Similarly to IDBD, we use an exponential form for the step-size, where and is updated with (unconstrained) stochastic gradient descent. Conveniently, as we show in the appendix, we do not need to maintain this auxiliary variable, and can simply directly update .

The resulting generic updates for quadratic-complexity AdaGain, with meta step-size , are

(5)

where the exponential is applied element-wise, , (or some initial value), and . For computational efficiency to avoid matrix-matrix multiplication, the order of multiplication for should start from the right, as . The key complexity in deriving an AdaGain update, then, is simply in computing the Jacobian ; given this, the remainder of the algorithm is fixed. For each update , the Jacobian will be different, but is straightforward to compute.

3.2 Generic AdaGain algorithm with linear-complexity

Maintaining the entire matrix can be prohibitively expensive. As was done in IDBD [Sutton1992b], one way to avoid maintaining this matrix is to assume that for . This heuristic reflects that is likely to have the largest impact on , and less impact on the other entries in .

The modification above for this heuristic is straightforward, simply by setting entries for . This results in the simplification

Further, since we will then assume that for , there is no purpose in computing the full vector . Instead, we only need to compute the th entry, i.e., for . We can then instead define to be a scalar approximating , with the vector of these, and to define the recursion as , with . The gradient using this approximation, with off-diagonals zero, is

To compute this approximation, for all , we still need to be able to compute . In some cases this is straightforward, as is the case for linear TD (found in the appendix). More generally, we can use R-operators [Pearlmutter1994] to compute this Jacobian-vector product, or a simple finite difference approximation, as we do in the appendix. Therefore, because we can compute this Jacobian-vector product in linear time, the only approximation is to . The update is

(6)

These approximations parallel diagonal approximations, for second-order techniques, which similarly assume off-diagonal elements are zero. Further, itself is a gradient of the update w.r.t. the weights, where this update was already likely the gradient of the loss w.r.t. the weights. This , therefore, contains similar information as the Hessian. The AdaGain update, therefore, contains some information about curvature, but allows for updates that are not necessarily (true) gradient updates.

This AdaGain update is generic, but does require computing the Jacobian of a given update, which could be onerous in certain settings. We provide an update, based on finite differences in the appendix, that only requires differences between updates, that we have found works well in practice.

4 Experiments in synthetic tasks

We conduct experiments in several simulation domains to highlight the performance characteristics of meta-descent and quasi-second order methods. In our first experiment we investigate AdaGain and several meta-descent and quasi-second order approaches on a notoriously difficult stationary optimization task. Next we return to the simple state-less tracking problem described in the introduction, and investigate the parameter sensitivity of each method. Our third experiment investigates how different optimization algorithms can stabilize the iterates in sequential off-policy learning problems, which cause SGD-based methods to diverge. We conclude with a comparison of AdaGain and AMSGrad (the best performing quasi-second order method in the first three experiments) for online prediction on data generated by a mobile robot.

In all the experiments, we use AdaGain layered on-top of an RMSProp update, rather than a vanilla SGD update. As motivated earlier, meta-descent methods are not robust on difficult optimization surfaces, such as with flat or sharp regions. AdaGain provides a practical method to pursue meta-descent strategies that are robust to such realistic optimization problems. We motivate the importance of this choice in our first experiment on a difficult optimization task.

Figure 2: Optimization paths of a single run (with tuned meta-parameters) for several algorithms on the Rosenbrock function. The white symbol indicates where in the input space the algorithm converged. The paths represent how each algorithm changes the weights while searching for the minimum. The white symbol indicates the optimal value for the weights—if and symbol overlap the algorithm has reached the global minimum of the function. Although SGD and SMD appear to quickly approach the minimum, the valley is in fact easy to find, but reaching the is difficult. Neither method achieves a low final value, and converge slowly. The AdaGain algorithms with RMSProp—including full quadratic AdaGain algorithm, AdaGain with the linear approximation and AdaGain with the linear approximation and finite differences—outperform the other methods in this problem. The finite differences AdaGain algorithm is a generic strategy, that does not require knowledge of the Jacobian, and so can be easily applied to any updates (provided in the appendix). This result highlights that there is not a significant loss in using this approximation, over AdaGain with analytic Jacobians. AdaGain without RMSProp, on the other hand, converges much more slowly, though interestingly it does still outperform SMD. Note although the run above of AdaGain without RMSProp did reach the minimum, that was not true in general as reflected by the learning curve.

Function optimization. The aim of our first experiment is to investigate how AdaGain performs on optimization problems designed to be difficult for gradient descent. The Rosenbrock function is a two dimensional non-convex function, and the minimum is inside a flat parabolic shaped valley. We compared AMSGrad, SGD, and SMD, in each case extensively searching the meta-parameters of each method, averaging performance over 100 runs and 6000 optimization steps. The results are summarized in Figure 2, with trajectory plots of a single run of each algorithm, and the learning curves for all methods. AdaGain both learns faster and gets closer to the global optimum than all other methods considered. Further, two meta-descent methods, SMD and AdaGain without RMSProp perform poorly. This result highlights issues with applying meta-descent approaches without considering the optimization surface, and the importance of having an algorithm like AdaGain which can be combined with quasi-second order methods.

Figure 3: Parameter sensitivity plot for the first 500,000 steps of the stateless tracking problem. Each circle denotes the average MSE for a single parameter combination of an algorithm. The parameter combinations and respective performance are grouped in vertical columns for each method. The circles in each column are randomly offset within the column horizontally as many parameter settings may achieve almost identical MSE. Circles near the bottom of the plot represent low MSE. Circles arranged in a line in the top-most part of the plot are parameter combinations that either diverged or exceeded a minimum performance threshold, with the percentage of such parameter combinations given in the graph.

Stateless tracking problem. Recall from Figure 1, that several methods performed well in the stateless tracking problem; sensitivity to parameter settings, however, is also important. To help better understand these methods, we constructed a parameter sensitivity graph (Figure 3). IDBD can outperform AdaGain on this problem (lower MSE), but only a tiny fraction of IDBD’s parameter settings achieve good performance. None of AdaGrad’s parameter combinations exceeded the threshold, but all combinations resulted in high error compared with AdaGain. Many of the parameter combinations allowed AdaGain to achieve low error, suggesting AdaGain with a simple manual parameter tuning is likely to achieve good performance on this problem, while IDBD likely requires a comprehensive parameter sweep.

Baird’s counterexample. Our final synthetic-domain experiment tests the stability of AdaGain’s update when combined with the TD() algorithm for off-policy state-value prediction in a Markov Decision Process. We use Baird’s counterexample, which causes the weight’s learned by off-policy TD([Sutton and Barto1998] to diverge if a global step-size parameter is used (decaying or otherwise) [Baird1995, Sutton and Barto1998, Maei2011]. The key challenge is the feature representation, and the difference between the target and behavior policies. There is a shared redundant feature, and the weight associated seventh feature is initialized to a high value. The target policy always chooses to go to state seven and stay there forever. The behavior policy, on the other hand, only visits state seven 1/7 the time, causing large importance sampling corrections.

We applied AdaGain, AMSGrad, RMSprop, SMD, and TIDBD[Kearney et al.2018]—a recent extension of the IDBD algorithm — to adapt the step-sizes of linear TD() on Baird’s counterexample. As before, the meta-parameters were extensively swept and the best performing parameters were used to generate the results for comparison. Figure 5 shows the learning curves of each method. Only AdaGain and AMSGrad are able to prevent divergence. SMD’s performance is typical of Baird’s counterexample: the meta-parameter search simply found parameters that caused extremely slow divergence. AdaGain learns significantly faster than AMSGrad, and achieves lower error.

To understand how AdaGain prevents divergence consider Figure 4. The left graph shows the step-size values as they evolve over time, and the right graph shows the corresponding weights. Recall, the weight for feature seven is initialized to a high value. AdaGain initially increases feature seven’s step-size causing weight seven to quickly fall. In parallel AdaGain reduces the step-size for the redundant feature, preventing incorrect generalization. Over time the weights converge to one of many valid solutions, and the value error, plotted in black on the right side converges to zero. The left plots of Figure 5 show the same evolution of the weights and step-sizes for AMSGrad. AMSGrad is successful in reducing the step-size for the redundant feature, however the step-sizes of the other features decay quickly and then begin growing again preventing convergence to low value error.

Figure 4: The step-size parameter values over time, and the corresponding weights learned by AdaGain in Baird’s counterexample, with results averaged over 1000 independent runs. AdaGain is able to adapt the step-sizes of each feature in such a way that off-policy TD() converges.
Figure 5: The step-size parameter values over time, and the corresponding weights learned by AMSGrad, and learning curves for several methods in Baird’s counterexample. Results averaged over 1000 independent runs. TD() combined with AdaGain achieves the best performance. AMSGrad also prevents divergence, but converges to worse value error.
Figure 6: The median symmetric mean absolute percentage error (SMAPE) across all 53 sensors (left), with a plot of the predictions for the heat sensor versus the ideal prediction in early learning (right). The ideal predictions are computed offline using all future data (as described in [Modayil, White, and Sutton2014]), but the predictions are learned online and incrementally. The learning curve shows that the predictions learned by AdaGain achieve good accuracy more quickly than those learned by AMSGrad. The right plot highlights early learning performance on the heat sensor—from time zero—illustrating that AdaGain’s prediction more quickly approaches the desired magnitude and then maintains good stability. This is particularly notable because the heat sensor targets in this case are unnormalized, obtaining values over 1 million. We also include the optimal predictions computed by solving a system of equations offline (again as in [Modayil, White, and Sutton2014]). The optimal solution makes use of only the first 40,000 data points for each sensor, reflecting the realistic scenario of computing predictions from a limited batch of data, and later using the offline solution for online prediction. As to be expected the SMAPE for these offline optimal predictions is low on the training data (first 40,000 time steps), and much higher on later data.

5 Experiments on robot data

In our final experiment we recreate nexting [Modayil, White, and Sutton2014], using TD() to make dozens of predictions about the future values of robot sensor readings. We formulate each prediction as estimating the discounted sum of future sensor readings, treating each sensor as a reward signal with discount factor of corresponding to approximately 80 second predictions. Using the freely available nexting data set (144,000 samples, corresponding to 3.4 hours of runtime on the robot), we incrementally processed the data on each step constructing a feature vector from the sensor vector, and making one prediction for each sensor. At the end of learning we computed the ”ideal” prediction offline and computed the symmetric mean absolute percentage error of each prediction, and aggregated the 50 learning curves using the median. We used the same non-linear coarse recoding of the sensor inputs described in the original work, giving 6065 binary feature components for use as a linear representation.

For this experiment we reduced the number of algorithms, using AMSGrad as the best performing quasi-second order method based on our synthetic task experiments and AdaGain as the representative meta-descent algorithm. The meta step-size was optimized for both algorithms.

Figure 7: Three snapshots in time of the predictions learned by AdaGain compared with the offline ideal predictions. Each of the three plots highlights a different part of the dataset to give an alternative perspective on the accuracy of AdaGain’s learned predictions. The leftmost plot we see a situation where the robot stalled unexpectedly directly in front of a bright light source, saturating the light sensor. Due to this sudden unpredictable event, the predictions of both AdaGain and AMSGrad became incorrect. However, AdaGain more quickly adapts learning to adjust its predictions to reflect the new reality, matching the ideal predictions (black line). Otherwise, these plots show that, in general, AdaGain and AMSGrad can track the ideal prediction similarily.

The learning curves in Figure 6 show a clear advantage for AdaGain in terms of aggregate error over all predictions. Inspecting the predictions of one of the heat sensors reveals why. In early learning, AdaGain much more quickly increases the prediction, to near the ideal prediction, whereas AMSGrad much more slowly reaches this point—over 12000 steps. AdaGain and AMSGrad then both track the the ideal heat prediction similarly, and so obtain similar error for the remainder of learning. This advantage in initial learning is also demonstrated in Figure 7, which depicts predictions on two different sensors. For example, AdaGain adapts the predictions more quickly in reaction to the unexpected stall event, but otherwise AdaGain and AMSGrad obtain similar errors. This result also serves as a sanity check for AdaGain, validating that AdaGain does scale to more realistic problems and remains stable in the face of high levels of noise and high-magnitude prediction targets.

6 Conclusion

In this work, we proposed a new general meta-descent strategy, to adapt a vector of stepsizes for online, continual prediction problems. We defined a new meta-descent objective, that enables a broader class of incremental updates for the base learner, generalizing beyond work specialized to least-mean squares, temporal difference learning and vanilla stochastic gradient descent updates. We derive a recursive update for the stepsizes, and provide a linear-complexity approximation. In a series of experiments, we highlight that meta-descent strategies are not robust to the shape of the optimization surface. The ability to use AdaGain for generic updates enabled us to overcome this issue, by layering AdaGain on RMSProp, a simple quasi-second order approach. We then shown that, with this modification, meta-descent methods can perform better than the more commonly used quasi-second order updates, adapting more quickly in non-stationary tasks.

References

  • [Almeida et al.1998] Almeida, L. B.; Langlois, T.; Amaral, J. D.; and Plakhov, A. 1998.

    On-line learning in neural networks.

    In Saad, D., ed., On-Line Learning in Neural Networks. New York, NY, USA: Cambridge University Press. chapter Parameter Adaptation in Stochastic Optimization, 111–134.
  • [Amari, Park, and Fukumizu2000] Amari, S.-i.; Park, H.; and Fukumizu, K. 2000.

    Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons.

    Neural Computation.
  • [Andrychowicz et al.2016] Andrychowicz, M.; Denil, M.; Gómez, S.; Hoffman, M. W.; Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems.
  • [Baird1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995. Elsevier. 30–37.
  • [Baydin et al.2018] Baydin, A. G.; Cornish, R.; Rubio, D. M.; Schmidt, M.; and Wood, F. 2018. Online Learning Rate Adaptation with Hypergradient Descent. In International Conference on Learning Representations.
  • [Benveniste, Metivier, and Priouret1990] Benveniste, A.; Metivier, M.; and Priouret, P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer.
  • [Bordes, Bottou, and Gallinari2009] Bordes, A.; Bottou, L.; and Gallinari, P. 2009. SGD-QN: Careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research.
  • [Dabney and Barto2012] Dabney, W., and Barto, A. G. 2012. Adaptive step-size for online temporal difference learning. In AAAI.
  • [Dabney and Thomas2014] Dabney, W., and Thomas, P. S. 2014. Natural Temporal Difference Learning. In

    AAAI Conference on Artificial Intelligence

    .
  • [Dabney2014] Dabney, W. C. 2014. Adaptive Step-sizes for Reinforcement Learning. Ph.D. Dissertation, University of Massachusetts - Amherst.
  • [Duchi, Hazan, and Singer2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research.
  • [Finn, Abbeel, and Levine2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning.
  • [Jacobs1988] Jacobs, R. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks.
  • [Jaeger2000] Jaeger, H. 2000. Observable Operator Processes and Conditioned Continuation Representations. Neural Computation.
  • [Kearney et al.2018] Kearney, A.; Veeriah, V.; Travnik, J. B.; Sutton, R. S.; and Pilarski, P. M. 2018. Tidbd: Adapting temporal-difference step-sizes through stochastic meta-descent. arXiv preprint arXiv:1804.03334.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Machine Learning.
  • [Littman, Sutton, and Singh2001] Littman, M. L.; Sutton, R. S.; and Singh, S. 2001. Predictive representations of state. In Advances in Neural Information Processing Systems.
  • [Maei2011] Maei, H. R. 2011. Gradient temporal-difference learning algorithms. University of Alberta Edmonton, Alberta.
  • [Mahmood et al.2012] Mahmood, A. R.; Sutton, R. S.; Degris, T.; and Pilarski, P. M. 2012. Tuning-free step-size adaptation. ICASSP.
  • [McMahan and Streeter2010] McMahan, H. B., and Streeter, M. 2010. Adaptive Bound Optimization for Online Convex Optimization. In International Conference on Learning Representations.
  • [Meyer et al.2014] Meyer, D.; Degenne, R.; Omrane, A.; and Shen, H. 2014. Accelerated gradient temporal difference learning algorithms. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.
  • [Modayil, White, and Sutton2014] Modayil, J.; White, A.; and Sutton, R. S. 2014. Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior 22(2):146–160.
  • [Nesterov1983] Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics and Doklady.
  • [Pan, Azer, and White2017] Pan, Y.; Azer, E. S.; and White, M. 2017. Effective sketching methods for value function approximation. In Conference on Uncertainty in Artificial Intelligence, Amsterdam, Netherlands.
  • [Pan, White, and White2017] Pan, Y.; White, A.; and White, M. 2017. Accelerated Gradient Temporal Difference Learning. In International Conference on Machine Learning.
  • [Pearlmutter1994] Pearlmutter, B. A. 1994. Fast Exact Multiplication by the Hessian. dx.doi.org.
  • [Reddi, Kale, and Kumar2018] Reddi, S. J.; Kale, S.; and Kumar, S. 2018. On the Convergence of Adam and Beyond. In International Conference on Learning Representations.
  • [Roux, Schmidt, and Bach2012] Roux, N. L.; Schmidt, M.; and Bach, F. R. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems.
  • [Schaul, Zhang, and LeCun2013] Schaul, T.; Zhang, S.; and LeCun, Y. 2013. No More Pesky Learning Rates. In International Conference on Artificial Intelligence and Statistics.
  • [Schraudolph, Yu, and Günter2007] Schraudolph, N.; Yu, J.; and Günter, S. 2007. A stochastic quasi-Newton method for online convex optimization. In International Conference on Artificial Intelligence and Statistics.
  • [Schraudolph1999] Schraudolph, N. N. 1999. Local gain adaptation in stochastic gradient descent. International Conference on Artificial Neural Networks: ICANN ’99.
  • [Spall1992] Spall, J. C. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37(3):332–341.
  • [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Introduction to Reinforcement Learning. Cambridge, MA, USA: MIT Press, 1st edition.
  • [Sutton and Tanner2004] Sutton, R. S., and Tanner, B. 2004. Temporal-Difference Networks. In Advances in Neural Information Processing Systems.
  • [Sutton et al.2011] Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems.
  • [Sutton, Koop, and Silver2007] Sutton, R.; Koop, A.; and Silver, D. 2007. On the role of tracking in stationary environments. In International Conference on Machine Learning.
  • [Sutton1992a] Sutton, R. S. 1992a. Gain Adaptation Beats Least Squares? In Seventh Yale Workshop on Adaptive and Learning Systems.
  • [Sutton1992b] Sutton, R. 1992b. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI Conference on Artificial Intelligence.
  • [Tieleman and Hinton2012] Tieleman, T., and Hinton, G. 2012. RmsProp: Divide the gradient by a running average of its recent magnitude. In COURSERA Neural Networks for Machine Learning.
  • [Wu et al.2018] Wu, Y.; Ren, M.; Liao, R.; and Grosse, R. B. 2018. Understanding Short-Horizon Bias in Stochastic Meta-Optimization. In International Conference on Learning Representations.
  • [Zeiler2012] Zeiler, M. D. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv:1411.4000v2 [cs.LG].

Appendix A Stochastic Meta-Descent algorithm

We recreate the SMD derivation, in our notation, for easier reference.

We compute the gradient of the loss function , w.r.t. stepsize. We derive the full quadratic-complexity algorithm to start, and then introduce approximations to obtain a linear-complexity algorithm. For stepsize as the th element in the vector ,

Define the following two vectors, for the -th element in vector ,

(7)
(8)

We can obtain vector recursively as

The resulting generic updates for quadratic-complexity SMD, with meta stepsize , are

(9)

and (or some initial value). As with AdaGain, the Hessian-vector product can be computed efficiently, using R-operators. Here, it is irrelevant, because we maintain the quadratic .

For the linear-complexity algorithm, again we set entries for . Let be the th column of the Hessian. This results in the simplification

Further, since we will then assume that for , there is no purpose in computing the full vector . Instead, we only need to compute the th entry, i.e., for . We can then instead define to be a scalar approximating , with the vector of these, and the diagonal of the Hessian

(10)

to define the recursion as , with . The gradient using this approximation, with off-diagonals zero, is

The resulting update to the stepsize is

(11)

Difference to original SMD algorithm: Now, surprisingly, the above algorithm differs from the algorithm given for SMD. But, that derivation appears to have a flaw, where the gradients of weights taken w.r.t. to a vector of stepsizes is assumed to be a vector. Rather, with the same off-diagonal approximation we use, it should be a diagonal matrix, and then they would also only get a diagonal Hessian. For completeness, we include their algorithm, which uses a full Hessian-vector product.

(12)

Note that a follow-up paper that tested SMD [Wu et al.2018] uses this update, but does not have an error, because they use a scalar step size. In fact, in the SMD paper, if the step size had been a scalar, then their derivation would be correct.

The addition of : The original SMD algorithm did not use forgetting with . In our experiments, however, we consider SMD with —which performs significantly better—since our goal is not to compare directly with SMD, but rather to compare the choice of objectives.

Appendix B Derivations for AdaGain updates

Consider again the generic update

(13)

where is the update for this step, for weights and constant vector stepsize and the operator denotes element-wise multiplication.

b.1 Maintaining non-negative stepsizes in AdaGain

One straightforward option to maintain non-negative stepsizes is to define a constraint on the stepsize. We can prevent the stepsize from going below a small threshold (e.g., ), ensuring positive stepsizes. The projection onto this constraint set after each gradient descent step simply involves applying the operator , which thresholds any values below to . We experimented with this strategy compared to the mentioned exponential form, and found it performed relatively similarly, but required an extra parameter to tune.

Another option—and the one we use in this work—is to use an exponential form for the stepsize, so that it remains positive. One form, used also by IDBD, is to use . The algorithm, with or without an exponential form, remains essentially identical to the thresholded version, because

Therefore, we can still recursively estimate the gradient with the same approach, regardless of how the stepsize is constrained. For the thresholded form, we simply use the gradient and then project (i.e., threshold). For the exponential form, the gradient update for is simply used within an exponential function, as described below.

Consider directly maintaining , which is unconstrained. For the function form , the partial derivative is simply equal to and so the gradient update includes an additional in front. This can more explicitly be maintained, without an additional variable, by noticing that for gradient for

Therefore, we can still directly maintain . The resulting update to is simply

(14)

Other multiplicative updates are also possible. schraudolph1999local  schraudolph1999local uses an exponential update, but uses an approximation with a maximum, to avoid the expensive computation of the exponential function. baydin2018online  baydin2018online uses a similar multiplicative update, but without a maximum.

b.2 AdaGain for linear TD

In this section, we derive for a particular algorithm, namely linear TD. LMS updates can be obtained as special cases, by setting . We then provide a more general update algorithm—which does not require knowledge of the form of the update—in the next section. One advantage of AdaGain is that it is derived generically, allowing extensions to many online algorithms, unlike IDBD, and variants which are derived specifically for the squared TD-error.

We first provide the AdaGain updates for linear TD(), and then provide the derivation below. For TD(), the update is

(15)

where , ,

To derive the update for , we need to compute the gradients of the updates, particularly or for the full algorithm, the Jacobian .

Letting , the Jacobian is and the diagonal approximation is . Because of the form of the Jacobian, we can actually use it in the update to , though not in computing , if we want to maintain linearity. The quadratic complexity algorithm uses as given

The linear complexity algorithm uses to update , giving the stepsize update in (15)

b.3 Generic AdaGain algorithm

To avoid requiring knowledge about the algorithm update and its derivatives, we can provide an approximation to the Jacobian-vector product and the diagonal of the Jacobian, using finite differences. As long as the update function for the algorithm can be queried multiple times, this algorithm can be easily applied to any update.

To compute the Jacobian-vector product, we use the fact that this corresponds to a directional derivative. Notice that corresponds to the vector of directional derivatives for each component (function) in the update , in the direction of , because the dot-product separates in . Therefore, for update function (such as the gradient of the loss), we get for small ,

(16)

For the diagonal of the Jacobian, we can again use finite differences. An efficient finite difference computation is proposed within the simultaneous perturbation stochastic approximation algorithm [Spall1992], which uses a random perturbation vector to compute the centered difference . This formula provides an approximation to the gradient of the entry in the update with respect to weight ; when computed for all , this approximates the diagonal of the Jacobian . To avoid additional computation, we can re-use the above difference with perturbation , rather than a random vector . To avoid division by zero, if contains a zero entry, we threshold the normalization with a small constant to give

(17)

where division is element-wise. another approach would be to sample a random direction for this finite difference and use , divided by the absolute value of each element of . We found empirically that using the same direction as was actually more effective, and more computationally efficient, so we propose that approach.

Using these approximations, we can compute the update to the stepsize as in Equation (6), repeated here for easy reference

Figure 8: Depicted above are the offline optimal predictions during the light sensor stall, and during the light sensor’s normal operation (see Figure 7). The optimal offline solution was trained by computing the linear least-squares solution for the first 40,000 data points, and using that solution to make predictions on the rest of the dataset.