1 Introduction
In this paper we consider continual, nonstationary prediction problems. Consider a learning system whose objective is to learn a large collection of predictions about an agent’s future interactions with the world. The predictions specify the value of some signal many steps in the future, given that the agent follows some specific course of action. There are many examples of such prediction learning systems including Predictive State Representations [Littman, Sutton, and Singh2001], Observable Operator Models [Jaeger2000], Temporaldifference Networks [Sutton and Tanner2004], and General Value Functions [Sutton et al.2011]. In our setting, the agent continually interacts with the world, making new predictions about the future, and revising its previous predictions as new outcomes are revealed. Occasionally, partially due to changes in the world and partially due to changes in the agent’s own behaviour, the targets may change and the agent must refine its predictions. ^{1}^{1}1We exclude recent metalearning frameworks (MAML [Finn, Abbeel, and Levine2017], LTLGDGD [Andrychowicz et al.2016]) because they assume access to a collection of tasks that can be sampled independently, enabling the agent to learn how to select metaparameters for a new problem. In our setting, the agent must solve a large collection of nonstationary prediction problems in parallel using offpolicy learning methods.
Stochastic gradient descent (SGD) is a natural choice for our setting because gradient descent methods work well when paired with abundant training data. The performance of SGD is dependent on the stepsize parameter (scalar, vector or matrix), which scales the gradient to mitigate sample variance and improve data efficiency. Most modern largescale learning systems make use of optimization algorithms that attempt to approximate stochastic secondorder gradient descent to adjust both the direction and magnitude of the descent direction, with early work indicating the benefits of such quasisecond order methods if used carefully in the stochastic case
[Schraudolph, Yu, and Günter2007, Bordes, Bottou, and Gallinari2009]. Many of these algorithms attempt to approximate the diagonal of the inverse Hessian, which describes the curvature of the loss function, and so maintain a vector of stepsizes—one for each parameter. Starting from AdaGrad
[McMahan and Streeter2010, Duchi, Hazan, and Singer2011], several diagonal approximations have been proposed, including RmsProp [Tieleman and Hinton2012], AdaDelta [Zeiler2012], vSGD [Schaul, Zhang, and LeCun2013], Adam [Kingma and Ba2015] and AmsGrad [Reddi, Kale, and Kumar2018]. Stochastic quasisecond order updates have been derived specifically for temporal difference learning, with some empirical success [Meyer et al.2014], particularly in terms of parameter sensitivity [Pan, White, and White2017, Pan, Azer, and White2017]. On the other hand, second order methods, by design, assume the loss and thus Hessian are fixed, and so nonstationary dynamics or drifting targets could be problematic.A related family of optimization algorithms, called metadescent algorithms, were developed for continual, online prediction problems. These algorithms perform metagradient descent adapting a vector of stepsize parameters to minimize the error of the base learner, instead of approximating the Hessian. Metadescent applied to the stepsize was first introduced for online leastmean squares methods [Jacobs1988, Sutton1992b, Sutton1992a, Almeida et al.1998, Mahmood et al.2012], including the linear complexity method IDBD [Sutton1992b]. IDBD was later extended to more general losses [Schraudolph1999] and to support (semigradient) temporal difference methods [Dabney and Barto2012, Dabney2014, Kearney et al.2018]
. These methods are wellsuited to nonstationary problems, and have been shown to ignore irrelevant features. The main limitation of several of these metadescent algorithms, however, is that the derivations are heuristic, making it difficult to extend to new settings beyond linear temporal difference learning. The more general approaches, like Stochastic MetaDescent (SMD)
[Schraudolph1999], require the update to be a stochastic gradient descent update and have some issues in biasing towards smaller stepsizes [Wu et al.2018]. It remains an open challenge to make these metadescent strategies as broadly and easily applicable as the AdaGrad variants.In this paper we introduce a new metadescent algorithm, called AdaGain, that attempts to optimize the stability of the base learner, rather than convergence to a fixed point. AdaGain is built on a generic derivation scheme that allows it to be easily combined with a variety of baselearners including SGD, (semigradient) temporaldifference learning and even optimized SGD updates, like AMSGrad. Our goal is to investigate the utility of both metadescent methods and the more widely used quasisecond order optimizers in online, continual prediction problems.We provide an extensive empirical comparison on (1) canonical optimization problems that are difficult to optimize with large flat regions (2) an online, supervised tracking problem where the optimal stepsizes can be computed, (3) a finite Markov Decision Process with linear features that cause conventional temporal difference learning to diverge, and (4) a highdimensional timeseries prediction problem using data generated from a real mobile robot. In problems with nonstationary dynamics the metadescent methods can exhibit an advantage over the quasisecond order methods. On the difficult optimization problems, however, metadescent methods fail, which, retrospectively, is unsurprising given the metaoptimization problem for stepsizes is similarly difficult to optimize. We show that AdaGain can possess the advantages of both families — performing well on both optimization problems with flat regions as well as nonstationary problems — by selecting an appropriate base learner, such as RMSProp.
2 Background and Notation
In this paper we consider online continual prediction problems modeled as nonstationary, uncontrolled dynamical systems. On each discrete time step , the agent observes the internal state of the system through an imperfect summary vector for some , such as the sensor readings of a mobile robot. On each step, the agent makes a prediction about a target signal . In the simplest case, the target of the prediction is a component of the observation vector on the next step
—the classic onestep prediction. In the more general case, the target is constructed by mapping the entire future of the observation time series to a scalar, such as the discounted sum formulation used in reinforcement learning:
, where discounts the contribution of future observations to the infinite sum. The prediction is generated by a parametrized function, with modifiable parameter vector .In online continual prediction problems the agent updates its predictions (via ) with each new sample , unlike the more common batch and stochastic settings. The agent’s objective is to minimize the error between the prediction given by and the target before it is observed, over all time steps. Online continual prediction problems are typically solved using stochastic updates to adapt the parameter vector after each time step to reduce the error (retroactively) between and . Generically, for stochastic update vector , the weights are modified
(1) 
for a vector stepsize , where the operator denotes elementwise multiplication. Given an update vector, the goal is to select to reduce error, into the future. Semigradient methods like temporal difference learning follow a similar scheme, but is not the gradient of an objective function.
Stepsize adaptation for the stationary setting is often based on estimating secondorder updates.
^{2}^{2}2A related class of algorithms are natural gradient methods, which aim to be robust to the functional parametrization. Incremental natural gradient methods have been proposed [Amari, Park, and Fukumizu2000], including for policy evaluation with gradient TD methods [Dabney and Thomas2014]. However, these algorithms do not remove the need select a stepsize, and so we do not consider them further here. The idea is to estimate the loss function locally around the current weights using a secondorder Taylor series approximation—which requires the Hessian . A closedform solution can then be obtained for the approximation, because it is a quadratic function, giving the next candidate solution . If instead the Hessian is approximated—such as with a diagonal approximation—then we obtain quasisecond order updates. Taken to the extreme, with the Hessian approximated by a scalar, as , we obtain firstorder gradient descent with a stepsize of . For the batch setting, the gains from second order methods are clear, with a convergence rate^{3}^{3}3There is a large literature on accelerated firstorder descent methods, starting from early work on momentum [Nesterov1983] and many since focused mainly on variance reduction (c.f. [Roux, Schmidt, and Bach2012]). These methods can complement stepsize adaptation, but are not wellsuited to nonstationary problems because many of the algorithms are designed for a batch of data and focus on increasing convergence rate to a fixed minimum. of , as opposed to for firstorder descent.These gains are not as clear in the stochastic setting, but diagonal approximations appear to provide an effective balance between computation and convergence rate improvements [Bordes, Bottou, and Gallinari2009]. duchi2011adaptive duchi2011adaptive provide a general regret analysis for diagonal approximations methods proving sublinear regret if stepsizes decrease to zero overtime. One algorithm, AdaGrad, uses the vector stepsize for a fixed and a small , with elementwise division. RMSProp and Adam—which are not guaranteed to obtain sublinear regret—use a running average rather than a sum of gradients, with Adam additionally including a momentum term for faster convergence. AMSGrad is a modification of Adam, that satisfies the regret criteria, without decaying the stepsizes as aggressively as AdaGrad.
The metadescent strategies instead directly learn stepsizes that minimize the same objective as the base learner. A simpler set of such methods, called hypergradient methods [Jacobs1988, Almeida et al.1998, Baydin et al.2018], only adjust the stepsize based on its impact on the weights on a single step. Hypergradient Descent (HD) [Baydin et al.2018] takes the gradient of the loss w.r.t. a scalar stepsize , to get the metagradient for the stepsize as . The update simply requires storing the vector and updating , for a meta stepsize . More generally, metadescent methods, like IDBD [Sutton1992b] and SMD [Schraudolph1999], consider the impact of the stepsize back in time, through the weights, with the th element in vector
(2) 
The goal is to approximate this gradient efficiently, usually using a recursive strategy. We derive such a strategy for AdaGain below using a different metadescent objective, and for completeness include the derivation for the SMD objective in the appendix (as the original contains an error).
2.1 Illustrative example
To make the problem more concrete, consider a simple stateless tracking problem driven by two interacting Gaussians:
(3) 
where the agent only observes the sequence . The objective is minimize mean squared error (MSE) between a scalar prediction and the target . This problem is nonstationary because and change periodically and the agent has no knowledge of the schedule. Since and govern how quickly the mean drifts and the sampling variance in , the agent must step its stepsize accordingly: larger requires larger stepsize, larger requires a smaller stepsize. The agent must continually change its scalar stepsize value in order to achieve low MSE. The optimal constant scalar stepsize can be computed in this simple domain [Sutton1992b], and is shown by the black dashed line in Figure 1. We compared the stepsizes learned by several wellknow quasisecond order methods (AdaGrad, RMSProp, Adadelta) and three metadescent strategies including our own AdaGain. We ran the experiment for over 24 hours to test the robustness of these methods in a longrunning continual prediction task. Several methods including AdaGain were able to match the optimal stepsize. However, several wellknown methods including AdaGrad and AdaDelta completely fail in this problem. In addition, the metadescent strategy SMD diverged after 8183817 time steps, highlighting the special challenges of online, continual prediction problems.
3 Adaptive Gain for Stability
Tracking—continually updating the weights with recent experience—contrasts the typical goal of convergence. Much of the previous algorithm development for stepsize adaptation, however, has been towards the aim of convergence, with algorithms like AdaGrad and AMSGrad that decay stepsizes over time. Assuming finite representational capacity, there may be aspects of the problem that can never be accurately modeled or predicted by the agent. In these partially observable problems tracking and thus treating the problem as if it were nonstationary can improve prediction accuracy compared with methods that converge [Sutton, Koop, and Silver2007]. In continual learning we assume the agent’s task partially observable in this way, and develop a new stepsize method that can facilitate tracking.
We treat the learning system as a dynamical system—where the weight update is based on stochastic updates known to suitably track the targets—and consider the choice of stepsize as the inputs to the system to maintain stability. Such a view has been previously considered under adaptive gain for leastmean squares (LMS) [Benveniste, Metivier, and Priouret1990, Chapter 4], where weights are treated as state following a random drift. To generalize this idea to other incremental algorithms, we propose a more general criteria based on the magnitude of the update vector.
A criteria for to maintain stability in the system is to keep the norm of the update vector small
(4) 
The update on this time step is dependent on the stepsize because that stepsize influences and past updates. The expected value is over all possible update vectors for the given stepsize and assuming the system started with some . If the dynamics are ergodic, does not depend on the initial , and is only driven by the underlying state dynamics and the choice of . The stepsize can be seen as a control input for this system, with the goal to maintain a stable dynamical system by minimizing over time.
We derive an algorithm to estimate for this dynamical system, which we call AdaGain: Adaptive Gain for Stability. The algorithm is derived for a generic update that is differentiable w.r.t. the weights ; we provide specific examples for particular updates in the appendix, including for linear TD.
3.1 Generic algorithm with quadraticcomplexity
We derive the full quadraticcomplexity algorithm to start, and then introduce approximations to obtain a linearcomplexity algorithm. To minimize (4), we use stochastic gradient descent, and thus need to compute the gradient of w.r.t. the stepsize . For stepsize as the th element in the vector , and the th element in vector
The key, then, is to track how a change in the weights impacts the update and how changes in the stepsize impact the weights. The first term can be computed instantaneously on this step. For the second term, however, the impact of the stepsize on the weights goes back further to previous updates. We show how to obtain a recursive form for this stepsize gradient, .
where , , and Therefore, represents a sum of updates, with a recursive weighting on previous adjusting the weight of previous updates in the sum.
We can approximate the gradient using this recursive relationship, without storing all previous samples. Though the above updates are exact, we obtain an approximation when implementing such a recursive form in practice. When using computed on the last time step , this gradient estimate is in fact w.r.t. the previous stepsize , rather than . Because these stepsizes are slowly changing, this gradient still provides a reasonable estimate; however, for many steps into the past, the accumulated gradients in are likely inaccurate. To improve the approximation, and forget old gradients, we introduce a forgetting parameter , which focuses the accumulation of gradients in to a more recent window.
The gradient update to the stepsize also needs to ensure that the stepsizes remain positive. Similarly to IDBD, we use an exponential form for the stepsize, where and is updated with (unconstrained) stochastic gradient descent. Conveniently, as we show in the appendix, we do not need to maintain this auxiliary variable, and can simply directly update .
The resulting generic updates for quadraticcomplexity AdaGain, with meta stepsize , are
(5)  
where the exponential is applied elementwise, , (or some initial value), and . For computational efficiency to avoid matrixmatrix multiplication, the order of multiplication for should start from the right, as . The key complexity in deriving an AdaGain update, then, is simply in computing the Jacobian ; given this, the remainder of the algorithm is fixed. For each update , the Jacobian will be different, but is straightforward to compute.
3.2 Generic AdaGain algorithm with linearcomplexity
Maintaining the entire matrix can be prohibitively expensive. As was done in IDBD [Sutton1992b], one way to avoid maintaining this matrix is to assume that for . This heuristic reflects that is likely to have the largest impact on , and less impact on the other entries in .
The modification above for this heuristic is straightforward, simply by setting entries for . This results in the simplification
Further, since we will then assume that for , there is no purpose in computing the full vector . Instead, we only need to compute the th entry, i.e., for . We can then instead define to be a scalar approximating , with the vector of these, and to define the recursion as , with . The gradient using this approximation, with offdiagonals zero, is
To compute this approximation, for all , we still need to be able to compute . In some cases this is straightforward, as is the case for linear TD (found in the appendix). More generally, we can use Roperators [Pearlmutter1994] to compute this Jacobianvector product, or a simple finite difference approximation, as we do in the appendix. Therefore, because we can compute this Jacobianvector product in linear time, the only approximation is to . The update is
(6)  
These approximations parallel diagonal approximations, for secondorder techniques, which similarly assume offdiagonal elements are zero. Further, itself is a gradient of the update w.r.t. the weights, where this update was already likely the gradient of the loss w.r.t. the weights. This , therefore, contains similar information as the Hessian. The AdaGain update, therefore, contains some information about curvature, but allows for updates that are not necessarily (true) gradient updates.
This AdaGain update is generic, but does require computing the Jacobian of a given update, which could be onerous in certain settings. We provide an update, based on finite differences in the appendix, that only requires differences between updates, that we have found works well in practice.
4 Experiments in synthetic tasks
We conduct experiments in several simulation domains to highlight the performance characteristics of metadescent and quasisecond order methods. In our first experiment we investigate AdaGain and several metadescent and quasisecond order approaches on a notoriously difficult stationary optimization task. Next we return to the simple stateless tracking problem described in the introduction, and investigate the parameter sensitivity of each method. Our third experiment investigates how different optimization algorithms can stabilize the iterates in sequential offpolicy learning problems, which cause SGDbased methods to diverge. We conclude with a comparison of AdaGain and AMSGrad (the best performing quasisecond order method in the first three experiments) for online prediction on data generated by a mobile robot.
In all the experiments, we use AdaGain layered ontop of an RMSProp update, rather than a vanilla SGD update. As motivated earlier, metadescent methods are not robust on difficult optimization surfaces, such as with flat or sharp regions. AdaGain provides a practical method to pursue metadescent strategies that are robust to such realistic optimization problems. We motivate the importance of this choice in our first experiment on a difficult optimization task.
Function optimization. The aim of our first experiment is to investigate how AdaGain performs on optimization problems designed to be difficult for gradient descent. The Rosenbrock function is a two dimensional nonconvex function, and the minimum is inside a flat parabolic shaped valley. We compared AMSGrad, SGD, and SMD, in each case extensively searching the metaparameters of each method, averaging performance over 100 runs and 6000 optimization steps. The results are summarized in Figure 2, with trajectory plots of a single run of each algorithm, and the learning curves for all methods. AdaGain both learns faster and gets closer to the global optimum than all other methods considered. Further, two metadescent methods, SMD and AdaGain without RMSProp perform poorly. This result highlights issues with applying metadescent approaches without considering the optimization surface, and the importance of having an algorithm like AdaGain which can be combined with quasisecond order methods.
Stateless tracking problem. Recall from Figure 1, that several methods performed well in the stateless tracking problem; sensitivity to parameter settings, however, is also important. To help better understand these methods, we constructed a parameter sensitivity graph (Figure 3). IDBD can outperform AdaGain on this problem (lower MSE), but only a tiny fraction of IDBD’s parameter settings achieve good performance. None of AdaGrad’s parameter combinations exceeded the threshold, but all combinations resulted in high error compared with AdaGain. Many of the parameter combinations allowed AdaGain to achieve low error, suggesting AdaGain with a simple manual parameter tuning is likely to achieve good performance on this problem, while IDBD likely requires a comprehensive parameter sweep.
Baird’s counterexample. Our final syntheticdomain experiment tests the stability of AdaGain’s update when combined with the TD() algorithm for offpolicy statevalue prediction in a Markov Decision Process. We use Baird’s counterexample, which causes the weight’s learned by offpolicy TD() [Sutton and Barto1998] to diverge if a global stepsize parameter is used (decaying or otherwise) [Baird1995, Sutton and Barto1998, Maei2011]. The key challenge is the feature representation, and the difference between the target and behavior policies. There is a shared redundant feature, and the weight associated seventh feature is initialized to a high value. The target policy always chooses to go to state seven and stay there forever. The behavior policy, on the other hand, only visits state seven 1/7 the time, causing large importance sampling corrections.
We applied AdaGain, AMSGrad, RMSprop, SMD, and TIDBD[Kearney et al.2018]—a recent extension of the IDBD algorithm — to adapt the stepsizes of linear TD() on Baird’s counterexample. As before, the metaparameters were extensively swept and the best performing parameters were used to generate the results for comparison. Figure 5 shows the learning curves of each method. Only AdaGain and AMSGrad are able to prevent divergence. SMD’s performance is typical of Baird’s counterexample: the metaparameter search simply found parameters that caused extremely slow divergence. AdaGain learns significantly faster than AMSGrad, and achieves lower error.
To understand how AdaGain prevents divergence consider Figure 4. The left graph shows the stepsize values as they evolve over time, and the right graph shows the corresponding weights. Recall, the weight for feature seven is initialized to a high value. AdaGain initially increases feature seven’s stepsize causing weight seven to quickly fall. In parallel AdaGain reduces the stepsize for the redundant feature, preventing incorrect generalization. Over time the weights converge to one of many valid solutions, and the value error, plotted in black on the right side converges to zero. The left plots of Figure 5 show the same evolution of the weights and stepsizes for AMSGrad. AMSGrad is successful in reducing the stepsize for the redundant feature, however the stepsizes of the other features decay quickly and then begin growing again preventing convergence to low value error.
5 Experiments on robot data
In our final experiment we recreate nexting [Modayil, White, and Sutton2014], using TD() to make dozens of predictions about the future values of robot sensor readings. We formulate each prediction as estimating the discounted sum of future sensor readings, treating each sensor as a reward signal with discount factor of corresponding to approximately 80 second predictions. Using the freely available nexting data set (144,000 samples, corresponding to 3.4 hours of runtime on the robot), we incrementally processed the data on each step constructing a feature vector from the sensor vector, and making one prediction for each sensor. At the end of learning we computed the ”ideal” prediction offline and computed the symmetric mean absolute percentage error of each prediction, and aggregated the 50 learning curves using the median. We used the same nonlinear coarse recoding of the sensor inputs described in the original work, giving 6065 binary feature components for use as a linear representation.
For this experiment we reduced the number of algorithms, using AMSGrad as the best performing quasisecond order method based on our synthetic task experiments and AdaGain as the representative metadescent algorithm. The meta stepsize was optimized for both algorithms.
The learning curves in Figure 6 show a clear advantage for AdaGain in terms of aggregate error over all predictions. Inspecting the predictions of one of the heat sensors reveals why. In early learning, AdaGain much more quickly increases the prediction, to near the ideal prediction, whereas AMSGrad much more slowly reaches this point—over 12000 steps. AdaGain and AMSGrad then both track the the ideal heat prediction similarly, and so obtain similar error for the remainder of learning. This advantage in initial learning is also demonstrated in Figure 7, which depicts predictions on two different sensors. For example, AdaGain adapts the predictions more quickly in reaction to the unexpected stall event, but otherwise AdaGain and AMSGrad obtain similar errors. This result also serves as a sanity check for AdaGain, validating that AdaGain does scale to more realistic problems and remains stable in the face of high levels of noise and highmagnitude prediction targets.
6 Conclusion
In this work, we proposed a new general metadescent strategy, to adapt a vector of stepsizes for online, continual prediction problems. We defined a new metadescent objective, that enables a broader class of incremental updates for the base learner, generalizing beyond work specialized to leastmean squares, temporal difference learning and vanilla stochastic gradient descent updates. We derive a recursive update for the stepsizes, and provide a linearcomplexity approximation. In a series of experiments, we highlight that metadescent strategies are not robust to the shape of the optimization surface. The ability to use AdaGain for generic updates enabled us to overcome this issue, by layering AdaGain on RMSProp, a simple quasisecond order approach. We then shown that, with this modification, metadescent methods can perform better than the more commonly used quasisecond order updates, adapting more quickly in nonstationary tasks.
References

[Almeida et al.1998]
Almeida, L. B.; Langlois, T.; Amaral, J. D.; and Plakhov, A.
1998.
Online learning in neural networks.
In Saad, D., ed., OnLine Learning in Neural Networks. New York, NY, USA: Cambridge University Press. chapter Parameter Adaptation in Stochastic Optimization, 111–134. 
[Amari, Park, and
Fukumizu2000]
Amari, S.i.; Park, H.; and Fukumizu, K.
2000.
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons.
Neural Computation.  [Andrychowicz et al.2016] Andrychowicz, M.; Denil, M.; Gómez, S.; Hoffman, M. W.; Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems.
 [Baird1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995. Elsevier. 30–37.
 [Baydin et al.2018] Baydin, A. G.; Cornish, R.; Rubio, D. M.; Schmidt, M.; and Wood, F. 2018. Online Learning Rate Adaptation with Hypergradient Descent. In International Conference on Learning Representations.
 [Benveniste, Metivier, and Priouret1990] Benveniste, A.; Metivier, M.; and Priouret, P. 1990. Adaptive Algorithms and Stochastic Approximations. Springer.
 [Bordes, Bottou, and Gallinari2009] Bordes, A.; Bottou, L.; and Gallinari, P. 2009. SGDQN: Careful quasiNewton stochastic gradient descent. Journal of Machine Learning Research.
 [Dabney and Barto2012] Dabney, W., and Barto, A. G. 2012. Adaptive stepsize for online temporal difference learning. In AAAI.

[Dabney and Thomas2014]
Dabney, W., and Thomas, P. S.
2014.
Natural Temporal Difference Learning.
In
AAAI Conference on Artificial Intelligence
.  [Dabney2014] Dabney, W. C. 2014. Adaptive Stepsizes for Reinforcement Learning. Ph.D. Dissertation, University of Massachusetts  Amherst.
 [Duchi, Hazan, and Singer2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research.
 [Finn, Abbeel, and Levine2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning.
 [Jacobs1988] Jacobs, R. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks.
 [Jaeger2000] Jaeger, H. 2000. Observable Operator Processes and Conditioned Continuation Representations. Neural Computation.
 [Kearney et al.2018] Kearney, A.; Veeriah, V.; Travnik, J. B.; Sutton, R. S.; and Pilarski, P. M. 2018. Tidbd: Adapting temporaldifference stepsizes through stochastic metadescent. arXiv preprint arXiv:1804.03334.
 [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Machine Learning.
 [Littman, Sutton, and Singh2001] Littman, M. L.; Sutton, R. S.; and Singh, S. 2001. Predictive representations of state. In Advances in Neural Information Processing Systems.
 [Maei2011] Maei, H. R. 2011. Gradient temporaldifference learning algorithms. University of Alberta Edmonton, Alberta.
 [Mahmood et al.2012] Mahmood, A. R.; Sutton, R. S.; Degris, T.; and Pilarski, P. M. 2012. Tuningfree stepsize adaptation. ICASSP.
 [McMahan and Streeter2010] McMahan, H. B., and Streeter, M. 2010. Adaptive Bound Optimization for Online Convex Optimization. In International Conference on Learning Representations.
 [Meyer et al.2014] Meyer, D.; Degenne, R.; Omrane, A.; and Shen, H. 2014. Accelerated gradient temporal difference learning algorithms. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.
 [Modayil, White, and Sutton2014] Modayil, J.; White, A.; and Sutton, R. S. 2014. Multitimescale nexting in a reinforcement learning robot. Adaptive Behavior 22(2):146–160.
 [Nesterov1983] Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics and Doklady.
 [Pan, Azer, and White2017] Pan, Y.; Azer, E. S.; and White, M. 2017. Effective sketching methods for value function approximation. In Conference on Uncertainty in Artificial Intelligence, Amsterdam, Netherlands.
 [Pan, White, and White2017] Pan, Y.; White, A.; and White, M. 2017. Accelerated Gradient Temporal Difference Learning. In International Conference on Machine Learning.
 [Pearlmutter1994] Pearlmutter, B. A. 1994. Fast Exact Multiplication by the Hessian. dx.doi.org.
 [Reddi, Kale, and Kumar2018] Reddi, S. J.; Kale, S.; and Kumar, S. 2018. On the Convergence of Adam and Beyond. In International Conference on Learning Representations.
 [Roux, Schmidt, and Bach2012] Roux, N. L.; Schmidt, M.; and Bach, F. R. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems.
 [Schaul, Zhang, and LeCun2013] Schaul, T.; Zhang, S.; and LeCun, Y. 2013. No More Pesky Learning Rates. In International Conference on Artificial Intelligence and Statistics.
 [Schraudolph, Yu, and Günter2007] Schraudolph, N.; Yu, J.; and Günter, S. 2007. A stochastic quasiNewton method for online convex optimization. In International Conference on Artificial Intelligence and Statistics.
 [Schraudolph1999] Schraudolph, N. N. 1999. Local gain adaptation in stochastic gradient descent. International Conference on Artificial Neural Networks: ICANN ’99.
 [Spall1992] Spall, J. C. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37(3):332–341.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Introduction to Reinforcement Learning. Cambridge, MA, USA: MIT Press, 1st edition.
 [Sutton and Tanner2004] Sutton, R. S., and Tanner, B. 2004. TemporalDifference Networks. In Advances in Neural Information Processing Systems.
 [Sutton et al.2011] Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White, A.; and Precup, D. 2011. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems.
 [Sutton, Koop, and Silver2007] Sutton, R.; Koop, A.; and Silver, D. 2007. On the role of tracking in stationary environments. In International Conference on Machine Learning.
 [Sutton1992a] Sutton, R. S. 1992a. Gain Adaptation Beats Least Squares? In Seventh Yale Workshop on Adaptive and Learning Systems.
 [Sutton1992b] Sutton, R. 1992b. Adapting bias by gradient descent: An incremental version of deltabardelta. In AAAI Conference on Artificial Intelligence.
 [Tieleman and Hinton2012] Tieleman, T., and Hinton, G. 2012. RmsProp: Divide the gradient by a running average of its recent magnitude. In COURSERA Neural Networks for Machine Learning.
 [Wu et al.2018] Wu, Y.; Ren, M.; Liao, R.; and Grosse, R. B. 2018. Understanding ShortHorizon Bias in Stochastic MetaOptimization. In International Conference on Learning Representations.
 [Zeiler2012] Zeiler, M. D. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv:1411.4000v2 [cs.LG].
Appendix A Stochastic MetaDescent algorithm
We recreate the SMD derivation, in our notation, for easier reference.
We compute the gradient of the loss function , w.r.t. stepsize. We derive the full quadraticcomplexity algorithm to start, and then introduce approximations to obtain a linearcomplexity algorithm. For stepsize as the th element in the vector ,
Define the following two vectors, for the th element in vector ,
(7)  
(8) 
We can obtain vector recursively as
The resulting generic updates for quadraticcomplexity SMD, with meta stepsize , are
(9)  
and (or some initial value). As with AdaGain, the Hessianvector product can be computed efficiently, using Roperators. Here, it is irrelevant, because we maintain the quadratic .
For the linearcomplexity algorithm, again we set entries for . Let be the th column of the Hessian. This results in the simplification
Further, since we will then assume that for , there is no purpose in computing the full vector . Instead, we only need to compute the th entry, i.e., for . We can then instead define to be a scalar approximating , with the vector of these, and the diagonal of the Hessian
(10) 
to define the recursion as , with . The gradient using this approximation, with offdiagonals zero, is
The resulting update to the stepsize is
(11)  
Difference to original SMD algorithm: Now, surprisingly, the above algorithm differs from the algorithm given for SMD. But, that derivation appears to have a flaw, where the gradients of weights taken w.r.t. to a vector of stepsizes is assumed to be a vector. Rather, with the same offdiagonal approximation we use, it should be a diagonal matrix, and then they would also only get a diagonal Hessian. For completeness, we include their algorithm, which uses a full Hessianvector product.
(12)  
Note that a followup paper that tested SMD [Wu et al.2018] uses this update, but does not have an error, because they use a scalar step size. In fact, in the SMD paper, if the step size had been a scalar, then their derivation would be correct.
The addition of : The original SMD algorithm did not use forgetting with . In our experiments, however, we consider SMD with —which performs significantly better—since our goal is not to compare directly with SMD, but rather to compare the choice of objectives.
Appendix B Derivations for AdaGain updates
Consider again the generic update
(13) 
where is the update for this step, for weights and constant vector stepsize and the operator denotes elementwise multiplication.
b.1 Maintaining nonnegative stepsizes in AdaGain
One straightforward option to maintain nonnegative stepsizes is to define a constraint on the stepsize. We can prevent the stepsize from going below a small threshold (e.g., ), ensuring positive stepsizes. The projection onto this constraint set after each gradient descent step simply involves applying the operator , which thresholds any values below to . We experimented with this strategy compared to the mentioned exponential form, and found it performed relatively similarly, but required an extra parameter to tune.
Another option—and the one we use in this work—is to use an exponential form for the stepsize, so that it remains positive. One form, used also by IDBD, is to use . The algorithm, with or without an exponential form, remains essentially identical to the thresholded version, because
Therefore, we can still recursively estimate the gradient with the same approach, regardless of how the stepsize is constrained. For the thresholded form, we simply use the gradient and then project (i.e., threshold). For the exponential form, the gradient update for is simply used within an exponential function, as described below.
Consider directly maintaining , which is unconstrained. For the function form , the partial derivative is simply equal to and so the gradient update includes an additional in front. This can more explicitly be maintained, without an additional variable, by noticing that for gradient for
Therefore, we can still directly maintain . The resulting update to is simply
(14) 
Other multiplicative updates are also possible. schraudolph1999local schraudolph1999local uses an exponential update, but uses an approximation with a maximum, to avoid the expensive computation of the exponential function. baydin2018online baydin2018online uses a similar multiplicative update, but without a maximum.
b.2 AdaGain for linear TD
In this section, we derive for a particular algorithm, namely linear TD. LMS updates can be obtained as special cases, by setting . We then provide a more general update algorithm—which does not require knowledge of the form of the update—in the next section. One advantage of AdaGain is that it is derived generically, allowing extensions to many online algorithms, unlike IDBD, and variants which are derived specifically for the squared TDerror.
We first provide the AdaGain updates for linear TD(), and then provide the derivation below. For TD(), the update is
(15)  
where , ,
To derive the update for , we need to compute the gradients of the updates, particularly or for the full algorithm, the Jacobian .
Letting , the Jacobian is and the diagonal approximation is . Because of the form of the Jacobian, we can actually use it in the update to , though not in computing , if we want to maintain linearity. The quadratic complexity algorithm uses as given
The linear complexity algorithm uses to update , giving the stepsize update in (15)
b.3 Generic AdaGain algorithm
To avoid requiring knowledge about the algorithm update and its derivatives, we can provide an approximation to the Jacobianvector product and the diagonal of the Jacobian, using finite differences. As long as the update function for the algorithm can be queried multiple times, this algorithm can be easily applied to any update.
To compute the Jacobianvector product, we use the fact that this corresponds to a directional derivative. Notice that corresponds to the vector of directional derivatives for each component (function) in the update , in the direction of , because the dotproduct separates in . Therefore, for update function (such as the gradient of the loss), we get for small ,
(16) 
For the diagonal of the Jacobian, we can again use finite differences. An efficient finite difference computation is proposed within the simultaneous perturbation stochastic approximation algorithm [Spall1992], which uses a random perturbation vector to compute the centered difference . This formula provides an approximation to the gradient of the entry in the update with respect to weight ; when computed for all , this approximates the diagonal of the Jacobian . To avoid additional computation, we can reuse the above difference with perturbation , rather than a random vector . To avoid division by zero, if contains a zero entry, we threshold the normalization with a small constant to give
(17) 
where division is elementwise. another approach would be to sample a random direction for this finite difference and use , divided by the absolute value of each element of . We found empirically that using the same direction as was actually more effective, and more computationally efficient, so we propose that approach.
Using these approximations, we can compute the update to the stepsize as in Equation (6), repeated here for easy reference
Comments
There are no comments yet.