Boosting for Dynamical Systems

06/20/2019 ∙ by Naman Agarwal, et al. ∙ Google Princeton University 0

We propose a framework of boosting for learning and control in environments that maintain a state. Leveraging methods for online learning with memory and for online boosting, we design an efficient online algorithm that can provably improve the accuracy of weak-learners in stateful environments. As a consequence, we give efficient boosting algorithms for both prediction and the control of dynamical systems. Empirical evaluation on simulated and real data for both control and prediction supports our theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many learning scenarios it is significantly easier to come up with a mildly accurate rule of thumb than state of the art performance. This motivation led to the development of ensemble methods and boosting [27], a theoretically sound methodology to combine rules of thumb into an accurate learner.

The application of boosting has transformed machine learning across a variety of applications, mostly in supervised learning: classification

[16], regression [25], online learning [9, 8], agnostic learning [22], recommendation systems [15] and many more.

The same exact motivation for boosting exists in dynamical systems: it is often easy to come up with a reasonable controller for a dynamical system, or a reasonable predictor in time series analysis. However, the theory and practice of boosting faces significant challenges in these settings by the existence of state. Taking control of dynamical systems as an example, a controller affects the state of the system, and it is not a-priori clear how to obtain a meaningful guarantee when shifting between different controllers.

In this paper we study a framework for theoretically sound boosting in the presence of state. We show how using techniques from online learning with memory and online gradient boosting gives rise to provable guarantees for learning and control in stateful frameworks. Our methods enlarges the expressivity of weak learners in two ways: following classical results, we show how to compete with the best committee (or convex combination) of weak learners. We also give an alternate boosting method, which is more efficient in terms of the number of weak learners required, that allows to utilize weak learners designed for a quadratic loss into strong learners that can learn any strongly convex and smooth loss.

We conclude with experimental evaluation of our methods on a variety of control and learning tasks in dynamical systems. As weak learners, we use both provable controllers, such as recent improvements of the Linear Quadratic Regulator, as well as deep recurrent neural networks, and show how boosting improves the performance of both.

1.1 Our contributions

Below we describe algorithms that are provably capable of improving the performance of prediction and control algorithms in dynamical systems. We design two methods that enhance the expressivity of the weak learner in different ways, as summarized below.

The setting considered is that of online regression where loss functions have memory, capturing the state of the dynamical system. Every round the player is presented with an example

and is asked to output a prediction . An adversarial cost is then presented, and the loss incurred by the player depends not only on but also on the past predictions . We denote by the time horizon, by the class of weak learners, and by the number of weak learners used.

Algorithm Reference class Loss of WL Regret
Dyna-Boost 1 linear
Dyna-Boost 2 quadratic

The first boosting algorithm takes as input weak learners which have low regret assuming linear loss functions against a reference class of predictors . The algorithm guarantees low regret against the class of more general convex and smooth loss functions and the expanded reference class , i.e. the convex hull of . We show that the penalty in terms of regret is an additive term inversely proportional to the number of learners. This result is formally stated in Theorem 3.3.

For the case when the loss functions are strongly convex, we present a faster online boosting algorithm which is significantly more efficient in terms of the number of weak learners it requires. The weak learners required are those that can achieve low regret against well conditioned quadratic loss functions. The boosting algorithm competes with the same class of functions as the weak learner, although it can cope with more general convex loss functions which is formally stated in Theorem 3.6

1.2 Applications

We apply our algorithms to systems with a dynamic state in both prediction and control. The results below follow as corollaries from our results on general setting of online boosting with memory.

As a first case study, we design a boosting algorithm for the control of dynamical systems. The main idea is to reduce controlling of a dynamical system to online learning with memory, and then apply our previous results. Such a reduction is achieved naturally on a stable dynamical system. We describe this setting formally in Section 4. While we present our reduction in generality, as a concrete example we provide precise regret guarantees on controlling a linear dynamical system, which allows general loss functions and adversarial noise.

Another application we explore is that of prediction in time series. The problems we consider are prediction in ARMA models and online portfolio selection with transaction costs.

1.3 Related work

Boosting

The importance of ensemble methods was recognized in the statistics literature for decades [10]. The seminal work of Freund and Schapire [16]

introduced the theoretical framework of boosting in the context of computational learning theory, and notably the AdaBoost algorithm. We refer the reader to the text

[27] for an in-depth survey of the fundamentals and various applications of Boosting. Of particular interest to this paper is the generalization of boosting to the online setting [9, 8] and references therein.

Online learning with memory

To abstract out a general boosting algorithm for dynamical systems we leverage the online convex optimization with memory framework introduced by [6].

Prediction and Control in Dynamical Systems

The main application of our methods is to control of dynamical systems. For a survey of linear dynamical systems (LDS) as well as learning, prediction and control problems for them see the [24], as well as a recent literature review in [18]. Recently, there has been a renewed interest in learning dynamical systems in the machine learning literature. This includes refined sample complexity bounds for control [2, 12, 1], the new spectral filtering technique for learning and open-loop control of non-observable systems [21, 7, 20], and provable control [14, 4, 11].

2 Setting and definitions

Online convex optimization.

In the online convex optimization framework (see [19]) a learner iteratively predicts a point in a convex decision set , and suffers loss according to an adversarial loss function . The goal of the learner is to minimize regret, defined as

Online learning with memory.

An important generalization of the above setting that we will need is the ability to minimize regret in an OCO instance in which the functions have memory. That is, the loss functions depend not only on the current decision, but also on several previous decisions. We say that a function has memory if it is dependent on the previous decisions. Such a setting was introduced in the work of [6] which defined the notion of regret as

OCO with memory is a natural setup in dynamical systems where the memory can be used to represent the associated state. This connection was recently leveraged in [4] which provided robust regret bounds for Linear Quadratic Regulators. We make this connection precise in Section 4.

Our main result is to provide boosting algorithms for online learners with memory. Inspired by the setting of Online Gradient Boosting for regression functions introduced by [8] we define a generalization of OCO with memory called online regression with memory.

Formally consider an input space , a label space and a function space . At every time , the player is presented with an example and the player makes a decision . An adversary simultaneously chooses a convex loss function from a class which penalizes potentially the last decisions made by the learner, i.e. and the player suffers the loss . The objective here is to minimize regret with respect to best regression function in hindsight, i.e.

Online boosting.

Consider the above setting of online regression and suppose we have access to a weak learner with regret with respect to a given function class . Online Boosting refers to a meta-learning algorithm which takes as input, access to (potentially many copies) of such a learner and produces a sequence of outputs which have low regret when compared to the function class , i.e. the convex hull of .

For a given class of loss functions to be learned, the notion of weak learners we will require would be those that compete with the class of linear functions given by the gradients of losses in . Formally define the following set of all possible gradients

We overload the notation for the above set to also represent all possible linear functions given by the vectors in the above set.

As an example let to be the set of all lipschitz differentiable functions. In this case the set consists of all linear functions with bounded norm.

3 Main results

We propose two algorithms (1, 2) for boosting in the case when loss functions are smooth and smooth as well as strongly convex respectively. As described above we consider the setting of online regression with memory with a loss function class and a class of regression functions . Let represent the diameter of the sets respectively.

3.1 Boosting for Smooth Losses

Assumption 3.1.

Loss functions are convex and -smooth, i.e. for all and

Assumption 3.2.

There exists an algorithm which when run with losses chosen from the class of linear functions , produces a sequence of predictions with regret at most against the class of predictors . Formally,

Under the above assumptions we show the following theorem.

Theorem 3.3.

Let the class of loss functions satisfy assumption 3.1 and suppose we have access to multiple copies of a weak learner satisfying assumption 3.2. There exists a boosting algoritihm (Algorithm 1) which produces a sequence of predictions enjoying the following regret bound with respect to ,

1:  Input: weak learners ,…,, step length , initial state
2:  for  do
3:     Receive the state
4:     Define
5:     for  to  do
6:        Define
7:     end for
8:     Output the decision
9:     Receive loss function and suffer loss
10:     Define linear loss function
11:     Pass loss function to weak learner .
12:  end for
Algorithm 1 Dyna-Boost 1
Proof.

First, note that since for any , since is a linear function, we have that

Now let be any function in . Define

Consider the following calculations for .

For , since , the above bound implies that . Starting from this base case, by induction on it follows that . Applying this bound for completes the proof.

3.2 Efficient Online Boosting for Strongly Convex Losses

We now present our results for the case when the losses are strongly convex. In this case we prove that the excess regret of boosting goes down exponentially in the number of weak learners. The weak learners required for this result are stronger in the sense that they are able to have low regret against quadratic functions as opposed to linear functions in the previous part. Due to this, the boosted algorithm does not compete with an expanded class of predictors but rather just with the original class of predictors .

In addition to assumptions in the previous section we will need the following additional assumptions.

Assumption 3.4.

We assume that all loss functions are -strongly convex , i.e. for all and we have that

Furthermore we assume that for all and we have that .

Assumption 3.5.

Consider the class of quadratic functions which contains for every and any the following function

There exists an algorithm which when run with losses chosen from the class of functions , produces a sequence of predictions with regret at most against the class of predictors . Formally,

Our main result is the following theorem.

Theorem 3.6.

Let the class of loss functions satisfy assumptions 3.1, 3.4 and suppose we have access to multiple copies of a weak learner satisfying assumption 3.5. There exists a boosting algorithm (Algorithm 2) which produces a sequence of predictions enjoying the following regret bound with respect to ,

1:  Let be weak learners, for .
2:  for  do
3:     Receive the state
4:     Define
5:     for  to  do
6:        Define
7:     end for
8:     Output the decision
9:     Receive loss function and suffer loss
10:     Define linear loss function
11:     Pass loss function to weak learner .
12:  end for
Algorithm 2 Dyna-Boost 2

4 Boosting for Controlling Dynamical Systems

In this section we describe a general methodology to apply our algorithms to prediction and control of dynamical systems. Consider the following dynamical system

where represents the state space which is assumed to be fully observable, represents controls which the player is supposed to choose, represents a potentially non-linear but known transition function and represents potentially adversarial disturbances. Since the system is known and the state is fully observable, the noise is assumed to also be fully-observable.

In the online setting (see e.g. [11, 4]) an adversary at every time step, selects a convex cost function which penalizes the state and the action taken by the player. The cost function is then fully revealed to the player. The task of the player is to minimize regret against a given policy class . Given a policy , let be the actions generated by the policy and be the states encountered by executing these actions. Regret in this setting is formally defined as

In order to reduce online control of dynamical systems to online learning with memory, we need to define the notion of stable systems. Let be an integer representing memory length and let be any sequence of bounded actions. Define the state to be the state reached by the system if we artificially set and simulate the system with the actions . A system is considered stable with memory length if for all we have that,

For an example of a stable system consider a linear dynamical system C.1 with the matrix similar to a matrix with spectral norm less than . For stable systems we can now define a proxy function which penalizes the last actions defined as .

Stability now ensures that minimizing regret over the proxy costs (which has a finite memory) is sufficient to minimize overall regret. Having reduced the control of dynamical systems to minimizing regret over functions with memory, we can apply Algorithm 1 directly. In the appendix we show a concrete example of this on a linear system borrowing requisite methodology from [4].

5 Experimental Evaluation

5.1 Boosting for Online Control.

We experiment with various control systems, where we start with a standard LDS system, i.e., , and in the last setting we experiment with a non-linear control system.

Sanity check:

We experiment with a simple LDS in which each noise term

is normally distributed, independently. We demonstrate results for systems with different dimensions,

.

(a)
(b)
(c)

Correlated noise: We now consider more challenging LDS settings in which the noise values are correlated across time. In the ”Gaussian random walk” setting, each noise term is distributed normally, with the previous noise term as its mean. In the ”Sine noise” setting, the sine function is applied to the time index, i.e., .

(d) Gaussian random walk
(e) Sine noise

Inverted pendulum: The inverted pendulum, a highly nonlinear unstable system, is a commonly used benchmark for control methods. The objective of the control system is to balance the inverted pendulum by applying torque that will stabilize it in a vertically upright position.

5.2 Boosting for Time-Series Prediction

In this section, we experiment in online time-series prediction. The data is assumed to be generated according to an autoregressive moving average model (ARMA), parameterized by horizon terms and coefficients vectors , and , such that, , where is a zero-mean random noise term. We apply online boosting to weak learners that are a linear combination of previously observed states, as in ARMA-OGD, and ARMA-ONS [26]. That is, each weak learner makes a prediction of the following form, When applicable, we also compare against the ARIMA-ONS method [23]. In all plots, boosted and fast-boosted results correspond to Dyna-Boost-1, and Dyna-Boost-2, respectively. We follow the same 4 simulated experimental settings as in [26], as detailed in the Appendix. In addition, we also demonstrate results on real-world data.

(f) Setting 1.
Sanity check
(g) Setting 2.
Changing coefficients
(h) Setting 3.
Abrupt change
(i) Setting 4.
Correlated noise

Real-world data. We provide results on real-world time-series data, from the UCI Machine Learning Repository [13]. The data contains 9358 instances of hourly averaged measurements of air quality properties from an Italian city throughout one year, as measured by chemical sensors. Specifically, our goal was to predict the level of CO contaminate concentration.

(j) Time-series: real-world data
(k) Portfolio Selection

5.3 Boosting for Portfolio with Transaction Loss

The data we use is Yahoo! SP 500 data, with 490 tickers, 1000 trading days in the period 2001-2005. A transaction loss is used where is strategy and is price

we test MW (multiplicative weights) algorithm, OGD (online gradient descent, as the weak learner) and boosted OGD for comparison. The figure of averaged gain over time is shown above, where gain is simply defined as the opposite number of loss each round.

6 Conclusions

We have described a framework for boosting of algorithms that have state information, and two efficient algorithms that provably enhance weak learnability in different ways. These can be applied to a host of learning and control of dynamical systems. Preliminary experiments look very promising across a host of datasets in time series prediction as well as simulated control.

7 Acknowledgements

Elad Hazan acknowledges funding from the NSF, award number 1704860.

References

  • [1] Yasin Abbasi-Yadkori, Nevena Lazic, and Csaba Szepesvári. Model-free linear quadratic control via reduction to expert prediction. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    , pages 3108–3117, 2019.
  • [2] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • [3] Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E Schapire. Algorithms for portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, pages 9–16. ACM, 2006.
  • [4] Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721, 2019.
  • [5] Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. CoRR, abs/1902.08721, 2019.
  • [6] Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784–792, 2015.
  • [7] Sanjeev Arora, Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Towards provable control for unknown linear dynamical systems. 2018.
  • [8] Alina Beygelzimer, Elad Hazan, Satyen Kale, and Haipeng Luo. Online gradient boosting. In Advances in neural information processing systems, pages 2458–2466, 2015.
  • [9] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pages 2323–2331, 2015.
  • [10] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug 1996.
  • [11] Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. arXiv preprint arXiv:1806.07104, 2018.
  • [12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • [13] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • [14] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for linearized control problems. arXiv preprint arXiv:1801.05039, 2018.
  • [15] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4:933–969, December 2003.
  • [16] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, August 1997.
  • [17] Ludwig Pettersson Jonas Schneider John Schulman Jie Tang Greg Brockman, Vicki Cheung and Wojciech Zaremba. Openai gym, 2016.
  • [18] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068, 2018.
  • [19] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • [20] Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. arXiv preprint arXiv:1802.03981, 2018.
  • [21] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6702–6712, 2017.
  • [22] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, pages 880–888, 2009.
  • [23] Zhao Peilin Liu Chenghao, Hoi Steven CH and Sun Jianling. Online arima algorithms for time series prediction. In Thirtieth AAAI conference on artificial intelligence, 2016.
  • [24] Lennart Ljung. System identification: Theory for the User. Prentice Hall, Upper Saddle Riiver, NJ, 2 edition, 1998.
  • [25] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512–518, 2000.
  • [26] Shie Mannor Oren Anava, Elad Hazan and Ohad Shamir. Online learning for time series prediction. In Proceedings of the 26th Annual Conference on Learning Theory, pages 172–184, 2013.
  • [27] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.

Appendix A Appendix

a.1 Proof of Theorem 3.6

Proof of Theorem 3.6.

Since satisfies assumption 3.5, we get that

(A.1)

Now let be any function in . Define

Notice that by -strongly convexity of , as long as we choose , we have

Thus by summing them up we get

(A.2)

Therefore,

Choosing , then by noticing is always upper bounded by a convex combination of and

plugging in finishes our proof. ∎

Appendix B Boosting for Universal Portfolio Selection

This section illustrates the application of our methods to a problem of time series prediction with memory. The problem we choose is that of universal portfolio selection, see e.g. [3] and related papers, with an additional transaction loss. This is a basic setting that already exhibits time dependence in the loss functions and requires our machinery for boosting.

Formally speaking, at every round the adversary selects a price vector and the player selects a portfolio in the -dimensional simplex and incurs loss given by

for some fixed . The loss above can be seen to be convex and has a memory dependency on . Therefore we can apply Algorithm 1 to it 111The setting can be converted to online regression by a degenerate reduction by setting and . The class of regression functions is set to identity functions parameterized by . We now show that the above class of loss functions satisfies assumption 3.1. As standard in the literature, we assume that , and the upper bound is without loss of generality (see e.g. [3]).

Proposition B.1.

is -Lipschitz, -smooth, and bounded by

Proof.

The Euclidean part of is clearly smooth, Lipschitz and bounded by . The gradient and Hessian of the first part are given by

Therefore, since , , and . ∎

Corollary B.2.

Suppose we have a weak learner which has regret guarantee on linear loss functions with a norm bound of , i.e.

Then applying Algorithm 1 with the weak learner gives the following regret bound

Appendix C Boosting the Linear Quadratic Regulator

We follow the setting in [5], and use the algorithm presented therein as the weak learner. A linear dynamical system is governed by the dynamics equation

(C.1)

where is the state, is the control and is a disturbance to the system which could potentially be adversarial. The goal of the controller is to minimize the cumulative cost, which at time is given by the convex function

The algorithm presented in [5] parameterizes the control is given by

(C.2)

where is a pre-fixed matrix and to are parameters governing the controller. Under standard assumptions, it is shown in [4] that the system is stable with , thereby allowing the reduction to online convex optimization with memory. The following corollary presents the regret bound for the application of Algorithm 1 to the above system.

Corollary C.1.

Assume the controller presented in [4] has regret on bounded norm linear loss functions with memory length , then by applying Algorithm 1 on an LDS with as the weak learner, we get the following regret guarantee

Note that while the controller proposed in [4] enjoys a regret bound for general convex loss functions, the regret bound for the controller required by our algorithm is only on linear loss functions. This could be much smaller and hence the boosted result could be significantly better. We verify this empirically across multiple settings in the next section.

Appendix D Experimental Settings

d.1 Boosting for Online Control

For the LDS system, the matrices are generated randomly, such that their spectral norms are smaller than . The cost function used is . The weak-learner baseline is designed as in Equation C.2, following [5], with the pre-fixed matrix set to . In all figures, we plot the averaged results for a fixed random system determined by , for each setting, over experiment runs.

Sanity check experiment. In this LDS setting, the noises are normally distributed with zero mean, and variance. We set the memory length to for , and for the larger dimensions, and use weak-learners in all the experiments.

Correlated noise experiment. In the first setting, the noise terms are , and are then clipped to the range . Here we also test our method with a Recurrent Neural Network, using an LSTM architecture, with hidden units. We plot the performance of the RNN weak learner (denoted as RNN-WL), as well as our online boosting technique applied to it.

Inverted Pendulum experiment. Here we follow the dynamics that was implemented in [17]. The LQR baseline solution is obtained from the linear approximation of the system dynamics, whereas our baseline and boosted controllers are not restricted to that approximation. We add correlated noise to the same, obtained from a Gaussian random walk, as above, such that , where the noise values are then clipped to the range .

d.2 Boosting for Time-Series Prediction

In these experiments, the x-axis is time (number of samples), and the y-axis is the average squared loss. We have averaged the results over runs for stability, and set the memory to be , and number of weak learners to be , in all our experiments. The ARIMA baseline is not included in settings in which it performs poorly, thus it is only included in the correlated noise setting, and for real-data. In all the simulated experiments, although we follow [26], unlike in their experiments, we have randomly initialized all predictors far from the optimal parameters, and our results demonstrate a large improvement over the weak-learners. Note that state is not directly applicable in all time-series settings we consider. In these cases, we implement Dyna-Boost-1 and Dyna-Boost-2 with the original loss functions.

Setting 1. We start with a simple sanity check, in which we generate stationary time-series data assuming the ARMA model with , , and normally distributed noise terms .

Setting 2. We generate non-stationary time-series data assuming the ARMA model, such that the coefficients are slowly changing with time. Specifically, we set and

The noise terms

are uniformly distributed on the range

.

Setting 3. We again generate non-stationary ARMA process, but here coefficients change abruptly. For the first half of time steps we use and . For the second half, we set and . The noise terms are uniformly distributed on .

Setting 4. We consider an ARMA process generated by and . The noise terms are now correlated; each noise term is normally distributed, with expectation that is the value of the previous noise term, such that , and then clip to be in the range .

d.3 Boosting for Portfolio

We do experiments on applying Algorithm 1 to portfolio with transaction loss. The data we use is Yahoo! SP 500 data, with 490 tickers, 1000 trading days in the period 2001-2005. Translating into our model, we have , with being length-490 vectors containing prices of all stocks at time . Loss function and boosting algorithm follows the setting in Section 4, where a transaction loss is used

For weak learners, we test MW (multiplicative weights) algorithm, OGD (online gradient descent) and boosted OGD for comparison. For parameters, we set number of weak learners , and for MW algorithm which achieves best performance.

The initial strategy used for all algorithms is uniform distribution on all stocks. This explains the curves in the figure. At first the prices of different stocks vary a lot, thus a uniform strategy is far from best and there is much space to learn. OGD tends to learn faster in the beginning while MW changes slower, therefore at first even single SGD beats MW. The reason why the average gain decreases for both OGD and boosted OGD is because the prices of stocks are decreasing over time. The sum of prices of all stocks starts at around 22k, but keeps decreasing and reaches the lowest point with sum of prices being around 12k.