Online learning is an important domain in machine learning which is concerned with a learner predicting a sequence of outcomes and improving the predictions as time continues by learning from previous outcomes. Every round a prediction is made after which the outcome is revealed. Using the knowledge about this outcome the strategy for predicting the outcome in the next round can be adjusted. The learner wants to maximize the amount of good predictions made.
In a specific case of online learning the learner receives advice from
so-called experts. It maintains a weight vector which is a probability distribution over the experts. Every round the learner randomly picks an expert using this distribution and follows its advice. When the outcome is revealed, the learner knows which experts were wrong and the weight vector will be adjusted accordingly to improve its chances of making a good prediction in the next round. This also is the form of online learning which we will study in this thesis.
The main question in this field of research is: what is the best way to adjust the weight vector? Multiple algorithms have been designed to determine the weight vector, but we will specifically look at the Squint algorithm 
. This algorithm is designed to always function at least as well as other known algorithms and in specific cases it functions much better. Squint was made for a non-changing environment. This means that it was designed for a setting in which the probability of a certain expert making a good prediction will not change over time. This thesis is concerned with making Squint function well in a changing environment, i.e. where experts can start performing better or worse at some point.
In Chapter 2 we will introduce the mathematical setting, present two algorithms, including Squint, for the non-changing environment and compare their properties. Then in Chapter 3 we will analyze how algorithms for the non-changing environment are usually made suitable for the changing environment. However, this conventional method makes the desired properties Squint has in a non-changing environment vanish. In order to find a way to retain Squint’s properties, we will first dive into the design of the Squint algorithm and the proof of its properties in Chapter 4. Finally, in Chapter 5 we will make our contribution by combining all the gathered information, designing a method which makes Squint function well in a changing environment and making sure its properties are preserved. We summarize our results in the concluding Chapter 6.
2 Prediction with Expert Advice
In this thesis we will study the Squint algorithm which is used for a specific case of online learning . We will now introduce the setting and make some definitions as used in the article of the Squint algorithm .
Each round a learner wants to predict an outcome , which is determined by the environment. The loss of a prediction is denoted by and indicates how good the prediction was, where a low loss resembles a good prediction and a high loss resembles a bad one. For example, can indicate whether it rains or not on day . The learner predicts a probability of it raining that day. The loss then can be defined as .
Each round the learner obtains advice from experts. For each expert this advice is in the form of a prediction of the outcome . However, their predictions are not necessarily right. Hence, every expert suffers a loss . Based on the experts’ losses in previous rounds the learner will decide which expert’s advice to adopt. The learner’s decisions are randomized. Thus, they are made using a probability vector (the components are non-negative and add up to 1) on the experts, which can be adjusted every round by the learner.
We now define as the losses of the experts in round . Then the dot product resembles the expected loss of the learner. Since a low loss induces a good performance, we define the learner’s performance compared to expert by . This value resembles how much the learner is expected to regret using his probability distribution on the experts instead of deterministically picking expert and hence is called the instantaneous regret compared to expert . Finally, this leads us to defining the total regret by
The goal of the learner is to perform as well as the best expert or more specifically, the expert with the lowest accumulated loss. Hence, the goal is to have ‘small’ regret simultaneously for all experts after any number of rounds . The total number of rounds is known to the learner before it starts the learning task and thus can be used when determining .
One could question when we can call the regret ‘small’. We assume that every expert will always have the same probability of making a good prediction. This means that an expert’s total loss will grow linearly over time. We want our total regret to grow slower such that the average instantaneous regret approaches zero as the amount of time steps increases. For this reason we call the regret ‘small’ if it grows sublinearly in . We will now look into an algorithm which guarantees this property for the regret.
2.2 Hedge Algorithm
The question we aim to answer is: how should the learner adjust the probability vector each round in order to have ‘small’ regret? One way to do this is described by Freund and Schapire in their so-called Hedge algorithm . This algorithm depends on a parameter , which can be chosen by the user. We define to be the ’th component of the probability vector , which in essence equals the weight put on expert . Then the Hedge algorithm determines the probability vector for the next round by, for each , setting
Here is the prior distribution on the experts. So equals the weight put on expert before the algorithm starts. When we set
to be the uniform distribution, sofor all , and we set , Freund and Schapire show that the algorithm bounds the total regret by
for each total number of rounds and every expert . Other algorithms also obtain explicit bounds like we have here for Hedge. However, those bounds look more complex. Hence, from now on we use the following definition:
Let be the domain of the functions and . We write if there exists an absolute constant such that for all we have .
With this definition we can write a cleaner expression for the bound obtained by the Hedge algorithm:
Thus, the Hedge algorithm guarantees that the total regret grows sublinearly over time, which is exactly what we wished for. So using this algorithm we obtain ‘small’ regret. But there exist algorithms which yield even smaller regret, like is shown in the next section.
2.3 Squint Algorithm
More recently, Koolen and Van Erven have come up with another algorithm, named Squint . This algorithm makes use of
, the cumulative uncentered variance of the instantaneous regrets. Furthermore, it involves a learning rate. However, the optimal learning rate is not known to the learner. So it uses a so-called prior distribution on the learning rate. Finally, the prior distribution on the experts is again denoted by . Using these distributions, Squint determines the probability vector for the next round by setting
where is the unit vector in the ’th direction.
For different choices of they obtain bounds for with a subset of the set of experts and the prior conditioned on . This is the expected total regret compared to experts in the subset . Likewise, they define . For a certain choice of , on which we will go into further detail in Chapter 4, they obtain the bound
for each subset of experts and every number of rounds . Here is the weight put on the elements in the subset by the prior distribution. Usually is chosen to be the set, unknown to the algorithm, of the best experts, such that the bound tells us something about the expected regret compared to the best experts. If this regret is ‘small’, then the regret compared to the worse experts definitely is small as well. However, we now do not have a guarantee of our regret compared to the very best expert, which is what we initially were after.
Hence, one might wonder what makes this bound better than the one (2) of the Hedge algorithm. Instead of using the time in the new bound involves the variance in . Since the instantaneous regret is bounded by , we conclude that . Often the variance is strictly smaller than , which makes it a stricter bound. On top of that, the factor is replaced by . When the prior distribution is chosen uniformly, this factor is again smaller. Lastly, the factor practically behaves like a small constant and thus can be neglected.
In the worst case there are no multiple good experts and we have to choose to be a single expert. When using the uniform prior, the term equals . Also, in the worst case will be close to . Since is negligible, the Squint bound now matches the bound of Hedge in some sense. But again, this holds for the worst case. In practice there are often multiple good experts and the expected variance is much smaller than for specific cases . That makes this bound significantly better than the Hedge bound. Another advantage is that the bound holds for every prior distribution on the experts and not only for the uniform prior. For non-uniform priors it could also yield favorable bounds.
In Section 2.1 we assumed that every expert always has the same probability of making a good prediction in order to define what we mean by ‘small’ regret. However, in reality experts might notice over time that their predictions are not very accurate. This could lead to them changing their strategy of predicting and hence they could start making better or worse predictions. Moreover, if the sequence of outcomes changes this could also lead to certain experts gaining an advantage and making better predictions. How do we then define ‘small’ regret and how can we make the Squint algorithm adapt to changes in performance of the experts? After all, we want our algorithm to put the most weight on the best experts, but when one expert performed poorly at first and now starts performing very well, we want the algorithm to forget about the first few bad predictions. So even though the current regret bound still holds, we now want to bound the total regret obtained since the last time an expert’s performance changed and not necessarily since . In the remainder of this thesis, we will focus on these problems caused by varying performances of the experts.
3 Changing Environment
In this chapter we will be looking at a changing environment for the prediction with expert advice setting, i.e. the experts’ performances change over time. Jun et al.  studied this with a tool named coin betting. We will use their research as a starting point and apply it to our learning task.
3.1 (Strongly) Adaptive Algorithms
When the environment changes, we want the learner to adapt to this change. Originally, the probability vector was based on the data from the first round until the current round. However, when the environment has changed, we want the learner to forget about the data from before that point. So we only want to use data from an interval, starting at the last time the environment changed and ending at the current round. Applying this approach, we now look for algorithms which have ‘small’ regret on every possible interval, since we do not know when the environment changes and thus which intervals to look at. First, we will adjust our definition of the total regret (1) to this situation. We define the total regret on a contiguous interval with compared to expert by
We want an algorithm which makes grow sublinearly over time on its interval. Often used definitions are the following: we call an algorithm adaptive to a changing environment if grows with for every contiguous and strongly adaptive if it grows with where is the length of the interval. Note that the stopping time is known beforehand and hence can be used by the algorithm, but the interval , on which we measure the regret, is not known and cannot be used by the algorithm. This is because we want to have ‘small’ regret on all possible intervals .
If we knew beforehand when the environment changes, we could apply Hedge or Squint to every separate interval between the times the environment changes. However, we do not know when the environment changes and thus cannot tell what the starting points of these intervals are. That is where a so-called meta algorithm which learns these starting points comes in.
3.2 Meta Algorithms
The idea of a meta algorithm is that it uses so-called black-box algorithms like the Hedge algorithm. For each possible starting point a black-box algorithm is introduced. These algorithms compute for every time step of their intervals and can only use data of the environment available from their starting points onwards. In our case the only difference between the black-box algorithms is their interval. We choose the type of algorithm (e.g. Hedge or Squint) to be the same for all black-boxes.
The meta algorithm then keeps track of a probability vector on the active black-box algorithms (i.e. the algorithms that produce an output at time ). It uses this probability vector to randomly determine which black-box algorithm it should follow. Based on knowledge from previous rounds the probability vector is adjusted. This is actually very similar to our prediction with expert advice setting where the black-box algorithms are the experts and the meta algorithm is the learner.
When using this meta algorithm we should reformulate our regret on an interval . Let be the set of all black-box algorithms and let be the set of active black-box algorithms at time . For the algorithm we define as its computed probability vector at time . Similar to (5), the regret for this black-box algorithm on the interval is denoted by . Finally, is the probability vector with components on the set of black-box algorithms , which is determined by the meta algorithm . It has the properties if , if and . Since the regret is the difference between the expected loss of the learner and the loss of the expert, we conclude that the regret of the meta algorithm in combination with the black-box algorithms is given by
Before we go into the precise formulation of the algorithm, we would like to look at the computational complexity. When we introduce a black-box algorithm for every possible starting point, we would have to keep track of a lot of algorithms. At time there would be active algorithms (one for each possible starting point, including the current round), which would sum to . So the computation time would scale quadratically with , whereas we want it to scale linearly with , which is the case for algorithms in a non-changing environment. In order to reduce the computation time, we will only look at the so-called geometric covering intervals. Using these we will take less than black-box algorithms into account when at time .
3.3 Geometric Covering Intervals
Daniely et al.  prove Lemma 1 below, which states that every possible contiguous interval can be partitioned into the geometric covering intervals. Hence, if we can guarantee a ‘small’ sum of regrets on these specific intervals, we can guarantee a ‘small’ regret on every possible interval. We will now make this more precise.
For every , define to be the collection of intervals of length with starting points . The geometric covering intervals are
Since is an element of an interval only if , we see that at time there are active intervals. We will only use black-box algorithms which function solely on these intervals and not outside of them. So at time we also have active black-box algorithms, which sums to . This improves the previous quadratic scaling to nearly linear in , except for the minor overhead of an extra logarithmic factor. Thus, using the geometric covering intervals requires considerably less computation time than looking at all possible intervals with endpoint . As mentioned before, the following lemma allows us to use the geometric covering intervals:
Lemma 1 ([5, Lemma 1.2]).
Let be an arbitrary interval. Then can be partitioned into a finite number of disjoint and consecutive intervals denoted with such that for all and such that
The partitioning from the lemma is a sequence of smaller intervals which successively double and then successively halve in length. With this partitioning we can now decompose the total regret compared to an expert , obtained by the meta algorithm with the black-box algorithms. To do so, we introduce the notation for the black-box algorithm which operates only on the interval and not outside of it.
From now on we use the following terms for the different regrets: the -regret (denoted ) is the total regret on of compared to a black-box . The -regret (denoted ) is the total regret on of compared to expert . Finally, the -regret (denoted ) is the total regret on of with its black-box algorithms compared to expert . This gives the following equation:
So in order to have ‘small’ regret on every interval , we need the following properties:
the sum over the intervals of the -regret must be ‘small’ for all combinations of the
the sum over the intervals of the -regret must be ‘small’ for all combinations of the
We will first go into further detail on the meta algorithm that uses the geometric covering intervals, after which we will show these properties are satisfied.
3.4 CBCE Algorithm
We will now introduce the meta algorithm, called Coin Betting for Changing Environment (CBCE), which was established by Jun et al.  and based on the work of Orabona and Pál . We will present a slight adjustment of this algorithm such that it matches our notation. However, the functioning and corresponding properties stay the same.
For , let be the instantaneous regret of compared to . Let be a prior distribution on the set of all (geometric covering interval) black-box algorithms and let be the prior restricted to all active black-box algorithms at time . So if and if where is the weight put on the active algorithms by . Then each round the probability vector on these black-box algorithms is computed by
To clarify the functioning of the algorithm, we will also describe it in pseudocode (see Algorithm 1).
The creators prove a bound on the -regret of CBCE for the following choice of prior:
Here is a normalization factor. The bound, given on interval , is
Using this result, we will now show that the two necessary properties in order to have ‘small’ regret on an interval are satisfied when one applies Hedge for the black-box algorithms. This proof is also adopted from .
3.5 Applying CBCE to Hedge
Using (6) we will bound the -regret on an arbitrary interval compared to expert where we take CBCE as our meta algorithm and Hedge for the black-box algorithms. Since the -regret on an interval is equal to the standard regret (1) for , the -regret for Hedge is bounded by
This gives the bound
Since the intervals form the partitioning from Lemma 1, we know that and hence for all . This gives
We know that and that the lengths successively halve when goes down from 0 or up from 1. So we obtain
Now we bound the regret by
We conclude that the two properties required for a ‘small’ regret on are satisfied. Since , we find that CBCE combined with Hedge is a strongly adaptive algorithm. In the previous chapter we saw that Squint gave a better bound than Hedge. Consequently, we want to try to obtain a better bound in a changing environment by applying CBCE to Squint.
3.6 Applying CBCE to Squint
The creators of Squint  obtained bounds for , the expected regret compared to a subset of experts. Likewise, we now define this regret on the interval : . Furthermore, we define and similarly.
For the expected regret compared to a subset we can make the same decomposition as displayed in equation (6). We then obtain
This is very similar to equation (6). The analysis of the sum over the first term stays the same as in Section 3.5. What is left is the analysis of the sum over the second term, which is different than before. To do this, we would like to know over how many intervals we are summing (i.e. what is ?). Lemma 1 tells us that the first successively at least double and then successively at most halve in length. This enables us to bound the length of from below by
This finally gives
Remember that the total regret for Squint in a non-changing environment is bounded by
We will now determine the bound for the sum of the Squint regret over the intervals . In the summation over the intervals , we first only look at the second and third term of the Squint regret.
In order to prove a good bound on the first term, we need the following lemma:
For all with :
Let denote the uniform distribution on the set . Since the square root is a concave function on , we can use Jensen’s inequality and obtain the following:
Using this lemma, we will derive a bound on the first term of the Squint regret:
Conclusively, the bound of CBCE combined with Squint is
Remember that and that can be neglected, since it practically behaves like a small constant. As and thus , we can conclude that the first term grows faster over the interval length than the second term. Moreover, because grows slower than , the first term also grows faster than the third term. Hence, the CBCE regret dominates the Squint regret in this expression. Likewise, the CBCE regret dominates the Hedge regret in the expression from Section 3.5. So the advantage Squint has over Hedge in a non-changing environment vanishes in a changing environment.
Therefore, our goal is to find a way to retain the advantages of Squint when applying it in a changing environment. In the following chapters we will focus on adjusting Squint for this matter. We will first give a detailed proof of the regular bound, so we can later on build on the same techniques and construct a bound in a changing environment.
After having discussed the effects of a changing environment, we will now go back to a non-changing environment. In order to make Squint work well under varying circumstances, we first have to go deeper into its properties for the non-changing case. Specifically, we will show a proof of its regret bound (4) from Chapter 2, as this will be an important stepping stone for proving a bound in a changing environment.
4.1 Reduction to a Surrogate Task
First, we will introduce a new task, the so-called surrogate task, to which we can reduce our original learning task, as done similarly by Van der Hoeven et al. . Remember that, when we introduced Squint, we used a prior on the learning rate , since we did not know what the optimal learning rate was. For the surrogate task we try to find the best expert, but also the best learning rate. As a consequence, we now keep track of a probability distribution on the learning rates and the experts instead of our previous probability vector . And instead of our previous loss , which only depended on the expert, we now have a so-called surrogate loss which depends on the learning rate and the expert. Here, is the instantaneous regret compared to expert , just like defined in Section 2.1. The definition of this surrogate loss will later turn out to be useful, since the sum of these losses yields the total regret and variance . Next, we make some general definitions, which we will use multiple times in the remainder of this thesis:
Let be a measurable space, let be a probability distribution on for all and let be the loss of at time . Then the mix loss of under is defined by
Let be a measurable space, let and be probability distributions on for all and let be the loss of at time . Then the surrogate regret of the set of distributions compared to the distribution under the losses is defined by
For now, we set and . Our loss is . For the surrogate regret under these losses we obtain the expression
The surrogate regret now represents the difference between the total mix loss of the learner (who uses as its distributions) and the total expected loss of any other distribution . The goal of the new task is to keep the surrogate regret small. We can now reduce our original task to this new task by only keeping track of a probability distribution on the experts and marginalizing out the learning rate:
To derive the Squint algorithm, we will take a look at the definition of the Exponential Weights algorithm (EW) , which we will also use more often in the remainder of this thesis.
Let be a measurable space, let be a prior distribution on and let be the loss of at time . Then the Exponential Weights algorithm sets the densities of the probability distributions to be equal to