Contextual Bandits under Delayed Feedback

Delayed feedback is an ubiquitous problem in many industrial systems employing bandit algorithms. Most of those systems seek to optimize binary indicators as clicks. In that case, when the reward is not sent immediately, the learner cannot distinguish a negative signal from a not-yet-sent positive one: she might be waiting for a feedback that will never come. In this paper, we define and address the contextual bandit problem with delayed and censored feedback by providing a new UCB-based algorithm. In order to demonstrate its effectiveness, we provide a finite time regret analysis and an empirical evaluation that compares it against a baseline commonly used in practice.


page 1

page 2

page 3

page 4


Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

We investigate the feasibility of learning from both fully-labeled super...

Simple Regret Minimization for Contextual Bandits

There are two variants of the classical multi-armed bandit (MAB) problem...

Bandits with Delayed Anonymous Feedback

We study the bandits with delayed anonymous feedback problem, a variant ...

Interaction-Grounded Learning

Consider a prosthetic arm, learning to adapt to its user's control signa...

Regret Minimization with Performative Feedback

In performative prediction, the deployment of a predictive model trigger...

Learning Multiclass Classifier Under Noisy Bandit Feedback

This paper addresses the problem of multiclass classification with corru...

Interaction-Grounded Learning with Action-inclusive Feedback

Consider the problem setting of Interaction-Grounded Learning (IGL), in ...

1 Introduction

Content optimization for websites and online advertising are among the main industrial applications of bandit algorithms. The services at stake are used to sequentially choose an option among several possible ones and display it on a web page to a particular customer. For that purpose, contextual bandits are among the most adopted as they allow to take into account the structure of the space where the actions lie.

Moreover, a key aspect of these interactions through displays on webpages is the time needed by a customer to make a decision and perform an action. The customers’ reactions are often treated as feedback and provided to the learning algorithm. For example, a mid-size e-commerce website can serve hundreds of recommendations per second but customers need minutes, or even hours, to perform a purchase. This means that the feedback the learner requires for its internal updates is always delayed by several thousands of steps. In (Chapelle, 2014)

, the authors ran multiple tests on proprietary industrial datasets, providing a good example of how delays affect the performance of click-through rate estimation.

The problem becomes even more relevant when the delay is big compared to the considered time horizon. For example, several e-commerce websites provide special deals that only last for a few hours. Yet, customers will still need several minutes to make their decision, which is a significant amount of time as compared to the time horizon. Furthermore, on the other extreme of the time scale, some companies optimize long-term metrics for customers engagement (e.g., accounting for returned products in the sales results) which by definition can be computed only several weeks after the bandit performed the action. In many of these applications, the ratio between time horizon of the learner and the average delay is between 5 and 10, which makes the delayed feedback problem extremely relevant in practice.

Two major requirements for bandits algorithms in order to be ready to run in a real online service are the ability to leverage contextual information and handle delayed feedback. Many approaches are available to deal with contextual information Abbasi-Yadkori et al. (2011); Agarwal et al. (2014); Auer et al. (2002); Neu (2015); Beygelzimer et al. (2011); Chu et al. (2011); Zhou (2015) and delays have been identified as a major problem in online applications Chapelle (2014). We give an overview of the existing literature in the field in Section 6. However, to the best of our knowledge, no algorithm was able to address this problem given the requirements defined above. In the following, we consider a censored contextual bandit problem: after a piece of content is displayed on a page, the user may or may not decide to react (e.g., click, buy a product). In the negative case, no signal will ever be sent to the system. This means that the system will wait for the feedback even if it never comes. On top of preventing the update of the parameters, delays also imply non-negligible memory costs. Indeed, as long as the learner is awaiting a feedback, the context and action must be stored in order to allow the future possible update. For that reason, it is common practice to impose a cut off time after which the stored objects are discarded and the rewards associated with too long delays are never considered. This additional censoring of the feedback has only been addressed in a recent work on non-contextual stochastic bandits Vernade et al. (2017) which does not generalize to the contextual case.

In this paper, we define and formalize the contextual censored problem under bandit feedback. We notice that a simple baseline algorithm can be easily implemented but has bad non-asymptotic performance. We propose , a carefully modified version of handling delays more efficiently. We provide a regret analysis for the newly presented algorithm and an empirical evaluation against the exhibited naive baseline.

The setting and notation are defined in Section 2 and we present existing works in Section 6. Our algorithm is described in Section 3 which also features our new concentration results. The regret analysis is in Section 4 and experiments are shown in Section 5.

2 Setting and Notation

We introduce the contextual delayed bandits as stochastic contextual bandit problem with independent stochastic delays. This setting is inspired by Abbasi-Yadkori et al. (2011); Joulani et al. (2013) and Vernade et al. (2017)

. Upper case and lower case letters are used respectively for random variables and constants.

At round , a set of contextualized actions

is available. In practice, these action vectors are constructed as

, where are the user’s features, are fixed action vectors, and is some nonlinear projection on . The system constraints fix a cut off time that corresponds to the longest waiting time allowed for each action made.

The learning proceeds as follows:

  1. The learner observes the contextualized actions ;

  2. An action is chosen from ;

  3. An acknowledgment indicator is generated independently of the past and accounts for the event that the chosen action was actually evaluated by the customer. The parameter does not depend on the action taken and it is also unknown.

  4. Conditionally on , two random variables are generated but not immediately observed:

    4.a.  a reward following the linear assumption: where is an unknown vector and is an independent centered random noise discussed later;

    4.b.  an action-independent delay . As in Vernade et al. (2017), for all , characterize the delay distribution in a non-parametric fashion. However, as opposed to their setting, we do not require a prior knowledge of these parameters.

  5. The observation of is postponed to if , we then say that the action converts. Otherwise it will never be observed and the system will have to process it accordingly. As long as the observation is not observed, the learner sees a which can be mistaken for a reward.

Remark 1.

The acknowledgment variables are here to stand for actions that are never converted for external reasons non-related to the taken action. It is equivalent to assuming that the delay associated to these actions is larger than . We made this choice to align with the model of Chapelle (2014) but our bandit algorithm is agnostic to .

From the moment an action is made at round

, a sequence of random variables is generated that model the available observation at round :

Indeed, as long as , the action chosen at time is still awaiting conversion and the conditional expectation of is


The goal of the learner is to sequentially make actions in order to minimize the expected cumulated regret after rounds,

where is such that .

A main difficulty here is that when one observes at time with , there is an ambiguity on whether , or whether is while is not . If we were to know the value of , the problem would be much easier. This is illustrated as a warm-up in Section 3.1.

The setting of this paper addresses a harder problem where rewards are ambiguous. Specifically, the learner does not observe . For this case, we propose two alternative ways of constructing and controlling an estimator of that make use of the sequential observations in a different manner. The first one, our baseline, simply waits for the cut off. This allows to avoid any delay-related bias due to the awaiting observations mentioned in Eq. 1. The second, more efficient in practice, updates the estimation on-the-fly as data comes in and handles the bias conveniently.

Assumption 1.

Without loss of generality and following usual practices in the literature on linear bandits Abbasi-Yadkori et al. (2011); Lattimore and Szepesvári (2016), we make the following assumptions:

  • Bounded scalar reward: , , ;

  • Bounded actions: we assume that each coefficient of any action is bounded by such that .

  • Bounded noise: for all , . Typically we will consider the case where the are Bernoulli random variables. We comment on sub-Gaussian noise in Section 3.1 below.

3 Algorithm

In this section, we will define confidence intervals in order to build

-like algorithms for the contextualized delayed feedback setting. We build on existing results by Lattimore and Szepesvári (2016) and we make use of the exploration function therein: for some universal constant,

For any matrix , we denote its pseudo-inverse and .

3.1 Warm-up: the non-ambiguous case.

We first consider a special case where the learner observes the fact that a sample is censored: she receives as an extra information . Note that in the case where the reward is continuously distributed, typically when the noise is Gaussian, it is necessarily censored as soon as the reward is exactly zero. This is because, in that case, .

This setting is much simpler than the general case where one does not observe . Then the learner can simply update the covariance matrix and the estimator of as soon as the reward is received, i.e. when . If the reward is censored, the action is just discarded. We define the least squares estimator of in this non-ambiguous (NA) case by


We introduced the notations and based on Abbasi-Yadkori et al. (2011). The following theorem defines a confidence interval for .

Theorem 1.

For any , and such that is almost surely non-singular,


Let such , we have

The rest of the proof follows the lines of Theorem 8 in Lattimore and Szepesvári (2016). It mostly relies on bounding the deviations of the martingale

where so it is 1-sub-Gaussian. Their analysis gives a global bound for any design and any . More details on this result is given in Appendix A.  

With this confidence interval, one can run a Chu et al. (2011); Abbasi-Yadkori et al. (2011) : actions are chosen uniformly at random until the first uncensored reward is observed and then


Since Theorem 1 holds, and is similar to Theorem 8 in Lattimore and Szepesvári (2016) that is the main ingredient to bound the regret of , an algorithm that proceeds as described in Equation (3) will have similar regret as , as derived in Lattimore and Szepesvári (2016).

3.2 General case: ambiguous rewards

In many applications, the extra information of the censoring is not available: one does not observe . A user may decide to click or not for any reason and the system can never know whether a null observation means that the user did not acknowledge the display (), whether it means that the delay is too long and the reward was censored (), or whether it means that the reward is truly (). This distinction would be crucial in the example of Bernoulli rewards , which is important in classical web applications where the reward models a click. The warm-up estimation strategy presented above cannot be applied here as we do not observe , and e.g. removing all null rewards from an estimator would result in an estimator whose bias would not converge to .

We present two strategies to address delayed feedback in this setting. One major improvement as compared to the existing proposed by Vernade et al. (2017)

is that our estimator does not require any prior knowledge of the delay distribution or of the conversion probability

. We start by presenting a baseline estimator that simply relies on waiting for each action to be cut off. While a good linear bandit strategy using this estimator would be asymptotically efficient, we argue that it suffers from bad non-asymptotic performance, especially when the time horizon is short with respect to the cut off - as we discard effectively all the last observations.

To overcome this pitfall, we design a better estimator that also makes use of the not-yet-converted actions. We present , a linear bandit algorithm based on a new concentration result that has better non-asymptotic performance, as it does not discard any information.

Waiting as a baseline

We present

, a simple heuristic that builds on existing concentration results. It is based on the aforementioned observation that after the timeout of the system has passed, no conversion can be observed anymore. To be more precise, for any

and any , .

This means that at a fixed round

, one can build an unbiased estimate of

by computing the least-squares solution using only the data available at :


The following theorem defines the according confidence interval.

Theorem 2.

For any , and such that is almost surely non-singular,


The proof relies on the following decomposition:

We rewrite each term above and finally bound each of their absolute value using Theorem 8 from Lattimore and Szepesvári (2016). Details are in Appendix A.  

Then, simply proceeds like : actions are chosen uniformly at random until the first uncensored reward is observed and then


As in the warm-up case, since Theorem 2 holds, will have a similar regret as .

Better approach: progressive updates

Input: Horizon (optional), confidence level .
Initialization: Pick uniformly at random.
for  do
     Get the contextualized actions ,
     Compute as in (6) and for all ,
     Select arm and update .
     Receive the feedback that is due on round and accordingly.
Algorithm 1  : a contextual bandit algorithm for ambiguous delayed feedback

We now describe , an algorithm that consists in taking into account all the data at hands, including those that are received within the conversion window.

We define another estimator that suffers a small, controllable and vanishing bias, but that is much more data-efficient:


This estimator includes all the observations received up to round and it has the same precision matrix as . As described in details in Algorithm 1, this allows to start updating the internal parameters after each action. The algorithm updates the covariance after each action is taken but only updates after rewards are received.

Our technique comes at the cost of a bounded bias that we carefully take into account in the confidence interval that drives our algorithm presented in the theorem below.

Theorem 3.

Let , and large enough such that is invertible. Then,


The proof works as follows: we notice that this estimator is “close” to and we bound the bias due to the additional observations. For , we write

Decomposing according to the initial remark, we obtain

The second term is handled by Theorem 2 so it suffices to bound for any :

And so by Cauchy-Schwartz, and under our bounded-noise assumption, we have

The quantity can finally be bounded using Theorem 8 in Lattimore and Szepesvári (2016), but carefully taking into account the delay and censorship, as described in Appendix A.  

To give an intuition of this result, we state it in the case when the arms are the canonical basis of , which corresponds to the non-contextual bandit setting.

Corollary 1.

For all , let be the canonical basis of . Then, after rounds, where is the number of pulls of the action and . Then,

4 Regret analysis

4.1 Problem-independent lower bound

We first present a lower bound that sheds light on the fact that the quantity has to appear in the regret. Intuitively, the higher the censorship is, the larger the regret must be.

Theorem 4.

Consider the set of all contextual delayed bandit problems (as defined in the setting) in dimension with rewards bounded by , horizon , and censoring parameters . We have for any bandit policy

where is the expected regret of policy on the bandit problem .

The proof of this result is deferred to Appendix B.

4.2 Problem-independent Upper bound

By design, our algorithms and suffer very similar regret bounds as as shown e.g. in Abbasi-Yadkori et al. (2011). We derive the bound below only for as it is the most interesting one here. Indeed, mutatis mutandis, a similar bound can be proved for .

Theorem 5.

Let sufficiently large such that is invertible. With probability , the expected regret of after rounds is bounded by


For any round , let us denote the vector such that . A consequence of Theorem 3 is that

So, it suffices to bound the regret on the most likely event

The last term is handled by the concentration result stated in Theorem 3. The first sum is bounded using the classical analysis of that can be found for instance in Abbasi-Yadkori et al. (2011):

Finally, by Jensen’s and Cauchy-Schwartz inequality,

The last inequality comes from the so called Elliptical Potential Lemma (see e.g. Lemma 11 in Abbasi-Yadkori et al. (2011)):


We make two remarks on this bound. First, there is a gap with the lower bound: this is a common gap suffered by approaches Abbasi-Yadkori et al. (2011). Second, there is a gap with the lower bound. We believe that this could be fixed with a tighter control of but that would imply quite involved computations that we leave for future work.

5 Experiments

To validate the effectiveness of our algorithm, we run experiments in the censored setting that is the most common scenario in real-world applications. In this section, rewards are simulated in order to show the behavior of our algorithm compared to when facing a specific feedback environment. For a given time horizon, we test the non-asymptotic behavior of both policies. Indeed, that an accurate handling of the delays will provide better performance. Even if in theory the improvement is a constant factor, this matters in practice, especially since it can be fairly large from a non-asymptotic perspective.

We fix the horizon to , we choose a geometric delay distribution with mean . In a real setting, this would correspond to an experiment that lasts 3h, with average delays of 6 minutes. Then, we let the cut off vary , which corresponds to 15, 30 and 60 minutes: a reasonable set of values for a waiting time in an on-line system. We also fix , and we set . We only show the results in the more interesting and realistic Bernoulli case.

Figure 1: Results of our simulations. From left to right, the plots report the performance of the two algorithms with .

The results in Figure 1 show a clear trend: when the size of the conversion window grows and the average delay is within the window, the advantage of properly handling the delay become more and more evident. In Appendix C, we report more experiments which confirms the same trend.

6 Related Work

Delays have been identified early Chapelle and Li (2011) as an incompatibility of the usual assumptions of the bandit framework for concrete applications. However, in many works on bandit algorithm, delays are ignored as a first approximation.

Delayed feedback occurs in various situations. For instance, Desautels et al. (2014); Grover et al. (2018) consider the problem of running parallel experiments that do not all end simultaneously. They propose a Bayesian way of handling uncertain outcomes to make decisions: they sample hallucinated results according to the current posterior.

In online advertising, delays are due to the natural latency in users’ responses. In the famous empirical study of Thompson Sampling

Chapelle and Li (2011), a section is dedicated to analyzing the impact of delays on either Thompson Sampling or . While this is an early interest for this problem, they only consider fixed, non-random, delays of 10, 30 or 60 minutes. Similarly, in Mandel et al. (2015), the authors conclude that randomized policies are more robust to this type of latencies. The general problem of online learning under delayed feedback is addressed in Joulani et al. (2013), including full information settings and partial monitoring, and we refer the interested reader to their references on those topics. The idea of ambiguous feedback is introduced in Vernade et al. (2017).

Many models of delays for online advertising have been proposed to estimate conversion rates in an offline fashion: e.g. Yoshikawa and Imai (2018) (non-parametric) or Chapelle (2014)

(generalized linear parametric model). The learning setting of the present work builds on the latter reference. Our goal is different though as we want to build a bandit algorithm that minimizes the regret under such assumptions.

Finally, we mention a recent work on anonymous feedback Pike-Burke et al. (2017) that considers a rather harder setting where the rewards, when observed, cannot be directly linked to the action that triggered them in the past.

7 Discussions and conclusions

This paper frames and models a relevant and recurrent problem in several industrial systems employing contextual bandits. After noticing that the problem can be solved by a simple heuristic, we investigate a more efficient strategy called , which provides a significant practical advantage and stronger theoretical guarantees. An interesting aspect which is not investigated in the work is the usage of randomize policies, often preferable in practical applications for their positive impact on the customer engagement.


  • Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In

    International Conference on Machine Learning

    , pages 1638–1646, 2014.
  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Beygelzimer et al. [2011] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire.

    Contextual bandit algorithms with supervised learning guarantees.


    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 19–26, 2011.
  • Bubeck et al. [2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chapelle [2014] Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105. ACM, 2014.
  • Chapelle and Li [2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • Desautels et al. [2014] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
  • Grover et al. [2018] Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, et al. Best arm identification in multi-armed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pages 833–842, 2018.
  • Joulani et al. [2013] Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1453–1461, 2013.
  • Lattimore and Szepesvári [2016] Tor Lattimore and Csaba Szepesvári. The end of optimism? an asymptotic analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491, 2016.
  • Mandel et al. [2015] Travis Mandel, Emma Brunskill, and Zoran Popović.

    Towards more practical reinforcement learning.

    In 24th International Joint Conference on Artificial Intelligence, IJCAI 2015. International Joint Conferences on Artificial Intelligence, 2015.
  • Neu [2015] Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176, 2015.
  • Pike-Burke et al. [2017] Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed anonymous feedback. arXiv preprint arXiv:1709.06853, 2017.
  • Vernade et al. [2017] Claire Vernade, Olivier Cappé, and Vianney Perchet. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.
  • Yoshikawa and Imai [2018] Yuya Yoshikawa and Yusaku Imai. A nonparametric delayed feedback model for conversion rate prediction. arXiv preprint arXiv:1802.00255, 2018.
  • Zhou [2015] Li Zhou. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326, 2015.

Appendix A Concentration results

Concentration of proof of Theorem 6.

For completeness, we report here the result of Theorem 8 from Lattimore and Szepesvári [2016]. It gives a high-probability bound on the deviations of the absolute value of for any vector and any sequence of actions .

Fix sufficiently large such that is almost surely non-singular. Concretely, they prove that for any , and for any sub-Gaussian noise ,


We will use this result to bound similar deviations in all the concentration results of this paper. Note that this is a refinement of the original result from Abbasi-Yadkori et al. [2011] that could also be used instead.

In this section we define and study a scaled estimator of :

This estimator uses the complete covariance matrix – built with all the past action vectors– but only the received observations. The unreceived ones are counted as zeros. It is not exactly the least squares estimator that should use either the covariance containing converted actions or all the rewards, which is not possible because part of them are unobserved. This effect will tend to shrink the norm of the estimator. The next theorem tries to control as defined above.

Theorem 6.

For any , sufficiently large and such that is almost surely non-singular,

where for some universal constant,


We start by noticing that

Let such that . We have


We thus have two noises that can be bounded individually.

The right term can be rewritten as

The last term can be bounded with high-probability using the inequality in Eq.(7).

On the other hand, we bound the deviation term on the left of Eq.(8):

Taking the scalar product with some vector , we get a sum of two noises again:

The absolute value of each of the terms above can be bounded using Eq. (7).

Summing the 3 upper bounded derived above, we finally obtain,


Appendix B Lower Bound: proof of Theorem 4

The proof follows the lines of Theorem 3.5 in Bubeck et al. [2012]. Their result is actually a special case of a more general result stated in Lemma 3.6. It gives a problem-independent lower bound that depends on a parameter that characterizes the considered changes of distributions. Here, the number of arms is fixed to , where in the harder case the arms are the vectors of the canonical basis.

The idea is to consider hard problems where all arms have a mean except one that has mean . This defines hard problems, for each of the arms. The goal is then to lower bound the worst expected regret under each of those models, which happens to be larger than the mean :

It then suffices to bound the right-hand side. Pinsker’s inequality provides a first bound :

where is the model where no arm stands out and they all have their mean equal to .

We prove the following bound that adapts their Lemma 3.6 to our delayed feedback setting:


The main difference of our setting as compared to theirs is that we assume that we are given a collection of i.i.d. Bernoulli random variables of parameter independent of the generated rewards. The data generated is and we only observe in the case where .

The main step that we modify is the computation of the empirical corresponding to the expectation of the likelihood of the rewards under two alternative models. We prove that

This allows us to obtain the desired result using the concavity of the square root:

where we used the fact that

since and are independent.

Taking , we get from Lemma 3.6 that


Finally, we recall and prove the lower bound of Section 4.

Theorem 7.

Consider the set of all contextual delayed bandit problems (as defined in the setting) in dimension with rewards bounded by , horizon , and censoring parameters . We have for any bandit algorithm

where is the expected regret of algorithm on the bandit problem .


Taking in the result proved above (equivalent of Lemma 3.6 of in Bubeck et al. [2012]), we get


Appendix C Additional experiments

As in the main paper, we fixed the horizon and the cut off

vary in {250, 500, 1000}. The delay still follows a geometric distribution but in this set of experiments


In our hypothetic 3h experiment, the average delay now corresponds to 15 minutes and the cutoff time is fixed in 15, 30, 60 minutes.

Figure 2: Results of our simulations. From left to right, the plots report the performance of the two algorithms with .

As for the experiments in the main paper, the trend is clear: the higher is the cutoff time the more advantageous is to properly handle the delay. The main difference with the previous experiments is the speed at which the algorithms are learning: as expected the higher is the average delay the slower is the learning process and the higher is the regret.