1 Introduction
Recently, bandit algorithms have found practical application in areas from dynamic pricing (Misra2019) and healthcare (Durand2018a) to finance (Shen2015) and recommender systems (Spotify2018), and many more. For many of these application areas, generalised linear bandit algorithms are among the most commonly used approaches, as they are able to capture the structure of the rewards and actions often seen in practice. Additionally, there exist algorithms with nearoptimal theoretical guarantees for this problem (Filippi2010; AbbasiYadkori2011; Li2017). However, these algorithms require the learner receives feedback from the environment immediately. This strict requirement for immediate rewards often goes unmet in practice. Consequently, we consider learning the stochastic generalised linear bandit under delayed feedback.
For example, a feature of many recommender systems is that the user must provide feedback to the learner. In many of these applications, the user and the learner operate on vastly different time scales (Chapelle2014). In some cases, the learner can serve thousands of recommendations per second. Conversely, it is not uncommon for users to take several minutes to a couple of days to provide feedback, possibly avoiding giving feedback altogether. Alternatively, practitioners might want to optimise for a longerterm measure of success (Lyft2021). In such cases, the reward is not observable or even defined immediately. Another setting where delayed feedback arises is in clinical trials, where it is rare that patients respond to their prescribed treatment immediately. On top of this, obtaining medical feedback is often a timeconsuming task.
In any of the above settings, the reward for any given action returns at some unknown, possibly random, future timepoint. Regardless of how delays arise, the learner must continue to select actions, despite a high likelihood of not seeing feedback from many of its past choices. Indeed, this abundance of missingness poses significant theoretical challenges, as standard tools for analysing bandit algorithms no longer apply. Further complicating the matter is that we would like practical algorithms that operate without apriori knowledge of the delay distribution, without careful hyperparameter tuning and without assumptions on the action sets. To the best of our knowledge, no existing algorithms meet these requirements. Therefore, we propose an algorithm based on the optimistic principle that meets these requirements and develop new theoretical tools to derive high probability regret bounds.
1.1 Related Work
The multiarmed bandit literature covers stochastically delayed feedback extensively. Both Joulani2013 and Mandel2015 propose queuebased approaches to adapt existing armed bandit algorithms so they can handle the delays, each proving that the regret bound of the chosen algorithm only increases by an additive factor of . Different versions of delayed feedback have also been studied from a theoretical perspective, for example PikeBurke2018 study the case where the feedback from various rounds is aggregated, and Vernade2017 consider the case where some rewards can be censored if the delay is too large.
Comparatively, fewer theoretical results quantify the impact of delays beyond multiarmed bandits. Vernade2020
consider a Bernoulli bandit whose expected reward is linear in some unknown feature vector, and allow delayed and censored rewards. Thus, the challenges they face are different to ours. Nevertheless, they deal with the delays by inflating the exploration bonus and handle the censoring by introducing a windowing parameter that sets rewards taking too long to return equal to zero.
Dudik2011 developed a policy elimination algorithm capable of handling contextual information, and show that it incurs regret up to polylogarithmic factors, where is the number of actions and is a constant delay between playing an action and observing the corresponding reward. However, implementing their algorithm is challenging, requires perfect knowledge of the distribution of the contextual information, and their results only hold for the setting of constant delays. Zhou2019 consider learning in the same setting as us. They propose an optimistic algorithm that inflates the exploration bonus to account for the missing rewards and prove their algorithm. Inflating the exploration bonus in this way leads to a multiplicative increase in the regret, and Zhou2019 prove a regret upper bounded of up to polylogarithmic factors, where is a nonnegative delaydependent constant beyond which the delays have wellbehaved tails. However, their algorithm requires prior knowledge of the delays, and their theoretical guarantees only hold for lighttailed delays.1.2 Challenges & Contributions
Challenges.
We consider learning the stochastic generalised linear bandit under delayed feedback. As previously mentioned, existing approaches for this setting require constant or lighttailed delays to derive performance guarantees (Dudik2011; Zhou2019). These assumptions are a limiting factor in their theory, as empirical evidence suggests that the delays have a heavytailed nature (Chapelle2014).
The only existing work that considers learning generalised linear bandits with delayed rewards utilises the optimistic principle and inflates the exploration bonus by the square root of the number of missing rewards to account for uncertainty in the missing rewards (Zhou2019)
. Combining this with an unbiased estimator,
(Zhou2019) present an elegant argument which allows them to use standard theoretical tools, namely the elliptical potential lemma, to handle the leadingorder terms. The usage of this lemma is widespread in the analysis of linear bandit algorithms and is provably tight (Carpentier2020). However, when coupled with an exploration bonus which depends additively on the expected delay, this unavoidably introduces a term of order into the regret bound.Intuitively, one would anticipate that the delays become irrelevant once the learner has observed enough rewards to obtain a "good" estimate of the underlying expected reward function. Therefore, any delayrelated terms in the regret bound are expected to be independent of the horizon,
. Indeed, this is the case in the multiarmed bandit setting (Joulani2013). A similar result in the contextual bandit problem was shown by Dudik2011. However, these results hold only for constant delays, a fixed number of actions and when the learner has perfect knowledge of contextual information distribution, which is a strong assumption that is unmet in most practical applications.Contributions.
In this paper we present a natural approach based on optimism that achieves regret bounds scaling with . This approach also addresses many of the limitations of existing works and reduces the dependence on the delay from a multiplicative penalty to an additive one.
Our algorithm only the utilises rounds with observed rewards in the estimation procedure, and introduces a regularisation term into the loss function; biasing the otherwise unbiased estimator. This biasing step is helpful because it bypasses a lengthy period of exploration that requires prior knowledge of the delay distribution used by existing methods. As such, the regularisation allows our algorithm to leverage new information immediately, a crucial step in obtaining a "good" estimator as quickly as possible. We then derive delayadapted confidence sets for our biased estimator by extending the theory from the linear to the generalised linear bandit setting. Subsequently, we prove bounds on the regret of our optimistic algorithm, whose worstcase performance is upper bounded by
.Improving the delay dependence from to requires the introduction of new techniques for bounding the leadingorder term in the delayed feedback setting, as the existing theoretical tools no longer hold. These techniques allow us to relax existing assumptions on the delays from lighttailed to subexponential, aligning the theory with reality. We note that the arguments within our bound on the leadingorder term might be of independent interest for bandit algorithms with complex feedback structures.
2 Problem Formulation
We consider learning in the stochastic generalised linear bandit problem under delayed feedback. Here, the learner interacts with the environment over a fixed number of rounds by selecting an action from some decision set. Then, the standard assumption is that the learner immediately observes a reward after each interaction with the environment (Filippi2010; Li2017). In contrast, we assume that there is a delay between the learner playing an action and observing the corresponding reward, which more accurately reflects the feedback structure in many practical applications.
2.1 Generalised Linear Bandits
Throughout, we assume the conditional distribution of the reward given an action belongs to the exponential family. Letting and be the action and reward associated with the th round, we assume that the density of the rewards given the action has the following form:
(1) 
where is an unknown parameter vector; and are distribution specific functions; and is a known constant that is often referred to as the dispersion parameter. For distributions belonging to the exponential family, one can verify that:
Here, is a strictly increasing link function
that relates the inner product of the action vector and the unknown parameter to the expected reward. For example, if the rewards are normally distributed,
and we recover the standard linear model. If the rewards are Bernoulli, thenand we have a logistic regression model.
In the stochastic setting, the learner selects an action and receives noisy observations of the unknown expected reward function of the form:
where is random noise. Section 2.2 states the assumptions we make on the link function and the noise.
The ultimate goal of the learner in generalised linear bandit setting is to minimise the pseudoregret. Intuitively, this compares the expected reward of the action selected by the learner to the action with the highest expected reward. Mathematically, we define the pseudoregret of an algorithm in the generalised linear bandit setting as follows:
(2) 
where is the action maximising the expected round in the th round.
2.2 Delayed Feedback Learning Setting
Let denote the random delay associated with the decision made in the th round. Then, the sequential decisionmaking procedure for generalized linear bandits under stochastically delayed feedback is as follows. For :

The learner receives a decision set containing the context vectors: .

The learner selects a dimensional feature vector from the decision set: .

Unbeknownst to the learner, the environment generates a random delay, a random reward and then schedules an observation time:

The random reward has the form:

The random delay comes from some unknown distribution:

The environment schedules the observation time of the reward: .

From the above decisionmaking procedure, it is clear that the learner only has access to the rewards of the actions whose observation times are less than or equal to the current round. More specifically, the reward corresponding to the th round is only available to the learner making decisions in rounds . Otherwise, it is missing. To that end, we define the set of observable information just before observing the rewards at the end of the th round as:
where
indicates whether the reward associated with the th round is observable at the end of the th round.
2.3 Assumptions
Now, we are ready to state our assumptions on the noise and the link function. The following assumptions are standard in the literature on linear and generalised linear bandits (Filippi2010; AbbasiYadkori2011; Li2017).
Assumption 1 (Subgaussian Noise).
Let
. Then, the moment generating function of the reward distribution conditional on the observed information must satisfy the following inequality:
for all .
Assumption 2 (Link Function).
The link function is known apriori and is twice differentiable with first and second derivatives bounded by and , respectively. Further,
where is the set of all possible parameter vectors.
Assumption 1 implies that the noise distribution has nicely behaved tails. Assumption 2 implies that the link function is Lipschitz. One can interpret the condition on as guaranteeing that it is possible to distinguish between two actions whose expected rewards are arbitrarily close to one another. Indeed, , and all feature in the theoretical analysis and regret bounds.
As will later become clear, it is also necessary for the delays to satisfy some assumptions too. In particular, we assume the following holds.
Assumption 3 (Subexponential Delays).
The delays are nonnegative, independent and identically distributed
subexponential random variables. That is, their moment generating function satisfies the following inequality:
for some and , and all .
The class of distributions with subexponential tail behaviour supersedes the class of lighttailed distributions. Thus, Assumption 3 is a relaxation of those made in existing works that assumes the delays have lighttails beyond some known point on the support (Zhou2019). The subexponential assumption is also broad enough to include many heavytailed distributions, such as the
and Exponential distributions. Importantly, Assumption
3 also agrees with the empirical evidence suggesting that the delays have Exponentiallike tails in practice (Chapelle2014).2.4 Notation
Throughout, denotes the norm of an arbitrary vector . Further, denotes the inner product between the two vector arguments. For , we denote and adopt the following notation for positive (semi)definite matrices:

(positivedefinite) for all .

(positive semidefinite) for all .

for all .
Additionally, and denote the
th largest eigenvalue and the
th largest singular value of matrix
, respectively. Finally, for a realvalued function , and denote its first and second derivatives, respectively.3 Delayed GLMUCB
We develop a provably efficient algorithm for generalised linear bandits with stochastic delays. We base our approach on the optimistic principle and show that delays only cause an additive increase in regret.
Due to the delays, it is necessary to introduce some additional notation that discriminates between rounds whose feedback has or has not been observed. Denote the number of missing rewards at the end of the th round and the maximum number of missing rewards across all rounds by:
respectively. Further, we introduce the total, observed and missing design matrices:
(3)  
(4)  
(5) 
respectively. Here, is a regularisation parameter. Briefly, is the total design matrix and contains information relating to all past choices. Whereas and include information about actions with and without observed rewards, respectively. It is easy to see that the total, observed and missing design matrices must satisfy the following relationship:
(6) 
Thus, when there are no delays, the total and observed design matrices are equivalent to each other, and the missing design matrix is full of zeros.
3.1 Estimation Procedure
We use maximum likelihood estimation to estimate the unknown parameter of the environment. Within the estimation procedure, we only use rounds with observed rewards. Under immediate feedback, generalised linear bandit algorithms use a phase of pure exploration (Filippi2010; Li2017). This exploration phase lasts until the observed design matrix matrix is of full rank, which ensures a unique maximiser of the likelihood function exists.^{1}^{1}1Under Assumption 2, the Hessian is of full rank whenever the observed design matrix is of full rank.
However, the length of this exploration phase increases under delayed feedback, as the reward associated with each action returns at some unknown time in the future. Since no updates occur in this phase, the learner cannot utilise information from the rounds with observed rewards. Ideally, we would like to leverage new information as soon as it becomes available. Thus, we introduce a penalisation term into the objective function, an idea that we borrow from the linear and contextual linear bandits, where one can derive a closedform maximum likelihood estimator (AbbasiYadkori2011; Chu2011). In the generalised linear setting, this equates to penalising the loglikelihood. From Equation (1) and the conditional independence of the rewards given past actions, one can write the penalised loglikelihood as follows:
(7) 
Equation (7) always has a unique maximiser due to the introduction of the penalisation term, . Using simple calculus, one can verify that the maximiser is the solution of the following equation:
(8) 
where is the dispersion parameter of the reward distribution. To implement the optimistic principle, we must construct confidence sets around our estimators and prove that this set contains with high probability.
Lemma 1.
Let and assume that . Then, with probability at least :
across all rounds .
Proof Sketch.
Proving the lemma requires several adjustments to existing arguments for deriving confidence sets in the immediate feedback setting. Firstly, we account for regularising the loglikelihood function, which we do in Lemmas 6 and 7 in Appendix A. These lemmas allow us to separate noiserelated terms from those introduced by biasing our estimator with the regularisation term. Subsequently, we show that the noiserelated terms satisfy the martingale property under the information structure created by the delays. This result allows us to apply existing results for selfnormalising processes (Pena2004; AbbasiYadkori2011). See Appendix A for a full proof. ∎
3.2 Delayed OFU for Generalised Linear Bandits
Algorithm 1 presents the pseudocode for our algorithm, Delayed OFUGLM. It requires several input parameters that we briefly discuss below.
The algorithm requires knowledge of
, the dispersion parameter of the reward distribution. For Bernoulli and Poisson rewards, one can show that this parameter always equals one. In the Gaussian case, this parameter is the variance of the reward distribution
, which all optimistic algorithms also require to define the confidence sets.Secondly, is an upper bound on the norm of the environment’s parameter vector. Since (Zhou2019) uses a period of explicit exploration, they do not need this hyperparameter. Instead, they require knowledge of the delay distribution to define the length of the exploration phase.
Finally, measures how close the expected reward of two actions can get. For Gaussian rewards, . For other distributions, one can replace this quantity with a lower bound and our theoretical results will still hold. If the rewards are Bernoulli, one can utilise the fact that the first derivative of the link function is symmetric about zero and decreasing to show that: . For Poisson rewards, by the definition of the inner product, . Indeed, many optimistic algorithms for generalised linear bandits require this hyperparameter, as it features in the definition of the confidence sets. However, recent work removes the need for this hyperparameter for the logistic bandit (Faury20a).
3.3 Regret Bounds for Delayed OFUGLM
We prove the following worstcase regret guarantees for Algorithm 1.
Theorem 1.
Proof.
First, we bound the perround pseudoregret. In Algorithm 1, the action selection procedure implies that the action selected by any optimistic algorithm satisfies the following inequalities:
where the final inequality holds with probability at least due to the definition of the confidence sets and the actionselection procedure in Algorithm 1. Adding and subtracting the maximum likelihood estimator gives:
(Hölder’s)  
(Definition of )  
() 
Therefore, we have that the pseudoregret has the following upper bound:
(11) 
with probability at least .
Usually, one would employ the socalled elliptical potential lemma to handle the remaining summation. This algebraic argument completes the proof in the immediate feedback setting and provides a tight upper bound on the term in question (Carpentier2020). However, the elliptical potential lemma requires that the learner updates the design matrix at the end of every round with the most recent action.
Sadly, this is not the case for the summation in (11), as the feedback associated with the most recent action is not necessarily observable immediately and is, therefore, not used to increment the observed design matrix. Moreover, there will likely be rounds where no feedback arrives at all. Consequently, we introduce the following technical lemmas that aid in bounding the summation.
Lemma 2.
Let . Then, and are invertible and have inverses that satisfy the following relationship:
Proof.
See Appendix B. ∎
Lemma 3.
Let be an arbitrary sequence of nonnegative random variables. Then, for :
Proof.
See Appendix B. ∎
Lemma 2 relates the inverse of the observed design matrix to the inverse of the total design matrix and a product of three matrices. This allows us to separate the usual elliptical potential from terms involving the delays by application of the triangle inequality. Then, Lemma 3 shows that we can relate the remaining summation to a lowerorder term. More concretely,
(Triangle Inequality)  
where the final inequality follows from Lemma 3. The above reveals that we must bound , the maximum number of missing rewards, and the sum involving the delays, which we do in the following lemmas.
Lemma 4.
Define and let be a sequence of independent and identically distributed random variables with a finite expectation and define:
Then,
Proof.
The proof follows the same arguments used in multiarmed bandits (Joulani2013). However, we include a simple extension to accommodate for continuous delay distributions. See Appendix B ∎
Lemma 5.
Let be subexponential random variables and define:
Then,
Proof.
The above follows from a standard tail bound for subexponential random variables (Wainwright2019) and a union bound. ∎
From Lemmas 4 and 5, and . Substituting this into our current upper bound on the summation in (11) gives:
which holds with probability at least . Applying CauchySchwarz to the first time gives:
Now, here the total design matrix is incremented by the most recent action at the end of every round. Therefore, we can apply the elliptical potential lemma, which bounds the remaining summation by:
where . For completeness, we present proof of the elliptical potential lemma in Appendix C. Therefore, we have that:
From Equation (11), it is clear that all that remains is to upper bound the width of the confidence set at the end of the final round. Recall , because the observed design matrix is a partial sum of positive semidefinite matrices that make up the total design matrix. Therefore,
where the inequality follows from Lemma 14 of Appendix C. Bringing everything together, we have that:
Omitting polylogarithmic factors gives:
where , completing the proof. ∎
4 Experimental Results
This section investigates the impact of delayed feedback in the Linear and Logistic bandits, which we believe are two settings of particular practical interest. We compare our algorithm to DUCBGLCB (Zhou2019) that inflates the exploration bonus by the number of missing rewards.
In our experiments, we fix the dimension of the action vectors and the horizon to and , respectively. Further, was randomly generated from the unit ball for the Linear and Logistic bandit environments. For the linear environment, we set the subgaussian parameter of the noise .
The decision set for all rounds is the dimensional unit ball whose fixed size is: . Additionally, the delay between playing an action and observing the reward is generated from the Uniform and Exponential distributions. Indeed, Theorem 1 holds for both of these delay distributions. For each delay distribution, we consider expected values .
For Delayed OFUGLM, we set by setting in both of our experiments. We implement DUCBGLCB as described in the original paper (Zhou2019). For each algorithm, we set so that the theoretical results hold with probability at least . Figures 1 and 2 illustrate the results of our experiments, showing that Delayed OFUGLM outperforms DUCBGLCB for all combinations of environments, delay distributions, and expected delays considered, as our theory would suggest.
5 Discussion
In this work, we studied the impact of delayed feedback on algorithms for generalised linear bandits. Under Assumption 3, we were able to design an optimistic algorithm whose regret increase by an additive term involving the expected delay. This theoretical result is a significant improvement over existing work in the setting, whose algorithms suffer from a multiplicative penalty under stronger assumptions on the delay distribution. Obtaining improved theoretical results required the introduction of novel arguments for bounding leading order terms. These techniques might find use in other bandit problems with complex feedback structures.
Further, our result nearly recovers the penalty from multiarmed bandits, where is the number of available actions, despite the additional difficulties of our setting. We believe the extra factor of that impacts the delay terms in our bounds might be necessary because the generalised linear bandit setting is intrinsically more challenging than the multiarmed bandit setting (Book). However, this is only a conjecture, and proving lower bounds for delayed feedback remains an open problem, even in multiarmed bandits.
Another open question relates to whether we can relax the subexponential assumption even further, perhaps only requiring the delays have a finite expected value. Although Lemma 3 shows that this might be challenging, as the delays unavoidably appear in our regret analysis through Lemma 3. Relaxing this assumption on the delays may require adjustments to our theoretical techniques or different algorithms.
Finally, we expect that one can obtain similar results for a Thompson Sampling version of our algorithm. Combining techniques found in
Russo2014 with those found in this paper will likely give similar guarantees for the Bayesian regret.References
Appendix A Confidence Sets
Here, we show that the confidence sets are valid under delayed feedback. To that end, we define the following algebra:
(12) 
Consequently, is measurable. Further, is measurable. For notational purposes, we find it useful to define:
as well as the second derivative of the negative log likelihood:
Now, is the vector satisfying the following equality:
(13) 
See 1
Proof.
Further Lemma 8 of A.1 reveals that the last term in the above is a nonnegative supermartingale. Therefore, we are able to use known methods for bounding selfnormalised vectorvalued martingales [Pena2004, AbbasiYadkori2011]. Let be a stopping time with respect to the filtration. Applying Lemma 9 of AbbasiYadkori2011 to the stopped martingale gives:
a.1 Supporting Lemmas
Proving Lemma 1 requires several supporting lemmas. Firstly, the confidence sets are in terms of and . Conversely, Equation (13) reveals that our estimation procedure involves and . The following lemma allowed us to relate these two quantities to one another.
Lemma 6.
Let and be arbitrary vectors, and . Then, the following inequality holds:
Proof.
Similarly to Filippi2010, we apply the mean value theorem to the terms inside the norm on the righthand side of the above, which allows us to related them to the original vectors. Expanding and reveals that:
(15) 
where the second equality follows from the mean value theorem for some . Rewriting the Hessian for some and recalling that reveals that:
(16) 
From Equation (16), we immediately have that . Combining Equation (15) with the partial ordering of Equation (16) gives:
()  
()  
(Equation (15))  
()  
() 
Therefore, using homogeneity property of norms on the first and last terms of the above reveals that:
Bringing all ’s to the left hand side side gives the stated result. ∎
Lemma 7.
Let be the unknown parameter of the environment and be the solution to (13). Further, define and
Then,