The application of stochastic bandit optimization algorithms to safety-critical systems has received significant attention in the past few years. In such cases, the learner repeatedly interacts with a system with uncertain reward function and operational constraints. Yet, in spite of this uncertainty, the learner needs to ensure that her actions do not violate the operational constraints at any round of the learning process
. As such, especially in the earlier rounds, there is a need to choose actions with caution, while at the same time learning more about the set of possible safe actions. Notably, the estimated safe set at each stage of the algorithm might not originally include the optimal action. This uncertainty about safety and the resulting conservative behavior means the learner could experience additional regret in such environments.
In this paper, we focus on a special class of stochastic bandit optimization problems where the reward is a linear function of the actions. This class of problems, referred to as linear stochastic bandits (LB), generalizes multi-armed bandit (MAB) problems to the setting where each action is associated with a feature vector, and the expected reward of playing each action is equal to the inner product of its feature vector and an unknown parameter vector . There exists several variants of linear stochastic bandits that study the finite [auer2002finite] or infinite [Dani08stochasticlinear, Tsitsiklis, abbasi2011improved] set of actions, as well as the case where the set of feature vectors can change over time [pmlr-v15-chu11a, li2010contextual].
Two efficient approaches have been developed for LB: linear UCB (LUCB) and linear Thompson Sampling (LTS). For LUCB, [abbasi2011improved] provides a regret bound of order . From a Bayesian point of view, [agrawal2012analysis] show that LTS achieves an expected regret of order , while [agrawal2013thompson, abeille2017linear] adopt a frequentist view and show a regret of order for LTS.
Here we provide a LTS algorithm that respects linear safety constraints and study its performance. Let us formally define our problem setting before stating a summary of our contributions.
1.1 Safe Stochastic Linear Bandit Model
The setting. The learner is given a convex and compact set of actions . At each round , playing an action results in observing a reward
where is a fixed and unknown parameter and is a zero mean additive noise.
Safety constraint. We further assume that the environment is subject to a linear side constraint (referred to as the safety constraint) which needs to be satisfied at every round :
Here, is a fixed and unknown parameter and the positive constant is known to the learner. The set of “safe actions” denoted by that satisfy the safety constraint (2) is defined as
The set of safe actions depends on the parameter , which is unknown to the learner. Hence, the learner who is not able to identify the safe action set must play conservatively in order to not violate the safety constraint for all rounds
(at least with high probability). When playing an action, the learner does not observe , but rather, a noisy measurement:
where is zero mean additive noise. For ease of notation, we will refer to the safe action set by and drop the dependence on .
Regret. The cumulative pseudo-regret of the learner up to round is defined as , where is the optimal safe action that maximizes the expected reward, i.e., .
Goal. The learner’s objective is to control the growth of the pseudo-regret. Moreover, we require that the chosen actions for are safe (i.e., belong to (13)) with high probability. For simplicity, we use regret to refer to the pseudo-regret .
1.2 Major Contributions
We provide the first safe LTS (Safe-LTS) algorithm with provable regret guarantees for the linear bandit problem with linear safety constraints.
Our regret analysis shows that Safe-LTS achieves the same order of cumulative regret as the original LTS (without safety constraints) as shown by [abeille2017linear]. Hence, the dependence of the regret of Safe-LTS on the time horizon cannot be improved modulo logarithmic factors (see lower bounds for LB in [Dani08stochasticlinear, Tsitsiklis]).
We compare Safe-LTS to Safe-LUCB of [amani2019linear], which is a recent UCB-based safe algorithm with provable regret bounds for linear stochastic bandits with linear stage-wise safety constraints. Compared to the latter, we show that our algorithm has: better regret in the worst-case ( vs. ), fewer parameters to tune and superior empirical performance.
1.3 Related Work
Safety - A diverse body of related works on stochastic optimization and control have considered the effect of safety constraints that need to be met during the run of the algorithm. For example, [Krause, sui2018stagewise] study the problem of nonlinear
bandit optimization with nonlinear constraints trough a UCB approach using Gaussian processes (GPs) (as non-parametric models). The algorithms in[Krause, sui2018stagewise] come with convergence guarantees but no regret bound. Such approaches for safety-constrained optimization using GPs have shown great promises in robotics applications [ostafew2016robust, 7039601]. Without the GP assumption, [kamgar] proposes and analyzes a safe variant of the Frank-Wolfe algorithm to solve a smooth optimization problem with an unknown convex objective function and unknown linear constraints. Their main theoretical result provides convergence guarantees for their proposed algorithm. A large body of work has considered safety in the context of model-predictive control, see, e.g., [aswani2013provably, koller2018learning] and references therein. Focusing specifically on linear stochastic bandits, extensions of UCB-type algorithms to provide safety guarantees with provable regret bounds been considered recently. For example, in [vanroy], safety refers to the requirement of ensuring that the cumulative reward up to each round of the algorithm stays above a given percentage of the performance of a known baseline approach. Recently, [amani2019linear] considered the effect of safety constraints similar but not identical to ours on the regret of LUCB, where a problem-dependent upper bound on the regret is provided. The algorithm’s regret performance depends on the location of the optimal action in the safe action set, increasing significantly in problem instances for which the safety constraint is active.
Thompson Sampling - Even though Thompson sampling (TS)-based algorithms [thompson1933likelihood] are computationally easier to implement than UCB-based algorithms and have shown great empirical performance, they were largely ignored by the academic community until a few years ago, when a series of papers (e.g., [russo, abeille2017linear, agrawal2012analysis, kaufmann2012thompson]
) showed that TS achieves optimal performance in both frequentist and Bayesian settings. Most of the literature focused on the analysis of the Bayesian regret of TS for general settings such as linear bandits or reinforcement learning (see e.g.,[osband2015bootstrapped]). More recently, [russo2016information, dong2018information, dong2019performance] provided an information-theoretic analysis of TS, where the key tool in their approach is the information ratio which quantifies the trade-off between exploration and exploitation. Additionally, [gopalan2015thompson] provides regret guarantees for TS in the finite and infinite MDP setting. Another notable paper is [gopalan2014thompson], which studies the stochastic multi-armed bandit problem in complex action settings. They provide a regret bound that scales logarithmically in time with improved constants. However, none of the aforementioned papers study the performance of TS for linear bandits with safety constraints. Our proof for Safe-LTS is inspired by the proof technique in [abeille2017linear].
2 Safe Linear Thompson Sampling
Our proposed algorithm is a safe variant of Linear Thompson Sampling (LTS). At any round , given a regularized least-squares (RLS) estimate , the algorithm samples a perturbed parameter that is appropriately distributed to guarantee sufficient exploration. Then, for the sampled , the algorithm chooses the optimal action while making sure that safety constraint (2) holds. The presence of the safety constraint complicates the learner’s choice of actions. To ensure that actions remain safe at all rounds, the algorithm constructs a confidence region , which contains the unknown parameter with high probability. It then ensures that the action chosen by the algorithm satisfies the safety constraint . The summary is presented in Algorithm 1 and a detailed description follows in the rest of this section.
2.1 Model assumptions
Notation. For a positive integer , denotes the set . The Euclidean norm of a vector is denoted by . The weighted 2-norm of the vector with respect to a positive semidefinite matrix is defined by . We also use the standard notation that ignores poly-logarithmic factors.
Let denote the filtration that represents the accumulated information up to round . In the following, we introduce standard assumptions on the problem.
For all , and are conditionally zero mean and -sub-Gaussian noises with a constant , i.e.,
There exists a positive constant such that and .
The action set is a compact and convex subset of that contains the origin. We assume , and .
It is straightforward to generalize our results to the case where the noises and are sub-Gaussian with different constants and the unknown parameters and have different upper bounds. In this paper, for brevity of the notation we assume these are equal.
2.2 Algorithm description
Let be the sequence of actions and and be the corresponding rewards and measurements, respectively. Then, and can be estimated with -regularized least square. For any , the Gram matrix and RLS-estimates of and of are as follows:
Based on and , Algorithm 1 constructs two confidence regions and as follows:
The ellipsoid radius is chosen according to the Theorem 2.1 in [abbasi2011improved] in order to guarantee that and with high probability.
The algorithm proceeds in two steps which we describe next.
2.2.1 Sampling from the posterior: the safe setting
The LTS algorithm is a randomized algorithm constructed on the RLS-estimate of the unknown parameter . At any round , the LTS algorithm samples the parameter from the posterior as
In the standard (non-safe) LTS setting, [agrawal2012analysis] define TS as a Bayesian algorithm. As such, they consider a Gaussian Prior over the unknown parameter , which is updated according to reward observations. At each round, they play an optimal action corresponding to a random sample drawn from the posterior, and provide regret guarantees for this Bayesian approach. More recently, [agrawal2013thompson, abeille2017linear] showed that TS can be defined as a randomized algorithm over the RLS-estimate of the unknown parameter . They show that the same guarantees hold as long as the parameter is sampled from a distribution , which satisfies certain concentration and anti-concentration properties that hold for more distributions than the Gaussian prior.
Unfortunately, it turns out that the presence of safety constraints leads to a more challenging problem, in which the distributional assumptions set forth in [abeille2017linear] do not provide sufficient exploration needed to expand the safe set fast enough. As such, in order to obtain our regret guarantees for Algorithm 1 we need to impose appropriate modifications on the [abeille2017linear] distributional properties. We state the new properties next and discuss them immediately after.
is a multivariate distribution on , absolutely continuous with respect to the Lebesgue measure which satisfies the following properties:
(anti-concentration) there exists a strictly positive probability such that for any with ,
(concentration) there exists positive constants such that ,
As observed in [abeille2017linear], the algorithm explores far enough from (anti-concentration), but not too much (concentration). Compared to [abeille2017linear], we have added an extra term in (11) and (12). While inconspicuous, adding this term is critical for guaranteeing small regret in the Safe-LTS setting; see Section 3.1 for details. We note that, in the slightly more general setting where we do not assume (Assumption 3), the constant needs to be replaced by (see Section D in Appendix for details).
2.2.2 Choosing the safe action
Since the learner does not know the safe action set , she performs conservatively in order to satisfy the safety constraint (2). This is achieved by first building a confidence region around the RLS-estimate at each round . Based on , the algorithm creates the so-called safe decision set denoted as :
and chooses the safe action from this set that maximizes the expected reward given the sampled parameter . Note that contains actions which are safe with respect to all the parameters in the confidence region , and not only . Therefore, this safe decision set is an inner approximation of safe action set , which may lead to extra regret for Safe-LTS that was otherwise absent from LTS.
3 Regret Analysis
In this section, we present our main result, a tight regret bound for Safe-LTS, and discuss key proof ideas.
For any , denote the corresponding optimal value of playing the optimal action from the safe action set in (3) by
At each round , for the sampled parameter , denote the optimal safe action that algorithm plays and its optimal value by
Since, and are unknown, the learner does not know the safe action set . Therefore, in order to satisfy the safety constraint (2), the learner chooses her actions from , which is an inner approximation of . Intuitively, the better this approximation, the more likely that safe linear Thompson sampling leads to a good regret, ideally the same regret as that of LTS in the original linear bandit setting. To see the changes due to the safety constraint, let us consider the following standard decomposition of the instantaneous regret:
Despite the addition of safety constraints, controlling Term II remains straightforward closely following previous results (e.g., [abbasi2011improved]); see Appendix C.2 for more details. As such, here we focus our attention on bounding Term I. To see how the safety constraints affect the proofs let us review the treatment of Term I in the original setting. Specifically, in the absence of safety constraints, we have
as all actions are always “safe” to play. On the one hand, if an LUCB algorithm is adopted, e.g., [abbasi2011improved], Term I will be non-positive with high probability at all time steps since, by algorithm’s construction, . As such, Term I no longer contributes to the growth of the regret. On the other hand, for LTS studied in [agrawal2013thompson, abeille2017linear], this term can be positive. As such, the authors in [abeille2017linear] use the fact that
and they further show how to bound . With these, they obtain a regret bound of order . Unfortunately, this approach cannot be adopted in the presence of safety constraints as (18) no longer holds. In the presence of safety constraints, is in general less than or equal to since . Our main contribution towards establishing regret guarantees is upper bounding Term I given the existence of the safety constraint. Specifically, we are able to obtain the same regret bound as that of [abeille2017linear] (i.e., ) in spite of the additional safety constraints imposed on the problem.
3.1 Sketch of the proof for bounding Term I
We provide the sketch of the proof for bounding Term I. The idea is inspired by [abeille2017linear]: we wish to show that TS has a constant probability of being optimistic, but now in the presence of safety constraints. First, we define the set of the optimistic parameters as
To see how is relevant note that if , then Term I at time is non-negative. Additionally, we define the ellipsoid such that
It is not hard to see by combining Theorem 2.1 and the concentration property that with high probability. For the purpose of this proof sketch, we assume that at each round , the safe decision set contains the previous safe action that the algorithm played, i.e., . However, for the formal proof in Appendix C.1, we do not need such an assumption.
Let be a time such that , i.e., . Then, for any we have
The last inequality comes from the assumption that at each round , the safe decision set contains the previous played safe actions for rounds ; hence, . To continue from (23), we use Cauchy-Schwarz, and obtain
Since the Gram matrix is increasing with time (i.e., ) and using and , we can write
Therefore, we can upper bound Term I with respect to the -norm of the optimal safe action at time . Bounding the term is standard based on the analysis provided in [abbasi2011improved] (see Proposition A.1 in the Appendix).
It only remains to show that TS samples a parameter belonging to the optimistic set (i.e., ) with some constant probability. The next lemma informally characterizes such claim (see the formal statement of the lemma and its proof in Section D of the Appendix).
(informal) Let be the set of optimistic parameters, with , then , .
Simply stated, we need to show that
where denotes the strictly positive probability defined in (11). In order to do so, we introduce an enlarged confidence region centered on as
and the shrunk safe decision set as
Define such that belongs to the shrunk safe decision set , i.e.,
In particular, this is satisfied if we choose as follows (see Appendix D for more details)
Then, by optimality of and by feasibility of :
Now, recall that . Thus, we have
Further using the fact that we need
Let us define a vector such that , (note that by definition, ). Then, a direct application of Cauchy-Schwartz inequality and (7), gives:
4 Numerical Results
In this section, we present details of our numerical experiments on synthetic data. First, we show how the presence of safety constraints affects the performance of LTS in terms of regret. Next, we evaluate Safe-LTS by comparing it against Safe-LUCB presented in [amani2019linear]. In all the implementations, we used the following parameters: and . We considered a time independent decision set in . The reward and constraint parameters and are drawn from ; is drawn uniformly from .
In view of the discussion in [Dani08stochasticlinear] regarding computational issues of LUCB algorithms with confidence regions specified with 2-norms, we implement a modified version of Safe-LUCB which uses 1-norms instead of 2-norms (see also [amani2019linear]). This highlights a well-known benefit associated with TS-based algorithms, namely that they are easier to implement and more computationally-efficient than UCB-based algorithms, since action selection via the latter involves solving optimization problems with bilinear objective functions, whereas the former would lead to linear objectives by first sampling the unknown parameter vector (e.g. [abeille2017linear]).
4.1 The effect of safety constraints on LTS
We compare the performance of our algorithm to an oracle that has access to set of safe actions and hence applies simply the LTS algorithm of [abeille2017linear] to the problem. This experiment highlights the additional contribution of the safety constraint to the growth of regret. Specifically, Fig. 1 compares the average cumulative regret of Safe-LTS with the standard LTS algorithm with oracle access to the safe set over 20 problem realizations. As shown, even though the Safe-LTS requires that the chosen actions should belong to the more restricted set (i.e., inner approximation of the unknown safe set ), it achieves a regret of the same order as the oracle.
4.2 Comparison with a safe version of LUCB
We compare the performance of our algorithm with two safe versions of LUCB: 1) Naive Safe-LUCB: this is an extension of the LUCB algorithm of [Dani08stochasticlinear, abbasi2011improved] to respect safety constraints by choosing actions from the estimated safe set defined in (13) instead. 2) Safe-LUCB: Inspired by a recent paper of [amani2019linear] on safe LUCB in a similar but non-identical setting, we consider a modification of Naive-SLUCB that proceeds in two phases: starting with a so-called pure exploration phase, the algorithm randomly chooses actions from a seed safe set for a given period of time before switching to the same decision rule as Naive-SLUCB. The additional randomized exploration phase allows the algorithm to first learn a good representation of the safe action set before applying the LUCB action selection rule.
in Appendix for plots with standard deviation). The reader can observe that Naive-SLUCB leads to very poor (almost linear) regret. For Safe-LUCB, a general regret bound ofwas provided by [amani2019linear]. As our numerical experiment suggests, this is likely not a mere artifact of the proof provided in [amani2019linear]. We observe that the LUCB action selection rule alone does provide sufficient exploration towards safe set expansion, thus requiring the algorithm to 1) have access to a seed safe set; 2) start with a pure exploration phase in order to guarantee safe set expansion. As pointed out in [amani2019linear], this lack of exploration is especially costly for Safe-LUCB under problem instances where the safety constraint is active, i.e., . We highlight this by comparing the regret of Safe-LUCB and Safe-LTS and their corresponding estimated safe sets at different rounds for a problem instance where in Figure 3. The left and middle figures depict the safe sets at different rounds for a specific choice of parameters: , , and . Comparing the left and middle plots demonstrates that inherent randomness of TS leads to a natural exploration ability that is much faster at expanding the estimated safe set towards compared to Safe-LUCB. This results in an undesired regret performance for Safe-LUCB especially in instances where the safety constraint is active.
4.3 Sampling from a dynamic noise distribution
In order for Safe-LTS to be optimistic, i.e., for the event to happen with a constant probability, our theoretical results require the anti-concentration property assumed in (11). This requires that the noise is sampled from a distribution that satisfies:
The extra factor compared to the results of [abeille2017linear] is needed in our theoretical results due to the presence of the safety constraint, restricting the choice of ’s to the set , which is the inner approximation of the to which belongs.
However, here we highlight the performance of a heuristic modification of the algorithm in which the TS distributiondoes not sample according to the above anti-concentration property for all rounds. We empirically observe that if the TS distribution satisfies (28), the TS algorithm explores more than what it needs to get a good approximation of the unknown parameter, which can cause the growth of the regret. Instead, Fig. 1 shows that if the TS distribution satisfies the following dynamic property
where is a linearly-decreasing function in time , the TS algorithm will have a smaller regret in comparison to when . This is intriguing to develop a theoretical understanding of this heuristic.
In this paper, we study a linear stochastic bandit (LB) problem in which the environment is subject to unknown linear safety constraints that need to be satisfied at each round. As such, the learner must make necessary modifications to ensure that the chosen actions belong to the unknown safe set. We propose Safe-LTS, which to the best of our knowledge, is the first safe linear TS algorithm with provable regret guarantees for this problem. We show that appropriate modifications of the [abeille2017linear] distributional properties allow us to design an efficient algorithm for the more challenging LB problem with linear safety constraints. Moreover, we show that the Safe-LTS achieves the same frequentist regret of order as the original LTS problem studied in [abeille2017linear]. We also compare Safe-LTS with Safe-LUCB of [amani2019linear] a UCB-based safe algorithm for LB with linear safety constraints. We show that our algorithm has: better regret in the worst-case ( vs. ), fewer parameters to tune and superior empirical performance. Interesting directions for future work include gaining a theoretical understanding of the regret of the algorithm when the TS distribution satisfies the dynamic property in (29), which empirically leads regret of smaller order.
Appendix A Useful Results
The following result is standard and plays an important role in most proofs for linear bandits problems.
([abbasi2011improved]) Let . For any arbitrary sequence of actions , let be the corresponding Gram matrix (5), then
In particular, we have
Also, we recall the Azuma’s concentration inequality for super-martingales.
(Azuma’s inequality [boucheron2013concentration]) If a super-martingale corresponding to a history satisfies for some positive constant , for all then for any ,
Appendix B Least-Square Confidence Regions
We start by constructing the following confidence regions for the RLS-estimates.
Let , , and . We define the following events:
is the event that the RLS-estimate concentrates around for all steps , i.e., ;
is the event that the RLS-estimate concentrates around , i.e., . Moreover, define such that .
is the event that the sampled parameter concentrates around for all steps , i.e., . Let be such that .
The proof is similar to the one in [abeille2017linear, Lemma 1] and is ommited for brevity. ∎
Appendix C Proof of Theorem 3.1
We use the following decomposition for bounding the regret:
c.1 Bounding Term I.
The proof steps closely follow [abeille2017linear]. Our exposition is somewhat simplified by avoiding relating regret bounds to the gradient of .
On event , belongs to which leads to
Here and onwards, we denote the indicator function applied to an event . We can bound (34) by the expectation over any random choice of that leads to
where and . Then, using Cauchy–Schwarz and the definition of in (21)
This property shows that the regret is upper bounded by