Double-Linear Thompson Sampling for Context-Attentive Bandits

10/15/2020 ∙ by Djallel Bouneffouf, et al. ∙ ibm Orange Montréal Institute of Learning Algorithms 0

In this paper, we analyze and extend an online learning framework known as Context-Attentive Bandit, motivated by various practical applications, from medical diagnosis to dialog systems, where due to observation costs only a small subset of a potentially large number of context variables can be observed at each iteration;however, the agent has a freedom to choose which variables to observe. We derive a novel algorithm, called Context-Attentive Thompson Sampling (CATS), which builds upon the Linear Thompson Sampling approach, adapting it to Context-Attentive Bandit setting. We provide a theoretical regret analysis and an extensive empirical evaluation demonstrating advantages of the proposed approach over several baseline methods on a variety of real-life datasets



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The contextual bandit problem is a variant of the extensively studied multi-armed bandit problem  Lai and Robbins (1985); Gittins (1979); Auer et al. (2002); Lin et al. (2018), where at each iteration, the agent observes an -dimensional context (

feature vector

) and uses it, along with the rewards of the arms played in the past, to decide which arm to play Langford and Zhang (2008); Agarwal et al. (2009); Auer et al. (2003); Bouneffouf and Féraud (2016). The objective of the agent is to learn the relationship between the context and reward, in order to find the best arm-selection policy for maximizing cumulative reward over time.

Recently, a promising variant of contextual bandits, called a Context Attentive Bandit (CAB) was proposed in Bouneffouf et al. (2017), where no context is given by default, but the agent can request to observe (to focus its "attention" on) a limited number of context variables at each iteration. We propose here an extension of this problem setting: a small subset of context variables is revealed at each iteration (i.e. partially observable context), followed by the agent’s choice of additional features, where , , and the set of immediately observed features are fixed at all iterations. The agent must learn to select both the best additional features and, subsequently, the best arm to play, given the resulting observed features. (The original Context Attentive Bandit corresponds to .)

The proposed setting is motivated by several real-life applications. For instance, in a clinical setting, a doctor may first take a look at patient’s medical record (partially observed context) to decide which medical test (additional context variables) to perform, before choosing a treatment plan (selecting an arm to play). It is often too costly or even impossible to conduct all possible tests (i.e., observe the full context); therefore, given the limit on the number of tests, the doctor must decide which subset of tests will result into maximally effective treatment choice (maximize the reward). Similar problems can arise in multi-skill orchestration for AI agents. For example, in dialog orchestration, a user’s query is first directed to a number of domain-specific agents, each providing a different response, and then the best response is selected. However, it might be too costly to request the answers from all

domain-specific experts, especially in multi-purpose dialog systems with a very large number of domains experts. Given a limit on the number of experts to use for each query, the orchestration agent must choose the best subset of experts to use. In this application, the query is the immediately observed part of the overall context, while the responses of domain-specific experts are the initially unobserved features from which a limited subset must be selected and observed, before choosing an arm, i.e. deciding on the best out of the available responses. For multi-purpose dialog systems, such as, for example, personal home assistants, retrieving features or responses from every domain-specific agent is computationally expensive or intractable, with the potential to cause a poor user experience, again underscoring the need for effective feature selection.

Overall, the main contributions of this paper include: (1) a generalization of Context Attentive Bandit, and a first lower bound for this problem, (2) an algorithm called Context Attentive Thompson Sampling for stationary and non-stationary environments, and its regret bound in the case of stationary environment, and (3) an extensive empirical evaluation demonstrating advantages of our proposed algorithm over the previous context-attentive bandit approach Bouneffouf et al. (2017), on a range of datasets in both stationary and non-stationary settings.

2 Related Work

The contextual bandit (CB) problem has been extensively studied in the past, and a variety of solutions have been proposed. In LINUCB  Li et al. (2010); Abbasi-Yadkori et al. (2011); Li et al. (2011); Bouneffouf and Rish (2019), Neural Bandit Allesiardo et al. (2014) and in linear Thompson Sampling  Agrawal and Goyal (2013); Balakrishnan et al. (2019a, b), a linear dependency is assumed between the expected reward given the context and an action taken after observing this context; the representation space is modeled using a set of linear predictors. However, the context is assumed to be fully observable, which is not the case in this work. Motivated by dimensionality reduction tasks, Abbasi-Yadkori et al. (2012) studied a sparse variant of stochastic linear bandits, where only a relatively small and unknown subset of features is relevant to a multivariate function optimization. It presents an application to the problem of optimizing a function that depends on many features, where only a small, initially unknown subset of features is relevant. Similarly, Carpentier and Munos (2012) also considered high-dimensional stochastic linear bandits with sparsity. There the authors combined ideas from compressed sensing and bandit theory to derive a novel algorithm. In Oswal et al. (2020), authors explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards and propose an algorithm that can achieve regret, without prior knowledge of which features are relevant. In Bastani and Bayati (2015)

, the problem is formulated as a multi-arm bandit (MAB) problem with high-dimensional covariates, and a new efficient bandit algorithm based on the LASSO estimator is presented. However,

the above work, unlike ours, assumes fully observable context variables, which is not always the case in some applications, as discussed in the previous section. In Bouneffouf et al. (2017) the authors proposed the novel framework of contextual bandit with restricted context, where observing the whole feature vector at each iteration is too costly or impossible for some reasons; however, the agent can request to observe the values of an arbitrary subset of features within a given budget, i.e. the limit on the number of features observed. This paper explores a more general problem, and unlike Bouneffouf et al. (2017), we provide a theoretical analysis of the proposed problem and the proposed algorithm.

The Context Attentive Bandit problem is related to the budgeted learning problem, where a learner can access only a limited number of attributes from the training set or from the test set (see for instance Cesa-Bianchi et al. (2011)). In Foster et al. (2016), the authors studied the online budgeted learning problem. They showed a significant negative result: for any no algorithm can achieve regret bounded by in polynomial time. For overcoming this negative result, an additional assumption is necessary. Here, following Durand and Gagné (2014), we assume that the expected reward of selecting a subset of features is the sum of the expected rewards of selecting individually the features. We obtain an efficient algorithm, which has a linear algorithmic complexity in terms of time horizon.

3 Context Attentive Bandit Problem

We now introduce the problem setting, outlined in Algorithm 1. Let be a set of features. At each time point the environment generates a feature vector , which the agent cannot observe fully, but a partial observation of the context is allowed: the values of a subset of observed features , are revealed: , . Based on this partially observed context, the agent is allowed to request an additional subset of unobserved features , . The goal of the agent is to maximize its total reward over time via (1) the optimal choice of the additional set of features , given the initial observed features , and (2) the optimal choice of an arm based on , . We assume

, an unknown probability distribution of the reward given the context and the action taken in that context. In the following the expectations are taken over the probability distribution


1:  for  to  do
2:      Context is chosen by the environment
3:      The values of a subset are revealed
4:      The agent selects a subset
5:      The values are revealed;
6:      The agent chooses an arm
7:      The reward is sampled from distribution and it is revealed
8:      The agent updates the policy
10:  end for
Algorithm 1 Context Attentive Bandit Problem (CAB)

The contextual bandit problem. Following Langford and Zhang (2008), this problem is defined as follows. At each time point , an agent is presented with a context (i.e. feature vector) before choosing an arm . Let denote a reward vector, where is the reward at time associated with the arm . Let denote a policy, mapping a context into an arm . We assume that the expected reward is a linear function of the context.

Assumption 1 (linear contextual bandit): Whatever the subset of selected features the expected reward is a linear function of the context: , where is an unknown parameter, , and .

Contextual Combinatorial Bandit. The contextual combinatorial bandit problem Qin et al. (2014) can be viewed as a game where the agent sequentially observes a context , selects a subset and observes the reward corresponding to the selected subset. The goal is to maximize the reward over time. Let be the reward associated with the set of selected features knowing the context vector and the policy . We have , where . Each feature

is associated with the corresponding random variable

which indicates the reward obtained when choosing the -th feature at time .

Assumption 2 (linear contextual combinatorial bandit): the mean reward of selecting the set of features is: , and the expectation of the reward of selecting the feature is a linear function of the context vector : , where is an unknown weight vector associated with the feature , , and . Let be the set of linear policies such that only the features coming from are used, where is a fixed subset of , and is any subset of . The objective of Contextual Attentive Bandit (Algorithm 1) is to find an optimal policy , over iterations or time points, so that the total reward is maximized.

Definition 1 (Optimal Policy for CAB).

The optimal policy for handling the CAB problem is selecting the arm at time : , where .

Definition 2 (Cumulative regret).

The cumulative regret over iterations of the policy , is defined as .

Property 1 (Regret decomposition).

The cumulative regret over iterations of the policy can be rewritten as following:

Remark 1.

CAB problem generalizes contextual bandit with restricted context problem (Bouneffouf et al. (2017)). Indeed, when the subset of observed context is empty, the reward of selecting the feature is given by , which is the coordinate of the vector .

Before introducing an algorithm for solving the above CAB problem, we will derive a lower bound on the expected regret of any algorithm used to solve this problem.

Theorem 1.

For any policy solving Context Attentive Bandit problem (Algorithm 1) under Assumption 1.1 and 1.2, there exists probability distribution and , such that the lower bound of the regret accumulated by over iterations is :


The left term of the regret (see Property 1) is lower bounded by the lower bound of linear bandits in dimension (Theorem 2 in Chu et al. (2011)), while the right term is lower bounded by the lower bound of linear bandits in dimension . ∎

4 Context Attentive Thompson Sampling (CATS)

1:  Require: , , , , , , ,
2:  Initialize: , , , and , , , .
3:  Foreach do
4:   observe
5:    Foreach context feature do
6:    Sample from
7:    End do
8:   Sort in decreasing order
9:   Select
10:   observe values
11:   Foreach arm do
12:    Sample from distribution
13:   End do
14:   Select arm
15:   Observe
16:   ,  ,  
17:   Foreach
18:    ,  , 
19:   End do
20:  End do
Algorithm 2 Context Attentive Thompson Sampling (CATS)

We now propose an algorithm for solving the CAB problem, called Context-Attentive Thompson Sampling (CATS), and summarize it in Algorithm 2. The basic idea of CATS is to use linear Thompson Sampling Agrawal and Goyal (2013) for solving the linear bandit problems for selecting the set of additional relevant features knowing , and for selecting the best arm knowing . Linear Thompson Sampling assumes a Gaussian prior for the likelihood function, which corresponds in CAB for arm to , and for feature to, . Then the posterior at time are respectively for arm , and for feature .

The algorithm takes the total number of features , the number of features initially observed , the number of additional features to observe , the set of observed features , the number of actions , the time horizon , the distribution parameter used in linear Thompson Sampling, and a function of time , which is used for adapting the algorithm to non-stationary linear bandits.

At each iteration the values of features in the subset are observed (line 4 Algorithm 2). Then the vector parameters are sampled for each feature (lines 5-7) from the posterior distribution (line 6). Then the subset of best estimated features at time is selected (lines 8-9). Once the feature vector is observed line 10, Linear Thompson Sampling is applied in steps 11-15 to choose an arm. When the reward of selected arm is observed (line 15) the parameters are updated lines 16-19.

Remark 2 (Algorithmic complexity).

At each time set, Algorithm 2 sorts a set of size and inverts matrices in dimensions that leads to an algorithmic complexity in .

Due to assumption 2, CATS algorithm benefits from a linear algorithmic complexity, overcoming the negative result stated in Foster et al. (2016). Before providing an upper bound of the regret of CATS in ) ( hides logarithmic factors), we need an additional assumption on the noise .

Assumption 3 (Sub-Gaussian noise): and , the noise is conditionally -sub-Gaussian with , that is for all ,

Lemma 1.

(Theorem 2 in Agrawal and Goyal (2013)) When the measurement noise satisfies Assumption 2, , , , and the regret of Thompson Sampling in the Linear bandit problem with parameters is upper bounded with a probability by:

We can now derive the following result.

Theorem 2.

When the measurement noise satisfies Assumption 2, , , , and the regret of CATS (Algorithm 2) is upper bounded with a probability by:


For upper bounding the left term of the regret (see Property 1), we apply Lemma 1, and for upper bounding the right term, which is the regret of Thompson Sampling in linear bandit problems in dimensions with respectively parameters, we apply Lemma 1. ∎

Theorem 2 states that the regret of CATS depends on the following two terms: the left term is the regret due to selecting a sub-optimal arm, while the right term is the regret of selecting a sub-optimal subset of features. We can see that there is still a gap between the lower bound of the Context Attentive Bandit problem and the upper bound of the proposed algorithm. The left term of the lower bound scales in , while the left term of the upper bound of CATS scales in , where hides logarithmic factor. The right term of the lower bound scales in , while the right term of the upper bound of CATS scales in . These gaps are due to the upper bound of regret of CATS, which uses Lemma 1. This suggests that the use of linear bandits based on an upper confidence balls, which scale in Abbasi-Yadkori et al. (2011) ( is the dimension of contexts), could reduce this theoretical gap. As we show in the next section, we choose the Thompson Sampling approach for its better empirical performances.

5 Experiments

We compare the proposed CATS algorithm with: (1) Random-EI: in addition to the observed features, this algorithm selects a Random subset of features of the specified size at each Iteration (thus, Random-EI), and then invokes the linear Thompson sampling algorithm. (2) Random-fix: this algorithm invokes linear Thompson sampling on a subset of features, where the subset is randomly selected once prior to seeing any data samples, and remains fixed. (3) The state-of-art method for context-attentive bandits proposed in Bouneffouf et al. (2017), Thompson Sampling with Restricted Context (TSRC): TSRC solves the CBRC (contextual bandit with restricted context) problem discussed earlier: at each iteration, the algorithm decides on features to observe (referred to as unobserved context). In our setting, however, out of features are immediately observed at each iteration (referred to as known context), then TSRC decision mechanism is used to select

additional unknown features to observe, followed by linear Thompson sampling on U+V features. (4) CALINUCB: where we replace the contextual TS in CATS with LINUCB. (5) CATS-fix: is heuristic where we stop the features exploration after some iterations

(we report here the an average over the best results). Empirical evaluation of Random-fix, Random-EI, CATS-fix, CALINUCB, CATS 111Note that we have used the same exploration parameter value used in Chapelle and Li (2011) for TS and LINUCB type algorithms which are and and TSRC was performed on several publicly available datasets, as well as on a proprietary corporate dialog orchestration dataset. Publicly available Covertype and CNAE-9 were featured in the original TSRC paper and Warfarin Sharabiani et al. (2015) is a historically popular dataset for evaluating bandit methods.

To simulate the known and unknown context space, we randomly fix of the context feature space of each dataset to be known at the onset and explore a subset of unknown features. To consider the possibility of nonstationarity in the unknown context space over time, we introduce a weight decay parameter that reduces the effect of past examples when updating the CATS parameters. We refer to the stationary case as CATS and fix . For the nonstationary setting, we simulate nonstationarity in the unknown feature space by duplicating each dataset, randomly fixing the known context in the same manner as above, and shuffling the unknown feature set - label pairs. Then we stochastically replace events in the original dataset with their shuffled counterparts, with the probability of replacement increasing uniformly with each additional event. We refer to the nonstationary case as NCATS and use as defined by the GP-UCB algorithm Srinivas et al. (2009). We compare NCATS to NCATS-fix and NCALINUCB which are the non stationary version of CATS-fix and CALINUCB. we have also compare NCATS to the Weighted TSRC (WTSRC), the nonstationary version of TSRC also developed by Bouneffouf et al. (2017). WTSRC makes updates to its feature selection model based only on recent events, where recent events are defined by a time period, or "window" . We choose for WTSRC. We report the total average reward divided by T over 200 trials across a range of corresponding to various percentages of for each algorithm in Table 1.

width= Warfarin U 20% 40% 60% TSRC 53.28 1.08 57.60 1.16 59.87 0.69 CATS 53.65 1.21 58.55 0.67 60.40 0.74 CATS-fix 53.99 1.02 58.67 0.65 60.07 0.54 CALINUCB 52.17 0.89 57.23 0.53 60.29 0.66 Random-fix 51.05 1.31 53.55 0.97 55.15 0.83 Random-EI 43.65 1.21 48.55 1.67 50.40 1.33 width= Covertype U 20% 40% 60% TSRC 54.64 1.87 63.35 1.87 69.59 1.72 CATS 65.57 2.17 72.58 2.36 78.58 2.35 CATS-fix 65.88 2.01 72.67 2.13 78.55 2.25 CALINUCB 61.99 1.53 72.54 1.76 79.69 1.82 Random-fix 53.11 1.45 59.67 1.07 64.18 1.03 Random-EI 46.15 2.61 52.55 1.81 55.45 1.5 width= CNAE-9 U 20% 40% 60% TSRC 33.57 2.43 38.62 1.68 42.05 2.14 CATS 29.84 1.82 39.10 1.41 40.52 1.42 CATS-fix 29.82 1.70 39.57 1.23 41.43 1.39 CALINUCB 28.53 1.65 38.88 1.35 39.73 1.36 Random-fix 33.01 1.82 37.67 1.68 39.18 1.52 Random-EI 32.05 2.01 36.65 1.90 37.47 1.75

(a) Stationary setting

width= Warfarin U 20% 40% 60% WTSRC 55.83 0.55 58.00 0.83 59.85 0.60 NCATS 59.47 2.89 59.34 2.04 63.26 0.75 NCATS-fix 59.01 3.09 59.14 2.33 62.42 0.98 NCLINUCB 58.64 2.77 58.43 1.89 63.01 0.66 Random-fix 43.91 1.17 47.67 1.08 54.18 1.03 Random-EI 47.78 2.11 52.55 1.83 55.45 1.54 width= Covertype U 20% 40% 60% WTSRC 50.26 1.58 58.99 1.81 64.91 1.38 NCATS 48.50 1.05 68.17 3.14 83.78 5.51 NCATS-fix 49.87 1.20 68.04 3.24 82.98 5.83 NCLINUCB 48.12 0.99 68.20 3.11 83.91 5.21 Random-fix 43.11 3.05 49.67 2.77 53.18 2.33 Random-EI 44.45 4.44 46.65 3.88 53.45 3.61

width= CNAE-9 U 20% 40% 60% WTSRC 19.91 2.67 30.86 2.92 36.01 2.88 NCATS 30.88 0.96 34.91 1.93 42.04 1.52 NCATS-fix 29.92 1.06 33.43 1.83 40.04 1.51 NCLINUCB 31.07 0.87 34.61 1.73 41.81 1.62 Random-fix 13.01 3.45 21.77 3.08 24.18 2.43 Random-EI 16.15 2.44 22.55 2.18 25.45 2.15

(b) Nonstationary setting
Table 1: Total average reward,

The results in Table 1 are promising, with our methods outperforming the state of the art in the majority of cases across both settings. The most notable exception is found for CNAE-9 dataset, where CATS sometimes outperforms or nearly matches TSRC performance. This outcome is somewhat expected, since in the original work on TSRC Bouneffouf et al. (2017), the mean error rate of TSRC was only lower than the error corresponding to randomly fixing a subset of unknown features to reveal for each event on CNAE-9. This observation suggests that, for this particular dataset, there may not be a subset of features which would be noticeably more predictive of the reward than the rest of the features. We also observe that the LINUCB version of CATS has comparable performance with CATS with slight advantage to CATS. Another observation is that CATS-fix is performing better than CATS in some situations, the explanation could be that after finding the best features the algorithm do not need to explore anymore and focus on finding the best arms based on these featues.

(a) Stationary Setting
(b) Nonstationary Setting
Figure 1: Total Average Reward for Covertype

We perform a deeper analysis of the Covertype dataset, examining multi-staged selection of the unknown context feature sets. In CATS, the known context is used to select all additional context feature sets at once. In a multi-staged approach, the known context grows and is used to select each of the additional context features incrementally (one feature at a time). Maintaining , for the stationary case we denote these two cases of the CATS algorithm as CATS and CATS-Staged respectively and report their performance when of the context is randomly fixed, across various in Figure 0(a). Note that when the remaining 90% of features are revealed, the CATS and TSRC methods all reduce to simple linear Thompson sampling with the full feature set. Similarly, when 0 additional feature sets are revealed, the methods all reduce to linear Thompson sampling with a sparsely represented known context. Observe that CATS consistently outperforms CATS-Staged across all tested. CATS-Staged likely suffers because incremental feature selection adds nonstationarity to the known context - CATS learns relationships between the known and unknown features while CATS-Staged learns relationships between them as the known context grows. Nonetheless, both methods outperform TSRC. In the nonstationary case we use the GP-UCB algorithm for , refer to the single and multi-staged cases as NCATS and NCATS-Staged, and illustrate their performance in Figure 0(b). Here we observe that NCATS and NCATS-Staged have comparable performance, and the improvement gain over baseline, in this case WTSRC, is even greater than in the stationary case.

Next we evaluate our methods on Customer Assistant, a proprietary multi-skill dialog orchestration dataset. Recall that this kind of application motivates the CAB setting because there is a natural divide between the known and unknown context spaces; the query and its associated features are known at the onset and the potential responses and their associated features are only known for the domain specific agents the query is posed to. The Customer Assistant orchestrates 9 domain specific agents which we arbitrarily denote as in the discussion that follows. In this application, example skills lie in the domains of payroll, compensation, travel, health benefits, and so on. In addition to a textual response to a user query, the skills orchestrated by Customer Assistant also return the following features: an intent, a short string descriptor that categorizes the perceived intent of the query, and a confidence, a real value between 0 and 1 indicating how confident a skill is that its response is relevant to the query. Skills have multiple intents associated with them. The orchestrator uses all the features associated with the query and the candidate responses from all the skills to choose which skill should carry the conversation.

The Customer Assistant dataset contains 28,412 events associated with a correct skill response. We encode each query by averaging 50 dimensional GloVe word embeddings Pennington et al. (2014)

for each word in each query and for each skill we create a feature set consisting of its confidence and a one-hot encoding of its intent. The skill feature set size for

are 181, 9, 4, 7, 6, 27, 110, 297, and 30 respectively. We concatenate the query features and all of the skill features to form a 721 dimensional context feature vector for each event in this dataset. Recall that there is no need for simulation of the known and unknown contexts; in a live setting the query features are immediately calculable or known, whereas the confidence and intent necessary to build a skill’s feature set are unknown until a skill is executed. Because the confidence and intent for a skill are both accessible post execution, we reveal them together. We accommodate this by slightly modifying the objective of CATS to reveal unknown skill feature sets instead of unknown individual features for each event. We perform a deeper analysis of the Customer Assistant dataset, examining multi-staged selection of the unknown context feature sets. Maintaining , for the stationary case the results are summarized in Figure 1(a). Here both CATS-Staged and CATS methods outperform TSRC by a large margin.

(a) Stationary Setting
(b) Nonstationary Setting
Figure 2: Total Average Reward for Customer Assistant

For the nonstationary case we simulate nonstationarity in the same manner as the publicly available datasets, except using the natural partition of the query features as the known context and the skill feature sets as the unknown context instead of simulated percentages. We use the GP-UCB algorithm for and illustrate the performance of NCATS and NCATS-Staged alongside WTSRC in Figure 1(b). Here we observe that NCATS slightly outperforms NCATS-Staged, and both outperform the WTSRC baseline.

6 Conclusions and Future Work

We have introduced here a novel bandit problem with only partially observable context and the option of requesting a limited number of additional observations. We also propose an algorithm, designed to take an advantage of the initial partial observations in order to improve its choice of which additional features to observe, and demonstrate its advantages over the prior art, a standard context-attentive bandit with no partial observations of the context prior to feature selection step. Our problem setting is motivated by several realistic scenarios, including medical applications as well as multi-domain dialog systems. Note that our current formulation assumes that all unobserved features have equal observation cost. However, a more practical assumption is that some features may be more costly than others; thus, in our future work, we plan to expand this notion of budget to accommodate more scenarios involving different feature costs.

7 Broader Impact

This problem has broader impacts in several domains such as voice assistants, healthcare and e-commerce.

  • Better medical diagnosis. In a clinical setting, it is often too costly or infeasible to conduct all possible tests; therefore, given the limit on the number of tests, the doctor must decide which subset of tests will result into maximally effective treatment choice in an iterative manner. A doctor may first take a look at patient’s medical record to decide which medical test to perform, before choosing a treatment plan.

  • Better user preference modeling. Our approach can help to develop better chatbots and automated personal assistants. For example, following a request such as, for example, "play music", an AI-based home assistant must learn to ask several follow-up questions (from a list of possible questions) to better understand the intent of a user and to remove ambiguities: e.g., what type of music do you prefer (jazz, pop, etc)? Would you like it on hi-fi system or on TV? And so on. Another example: a support desk chatbot, in response to user’s complaint ("My Internet connection is bad") must learn to ask a sequence of appropriate questions (from a list of possible connection issues): how far is your WIFI hotspot? Do you have a 4G subscription? These scenarios are well-handled by the framework we proposed in this paper.

  • Better recommendations. Voice assistants and recommendation systems in general tend to lock us in our preferences, which can have deleterious effects: e.g., recommendations based only on the past history of user’s choices may reinforce certain undesirable tendencies, e.g., suggesting an online content based on a user’s with particular bias (e.g., racist, sexist, etc). On the contrary, our approach could potentially help a user to break out of this loop, by suggesting the items (e.g. news) on additional questions (additional features) which can be used to broaden user’s horizons.


  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §2, §4.
  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2012) Online-to-confidence-set conversions and application to sparse stochastic bandits. In

    Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012

    pp. 1–9. External Links: Link Cited by: §2.
  • D. Agarwal, B. Chen, and P. Elango (2009) Explore/exploit schemes for web content optimization. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, Washington, DC, USA, pp. 1–10. External Links: ISBN 978-0-7695-3895-2, Link, Document Cited by: §1.
  • S. Agrawal and N. Goyal (2013) Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pp. 127–135. Cited by: §2, §4, Lemma 1.
  • R. Allesiardo, R. Féraud, and D. Bouneffouf (2014)

    A neural networks committee for the contextual bandit problem

    In Neural Information Processing - 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I, pp. 374–381. External Links: Link, Document Cited by: §2.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2003) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. External Links: ISSN 0097-5397, Link, Document Cited by: §1.
  • A. Balakrishnan, D. Bouneffouf, N. Mattei, and F. Rossi (2019a) Incorporating behavioral constraints in online AI systems. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 3–11. External Links: Link, Document Cited by: §2.
  • A. Balakrishnan, D. Bouneffouf, N. Mattei, and F. Rossi (2019b) Using multi-armed bandits to learn ethical priorities for online AI systems. IBM J. Res. Dev. 63 (4/5), pp. 1:1–1:13. External Links: Link, Document Cited by: §2.
  • H. Bastani and M. Bayati (2015) Online decision-making with high-dimensional covariates. Available at SSRN 2661896. Cited by: §2.
  • D. Bouneffouf and R. Féraud (2016) Multi-armed bandit problem with known trend. Neurocomputing 205, pp. 16–21. External Links: Link, Document Cited by: §1.
  • D. Bouneffouf, I. Rish, G. A. Cecchi, and R. Féraud (2017) Context attentive bandits: contextual bandit with restricted context. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 1468–1475. External Links: Link, Document Cited by: §1, §1, §2, §5, §5, §5, Remark 1.
  • D. Bouneffouf and I. Rish (2019) A survey on practical applications of multi-armed and contextual bandits. CoRR abs/1904.10040. External Links: Link, 1904.10040 Cited by: §2.
  • A. Carpentier and R. Munos (2012) Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, pp. 190–198. External Links: Link Cited by: §2.
  • N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir (2011) Efficient learning with partially observed attributes. J. Mach. Learn. Res. 12 (null), pp. 2857–2878. External Links: ISSN 1532-4435 Cited by: §2.
  • O. Chapelle and L. Li (2011) An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: footnote 1.
  • W. Chu, L. Li, L. Reyzin, and R. E. Schapire (2011) Contextual bandits with linear payoff functions.. In AISTATS, G. J. Gordon, D. B. Dunson, and M. Dudik (Eds.), JMLR Proceedings, Vol. 15, pp. 208–214. External Links: Link Cited by: §3.
  • A. Durand and C. Gagné (2014) Thompson sampling for combinatorial bandits and its application to online feature selection. In AAAI, Workshop Sequential Decision-Making with Big Data:. Cited by: §2.
  • D. Foster, S. Kale, and H. Karloff (2016)

    Online sparse linear regression

    In COLT, Cited by: §2, §4.
  • J. C. Gittins (1979) Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41 (2), pp. 148–177. Cited by: §1.
  • T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. External Links: Link Cited by: §1.
  • J. Langford and T. Zhang (2008)

    The epoch-greedy algorithm for multi-armed bandits with side information

    In Advances in neural information processing systems, pp. 817–824. Cited by: §1, §3.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, WWW ’10, USA, pp. 661–670. Cited by: §2.
  • L. Li, W. Chu, J. Langford, and X. Wang (2011) Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms.. In WSDM, I. King, W. Nejdl, and H. Li (Eds.), pp. 297–306. External Links: Link Cited by: §2.
  • B. Lin, D. Bouneffouf, G. A. Cecchi, and I. Rish (2018)

    Contextual bandit with adaptive feature extraction

    In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Vol. , pp. 937–944. Cited by: §1.
  • U. Oswal, A. Bhargava, and R. Nowak (2020) Linear bandits with feature feedback. AAAI. Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. External Links: Link Cited by: §5.
  • L. Qin, S. Chen, and X. Zhu (2014) Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461–469. Cited by: §3.
  • A. Sharabiani, A. Bress, E. Douzali, and H. Darabi (2015) Revisiting warfarin dosing using machine learning techniques. Computational and mathematical methods in medicine 2015. Cited by: §5.
  • N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §5.