Linear stochastic bandit algorithms are used to sequentially select actions to maximize rewards. The linear bandit model assumes that the expected reward of each action is an (unknown) linear function of a (known) finite-dimensional feature associated with the action. Mathematically, if is the feature associated with the action chosen at time , then the stochastic reward is
where is the unknown linear functional and
is a zero mean random variable. The goal is to adaptively select actions to maximize the rewards. This involves (approximately) learningand exploiting this knowledge. Linear bandit algorithms that exploit this special structure have been extensively studied and applied Rusmevichientong and Tsitsiklis (2010); Abbasi-Yadkori et al. (2011)
. Unfortunately, standard linear bandit algorithms suffer from the curse of dimensionality. The regret grows linearly with the feature dimension. The dimension may be quite large in modern applications (e.g., s of features in NLP or image/vision applications). However, in many cases the linear function may only involve a sparse subset of the features, and this can be exploited to partially reduce dependence on . In such cases, the regret of sparse linear bandit algorithms scales like Abbasi-Yadkori et al. (2012); Lattimore and Szepesvári (2018).
We tackle the problem of linear bandits from a new perspective that incorporates feature feedback in addition to reward feedback, mitigating the curse of dimensionality. Specifically, we consider situations in which the algorithm receives a stochastic reward and stochastic feedback indicating which, if any, feature-dimensions were relevant to the reward value. For example, consider a situation in which users rate recommended text documents and additionally highlight keywords or phrases that influenced their ratings. Figure 1 illustrates the idea. Obviously, the additional “feature feedback” may significantly improve an algorithm’s ability to home-in on the relevant features. The focus of this paper is the development of new theory and algorithms for linear bandits with feature feedback. We show that the regret of linear bandits with feature feedback scales linearly in , the number of relevant features, without prior knowledge of which features are relevant nor the value of . This leads to large improvements in theory and practice.
Perhaps the most natural and simple way to leverage the feature feedback is an explore-then-commit strategy. In the first steps the algorithm selects actions at random and receives rewards and feature feedback. If is sufficiently large, then the algorithm will have learned all or most of the relevant features and it can then switch to a standard linear bandit algorithm operating in the lower-dimensional subspace defined by those features. There are two major problems with such an approach:
The correct choice of depends on the prevalence of relevant features in randomly selected actions, which generally is unknown. If is too small, then many relevant features will be missed and the long-run regret will scale linearly with the time horizon. If is too large, then the initial exploration period will suffer excess regret. This is depicted in Figure 2.
Regardless of the choice of , the regret will grow linearly during the exploration period. The new FF-OFUL algorithm that we propose combines exploration and exploitation from the start and can lead to smaller regret initially and asymptotically as shown in Figure 2.
These observations motivate our proposed approach that dynamically adjusts the trade-off between exploration and exploitation. A key aspect of the approach is that it is automatically adaptive to the unknown number of relevant features . Our theoretical analysis shows that its regret scales like . Experimentally, we show the algorithm generally outperforms traditional linear bandits and the explore-then-commit strategy. This is due to the fact that the dynamic algorithm exploits knowledge of relevant features as soon as they are identified, rather than waiting until all or most are found. A key consequence is that our proposed algorithm yields significantly better rewards at early stages of the process, as shown in Figure 2
and in more comprehensive experiments later in the paper. The intuition for this is that estimatingon a fraction of the relevant coordinates can be exploited to recover a fraction of the optimal reward. Similar ideas are explored in linear bandits (without feature feedback) in Deshpande and Montanari (2012).
1.1 Motivating Application
Consider the application of recommending news articles. At every time instant, the algorithm recommends an article to the user from a large database containing articles about topics like “politics”, “technology”, “sports”. The user provides a numerical reward corresponding to her assessment of the document’s value. The goal of the algorithm is to maximize the cumulative reward over time. This can be challenging if the majority of the documents in the database are not of interest to the user. Linear bandit algorithms strike a balance between exploration of the database to ascertain the user’s interests and exploitation by retrieving documents similar to those that have received the highest rewards (In the paper, we also refer to exploration-exploitation in the context of the knowledge of relevant features). Typical word models such as TF-IDF result in features () in the order of thousands of dimensions. The high-dimensionality makes it challenging to employ state-of-the-art algorithms since it involves maintaining and updating a matrix at every stage. The approach taken in this work is to augment the usual reward feedback with additional feature feedback by allowing the user to highlight words or phrases to help orient the search. As an example, suppose the user is looking for articles about NFL football. They can highlight words such as “Patriots”, “Football”, “Rams” to reinforce search in that direction and also negative words such as “politics”, “stocks” to avoid in the document search. However words such as “grass”, “air” may be common words and therefore less relevant to the search. The goal is to give the user a tool to speed up their search with nearly effortless feedback.
For round, , let
be the set of actions/items provided to the learner. We assume the standard linear model for rewards with a hidden weight vector. If the learner selects an action, , it receives reward, , defined in (1) where is noise with a sub-Gaussian random distribution with parameter .
For the set of actions , the optimal action is given by, , which is unknown. We define regret as,
This is also called cumulative regret but, unless stated otherwise, we will refer to it as regret. We refer to the quantity as the instantaneous regret which is the difference between the optimal reward and the reward received at that instant. We make the standard assumption that the algorithm is provided with an enormous action set which is only changing slowly over time, for instance, from sampling the actions without replacement ().
1.3 Related Work
The area of contextual bandits was introduced by Ginebra and Clayton (1995). The first algorithms for linear bandits appeared in Abe and Long (1999) followed by those using the optimism in the face of uncertainty principle, Auer and Long (2002); Dani et al. (2008). Rusmevichientong and Tsitsiklis (2010) showed matching upper and lower bounds when the action (feature) set is a unit hypersphere. Finally, Abbasi-Yadkori et al. (2011) gave a tight regret bound using new martingale techniques. We use their algorithm, OFUL, as a subroutine in our work. In the area of sparse linear bandits, regret bounds are known to scale like , Abbasi-Yadkori et al. (2012); Lattimore and Szepesvári (2018), when operating in a dimensional feature space with relevant features. The strong dependence on the ambient dimension is unavoidable without further (often strong and unrealistic) assumptions. For instance, if the distribution of feature vectors is isotropic or otherwise favorably distributed, then the regret may scale like , e.g., by using incoherence based techniques from compressed sensing Carpentier and Munos (2012). These results also assume knowledge of sparsity parameter and without it no algorithm can satisfy these regret bounds for all simultaneously.
In contrast, we propose a new algorithm that automatically adapts to the unknown sparsity level and removes the dependence of regret on by exploiting additional feature feedback. In terms of feature feedback in text-based applications, Croft and Das (1989) have proposed a method to reorder documents based on the relative importance of words using feedback from users. Poulis and Dasgupta (2017)
consider a similar problem but for learning a linear classifier. We use a similar feedback model but focus on the bandit setting where such feedback can be naturally collected along with rewards to improve search while striking a balance between exploration and exploitation leading to interesting tradeoffs. The idea of allowing user’s to provide richer forms of feedback has been studied in the active learning literatureRaghavan et al. (2006); Druck et al. (2009) and also been considered in other (interactive) learning tasks, such as cognitive science Roads et al. (2016), machine teaching Chen et al. (2018), and NLP tasks Yessenalina et al. (2010).
2 Model for Feature Feedback
The algorithm presents the user with an item (e.g., document) and the user provides feedback in terms of whether they like the item or not (logistic model) or how much they like it (inner product model). The user also selects a few features (e.g., words), if they can find them, to help orient the search. The reasonable assumption in the high-dimensional setting is that the linear bandit weight vector
is sparse (or approximately sparse). Suppose one is searching for articles about machine learning. It is easy to see how one may pay attention to words like pattern, recognition, and networks, but the vast majority of words may not help at all in determining if that article is about machine learning.
Assumption 1 (Sparsity).
The hidden weight vector is -sparse and is unknown. In other words, has at most non-zero entries or if then .
Assumption 1 ensures that there are at most relevant features, however we stress that the value of is unknown (it is possible that all features are relevant). We make the following underlying assumptions about feature feedback.
Assumption 2 (Discoverability).
For an action selected uniformly at random, the probability that a relevant feature is present and is selected is at least
selected uniformly at random, the probability that a relevant feature is present and is selected is at least(unknown).
Assumption 2 ensures that while every item may not have relevant features, we are able to find them with a non-zero probability when searching through items at random. This assumption can be viewed as a (possibly pessimistic) lower bound on the rate at which relevant features are discovered. For example, it is possible that exploitative actions may yield relevant features at a higher rate (e.g., relevant features may be correlated with higher rewards). We do not attempt to model such possibilities since this would involve making additional assumptions that may not hold in practice.
Assumption 3 (Noise).
Users may report irrelevant features. The number of reported irrelevant features (denoted by ) is unknown in advance.
Assumption 3 accounts for ambiguous features that are irrelevant but users erring on the side of marking as relevant.
The set up is as follows: we have a set of items or actions, that we can propose to the users. There is a hidden weight vector that is -sparse. We will further assume that and the action vectors are bounded in norm: . Besides the reward , defined in (1), at each time-step the learner gets which is the relevance feedback information. The model further specifies that . That is, the probability a relevant feature is selected at random is at least . We need this assumption to make sure that we can find all the relevant features.
In this section, we introduce an algorithm that makes use of feature relevance feedback in order to start with a small feature space and gradually increase the space over time without the knowledge of . We begin by reminding ourselves of the following theorem that bounds the regret (2) of the OFUL algorithm (stated as Algorithm 1
) based on the principle of optimism in the face of uncertainty. The algorithm constructs ellipsoidal confidence sets centered around the ridge regression estimate, using observed data such that the sets contain the unknownwith high probability, and selects the action/item that maximizes the inner product with any from the confidence set.
Theorem 1 (Abbasi-Yadkori et al. (2011)).
Assume that and , . Then with probability at least , the regret of OFUL satisfies:
where is the ridge regression parameter of OFUL.
Roughly, this theorem provides a bound on the regret of OFUL stated as Algorithm 1 by ignoring constants and logarithmic terms. We will combine this with a form of -greedy algorithm due to Sutton and Barto (1998) to prove a result similar to Theorem 1 but reduce the dependence on the dimension from to .
In order to do so, we must discover the support of . The idea being that we apportion a set of actions to random plays in order to guarantee that we find all the relevant features, and the remaining time we will run OFUL on the identified relevant dimensions. Reducing the proportion of random actions over time guarantees that the regret remains sub-linear in time. We propose Algorithm 2 to exploit feature feedback. Here, at each time , with probability proportional to , the algorithm selects an action/item to present at random, otherwise it selects the item recommended by feature-restricted-OFUL.
All updates are made only in the dimensions that have been marked as relevant and the space is dynamically increased as new relevant features are revealed. If nothing is marked as relevant, then by default the actions are selected at random, potentially suffering the worst possible reward but, at the same time, increasing our chances of getting relevance feedback leading to a trade-off. As time goes on, more relevance information is revealed. Note that the algorithm is adaptive to the unknown number of relevant features . If were known, we could stop looking for features when all relevant ones have been selected. We find that in practice, this algorithm has an additional benefit of being more robust to changes in the ridge parameter () due to its intrinsic regularization of restricting the parameter space.
4 Regret Analysis
In this section, we state the regret bounds for the FF-OFUL algorithm along with a sketch of the proof and discussion on approaches to improve or generalize the bounds. The more subtle proof details are deferred to the appendix.
4.1 Regret Bound for Algorithm 2 (Ff-Oful)
Recall that the norm of the actions are bounded by and the hidden weight vector is also bounded in norm by . Therefore, for any action, the worst-case instantaneous regret can be derived using Cauchy-Schwarz as follows:
In other words, with high probability, the regret of Algorithm 2 (FF-OFUL) scales like by ignoring constants and logarithmic terms and using the taylor series expansion of . The three terms in the total regret come from the following events. Regret due to:
Exploration to guarantee observing all the relevant features (with high probability).
Exploration after observing all relevant features (due to lack of knowledge of or ).
Exploitation and exploration running OFUL (after having observed all the relevant features).
In practice, feature feedback may be noisy. Sometimes, features that are irrelevant may be marked as relevant. To account for this, we can relax our assumption to allow for subset of irrelevant features that are mistakenly marked as relevant. Including these features will increase the regret but the algorithm will still work and the theory goes through without much difficulty as stated in the following corollary.
The corollary follows by observing that the exploration is not affected by this noise and the regret of exploitation on the weight vector restricted to the dimensions scales like . This accounts for having some features being ambiguous and users erring on the side of marking them as relevant. This only results in slightly higher regret so long as is still smaller than
. One could improve this regret by making additional assumptions on the probabilities of feature selection to weed out the irrelevant features.
4.1.1 Proof Sketch of Main Result
We provide a sketch of the proof here and defer the details to the appendix. Recall, the cumulative regret is summed over the instantaneous regrets for
. We divide the cumulative regret across epochsof doubling size for .
This ensures that the last epoch dominates the regret which gives the multiplicative factor of . For each epoch, we bound the regret under two events, all relevant features have been identified (via user feedback) up to that epoch or not. First, we bound the regret conditioned on the event that all the relevant features have been identified in Lemma 3. This is further, in expectation, broken down into the portion of random actions for pure exploration (Lemma 1) and modified OFUL actions on the -dimensional feature space for exploitation-exploration (Lemma 2). For the pure exploration part, we use the worst case regret bound but since is decreasing this does not dominate the OFUL term. Second, we bound the probability that some of the relevant features are not identified so far (Proposition 3), which is a constant depending on and since it becomes zero after enough epochs have passed. We need pure exploration to ensure the probability that some features are not identified decreases with each passing epoch. The regret in this case is bounded with worst case regret.
A subtle issue of bounding regret of the actions selected by the OFUL subroutine is that, unlike OFUL, the confidence sets in our algorithm are constructed using additional actions from exploration rounds and past epochs. To accommodate this we prove a regret bound for this variation in Lemma 2. Putting all this together gives us the final result.
Lower bound. We can use the arguments from (Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010) to get a lower bound of . To see this, assume that we know the support. Then any linear bandit algorithm that is run on that support must incur an order regret. We don’t know the support but we estimate it with high probability and therefore the lower bound also applies here. Our algorithm is optimal up to log factors in terms of the dimension.
4.2 Better Early-Regret Bounds
In our analysis, we bound the regret in the rounds before observing all relevant features with the worst case regret. This may be too pessimistic in practice. We present some results to support the idea of restricting the feature space in the short-term horizon and growing the feature space over time. The results also suggest that an additional assumption on the behavior of early-regret could lead to better constants in our bounds. Any linear bandit algorithm restricted to the support of must incur an order regret so one can only hope to improve the constants of the bound.
Figure 4(a) shows that the average regret of pure exploration has a worse slope than that of OFUL restricted to a subset of the relevant features. We randomly sampled actions from the unit sphere in dimensions and generated with sparsity. The only regret bound one can derive for a pure exploration algorithm that picks actions uniformly at random, independent of the problem instance, is a worst-case cumulative regret bound of . Let be the expected regret of algorithm run on the subset of relevant features . For example, could be the OFUL algorithm. Then represents the expected regret of OFUL only restricted to features in . Suppose the explore-then-commit algorithm first explores for roughly time instances to discover relevant features () followed by an exploitation stage such as OFUL only restricted to features in . The rewards in the exploitation stage can be divided in two parts,
where is the portion of restricted to and . Similarly, the regret can be divided in two parts. Roughly the regret on can be bounded by under certain conditions using the OFUL regret bound. For the regret on , suppose each relevant component of has a mean square value of (for example, this can be achieved with a sparse gaussian model such as those described in Deshpande and Montanari (2012)). This yields where . The worst-case instantaneous regret bound on becomes leading to an improvement in the slope of linear regret by a factor of over pure exploration as seen in Figure 4.
Figure 4(b) shows the average regret of OFUL restricted to feature subsets of different sizes with synthetic data with actions, and . For , we randomly picked subsets of size from the support of . We report the average regret of OFUL for a short horizon, , restricted to random subsets. We also plot average regret of OFUL on the full dimensional data. Figure 4(c) depicts the same with real data from Poulis and Dasgupta (2017) with and sparsity, , we choose random subsets of size from the set of relevant features marked by users (see Section 5 for more details). We report the average regret of OFUL restricted to the features from random subsets for a relatively short time horizon, .
The plots show that, in the short horizon, it may be more beneficial to use a subset of the relevant features than using the total feature set which may include many irrelevant features. The intuition is that when OFUL has not seen many samples, it does not have enough information to separate the irrelevant dimensions from relevant ones. As time goes on (i.e., for longer horizons) OFUL’s relative performance improves since it enjoys sublinear regret but would ultimately be a factor of worse than that of the low-dimensional model that includes all relevant features.
In this section, we demonstrate the performance of our algorithm with synthetic and real human-labeled data.
5.1 Results with Synthetic Data
For the synthetic dataset, we simulate a text categorization dataset as follows. Each action corresponds to an article. Generally an article contains only a small subset of the words from the dictionary. Therefore, to simulate documents we generate sparse actions in dimensions. A -sparse reward generating vector, , is chosen at random. This is representative of the fact that in reality a document category probably contains only a few relevant words. The features represent word counts and hence are always positive. Here we have access to therefore for any action , we use the standard linear model (1) for the reward with . The support of is taken as the set of oracle relevant words. For every round, each word from the intersection of the support of the action and oracle relevant words is marked as relevant with probability . Figure 5(a) shows the results of an average of 100 random trials where is sparse with , with actions. As expected, the FF-OFUL algorithm outperforms standard OFUL significantly. Figure 5(b) also shows that the feedback does not hurt the performance much for non-sparse with . Figure 2 compares the performance of FF-OFUL with an explore-then-commit strategy.
5.2 Results with 20Newsgroup Dataset
For real data experiments, we use the 20Newsgroup dataset from Lang (1995). It has documents covering topics such as politics, sports. We choose a subset of 5 topics (misc.forsale, rec.autos, sci.med, comp.graphics, talk.politics.mideast) with approximately 4800 documents posted under these topics. For the word counts, we use the TF-IDF features for the documents which give us approximately features. For the sake of comparing our method with OFUL, we first report and dimensional experiments and then on the full
dimensional data. To do this, we use logistic regression to train a high accuracy sparse classifier to selectfeatures. Then select an additional features at random in order to simulate high dimensional features. We compared OFUL and FF-OFUL algorithms on this data. This is similar to the way Poulis and Dasgupta (2017) ran experiments in the classification setting. We ran only our algorithm on the full dimension data since it was infeasible to run OFUL. For the reward model, we pick one of the articles from the database at random as and the linear reward model in (1) or use the labels to generate binary, one vs many rewards to simulate search for articles from a certain category. In order to come close to simulating a noisy setting, we used the logistic model, with .
5.2.1 Oracle Feedback.
We used the support of the one vs many sparse logistic regression to get an “oracle set of relevant features” for each class. Each word from the intersection of the support of the action and oracle relevant words was marked as relevant with probability . There were about relevant features for each category. Figure 6, shows the performance of OFUL, Explore-then-commit and FF-OFUL on the Newsgroup dataset with oracle feedback. In these simulations averaged over random , FF-OFUL outperforms OFUL and Explore-then-commit significantly. OFUL parameter was tuned to .
5.2.2 Human Feedback.
Poulis and Dasgupta (2017) took of the 20Newsgroup articles from categories and had users annotate relevant words. These are the same categories that we used in the Newsgroup20 results. This is closer to simulating human feedback since we are not using sparse logistic regression to estimate the sparse vectors. We take the user indicated relevant words instead as the relevance dimensions. . There were relevant features for each category. In Figure 7(a), we can see that FF-OFUL is already outperforming OFUL and Explore-then-commit. This is despite the fact that it is not a very sparse regime. Surprisingly, we found that tuning had little effect on the performance of FF-OFUL whereas it had a significant effect on OFUL (see Figure 7). We believe that this behavior is due to the gradual growth in the number of relevant dimensions as we receive new feedback therefore implicitly regularizing the number of free parameters. FF-OFUL also yields significantly better rewards at early stages by exploiting knowledge of relevant features as soon as they are identified, rather than waiting until all or most are found.
5.2.3 Parameter Tuning.
For OFUL we tune the ridge parameter () in the range to pick the one with best performance. All the tuned parameters that were selected for OFUL were strictly inside this range. For and , . For (Newsgroup), . Figure 7(b) demonstrates the sensitivity of OFUL to change in tuning parameter. For FF-OFUL, the remarkable feature is that it does not require parameter tuning so for all experiments.
5.2.4 Full dimension experiments.
Remarkably the performance of our algorithm barely drops in full () feature dimensions as seen in Figure 7(c). It is important to note that the ridge regression parameter () for all the experiments was set to and was not tuned. FF-OFUL is robust to changes in the ambient dimensions and the parameter . Recall that we do not compare the results with OFUL on dimensional data since it would require storing and updating a matrix at each stage.
In this paper we provide an algorithm that incorporates feature feedback in addition to the standard reward feedback. We would like to underline that since this algorithm incrementally grows the feature space, it makes it possible to use the new algorithm in high-dimensional settings where conventional linear bandits are impractical and also makes it less sensitive to the choice of tuning parameters. This behavior could be beneficial in practice since tuning bandit algorithms could be sped up. In the future, it might prove fruitful to augment the feature feedback provided by the user with ideas from compressed sensing to facilitate faster recognition of relevant features.
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2011). Improved Algorithms for Linear Stochastic Bandits. Advances in Neural Information Processing Systems (NIPS), pages 1–19.
Abbasi-Yadkori et al. (2012)
Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012).
Online-to-Confidence-Set Conversions and Application to Sparse
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
Abe and Long (1999)
Abe, N. and Long, P. M. (1999).
Associative reinforcement learning using linear probabilistic concepts.In Proceedings of the International Conference on Machine Learning (ICML), pages 3–11.
- Auer and Long (2002) Auer, P. and Long, M. (2002). Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research, 3:2002.
- Carpentier and Munos (2012) Carpentier, A. and Munos, R. (2012). Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In International Conference on Artificial Intelligence and Statistics, pages 190–198.
- Chen et al. (2018) Chen, Y., Mac Aodha, O., Su, S., Perona, P., and Yue, Y. (2018). Near-optimal machine teaching via explanatory teaching sets. In International Conference on Artificial Intelligence and Statistics, pages 1970–1978.
- Croft and Das (1989) Croft, W. B. and Das, R. (1989). Experiments with query acquisition and use in document retrieval systems. In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, pages 349–368. ACM.
- Dani et al. (2008) Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback.
- Deshpande and Montanari (2012) Deshpande, Y. and Montanari, A. (2012). Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750–1754. IEEE.
Druck et al. (2009)
Druck, G., Settles, B., and McCallum, A. (2009).
Active learning by labeling features.
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 81–90. Association for Computational Linguistics.
- Ginebra and Clayton (1995) Ginebra, J. and Clayton, M. K. (1995). Response surface bandits. Journal of the Royal Statistical Society. Series B (Methodological), pages 771–784.
- Lang (1995) Lang, K. (1995). Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pages 331–339. Elsevier.
- Lattimore and Szepesvári (2018) Lattimore, T. and Szepesvári, C. (2018). Bandit algorithms.
- Poulis and Dasgupta (2017) Poulis, S. and Dasgupta, S. (2017). Learning with feature feedback: from theory to practice. In Artificial Intelligence and Statistics, pages 1104–1113.
- Raghavan et al. (2006) Raghavan, H., Madani, O., and Jones, R. (2006). Active learning with feedback on features and instances. Journal of Machine Learning Research, 7(Aug):1655–1686.
- Roads et al. (2016) Roads, B., Mozer, M. C., and Busey, T. A. (2016). Using highlighting to train attentional expertise. PloS one, 11(1):e0146266.
- Rusmevichientong and Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly Parameterized Bandits. Math. Oper. Res., 35(2):395–411.
- Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
- Yessenalina et al. (2010) Yessenalina, A., Yue, Y., and Cardie, C. (2010). Multi-level structured models for document-level sentiment classification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1046–1056. Association for Computational Linguistics.
Appendix A Feature Feedback Epoch OFUL
This second algorithm, Feature Feedback Epoch OFUL (Algorithm 3), is an epoch version of Algorithm 2 which runs in epochs of doubling length so the last epoch dominates the regret. It is essentially the same as Algorithm 2 written in a different format which facilitates proving the main result. The main difference in the algorithms is the choice of depicted in Figure 8.
Appendix B Proof of Theorem 2
We begin by proving intermediate results for three different events followed by the proof details.
The number of times we pull a random arm during an epoch is close to its expectation.
We have seen all the relevant arms before the current epoch.
Modified OFUL regret bound using arms from both exploration and exploitation.
b.1 Bounding the number of times we pull a random arm
During epoch , there are time steps. Let be the number of random arm pulls during epoch . Given that the probability of pulling a random arm during epoch is , then for any :
We can see as the sum of i.i.d. Bernoulli random variables with probability of success of . It is easy to see that . Finish by applying the Hoeffding’s inequality to the sum of the Bernoulli random variables. ∎
With probability :
This is a simple consequence of taking in Lemma 1. ∎
b.2 Probability of having identified all the relevant arms
Let and for . Then:
The number of random arms pulled before an epoch can be bounded as:
with probability .
Let be a random variable:
Let be the event that all the relevant features are marked.
The probability that we have not seen all the relevant arms goes down quickly. Here we characterize how quickly. Note the assumption here that at every around, we assume that each relevant feature is revealed with some probability at least independent of other relevant features.
The probability that all the relevant features have not been marked up to epoch , is bounded as follows.
The proof follows by union bound.
Now we can find the number of epochs that need to pass after which we have observed all the features with high probability:
epochs, we have observed all the relevant features with probability .
b.3 Regret for a modification of OFUL after epoch
We cannot use the OFUL regret bound directly since our algorithm involves additional random arms sampled during the epoch along with arms sampled in previous epochs. To bound the regret of arms pulled using OFUL, we prove the following regret bound for the modified OFUL algorithm, stated as Algorithm 4, where some additional arms are sampled in addition to the OFUL ones:
Assume that and , . Then with probability at least , the regret of Extended OFUL (Algorithm 4) satisfies:
where is the ridge regression parameter of OFUL.
This lemma shows that the additional arms sampled between of OFUL turns do not harm the regret of OFUL.
We will require the following result to prove the theorem.
For symmetric positive definite matrices and , we have
and eigenvalue decomposition be. Then we have
The remaining proof follows the proof of Theorem 3 and we state it here for the sake of completeness.
We will follow the proof of Theorem 3 in [Abbasi-Yadkori et al. (2011)] which is divided into 2 parts: first they prove that with high probabiliy lies inside the confidence set constructed by OFUL at that time. Notice that the super martingale arguments used to prove that is inside the confidence set with high probability do not make an assumption on how the previous arms were sampled so the argument goes through without any modification.
As in Abbasi-Yadkori et al. (2011) we can decompose the instantaneous regret as follows:
where we use the fact that is optimistic and that all lie in the confidence set with high probability. Thus with probability at least , for all
where we used Proposition 5 stated above.
By Lemma 11 in Abbasi-Yadkori et al. (2011) we have,
b.4 Regret after epoch
During each epoch after , we have at most random arm pulls.
For epochs , the cumulative regret is bounded by:
with probability .
The regret during the epoch is the sum of the regret when we pull the random arms added to the regret when we pull OFUL arms.
Now, we just have to use the upper bound on the number of times we pull a random arm in Corollary 2. During each random arm pull the worst case regret is .
The number of times we pull an OFUL arm in epoch , , is trivially upper bounded by . Apply Lemma 2 stated above with , , to get the result. Recall, we cannot apply the OFUL regret bound directly here since our algorithm involves additional random arms sampled during the epoch along with arms sampled in previous epochs. ∎
b.5 Proof of main result
We are now ready to prove the regret bound of Feature Feedback Epoch OFUL.
The regret can be summed over the epochs as:
Now, note that:
Now, setting and , we get the final regret expression using Lemma 3. The multiplicative factor of comes from bounding the sum of regrets over the epochs by the max regret over all the epochs (which occurs during the last epoch) multiplied by the number of epochs, which is .
The proof for Feature Feedback OFUL follows similarly by noticing that the Algorithms are essentially the same with different and using the fact that for .