The multi-armed bandit (MAB) is a framework for sequential decision making where, at every time step, the learner selects (or “pulls”) one of several possible actions (or “arms”), and receives a reward based on the selected action. The regret of the learner is the difference between the maximum possible reward and the reward resulting from the chosen action. In the classical MAB setting, the goal is to minimize the sum of all regrets, or cumulative regret, which naturally leads to an exploration/exploitation trade-off problem (Auer et al., 2002a). If the learner explores too little, it may never find an optimal arm which will increase its cumulative regret. If the learner explores too much, it may select sub-optimal arms too often which will also increase its cumulative regret. There are a variety of algorithms that solve this exploration/exploitation trade-off problem (Auer et al., 2002a; Auer, 2002; Auer et al., 2002b; Agrawal and Goyal, 2012; Bubeck et al., 2012).
The contextual bandit problem extends the classical MAB setting, with the addition of time-varying side information, or context, made available at every time step. The best arm at every time step depends on the context, and intuitively the learner seeks to determine the best arm as a function of context. To date, work on contextual bandits has studied cumulative regret minimization, which is motivated by applications in health care, web advertisement recommendations and news article recommendations (Li et al., 2010)
. The contextual bandit setting is also called associative reinforcement learning(Auer, 2002) and linear bandits (Agrawal and Goyal, 2012; Abbasi-Yadkori et al., 2011).
In classical (non-contextual) MABs, the goal of the learner isn’t always to minimize the cumulative regret. In some applications, there is a pure exploration phase during which the learning incurs no regret (i.e., no penalty for sub-optimal decisions), and performance is measured in terms of simple regret, which is the regret assessed at the end of the pure exploration phase. For example, in top-arm identification, the learner must guess the arm with highest expected reward at the end of the exploration phase. Simple regret minimization clearly motivates different strategies, since there is no penalty for sub-optimal decisions during the exploration phase. Fixed budget and fixed confidence are the two main theoretical frameworks in which simple regret is generally analyzed (Gabillon et al., 2012; Jamieson and Nowak, 2014; Garivier and Kaufmann, 2016; Carpentier and Valko, 2015).
In this paper, we extend the idea of simple regret minimization to contextual bandits. In this setting, there is a pure exploration phase during which no regret is incurred, following by a pure exploitation phase during which regret is incurred, but there is no feedback so the learner cannot update its policy. To our knowledge, previous work has not addressed novel algorithms for this setting. Guan and Jiang (2018) provide simple regret guarantees for the policy of uniform sampling of arms in the i.i.d setting. The contextual bandit algorithm of Tekin and van der Schaar (2015) also has distinct exploration and exploitation phases, but unlike our setting, the agent has control over which phase it is in, i.e., when it wants to receive feedback. In the work of Hoffman et al. (2014); Soare et al. (2014); Libin et al. (2017); Xu et al. (2018) there is a single best arm even when contexts are observed (directly or indirectly). Our algorithm, Contextual-Gap, generalizes the idea of Gabillon et al. (2012) and Hoffman et al. (2014) to the contextual bandits setting.
We make the following contributions: 1. We formulate a novel problem: that of simple regret minimization for contextual bandits. 2. We develop an algorithm, Contextual-Gap, for this setting. 3. We present performance guarantees on the simple regret in the fixed budget framework. 4. We present experimental results for multiclass online classification with partial feedback, and for adaptive sensor selection in nano-satellites.
The paper is organized as follows. In section 2, we motivate the new problem based on the real-life application of magnetometer selection in spacecraft. In section 3, we state the problem formally, and to solve this new problem, we present the Contextual-Gap algorithm in section 4. In section 5, we present the learning theoretic analysis and in section 6, we present and discuss experimental results. Section 7 concludes.
Our work is motivated by autonomous systems that go through an initial training phase (the pure exploration phase) where they learn how to accomplish a task without being penalized for sub-optimal decisions, and then are deployed in an environment where they no longer receive feedback, but regret is incurred (the pure exploitation phase).
An example scenario arises in the problem of estimating weak interplanetary magnetic fields (Figure 1) in the presence of noise using resource-constrained spacecraft known as nano-satellites or CubeSats. Spacecraft systems generate their own spatially localized magnetic field noise due to large numbers of time-varying current paths in the spacecraft. Historically, with large spacecraft, such noise was minimized by physically separating the sensor from the spacecraft using a rigid boom. In highly resource-constrained satellites such as nano-satellites, however, structural constraints limit the use of long rigid booms, requiring sensors to be close to or inside the CubeSat (Figure 2). Thus, recent work has focused on nano-satellites equipped with multiple magnetic field sensors (magnetometers) (Sheinker and Moldwin, 2016).
A natural problem in nano-satellites is that of determining the best sensor to actuate at any given time. Power constraints motivate the selection of a single sensor at each time step. Furthermore, the best sensor changes with time. This stems from the time-varying localization of noise in the spacecraft, which in turn results from different operational events such as data transmission, spacecraft maneuvers, and power generation. This dynamic sensor selection problem is readily cast as a contextual bandit problem. The context is given by the spacecraft’s telemetry system which provides real-time measurements related to spacecraft operation, including solar panel currents, temperatures, momentum wheel information, and real-time current consumption (Springmann and Cutler, 2012).
In this application, however, conventional contextual bandit algorithms are not applicable because feedback is not always available. Feedback requires knowledge of sensor noise, which in turn requires knowledge of the true magnetic field. Yet the true magnetic field is known only during certain portions of a spacecraft’s orbit (e.g., when the satellite is near other spacecraft, or when the earth shields the satellite from sun-induced magnetic fields). Moreover, when the true magnetic field is known, there is no need to estimate the magnetic field in the first place! This suggests a learning scenario where the agent (the sensor scheduler) operates in two phases, one where it has feedback but incurs no regrets (because the field being estimated is known), and another where it does not receive feedback, but nonetheless needs to produce estimates. This is precisely the problem we study.
In the magnetometer problem defined above, the exploration and exploitation times occur in phases, as the satellite moves into and out of regions where the true magnetic field is known. For simplicity, we will the address the problem in which the first time steps belong to the exploration phase, and all subsequent time steps to the exploitation phase. Nonetheless, the algorithm we introduce can switch between phases indefinitely, and does not need to know in advance when a new phase is beginning.
Sensor management, adaptive sensing, and sequential resource allocation have historically been viewed in the decision process framework where the learner takes actions on selecting the sensor based on previously collected data. There have been many proposed solutions based on Markov decision processes (MDPs) and partially observable MDPs, with optimality bounds for cumulative regret(Hero and Cochran, 2011; Castanon, 1997; Evans and Krishnamurthy, 2001; Krishnamurthy, 2002; Chong et al., 2009). In fact, sensor management and sequential resource allocation was one of the original motivating settings for the classical MAB problem (Mahajan and Teneketzis, 2008; Bubeck et al., 2012; Hero and Cochran, 2011), again with the goal of cumulative regret minimization. We are interested in an adaptive sensing setting where the optimal decisions and rewards also depend on the context, but where the actions can be separated into a pure exploration and pure exploitation phases, with no regret during exploration, and with no feedback during pure exploitation.
3 Formal Setting
We denote the context space as . Let denote the sequence of observed contexts. Let the total number of arms be . For each , the learner is required to choose an arm , where .
For arm , let be a function that maps context to expected reward when arm is selected. Let denote the arm selected at time , and assume the reward at time obeys , where is noise (described in more detail below). We assume that for each , belongs to a reproducing kernel Hilbert space (RKHS) defined on . The first time steps belong to the exploration phase where the learner observes context , chooses arm and obtains reward . The time steps after belong to an exploitation phase where the learner observes context , chooses arm and earns an implicit reward that is not returned to the learner.
-conditionally sub-Gaussian random variable, i.e.,is such that for some and ,
Here is the history at time (see supplementary material for additional details).
We also define the following terms. Let be the set of all time indices when arm was selected up to time and set . Let be the data matrix whose columns are and similarly let
denote the column vector of rewards. Thus, and .
3.1 Problem Statement
At every time step , the learner observes context . During the exploration phase , the learner chooses a series of actions to explore and learn the mappings from context to reward. During the exploitation phase , the goal is to select the best arm as a function of context. We define the simple regret associated with choosing arm , given context , as:
where is the expected reward for the best arm for context . The learner aims to minimize the simple regret for . To be more precise, let be the fixed policy mapping context to arm during the exploitation phase. The goal is to determine policies for the exploration and exploitation phases such that for all and
where is an expression that decreases to 0 as .
The following section presents an algorithm to solve this problem.
4.1 Estimating Expected Rewards
A key ingredient of our extension is an estimate of , for each , based on the current history. We use kernel methods to estimate . Let be a symmetric positive definite kernel function on , be the corresponding RKHS and be the associated canonical feature map. Let . We define the kernel matrix associated with as and the kernel vector of context as . Let
be the identity matrix of size. We estimate at time
, via kernel ridge regression, i.e.,
The solution to this optimization problem is . Furthermore, Durand et al. (2018)
establish a confidence interval forin terms of
and the “variance”.
Theorem 4.1 (Restatement of Theorem 2.1 in Durand et al. (2018)).
In the supplementary material we show that . For convenience, we denote the width of the confidence interval . Thus, the upper and lower confidence bounds of are and . The upper confidence bound is the most optimistic estimate of the reward and the lower confidence bound is the most pessimistic estimate of the reward.
4.2 Contextual-Gap Algorithm
During the exploration phase, the Contextual-Gap algorithm proceeds as follows. First, the algorithm has a burn-in period where it cycles through the arms (ignoring context) and pulls each one times. Following this burn-in phase, when the algorithm is presented with context at time , the algorithm identifies two candidate arms, and , as follows. For each arm the contextual gap is defined as . is the arm that minimizes and is the arm (excluding ) whose upper confidence bound is maximized. Among these two candidates, the one with the widest confidence interval is selected. Note that one can rewrite which clearly shows that is the best arm considering a pessimistic estimate of the reward.
In the exploitation phase, for a given context , the contextual gap for all time steps in the exploration phase are evaluated. The arm with the smallest gap over the entire exploration phase for the given context is chosen as the best arm associated with context . Because there is no feedback during the exploitation phase, the algorithm moves to the next exploitation step without modification to the learning history. The exact description is presented in Algorithm 1.
During the exploitation phase, looking back at all history may be computationally prohibitive. Thus, in practice, we just select the best arm as . As described in the experimental section, this works well in practice. Theoretically, has to be bigger than a certain number defined in Lemma 5.2, but for experimental results we keep .
4.3 Comparison of Contextual-Gap and Kernel-UCB
In this section, we illustrate the difference between the policies of Kernel-UCB (which minimizes cumulative regret) and exploration phase of Contextual-Gap (which aims to minimize simple regret). At each time step, Contextual-Gap selects one of two arms: , the arm with highest pessimistic reward estimate, or , the arm excluding with highest optimistic reward estimate. Kernel-UCB, in contrast, selects the arm with the highest optimistic reward estimate (i.e., with the maximum upper confidence bound).
Case 1 (Figure 3): In this case, Kernel-UCB would pick arm 1, because it has the maximum upper confidence bound. Kernel-UCB’s policy is designed to be optimistic in the case of uncertainty. In the Contextual-Gap, we first calculate which minimizes . Note that , and . In this case, and hence . Finally, Contextual-Gap would choose among arm 1 and arm 2, and would finally choose arm 1 because it has the largest confidence interval. Hence, in case 1, Contextual-Gap chooses the same arm as that of Kernel-UCB.
Case 2 (Figure 4): In this case, Kernel-UCB would pick arm 1. Note that , and . Then and hence . Finally, Contextual-Gap chooses arm 2, because it has the widest confidence interval. Hence, in case 2, Contextual-Gap chooses a different arm compared to that of Kernel-UCB.
Clearly, the use of the lower confidence bound along with upper confidence bound allows Contextual-Gap to explore more than kernel-UCB. However, Contextual-Gap doesn’t explore just any arm, but rather it explores only among arms with some likelihood of being optimal. The following section details high probability bounds on the simple regret of the Contextual-Gap algorithm.
5 Learning Theoretic Analysis
We now analyze high probability simple regret bounds which depend on the gap quantity . The bounds are presented in the non-i.i.d setting described in Section 3
. For the confidence interval to be useful, it needs to shrink to zero with high probability over the feature space as each arm is pulled more and more. This requires the smallest non-zero eigenvalue of the sample covariance matrix of the data for each arm to be lower bounded by a certain value. We make an assumption that allows for such a lower bound, and use it to prove that the confidence intervals shrink with high probability under certain assumptions. Finally, we bound the simple regret using the result of shrinking confidence interval, the gap quantity, and the special exploration strategy described in Algorithm1. We now make additional assumptions to the problem setting.
[align=left, leftmargin=*, label=A ]
, is a random process on compact space endowed with a finite positive Borel measure.
Kernel is bounded by a constant , the canonical feature map of is a continuous function, and is separable.
We denote and by the largest eigenvalue of a compact self adjoint operator . For a context , the operator is a compact self-adjoint operator. Based on this notation, we make the following assumption:
[align=left, leftmargin=*, label=A ]
There exists a subspace of dimension with projection , and a constant , such that , for and .
eigenvalue of the cumulative second moment operatorso that it is possible to learn the reward behavior in the low energy directions of the context at the same rate as the high energy ones with high probability.
We now provide a lower bound on the eigenvalue of a compact self-adjoint operator. There are similar results in the setting where reward is a linear function of context, including Lemma 2 in Gentile et al. (2014) and Lemma 7 in Li and Zhang (2018) which provides lowest eigenvalue bounds with the assumption of linear reward and full rank covariance, and Theorem 2.2 in Tu and Recht (2017) which assumes more structure to the contexts generated. We extend these results to the setting of a compact self-adjoint operator scenario with data occupying a finite dimensional subspace. Let . By construction and Assumption 3 we can show that has non-zero eigenvalues (See Section 4.1 in the supplementary material).
Lemma 5.1 (Lower bound on Eigen-value of compact self-adjoint operators).
Lemma 5.1 provides high probability lower bounds on the minimum nonzero eigenvalue of the cumulative second moment operator . Using the preceding lemma and the confidence interval defined in Theorem 4.1, it is possible to provide high probability monotonic bounds on the confidence interval widths .
Lemma 5.2 (Monotonic upper bound of ).
The condition results in a minimum number of tries that arm has to be selected before any bound will hold. In , the first and third term in the are needed so that we can give concentration bounds on eigenvalues and prove that the confidence width shrinks. The second term is needed because one has to get at least contexts for every arm so that at least some energy is added to the lowest eigenvalues.
These high probability monotonic upper bounds on the confidence estimate can be used to upper bound the simple regret. The upper bound depends on a context-based hardness quantity defined for each arm (similar to Hoffman et al. (2014)) as
Denote its lowest value as . Let total hardness be defined as (Note that ). The recommended arm after time is defined as
from Algorithm 1. We now upper bound the simple regret as follows:
Note that the term in (5) grows logarithmically in (see supplementary material). For to be positive, should be greater than . We compare the term in our bound with the uniform sampling technique in Guan and Jiang (2018) which leads to a bound that decay like , where , is the context dimension, and and are constants. In our case, the decay rate has the form for constants . Clearly, our bound is superior for .
6 Experimental Results and Discussion
We present results from two different experimental setups, first from online multiclass classification with partial feedback, and second from a lab generated non-i.i.d spacecraft magnetic field as described in Section 2. The datasets were split into cross-validation and evaluation datasets and each of those datasets were further split into exploration and exploitation phases. Cross validation was performed to minimize average simple regret for the exploitation phase while training with the exploration phase, both from the cross validation dataset. The value of selected in both the cross validation and evaluation datasets were of similar magnitude. Evaluation of the algorithm for average simple regret behavior is performed with the evaluation dataset.
We present average simple regret comparisons of the Contextual-Gap algorithm against four baselines:
Uniform Sampling: We equally divide the exploration budget among arms and learn a reward estimating function for each of the arm during the exploration phase. During the exploitation phase, we select the best arm based on estimated reward function .
Epsilon Greedy: At every step, we select the best arm (according to estimated ) with probability and other arms with probability . We use , where is the time step.
Kernel-UCB: We implement kernel-UCB from Valko et al. (2013).
For all the algorithms, we use the Gaussian kernel and tune the bandwidth of the kernel, and the regularization parameter. The exploration parameter is set to for the results in this section and we show results for different values of in the supplementary material 111The code to reproduce our results is available at https://www.dropbox.com/sh/0f6ycz6x9kaprl3/AACUFHyNgT6eSBl5s2VhuM5ga?dl=0.
6.1 Multi-class Classification
We present results of contextual simple regret minimization for multiclass datasets. At every time step, we observe a feature vector and need to select the class to which the example belongs. Each class is treated like an arm or action. If we select the best arm (true class) we get a reward of one, otherwise we get a reward of zero. This setting is different from standard online multiclass classification, because we don’t learn the true class if our selection is wrong. We present results over three multiclass datasets: MNIST (LeCun et al., 1998), USPS (Hull, 1994) and Letter (Hsu and Lin, 2002). Figure 5 shows the variation of the average simple regret with increasing exploration phase for five algorithms. The dataset for evaluation of simple regret was kept constant. Since the datasets are i.i.d in nature, multiple simple regret evaluations are performed by shuffling the evaluation datasets, and the average curves are reported. Note that the algorithms have been cross validated for simple regret minimization. The plots are generated by varying the length of the exploration phase and keeping the exploitation dataset constant for evaluation of simple regret. It can be seen that the simple regret of the Contextual-Gap converges faster than the simple regret of other baselines.
6.2 Experimental Spacecraft Magnetic Field Dataset
We present the experimental setup and results associated with a lab generated, realistic spacecraft magnetic field dataset with non-i.i.d contexts. In spacecraft magnetic field data, we are interested in identifying the least noisy sensor for every time step (see Section 2). The dataset was generated with contexts consisting of measured variables associated with the electrical behavior of the GRIFEX spacecraft (Norton et al., 2012; Cutler et al., 2015), and reward is the negative of the magnitude of the sensor noise measured at every time step.
Data were collected using 3 sensors (arms), and sensor readings were downloaded for all three sensors at all times steps, although the algorithm does not know these in advance and must select one sensor at each time step. The context information was used in conjunction with a realistic simulator to generate spacecraft magnetic field, and hence a realistic model of sensor noise, as a function of context. The true magnetic field was computed using models of the earth’s magnetic field.
Figure 4(d) shows the simple regret minimization curves for the spacecraft data-set and even in this case Contextual-Gap converges faster compared to other algorithm. Note that, in addition to the non-i.i.d nature, there exists large variability in reward for certain regions of the context space.
In this work, we present a novel problem: that of simple regret minimization in the contextual bandit setting. We propose the Contextual-Gap algorithm, give a regret bound for the simple regret, and show empirical results on three multiclass datasets and one lab-based spacecraft magnetometer dataset. It can be seen that in this scenario persistent and efficient exploration of the best and second best arms with the Contextual-Gap algorithm provides improved results compared against algorithms designed to optimize cumulative regret.
- Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Agrawal and Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
- Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Auer et al. (2002a) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002a.
- Auer et al. (2002b) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
- Bercovici et al. (2009) H Bercovici, WS Li, and Dan Timotin. The horn conjecture for sums of compact self-adjoint operators. American Journal of Mathematics, 131(6):1543–1567, 2009.
- Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Carpentier and Valko (2015) Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pages 1133–1141, 2015.
- Castanon (1997) David A Castanon. Approximate dynamic programming for sensor management. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 2, pages 1202–1207. IEEE, 1997.
- Chong et al. (2009) Edwin KP Chong, Christopher M Kreucher, and Alfred O Hero. Partially observable Markov decision process approximations for adaptive sensing. Discrete Event Dynamic Systems, 19(3):377–422, 2009.
- Chowdhury and Gopalan (2017) Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844–853, 2017.
- Cutler et al. (2015) James. W Cutler, Charles Lacy, Tyler Rose, So-hee Kang, David Rider, and Charles Norton. An update on the GRIFEX mission. Cubesat Developer’s Workshop, 2015.
- Durand et al. (2018) Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. Journal of Machine Learning Research, 19(August), 2018.
- England et al. (2018) Nathanael England, James. W Cutler, and Srinagesh Sharma. Tandom beacon experiment-tbex design overview and lessons learned. Cubesat Developer’s Workshop, 2018.
Evans and Krishnamurthy (2001)
Jamie Evans and Vikram Krishnamurthy.
Optimal sensor scheduling for hidden Markov model state estimation.International Journal of Control, 74(18):1737–1742, 2001.
- Gabillon et al. (2012) Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212–3220, 2012.
- Garivier and Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016.
- Gentile et al. (2014) Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757–765, 2014.
Guan and Jiang (2018)
Melody Y Guan and Heinrich Jiang.
Nonparametric stochastic contextual bandits.
The 32nd AAAI Conference on Artificial Intelligence, 2018.
- Hero and Cochran (2011) Alfred O Hero and Douglas Cochran. Sensor management: Past, present, and future. IEEE Sensors Journal, 11(12):3064–3075, 2011.
- Hoffman et al. (2014) Matthew Hoffman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Artificial Intelligence and Statistics, pages 365–374, 2014.
Hsu and Lin (2002)
Chih-Wei Hsu and Chih-Jen Lin.
A comparison of methods for multiclass support vector machines.
IEEE transactions on Neural Networks, 13(2):415–425, 2002.
- Hull (1994) Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
- Jamieson and Nowak (2014) Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014.
- Krishnamurthy (2002) Vikram Krishnamurthy. Algorithms for optimal scheduling and management of hidden Markov model sensors. IEEE Transactions on Signal Processing, 50(6):1382–1397, 2002.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
- Li and Zhang (2018) Shuai Li and Shengyu Zhang. Online clustering of contextual cascading bandits. In The 32nd AAAI Conference on Artificial Intelligence, 2018.
- Libin et al. (2017) Pieter Libin, Timothy Verstraeten, Diederik M Roijers, Jelena Grujic, Kristof Theys, Philippe Lemey, and Ann Nowé. Bayesian best-arm identification for selecting influenza mitigation strategies. arXiv preprint arXiv:1711.06299, 2017.
- Mahajan and Teneketzis (2008) Aditya Mahajan and Demosthenis Teneketzis. Multi-armed bandit problems. In Foundations and Applications of Sensor Management, pages 121–151. Springer, 2008.
- Micchelli et al. (2006) Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
- Minsker (2017) Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 127(C):111–119, 2017.
- Motwani and Raghavan (1995) Rajeev Motwani and Prabhakar Raghavan. Tail Inequalities, page 67–100. Cambridge University Press, 1995. doi: 10.1017/CBO9780511814075.005.
- Norton et al. (2012) Charles D Norton, Michael P Pasciuto, Paula Pingree, Steve Chien, and David Rider. Spaceborne flight validation of NASA ESTO technologies. In Geoscience and Remote Sensing Symposium (IGARSS), 2012 IEEE International, pages 5650–5653. IEEE, 2012.
- Sheinker and Moldwin (2016) Arie Sheinker and Mark B Moldwin. Adaptive interference cancelation using a pair of magnetometers. IEEE Transactions on Aerospace and Electronic Systems, 52(1):307–318, 2016.
- Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014.
- Springmann and Cutler (2012) John C Springmann and James W Cutler. Attitude-independent magnetometer calibration with time-varying bias. Journal of Guidance, Control, and Dynamics, 35(4):1080–1088, 2012.
- Steele (2004) J Michael Steele. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
- Tekin and van der Schaar (2015) Cem Tekin and Mihaela van der Schaar. Releaf: An algorithm for learning and exploiting relevance. IEEE Journal of Selected Topics in Signal Processing, 9(4):716–727, 2015.
- Tsunoda (2016) RT Tsunoda. Tilts and wave structure in the bottomside of the low-latitude f layer: Recent findings and future opportunities. In AGU Fall Meeting Abstracts, 2016.
- Tu and Recht (2017) Stephen Tu and Benjamin Recht. Least-squares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642, 2017.
- Valko et al. (2013) Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence, page 654. Citeseer, 2013.
- Xu et al. (2018) Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 843–851, 2018.
- Zi-Zong (2009) Yan Zi-Zong. Schur complements and determinant inequalities. 2009.
9 Probabilistic Setting and Martingale Lemma
For the theoretical results, the following general probabilistic framework is adopted, following Abbasi-Yadkori et al. (2011) and Durand et al. (2018). We formalize the notion of history defined in the Section 3 of the main paper using filtration. A filtration is a sequence of -algebras such that . Let be a filtration such that is measurable, and is measurable. For example, one may take , i.e., is the algebra generated by .
We assume that is a zero mean, -conditionally sub-Gaussian random variable, i.e., is such that for some and ,
Definition 9.1 (Definition 4.11 in Motwani and Raghavan (1995)).
Let be a probability space with filtration . Suppose that are random variables such that for all , is measurable. The sequence is a martingale provided for all ,
Lemma 9.2 (Theorem 4.12 in Motwani and Raghavan (1995)).
Any subsequence of a martingale is also a martingale (relative to the corresponding subsequence of the underlying filter).
The above Lemma is important because we construct confidence intervals for each arm separately. Note that we define a subset of time indices ( of each arm ), when the arm was selected. Based on these indices we can form sub-sequences of the main context and noise sequence such that the assumptions on the main sequence hold for subsequences.
9.1 Theorem 4.1 in Main Paper
Theorem 4.1 is a slight modification of Theorem 2.1 in Durand et al. (2018). In the contextual bandit setting in Durand et al. (2018), for any , Theorem 2.1 in Durand et al. (2018) establishes that with probability at least , it holds simultaneously over all and ,
For , one can replace in the log terms with . Then , we have
Let . In that case,
Using triangle inequality for any ,
Let and . Hence, we have
10 Lower Bound on Eigenvalue
First we state the Lemmas that we use to prove Lemma 5.1 in main paper.
Lemma 10.1 (Lemma 9 in Li and Zhang (2018)).
If , then for all ,
Lemma 10.2 (Lemma 1.1 in Zi-Zong (2009)).
Let be a symmetric positive definite matrix partitioned according to
where and . Then .
Lemma 10.3 (Special case of extended Horn’s inequality (Theorem 4.5 of Bercovici et al. (2009))).
Let be compact self-adjoint operators. Then for any ,
Theorem 10.4 (Freedman’s inequality for self adjoint operators, Thm 3.2 & section 3.2 in Minsker (2017)).
Let be a sequence of self-adjoint Hilbert Schmidt operators acting on a seperable Hilbert space ( is a operator such that for any ). Additionally, assume that is a martingale difference sequence of self adjoint operators such that almost surely for all and some positive . Denote by and . Then for any ,
where is the operator norm and .
Note that is a function of but it’s upper bounded by which is the rank of .
10.1 Proof of Lemma 5.1 in main paper
Lemma 7 in Li and Zhang (2018) gives the lower bound on minimum eigenvalue (finite dimensional case) when reward depends linearly on context. We extend it to largest eigenvalue (infinite dimensional case) and the case when reward depends non-linearly on context.
is a compact space endowed with a finite positive Borel measure. For a continuous kernel the canonical feature map is a continuous function , where is a separable Hilbert space (See section 2 of Micchelli et al. (2006) for a construction such that is separable). In such a setting is also compact space with a finite positive Borel measure (Micchelli et al., 2006). We now define a few terms on .
Define the random variable . Let .
By construction, is a martingale and is the martingale difference sequence. Notice that . To use the Freedman’s inequality, we lower bound the operator norm of , and upper bound the largest eigenvalue of , . Let be the spectral radius of operator . We work with the spectral radius because it is not necessary that is a positive definite operator. It is well known that
By assumption 3, lies in a fixed dimensional subspace with its eigenvalues for . Thus, for , .
Bound on : By definition, . Hence, by using Horn’s inequality (Lemma 10.3).
Bound on : To bound the term , write
By using square expansion,
Taking norm on both sides,