## I Introduction

In this paper we propose a novel framework for online learning by multiple cooperative and decentralized learners. We assume that an instance (a data unit), characterized by a context (side) information, arrives at a learner (processor) which needs to process it either by using one of its own processing functions or by requesting another learner (processor) to process it. The learner’s goal is to learn online what is the best processing function which it should use such that it maximizes its total expected reward for that instance. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. For example, in a stream mining application, an instance can be the data unit extracted by a sensor or camera; in a wireless communication application, an instance can be a packet that needs to be transmitted. The context can be anything that provides information about the rewards to the learners. For example, in stream mining, the context can be the type of the extracted instance; in wireless communications, the context can be the channel Signal to Noise Ratio (SNR). The processing functions in the stream mining application can be the various classification functions, while in wireless communications they can be the transmission strategies for sending the packet (Note that the selection of the processing functions by the learners can be performed based on the context and not necessarily the instance). The rewards in the stream mining can be the accuracy associated with the selected classification function, and in wireless communication they can be the resulting goodput and expended energy associated with a selected transmission strategy.

To solve such distributed online learning problems, we define
a new class of multi-armed bandit solutions, which we refer to as
cooperative contextual bandits. In the considered scenario,
there is a set of cooperative learners, each equipped with
a set of processing functions (arms^{1}^{1}1We use the terms action and arm interchangeably.) which can be used
to process the instance. By definition, cooperative learners agree to follow the rules of a prescribed algorithm provided by a designer given that the prescriped algorithm meets the set of constraints imposed by the learners. For instance, these constraints can be privacy constraints, which limits the amount of information a learner knows about the arms of the other learners. We assume a discrete time model ,
where different instances and associated context information
arrive to a learner.^{2}^{2}2Assuming synchronous agents/learners is common in the decentralized
multi-armed bandit literature [1, 2].
Although our formulation is for synchronous learners, our results
directly apply to the asynchronous learners, where times of instance
and context arrivals can be different. A learner may not receive an
instance and context at every time slot . Then, instead of the
final time , our performance bounds for learner will depend
on the total number of arrivals to learner by time . Upon the arrival of an instance, a learner needs to select
either one of its arms to process the instance or it can call another
learner which can select one of its own arms to process the instance and
incur a cost (e.g., delay cost, communication cost, processing
cost, money).
Based on the selected arm, the learner receives a random
reward, which is drawn from some unknown distribution that depends
on the context information characterizing the instance. The goal of
a learner is to maximize its total undiscounted reward up to any time
horizon . A learner does not know the expected reward (as a function
of the context) of its own arms or of the other learners’ arms. In
fact, we go one step further and assume that a learner does not know
anything about the set of arms available to other learners except
an upper bound on the number of their arms.
The learners are cooperative because they obtain mutual benefits from cooperation
- a learner’s benefit from calling another learner may be an increased
reward as compared to the case when it uses solely its own arms; the
benefit of the learner asked to perform the processing by another
learner is that it can learn about the performance of its own arm
based on its reward for the calling learner.
This is especially beneficial when certain instances and associated contexts are less
frequent, or when gathering labels (observing the reward) is costly.

The problem defined in this paper is a generalization of the well-known contextual bandit problem [3, 4, 5, 6, 7, 8], in which there is a single learner who has access to all the arms. However, the considered distributed online learning problem is significantly more challenging because a learner cannot observe the arms of other learners and cannot directly estimate the expected rewards of those arms. Moreover, the heterogeneous contexts arriving at each learner lead to different learning rates for the various learners. We design distributed online learning algorithms whose long-term average rewards converge to the best distributed solution which can be obtained if we assumed complete knowledge of the expected arm rewards of each learner for each context.

To rigorously quantify the learning performance, we define the regret of an online learning algorithm for a learner as the difference between the expected total reward of the best decentralized arm selection scheme given complete knowledge about the expected arm rewards of all learners and the expected total reward of the algorithm used by the learner. Simply, the regret of a learner is the loss incurred due to the unknown system dynamics compared to the complete knowledge benchmark. We prove a sublinear upper bound on the regret, which implies that the average reward converges to the optimal average reward. The upper bound on regret gives a lower bound on the convergence rate to the optimal average reward. We show that when the contexts arriving to a learner are uniformly distributed over the context space, the regret depends on the dimension of the context space, while when the contexts arriving to the same learner are concentrated in a small region of the context space, the regret is independent of the dimension of the context space.

The proposed framework can be used in numerous applications including the ones given below.

###### Example 1

Consider a distributed recommender system in which there is a group of agents (learners) that are connected together via a fixed network, each of whom experiences inflows of users to its page. Each time a user arrives, an agent chooses from among a set of items (arms) to offer to that user, and the user will either reject or accept each item. When choosing among the items to offer, the agent is uncertain about the user’s acceptance probability of each item, but the agent is able to observe specific background information about the user (context), such as the user’s gender, location, age, etc. Users with different backgrounds will have different probabilities of accepting each item, and so the agent must learn this probability over time by making different offers. In order to promote cooperation within this network, we let each agent also recommend items of other agents to its users in addition to its own items. Hence, if the agent learns that a user with a particular context is unlikely to accept any of the agent’s items, it can recommend to the user items of another agent that the user might be interested in. The agent can get a commission from the other agent if it sells the item of the other agent. This provides the necessary incentive to cooperate. However, since agents are decentralized, they do not directly share the information that they learn over time about user preferences for their own items. Hence the agents must learn about other agent’s acceptance probabilities through their own trial and error.

###### Example 2

Consider a network security scenario in which autonomous systems (ASs) collaborate with each other to detect cyber-attacks [9]. Each AS has a set of security solutions which it can use to detect attacks. The contexts are the characteristics of the data traffic in each AS. These contexts can provide valuable information about the occurrence of cyber-attacks. Since the nature of the attacks are dynamic, non-stochastic and context dependent, the efficiency of the various security solutions are dynamically varying, context dependent and unknown a-priori. Based on the extracted contexts (e.g. key properties of its traffic, the originator of the traffic etc.), an AS may route its incoming data stream (or only the context information) to another AS , and if AS detects a malicious activity based on its own security solutions, it warns AS . Due to the privacy or security concerns, AS may not know what security applications AS is running. This problem can be modeled as a cooperative contextual bandit problem in which the various ASs cooperate with each other to learn online which actions they should take or which other ASs they should request to take actions in order to accurately detect attacks (e.g. minimize the mis-detection probability of cyber-attacks).

The remainder of the paper is organized as follows. In Section II we describe the related work and highlight the differences from our work. In Section III we describe the choices of learners, rewards, complete knowledge benchmark, and define the regret of a learning algorithm. A cooperative contextual learning algorithm that uses a non-adaptive partition of the context space is proposed and a sublinear bound on its regret is derived in Section IV. Another learning algorithm that adaptively partitions the context space of each learner is proposed in Section V, and its regret is bounded for different types of context arrivals. In Section VI we discuss the necessity of training phase which is a property of both algorithms and compare them. Finally, the concluding remarks are given in Section VII.

## Ii Related Work

Contextual bandits have been studied before in [5, 6, 7, 8] in a single agent setting, where the agent sequentially chooses from a set of arms with unknown rewards, and the rewards depend on the context information provided to the agent at each time slot. The goal of the agent is to maximize its reward by balancing exploration of arms with uncertain rewards and exploitation of the arm with the highest estimated reward. The algorithms proposed in these works are shown to achieve sublinear in time regret with respect to the complete knowledge benchmark, and the sublinear regret bounds are proved to match with lower bounds on the regret up to logarithmic factors. In all the prior work, the context space is assumed to be large and a known similarity metric over the contexts is exploited by the algorithms to estimate arm rewards together for groups of similar contexts. Groups of contexts are created by partitioning the context space. For example, [7]

proposed an epoch-based uniform partition of the context space, while

[5] proposed a non-uniform adaptive partition. In [10], contextual bandit methods are developed for personalized news articles recommendation and a variant of the UCB algorithm [11] is designed for linear payoffs. In [12], contextual bandit methods are developed for data mining and a perceptron based algorithm that achieves sublinear regret when the instances are chosen by an adversary is proposed. To the best of our knowledge, our work is the first to provide rigorous solutions for online learning by multiple cooperative learners when context information is present and propose a novel framework for cooperative contextual bandits to solve this problem.

Another line of work [3, 4] considers a single agent with a large set of arms (often uncountable). Given a similarity structure on the arm space, they propose online learning algorithms that adaptively partition the arm space to get sublinear regret bounds. The algorithms we design in this paper also exploits the similarity information, but in the context space rather than the action space, to create a partition and learn through the partition. However, distributed problem formulation, creation of the partitions and how learning is performed is very different from related prior work [5, 6, 7, 8, 3, 4].

Previously, distributed multi-user learning is only considered for multi-armed bandits with finite number of arms and no context. In [13, 1] distributed online learning algorithms that converge to the optimal allocation with logarithmic regret are proposed for the i.i.d. arm reward model, given that the optimal allocation is an orthogonal allocation in which each user selects a different arm. Considering a similar model but with Markov arm rewards, logarithmic regret algorithms are proposed in [14, 15], where the regret is with respect to the best static policy which is not generally optimal for Markov rewards. This is generalized in [2] to dynamic resource sharing problems and logarithmic regret results are also proved for this case.

A multi-armed bandit approach is proposed in [16] to solve decentralized constraint optimization problems (DCOPs) with unknown and stochastic utility functions. The goal in this work is to maximize the total cumulative reward, where the cumulative reward is given as a sum of local utility functions whose values are controlled by variable assignments made (actions taken) by a subset of agents. The authors propose a message passing algorithm to efficiently compute a global upper confidence bound on the joint variable assignment, which leads to logarithmic in time regret. In contrast, in our formulation we consider a problem in which rewards are driven by contexts, and the agents do not know the set of actions of the other agents. In [17]

a combinatorial multi-armed bandit problem is proposed in which the reward is a linear combination of a set of coefficients of a multi-dimensional action vector and an instance vector generated by an unknown i.i.d. process. They propose an upper confidence bound algorithm that computes a global confidence bound for the action vector which is the sum of the upper confidence bounds computed separately for each dimension. Under the proposed i.i.d. model, this algorithm achieves regret that grows logarithmically in time and polynomially in the dimension of the vector.

We provide a detailed comparison between our work and related work in multi-armed bandit learning in Table I. Our cooperative contextual learning framework can be seen as an important extension of the centralized contextual bandit framework [3, 4, 5, 6, 7, 8]. The main differences are: (i) training phase which is required due to the informational asymmetries between learners, (ii) separation of exploration and exploitation over time instead of using an index for each arm to balance them, resulting in three-phase learning algorithms with training, exploration and exploitation phases, (iii) coordinated context space partitioning in order to balance the differences in reward estimation due to heterogeneous context arrivals to the learners. Although we consider a three-phase learning structure, our learning framework can work together with index-based policies such as the ones proposed in [5], by restricting the index updates to time slots that are not in the training phase. Our three-phase learning structure separates exploration and exploitation into distinct time slots, while they take place concurrently for an index-based policy. We will discuss the differences between these methods in Section VI. We will also show in Section VI that the training phase is necessary for the learners to form correct estimates about each other’s rewards in cooperative contextual bandits.

Different from our work, distributed learning is also considered in online convex optimization setting [18, 19, 20]. In all of these works local learners choose their actions (parameter vectors) to minimize the global total loss by exchanging messages with their neighbors and performing subgradient descent. In contrast to these works in which learners share information about their actions, the learners in our model does not share any information about their own actions. The information shared in our model is the context information of the calling learner and the reward generated by the arm of the called learner. However, this information is not shared at every time slot, and the rate of information sharing between learners who cannot help each other to gain higher rewards goes to zero asymptotically.

In addition to the aforementioned prior work, in our recent work [21] we consider online learning in a decentralized social recommender system. In this related work, we address the challenges of decentralization, cooperation, incentives and privacy that arises in a network of recommender systems. We model the item recommendation strategy of a learner as a combinatorial learning problem, and prove that learning is much faster when the purchase probabilities of the items are independent of each other. In contrast, in this work we propose the general theoretical model of cooperative contextual bandits which can be applied in a variety of decentralized online learning settings including wireless sensor surveillance networks, cognitive radio networks, network security applications, recommender systems, etc. We show how context space partition can be adapted based on the context arrival process and prove the necessity of the training phase.

[5, 6, 7, 8] | [22, 13, 2] | This work | |

Multi-user | no | yes | yes |

Cooperative | N/A | yes | yes |

Contextual | yes | no | yes |

Context arrival | arbitrary | N/A | arbitrary |

process | |||

synchronous (syn)/ | N/A | syn | both |

asynchronous (asn) | |||

Regret | sublinear | logarithmic | sublinear |

## Iii Problem Formulation

The system model is shown in Fig. 1. There are learners which are indexed by the set . Let be the set of learners learner can choose from to receive a reward. Let denote the set of arms of learner . Let denote the set of all arms. Let . We call the set of choices for learner . We use index to denote any choice in , to denote arms of the learners, to denote other learners in . Let , and , where is the cardinality operator. A summary of notations is provided in Appendix B.

The learners operate under the following privacy constraint: A learner’s set of arms is its private information. This is important when the learners want to cooperate to maximize their rewards, but do not want to reveal their technology/methods. For instance in stream mining, a learner may not want to reveal the types of classifiers it uses to make predictions, or in network security a learner may not want to reveal how many nodes it controls in the network and what types of security protocols it uses. However, each learner knows an upper bound on the number of arms the other learners have. Since the learners are cooperative, they can follow the rules of any learning algorithm as long as the proposed learning algorithm satisfies the privacy constraint. In this paper, we design such a learning algorithm and show that it is optimal in terms of average reward.

These learners work in a discrete time setting , where the following events happen sequentially, in each time slot:
(i) an instance with context arrives to each learner ;
(ii) based on , learner either chooses one of its arms or calls another learner and sends ;^{3}^{3}3An alternative formulation is that learner selects multiple choices from at each time slot, and receives sum of the rewards of the selected choices. All of the ideas/results in this paper can be extended to this case as well.
(iii) for each learner who called learner at time , learner chooses one of its arms ;
(iv) learner observes the rewards of all the arms it had chosen both for its own contexts and for other learners;
(v) learner either obtains directly the reward of its own arm it had chosen, or a reward that is passed from the learner that it had called for its own context.^{4}^{4}4Although in our problem description the learners are synchronized, our model also works for the case where instance/context arrives asynchronously to each learner. We discuss more about this in [9].

The contexts come from a bounded dimensional space , which is taken to be without loss of generality.
When selected, an arm generates a random reward sampled from an unknown, context dependent distribution with support in .^{5}^{5}5Our results can be generalized to rewards with bounded support for . This will only scale our performance bounds by a constant factor.
The expected reward of arm for context is denoted by .
Learner incurs a known deterministic and fixed cost for selecting choice .^{6}^{6}6

Alternatively, we can assume that the costs are random variables with bounded support whose distribution is unknown. In this case, the learners will not learn the reward but they will learn reward minus cost which is essentially the same thing. However, our performance bounds will be scaled by a constant factor.

For example for , can represent the cost of activating arm , while for , can represent the cost of communicating with learner and/or the payment made to learner . Although in our system model we assume that each learner can directly call another learner , our model can be generalized to learners over a network where calling learners that are away from learner has a higher cost for learner . Learner knows the set of other learners and costs of calling them, i.e., , but does not know the set of arms , , but only knows an upper bound on the number of arms that each learner has, i.e., on , . Since the costs are bounded, without loss of generality we assume that costs are normalized, i.e., for , . The net reward of learner from a choice is equal to the obtained reward minus cost of selecting the choice. The net reward of a learner is always in .The learners are cooperative which implies that when called by learner , learner will choose one of its own arms which it believes to yield the highest expected reward given the context of learner .

The expected reward of an arm is similar for similar contexts, which is formalized in terms of a Hölder condition given in the following assumption.

###### Assumption 1

There exists , such that for all and for all , we have , where denotes the Euclidian norm in .

We assume that is known by the learners. In the contextual bandit literature this is referred to as similarity information [5], [23]. Different from prior works on contextual bandit, we do not require to be known by the learners. However, will appear in our performance bounds.

The goal of learner is to maximize its total expected reward. In order to do this, it needs to learn the rewards from its choices. Thus, learner should concurrently explore the choices in to learn their expected rewards, and exploit the best believed choice for its contexts which maximizes the reward minus cost. In the next subsection we formally define the complete knowledge benchmark. Then, we define the regret which is the performance loss due to uncertainty about arm rewards.

### Iii-a Optimal Arm Selection Policy with Complete Information

We define learner ’s expected reward for context as , where . This is the maximum expected reward learner can provide when called by a learner with context . For learner , denotes the net reward of choice for context . Our benchmark when evaluating the performance of the learning algorithms is the optimal solution which selects the choice with the highest expected net reward for learner for its context . This is given by

(1) |

Since knowing requires knowing for , knowing the optimal solution means that learner knows the arm in that yields the highest expected reward for each .

### Iii-B The Regret of Learning

Let be the choice selected by learner at time . Since learner has no a priori information, this choice is only based on the past history of selections and reward observations of learner . The rule that maps the history of learner to its choices is called the learning algorithm of learner . Let be the choice vector at time . We let denote the arm selected by learner when it is called by learner at time . If does not call at time , then . Let and . The regret of learner with respect to the complete knowledge benchmark given in (1) is given by

where denotes the random reward of choice for context at time for learner , and the expectation is taken with respect to the selections made by the distributed algorithm of the learners and the statistics of the rewards. For example, when and , this random reward is sampled from the distribution of arm .

Regret gives the convergence rate of the total expected reward of the learning algorithm to the value of the optimal solution given in (1). Any algorithm whose regret is sublinear, i.e., such that , will converge to the optimal solution in terms of the average reward. In the subsequent sections we will propose two different distributed learning algorithms with sublinear regret.

## Iv A distributed uniform context partitioning algorithm

The algorithm we consider in this section forms at the beginning a uniform partition of the context space for each learner. Each learner estimates its choice rewards based on the past history of arrivals to each set in the partition independently from the other sets in the partition. This distributed learning algorithm is called Contextual Learning with Uniform Partition (CLUP) and its pseudocode is given in Fig. 2, Fig. 3 and Fig. 4. For learner , CLUP is composed of two parts. The first part is the maximization part (see Fig. 3), which is used by learner to maximize its reward from its own contexts. The second part is the cooperation part (see Fig. 4), which is used by learner to help other learners maximize their rewards for their own contexts.

Let be the slicing parameter of CLUP that determines the number of sets in the partition of the context space .
When is small, the number of sets in the partition is small, hence the number of contexts from the past observations which can be used to form reward estimates in each set is large.
However, when is small, the size of each set is large, hence the variation of the expected choice rewards over each set is high.
First, we will analyze the regret of CLUP for a fixed and then optimize over it to balance the aforementioned tradeoff.
CLUP forms a partition of consisting of sets where each set is a -dimensional hypercube with dimensions .
We use index to denote a set in .
For learner let be the set in which belongs to.^{7}^{7}7If is an element of the boundary of multiple sets, then it is randomly assigned to one of these sets.

First, we will describe the maximization part of CLUP. At time slot learner can be in one of the three phases: training phase in which learner calls another learner with its context such that when the reward is received, the called learner can update the estimated reward of its selected arm (but learner does not update the estimated reward of the selected learner), exploration phase in which learner selects a choice in and updates its estimated reward, and exploitation phase in which learner selects the choice with the highest estimated net reward.

Recall that the learners are cooperative. Hence, when called by another learner, learner will choose its arm with the highest estimated reward for the calling learner’s context. To gain the highest possible reward in exploitations, learner must have an accurate estimate of other learners’ expected rewards without observing the arms selected by them. In order to do this, before forming estimates about the expected reward of learner , learner needs to make sure that learner will almost always select its best arm when called by learner . Thus, the training phase of learner helps other learners build accurate estimates about rewards of their arms, before learner uses any rewards from these learners to form reward estimates about them. In contrast, the exploration phase of learner helps it to build accurate estimates about rewards of its choices. These two phases indirectly help learner to maximize its total expected reward in the long run.

Next, we define the counters learner keeps for each set in for each choice in , which are used to decide its current phase. Let be the number of context arrivals to learner in by time (its own arrivals and arrivals to other learners who call learner ) except the training phases of learner . For , let be the number of times arm is selected in response to a context arriving to set by learner by time (including times other learners select learner for their contexts in set ). Other than these, learner keeps two counters for each other learner in each set in the partition, which it uses to decide training, exploration or exploitation. The first one, i.e., , is an estimate on the number of context arrivals to learner from all learners except the training phases of learner and exploration, exploitation phases of learner . This is an estimate because learner updates this counter only when it needs to train learner . The second one, i.e., , counts the number of context arrivals to learner only from the contexts of learner in set at times learner selected learner in its exploration and exploitation phases by time . Based on the values of these counters at time , learner either trains, explores or exploits a choice in . This three-phase learning structure is one of the major components of our learning algorithm which makes it different than the algorithms proposed for the contextual bandits in the literature which assigns an index to each choice and selects the choice with the highest index.

At each time slot , learner first identifies . Then, it chooses its phase at time by giving highest priority to exploration of its own arms, second highest priority to training of other learners, third highest priority to exploration of other learners, and lowest priority to exploitation. The reason that exploration of own arms has a higher priority than training of other learners is that it can reduce the number of trainings required by other learners, which we will describe below.

First, learner identifies its set of under-explored arms:

(2) |

where is a deterministic, increasing function of which is called the control function. We will specify this function later, when analyzing the regret of CLUP. The accuracy of reward estimates of learner for its own arms increases with , hence it should be selected to balance the tradeoff between accuracy and the number of explorations. If this set is non-empty, learner enters the exploration phase and randomly selects an arm in this set to explore it. Otherwise, learner identifies the set of training candidates:

(3) |

where is a control function similar to . Accuracy of other learners’ reward estimates of their own arms increase with , hence it should be selected to balance the possible reward gain of learner due to this increase with the reward loss of learner due to number of trainings. If this set is non-empty, learner asks the learners to report . Based in the reported values it recomputes as . Using the updated values, learner identifies the set of under-trained learners:

(4) |

If this set is non-empty, learner enters the training phase and randomly selects a learner in this set to train it.^{8}^{8}8Most of the regret bounds proposed in this paper can also be achieved by setting to be the number of times learner trains learner by time , without considering other context observations of learner .
However, by recomputing , learner can avoid many unnecessary trainings especially when own context arrivals of learner is adequate for it to form accurate estimates about its arms for set or when learners other than learner have already helped learner to build accurate estimates for its arms in set .
When or is empty, this implies that there is no under-trained learner, hence learner checks if there is an under-explored choice.
The set of learners that are under-explored by learner is given by

(5) |

where is also a control function similar to . If this set is non-empty, learner enters the exploration phase and randomly selects a choice in this set to explore it. Otherwise, learner enters the exploitation phase in which it selects the choice with the highest estimated net reward, i.e.,

(6) |

where is the sample mean estimate of the rewards learner observed (not only collected) from choice by time , which is computed as follows. For , let be the set of rewards collected by learner at times it selected learner while learner ’s context is in set in its exploration and exploitation phases by time . For estimating the rewards of its own arms, learner can also use the rewards obtained by other learner at times they called learner . In order to take this into account, for , let be the set of rewards collected by learner at times it selected its arm for its own contexts in set union the set of rewards observed by learner when it selected its arm for other learners calling it with contexts in set by time . Therefore, sample mean reward of choice in set for learner is defined as . An important observation is that computation of does not take into account the costs related to selecting choice . Reward generated by an arm only depends on the context it is selected at but not on the identity of the learner for whom that arm is selected. However, the costs incurred depend on the identity of the learner. Let be the estimated net reward of choice for set . Of note, when there is more than one maximizer of (6), one of them is randomly selected. In order to run CLUP, learner does not need to keep the sets in its memory. can be computed by using only and the reward at time .

The cooperation part of CLUP operates as follows. Let be the learners who call learner at time . For each , learner first checks if it has any under-explored arm for , i.e., such that . If so, it randomly selects one of its under-explored arms and provides its reward to learner . Otherwise, it exploits its arm with the highest estimated reward for learner ’s context, i.e.,

(7) |

### Iv-a Analysis of the Regret of CLUP

Let , and let denote logarithm in base . For each set (hypercube) let , , for , and , , for . Let be the context at the center (center of symmetry) of the hypercube . We define the optimal choice of learner for set as . When the set is clear from the context, we will simply denote the optimal choice for set with . Let

be the set of suboptimal choices for learner for hypercube at time , where , are parameters that are only used in the analysis of the regret and do not need to be known by the learners. First, we will give regret bounds that depend on values of and and then we will optimize over these values to find the best bound. Also related to this let

be the set of suboptimal arms of learner for hypercube at time , where . Also when the set is clear from the context we will just use . The arms in are the ones that learner should not select when called by another learner.

The regret given in (1) can be written as a sum of three components: , where is the regret due to trainings and explorations by time , is the regret due to suboptimal choice selections in exploitations by time and is the regret due to near optimal choice selections in exploitations by time , which are all random variables. In the following lemmas we will bound each of these terms separately. The following lemma bounds .

###### Lemma 1

When CLUP is run by all learners with parameters , , and ,^{9}^{9}9For a number , let be the smallest integer that is greater than or equal to . where and , we have

where

(8) |

Since time slot is a training or an exploration slot for learner if and only if , up to time , there can be at most exploration slots in which an arm in is selected by learner , training slots in which learner selects learner , exploration slots in which learner selects learner . Since for all , the realized (hence expected) one slot loss due to any choice is bounded above by . Hence, the result follows from summing the above terms and multiplying by , and the fact that for any .

From Lemma 1, we see that the regret due to explorations is linear in the number of hypercubes , hence exponential in parameter and .

For any and , the sample mean represents a random variable which is the average of the independent samples in set . Let be the event that a suboptimal arm is selected by learner , when it is called by learner for a context in set for the th time in the exploitation phases of learner . Let denote the random variable which is the number of times learner selects a suboptimal arm when called by learner in exploitation slots of learner when the context is in set by time . Clearly, we have

(9) |

where is the indicator function which is equal to if the event inside is true and otherwise. The following lemma bounds .

###### Lemma 2

Consider all learners running CLUP with parameters , , and , where and . For any if holds for all , then we have

Consider time . Let be the event that learner exploits at time .

First, we will bound the probability that learner selects a suboptimal choice in an exploitation slot. Then, using this we will bound the expected number of times a suboptimal choice is selected by learner in exploitation slots. Note that every time a suboptimal choice is selected by learner , since for all , the realized (hence expected) loss is bounded above by . Therefore, times the expected number of times a suboptimal choice is selected in an exploitation slot bounds . Let be the event that choice is chosen at time by learner . We have . Adopting the standard probabilistic notation, for two events and , is equal to . Taking the expectation

(10) |

Let be the event that at most samples in are collected from suboptimal arms of learner in hypercube . Let . For a set , let denote the complement of that set. For any , we have

(11) |

for some . This implies that

Since for any ,

Comments

There are no comments yet.