# Regret Bounds for Opportunistic Channel Access

We consider the task of opportunistic channel access in a primary system composed of independent Gilbert-Elliot channels where the secondary (or opportunistic) user does not dispose of a priori information regarding the statistical characteristics of the system. It is shown that this problem may be cast into the framework of model-based learning in a specific class of Partially Observed Markov Decision Processes (POMDPs) for which we introduce an algorithm aimed at striking an optimal tradeoff between the exploration (or estimation) and exploitation requirements. We provide finite horizon regret bounds for this algorithm as well as a numerical evaluation of its performance in the single channel model as well as in the case of stochastically identical channels.

• 11 publications
• 12 publications
• 31 publications
02/25/2021

### Online Learning for Unknown Partially Observable MDPs

Solving Partially Observable Markov Decision Processes (POMDPs) is hard....
12/22/2020

### The Value of Information and Efficient Switching in Channel Selection

We consider a collection of statistically identical two-state continuous...
06/24/2020

### Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

We study minimax optimal reinforcement learning in episodic factored Mar...
07/10/2020

### Improved Analysis of UCRL2 with Empirical Bernstein Inequality

We consider the problem of exploration-exploitation in communicating Mar...
06/07/2019

### Worst-Case Regret Bounds for Exploration via Randomized Value Functions

This paper studies a recent proposal to use randomized value functions t...
11/27/2018

### Optimal Learning for Dynamic Coding in Deadline-Constrained Multi-Channel Networks

We study the problem of serving randomly arriving and delay-sensitive tr...
10/16/2017

### SpecWatch: A Framework for Adversarial Spectrum Monitoring with Unknown Statistics

In cognitive radio networks (CRNs), dynamic spectrum access has been pro...

## 1 Introduction

In recent years, opportunistic spectrum access for cognitive radio has been the focus of significant research efforts [akyildiz:won-yeol:vuran:mohanty:2008, haykin:2005, mitola:2000]. These works propose to improve spectral efficiency by making smarter use of the large portion of the frequency bands that remains unused. In Licensed Band Cognitive Radio, the goal is to share the bands licensed to primary users with non primary users called secondary users or cognitive users. These secondary users must carefully identify available spectrum resources and communicate avoiding to disturb the primary network. Opportunistic spectrum access thus has the potential for significantly increasing the spectral efficiency of wireless networks.

In this paper, we focus on the opportunistic communication model previously considered by [Liu:Zhao:08, Zhao:al:07bis], which consists of channels in which a single secondary user searches for idle channels temporarily unused by primary users. The channels are modeled as Gilbert-Elliot channels: at each time slot, a channel is either idle or occupied and the availability of the channel evolves in a Markovian way. Assuming that the secondary user can only sense channels simultaneously [Lai:al:08, Liu:Zhao:08, Zhao:al:08], his main task is to choose which channel to sense at each time aiming to maximise its expected long-term transmission efficiency. Under this model, channel allocation may be interpreted as a planning task in a particular class of Partially Observed Markov Decision Process (POMDP) also called restless bandits [Liu:Zhao:08, Zhao:al:07bis].

In the works of [Liu:Zhao:08, Zhao:al:08, Zhao:al:07bis]

, it is assumed that the statistical information about the primary users’ traffic is fully available to the secondary user. In practice however, the statistical characteristics of the traffic are not fixed a priori and must be somehow estimated by the secondary user. As the secondary user selects channels to sense, we are not faced with a simple parameter estimation problem but with a task which is closer to reinforcement learning

[Sutton:92]. We consider scenarios in which the secondary user first carries out an exploration phase in which the statistical information regarding the model is gathered and then follows by the exploitation phase, where the optimal sensing policy, based on the estimated parameters, is applied. The key issue is to reach the proper balance between exploration and exploitation. This issue has been considered before by [Long:al:08] who proposed an asymptotic rule to set the length of the exploration phase but without a precise evaluation of the performance of this approach. Lai et al [Lai:al:08] also considered this problem in the multiple secondary users case but in a simpler model where each channel is modeled as an independent and identically distributed source. In the field of reinforcement learning, this class of problems is known as model-based reinforcement learning for which several approaches have been proposed recently [Auer:Ortner:07, Strehl:Littman:08, Tewari:Bartlett:08]. However, none of these directly applies to the channel allocation model in which the state of the channels is only partially observed.

Our contribution consists in proposing a strategy, termed Tiling Algorithm

, for adaptively setting the length of the exploration phase. Under this strategy, the length of the exploration phase is not fixed beforehand and the exploration phase is terminated as soon as we have accumulated enough statistical evidence to determine the optimal sensing policy. The distinctive feature of this approach is that it comes with strong performance guarantees in the form of finite-horizon regret bounds. For the sake of clarity, this strategy is described in the general abstract framework of parametric POMDPs. Remark that the channel access model corresponds to a specific example of POMDP parameterized by the transition probabilities of the availability of each channel. As the approach relies on the restrictive assumption that for each possible parameter value the solution of the planning problem be fully known, it is not applicable to POMDPs at large but is well suited to the case of the channel allocation model. We provide a detailed account of the use of the approach for two simple instances of the opportunistic channel access model, including the case of stochastically identical channels considered by

[Zhao:al:08].

The article is organized as follows. The channel allocation model is formally described in Section 2. In Section 3, the tiling algorithm is presented and its performance in terms of finite-horizon regret bounds are obtained. The application to opportunistic channel access is detailed in Section 4, both in the one channel model and in the case of stochastically identical channels.

## 2 Channel Access Model

Consider a network consisting of independent channels with time-varying state, with bandwidths , for . These channels are licensed to a primary network whose users communicate according to a synchronous slot structure. At each time slot, channels are either free or occupied (see Fig. 1). Consider now a secondary user seeking opportunities of transmitting in the free slots of these channels without disturbing the primary network. With limited sensing, a secondary user can only access a subset of channels. The aim of the secondary user is to leverage this partial observation of the channels so as to maximize its long-term opportunities of transmission.

Introduce the state vector which describes the network at time

, , where is equal to when the channel is occupied and when the channel is idle. The states and of different channels are assumed to be independent. Let (resp.) be the transition probability from state (resp. ) to state in channel (see Fig. 2). Additionally, denote by

the stationary probability of the Markov chain

. The secondary user selects a set of channels to sense. This choice corresponds to an action , where if the -th channel is sensed and otherwise. Since only channels can be sensed, . The observation is an -dimensional vector such that for the selected channels and is an arbitrary value not in for the other channels. The reward gained at each time slot is equal to the aggregated bandwidth available. In addition, a reward equal to is received for each unobserved channel. At each time , the received reward is where

 r(Xt(i),At(i))=⎧⎨⎩B(i)if At(i)=1, Xt(i)=Yt(i)=10if At(i)=1, Xt(i)=Yt(i)=0λotherwise,

which depends on only through . The gain associated to the action of not observing may also be interpreted as a penalty for sensing occupied channels. Indeed, this model is equivalent to the one where a positive reward is received for available sensed channels, a penalty is received for occupied sensed channels and no reward are received for non-sensed channels.

Note that this model is a particular POMDP in which the state transition probabilities do not depend on the actions. Moreover, the independence between the channels may be exploited to construct a -dimensional sufficient internal state which summarizes all past decisions and observations. The internal state is defined as follows: for all , . This internal state enables the secondary user to select the channels to sense. The internal state recursion is

 pt+1(i)=⎧⎪⎨⎪⎩α(i)if At(i)=1,Yt(i)=0β(i)if At(i)=1,Yt(i)=1pt(i)β(i)+(1−pt(i))α(i)otherwise. (1)

Moreover, remark that at each time , the internal state is completely defined by the pair where denotes the last observed state for each channel and is the duration during which the corresponding channel has not been observed. Denote by the probability that a channel is free given that it has not been observed for time slots and that the last observation was . That is to say, for , and Using equation (1), these probabilities may be written as follows:

 pk(i),0α(i),β(i)=α(i)(1−(β(i)−α(i))k(i))1−β(i)+α(i), (2) pk(i),1α(i),β(i)=(β(i)−α(i))k(i)(1−β(i)+α(i))1−β(i)+α(i). (3)

The channel allocation model may also be interpreted as an instance of the restless multi-armed bandit framework introduced by [Whittle:88]. Papadimitriou and Tsitsiklis [Papadimitriou:Tsitsiklis:94] have established that the planning task in the restless bandit model is PSPACE-hard, and hence that optimal planning is not practically achievable when the number of channels becomes important. Nevertheless, recent works have focused on near-optimal so-called index strategies [LeNy:al:08bis, Guha:Munagala:07, Liu:Zhao:08], which have a reduced implementation cost. An index strategy consists in separating the optimization task into channel-specific sub-problems, following the idea originally proposed by Whittle [Whittle:88]. Interestingly, to determine the Whittle index pertaining to each channel, one has to solve the planning problem in the single channel model for arbitrary values of . Using this interpretation, explicit expressions of the Whittle’s indexes as a function of the channel transition probabilities have been provided by [LeNy:al:08bis, Liu:Zhao:08].

## 3 The Tiling Algorithm

Here, we focus on determining the sensing policy when the secondary user does not have any statistical information about the primary users’ traffic. A common approach is to learn the transition probabilities in a first phase and then to act optimally according to the estimated model. If the learning phase is sufficiently long, the estimates of the probabilities can be quite precise and there is a higher chance that the policy followed during the exploitation phase is indeed the optimal policy. On the other hand, blindly sensing channels to learn the model parameters does not necessarily coincide with the optimal policy and thus has a cost in terms of performance. The question is hence: how long should the secondary user learn the model (explore) before applying an exploitation policy such as Whittle’s policy ?

This problem is the well known dilemma between exploration and exploitation [Sutton:92]. Here we propose an algorithm to balance exploration and exploitation by adaptively monitoring the duration of the exploration phase. We present this algorithm in a more abstract framework for generality. We assume that the optimal policy is a known function of a low dimensional parameter. This condition can be restrictive but it is verified in simple cases such as finite state space MDPs or in particular cases of POMDPs like the channel access model (see also Section 4).

### 3.1 The Parametric POMDP Model

Consider a POMDP defined by , where is the discrete state space, is the observation space, is the finite set of actions, is the transition probability, is the observation function, is the bounded reward function and denotes an unknown parameter. Given the current hidden state of the system, and a control action , the probability of the next state is given by . At each time step , one chooses an action according to a policy , and hence observes and receives the reward . Without loss of generality, we assume that for all , for all , .

Since we are interested in rewards accumulated over finite but large horizons, we will consider the average (or long-term) reward criterion defined by

 Vπθ=limn→∞1nEπθ(n∑t=1r(Xt,At)),

where denotes a fixed policy. The notation is meant to highlight the fact that the average reward depends on both the policy and the actual parameter value . For a given parameter value, the optimal long-term reward is defined as and denotes the associated optimal policy. We assume that the dependence of and with respect to is fully known. In addition, there exists a particular default policy under which the parameter can be consistently estimated.

Given the above, one can partition the parameter space into non-intersecting subsets, , such that each policy zone corresponds to a single optimal policy, which we denote by . In other words, for any , In each policy zone , the corresponding optimal policy is assumed to be known as well as the long-term reward function for any .

### 3.2 The Tiling Algorithm (TA)

We denote by the parameter estimate obtained after steps of the exploration policy and by the associated confidence region, whose construction will be made more precise below. The principle of the tiling algorithm is to use the policy zones to determine the length of the exploration phase: basically, the exploration phase will last until the estimated confidence region fully enters one of the policy zones. It turns out however that this naive principle does not allow for a sufficient control of the expected duration of the exploration phase, and, hence, of the algorithm’s regret. In order to deal with parameter values located close to the borders of policy zones, one needs to introduce additional frontier zones that will shrink at a suitable rate with the time horizon . Let

 Tn=inf{t≥1:∃i,Δt⊂Zior ∃j,Δt⊂Fj(n)} (4)

denote the random instant where the exploration terminates. Note that the frontier zones depends on . Indeed, the larger the smaller the frontier zones can be in order to balance the length of the exploration phase and the loss due to the possible choice of a suboptimal policy.

In Figure 3, we represent the tiling of the parameter space for an hypothetical example with three distinct optimal policy zones. In this case, there are four frontier zones: one between each pair of policy zones (, and ) and another () for the intersection of all the policy zones. In the following, we shall assume that there exists only finitely many distinct frontier and policy zones.

The tiling algorithm consists in using the default exploratory policy until the occurrence of the stopping time , according to (4). From onward, the algorithm then selects a policy to use during the remaining time as follows: if at the end of the exploration phase, the confidence region is fully included in a policy zone , then the selected policy is ; otherwise, the confidence region is included in a frontier zone and the selected policy is any optimal policy compatible with the frontier zone .An optimal policy is said to be compatible with the frontier zone if the intersection between the policy zone and the frontier zone is non empty. In the example of Figure 3, for instance, and are compatible with the frontier zone , while all the optimal policies are compatible with the central frontier zone . If the exploration terminates in a frontier zone, then one basically does not have enough statistical evidence to favor a particular optimal policy and the tiling algorithm simply selects one of the optimal policies compatible with the frontier zone. Hence, the purpose of frontier zones is to guarantee that the exploration phase will stop even for parameter values for which discriminating between several neighboring optimal policies is challenging. Of course, in practice, there may be other considerations that suggest to select one compatible policy rather than another but the general regret bound below simply assumes that any compatible policy is selected at the termination of the exploration phase.

### 3.3 Performance Analysis

To evaluate the performance of this algorithm, we will consider the regret, for the prescribed time horizon , defined as the difference between the expected cumulated reward obtained under the optimal policy and the one obtained following the algorithm,

 Rn(θ∗)=Eπ∗θ∗θ∗[n∑t=1r(Xt,At)]−ETAθ∗[n∑t=1r(Xt,At)], (5)

where denotes the unknown parameter value. To obtain bounds for that do not depend on , we will need the following assumptions.

###### Assumption 1.

The confidence region is constructed so that there exists constants such that, for all , for all , for all , where is the diameter of the confidence region.

###### Assumption 2.

Given a size , one may construct the frontier zones such that there exists constants for which

• implies that there exists either such that or such that ,

• if , there exists such that , for all policy zones compatible with (i.e., such that ).

###### Assumption 3.

For all , there exists such that for all ,

Assumption 1 pertains to the construction of the confidence region and may usually be met by standard applications of the Hoeffding inequality. The constant is meant to match the worst-case rate given in Theorem 1 below. Assumption 2 formalizes the idea that the frontier zones should allow any confidence region of diameter less than to be fully included either in an original policy zone or in a frontier zone, while at the same time ensuring that, locally, the size of the frontier is of order . The applicability of the tiling algorithm crucially depends on the construction of these frontiers. Finally, Assumption 3 is a standard regularity condition (Lipschitz continuity) which is usually met in most applications. The performance of the tiling approach is given by the following theorem, which is proved in Appendix A.

###### Theorem 1.

Under assumptions 1, 2 and 3, and for all , the duration of the exploration phase is bounded, in expectation, by

 Eθ∗(Tn)≤clognϵ2(n), (6)

and the regret by

 Rn(θ∗)≤Eθ∗(Tn)+c′nϵ(n)+c′′nexp{−13logn}, (7)

where , and . The minimal worst-case regret is obtained when selecting of the order of , which yields the bound for some constant .

The duration bound in (6) follows from the observation that exploration is guaranteed to terminate only when the confidence region defined by Assumption 1 reaches a size which is of the order of the diameter of the frontier, that is, . The second term in the right-hand side of (7) corresponds to the maximal regret if the exploration terminates in a frontier zone. The rate is obtained when balancing these two terms ( and ). A closer examination of the proof in Appendix A shows that if one can ensure that the exploration indeed terminates in one of the policy regions , then the regret may be bounded by an expression similar to (7) but without the term. In this case, by using a constant strictly larger than 1—instead of —in Assumption 1, one can obtain logarithmic regret bounds. To do so, one however need to introduce additional constraints to guarantee that exploration terminates into a policy region rather than in a frontier. These constraints typically take the form of an assumed sufficient margin between the actual parameter value and the borders of the associated policy zone. This is formalized in Theorem 2 which is proved in Appendix B. First, introduce an alternative of Assumption 1.

###### Assumption 4.

The confidence region is constructed so that there exists constants , such that, for all , for all , for all ,

###### Theorem 2.

Consider in a policy zone such that there exists for which . Under assumption 4, the regret is bounded by for all and for some constants and which decrease with .

## 4 Application to Channel Access

In the following, we consider two specific instances of the opportunistic channel access model introduced in Section 2. First, we study the single channel case which is an interesting illustration of the tiling algorithm. Indeed, in this model, there are a lot of different policy zones and both the optimal policy and the long-term reward can be explicitly computed in each of them. In addition, the one channel model plays a crucial role in determining the Whittle index policy. Next, we apply the tiling algorithm to a channel model with stochastically identical channels.

### 4.1 One Channel Model

Consider a single channel with bandwidth . At each time, the secondary user can choose to sense the channel hoping to receive a reward equal to if the channel is idle and taking the risk of receiving no reward if the channel is occupied. He can also decide to not observe the channel and then to receive a reward equal to .

#### 4.1.1 Optimal policies, long-term rewards and policy zones

Studying the form of the optimal policy as a function of brings to light several optimal policy zones. In each zone, the optimal policy is different and is characterized by the pair which defines how long the secondary user needs to wait (i.e. not observe the channel) before observing the channel again depending on the outcome of the last observation. Denote by the policy which consists in waiting (resp. ) time slots before observing the channel again if, last time the channel was sensed, it was occupied (resp. idle), and by the corresponding policy zone. Let be the policy which consists in never observing the channel; this policy is optimal when and are such that the probability that the channel is idle is always lower than . We represent in Figure 4 the policy zones.

The long-term reward of each policy can be exactly computed:

 Vπ∗(1,1)α,β=α1−β+α, Vπ∗(1,2)α,β=α1+λ1+α+β(α−β), Vπ∗(k0,1)α,β=(k0−1)(1−β)λ+1pk0,0α,βk0(1−β)+pk0,0α,β, for k0≥2, Vπ∗∞α,β=λ.

#### 4.1.2 Applying the tiling algorithm

Applying the tiling algorithm to this model is not straightforward as there are an infinity of policy zones. We introduce border zones between , , , as shown in Figure 4. Moreover, to address the problem of the infinity of zones, we propose to aggregate the policy zones when and . For example, we aggregate all the zones with and the non-observation zone with the zones such that , where are variables to be tuned according to the time horizon . Thus, Theorem 1 still applies.

Recall that the tiling algorithm consists in learning the parameter until the estimated confidence region fully enters either one of the policy zones or one of the frontier zones. The exploration policy, denoted by in Section 3, consists in always sensing the channel. At time , the estimated parameter is given by

 ^αt=N0,1tN0t and ^βt=N1,1tN1t, (8)

where (resp. ) is the number of visits to (resp. ) until time and (resp. ) is the number of visits to (resp. ) followed by a visit to until time .

In order to verify that this model satisfies the conditions of Theorem 1, we need to make an irreducibility assumption on the Markov chain.

###### Assumption 5.

There exists such that .

This condition ensures that, during the time horizon , the Markov chain visits the two states sufficiently often to estimate the parameter . We define the confidence region as the rectangle

 Δt=⎡⎣^αt±√logn6N0t⎤⎦×⎡⎣^βt±√logn6N1t⎤⎦. (9)

To prove that the regret of the tiling algorithm in a single channel model is bounded, we need to verify the three assumptions of Theorem 1. First, it is shown in appendix C that Assumption 1 holds. Secondly, except when and , Assumption 2 is obviously satisfied, since the confidence region and the policy and frontier zones are all rectangles (see Fig. 4). Let be half of the smallest width of the frontier zones. Additionally, when and , if the center frontier zone is large enough, the aggregation of the zones can be done such that the second condition holds. Finally, for all optimal policy, the long-term reward is a Lipschitz continuous function of for , so the third condition is also satisfied.

#### 4.1.3 Experimental results

As suggested by Theorems 1– 2, the length of the exploration phase following the tiling algorithm depends on the value of the true parameter . In addition, for a fixed value of , the length of the exploration varies from one run to another, depending on the size of the confidence region. To illustrate these effects, we consider two different value of the parameters: which is included in the policy zone and far from any frontier zone, and, which lies in the frontier zone between and and is close to the border of the frontier zone. The corresponding empirical distributions of the length of the exploration phase are represented in Figure 5. Remark that the shape of these two distributions are quite different and that the empirical mean of the length of the exploration phase is lower for a parameter which is far from any frontier zone than for a parameter which is close to the border of a frontier zone.

In Figure 6, we compare the cumulated regrets of the tiling algorithm to the regrets of an algorithm with a deterministic length of exploration phase . Both algorithms are run with . We use two values of : one lower () and the other larger () than the average length of the exploration phase following the tiling algorithm which ranges between and for this value of the parameter (see Fig. 5). The algorithms are run four times independently and every cumulated regret are represented in Figure 6.

Note that, being in the interior of a policy zone (i.e. not in a frontier zone), the regret of the tiling algorithm is null during the exploitation phase since the optimal policy for the true parameter is used. Similarly, when the deterministic length of the exploration phase is sufficiently large, the estimation of the parameter is quite precise, therefore the regret during the exploitation phase is null. On the other hand, too large a value of increases the regret during the exploration phase: we oberve in Figure  6 that the regret with is larger than . When the deterministic length of the exploration phase is smaller than the average length of the exploration phase following the tiling algorithm, either the parameter is estimated precisely enough and then is smaller than , or, the estimated value is too far away from the actual value and the policy followed during the exploitation phase is not the optimal one. In the latter case, the regret is not null during the exploitation phase and is noticeably large. This can be observed in Figure  6: in three of the four runs, the cumulated regret with (dashed line) are small, whereas in the remaining run it sharply and constantly increases.

### 4.2 Stochastically Identical Channels Case

In this section, consider a full channel allocation model where all the channels have equal bandwidth and are stochastically identical in terms of primary usage, i.e. all the channels have the same transition probabilities: In addition, let .

#### 4.2.1 Optimal policies, long-term rewards and policy zones

Under these assumptions, the near optimal Whittle’s index policy has been shown to be equivalent to the myopic policy (see [Liu:Zhao:08]) which consists in selecting the channels to be sensed according to the expected one-step reward: given that channel has not been observed for time slots and the last observation was . Recall that denote the set of -dimensional vectors with components equal to and equal to . Following this policy, the secondary user senses the channels that have the highest probabilities to be free.

The resulting policy depends only on whether the system is positively correlated () or negatively correlated () (see [Liu:Zhao:08] for details). To explain an important difference between the positively and negatively correlated cases, we represent in Figure 7 the probability that the -th channel is idle for and as a function of , in the two cases. We observe that, for all , for all ,

 ⎧⎨⎩p1,0α,β=α≤pk(j),y(j)α,β≤β=p1,1α,βif α≤β,p1,1α,β=β≤pk(j),y(j)α,β≤α=p1,0α,βif β≤α. (10)

Then, in the positively correlated case, according to equation (10), if a channel has just been observed to be idle, i.e. , the optimal action is to observe it once more since the channel has the highest (or equal) probability to be free: for all , . On the contrary, if a channel has just been observed to be occupied, i.e. , it is optimal to not observe it since the channel has the lowest probability to be free. When the system is negatively correlated, the policy is reversed.

Let be the policy in the positively correlated case and the policy in the negatively correlated one.

The long-term reward of policies and can not be computed exactly. However, one may use the approach of [Zhao:al:08] to compute an approximation of and and obtain:

 {Vπ+α,β≈Mν11−β+ν1,Vπ−α,β≈Mα1−ν1+α. (11)

#### 4.2.2 Applying the tiling algorithm

The secondary user thus needs to distinguish between values of the parameter that lead to positive or negative one-step correlations in the chain. Knowing which of these two alternatives applies is sufficient to determine the optimal policy. Let and be the policy zones corresponding to these two optimal policies and (see Figure 8). Between these zones, we introduce a frontier zone .

The estimation of the parameter and the confidence region are similar to the one channel case (see Section 4.1). The Assumption 1 of Theorem 1 is thus satisfied. Moreover, given the simple geometry of the frontier zone, Assumption 2 is easily verified. Indeed, any confidence rectangle whose length is less than is either included in the frontier zone or in one of the policy zones. Moreover, for any point in the frontier zone, there exists a point which is at a distance less than and is also in the frontier zone but belongs to the other policy zone. Finally, the approximations of the long-term rewards and defined in (11) are Lipschitz functions, and hence the third condition of Theorem 1 is satisfied.

#### 4.2.3 Experimental Results

To illustrate the performance of the approach, we ran the tiling algorithm for a grid of values of regularly covering the set , with . For each value of the parameter, 10 Monte Carlo replications of the data were processed. The time horizon is and the width of the frontier zone is taken equal to 0.15. The resulting cumulated regret has an empirical distribution which does not vary much with the actual value of the parameter and is, on average, smaller than . However, it may be observed that the average length of the exploration phase , represented in Figure 9, depends on the value of . First observe that is quite large for close to the frontier zone and small otherwise. Indeed, when the actual parameter is far from the policy frontier, the exploration phase runs until the confidence region is included in the corresponding policy zone, which is achieved very rapidly. On the contrary, when the true parameter is inside the frontier zone, the exploration phase lasts longer. Remark that for parameter values that sit exactly on the policy frontier both policies are indeed equivalent. This observation is captured, to some extent, by the algorithm as the maximal durations of the exploration phase do not occur exactly on the policy frontier. The second important observation is that the exploration phase is the longest when is close to or . Actually, when is around (resp. ), the channel is really often busy (resp. idle) and hence it is difficult to estimate (resp. ).

The later effect is partially predicted by the asymptotic approach of [Long:al:08]

who used the Central Limit Theorem to show that the length of the exploration phase, for a channel with transition probabilities

, has to be equal to

 lexpl(α∗,β∗,δ,PC)=(Φ−1(PC+12))2δ2(1−α∗)(1α∗+11−β∗) (12)

in order to guarantee that is properly estimated (with a similar result holding for ). In (12)

stands for the standard normal cumulative distribution function and

and are values such that . This formula rightly suggests that when is very small, there are very few observed transitions from the busy to the idle state and hence that estimating is a difficult task. However, it can be seen on Figure 9 that with the tiling algorithm, the length of the exploration phase is actually longer when both and are very small but is not particularly long when is small and is close to one (upper left corner in Figure 9). Indeed in the latter case, the channel state is very persistent, which imply few observed transitions and, correlatively, that estimating either or would necessitate many observation. On the other hand, in this case the channel is strongly positively correlated and even a few observations suffice to decide that the appropriate policy is rather than .

## 5 Conclusion

The tiling algorithm is a model-based reinforcement learning algorithm applicable to opportunistic channel access. This algorithm is meant to adequately balance exploration and exploitation by adaptively monitoring the duration of the exploration phase so as to guarantee a worst-case regret bounds for a pre-specified finite horizon . Furthermore, it has been shown in Theorem 2 that in large regions of the parameter space, the regret can indeed be guaranteed to be logarithmic. In numerical experiments on the single channel and stochastically identical channels models, it has been observed that the tiling algorithm is indeed able to adapt the length of the exploration phase, depending on the sequence of observations. Furthermore, we observed in the stochastically identical model that the algorithm was able to interrupt the exploration phase rapidly in cases where the nature of the optimal policy is rather obvious.

For the future, the tiling algorithm promises as well a high potential for other applications for example in wireless communications. Concerning the opportunistic channel access, the algorithm as it stands is not able to handle the general channel model presented Section 2 (with stochastically non-identical channels). However, another interesting prospective work would be to adapt our approach such that its main principles apply to the general model.

## Appendix A Appendix: Proof of Theorem 1

The confidence zone is such that, at the end of the exploration phase, At the end of the exploration phase, if the true parameter is in the confidence region, there are two possibilities: either the confidence zone is included in a policy zone or it is included in a frontier zone . If the confidence zone is in a policy region, the regret is equal to the sum of the duration of the exploration phase and of the loss corresponding to the case where the confidence region is violated: If the confidence zone is in a frontier region , an additional term of the regret is the loss due to the fact that the policy selected at the end of the exploration phase is not necessarily the optimal one for the true parameter . Let denote the optimal policy for and the selected policy. Note that and are compatible with . The loss is where . The last term is negative since is the optimal policy for . The two other terms can be bounded using Assumption 3. Then, According to Assumption 2, one can choose such that for which where

The maximal regret is obtained when the confidence region belongs to a frontier zone. According to Assumptions 1 and 2, if satisfies then , with large probability. Therefore, . The regret is then bounded by

 maxθ∗Rn(θ∗)≤c21lognc22ϵ2(n)+nc′ϵ(n)+c′1nexp{−13logn},

which is minimized for

## Appendix B Appendix: Proof of Theorem 2

The condition means that the distance between and any border of the policy zone is larger than . Hence, as soon as , the confidence region is included in the policy zone . The regret of the tiling algorithm is then equal to According to Assumption 4, if satisfies then with large probability. Therefore, and the regret is bounded by which is minimized for . For this value of , we have

## Appendix C Appendix: Confidence interval for Markov Chains

In this appendix, we prove that the confidence region defined in equation (9) satisfies Assumption 1. First, remark that the event for . Hence, using the Hoeffding inequality, we have Moreover, we need to bound the probability . We apply Theorem 2 of [Glynn:Ormoneit:02] to bound . To do so, remark that and that the minoration constant is lower-bounded by . We then have

 P(N1t

where the last inequality holds for . Similarly, we can show that, for , Hence, for all , In addition, for all , for