Advanced channel sensing technologies have enabled cognitive radio systems to acquire the channel status in real-time and exploit the temporal, spatial and spectral diversity of wireless communication channels for performance improvements . Various opportunistic spectrum access strategies have been investiageted, under both offline settings where the channel statistics are known a priori [2, 3, 4, 5, 6, 7, 8, 9], and online settings where the users do not possess a priori channel statistics but will have to infer them from observations [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20].
While the main objective in such works is to improve the spectrum usage efficiency by leveraging the channel status measurements, the inherent uncertainty in channel sensing results, and the costs of channel sensing and transmission, are rarely investigated. In practice, even a channel is sensed to be idle, the transmission rate it can support is still uncertain, due to the inherent randomness of wireless medium. Thus, the reward of each transmission is random in general. Meanwhile, both sensing and transmission consume energy. Channel sensing also causes delay. For cognitive radio systems that operate under stringent energy and power constraints, or communication applications that can only tolerate short end-to-end delays, such costs become critical in determining the optimal operation of the cognitive radio system. Intuitively, the optimal spectrum access strategy depends on the intricate relationship between the costs and reward, as well as channel statistics. What makes the problem even more complicated is that the statistics of such quantities are often time-varying and unknown beforehand in the fast changing radio environment.
Within this context, in this paper, we investigate cost-aware learning and optimization for multi-channel opportunistic spectrum access in a cognitive radio system. Our objective is to analytically characterize the impact of the costs and reward of sensing and transmission on the optimal behavior of the user, and develop optimal cost-aware opportunistic spectrum access strategies in both offline and online settings. To this end, we adopt a discrete-time model, where the state of each channel evolves according to an independent Bernoulli process from time frame to time frame. The user is allowed to sequentially sense the channels at the beginning of each time frame to get measurements of the instantaneous channel states. The user then uses the measurements to decide its actions, i.e., whether to continue sensing, to transmit over one channel, or to quit the current frame. We associate random costs with sensing and transmission, and assign a positive random reward for each transmission. For the offline setting, we leverage the finite horizon dynamic programming (DP) formulation  to identify the optimal policy of the user to maximize the expected net reward (reward-minus-cost) in each frame. For the online setting, we cast the problem to the multi-armed bandit (MAB) framework , and propose a cost-aware online learning strategy to infer the statistics of the channels, costs and reward, and asymptotically achieve the maximum per-frame net reward.
I-a Main Contributions
The main contributions of this paper are four-fold:
First, we identify the optimal offline spectrum access policy with a priori statistics of the channel states, the costs and reward of sensing and transmissions. The optimal offline policy exhibits a unique recursive double-threshold structure. The thresholds depend on the statistical information of the system, and can be determined in a recursive fashion. Such structural properties enable an efficient way to identify the optimal actions of the transceiver, and serve as the benchmark for the online algorithm developed in sequel.
Second, we propose an online algorithm to infer the statistics from past measurements, and track the optimal offline policy at the same time. In order to make the algorithm analytically tractable, we decouple the exploration stage and the exploitation stage. We judiciously control the length of the exploration stage to ensure that the sensing and transmission policy in the exploitation stage is identical to the optimal offline policy with high probability. We theoretically analyze the cumulative regret, and show that it scales in.
Third, we establish a lower bound on the regret for any -consistent online strategies. -consistent strategies are those that perform reasonably well with high probability. The lower bound scales in , which matches the upper bound in terms of scaling and implies that our online algorithm is order-optimal. To the best of our knowledge, this is the first online opportunistic spectrum access strategy achieving order-optimal regret when sequential sensing is considered.
Fourth, the online setting discussed in this paper is closely related to the cascading multi-armed bandits model 
in machine learning. The cost-aware learning strategy proposed in this paper extends the standard cost-oblivious cascading bandits in, and can be adapted and applied to a wide range of applications where the cost of pulling arms is non-negligible and the rank of arms affects the system performances, such as web search and dynamic medical treatment allocation.
I-B Relation to the State of the Art
Learning for multi-channel dynamic spectrum access is often cast in the MAB framework . In general, the classic non-Bayesian MAB assumes that there exist independent arms, each generating i.i.d. rewards over time from a given family of distributions with an unknown parameter. The objective is to play the arms for a time horizon to minimize the regret, i.e., the difference between the expected reward by always playing the best arm, and that without such prior knowledge. It has been shown that logarithmic regret is optimal [24, 25].
Within the MAB framework, order-optimal sensing and transmission policies for both single-user scenario  and multiple-user scenario in [11, 12, 13, 14]. In those works, the objectives are mainly to identify the best channel or channel-user match and access them most of the time in order to maximize the expected throughput.
Although the online strategy developed in this paper falls in the MAB framework, the sequential sensing model, and the intricate impact of the sensing and transmission costs and reward on the system operation make our problem significantly different from existing works [10, 11, 12]
. Since the user will stop sensing if certain condition is satisfied, the random stopping time implies that only a random subset of channels will be observed in each sensing phase. Such partial observation model makes the corresponding theoretical analysis very challenging. Moreover, the error in estimating the mean value of the costs and reward will affect the correctness of the online policy and propagate in a recursive fashion, which makes the regret analysis extremely difficult.
The sequential sensing model and analytical approach adopted in this paper is similar to that in [8, 15]. In , a constant sensing cost is considered for each channel, and the optimal offline probing and transmission scheduling policy is obtained through DP formulation. The corresponding online algorithm is proposed in . Compared with [8, 15], our model takes the randomness of the sensing/transmission costs and reward into consideration, which is a non-trivial extension. Besides, the Bernoulli channel status model adopted in our paper enables us to obtain the explicit structure of the optimal offline policy and the order-optimal online algorithm.
I-C Paper Outline
This paper is organized as follows. Section II describes the system model. Section III and Section IV describe the optimal offline policy and the online algorithm, respectively. Section V evaluates the proposed algorithms through simulations. Concluding remarks are provided in Section VI. Important proofs are deferred to Appendix.
Ii System Model
We consider a single wireless communication link consisting of channels, indexed by the set and a user who would like to send information to a receiver using exactly one of the channels. We partition the time axis into frames, where each frame consists of a channel sensing phase followed by a transmission phase. The channel sensing phase consists of multiple time slots, where in each slot, the user is able to sense one of the channels in , and obtain a measurement of the instantaneous channel condition. Similar sequential sensing mechanism has been discussed in [3, 4, 8]. In this work, we assume that the sensing phase is at most time slots, which corresponds to the scenario that the user senses each channel once in the time frame. As we will see, the actual length of the probing phase depends on the parameters of the system, and will be automatically adjusted to optimize the system performance. We adopte the constant data time (CDT) model studied in [4, 8], where the transmitter has a fixed amount of time for data transmission, regardless of how many channels it senses. The length of the transmission phase is much larger than the duration of a time slot in the sensing phase in general.
The communication over the link proceeds as follows. Within each frame, at the beginning of each time slot in the sensing phase, the user must choose between two actions: 1) sense: sample a channel that has not been sensed before and get its status, 2) stop: end the sensing phase in the current frame. Once the user stops sensing, it must decide between the following actions based on up-to-date sensing results: 1) access: transmitting over one of the channels already sensed in the current frame using a predefined transmission power. 2) guess: transmitting over one of the channels that have not been sensed in the current frame, 3) quit: giving up the current frame and wait until the next frame.
In the following, we use to index the time frames, and use to index the time slots within a frame, where is the last time slot in frame . refers to the -th time slot in the sensing phase of frame . Let be the channel the user sensed at time for . We use to denote channel the the user decides to user in the transmission phase in frame . indicates that it quits the transmission opportunity in frame . Let , , and be the probing cost, communication reward, and transmission cost associated with the decisions and , respectively. Those costs can refer to the energy consumed for sensing/transmission, the interference caused by the actions, etc, and can be adjusted according to the resource constraints or quality of service (QoS) requirements in the system. The reward may correspond to the information bits successfully delivered during the transmission phase. We do not assign any cost for the action quit for a clear exposition of this paper. We can always extend our current model to include a positive cost for quit, which can be used to capture certain QoS requirements (such as delay) in the system. As we will see in the rest of this paper, this will not change the structure of the optimal offline policy, or the design and analysis of the online algorithm.
We make the following assumptions on the distributions of the channel statistics, the sensing and transmission costs, and the communication reward.
The state of each channel stays constant within frame (denoted as ), and varies across frames according to an independent and identically distributed (i.i.d.) Bernoulli random process with parameter . Without loss of generality, we assume that .
If , and , is an i.i.d. random variable distributed over with mean ; Otherwise, if , or if and , .
is an i.i.d. random variable distributed over with mean .
If , is an i.i.d. random variable distributed over with mean ; If , .
, and .
Assumption 1.1 indicates that the status of a channel alternates between two states: idle and busy, which is a common assumption in existing works [3, 11, 10]. Assumption 1.2 is related to the fact that the maximum transmission rate supported by an available channel is random, due to the uncertain link condition in wireless medium. Assumptions 1.3 and 1.4 correspond to the sensing and transmission costs in the system. We assume they are random variables in general. In practice, the costs may be related to the physical resources (e.g., energy/power) available in the system, or QoS (e.g., delay) requirements of different applications. Therefore, they are usually not fixed but adaptively changing in order to satisfy the instantaneous constraints. When , they become two positive constants. We impose Assumption 1.5 to make the problem reasonable and non-trivial.
In the following, we will first identify the structure of the optimal offline policy with all of the statistical information in Assumptions 1 known a priori, and then develop an online scheme to learn the statistics and track the optimal offline policy progressively.
Iii Optimal Offline Policy
In this section, we assume that the statistics , , , are known a priori. However, the instantaneous realizations of the corresponding random variables remain unknown until actions are taken and observations are made. Thus, the user needs to make sensing and transmission decisions based on up-to-date observations, as well as the statistics. In the following, we use policy to refer to the rules that the user would follow in a frame. Specifically, this includes an order to sample the channels, a stopping rule to stop probing (and determine ), and a transmission rule to decide which channel to use, all based on past measurements. We note that due to the randomness in the system, the same policy may lead to different observations. Accordingly, the user may take different actions in sequel.
Under Assumptions 1, when all of the statistics are known beforehand, the observations made in one frame would not provide any extra information about other frames. Thus, the optimal offline policy should be the same in each frame. In this section, we will drop the frame index and focus on one individual frame. Let be the sequence of sensing and transmission actions the user takes following the policy . Then, the optimization problem can be formulated as follows:
We will use to denote the optimal policy that achieves . Such a policy is guaranteed to exist since there are a finite number of channels. We have the following observations.
Under the optimal policy, if the transmitter senses a channel and finds it is available, it should stop sensing and then transmit over it in the upcoming transmission phase.
First, we note that if the user transmits over a sampled and available channel, the first term in (III) would be , which cannot be improved further if the transmitter continues to sense, guess or quit. However, a continued sensing would increase the cost involved in the second term in (III). Thus, transmitting over the channel would be the best action given that the transmitter samples a channel and finds it is available. ∎
There exist only two possible structures of the optimal policy: 1) The transmitter chooses to “guess” without sensing any channel; 2) The user senses an ordered subset of channels sequentially, until it finds the first available channel. If none of them are available, it will decide to “guess” or “quit”. Whenever the transmitter chooses to “guess”, it transmits over the best unsampled channel, i.e., the one with the maximum .
Lemma 1 indicates the structure of the optimal policy would be a pure “guess”, or a sequence of probing followed by a “guess” or “quit”. It then becomes clear that the unsampled channel utilized for transmission should be the best unsampled channel, which gives the maximum expected reward. ∎
Under the optimal policy, at any time slot, a sufficient information state is given by the tuple , where is the set of unsensed channels and is the index of the last sensed channel in the frame. When , i.e., no channel has been sensed yet, the first two parameters equal zero.
When no channel has been examined yet, the information state can be represented as without any ambiguity. When some channels have been sensed, state implies that , . According to Corollary 1, the probing order of those channels will not affect the reward/cost of any future actions, thus is redundant. Therefore, contains sufficient information for future decision-making. ∎
Based on Corollary 2, in the following, we will use dynamic programming (DP)  to represent the optimal decision process. Let denote the value function, i.e. maximum expected remaining net reward given the system state is . Then,
where the expectation is taken with respect to . The first term on the right hand side of (III) represents the expected net reward if the user decides to access the last sensed channel, the second term represents the maximum expected net reward of probing a channel in , the third term represents the maximum expected net reward of guess, i.e., transmitting over the best unsampled channel, and the last term represents the net reward of quitting the current frame. Thus, given the information state , the user needs to choose the action to maximize the expected remaining net reward, while gives the expected total net reward in a frame under the optimal offline policy. This is a standard finite-horizon DP, which can be solve through backward induction. Roughly speaking, without any structural information of the optimal solution, the state space of this DP is , which quickly becomes prohibitive as increases. Therefore, in the following, we will first identify the structural properties of the optimal policy through theoretical analysis, and then leverage those properties to obtain the optimal offline policy.
We observe that the optimal policy has the following structure.
Denote as the subset of channels to sense under the optimal policy. Then, the transmitter should sense the channels sequentially according to a descending order of , until it finds the first available channel. Moreover, , i.e., the channels in are better than any other channel outside .
The optimal offline policy is to sequentially examine the channels starting from channel 1. At time slot , there exist two thresholds , , , such that:
if , , , i.e., transmit over channel without probing.
if , , i.e., sense channel . The user will transmit over channel if , and it will move on to the next channel if .
if , , i.e., stop sensing and quit transmission in the current frame.
The values of the thresholds and can be determined recursively by solving (III) through backward induction.
The proof of Theorem 1 is presented in Appendix -B. Compared with the DP formulation in (III), the state space of this offline policy now is reduced to . Thus, the optimal policy can be identified in a more computationally-efficient manner.
Remark: We point out that the optimal offline policy only depends on the mean of the costs and reward, thus, it can be directly applied to the scenario where the costs and reward are constants. Besides, it can also be applied to the case where the costs and reward are random but the instantaneous realizations are known beforehand. For this case, we can simply treat the costs and reward in each frame as constants and obtain the optimal policy in each frame individually. It can also be extended to handle the case where the costs and reward vary for different channels.
is a monotonically decreasing sequence, and is a monotonically increasing sequence.
Corollary 3 is implied by Lemma 4 in Appendix -C and the expressions of and in (14)(15) in Appendix -B. The monotonicity of the thresholds indicates that , i.e., channels with higher s are more prone to sense, while channels with lower are more prone to guess or quit. This is because the potential reward gain by sensing bad channel is small compared with the cost for sensing.
As increases, the maximum number of channels to sense in each frame decreases.
Corollary 4 can be proved using the monotonicity of in according to (13) in Appendix -B. It indicates that by adjusting the cost of probing, the user is able to adaptively choose the number of channels to sense in each frame, thus achieving the optimal tradeoff between the cost and reward incurred by the sensing actions.
Iv Online Optimization with Learning
In this section, we assume the statistics , , , are unknown beforehand. Then, our objective is to design an online strategy to decide based on up-to-date observations, so as to minimize the following regret measure:
where , i.e., the states of the sensed channels, the costs of sensing those channel, and the corresponding transmission cost and reward; ; and is the maximum expected reward as if the statistics were known beforehand.
If , it is called sub-linear in total regret and zero-regret in time average. Our objective is to design an online strategy to make converges to as quickly as possible. Intuitively, as more measurements become available, the user is able to infer the channel statistics more accurately and make more informed decisions accordingly. As we have observed in many previous works [10, 11, 12, 13, 14], the user faces an exploration-exploitation dilemma: On one hand, the user would take more sensing and transmission actions in order to get more measurements to refine its estimation accuracy; on the other hand, the user would exploit available information to track the optimal offline policy and optimize its net reward. The user should judiciously balance those two conflicting objectives in order to minimize the regret.
What makes the problem different and much more challenging than those existing works is the recursive structure of the optimal offline policy. As a result, the error in estimating would affect the ranking of the channels, as well as the decision that the user may take over each channel. Tracking the impact of the estimation error on the overall regret thus becomes very complicated. Moreover, due to the randomness of the channel status realizations, the user would stop sensing after observing a random number of channels, even if the user sticks to the same offline policy. Thus, if the user tries to track the optimal offline policy during exploitation, the channels are observed in a random and non-uniform fashion: the channels ranked high are observed with larger probability, while the channels ranked low may have limited chance to be observed. Therefore, the algorithm should take the sampling bias into consideration and adjust the exploration in a more sophisticated fashion.
We detail our joint learning and optimization strategy in Algorithm 1. In order to tackle the aforementioned challenges, we decouple exploration and exploitation to two separate phases. We use to track the number of times that channel has been sampled during exploration phase up to frame . In the exploration phase, the channels that have been sampled less than times up to frame will be sampled, with being positive constant parameters. The specific values of and to ensure the optimal convergence rate of will be discussed in Lemma 7 in Appendix -C. In the exploitation phase, the user first estimate (denoted as ) by calculating the empirical average of the channel states, the sensing cost, the transmission reward and cost based on historical observations . It then executes the optimal offline policy using the estimates according to Theorem 1.
The performance of the online algorithm is theoretically characterized in the following theorem.
There exist sufficiently large constants , such that the regret under Algorithm 1 is bounded by
where , .
Theorem 2 indicates that the cumulative regret is sub-linear in , thus achieving zero-regret asymptotically. The first term of the regret is due to the exploration while the last term comes from exploitation. The proof is sketched as follows: we first relate the error in estimating the thresholds with the estimation errors , , , . Based on this relationship, we derive an upper bound on the number of samples required to ensure all estimation errors are sufficiently small, so that the user will choose the optimal offline policy in the exploitation phase with high probability. Finally, we explicitly bound the regret by examining the regrets from exploration and exploitation separately. A detailed proof of Theorem 2 is presented in Appendix -C.
We also establish a lower bound on the regret under any -consistent  online strategy in the following.
Definition 1 (-consistent strategy)
Let be the number of times that the user takes the optimal offline policy in a frame over the first frames under an online strategy. Then, , if
the strategy is -consistent.
Roughly speaking, an -consistent strategy represents a category of “good” online strategies under which the user selects the optimal offline policy in each frame with high probability.
as the KL-distance between two Bernoulli distributions with parameters, , respectively. Then, we have the following lower bound for -consistent strategies.
Assume the optimal offline policy with is to sense/guess the first channels sequentially according to Theorem 1 and quit the remaining channels, where . Then, under any -consistent strategy, we have
where is the set of policies under which channel might be sampled or used with guess, is the per-frame regret under policy , and is the upper threshold for channel under the optimal offline policy, assuming is the best channel.
Theorem 3 indicates that for most system settings where at least one channel is left out under the optimal offline policy (i.e., ), the regret lower bound under any -consistent online strategy scales in . This scaling matches the upper bound in Theorem 2, thus our online strategy is order-optimal. We restrict to the situation because, under this setting, we are able to identify some sub-optimal policies adopted under an online strategy easily, i.e., any policy involves sense or guess over channel , , is strictly sub-optimal. By showing that the probability of choosing a policy that involves channel cannot be too small, we obtain a lower bound on the regret. The proof of Theorem 3 can be found in Appendix -D.
V Simulation Results
In this section, we evaluate the offline and online spectrum access policies through numerical results.
V-a Offline Policy
We first study how of system parameters would affect the optimal offline policy. We set , , and change the values of and separately. As shown in Table I, for fixed , the maximum number of channels to sense or access (i.e., ) is monotonically decreasing as increases. However, when is fixed, we do not observe such monotonicity in . This can be explained as follows: When is small, the cost of a wrong guess is small compared with the cost of sensing, thus the system would choose to transmit over the best channel directly without sensing. When is large, the cost of a wrong guess outweighs that of sensing, thus the user should choose to sense instead of guess. As grows, the user will probe less channels because the potential gain by sensing a channel (i.e., ) may not cover the sensing cost.
|Action over channel|
V-B Online Algorithm
Then, we resort to numerical simulations to verify the effectiveness of Algorithm 1 in the online learning situation.
Before we present the simulation results, we first introduce two baseline learning algorithms for comparison, namely, the
-greedy learning algorithm, and the Thompson Sampling (TS) based algorithm. The only difference between the-greedy algorithm and Algorithm 1 is to replace the condition for exploration with , where is an independent sample of a Bernoulli distribution with parameter . Thus, -greedy algorithm explores the system at a fixed rate. TS is a randomized algorithm based on the Bayesian principle, and has generated significant interest due to its superior empirical performance 
. We tailor the TS algorithm to our setting, where the key idea of the algorithm is to sample the channel statistics according to Beta distributions, whose parameters are determined by past observations. Though we are not able to characterize its performance theoretically, we do observe significant performance improvement through simulations. Therefore, we include it in this section for comparison purpose.Obtaining an upper bound on the regret under the TS algorithm is one of our future directions.
We now compare the performances under Algorithm 1, the -greedy algorithm, and the TS algorithm through simulations. We set , , , , and let the sensing cost, transmission cost and reward be uniform random variables with . According to Theorem 1, the optimal offline policy is to sense the first three channels sequentially until it finds the first available channel, and quit the current frame if all of them are unavailable. The expected per-frame net reward under the optimal policy is . We first set , for Algorithm 1, and for the -greedy algorithm. We run each algorithm 100 times and plot the sample-path average in Fig. 1(a) and Fig. 1(b). As we can see in Fig. 1(a), Algorithm 1 and TS algorithm outperform the -greedy algorithm significantly as time becomes sufficiently large. In addition, the -greedy algorithm achieves smaller regret than Algorithm 1 when is small. This is because Algorithm 1 explores more aggressively initially, resulting in a larger regret. In Fig. 1(b), we also notice that the average regrets of Algorithm 1 and TS algorithm approach zero as increases, indicating that both algorithms perform better after gaining information of the system. In contrast, the average regret under the -greedy algorithm approaches zero quickly and never converges to zero, which implies that it does not balance exploration and exploitation properly.
Next, we evaluate the per-frame net reward under those three algorithms, and compare them against the per-frame net reward under the optimal offline policy in Fig. 1(c). We notice that the per-frame net rewards under both Algorithm 1 and TS algorithm converge to the upper bound. The per-frame net reward under Algorithm 1 drops significantly in certain frames, as indicated by the sharp spikes. This is because Algorithm 1 has separated exploration stages. Whenever some of the channels have not been observed for sufficient number of times, the user will enter the exploration stage. We also note that the interval between two consecutive spikes grows as increases, indicating the portion of the exploration stage decays in time.
Finally, we focus on the impact of parameter selection on the performance under Algorithm 1 in Fig. 1(d). We set to be 10 , 15, 20 respectively while keeping according to Lemma 7 in Appendix -C. Interestingly, we note that the performance does not change monotonically as varies. Specifically, when is 10, it results in the smallest regret initially. However, as increases, the regret becomes even larger than those with and 20. It can be interpreted in this way: the algorithm explores less with a smaller , thus saving the sensing cost in exploration. However, it also converges to the optimal offline policy at a lower rate, as it has less observations. In contrast, the algorithm explores more aggressively with a larger , leading to larger cost in exploration but a faster convergence rate. Initially, the cost of exploration outweighs the reward of transmission, thus a smaller regret can be observed for smaller . When grows, the reward of transmission outweighs the cost of exploration, and the regret is mainly determined by the estimation accuracy. Therefore, the regret with smaller grows faster as increases, and eventually becomes greater than that with larger .
Vi Discussions and Conclusions
In this paper, we investigated a discrete-time multi-channel cognitive radio system with random sensing and transmission costs and reward. We started with an offline setting and explicitly identified the recursive double-threshold structure of the optimal solution. With insight drawn from the optimal offline policy, we then studied the online setting and proposed a order-optimal online learning and optimization algorithm. We further compared our algorithm with other baseline algorithms through simulations.
The problem studied in this paper is essentially a cascading bandit problem with “soft” cost constraint, which itself is a non-trivial extension of . We believe that the design and analysis of the algorithms in this paper advances the state of the art in both cognitive radio systems and MAB, and has the potential to impact a broader class of cost-aware learning and optimization problems.
-a Proof of Lemma 2
We first prove the first half of Lemma 2 through contradiction. Assume that under the optimal policy, the transmitter senses a channel right ahead of channel , where . We construct an alternative policy by switching the probing order of and . Consider a fixed realization of all , . Then, those two policies would result in different actions only when all channels sampled ahead of channel are unavailable, and .
Case I: . This event happens with probability . Under the original policy, the user would transmit over channel after probing. The instantaneous net reward in this step would be . After switching, the user will transmit over channel after probing channel and then channel . This results additional probing cost with the same reward.
Case II: . This event happens with probability . Under the original policy, the user would transmit over channel after probing both channels and . After switching, the user will transmit over channel after probing, but not probing channel . This probing cost will be less with the same reward.
Thus, the expected difference in probing cost would be
Therefore, by switching the probing order of channel and channel , we save probing cost without reducing the reward in expectation. Thus, under the optimal policy, the transmitter must sense the channels sequentially according to a descending order of their means.
We then prove the second half of Lemma 2 through contradiction as well. Assume the worst channel in , denoted as , is worse than the best channel in , denoted as (i.e., ). Under the original policy, there might be two possible actions after probing channel : guess, i.e., transmit over without probing, or quit. We construct an alternative policy by switching the role of and . Consider a fixed realization of all , . Again, those two policies would result in different actions only when all channels sampled ahead of channel are unavailable, and .
Case I: . Under the original policy, the user would transmit over channel after probing. After switching, depending on the action on channel under the original policy, the user will first sense channel , and then guess or quit. This will not incur any additional probing cost, however, the corresponding reward minus transmission cost would be or .
Case II: . Under the original policy, the user would guess or quit after probing channel . The corresponding reward would be or . After switching, the user will transmit over channel after probing. Again, this will not incur any additional probing cost, however, the corresponding reward minus transmission cost would be .
In conclusion, if the original policy is to guess after probing channel , there will be no difference in reward and cost after switching for both cases; If the original policy is to quit after probing channel , then the difference in reward minus transmission cost would be
Therefore, by switching the role of channel and channel , we increase net reward in expectation. Thus, under the optimal policy, any channel in should be better than any other channel outside .
-B Proof of Theorem 1
We prove the theorem through induction. As shown in Lemma 1, if , then the transmitter should transmit over channel . If , the transmitter need to decide between three actions: continue probing the next best channel, transmit over the next best channel without probing, or quit.
Then, for ,
and the corresponding actions that the user should take after finding are quit, sense, and guess, respectively.
-C Proof of Theorem 2
The proof consists of three main steps: we begin by first relate the difference between the thresholds and their estimates with , , , . Then, we derive an upper bound on the number of samples required to ensure the correct ordering of the channels, as well as the right sensing and transmission decisions with high probability. Finally, we explicitly bound the regret by examining the regret from exploration and exploitation separately.
To facilitate our analysis, we first introduce the following two lemmas without proof.
Let , , where . Then,
monotonically decreases as increases from to .
If for a sufficiently small , we have , , , , , then there exist constants , such that , , .
Based on (16), the estimate of , denote as , can be expressed as follows
Denote . According to Lemma 3, we have
We note that
Besides, we have
where (18) follows from the fact that .
Similarly, we can show that . Plugging into (17), we have
Multiplying to both sides of (19), we have
Applying (-C) recursively, we have
Next, we will use the relationship between and in (22) to bound and . Toward this end, we first note that for any satisfying and ,
Similarly, we have
where (29) follows from the fact that