Developing efficient resource allocation algorithms plays a central role in wireless networks research. Over the years, elegant solutions with performance guarantees have been designed for various tasks, including link scheduling, rate adaptation, power allocation, routing, and network utility optimization. A key assumption in many of these approaches is that the decision maker has perfect prior knowledge about the channel conditions. However, obtaining accurate channel conditions often requires substantial measurements, which can be very time consuming for both outdoor environments and dense indoor deployments. This is especially the case for mobile networks where the capacity of a link varies significantly over the locations of devices and environmental factors such as interference from other links and blockages. Thus, traditional approaches based on fixed channel conditions cannot obtain expected performance in unknown/uncertain environments or quickly adapt to changing environments.
To cope with the various uncertainty in network resource allocation and obtain adaptive scheduling policies, learning based approaches have been intensively studied recently. In particular, various online learning based algorithms have been developed for link scheduling [17, 12], rate adaptation  and beam selection , just to name a few. These works consider the challenging setting where the capacity of a link follows an unknown distribution that can only be sampled when the link is activated (i.e., the bandit feedback). Instead of using an offline learning approach with separated data collection and decision making stages, they adopt a multi-armed bandit based online learning framework that integrates exploration and exploitation. By carefully balancing the two aspects, they obtain no-regret adaptive policies with long-run performance approaching what can be achieved by the best offline policies that have prior knowledge on channel conditions.
In many real settings, a decision maker may obtain additional observations beyond the pure bandit feedback. For example, in the next generation millimeter-wave 802.11ad/ay WLANs, beamforming can be used to infer the real-time link quality before a scheduling decision is made. However, beamforming between all APs and clients can be time consuming for a densely deployed WLAN. Thus it is more realistic to assume that only a subset of APs can be selected for beamforming in each round, which reduces channel uncertainty but does not completely mitigate it. In this case, it is crucial to jointly optimize AP selection for beamforming (probing) and link scheduling for serving clients (play).
In this paper, we present a novel extension of the bandit learning framework to incorporate joint probing and play. We assume that before the decision maker chooses an arm to play in each round, it can probe a subset of arms and observe their rewards (in that round). The decision maker then picks an arm to play according to the observations obtained in the probing stage and historical data. Our framework can be directly applied to the joint beamforming and scheduling problem when multiple APs collaboratively serve a single client (detailed system model in Sec. III). Given that the data rate a client can obtain from an AP is highly correlated with the client location, we consider a contextual bandit model and treat the client location (or an approximation of it) as the context and learn a context-dependent joint probing and play policy. To solve the problem, we first derive useful structural properties of the offline optimal solution and then develop an online learning algorithm by extending the contextual zooming algorithm in . We further establish the regret bound of our algorithm in the special case when the reward distributions are Bernoulli.
Our bandit learning model and its extensions can potentially be applied to a large body of sequential decision making problems that involve joint probing and play under uncertainty. For example, by integrating probing with combinatorial multi-armed bandits where the decision maker can pick multiple arms to play, we can model the joint beamforming and scheduling problem in the more general multi-AP multi-client setting. As another example, consider the problem of finding the shortest path between a source and a destination in a road network with unknown traffic, where a path searcher can query a travel server to obtain hints of real time travel latency . Since each query consumes server resources and incurs delay, the path searcher can only make a limited number of queries before picking a path. Further, the path searcher may utilize contextual information such as the current time to assist decision making. This problem can again be modeled as a contextual combinatorial bandit problem with probing.
Our main contributions can be summarized as follows.
We propose contextual bandit with probing (CBwP) as a novel bandit learning framework that incorporates joint probing and play into sequential decision making under uncertainty.
We derive structural properties of the optimal offline solution and an efficient online learning algorithm to CBwP. We further derive the regret bound of the online algorithm when the rewards are Bernoulli.
We apply our framework to the joint beamforming and scheduling problem in 802.11ay WLAN where a set of APs collaboratively serve a single mobile client. Simulations using real data traces demonstrate the efficiency of our solution.
Ii Related work
The classic multi-armed bandit (MAB) model provides a clean framework for studying the exploration vs. exploitation tradeoff in sequential decision making under uncertainty. Since the seminal work of Lai and Robbins , MAB and its variants have been intensively studied [3, 6, 5] and applied to various domains including wireless resource allocation. In particular, a combinatorial sleeping multi-armed bandit model with fairness constraints is considered in , which has been used to model single AP scheduling where multiple clients compete for sending packets to the common AP. In 
, the problem of link rate selection for a single wireless link is considered and a constrained Thompson sampling algorithm is developed to exploit the structural property that a higher data rate is associated with a lower transmission success probability. In, online learning based scheduling for general ad hoc wireless networks with unknown channel statistics is considered. The classic greedy maximal matching based algorithm is extended by using UCB-based link weights. The work that is closest to ours is , where a contextual multi-armed bandit algorithm is applied to the beam selection problem in mmWave vehicular systems. However, none of the above work considers the joint probing and scheduling problem as we consider in this paper.
. A common setting is that given a set of random variables withknown distributions, a limited number of probes (observations) can be made about these distributions. A selection decision is then made according to the observations. This corresponds to the offline problem in our setting. Various objective functions have been considered including maximizing the largest value found minus the total probing cost spent. Since the general problem is NP-hard, various approximation algorithms have been developed . However, these results do not apply to the online settings with unknown distributions.
Iii System Model and Problem Formulation
In this section, we define contextual bandits with probing (CBwP) as a novel extension of the classic contextual bandits model . To make it concrete, we use the joint AP probing and selection problem as an example when presenting the model. Our formulation applies to a large class of sequential decision problems that involve joint probing and selection under uncertainty. We further derive some important properties of CBwP.
Iii-a Contextual Bandits with Probing (CBwP)
We consider a set of APs connected to a high-speed backhaul that collaboratively serve a set of mobile clients. AP collaboration helps boost wireless performance in both indoor and outdoor environments and is especially useful for directional mmWave communications that are susceptible to blockage . For simplicity, we assume that the beamforming process determines the best beam (i.e., highest SNR) from AP to the client. Hence, we do not distinguish AP selection from beam selection. Our framework readily applies to the more general setting of joint AP and beam selection.
To simplify the problem, we focus on the single client setting in this work. Let
be a set of contexts that correspond to the location (or a rough estimate of it) of a moving client. In general,can be either discrete or continuous. Let be a discrete set of arms that correspond to the set of APs, and . We consider a fixed time horizon that is known to the decision maker. In each time step , the decision maker first receives a context and then plays an arm and receives a reward , which is sampled from an unknown distribution, , that depends on both the context and the arm . We assume that the expected value of exists for any context-arm pair and denote it by . The sequence of context is assumed to be external to the decision making process. In the AP selection problem, the reward corresponds to the data rate that a client at a certain location can receive from an AP.
In the classic contextual bandit problem, the instantaneous reward of an arm is revealed only when it is played, and the decision maker receives no side observations. In contrast, we consider a more general setting where after receiving the context, the decision maker can first probe a subset of arms and observe their rewards, and then pick an arm to play (which may be different from the set of probed arms). In general, probing an arm reduces the uncertainty about the arm. We assume that the probing period (for arms) is short enough so that if an arm is probed with observed, then the same is the reward obtained if arm is played in . However, probing does reduce packet transmission time; hence, we require to be relatively small. The problem of choosing a proper either statically or dynamically is left to our future work. We further assume that the probing results are independent across arms. That is, is independent of other arms probed in or before . Extension to correlated arms is left to future work.
Let denote the expected reward in time step under a (time-varying) joint probing and play policy , where the expectation is over the randomness of observations in time step . Similar to the classic contextual bandit model, our goal is to maximize the (expected) cumulative reward . As we discuss below, when the reward distribution is known a prior for each context-arm pair
, the single stage problem at each time step can be modeled as a Markov decision process with an optimal policy. Letdenote the expected reward under when the optimal offline policy is adopted in each time step. Define the total regret as follows:
The goal of maximizing the expected cumulative reward then converts to minimizing .
Similar to , we assume that the context set is associated with a distance metric such that satisfies the following Lipschitz condition:
Without loss of generality, we assume that . This condition helps us capture the similarity between the context-arm pairs. In the joint AP probing and selection problem, is defined as the Euclidean distance between locations.
Iii-B Offline Problem as an MDP
We first consider the offline setting where the reward distributions are known to the decision maker a prior. We show that the joint probing and play problem in each time step can itself be modeled as a Markov decision process (MDP). We further derive important properties of the MDP.
Consider any time step with a context . To simplify the notation, we omit the time step subscript in this section. At each probing step , the decision maker observes the current state and then chooses the next arm to probe, where is the arm probed in round and is the observed reward of arm under context . We define . Further, the decision maker can decide at any round to stop probing and pick an arm to play according to the probing result, and receives the reward of the played arm. We observe that more information always helps in our problem, thus it never hurts to wait until round to choose an arm to play. Let denote the set of all possible states and the set of distributions over . The joint problem can be solved using a pair of policies: a probing policy that maps an arbitrary probing history to the next arm to probe and a play policy that chooses an arm to play according to the probing result. Let denote a joint probing and play policy.
We first observe that in the offline setting, there is a simple deterministic play policy that is optimal. Let denote the expected reward that can be obtained from playing an arm using policy given the probing result after rounds. We have
where denotes the probability of playing arm given the probing result . For any arm , let if and otherwise. Then we observe that the deterministic policy that always plays an arm with maximum is optimal and obtains the following optimal reward:
We summarize this observation as a lemma:
Given any context and state , the deterministic policy that plays an arm with maximum is optimal.
We then consider the problem of finding an optimal probing policy. For any given play policy , the probing problem can be formulated as a finite-horizon MDP , where is a set of states defined above, is set of actions that correspond to the set of arms. The reward function is defined as for and otherwise. The transition dynamics gives the probability of reaching state given the current state and action , which can be derived from .
We consider the standard objective of maximizing the expected cumulative reward for the MDP. Given the way the reward function is defined, this can be represented as . Thus, to find the optimal , it suffices to adopt an optimal play policy (such as the deterministic policy defined above) and solve the MDP to find the optimal probing policy . Let denote the optimal joint (offline) policy.
The MDP defined above uses the complete history of the probing results as the state. We then show that assuming an optimal play policy is adopted, it suffices to keep the set of arms probed and the maximum reward observed. This allows us to obtain a smaller MDP without loss of optimality. To show this, given any state , we derive a new state . Let denote the set of states . We further say that is similar to (denoted by ) if the latter can be derived from the former. We then define a new MDP , where for any and such that and . Note that the reward function is well defined as only depends on the maximum probed value in (see Equation (3)). Further, the new transition dynamics can be derived from the following observation:
We then show that and have the same optimal value. Let denote the optimal state-action value function of for any state and action . Assuming , satisfies the Bellman optimality equation:
is defined analogously. We then have the following result, which can be derived using the Bellman optimality equation and mathematical induction:
We prove the result by mathematical induction. First, we have . Thus, the base case holds. Assume the result holds for . From the Bellman optimality equation of , we have
where (a) follows from (4) and (b) follows from the definition of the reward function and the inductive hypothesis.
Iii-C Offline Problem with Bernoulli Rewards
follows a Bernoulli distribution (fully defined by) for any context-arm pair . there is a simple non-adaptive probing policy that is optimal.
Consider the following non-adaptive probing policy for arms with Bernoulli rewards: given any context , find the arms with the largest among all the arms, and then probe any of them. This policy together with the deterministic play policy given in Lemma 1 gives an optimal joint policy to the offline problem.
We first observe that for arms with Bernoulli rewards, there is an optimal joint probing and play policy that has the following form: At state , probe an arm . If , play and stop. Otherwise, probe another arm . Repeat this process until a probed arm gives a reward of 1 or arms have been played. In the latter case, play an arm that has not been probed.
Consider any policy of the above form. Let be the set of arms involved in the policy. We construct a new joint policy where we first probe all the arms involved in and then play an arm according to the optimal deterministic policy. We observe that the expected reward of is no less than that of . Thus, there is a non-adaptive probing policy that is optimal. Further,
Thus, is maximized when the set of arms and have the maximum expected reward among all the arms, irrespective of the order that these arms are probed and played.
Iv The CBwP Algorithm
Lemma 3 indicates that in the offline setting, greedy probing plus greedy play is optimal for Bernoulli rewards. However, this approach cannot be directly applied to the online setting as ’s are unknown, which involves a fundamental exploration vs. exploitation tradeoff. In this section, we consider the online setting and design an algorithm for the CBwP problem. We further derive its regret for the special case with Bernoulli rewards.
Iv-a Algorithm Description
Our algorithm is based on the contextual zooming algorithm in , which adaptively partitions the similarity space to exploit the Lipschitz condition (Equation (2)). As we consider a finite set of arms in our problem, we apply adaptive partitioning to the context space only. The main contribution of our work is extending the contextual zooming technique to the joint probing and play setting, which brings new challenges in both algorithm design and analysis.
The algorithm (see Algorithm 1) maintains a finite set of active balls for each arm . We require that the balls in collectively cover the similarity space . Initially contains a single ball of radius 1. A ball is activated once it is added to and remains active. These balls correspond to a partition of the context space from the arm ’s perspective.
In each time round , a context is revealed, and the algorithm selects up to arms to probe according to the “probing rule”. In each probing step after observing , the algorithm may activate a new ball according to the “activation rule”. The probing stage terminates if either arms have been probed or for some . The algorithm then selects an arm to play according to the “playing rule”, and may activate a new ball according to the “activation rule”.
We then specify the three rules used in the algorithm. Both the probing and play rules are inspired by Lemma 3. Since the true distributions of rewards are unknown, the algorithm picks arms according to their estimated rewards together with a confidence term. Let denote the radius of a ball . The confidence radius of at time round is defined as:
where is the number of times has been selected from to . Let denote the total reward from all rounds up to in which has been selected, and the average reward from . The pre-index of is defined as
The index of is obtained by taking a minimum over all active balls of AP :
where is the distance between the centers of the two balls.
In each round and for each AP , let be the set of active balls that contains and has the minimum radius. Let be an arbitrary ball in with maximal index. We then state the three rules as follows:
probing rule: At each probing step in time round , the algorithm probes an arm with the maximal index (break ties arbitrarily) among the set of unprobed arms and gets observation . The probing stage ends if arms have been probed or for some .
playing rule: In each time round and after the probing stage is done, let if has been probed and otherwise. If for some in the probing stage, play . Otherwise, play an arm with maximal .
activation rule: If arm is probed or played in time round , the algorithm updates and . Further, a new ball with center and radius is activated if . is called the parent ball of this newly activated ball.
We note that the index defined above includes both the average reward and a confidence radius, similar to the classic upper confidence bound (UCB) based approaches . Further, exploration is included in both probing and play stages. One may wonder if this is necessary since probing provides free observations, which may remove the necessity of exploration. However, as we show in our simulations, replacing by (so that no exploration is used) leads to suboptimal decisions. Intuitively, this is because although probing reduces uncertainty, it does not completely remove it for a small . Thus, it is crucial to judiciously utilize the limited probing resource.
Iv-B Theoretical Analysis for CBwP with Bernoulli rewards
In this section, we analyze the regret of Algorithm 1 in the special case when the rewards of arms follow Bernoulli distributions.
Before presenting our analysis, we first make the following claim. Proofs of these claims can be found in .
The following properties are maintained in every time round :
For any active ball , conf() if and only if is a parent ball.
covers the similarity space .
For any two active balls of same radius that are associated with the same arm, their centers are at distance at least .
If a ball is active in round , then with probability at least we have that
where and is the center of .
We call a run of the algorithm clean if (8) is satisfied for every active ball and in every time round and bad otherwise. Since at most balls are activated in each round, the total number of active balls is bounded by . By applying the union bound, the probability that a bad run happens is at most .
To derive the regret bound of our algorithm, we consider the optimal policy defined in Lemma 3. Consider any time round with context . Let denote the arms with largest chosen by . Without loss of optimality, we assume that are sorted non-increasingly by . Let denote the -th () arm probed or played in time round in Algorithm 1. For each probing step we have the following results.
Consider a clean run of the algorithm. We have
Fix a time round and consider any probing step . We have
where (a) is due to the probing and playing rules, (b) follows from Eq. (7) for some active ball , (c) follows from Eq. (6) and the clean run assumption (8), (d) holds from the Lipschitz property (2), and (e) holds from the Lipschitz property (2) and the fact that . Let be the parent of , and by the activation rule, we have
It follows that
So finally have
Consider a clean run of our algorithm. If ball is activated in round and is the parent of . then we have .
As is the parent of , by Lemma 4 and the activation rule we have . To prove the corollary, we replace (13) in the proof of Lemma 4 to show that if is a parent ball, then . This can be shown as follows:
where (a) is due to Eq. (7), (b) follows from Eq. (6), (c) follows from the clean run assumption (8), and (d) holds from the Lipschitz property (2) and the activation rule. The rest of the proof follows the same reasoning as in Lemma 4.
For any round , we have
Consider any round . Recall that are sorted non-increasingly by . We have
To simplify the notation, let , , and . We have
We claim that