I Introduction
To meet the exponentially growing demand in mobile data, the trend in wireless networks is migrating to higher frequencies combined with increasing number of antennas per device and per base station. For instance, it is envisioned that in G cellular systems certain portions of the millimeter wave (mmWave) band will be used, spanning the spectrum between GHz to GHz. However, propagation loss at mmWave frequencies is much higher due to a variety of factors including atmospheric absorption, basic Friis transmissioneffect, and low penetration. When the users and/or surrounding objects are mobile, this effect is more pronounced such that different propagation paths become highly variable with intermittent onoff periods. Thus, unlike existing communication schemes, mmWave systems require highly directional communications to compensate for large channel losses. Thanks to recent advances in antenna technologies, large directional antenna arrays with much smaller form factors can be deployed in relatively small chip areas. Such arrays have the potential to focus the signal energy toward a specific direction, “making up” for the channel losses.
In order to fully utilize directional communications, the transmitter and receiver beams need to be aligned. The experimental results in [1] demonstrate that in a system with a degree beam width, a misalignment of degree reduces the link budget by around dB, which can reduce the maximum throughput by up to Gbps or break the link entirely [2]. On the other hand, as a result of such “pencilbeams” at the transmitter and receiver, beam alignment incurs a large overhead that scales with device mobility and the product of the transmitterreceiver beam resolution. In exhaustive search methods, both users and base stations have a predefined codebook of beam directions that cover the entire angular space and are used sequentially to transmit and receive. Thus, the complexity of this exhaustive search is , where is the number of possible beam directions. To improve the search efficiency, the transmitter and receiver steering is decoupled in the 802.11ad standard such that the transmitter starts with a quasiomnidirectional beam, while the receiver scans the space for the best beam direction. The process is then reversed [3]. This approach reduces the search complexity to . Still, for a beam of a few degrees, the delay can be hundreds of milliseconds to seconds [4], which would easily stall realtime applications.
Dynamic conditions make the beam alignment more challenging since there is the need for frequent beam alignment. Under such scenarios, we pose the following question that given the outcome of the past beam alignments, is it possible to extract some information and reduce the search space for the subsequent beam alignment procedures? In particular, our work is based on the fact that successive beam alignments are stochastically correlated, and thus, outcome of the previous “beam matching” provides contextual information for the subsequent matchings, thus eliminating the need to search the entire angular domain. We exploit correlation and unimodality properties across various beam matching. Specifically, for a given beam matching, we call the difference between the transmitter and receiver direction as misalignment. Because of correlation if matching at a larger misalignment is successful (i.e., received energy is above a threshold
), with a high probability a matching will be successful at a smaller misalignment as well. Furthermore, the directivity gain (or received energy) can be approximated as a unimodal function of the misalignment value. We exploit this contextual information in order to obtain a beam search scheme that quickly identifies the best beam direction and maximizes the directivity gain. We formulate the problem of finding the best beam pair as an online stochastic optimization where the objective is to maximize the expected amount of received energy within a given time period.
To find the optimal solution, we show that this problem can be considered as an instance of the Multi Armed Bandit (MAB) model in which each transmit and receive beam pair is considered as a single arm. Thus, the objective is to design a sequential arm selection (or, equivalently, beam alignment) strategy that maximizes the expected reward (received energy) over a given time horizon. Performance of MAB models is usually expressed in terms of regret that is defined as the total expected reward loss compared with an oracle policy. Regret of the best algorithm is in the form of in which is the number of arms and is the time horizon [5]. In distinction with the classical MAB models, we exploit the contextual information of beam alignment that leads to a structured MAB model, and prove that the regret does not scale with the number of arms that is equal to the number of beam matchings. This is a crucial property in (massive) MIMO systems, and provides a fundamental performance limit satisfied by any exhaustive beam selection algorithm. This limit quantifies the inevitable performance loss due to the need to explore suboptimal beam pairs. It also characterizes the performance gains that can be achieved by devising beam pair selection schemes that optimally exploit the correlations and the structural properties of the MAB problem.
Therefore, in contrast to the previous works that explore the suboptimal beam pairs by heuristics, our method optimally explores the suboptimal beam pairs.
The following example illustrates how we achieve this goal.Example: Let us consider a scenario where the transmitter beam direction is fixed at angle with respect to the receiver. For the sake of exposition, we assume a 2D setting. Assume that there are possible directions at the receiver, as shown in Fig. 1. Using the exhaustive beam selection scheme, each of 16 directions will be examined one at a time, and the direction with the largest received energy (from beacon messages) is picked. However, under dynamic conditions (e.g., with mobility), the optimal beam direction can potentially change within a short period of time. In this case, we consider maximizing the received energy within a given period of time. Using our proposed scheme, the receiver assigns an index to each beam direction, and the beam with the highest index will be selected. The important point is that due to the correlation and unimodality properties, the search space will be limited to the neighborhood of the beam with maximum index. As a result, it prevents the need of a uniform exploration over the entire angular domain, thus mitigating the overhead of beam alignment when the number of beam directions becomes large.
In summary, our contributions are as follows:

We consider the beam alignment problem, and investigate the fundamental performance limits of the searchbased beam alignment between the transmitter and receiver antennas. We model the problem of finding the best beam alignment as an online stochastic optimization problem.

We exploit contextual information of the problem and formulate an equivalent structured MultiArmed Bandit model. We experimentally demonstrate that the received power (approximately) follows a unimodal pattern.

We derive a lower bound on the regret of any searchbased algorithm, and demonstrate that the regret does not scale with the number of transmission and receive beams, thanks to the underlying structure of the problem. Finally, based on the OSUB algorithm in [21], we propose the Unimodal Beam Alignment (UBA) algorithm that is shown to be asymptotically optimal.
Ii Related Work
Iia Beam Alignment
The authors in [6]
perform initial access for clustered mmWave small cells using the power delay profile. In this case, base stations are coordinated in clusters, and communicate through a backhaul network. Base stations share their measurement reports obtained from the mobile devices, and location of the mobile is estimated based on the shared measurements. This will enable the base stations to point at the estimated mobile location. Although this method is limited to lineofsight scenarios, the probability of having at least three lineofsight links (needed for mobile localization) increases by assuming larger cluster sizes. In another line of research, the authors in
[7] proposed a fastdiscovery hierarchical search method, while [8] exploits the sparse multipath structure of the mmWave channel to optimize the choice of beamforming directions. A cell discovery method is proposed in [9] in which the base station periodically transmits synchronization signals to scan the entire angular space in timevarying random directions. In [10], a beam alignment technique is designed based on adaptive subspace sampling and hierarchical beam codebooks. The authors in [11] use spatial information extracted at sub6 GHz to help estimate the best beam pairs at the transmitter and receiver at mmWave frequencies. In [12], a beam alignment scheme based on scanning several directions by oneshot is proposed. In contrast to the previous works, we focus on exploiting contextual information in standalone mmWave systems in order to reduce the search space, and thus the overhead of beam alignment operation. Note that there are other related work to reduce the overhead of beam search in integrated sub6 GHzmmWave systems [13, 14].IiB MultiArmed Bandit
MultiArmed Bandit (MAB) framework formulates sequential decision problems where an agent (i.e., decision maker) has to strike an optimal tradeoff between exploitation and exploration by sequentially selecting an action (or an arm), and observing the corresponding reward. Rewards of a given arm are random variables with unknown distributions. The objective is to maximize the expected reward over a given time horizon by selecting the optimal arm at each time slot. Most of the existing works focus on
unstructured MAB problems in which the reward associated with different arms are not related [15, 5]. In contrast to the unstructured MAB models, when the average rewards are structured, deriving the optimal regret bound and designing of optimal decision algorithms is more challenging [16]. Unimodal bandits are specific instances of bandit models in which the average reward of arms are correlated. In [17], unimodal bandits with a continuous set of arms are studied, and the authors show that the regret of the order of is achievable under some strong regularity assumptions on the reward functions. For the same problem, the authors in [18] provide an algorithm that achieves regret under relaxed regularity assumptions. In this paper, we cast the problem of mmWave beam alignment as a unimodal bandit model, and derive the regret bound.Iii Model and Objective
Iiia System Model
In mmWave systems, once the connection is lost, there are two options for connection establishment and subsequent beamforming: digital or analog. Digital beamforming is highly efficient in delay such that with the observations from all receive antennas, beamforming can be done by oneshot processing of the observed beacons. However, to achieve digital beamforming, there is the need for a separate analogtodigital converter (ADC) for each antenna, which may not even be feasible for even a small to midsized antenna array due to high energy consumption. On the other hand, while analog beamforming requires only one ADC, it can focus on one direction at a time, making the search process costly in delay. Given the fragility of the mmWave channel, the need to scan the entire space leads to the loss of opportunities to utilize the mmWave channel upon its availability. In order to avoid high energy consumption by mmWave components, we focus on analog beamforming in which a single RF chain is deployed at the transmitter and receiver. Other implementations (e.g., hybrid architectures) may look at combinations of directions, which is out of scope of this work.
Figure 2 depicts a schematic of the analog beamforming, and directional beams at the transmitter and receiver. We assume that the transmitter and receiver are equipped with phased array antennas with and identical antennas respectively, equally spaced by a distance along an axis. For the sake of exposition, we only consider the receiver side, while a similar argument is held for the transmitter. Due to the use of analog architecture, the signal at the input of the decoder is a scalar, identical to a weighted combination of signal across all antennas. Thus, the received signal at the mmWave receiver can be written as:
(1) 
where and
are the beamforming vectors. The white Gaussian noise
is normalized to have unit variance. If the transmitter uses
training precoding vectors , and the receiver uses training combining vectors , then the collected signals (divided through by the training signal) is given by:where is the combining matrix, and is the precoding matrix. Furthermore, is the postprocessing noise matrix. Hence, at each time slot, the problem of finding the best beam pair boils down to finding the largest value of matrix . In the exhaustive search, one should examine all elements of to find the largest index, which determines the optimal beam index at the transmitter and receiver. The authors in [11] use spatial information extracted at sub6 GHz to help estimate the largest index of . Applying the same framework, our beam alignment method exploits correlation across the elements of in order to reduce the search space to submatrices of .
IiiB Problem Statement
In order to establish a mmWave link, the transmitter selects a beam direction that determines the phase shifter weights to steer the beam in a certain direction. Similarly, the receiver selects a receive beam index to receive the signals in a certain direction. To obtain a high beamforming gain, the transmitter and receiver beams should be well aligned with each other. We let to denote a pair of beam direction in which is the beam index at the transmitter, and denotes the receiver beam index. There are and beams at the transmitter and receiver, respectively. Further, we define as the set of all possible beam pairs such that there exists distinct matching between the transmit and receive beams. For each pair of transmit and receive beams, misalignment is defined as follows.
Definition.
(Misalignment) Given the pair of transmitter and receiver beams, the misalignment captures the angular mismatch between the th transmitter beam and th receiver beam.
Set contains all possible values of the misalignment values such that is partially ordered. With an abuse of notation, we let to denote the () misalignment value.
In order to detect the transmitted beacon signals, the received energy level should lie above a certain threshold , which is determined based on the quality of service (QoS) needed. Once the received energy is larger than , we call it a successful matching. For a fixed transmit power, the received signal energy can be expressed as a function of the misalignment value, i.e., such that is nonincreasing. For the sake of notations, we use . The probability of success for the misalignment is given by . We assume a fixed situation (e.g., LOS) such that from one run of the alignment algorithm to another, the situation remains fixed and only the orientations and distances change, i.e., the success probabilities ’s are timeinvariant. Furthermore, we let a binary random variable represent the success () or failure () of the matching. The optimal matching is unique, i.e., there exists such that , for all .
Problem formulation:Time is slotted, and we let to denote the length of beam alignment phase followed by the data communication phase. Further, denotes the length of pilot signal by which each beam is measured. This value captures the amount of time that it takes to examine a single pair of beams, as shown in Fig. 3. We formulate the problem of finding the best beam pair as an online stochastic optimization problem such that an optimal beam selection policy maximizes the expected amount of energy received from beacon messages up to a certain finite time . In this case, we let denote the number of times that the misalignment is selected under policy and within the time period . Therefore, the optimal beam selection policy solves the following optimization problem:
(2a)  
(2b) 
In this formulation, small misalignment and large probability of success is desirable. In addition, given that examining each beam matching takes on average, the total number of beam examinations is upperbounded by , which is reflected in the first constraint. The second constraint implies that the number of each matching examination should be an integer.
Iv Equivalent MultiArmed Bandit Model
The beam alignment formulation implies an MAB model such that each combination of the transmitter and receiver beam is considered as an arm, which leads to total arms. In this work, we use the terms “arm” and “beam direction” interchangeably. In this case, denotes the number of times that arm has been selected under policy . Moreover, the reward of arm
has Bernoulli distribution with parameter
such that it is with probability and with probability . The average reward of arm is denoted by .Iva Contextual Information
We explore a new type of contextual information that correlates the misalignment and the received energy. In particular, due to physics of signal propagation, if matching at a larger misalignment is successful, a matching at a smaller misalignment will be successful with a high probability. On the other hand, if matching at a smaller misalignment fails, then a matching at a larger misalignment will fail with a high probability as well. Hence, we have:
(3) 
(4) 
Equivalently, we can define the vector of success probabilities to satisfy the following condition: where . In addition to the correlation property, we note that amount of energy received can be approximated as a unimodal function of misalignment. In this case, such that . Note that a similar unimodal model has been used for other applications such as the rate adaptation in 802.11 systems [19] and channel selection in cognitive networks [20].
Graph representation: In order to demonstrate the implications of contextual information, we can utilize a graph representation to capture the order of arms (beam pairs) with respect to each other. In this model, each arm corresponds to a node of a graph and each edge is associated with a relationship specifying which node of the edge gives the largest expected reward, thus providing a partial ordering over the arm space. Furthermore, from any node there is a path leading to the unique node with the maximum expected reward along which the expected reward is monotonically increasing. Under the assumption of unimodal expected reward, we can move from low expected rewards to high ones just by climbing them in the graph, preventing the need of a uniform exploration over all the graph nodes. This assumption reduces the complexity in the search for the optimal arm, since the optimal policy can avoid pulling the arms corresponding to some subset of nonoptimal nodes.
Experimental observations: For the purpose of illustration, we provide experimental results to observe the typical propagation pattern as a function of misalignment. In particular, we consider the case of clear line of sight, and run a set of experiments in which two horn antennas placed on tripods and facing each other symmetrically. The transmitter antenna is set to be the stationary antenna, while the receiver antenna is rotated throughout the experiment. Two software defined radio (NI USRP2901) are set up as the transmitter and receiver at GHz. The transmit antenna is connected to an upconverter with the output at GHz. The receiver antenna takes the GHz carrier and sends it to the downconverter with output of GHz. A power spectrum of gain (dB) vs. frequency is displayed in real time for data collection, and then we record the average peak value of gain. The receive horn antenna sweeps degrees incrementally on both clockwise and counterclockwise directions until the gain is indistinguishable from the thermal noise floor (about dB).
Our experimental setup is shown in Fig. 4, and Fig. 5 demonstrates the received power as a function of misalignment angle between the transmitter and receiver antennas. We observe that received power approximately (due to the sidelobes) follows the unimodal pattern. In this work, we assume that the transmitter and receiver deploy highly directional antenna arrays in which the effect of sidelobes is negligible and thus the received power can be approximated as a unimodal function. Moreover, although we would expect the misalignment of degree provides the highest signal energy, under blockage and reflection scenarios a misalignment of may provide a larger gain. Our model on contextual information is general and captures these conditions as well.
V Performance Analysis and Optimal Algorithm
Va Regret Analysis
In order to assess the performance of policy for beam alignment, or equivalently arm selection, we consider regret as the performance metric, defined as follows:
(5) 
From this definition, regret measures the expected reward loss (over a time period of ) compared with an oracle policy that would know everything. In the case that the expected reward of the various arms are not correlated, regret of the best algorithm is in the form of in which is the number of arms [5]. As a result, regret scales linearly with the number of arms. In beam alignment, for a beam of a few degrees the total number of arms (i.e., all combinations of transmitter and receiver beam pairs) becomes very large. Therefore, in order to avoid the scaling factor, we exploit the structural properties of the arms and their reward functions, and show that due to contextual information, the scaling factor is constant and does not scale with the number of beam matchings (i.e., size of the decision space). To this end, for any arm , we denote the set of its neighbors by:
Furthermore, given that the arms reward follow Bernoulli distribution, the KullbackLeibler (KL) divergence of two Bernoulli distributions with respective parameters and is defined as: . It has been shown in [19, 21] that the problem of learning in a unimodal bandit setting presents a lower bound over the regret of the following form:
Theorem 1.
For any beam alignment algorithm , the lower bound on the regret is given by:
in which is a function of arms reward, and is given by:
(6) 
From (6), we observe that is equal to a summation over a constant number of terms (i.e., independent of ). On the other hand, in the case that the structural properties is not exploited, the regret is lower bounded by:
where linearly increases with the number of possible arms, or equivalently, number of beam pairs at the transmitter and receiver.
VB Unimodal Beam Alignment (UBA) Algorithm
Next, we consider an algorithm whose regret matches the lower bound given in Theorem 1. The first part of this algorithm is identical to the OSUB algorithm proposed in [21], that we briefly describe here. This algorithm is asymptotically optimal, and is based on UCB algorithm that uses the KL divergence as an index for arm. In particular, each arm is attached an index that resembles the KLUCB index, but the arm selected at a given time is the arm with maximal index within the neighborhood of the arm that yields the highest empirical reward. Let be the arm selected at time , and denote the number of times arm has been selected up to time . The empirical reward of arm at time is:
(7) 
At any time slot , we denote by the index of the arm with the highest empirical reward. is referred to as the leader at time . Further, we define the number of times that arm has been the leader up to time . Now, the index of arm at time is defined as:
(8) 
in which and is a positive number. At any time slot, the algorithm selects the arm “close” to arm and with the maximum index. Next, we provide the finite time analysis of this algorithm, noting that the authors in [19, 21] have presented similar results.
Theorem 2.
Let fix . For all , the regret under the proposed UBA policy and at time is bounded by:
Proof.
Proof is provided in Appendix A. ∎
In order to guarantee a finite time running of this algorithm, we add an additional termination condition in Algorithm 1 and continue with the data communication phase thereafter. We use the peaktoaverage ratio as the termination condition in order to detect when the best beam direction is found. Therefore, as the UBA algorithm proceeds, we evaluate the ratio of the received energy to the average of previously received signals energy. If the ratio is higher than a threshold , we terminate the UBA algorithm and declare the beam as the best beam direction. Specifically, at time slot , we calculate the peaktoaverage ratio as follows:
in which denotes the energy level of beam direction selected at time . Therefore, when the condition is satisfied, we declare the beam with index as the optimal direction, and the UBA algorithm stops at time . The authors in [1] have experimentally evaluated the peaktoaverage ratio for LOS and NLOS situations such that is acceptable for detecting LOS. It should be noted that the proposed UBA scheme does not rely on the existence of LOS scenarios, while the threshold can be different under various environmental conditions. In particular, environmental conditions (e.g., blockage or reflection) alter the success probability of beam matchings, while the proposed UBA is oblivious to the underlying “physical layer” condition. In fact, based on the past observations, the UBA biases the search space towards the best beam direction. Therefore, the transmitter and receiver are able to refine the search space through successive rounds of beam alignment. The pseudocode is provided in Algorithm 1 where is the maximum degree of the graph representing the relation between arms.
In order to provide a complete beam alignment algorithm, similar to the IEEE 802.11ad standard, we decouple the transmitter and receiver steering such that the transmitter starts with a quasiomnidirectional beam, while the receiver uses the UBA algorithm (instead of exhaustive search) to find the best beam direction. The process is then reversed to have the transmitter scan the space while keeping the receiver quasiomnidirectional. As a result, we enhance the 802.11ad standard beam alignment by using the UBA algorithm instead of exhaustive search.
Vi Numerical Results
Via Setup
We compare the performance of the UBA algorithm with the exhaustive search scheme in which the receiver scans all different directions and samples the beam in all directions. The combination of transmitter and receiver beams that delivered the maximum power is picked as the direction of the signal. We perform the comparison of the UBA algorithm with the exhaustive search method under two different scenarios: directional and quasidirectional. Under directional conditions, probability of success is either very high or very low (e.g., in LOS scenarios). On the other hand, quasidirectional scenario occurs when the variance of success probabilities is smaller than directional situation (e.g., NLOS conditions). We evaluate the performance of beam alignment in terms of regret that measures the performance loss compared with the optimal alignment (i.e., an oracle policy). In this case, a lower regret implies a higher amount of received energy, and thus a higher accuracy in beam alignment. We also compare the beam alignment accuracy and delay overhead when the termination condition of peaktoaverage ratio is used. In simulations, we fix the transmitter beam direction, and the receiver scans the angular domain to find the optimal beam direction.
ViB Regret Performance
We set the vector of success probabilities as follows:
Figure 6 demonstrates the regret of the UBA method compared with the exhaustive beam sampling method under the directional and quasidirectional scenarios with beam directions. From the results, we observe that the regret increases over time since compared with an oracle policy, the total performance loss keeps increasing. However, the regret curve is concave and its rate of increase, decreases with time (i.e., error decreases). In addition, exploiting the structural properties using the UBA algorithm greatly reduces the regret that is equivalent to a higher amount of received energy. This implies a higher beam alignment accuracy that is proportional with the received energy. In addition, both methods achieve a lower regret under the the directional scenario, as expected.
ViC Scaling with the Size of System
Due to recent advances in antenna technologies, large directional antenna arrays with much smaller form factors can be deployed in relatively small chip areas. As a result, spatial resolution and number of the beams can be very large at the transmitter and receiver. Within this context, we investigate the effect of number of beam pairs on the performance of UBA. Figure 7 demonstrates the regret metric for and beam pairs. From the results, we observe that the performance of UBA scheme does not degrade with the number of beams that is a function of the number of antennas at the transmitter and receiver. This is a crucial property in massive antenna systems. Similar to Fig. 6, UBA scheme achieves a better performance compared with the exhaustive beam sampling method for both and beam pairs.
ViD Beam Alignment Accuracy and Delay Overhead
Next, we investigate the accuracy and delay overhead of the proposed UBA algorithm combined with the peaktoaverage ratio termination condition. We set the number of beams to be equal to beams at the receiver, and the goal is to find the best beam (e.g., misalignment angle of zero). Using the exhaustive search method, time slots is needed to examine all beams and pick the one with the highest received energy. This method is deterministic in a sense that the output is correct with the guaranteed delay overhead of slots. On the other hand, our method finds the optimal beam direction with a high probability while its delay overhead is smaller than the exhaustive method.
We set [1], and consider a scenario in which beam alignment success probability for the optimal beam is relatively small, i.e, . In this case, Fig. 8(a) reports the CDF of the optimal beam detection. From the results, we observe that in more than of iterations, we correctly predict the optimal beam direction. The important point, however, is that our method significantly reduces the delay overhead. Figure 8(b) depicts the scatter plot for detecting each beam as the optimal vs. the amount of time it takes. Size of each scatter point represents density of data. From the results, we observe that most of the beam alignment operations lead to beam 1 (i.e., high accuracy) with delay of less than time slots (i.e., low overhead). Figure 8(c) also shows the CDF of delay overhead in detecting beam 1 as the optimal beam. We extend the simulation to receiver beams. From the results shown in Fig. 9, we observe that delay overhead is significantly improved at the cost of some error in detecting the optimal beam direction. In fact, the delay overhead of time slots (to examine each direction) is reduced to time slots (averaged over 1000 iterations).
Vii Conclusion
In this paper, we investigated the beam alignment problem in mmWave systems where the transmit and receive antenna arrays require to frequently find the optimal beam pair that maximizes the received energy from beacon messages.
In order to reduce the overhead of exhaustive search methods, we investigated an online stochastic optimization problem and proposed an equivalent structured MultiArmed bandit model. In this case, the problem of finding the best beam pair is reduced to finding the optimal arm at each time slot such that the overall regret is minimized. We exploit the contextual information in order to reduce the search space, and thus the overhead of exhaustive beam selection. Thanks to the structural properties, we demonstrated that the regret bound does not depend on the size of decision space that is equal to the the number of transmit and receive beams multiplied. This is a crucial property in MIMO settings in which the number of all combinations of transmit and receive beams grows quickly. We further proposed an asymptotically optimal algorithm for the beam alignment problem and demonstrated its performance via simulations.
Acknowledgment
This work was supported in part by the following NSF grants: CNS1518916, CNS1314822, CNS1618566, CNS1514260, CNS1518829, and ONR grants: N000141512166 and N000141712417.
The authors would like to thank Hongliang Si and Nathan Weirich for performing the experiments.
Appendix A Proof of Theorem 2
Proof.
Similar to [21, 22], we split the rounds in two sets: those rounds in which the best arm is the leader , i.e., , and those in which the leader is another arm, i.e., . Therefore:
If we consider the first term, the proposed algorithm behaves like the UCB algorithm restricted to the optimal arm and its neighborhood, and the regret upper bound is the one presented in [23], i.e., for every :
where is a constant. For the second part, we have:
or . Next, we provide an upper bound on the number of times that arm has been the leader, i.e., , with that is the number of rounds spent with arm as leader in the case only its neighborhood is considered during the whole time horizon . Therefore, we have:
(10) 
where, with an abuse of notations, denotes the leader at round in this modified problem where only is considered. Since arm is the leader, its empirical mean is the maximum in its neighborhood, i.e., in which . Thus, we have: Defining as the expected loss incurred in choosing arm instead of its best adjacent one , we have:
(11) 
For the first term, we have:
(12) 
where the last inequality is due to the ChernoffHoeffding inequality expressed as follows:
Lemma 1.
(ChernoffHoeffding inequality) Let be random variables with common range and such that . Let . Then for all , we have:
Therefore, we have:
(13) 
since . Then, we have: where we dropped the conditioning. Since the expected number of times that the nonoptimal arm has been played is bounded and its variance is bounded as well, using Bernstein’s inequality (provided below), we have: and since is lower bounded, we conclude that is bounded by a constant, i.e., .
Lemma 2.
(Bernstein inequality) Let be random variables with range in and Let . Then for all , we have:
For the term, we have:
(14) 
It is straightforward to show that is bounded by a constant as well. By considering the three partial results on , , and , we have:
and the theorem statement follows. ∎
References
 [1] T. Nitsche, A. B. Flores, E. W. Knightly, and J. Widmer, “Steering with eyes closed: mmwave beam steering without inband measurement,” in Computer Communications (INFOCOM), IEEE Conference on. IEEE, 2015, pp. 2416–2424.
 [2] IEEE, “IEEE 802.11ad, amendment 3: Enhancements for very high throughput in the 60 GHz band,” IEEE 802.11 Working Group, 2012.
 [3] L. Zhou and Y. Ohashi, “Efficient codebookbased MIMO beamforming for millimeterwave WLANs,” in Personal Indoor and Mobile Radio Communications (PIMRC), 2012 IEEE 23rd International Symposium on. IEEE, 2012, pp. 1885–1889.
 [4] Y. Zhu, Z. Zhang, Z. Marzi, C. Nelson, U. Madhow, B. Y. Zhao, and H. Zheng, “Demystifying 60GHz outdoor picocells,” in Proceedings of the 20th annual international conference on Mobile computing and networking. ACM, 2014, pp. 5–16.
 [5] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
 [6] Y. Qi and M. Nekovee, “Coordinated initial access in millimetre wave standalone networks,” arXiv preprint arXiv:1605.03337v1, 2016.
 [7] V. Desai, L. Krzymien, P. Sartori, W. Xiao, A. Soong, and A. Alkhateeb, “Initial beamforming for mmwave communications,” in Signals, Systems and Computers, 2014 48th Asilomar Conference on. IEEE, 2014, pp. 1926–1930.
 [8] J. Singh and S. Ramakrishna, “On the feasibility of codebookbased beamforming in millimeter wave systems with multiple antenna arrays,” IEEE transactions on Wireless Communications, vol. 14, no. 5, pp. 2670–2683, 2015.
 [9] C. N. Barati, S. A. Hosseini, S. Rangan, P. Liu, T. Korakis, S. S. Panwar, and T. S. Rappaport, “Directional cell discovery in millimeter wave cellular networks,” IEEE Transactions on Wireless Communications, vol. 14, no. 12, pp. 6664–6678, 2015.
 [10] S. Hur, T. Kim, D. J. Love, J. V. Krogmeier, T. A. Thomas, and A. Ghosh, “Millimeter wave beamforming for wireless backhaul and access in small cell networks,” Communications, IEEE Transactions on, vol. 61, no. 10, pp. 4391–4403, 2013.
 [11] A. Ali, N. GonzálezPrelcic, and R. W. Heath Jr, “Millimeter wave beamselection using outofband spatial information,” arXiv preprint arXiv:1702.08574, 2017.
 [12] H. Hassanieh, O. Abari, M. Rodreguez, M. Abdelghany, D. Katabi, and P. Indyk, “Agile millimeter wave networks with provable guarantees,” arXiv preprint arXiv:1706.06935, 2017.
 [13] M. Hashemi, C. E. Koksal, and N. B. Shroff, “Energyefficient power and bandwidth allocation in an integrated sub6 GHz–millimeter wave system,” arXiv preprint arXiv:1710.00980, 2017.
 [14] ——, “Outofband millimeter wave beamforming and communications to achieve low latency and high energy efficiency in 5G systems,” IEEE Transactions on Communications, 2017.
 [15] H. Robbins, “Some aspects of the sequential design of experiments,” in Herbert Robbins Selected Papers. Springer, 1985, pp. 169–177.

[16]
S. Bubeck, N. CesaBianchi et al., “Regret analysis of stochastic and
nonstochastic multiarmed bandit problems,”
Foundations and Trends® in Machine Learning
, vol. 5, no. 1, pp. 1–122, 2012.  [17] E. W. Cope, “Regret and convergence bounds for a class of continuumarmed bandit problems,” IEEE Transactions on Automatic Control, vol. 54, no. 6, pp. 1243–1253, 2009.
 [18] Y. Y. Jia and S. Mannor, “Unimodal bandits,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 41–48.
 [19] R. Combes, A. Proutiere, D. Yun, J. Ok, and Y. Yi, “Optimal rate sampling in 802.11 systems,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014, pp. 2760–2767.
 [20] R. Combes and A. Proutiere, “Dynamic rate and channel selection in cognitive radio systems,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 5, pp. 910–921, 2015.
 [21] ——, “Unimodal bandits: Regret lower bounds and optimal algorithms,” in International Conference on Machine Learning, 2014, pp. 521–529.

[22]
S. Paladino, F. Trovò, M. Restelli, and N. Gatti, “Unimodal thompson sampling for graphstructured arms.” in
AAAI, 2017, pp. 2457–2463.  [23] O. Cappé, A. Garivier, O.A. Maillard, R. Munos, G. Stoltz et al., “Kullback–leibler upper confidence bounds for optimal sequential allocation,” The Annals of Statistics, vol. 41, no. 3, pp. 1516–1541, 2013.
Comments
There are no comments yet.