Machine-Learning Beam Tracking and Weight Optimization for mmWave Multi-UAV Links

by   Hsiao-Lan Chiang, et al.

Millimeter-wave (mmWave) hybrid analog-digital beamforming is a promising approach to satisfy the low-latency constraint in multiple unmanned aerial vehicles (UAVs) systems, which serve as network infrastructure for flexible deployment. However, in highly dynamic multi-UAV environments, analog beam tracking becomes a critical challenge. The overhead of additional pilot transmission at the price of spectral efficiency is shown necessary to achieve high resilience in operation. An efficient method to deal with high dynamics of UAVs applies machine learning, particularly Q-learning, to analog beam tracking. The proposed Q-learning-based beam tracking scheme uses current/past observations to design rewards from environments to facilitate prediction, which significantly increases the efficiency of data transmission and beam switching. Given the selected analog beams, the goal of digital beamforming is to maximize the SINR. The received pilot signals are utilized to approximate the desired signal and interference power, which yield the SINR measurements as well as the optimal digital weights. Since the selected analog beams based on the received power do not guarantee the hybrid beamforming achieving the maximization SINR, we therefore reserve additional analog beams as candidates during the beam tracking. The combination of analog beams with their digital weights achieving the maximum SINR consequently provides the optimal solution to the hybrid beamforming.



There are no comments yet.


page 3

page 4

page 5

page 6

page 7

page 9

page 11

page 12


On the Beamforming Design of Millimeter Wave UAV Networks: Power vs. Capacity Trade-Offs

The millimeter wave (mmWave) technology enables unmanned aerial vehicles...

Uplink Beam Management for Millimeter Wave Cellular MIMO Systems with Hybrid Beamforming

Hybrid analog and digital BeamForming (HBF) is one of the enabling trans...

A Low-Resolution ADC Module Assisted Hybrid Beamforming Architecture for mmWave Communications

We propose a low-resolution analog-to-digital converter (ADC) module ass...

Active and Dynamic Beam Tracking UnderStochastic Mobility

We consider the problem of active and sequential beam tracking at mmWave...

Time-Domain Multi-Beam Selection and Its Performance Improvement for mmWave Systems

Multi-beam selection is one of the crucial technologies in hybrid beamfo...

Reconfigurable Intelligent Surface-Assisted mmWave multi-UAV Wireless Cellular Networks

Unmanned aerial vehicles (UAVs) have brought a lot of flexibility in the...

Frequency-Selective Hybrid Beamforming Based on Implicit CSI for Millimeter Wave Systems

Hybrid beamforming is a promising concept to achieve high data rate tran...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Applications of unmanned aerial vehicles (UAVs) in civil uses become popular in recent years. For example, post-disaster use. A UAV is capable of carrying a network device as an access point that uses an intelligent reflecting surface with beamforming to reflect incident signals [1]. A group of UAVs forms an aerial radio access network (aerial-RAN), which serves short-term network infrastructure as an independent wireless network or a long-term extension of existing mobile communication networks [2, 3]. An aerial-RAN can perform tasks such as (i) the UAVs together transmit or receive signals from different directions to detect weak signals from victims and (ii) the UAVs separately serve as independent wireless networks to provide a wide range of services, see Fig. 1. In this example, two followers collect data from ground users and then report the information to the lead UAV, which will pass the data to a remote ground anchor node [4, 5, 6]. Such an operation is often characterized by low-latency and high-resilience constraints. The former is defined as the time to get a response to information sent, while the latter is the ability that provides and maintains an acceptable link quality of services in highly dynamic operations.

Millimeter-wave (mmWave) communication is one of the candidates to satisfy the low-latency requirement due to availability of large chunks of spectrum in unlicensed mmWave frequency bands [7, 8]. Compared with sub-6 GHz communications, mmWave propagation suffers from more severe environmental conditions, such as path loss and a small number of scattering events [9, 10]. In order to improve the data rates and quality of service, beamforming technology for large antenna arrays seems to be a promising approach. At mmWave frequencies, analog beamforming via a passive phased array is taken into account due to cost and power consumption concerns [11, 12, 13]

. With more than one analog beamforming vector, linear combinations of multiple analog beamforming vectors with weights of digital beamformers as coefficients provide more degrees of freedom for beamforming designs. Such a beamforming architecture is called hybrid analog-digital beamforming

[14, 15].

Fig. 1: An example of multi-UAV scenarios. The UAVs are deployed in an area of interest for search and rescue works, where the lead UAV transmits the users’ data collected from the followers to the ground anchor node.

In hybrid beamforming systems, although both analog and digital beamforming matrices use the same word beamforming, only the former has a specific geometrical meaning in the sense of transmitting or receiving signals towards specific directions in the 3-D space using antenna arrays. In contrast, the digital weights act in the sense of optimum linear combining, given some cost criterion. According to the functions of analog and digital beamforming, hybrid beamforming can be viewed as first converting a MIMO channel matrix (in the spatial domain) into an effective channel (in the angular domain) using analog beamforming vectors

[16, 17]. Then, one can further design the weights of the digital beamformers to linearly combine the analog beamforming vectors based on some optimality criteria. Clearly, the performance of hybrid beamforming is dominated by the analog beam search. In highly dynamic UAV environments with speed up to 100 m/s [18], this challenge (or specifically speaking, analog beam tracking) will be a critical problem.

One of the key performance indicators for dynamic beam tracking could be network resilience [19]. In dynamic environments, the UAVs may have to switch the analog beams rapidly in order to stably provide the acceptable link quality. Given codebooks that consist of candidates for the analog beams, the work in [20] presented a gradient-based algorithm to find a better beam next to the currently used beam, and in [21]

, the beam tracking problem is formulated as a multi-armed bandit problem. One can also use the extended Kalman filter to recursively track the beams based on the estimated angles of departure and arrival (AoDs/AoAs)


. In addition, a conventional object tracking method using reinforcement learning in computer vision

[23] has attracted attention and been used in beam tracking [24, 25, 26]. All above-mentioned methods try to find the beam which can achieve an acceptable link quality. However, implementing beam tracking for highly dynamic channels needs a large number of observations (that is, received pilot signals) by sacrificing the spectral efficiency. When we pursue a high-resilient multi-UAV communication, the transmission overhead of pilots is another issue. In this paper, we attempt to strike the balance between the system resilience and efficiency.

To handle the beamforming problem for a time-varying channel, we let the UAVs learn how to interact with the highly dynamic environment during the beam tracking using Q-learning [27, 28]. Q-learning is a model-free reinforcement learning algorithm that uses experience, current measurements, and rewards from the environments to solve the prediction problem without knowing a model of the environment. When applying Q-learning to beam tracking, the crucial problem is to design the reward function based on the noisy observations. Please note that the reward function also influences the experience in Q-learning. Some prior works in [24, 29] used true values of the signal to interference plus noise ratio (SINR) or true values of the received power to define the reward function, which cannot faithfully show the performance of Q-learning-based beam tracking in practical cases. In the proposed method, we use the noisy observations to design the reward function and take current/past observations as arguments in such a way to reduce the pilot overhead.

In the analog beam tracking, the analog beams are selected according to the power of observations.111Precisely, the power of observations determines the rewards from environments in Q-learning, and then we use the rewards to find favorable beams. These beams together yield (nearly) the maximum received power. However, the spatial-domain interference from different UAVs could seriously degrade the throughput. Essentially, what really matters to multi-UAV hybrid beamforming is the SINR maximization [30, 31]. To this end, given the selected analog beams, one can design the corresponding digital weights to maximize the SINR. To obtain the measurements of SINR, we use the received coupling coefficients222A coupling coefficient is a measure of a pair of analog beamforming vectors selected on both sides of the channel [17]. (associated with the beams assigned to difference UAVs) to approximate the desired signal and interference power, which facilitates the design of the digital weights. Moreover, it is worth noting that the analog beams leading to the maximum received power may not lead to the maximum SINR [17]. We therefore reserve more candidates for analog beams during the beam tracking. It turns out that the analog beams have to be determined after linear combinations of analog beamforming vectors with the digital weights.

The contributions of the proposed method are summarized as follows:

  • The proposed method only requires the received coupling coefficients as observations to implement both the analog beam tracking and digital weight optimization. Compared with prior works in the literature which need detailed knowledge, such as channel, we provide a more feasible solution to connect multiple UAVs with low complexity.

  • We formulate the beam tracking problem using a Q-learning model and introduce how to use the coupling coefficients to design the rewards. The proposed method can stably track the beams in highly dynamic environments.

  • To track the beams in highly dynamic UAV environments, the burden of pilot transmission is inevitable. The proposed beam tracking method uses current and past observations to solve the prediction problem. In such a way, it significantly increases the efficiency of data transmission and beam switching.

  • The selected analog beams based on the received power do not ensure that hybrid beamforming achieves the maximization SINR. We manage to reserve additional analog beams as candidates during the beam tracking and then determine which combination of analog beams with their digital weights achieves the maximum SINR. This idea can be simply implemented given the coupling coefficients.

The rest of this paper is organized as follows: Section II describes the multi-UAV beamforming system and time-varying AoDs/AoAs. Section III states the objectives and challenges of the hybrid beamforming problem in highly dynamic environments. To efficiently track the analog beams with limited number of observations, Q-learning is applied to the beam tracking problem for one and multiple links presented in Section IV. Given selected beam pairs, we pursue the corresponding optimal digital weights and the solution is provided in Section V. Simulation results are presented in Section VI, and we conclude our work in Section VII.

We use the following notations throughout this paper.

A scalar.
A column vector.
A matrix.
A set.
The entry of .
The complex conjugate of .
The Hermitian transpose of .
The identity matrix.

Ii System Model

Fig. 2: A multi-UAV hybrid beamforming system has a lead with a hybrid analog-digital beamformer and followers equipped with analog beamformers.

A clustered multi-UAV beamforming system shown in Fig. 2 has one lead and followers. We assume that these UAVs are perfectly synchronized in time and frequency, and the lead communicates data streams to followers at the same time and frequency. That is, we consider space-division multiple access (SDMA) with beamforming to enable data transmission/reception for multiple UAVs [32, 33], and let each UAV be equipped with a uniform rectangular array (URA) of antennas.

The goal of multi-UAV beamforming in a highly dynamic environment is to maximize the system throughput in a discrete time interval . At the cluster lead, the signals are received from specific directions using analog beamformers at time , denoted by , . The analog beamformers are implemented in the passband as part of the RF front end. Due to the concerns of high implementation costs and power consumption, they have some limitations, e.g., the weights of analog beamformers have unit magnitude because analog beamformers are typically implemented by phase shifters [12]. The analog beamforming vectors together are denoted by the matrix , and these vectors can be further combined with the weights of the baseband digital beamformer .

Given a pre-defined codebook , the analog beamforming vectors at the lead are selected from the set . Beam of the URA, i.e., the member of can be represented by the Kronecker product (denoted by ) of the beamforming vectors and in - and -direction respectively [34]:


and the element of and can be represented by


where and are the indices of antenna elements in - and -direction respectively. Also, and are respectively the candidate for the azimuth and elevation steering angles at the lead (see Fig. 3), is the distance between neighboring antenna elements, and is the wavelength at the carrier frequency.

Fig. 3: An array geometry of the URA.

For the followers, each only uses a single analog beamformer with phase shifters to communicate with the lead.333We assume that all the UAVs are equipped with a hybrid beamforming architecture since the leading UAV may change over time. The lead is randomly selected from UAVs at the beginning. Similar to the analog beams at the lead, each follower selects an analog beam from codebook .444Essentially, these two codebooks are the same, i.e., . We specify the beamforming problem in terms of two different notations of codebooks for generality.

Via a time-varying channel between the lead and follower , the received signal at the lead after the hybrid beamformer is the superposition of the desired signal, interference from other UAVs, and combined noise [30, 31]:


where is the pilot signal satisfying and , is an -dimensional circularly symmetric complex Gaussian (CSCG) random noise vector with mean and covariance matrix , i.e., , and is the column of .

The link between the lead and follower is modeled as a line-of-sight (LoS) path. According to the relative position and orientation between the transmitter and receiver, the MIMO channel matrix can be determined by the complex path gain and the outer product of two array response vectors and , which are functions of AoA and AoD [15, 35]. Thus, the channel matrix is expressed by


In a manner similar to the steering vector in (1), the array response vectors can be represented by the Kronecker product of the array response vectors in - and -direction. Take as an example:


and the entries of and are given by


where the random variables

and stand for the azimuth and elevation angles of departure at time . Given the azimuth and elevation angles of arrival (denoted by , ), the array response vector at the receiver (i.e., ) has a similar form as (5).

To model a highly dynamic environment for the angles under an observed LoS path, a Gaussian random walk is used to generate the time-varying angles , , , and . For instance, the azimuth angle of arrival can be defined by


where is a randomly selected initial angle of

and follows a uniform distribution, and

is the disturbance (or white noise) following a normal distribution. The other three time-varying angles are generated in a similar way.

Iii Problem Statement

The goal of hybrid beamforming in the multi-UAV system is to maximize the SINR (or system throughput) during the time interval . Meanwhile, after the combiner

, the variance of the combined noise signal is enforced to remain constant, i.e.,


which leads to a power constraint on the combiner as


Then, by introducing two sets and that include promising candidates for the analog beamforming matrices, we seek , , and that together achieve the maximum SINR and satisfy the power constraint from to :


where and are the power of the desired and interference signals given by


In the paper, we do not assume the channel state information or any knowledge of AoAs/AoDs is known to the lead. Instead, the required observations are the estimates of coupling coefficients associated with a beam pair , where and . By correlating the received pilot signals with the known transmitted ones, we can obtain such observations given by555The notation of observation is simplified from its formal expression given by .


where denotes the superposition of the combined interference and noise, and we assume that it follows a complex normal distribution, i.e., .

Given the observations , the strategy of solving the problem (10) could be, first, using the observations to find the sets and that ideally consist of the optimal analog beamforming matrices. However, due to the hardware constraint on the analog beamformer, the beam probing is time-consuming. When the channel is highly dynamic, the observations acquired early may become unreliable. How to use the observations to interact with the highly dynamic environment during the beam probing becomes a crucial problem. As a result, the idea of Q-learning algorithm [28] is borrowed to find appropriate beams (i.e., the members of and ) for time-varying channels. The concept of Q-learning is to let the UAVs learn the optimal behavior directly from the interaction with the environment. Once we determine the candidate sets and , the observations associated with the members of and are used to generate the corresponding digital weights and the SINR measurement.

Iv Analog Beam Tracking Using Q-Learning

Fig. 4: All the candidates for the beam pairs are represented by the grid map, where the red ones are trained during the initial beam search. An example of the Q-learning-based beam selection is given in Example 1. According to the updated Q-values, see Table I, it will converge to beam pair after few iterations.

In this section, we introduce an analog beam tracking algorithm for highly dynamic environments. Starting from a single link between the lead and a follower, we adopt Q-learning to deal with the beam tracking problem. The idea can be easily extended to multiple links with additional constraints.

Iv-a Beam Selection Using Q-Learning for One Link

To begin with, let us focus on the link between the lead and follower . That is, we seek the candidates for and . When the codebook size is large, the efficient way of beam tracking is to start from some specific directions that cover the 3-D environment. This phase is called initial beam search. For example, Fig. 4 shows candidates for the analog beam pair, where and are the numbers of elements in codebooks and respectively. In the example, the four beam pairs highlighted in red are initially explored. To be formal, we define two sets that consist of the beams used in the initial search by and and assume that both the lead and follower have the same initial beam search pattern. After the beam probing using these four beam pairs, the one having the maximum received power will be selected as a starting point of beam tracking in the next phase.

The beam tracking is conventionally implemented by searching a better choice next to the currently used beam pair [20, 36]. Both the initial beam search and beam tracking in the above-mentioned work only explore the environment rather than interact with the environment. The concept of “interaction with the environment” can be viewed as a beam selection algorithm that can explore uncharted territory and, meanwhile, exploit the searching experience. Concerning a highly dynamic environment, the exploration-exploitation balance becomes more important to the beam tracking. The idea of Q-learning is to let an agent (e.g., a UAV) learn to strike the balance between exploration and exploitation.

In Q-learning, the experience is recorded in a Q-learning table (or Q-table), see Table I, which is updated according to the current measurements. The Q-table is constructed according to three components: states, actions, and state-action values (also known as Q-values). Before the learning begins, the state-action values in the Q-table are initialized to zero. In a state at time , the UAV always implements the following four steps: select an action from the action set , go to the next state , observe a reward , and update the Q-value, given by [28, Ch. 6]


where is the learning rate (or step size), is the discount factor determining the importance of future rewards. The Q-value update can be described as a weighted average between the old value and new information.

Time () Episode Step State Action
0 0 0 1 0 0 0
1 1 0 0 1 0
2 2 0 0 1 0
3 3 1 0 0 0
4 1 0 0 0 1 0
5 1 0 1 0 0
6 2 0 1 0 0
7 3 0 0 2 0
TABLE I: The Q-values are updated according to the states and actions given in Example 1 and Fig. 4. Here we let the Q-values be updated by either 0 or 1 for simplicity.

The reward can be regarded as the feedback from the environment given an action. In terms of maximizing the SINR, the reward is supposed to be a function of SINR. Nevertheless, we only have the coupling coefficients as measurements which suffer from noise and interference. We therefore define the reward function as follows. According to the received power of the coupling coefficients corresponding to the trained beam pairs at time and , the reward is defined, in terms of thresholds, by functions of the received power


where is the beam index pair used at time . Due to the noise and interference, the observations, and , may be unreliable for determining the reward. To reduce the uncertainty, we define a lower threshold and an upper threshold . If the ratio of to is between and , the measurement is treated as ambiguity so that the reward is equal to zero. A more detailed discussion about the upper and lower thresholds is provided in Appendix A.

To elaborate the Q-learning-based beam selection, let us take an example by Fig. 4 and Table I.

Example 1.

When starting from a state , one of the neighboring beam pairs will be explored by choosing an action from according to the state-action values, i.e., . Since all the Q-values at are initialized to zero, an action will be selected randomly (or according to some predefined criteria). We assume that the action “up” is selected so that the next state becomes . The corresponding reward and Q-value will be updated accordingly, see Table I. In the example, we simply let the Q-values be updated by either 0 or 1, where a value of 1 implies that the agent chooses the action and gets a positive reward. In Q-learning, a sequence of time slots (also called steps) is defined as an episode. Each episode starts from a state, which could be pre-defined or determined by the received power. Fig. 4 shows that the initial beam search needs in total four episodes with starting states at , , , and respectively. In each episode, the beam probing takes time slots to update the Q-values. When finishing the first episode, the agent starts the next episode using beam pair . With a sufficiently large number of significant Q-values, Q-learning will converge to the beam pair corresponding to the maximum received power.∎

After the initial beam search, some beam pairs have been explored and the beam tracking will start from the beam pair with the maximum received power during the initial beam search, which is denoted by (i.e., the state or beam pair with respect to the maximum power).

According to the updated Q-values, an agent exploits what it has already experienced in order to obtain a positive reward, but it also has to explore the uncharted or changed environment to see if it can make better action selections in the future. One of the challenges in reinforcement learning is the trade-off between the exploration and exploitation. By introducing a parameter , an -greedy action is obtained to better balance the exploration and exploitation:


The agent chooses the action as it believes that the action yields the best long-term effect with probability

. Or the agent chooses an action uniformly at random with probability .

The pseudocode of the Q-learning-based beam tracking algorithm is shown in Algorithm 1, which includes two phases: the initial beam search and beam tracking. The difference between these two phases is the decision of the starting state of each episode. During the initial beam search, the starting state is selected from the pre-defined sets and . In Example 1, and . During the beam tracking, the starting state is selected according to the maximum received power. Moreover, the selected beam pair at time is denoted by . We assume that the analog beam pairs are determined at the UAV lead, and time division duplex (TDD) technique that separates the transmit and receive signals in the time domain can be used to inform the followers to update their beams.

0:  Observations
0:  Selected beam pairs
1:  Initialize Q-table
3:  for : number of episodes
4:   if initial beam search
6:   else if beam tracking
8:   end if
9:   for : number of steps
10:    choose and go to
11:    obtain according to observations
12:    update
13:    update according to observations
15:   end step
16:  end episode
Algorithm 1 Q-learning beam tracking for a single link.

Iv-B Overhead Reduction Using Offline Q-Learning

In Algorithm 1, the observations are available at each time slot . This implies that the beam switching and pilot transmission/reception are executed in every time slot, which is not a well-designed manner in the sense of system efficiency. To reduce the overhead, we reserve all observations so that the Q-learning can execute offline. When using past observations to obtain the rewards and update the Q-values, we name the Q-learning algorithm offline Q-learning. Otherwise, it is called online Q-learning.

For the offline Q-learning, only the observations associated with large received power have to be updated regularly. Therefore, at the end of each episode, the beam pairs with respect to the maximum received power (i.e., ) will be chosen and employed at the beginning of each episode in order to update the corresponding observations. For other steps in an episode, the pilot transmission and beam switching are not necessary unless a specific state has not been explored.

Iv-C Beam Selection Using Q-Learning for Multiple Links

The idea of Q-learning-based beam tracking for one link can be easily extended to the case of multiple links, similar to multi-agent systems [37, 38]. For multi-UAV beam probing, the lead receives the observations from different followers simultaneously in an SDMA manner. In this case, the members of at the lead UAV’s side should not be selected repeatedly. As a result, the action set in (16) has to be updated in real time.

In each beam probing, which could be in the stage of initial beam search or beam tracking, the Q-learning-based beam selection starts from a follower corresponding to the maximum received power at the moment. We further define a set

that includes the actions which will make different followers go to the same states. Thus, the action selection given in (16) can be reformulated as


After making the decision about the next state for a follower, the lead has to update accordingly.

V Digital Beamforming

In the previous section, we use Q-learning to find the members of sets and in the problem (10). However, the selected beam pairs may not be the optimal solution to the problem for the reasons that (i) Q-learning usually only provides a good enough solution666Q-learning uses experience to solve a prediction problem, which can be viewed as a Monte Carlo method. and (ii) the digital beamformer weights are not taken into account during the procedure of analog beam selection. In the sense of hybrid beamforming, a better solution should be the one whose linear combination with the digital weights leading to the maximum SINR. This issue can be solved by keeping more than one promising members with large received power in and [17]. We use Example 2 to explain the idea.

Example 2.

Two selected beam pairs with large received power for each follower are collected in the following two sets:


Given these two sets, we can generate all the members of and , given by

which has a cardinality of 3 because the members of at lead UAV should not be selected repeatedly, and the other set can be represented by

which has a cardinality of 8. In this example, given the above and , we have to evaluate a total of 24 combinations with their digital weights to maximize the SINR.∎

The above-mentioned idea is different from the work represented in [25] that keeps candidates in subspace. In our opinion, the better solution is supposed to keep candidates with large received power because the idea in [25] only takes into account the main lobes of analog beams, while the proposed method considers both the main and side lobes.

V-a Digital Weight Optimization

To simplify the following descriptions of digital beamforming, we assume that and only include one member respectively, i.e.,


In the numerical results, we will provide more discussion about the idea. Given and , the hybrid beamforming problem (10) becomes a digital beamforming problem subject to the power constraint, which can be formulate as


where . The signal and interference power are subject to the selected analog beams


To satisfy the power constraint on the combiner, one can define unit vectors that obey the relation [17]


Upon replacing with in the problem, the received signal and interference power can be written by


Then, we can find that the problem (19) is equivalent to seeking vectors that maximize the SINR for followers. As a result, the maximization problem (19) can be reformulated as


V-B SINR Approximation Using Coupling Coefficients

In (23) and (24), the couplings of the channel and analog beams, such as and , can be viewed as effective channel vectors. Since the observations, given in (13), are the coupling of the channel and one analog beam pair, we can use them to construct the estimates of effective channel vectors, defined by




where the entries of can be obtained from as well. The collected observations suffice to generate the estimates of and represented by




Using (28) and (29), the SINR for follower conditional on and can be approximated by the following equation


Using the property that is a positive definite matrix, the optimal solution of that attains the maximum SINR can be stated as follows (also see Appendix B):