I Introduction
Applications of unmanned aerial vehicles (UAVs) in civil uses become popular in recent years. For example, postdisaster use. A UAV is capable of carrying a network device as an access point that uses an intelligent reflecting surface with beamforming to reflect incident signals [1]. A group of UAVs forms an aerial radio access network (aerialRAN), which serves shortterm network infrastructure as an independent wireless network or a longterm extension of existing mobile communication networks [2, 3]. An aerialRAN can perform tasks such as (i) the UAVs together transmit or receive signals from different directions to detect weak signals from victims and (ii) the UAVs separately serve as independent wireless networks to provide a wide range of services, see Fig. 1. In this example, two followers collect data from ground users and then report the information to the lead UAV, which will pass the data to a remote ground anchor node [4, 5, 6]. Such an operation is often characterized by lowlatency and highresilience constraints. The former is defined as the time to get a response to information sent, while the latter is the ability that provides and maintains an acceptable link quality of services in highly dynamic operations.
Millimeterwave (mmWave) communication is one of the candidates to satisfy the lowlatency requirement due to availability of large chunks of spectrum in unlicensed mmWave frequency bands [7, 8]. Compared with sub6 GHz communications, mmWave propagation suffers from more severe environmental conditions, such as path loss and a small number of scattering events [9, 10]. In order to improve the data rates and quality of service, beamforming technology for large antenna arrays seems to be a promising approach. At mmWave frequencies, analog beamforming via a passive phased array is taken into account due to cost and power consumption concerns [11, 12, 13]
. With more than one analog beamforming vector, linear combinations of multiple analog beamforming vectors with weights of digital beamformers as coefficients provide more degrees of freedom for beamforming designs. Such a beamforming architecture is called hybrid analogdigital beamforming
[14, 15].In hybrid beamforming systems, although both analog and digital beamforming matrices use the same word beamforming, only the former has a specific geometrical meaning in the sense of transmitting or receiving signals towards specific directions in the 3D space using antenna arrays. In contrast, the digital weights act in the sense of optimum linear combining, given some cost criterion. According to the functions of analog and digital beamforming, hybrid beamforming can be viewed as first converting a MIMO channel matrix (in the spatial domain) into an effective channel (in the angular domain) using analog beamforming vectors
[16, 17]. Then, one can further design the weights of the digital beamformers to linearly combine the analog beamforming vectors based on some optimality criteria. Clearly, the performance of hybrid beamforming is dominated by the analog beam search. In highly dynamic UAV environments with speed up to 100 m/s [18], this challenge (or specifically speaking, analog beam tracking) will be a critical problem.One of the key performance indicators for dynamic beam tracking could be network resilience [19]. In dynamic environments, the UAVs may have to switch the analog beams rapidly in order to stably provide the acceptable link quality. Given codebooks that consist of candidates for the analog beams, the work in [20] presented a gradientbased algorithm to find a better beam next to the currently used beam, and in [21]
, the beam tracking problem is formulated as a multiarmed bandit problem. One can also use the extended Kalman filter to recursively track the beams based on the estimated angles of departure and arrival (AoDs/AoAs)
[22]. In addition, a conventional object tracking method using reinforcement learning in computer vision
[23] has attracted attention and been used in beam tracking [24, 25, 26]. All abovementioned methods try to find the beam which can achieve an acceptable link quality. However, implementing beam tracking for highly dynamic channels needs a large number of observations (that is, received pilot signals) by sacrificing the spectral efficiency. When we pursue a highresilient multiUAV communication, the transmission overhead of pilots is another issue. In this paper, we attempt to strike the balance between the system resilience and efficiency.To handle the beamforming problem for a timevarying channel, we let the UAVs learn how to interact with the highly dynamic environment during the beam tracking using Qlearning [27, 28]. Qlearning is a modelfree reinforcement learning algorithm that uses experience, current measurements, and rewards from the environments to solve the prediction problem without knowing a model of the environment. When applying Qlearning to beam tracking, the crucial problem is to design the reward function based on the noisy observations. Please note that the reward function also influences the experience in Qlearning. Some prior works in [24, 29] used true values of the signal to interference plus noise ratio (SINR) or true values of the received power to define the reward function, which cannot faithfully show the performance of Qlearningbased beam tracking in practical cases. In the proposed method, we use the noisy observations to design the reward function and take current/past observations as arguments in such a way to reduce the pilot overhead.
In the analog beam tracking, the analog beams are selected according to the power of observations.^{1}^{1}1Precisely, the power of observations determines the rewards from environments in Qlearning, and then we use the rewards to find favorable beams. These beams together yield (nearly) the maximum received power. However, the spatialdomain interference from different UAVs could seriously degrade the throughput. Essentially, what really matters to multiUAV hybrid beamforming is the SINR maximization [30, 31]. To this end, given the selected analog beams, one can design the corresponding digital weights to maximize the SINR. To obtain the measurements of SINR, we use the received coupling coefficients^{2}^{2}2A coupling coefficient is a measure of a pair of analog beamforming vectors selected on both sides of the channel [17]. (associated with the beams assigned to difference UAVs) to approximate the desired signal and interference power, which facilitates the design of the digital weights. Moreover, it is worth noting that the analog beams leading to the maximum received power may not lead to the maximum SINR [17]. We therefore reserve more candidates for analog beams during the beam tracking. It turns out that the analog beams have to be determined after linear combinations of analog beamforming vectors with the digital weights.
The contributions of the proposed method are summarized as follows:

The proposed method only requires the received coupling coefficients as observations to implement both the analog beam tracking and digital weight optimization. Compared with prior works in the literature which need detailed knowledge, such as channel, we provide a more feasible solution to connect multiple UAVs with low complexity.

We formulate the beam tracking problem using a Qlearning model and introduce how to use the coupling coefficients to design the rewards. The proposed method can stably track the beams in highly dynamic environments.

To track the beams in highly dynamic UAV environments, the burden of pilot transmission is inevitable. The proposed beam tracking method uses current and past observations to solve the prediction problem. In such a way, it significantly increases the efficiency of data transmission and beam switching.

The selected analog beams based on the received power do not ensure that hybrid beamforming achieves the maximization SINR. We manage to reserve additional analog beams as candidates during the beam tracking and then determine which combination of analog beams with their digital weights achieves the maximum SINR. This idea can be simply implemented given the coupling coefficients.
The rest of this paper is organized as follows: Section II describes the multiUAV beamforming system and timevarying AoDs/AoAs. Section III states the objectives and challenges of the hybrid beamforming problem in highly dynamic environments. To efficiently track the analog beams with limited number of observations, Qlearning is applied to the beam tracking problem for one and multiple links presented in Section IV. Given selected beam pairs, we pursue the corresponding optimal digital weights and the solution is provided in Section V. Simulation results are presented in Section VI, and we conclude our work in Section VII.
We use the following notations throughout this paper.
A scalar.  

A column vector.  
A matrix.  
A set.  
The entry of .  
The complex conjugate of .  
The Hermitian transpose of .  
The identity matrix. 
Ii System Model
A clustered multiUAV beamforming system shown in Fig. 2 has one lead and followers. We assume that these UAVs are perfectly synchronized in time and frequency, and the lead communicates data streams to followers at the same time and frequency. That is, we consider spacedivision multiple access (SDMA) with beamforming to enable data transmission/reception for multiple UAVs [32, 33], and let each UAV be equipped with a uniform rectangular array (URA) of antennas.
The goal of multiUAV beamforming in a highly dynamic environment is to maximize the system throughput in a discrete time interval . At the cluster lead, the signals are received from specific directions using analog beamformers at time , denoted by , . The analog beamformers are implemented in the passband as part of the RF front end. Due to the concerns of high implementation costs and power consumption, they have some limitations, e.g., the weights of analog beamformers have unit magnitude because analog beamformers are typically implemented by phase shifters [12]. The analog beamforming vectors together are denoted by the matrix , and these vectors can be further combined with the weights of the baseband digital beamformer .
Given a predefined codebook , the analog beamforming vectors at the lead are selected from the set . Beam of the URA, i.e., the member of can be represented by the Kronecker product (denoted by ) of the beamforming vectors and in  and direction respectively [34]:
(1) 
and the element of and can be represented by
(2) 
where and are the indices of antenna elements in  and direction respectively. Also, and are respectively the candidate for the azimuth and elevation steering angles at the lead (see Fig. 3), is the distance between neighboring antenna elements, and is the wavelength at the carrier frequency.
For the followers, each only uses a single analog beamformer with phase shifters to communicate with the lead.^{3}^{3}3We assume that all the UAVs are equipped with a hybrid beamforming architecture since the leading UAV may change over time. The lead is randomly selected from UAVs at the beginning. Similar to the analog beams at the lead, each follower selects an analog beam from codebook .^{4}^{4}4Essentially, these two codebooks are the same, i.e., . We specify the beamforming problem in terms of two different notations of codebooks for generality.
Via a timevarying channel between the lead and follower , the received signal at the lead after the hybrid beamformer is the superposition of the desired signal, interference from other UAVs, and combined noise [30, 31]:
(3) 
where is the pilot signal satisfying and , is an dimensional circularly symmetric complex Gaussian (CSCG) random noise vector with mean and covariance matrix , i.e., , and is the column of .
The link between the lead and follower is modeled as a lineofsight (LoS) path. According to the relative position and orientation between the transmitter and receiver, the MIMO channel matrix can be determined by the complex path gain and the outer product of two array response vectors and , which are functions of AoA and AoD [15, 35]. Thus, the channel matrix is expressed by
(4) 
In a manner similar to the steering vector in (1), the array response vectors can be represented by the Kronecker product of the array response vectors in  and direction. Take as an example:
(5) 
and the entries of and are given by
(6) 
where the random variables
and stand for the azimuth and elevation angles of departure at time . Given the azimuth and elevation angles of arrival (denoted by , ), the array response vector at the receiver (i.e., ) has a similar form as (5).To model a highly dynamic environment for the angles under an observed LoS path, a Gaussian random walk is used to generate the timevarying angles , , , and . For instance, the azimuth angle of arrival can be defined by
(7) 
where is a randomly selected initial angle of
and follows a uniform distribution, and
is the disturbance (or white noise) following a normal distribution. The other three timevarying angles are generated in a similar way.
Iii Problem Statement
The goal of hybrid beamforming in the multiUAV system is to maximize the SINR (or system throughput) during the time interval . Meanwhile, after the combiner
, the variance of the combined noise signal is enforced to remain constant, i.e.,
(8) 
which leads to a power constraint on the combiner as
(9) 
Then, by introducing two sets and that include promising candidates for the analog beamforming matrices, we seek , , and that together achieve the maximum SINR and satisfy the power constraint from to :
(10)  
where and are the power of the desired and interference signals given by
(11)  
(12) 
In the paper, we do not assume the channel state information or any knowledge of AoAs/AoDs is known to the lead. Instead, the required observations are the estimates of coupling coefficients associated with a beam pair , where and . By correlating the received pilot signals with the known transmitted ones, we can obtain such observations given by^{5}^{5}5The notation of observation is simplified from its formal expression given by .
(13) 
where denotes the superposition of the combined interference and noise, and we assume that it follows a complex normal distribution, i.e., .
Given the observations , the strategy of solving the problem (10) could be, first, using the observations to find the sets and that ideally consist of the optimal analog beamforming matrices. However, due to the hardware constraint on the analog beamformer, the beam probing is timeconsuming. When the channel is highly dynamic, the observations acquired early may become unreliable. How to use the observations to interact with the highly dynamic environment during the beam probing becomes a crucial problem. As a result, the idea of Qlearning algorithm [28] is borrowed to find appropriate beams (i.e., the members of and ) for timevarying channels. The concept of Qlearning is to let the UAVs learn the optimal behavior directly from the interaction with the environment. Once we determine the candidate sets and , the observations associated with the members of and are used to generate the corresponding digital weights and the SINR measurement.
Iv Analog Beam Tracking Using QLearning
In this section, we introduce an analog beam tracking algorithm for highly dynamic environments. Starting from a single link between the lead and a follower, we adopt Qlearning to deal with the beam tracking problem. The idea can be easily extended to multiple links with additional constraints.
Iva Beam Selection Using QLearning for One Link
To begin with, let us focus on the link between the lead and follower . That is, we seek the candidates for and . When the codebook size is large, the efficient way of beam tracking is to start from some specific directions that cover the 3D environment. This phase is called initial beam search. For example, Fig. 4 shows candidates for the analog beam pair, where and are the numbers of elements in codebooks and respectively. In the example, the four beam pairs highlighted in red are initially explored. To be formal, we define two sets that consist of the beams used in the initial search by and and assume that both the lead and follower have the same initial beam search pattern. After the beam probing using these four beam pairs, the one having the maximum received power will be selected as a starting point of beam tracking in the next phase.
The beam tracking is conventionally implemented by searching a better choice next to the currently used beam pair [20, 36]. Both the initial beam search and beam tracking in the abovementioned work only explore the environment rather than interact with the environment. The concept of “interaction with the environment” can be viewed as a beam selection algorithm that can explore uncharted territory and, meanwhile, exploit the searching experience. Concerning a highly dynamic environment, the explorationexploitation balance becomes more important to the beam tracking. The idea of Qlearning is to let an agent (e.g., a UAV) learn to strike the balance between exploration and exploitation.
In Qlearning, the experience is recorded in a Qlearning table (or Qtable), see Table I, which is updated according to the current measurements. The Qtable is constructed according to three components: states, actions, and stateaction values (also known as Qvalues). Before the learning begins, the stateaction values in the Qtable are initialized to zero. In a state at time , the UAV always implements the following four steps: select an action from the action set , go to the next state , observe a reward , and update the Qvalue, given by [28, Ch. 6]
(14) 
where is the learning rate (or step size), is the discount factor determining the importance of future rewards. The Qvalue update can be described as a weighted average between the old value and new information.
Time ()  Episode  Step  State  Action  
0  0  0  1  0  0  0  
1  1  0  0  1  0  
2  2  0  0  1  0  
3  3  1  0  0  0  
4  1  0  0  0  1  0  
5  1  0  1  0  0  
6  2  0  1  0  0  
7  3  0  0  2  0  
The reward can be regarded as the feedback from the environment given an action. In terms of maximizing the SINR, the reward is supposed to be a function of SINR. Nevertheless, we only have the coupling coefficients as measurements which suffer from noise and interference. We therefore define the reward function as follows. According to the received power of the coupling coefficients corresponding to the trained beam pairs at time and , the reward is defined, in terms of thresholds, by functions of the received power
(15) 
where is the beam index pair used at time . Due to the noise and interference, the observations, and , may be unreliable for determining the reward. To reduce the uncertainty, we define a lower threshold and an upper threshold . If the ratio of to is between and , the measurement is treated as ambiguity so that the reward is equal to zero. A more detailed discussion about the upper and lower thresholds is provided in Appendix A.
Example 1.
When starting from a state , one of the neighboring beam pairs will be explored by choosing an action from according to the stateaction values, i.e., . Since all the Qvalues at are initialized to zero, an action will be selected randomly (or according to some predefined criteria). We assume that the action “up” is selected so that the next state becomes . The corresponding reward and Qvalue will be updated accordingly, see Table I. In the example, we simply let the Qvalues be updated by either 0 or 1, where a value of 1 implies that the agent chooses the action and gets a positive reward. In Qlearning, a sequence of time slots (also called steps) is defined as an episode. Each episode starts from a state, which could be predefined or determined by the received power. Fig. 4 shows that the initial beam search needs in total four episodes with starting states at , , , and respectively. In each episode, the beam probing takes time slots to update the Qvalues. When finishing the first episode, the agent starts the next episode using beam pair . With a sufficiently large number of significant Qvalues, Qlearning will converge to the beam pair corresponding to the maximum received power.∎
After the initial beam search, some beam pairs have been explored and the beam tracking will start from the beam pair with the maximum received power during the initial beam search, which is denoted by (i.e., the state or beam pair with respect to the maximum power).
According to the updated Qvalues, an agent exploits what it has already experienced in order to obtain a positive reward, but it also has to explore the uncharted or changed environment to see if it can make better action selections in the future. One of the challenges in reinforcement learning is the tradeoff between the exploration and exploitation. By introducing a parameter , an greedy action is obtained to better balance the exploration and exploitation:
(16) 
The agent chooses the action as it believes that the action yields the best longterm effect with probability
. Or the agent chooses an action uniformly at random with probability .The pseudocode of the Qlearningbased beam tracking algorithm is shown in Algorithm 1, which includes two phases: the initial beam search and beam tracking. The difference between these two phases is the decision of the starting state of each episode. During the initial beam search, the starting state is selected from the predefined sets and . In Example 1, and . During the beam tracking, the starting state is selected according to the maximum received power. Moreover, the selected beam pair at time is denoted by . We assume that the analog beam pairs are determined at the UAV lead, and time division duplex (TDD) technique that separates the transmit and receive signals in the time domain can be used to inform the followers to update their beams.
IvB Overhead Reduction Using Offline QLearning
In Algorithm 1, the observations are available at each time slot . This implies that the beam switching and pilot transmission/reception are executed in every time slot, which is not a welldesigned manner in the sense of system efficiency. To reduce the overhead, we reserve all observations so that the Qlearning can execute offline. When using past observations to obtain the rewards and update the Qvalues, we name the Qlearning algorithm offline Qlearning. Otherwise, it is called online Qlearning.
For the offline Qlearning, only the observations associated with large received power have to be updated regularly. Therefore, at the end of each episode, the beam pairs with respect to the maximum received power (i.e., ) will be chosen and employed at the beginning of each episode in order to update the corresponding observations. For other steps in an episode, the pilot transmission and beam switching are not necessary unless a specific state has not been explored.
IvC Beam Selection Using QLearning for Multiple Links
The idea of Qlearningbased beam tracking for one link can be easily extended to the case of multiple links, similar to multiagent systems [37, 38]. For multiUAV beam probing, the lead receives the observations from different followers simultaneously in an SDMA manner. In this case, the members of at the lead UAV’s side should not be selected repeatedly. As a result, the action set in (16) has to be updated in real time.
In each beam probing, which could be in the stage of initial beam search or beam tracking, the Qlearningbased beam selection starts from a follower corresponding to the maximum received power at the moment. We further define a set
that includes the actions which will make different followers go to the same states. Thus, the action selection given in (16) can be reformulated as(17) 
After making the decision about the next state for a follower, the lead has to update accordingly.
V Digital Beamforming
In the previous section, we use Qlearning to find the members of sets and in the problem (10). However, the selected beam pairs may not be the optimal solution to the problem for the reasons that (i) Qlearning usually only provides a good enough solution^{6}^{6}6Qlearning uses experience to solve a prediction problem, which can be viewed as a Monte Carlo method. and (ii) the digital beamformer weights are not taken into account during the procedure of analog beam selection. In the sense of hybrid beamforming, a better solution should be the one whose linear combination with the digital weights leading to the maximum SINR. This issue can be solved by keeping more than one promising members with large received power in and [17]. We use Example 2 to explain the idea.
Example 2.
Two selected beam pairs with large received power for each follower are collected in the following two sets:
and
Given these two sets, we can generate all the members of and , given by
which has a cardinality of 3 because the members of at lead UAV should not be selected repeatedly, and the other set can be represented by
which has a cardinality of 8. In this example, given the above and , we have to evaluate a total of 24 combinations with their digital weights to maximize the SINR.∎
The abovementioned idea is different from the work represented in [25] that keeps candidates in subspace. In our opinion, the better solution is supposed to keep candidates with large received power because the idea in [25] only takes into account the main lobes of analog beams, while the proposed method considers both the main and side lobes.
Va Digital Weight Optimization
To simplify the following descriptions of digital beamforming, we assume that and only include one member respectively, i.e.,
(18) 
In the numerical results, we will provide more discussion about the idea. Given and , the hybrid beamforming problem (10) becomes a digital beamforming problem subject to the power constraint, which can be formulate as
(19) 
where . The signal and interference power are subject to the selected analog beams
(20)  
(21) 
To satisfy the power constraint on the combiner, one can define unit vectors that obey the relation [17]
(22) 
Upon replacing with in the problem, the received signal and interference power can be written by
(23)  
(24) 
Then, we can find that the problem (19) is equivalent to seeking vectors that maximize the SINR for followers. As a result, the maximization problem (19) can be reformulated as
(25) 
VB SINR Approximation Using Coupling Coefficients
In (23) and (24), the couplings of the channel and analog beams, such as and , can be viewed as effective channel vectors. Since the observations, given in (13), are the coupling of the channel and one analog beam pair, we can use them to construct the estimates of effective channel vectors, defined by
(26) 
and
(27) 
where the entries of can be obtained from as well. The collected observations suffice to generate the estimates of and represented by
(28) 
and
(29) 
Comments
There are no comments yet.