Log In Sign Up

Deep Reinforcement Learning for Joint Beamwidth and Power Optimization in mmWave Systems

by   Jiabao Gao, et al.
Zhejiang University

This paper studies the joint beamwidth and transmit power optimization problem in millimeter wave communication systems. A deep reinforcement learning based approach is proposed. Specifically, a customized deep Q network is trained offline, which is able to make real-time decisions when deployed online. Simulation results show that the proposed approach significantly outperforms conventional approaches in terms of both performance and complexity. Besides, strong generalization ability to different system parameters is also demonstrated, which further enhances the practicality of the proposed approach.


page 1

page 2

page 3

page 4

page 5


Online Deep Neural Network for Optimization in Wireless Communications

Recently, deep neural network (DNN) has been widely adopted in the desig...

Unsupervised Learning for Passive Beamforming

Reconfigurable intelligent surface (RIS) has recently emerged as a promi...

Reinforcement Learning Based Vehicle-cell Association Algorithm for Highly Mobile Millimeter Wave Communication

Vehicle-to-everything (V2X) communication is a growing area of communica...

Deep Reinforcement Learning for Improving Downlink mmWave Communication Performance

We propose a method to improve the DL SINR for a single cell indoor base...

Multi-objective Optimization of Notifications Using Offline Reinforcement Learning

Mobile notification systems play a major role in a variety of applicatio...

Online Multimodal Transportation Planning using Deep Reinforcement Learning

In this paper we propose a Deep Reinforcement Learning approach to solve...

Deep Reinforcement Learning for Scheduling and Power Allocation in a 5G Urban Mesh

We study the problem of routing and scheduling of real-time flows over a...

I Introduction

Due to the abundant spectrum at the millimeter frequency band, millimeter wave (mmWave) communication [1] has been visioned as a key enabling technology to overcome the spectrum shortage challenge in next generation communication systems. With high operating frequency, mmWave communications suffer from severe path loss and sensitivity to blockage[2]. Fortunately, highly directional beams can be formed with a large number of antennas, which can effectively alleviate above issues[3].

To fully unleash the potential of mmWave communications, fine alignment of the transmitting and receiving beams is of crucial importance, which gives rise to the intrinsic alignment gain and data transmission time trade-off. In addition, for the scenario with multiple transmitter-receiver pairs, interference management is essential to further improve the system performance, which requires sophisticated design of beamwidth and power control mechanism.

Thus far, several works have proposed different approaches to address the above issues. In work [4], two suboptimal algorithms were proposed, namely underestimation and overestimation of the interference. The former neglects interference and decomposes the joint optimization problem into several single pair subproblems which is convex and can be easily solved. The latter overestimates interference and only activates part of pairs without severe mutual interference. Besides, work [5]

studied the transmit power control problem in uplink mmWave cellular networks, while a heuristic algorithm based on simulated annealing was proposed to jointly optimize the power and beamwidth in work

[6]. Due to the various simplifying assumptions adopted in the aforementioned works, the proposed approaches in general yield suboptimal solutions.

Responding to this, we aim to tackle the joint beamwidth and power control problem. Specifically, motivated by its great potential of handling complex non-convex problems in communications [7, 8, 9]

, we pursue a artificial intelligence (AI) based design here. However, the most popular supervised learning paradigm is not suitable, mainly due to the prohibitive cost of labeling. Therefore, we propose to use deep reinforcement learning (DRL), a mechanism which does not require labels naturally. Recently,

[10] also proposed to use deep Q network (DQN) to address decision making problems in beam optimization. However, the current work differs from [10] in terms of both objective function and action design. Besides, [10] only considers beam selections, while this work jointly optimizes the transmit power and beamwidth to further enhance system performance. In particular, a customized DQN is designed to solve the decision making problem. We carefully preprocess the channel state information (CSI) and noise power density to formulate the state tuple, which can enhance effective learning of DRL framework. After offline training, the DQN is subsequently deployed online for real-time decision of the beamwidths and transmit power. Extensive simulation results are provided to evaluate the performance of proposed algorithm, which demonstrates the superior performance of the proposed method over conventional approaches.

The rest of this paper is organized as follows. Section II introduces the system model and formulates the optimization problem. Section III presents the detailed design of the proposed DRL based approach, and its superiority is validated by extensive simulation results in Section IV. Finally, the paper is concluded in Section VI.

Ii System model and Problem formulation

We consider a mmWave system consisting of transmitter-receiver pairs. Each time slot can be divided into two phases, namely beam alignment and data transmission, as shown in Fig. 1. In the beam alignment phase111Beam alignment can be further divided into sector-level and beam-level alignment. In this paper, we assume that sector-level alignment has already been established as in [11].

, each pair decides the optimal transmitting and receiving beam directions that maximize the signal-to-noise ratio (SNR) within their sectors by searching over all possible combinations, as specified in IEEE 802.15.3c

[4]. In the data transmission phase, data is transmitted and received using the selected beams.

Fig. 1: Time slot segmentation of link . T denotes the time slot duration and denotes beam alignment time.

Denote and as sector-level beamwidths, and as beam-level beamwidths at the transmitter and receiver of link respectively. Then, the number of possible combinations is . Denote as pilot transmission time of each combination, then the total alignment time is


We adopt the antenna model presented in [12], then the transmission and reception gains at transmitter and receiver toward each other can be expressed as


where and are the angles between the boresight of transmitter and receiver and the angle bisectors of transmitting and receiving beams, is the side lobe gain, as illustrated in Fig .2.

Fig. 2: The black solid line denotes the boresight between transmitter and receiver , and the dotted lines denote the angle bisectors of transmitting and receiving beams.

Denote as the channel gain between transmitter and receiver , as the transmission power of transmiter , as the thermal noise power spectral density and as the system bandwidth. Then, the signal-to-interference-pluse-noise ratio (SINR) of the -th link can be given as


Therefore, the joint beamwidth selection and power allocation optimization problem can be formulated as


where the first two constraints ensure the beamwidths in beam-level is strictly smaller than that in sector-level, the third constraint means that the beam alignment time can not exceed the entire time slot duration, and the fourth constraint specifies the maximal transmission power limit . Since this problem is non-convex, it is challenging to solve by conventional optimization approaches.

Iii Deep Reinforcement learning based approach

In this Section, we propose a DQN based approach to solve the above problem, and we start with a brief introduction of the DQN algorithm.

A Brief introduction to DQN

Fig. 3

illustrates the typical agent-environment interaction in a Markov Decision Process (MDP). At time step

, the agent takes an action by observing current state to interacts with the environment, where and are the sets of states and actions, respectively. One time step later, as the consequence of its action, the agent receives a reward and moves into a new state . The goal of reinforcement learning (RL) is to maximize the long-term rewards[13]. Specifically, it aims to learn the policy that yields maximal cumulative discounted reward function as follows


where is called the discount rate to discount rewards of later time slots.

Fig. 3: The agent-environment interaction in MDP.

Q learning, as one of the most popular RL algorithms, maintains a Q table to record Q values of all (state, action) pairs. Under policy , the Q function of agent with action in state is defined as


where the expectation is taken with respect to the environment and policy.

Through learning from trajectories, the Q table is updated to approach the real table under optimal policy , which is achieved when the following Bellman optimality equation is satisfied[13]


To further tackle the curse of dimensionality with continuous state variables, DQN


is proposed to directly predict Q value for any (state, action) pair by a deep neural network. Then, the policy is parameterized by the weights of the Q network

, which can be updated as222As will be explained later, is set to 0, so some critical parts of DQN like target network is not involved.


where is the optimal Q value, is the learning rate, and denotes the derivative operator.

B Customized DQN design

In this subsection, we present the details about the customized DQN design for P1, including state, action, reward, network architecture and training strategy.

B1 State

As in [4], we assume that perfect CSI is available. The state is the observation that agent can get from the environment. In the current problem, the observation of a particular link contains the channel gain of the link, interfering channel gains from this link’s transmitter to other links’ receivers (ITO),

interfering channel gains from other links’ transmitters to this link’s receiver (IFO), and noise power. Intuitively, one simple way is to directly use the raw data as state. However, it turns out that the performance is not satisfactory, which implies that proper preprocessing is of critical importance. Motivated by this, we first normalize all elements by the channel gain of the link to better expose the characteristics of relative interference and noise level. Then, we choose dB as unit to further reduce the variance of input and facilitate efficient training. Therefore, the state tuple of link

at certain time slot contains totally elements, which can be expressed as


where , and .

B2 Action

To achieve the optimal performance, the actions at all links should be jointly optimized. In such case, the action number will grow exponentially with the number of links, making effective learning impossible. Therefore, to avoid such problem, we let the DQN make decision for every link separately. For each link, the action is the combination of beamwidths for transmitter and receiver and transmit power. Assume transmitter and receiver use the same beamwidth as in [4], i.e., , then the action of link at certain time step can be expressed as .

B3 Reward

Since the objective is to maximize the instantaneous effective sum rate, the discount rate should be set to 0. Therefore, in Equation 7 reduces to , where is the effective sum rate at time slot .

B4 Q Network architecture

The neural network architecture for Q value estimation is designed by cross validation. In our experiments, a fully-connected network consisting of two hidden layers with 128 and 64 neurons and an output layer with the number of neurons equivalent to total action number works well.

B5 DQN training

The initial learning rate of neural network is 0.001, batch size is 256 and weights are updated by the Adam optimizer. First, we only generate data for 2000 episodes to fill the replay buffer. Then, in order to balance exploration and exploitation during training, policy is adopted[13]

. Specifically, the DQN will randomly choose an action from the action set with a small probability

, rather than always choose action with the maximal Q value. The initial is set to 0.2, and gradually decreases to 0 in 100000 episodes. After that, we continue to train with for 10000 episodes. Notice that, in practice, a large amount of training data can be generated automatically based on the model. However, mismatch may exist between the assumed channel model and the actual propagation environment. In such case, online fine-tuning can be adopted to compensate the model mismatch, by continuing training with real environmental data.

Iv Simulation Results

In this section, extensive simulation results are provided to demonstrate the performance of the proposed DRL based approach. The random selection algorithm and underestimation of interference algorithm are used as benchmark for performance comparison. We assume that all the transmitter-receiver pairs are distributed randomly in a square area with a side length of m. As in [2], the following mmWave channel pathloss model is used:


where and account for the line-of-sight (LoS) and non-line-of-sight (NLoS) loss, respectively. Also, is the indicator function which returns 1 when and 0 otherwise, while

is a boolean random variable with probability

being 1, and denotes distance between the transmitter and receiver. The following set of parameters are used in simulation as in[2].

Parameter name Value
Carrier frequency 28 GHz
System bandwidth 1 GHz
Reference distance 5 m
LoS path loss ,
NLoS path loss ,
Blockage model
NLoS small-scale fading Nakagami fading of parameter 3
Noise power density
Sector-level beamwidth for all
Side lobe gain
Pilot transmission time
TABLE I: Simulation parameters

Next, we investigate the impact of several key factors and parameters, including action discretization, network density and area size. All the tables and curves about testing performance are obtained by averaging over 500 independent experiments.

A Impact of action discretization

For DQN training, the actions are discretized depending on how they affect the system performance. According to equations (1-4), the SINR is a linear function with respect to the transmit power and the reciprocal of the beamwidth. Therefore, we propose to uniformly discretize the transmit power and the reciprocal of the square of beamwidth in and , respectively, where and are the minimal and maximal transmit power, and and are the minimal and maximal beamwidth. In simulation, we set . Besides, we use only 8 values for both the transmit power and beamwidth set to balance the training complexity and performance. To illustrate the performance of the proposed approach, the heuristic uniform approach is used as a benchmark, where both the transmit power and bemwidth are uniformly discretized. As can be observed in Fig. 4, the proposed scheme achieves better performance compared with the uniform scheme.

Fig. 4: Training histories with different discretization patterns when , . The vertical coordinate refers to the mean reward from the first episode to the current episode, in order to smooth the curve and highlight the growth trend.

To demonstrate the superiority of the proposed approach, it is necessary to see the performance gap when compared with the performance of exhaustive search (ES) scheme. Due to the complexity of the ES scheme, it is only feasible to make the comparison in relatively small scale systems. As shown in Table II, the performance of the DQN approach is close to that of the ES scheme, and is far superior to that of the random scheme.

(G bits/slot)
of DQN (%)
of Random (%)
2 19.58 98.72 40.42
3 28.74 97.56 38.65
4 35.18 96.09 39.35
5 45.28 95.02 39.80
TABLE II: Network throughput performance of different approaches when . Transmit power and beamwidth are discretized the same way as above with 4 optional values each.

B Impact of network density

Fig. 5 illustrates the impact of network density on the network throughput with and different . It can be observed that the throughput increases with , while the slope gradually decreases due to accumulated mutual interference. Also, the proposed DQN based approach consistently outperforms the underestimation of interference baseline[4] when , and the performance gap becomes larger as increases. The reason is that the proposed DQN method takes into account of the interference during the design, hence can achieve superior performance in crowded scenarios with severe interference.

Fig. 5: The impact of network density on throughput.

In practice, the network density may change over time. Therefore, it is of paramount importance to investigate the generalization ability of the proposed approach, which is a also key challenge when implementing AI based wireless communication systems [15]. With a well trained network with , if the actual link number is smaller than

, we simply consider the extra links as virtual links and pad each real link’s state tuple with

zeros. On the other hand, if is greater than , we sort the interfering channel gains of each link and only keep the largest ones. As illustrated in Fig. 5, where the red curve corresponds to the case with customized training for different , while the yellow curve is trained for , we observe that the two curves almost overlap for different , which demonstrates the superior generalization capability of the proposed DQN.

C Impact of area size

Fig. 6 illustrates the impact of area side length on network throughput with and different . As can be readily observed, the throughput decreases with , which is straightforward due to increased pathloss. Also, the proposed DQN based approach consistently outperforms the underestimation of interference baseline under various , while the performance gap decreases with larger since the impact of interference becomes insignificant. The generalization ability on is also investigated. As we can see, the network trained with works well for a wide range of with very little performance deterioration.

Fig. 6: The impact of area side length on throughput.

D Complexity

When , , the typical offline training time of the proposed approach on a GeForce GTX 1080 Ti GPU is about one hour. However, the online testing phase only needs to execute simple forward computation, which is much faster than the gradient descent process involved in the underestimation of interference baseline approach. When and , the average running time of the baseline approach is 207.40 and 513.04 ms, while the DQN based approach takes 0.98 and 0.99 ms, respectively, which is hundreds of times faster.

V CONCLUSION and future work

In this paper, we have proposed a DRL based approach to solve the joint beamwidth and power allocation problem in mmWave communication systems. A customized DQN is designed, and heuristic tricks are used to tackle the generalization issue. Simulation results show that the proposed approach significantly outperforms the conventional suboptimal approaches in terms of both performance and complexity. Besides, the proposed DQN has strong generalization ability, which makes it extremely desirable for practical implementation. In the future, we will consider the use of advanced DRL algorithms such as deep deterministic policy gradient[16] to optimize in continuous action domain.


  • [1] A. M. Al-samman, M. H. Azmi, and T. A. Rahman, “A survey of millimeter wave (mm-Wave) communications for 5G: Channel measurement below and above 6 GHz,” in Proc. Int. Conf. Reliable Inf. Commun. Technol. (IRICT), Berlin, Germany: Springer, 2018, pp. 451-463.
  • [2] T. Bai, V. Desai, and R. W. Heath, “Millimeter wave cellular channel models for system evaluation,” in Proc. IEEE Int. Conf. Netw. Commun. (ICNC), Honolulu, United States, Feb. 2014, pp. 178-182.
  • [3] S. Singh, R. Mudumbai, and U. Madhow, “Interference analysis for highly directional 60-GHz mesh networks: The case for rethinking medium access control,” IEEE/ACM Trans. Netw., vol. 19, no. 5, pp. 1513-1527, Oct. 2011.
  • [4] H. Shokri-Ghadikolaei, L. Gkatzikis, and C. Fischione, “Beam-searching and transmission scheduling in millimeter wave communications,” in Proc. IEEE Int. Conf. Commun. (ICC), London, United Kingdom, Jun. 2015, pp. 1292-1297.
  • [5] O. Onireti, A. Imran, and M. A. Imran, “Coverage, capacity, and energy efficiency analysis in the uplink of mmWave cellular networks,” IEEE Trans. Veh. Technol., vol. 67, no. 5, pp. 3982-3997, May 2018.
  • [6] R. Ismayilov, B. Holfeld, R. L. G. Cavalcante, and M. Kaneko, “Power and beam optimization for uplink millimeter-wave hotspot communication systems,” in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Marrakesh, Morocco, Apr. 2019, pp. 1-8.
  • [7] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular networks with deep Q learning approach,” in Proc. IEEE Int. Conf. Commun. (ICC), Shanghai, China, May 2019, pp.1-6.
  • [8]

    J. Gao, C. Zhong, X. Chen, H. Lin, and Z. Zhang, “Unsupervised learning for passive beamforming,”

    IEEE Commun. Lett., vol. 24, no. 5, pp. 1052-1056, May 2020.
  • [9] Y. Yang, F. Gao, C. Qian, and G. Liao, “Model-aided deep neural network for source number detection,” IEEE Signal Process. Lett., vol. 27, no. 12, pp. 91-95, Dec. 2019.
  • [10] R. Shafin, et al., “Self-tuning sectorization: Deep reinforcement learning meets broadcast beam pptimization,” IEEE Trans. Wireless Commun., early access.
  • [11] S. Singh, et al., “Blockage and directivity in 60 GHz wireless personal area networks: from cross-layer model to multihop MAC design,” IEEE J. Sel. Areas Commun., vol. 27, no. 8, pp. 1400-1413, Oct. 2009.
  • [12] H. Shokri-Ghadikolaei, C. Fischione, G. Fodor, P. Popovski, and M. Zorzi, “Millimeter wave cellular networks: A MAC layer perspective,” IEEE Trans. Commun., vol. 63, no. 10, pp. 3437-3458, Oct. 2015.
  • [13] R. Sutton and R. Barto, Reinforcement learning: An introduction. Cambridge, United Kingdom: MIT press, 2018.
  • [14] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, Feb. 2015.
  • [15] R. Shafin, et al., “Artificial intelligence-enabled cellular networks: A critical path to beyond-5G and 6G,” IEEE Wireless Commun., vol. 27, no. 2, pp. 212-217, Apr. 2020.
  • [16] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 2019, [Online]. Available: