I Introduction
Due to the abundant spectrum at the millimeter frequency band, millimeter wave (mmWave) communication [1] has been visioned as a key enabling technology to overcome the spectrum shortage challenge in next generation communication systems. With high operating frequency, mmWave communications suffer from severe path loss and sensitivity to blockage[2]. Fortunately, highly directional beams can be formed with a large number of antennas, which can effectively alleviate above issues[3].
To fully unleash the potential of mmWave communications, fine alignment of the transmitting and receiving beams is of crucial importance, which gives rise to the intrinsic alignment gain and data transmission time tradeoff. In addition, for the scenario with multiple transmitterreceiver pairs, interference management is essential to further improve the system performance, which requires sophisticated design of beamwidth and power control mechanism.
Thus far, several works have proposed different approaches to address the above issues. In work [4], two suboptimal algorithms were proposed, namely underestimation and overestimation of the interference. The former neglects interference and decomposes the joint optimization problem into several single pair subproblems which is convex and can be easily solved. The latter overestimates interference and only activates part of pairs without severe mutual interference. Besides, work [5]
studied the transmit power control problem in uplink mmWave cellular networks, while a heuristic algorithm based on simulated annealing was proposed to jointly optimize the power and beamwidth in work
[6]. Due to the various simplifying assumptions adopted in the aforementioned works, the proposed approaches in general yield suboptimal solutions.Responding to this, we aim to tackle the joint beamwidth and power control problem. Specifically, motivated by its great potential of handling complex nonconvex problems in communications [7, 8, 9]
, we pursue a artificial intelligence (AI) based design here. However, the most popular supervised learning paradigm is not suitable, mainly due to the prohibitive cost of labeling. Therefore, we propose to use deep reinforcement learning (DRL), a mechanism which does not require labels naturally. Recently,
[10] also proposed to use deep Q network (DQN) to address decision making problems in beam optimization. However, the current work differs from [10] in terms of both objective function and action design. Besides, [10] only considers beam selections, while this work jointly optimizes the transmit power and beamwidth to further enhance system performance. In particular, a customized DQN is designed to solve the decision making problem. We carefully preprocess the channel state information (CSI) and noise power density to formulate the state tuple, which can enhance effective learning of DRL framework. After offline training, the DQN is subsequently deployed online for realtime decision of the beamwidths and transmit power. Extensive simulation results are provided to evaluate the performance of proposed algorithm, which demonstrates the superior performance of the proposed method over conventional approaches.The rest of this paper is organized as follows. Section II introduces the system model and formulates the optimization problem. Section III presents the detailed design of the proposed DRL based approach, and its superiority is validated by extensive simulation results in Section IV. Finally, the paper is concluded in Section VI.
Ii System model and Problem formulation
We consider a mmWave system consisting of transmitterreceiver pairs. Each time slot can be divided into two phases, namely beam alignment and data transmission, as shown in Fig. 1. In the beam alignment phase^{1}^{1}1Beam alignment can be further divided into sectorlevel and beamlevel alignment. In this paper, we assume that sectorlevel alignment has already been established as in [11].
, each pair decides the optimal transmitting and receiving beam directions that maximize the signaltonoise ratio (SNR) within their sectors by searching over all possible combinations, as specified in IEEE 802.15.3c
[4]. In the data transmission phase, data is transmitted and received using the selected beams.Denote and as sectorlevel beamwidths, and as beamlevel beamwidths at the transmitter and receiver of link respectively. Then, the number of possible combinations is . Denote as pilot transmission time of each combination, then the total alignment time is
(1) 
We adopt the antenna model presented in [12], then the transmission and reception gains at transmitter and receiver toward each other can be expressed as
(2) 
(3) 
where and are the angles between the boresight of transmitter and receiver and the angle bisectors of transmitting and receiving beams, is the side lobe gain, as illustrated in Fig .2.
Denote as the channel gain between transmitter and receiver , as the transmission power of transmiter , as the thermal noise power spectral density and as the system bandwidth. Then, the signaltointerferenceplusenoise ratio (SINR) of the th link can be given as
(4) 
Therefore, the joint beamwidth selection and power allocation optimization problem can be formulated as
(5)  
(6) 
where the first two constraints ensure the beamwidths in beamlevel is strictly smaller than that in sectorlevel, the third constraint means that the beam alignment time can not exceed the entire time slot duration, and the fourth constraint specifies the maximal transmission power limit . Since this problem is nonconvex, it is challenging to solve by conventional optimization approaches.
Iii Deep Reinforcement learning based approach
In this Section, we propose a DQN based approach to solve the above problem, and we start with a brief introduction of the DQN algorithm.
A Brief introduction to DQN
Fig. 3
illustrates the typical agentenvironment interaction in a Markov Decision Process (MDP). At time step
, the agent takes an action by observing current state to interacts with the environment, where and are the sets of states and actions, respectively. One time step later, as the consequence of its action, the agent receives a reward and moves into a new state . The goal of reinforcement learning (RL) is to maximize the longterm rewards[13]. Specifically, it aims to learn the policy that yields maximal cumulative discounted reward function as follows(7) 
where is called the discount rate to discount rewards of later time slots.
Q learning, as one of the most popular RL algorithms, maintains a Q table to record Q values of all (state, action) pairs. Under policy , the Q function of agent with action in state is defined as
(8) 
where the expectation is taken with respect to the environment and policy.
Through learning from trajectories, the Q table is updated to approach the real table under optimal policy , which is achieved when the following Bellman optimality equation is satisfied[13]
(9) 
To further tackle the curse of dimensionality with continuous state variables, DQN
[14]is proposed to directly predict Q value for any (state, action) pair by a deep neural network. Then, the policy is parameterized by the weights of the Q network
, which can be updated as^{2}^{2}2As will be explained later, is set to 0, so some critical parts of DQN like target network is not involved.(10) 
where is the optimal Q value, is the learning rate, and denotes the derivative operator.
B Customized DQN design
In this subsection, we present the details about the customized DQN design for P1, including state, action, reward, network architecture and training strategy.
B1 State
As in [4], we assume that perfect CSI is available. The state is the observation that agent can get from the environment. In the current problem, the observation of a particular link contains the channel gain of the link, interfering channel gains from this link’s transmitter to other links’ receivers (ITO),
interfering channel gains from other links’ transmitters to this link’s receiver (IFO), and noise power. Intuitively, one simple way is to directly use the raw data as state. However, it turns out that the performance is not satisfactory, which implies that proper preprocessing is of critical importance. Motivated by this, we first normalize all elements by the channel gain of the link to better expose the characteristics of relative interference and noise level. Then, we choose dB as unit to further reduce the variance of input and facilitate efficient training. Therefore, the state tuple of link
at certain time slot contains totally elements, which can be expressed as(11) 
where , and .
B2 Action
To achieve the optimal performance, the actions at all links should be jointly optimized. In such case, the action number will grow exponentially with the number of links, making effective learning impossible. Therefore, to avoid such problem, we let the DQN make decision for every link separately. For each link, the action is the combination of beamwidths for transmitter and receiver and transmit power. Assume transmitter and receiver use the same beamwidth as in [4], i.e., , then the action of link at certain time step can be expressed as .
B3 Reward
Since the objective is to maximize the instantaneous effective sum rate, the discount rate should be set to 0. Therefore, in Equation 7 reduces to , where is the effective sum rate at time slot .
B4 Q Network architecture
The neural network architecture for Q value estimation is designed by cross validation. In our experiments, a fullyconnected network consisting of two hidden layers with 128 and 64 neurons and an output layer with the number of neurons equivalent to total action number works well.
B5 DQN training
The initial learning rate of neural network is 0.001, batch size is 256 and weights are updated by the Adam optimizer. First, we only generate data for 2000 episodes to fill the replay buffer. Then, in order to balance exploration and exploitation during training, policy is adopted[13]
. Specifically, the DQN will randomly choose an action from the action set with a small probability
, rather than always choose action with the maximal Q value. The initial is set to 0.2, and gradually decreases to 0 in 100000 episodes. After that, we continue to train with for 10000 episodes. Notice that, in practice, a large amount of training data can be generated automatically based on the model. However, mismatch may exist between the assumed channel model and the actual propagation environment. In such case, online finetuning can be adopted to compensate the model mismatch, by continuing training with real environmental data.Iv Simulation Results
In this section, extensive simulation results are provided to demonstrate the performance of the proposed DRL based approach. The random selection algorithm and underestimation of interference algorithm are used as benchmark for performance comparison. We assume that all the transmitterreceiver pairs are distributed randomly in a square area with a side length of m. As in [2], the following mmWave channel pathloss model is used:
(12) 
where and account for the lineofsight (LoS) and nonlineofsight (NLoS) loss, respectively. Also, is the indicator function which returns 1 when and 0 otherwise, while
is a boolean random variable with probability
being 1, and denotes distance between the transmitter and receiver. The following set of parameters are used in simulation as in[2].Parameter name  Value 

Carrier frequency  28 GHz 
System bandwidth  1 GHz 
Reference distance  5 m 
LoS path loss  , 
NLoS path loss  , 
Blockage model  
NLoS smallscale fading  Nakagami fading of parameter 3 
Noise power density  
Sectorlevel beamwidth  for all 
Side lobe gain  
Pilot transmission time 
Next, we investigate the impact of several key factors and parameters, including action discretization, network density and area size. All the tables and curves about testing performance are obtained by averaging over 500 independent experiments.
A Impact of action discretization
For DQN training, the actions are discretized depending on how they affect the system performance. According to equations (14), the SINR is a linear function with respect to the transmit power and the reciprocal of the beamwidth. Therefore, we propose to uniformly discretize the transmit power and the reciprocal of the square of beamwidth in and , respectively, where and are the minimal and maximal transmit power, and and are the minimal and maximal beamwidth. In simulation, we set . Besides, we use only 8 values for both the transmit power and beamwidth set to balance the training complexity and performance. To illustrate the performance of the proposed approach, the heuristic uniform approach is used as a benchmark, where both the transmit power and bemwidth are uniformly discretized. As can be observed in Fig. 4, the proposed scheme achieves better performance compared with the uniform scheme.
To demonstrate the superiority of the proposed approach, it is necessary to see the performance gap when compared with the performance of exhaustive search (ES) scheme. Due to the complexity of the ES scheme, it is only feasible to make the comparison in relatively small scale systems. As shown in Table II, the performance of the DQN approach is close to that of the ES scheme, and is far superior to that of the random scheme.





2  19.58  98.72  40.42  
3  28.74  97.56  38.65  
4  35.18  96.09  39.35  
5  45.28  95.02  39.80 
B Impact of network density
Fig. 5 illustrates the impact of network density on the network throughput with and different . It can be observed that the throughput increases with , while the slope gradually decreases due to accumulated mutual interference. Also, the proposed DQN based approach consistently outperforms the underestimation of interference baseline[4] when , and the performance gap becomes larger as increases. The reason is that the proposed DQN method takes into account of the interference during the design, hence can achieve superior performance in crowded scenarios with severe interference.
In practice, the network density may change over time. Therefore, it is of paramount importance to investigate the generalization ability of the proposed approach, which is a also key challenge when implementing AI based wireless communication systems [15]. With a well trained network with , if the actual link number is smaller than
, we simply consider the extra links as virtual links and pad each real link’s state tuple with
zeros. On the other hand, if is greater than , we sort the interfering channel gains of each link and only keep the largest ones. As illustrated in Fig. 5, where the red curve corresponds to the case with customized training for different , while the yellow curve is trained for , we observe that the two curves almost overlap for different , which demonstrates the superior generalization capability of the proposed DQN.C Impact of area size
Fig. 6 illustrates the impact of area side length on network throughput with and different . As can be readily observed, the throughput decreases with , which is straightforward due to increased pathloss. Also, the proposed DQN based approach consistently outperforms the underestimation of interference baseline under various , while the performance gap decreases with larger since the impact of interference becomes insignificant. The generalization ability on is also investigated. As we can see, the network trained with works well for a wide range of with very little performance deterioration.
D Complexity
When , , the typical offline training time of the proposed approach on a GeForce GTX 1080 Ti GPU is about one hour. However, the online testing phase only needs to execute simple forward computation, which is much faster than the gradient descent process involved in the underestimation of interference baseline approach. When and , the average running time of the baseline approach is 207.40 and 513.04 ms, while the DQN based approach takes 0.98 and 0.99 ms, respectively, which is hundreds of times faster.
V CONCLUSION and future work
In this paper, we have proposed a DRL based approach to solve the joint beamwidth and power allocation problem in mmWave communication systems. A customized DQN is designed, and heuristic tricks are used to tackle the generalization issue. Simulation results show that the proposed approach significantly outperforms the conventional suboptimal approaches in terms of both performance and complexity. Besides, the proposed DQN has strong generalization ability, which makes it extremely desirable for practical implementation. In the future, we will consider the use of advanced DRL algorithms such as deep deterministic policy gradient[16] to optimize in continuous action domain.
References
 [1] A. M. Alsamman, M. H. Azmi, and T. A. Rahman, “A survey of millimeter wave (mmWave) communications for 5G: Channel measurement below and above 6 GHz,” in Proc. Int. Conf. Reliable Inf. Commun. Technol. (IRICT), Berlin, Germany: Springer, 2018, pp. 451463.
 [2] T. Bai, V. Desai, and R. W. Heath, “Millimeter wave cellular channel models for system evaluation,” in Proc. IEEE Int. Conf. Netw. Commun. (ICNC), Honolulu, United States, Feb. 2014, pp. 178182.
 [3] S. Singh, R. Mudumbai, and U. Madhow, “Interference analysis for highly directional 60GHz mesh networks: The case for rethinking medium access control,” IEEE/ACM Trans. Netw., vol. 19, no. 5, pp. 15131527, Oct. 2011.
 [4] H. ShokriGhadikolaei, L. Gkatzikis, and C. Fischione, “Beamsearching and transmission scheduling in millimeter wave communications,” in Proc. IEEE Int. Conf. Commun. (ICC), London, United Kingdom, Jun. 2015, pp. 12921297.
 [5] O. Onireti, A. Imran, and M. A. Imran, “Coverage, capacity, and energy efficiency analysis in the uplink of mmWave cellular networks,” IEEE Trans. Veh. Technol., vol. 67, no. 5, pp. 39823997, May 2018.
 [6] R. Ismayilov, B. Holfeld, R. L. G. Cavalcante, and M. Kaneko, “Power and beam optimization for uplink millimeterwave hotspot communication systems,” in Proc. IEEE Wireless Commun. Netw. Conf. (WCNC), Marrakesh, Morocco, Apr. 2019, pp. 18.
 [7] F. Meng, P. Chen, and L. Wu, “Power allocation in multiuser cellular networks with deep Q learning approach,” in Proc. IEEE Int. Conf. Commun. (ICC), Shanghai, China, May 2019, pp.16.

[8]
J. Gao, C. Zhong, X. Chen, H. Lin, and Z. Zhang, “Unsupervised learning for passive beamforming,”
IEEE Commun. Lett., vol. 24, no. 5, pp. 10521056, May 2020.  [9] Y. Yang, F. Gao, C. Qian, and G. Liao, “Modelaided deep neural network for source number detection,” IEEE Signal Process. Lett., vol. 27, no. 12, pp. 9195, Dec. 2019.
 [10] R. Shafin, et al., “Selftuning sectorization: Deep reinforcement learning meets broadcast beam pptimization,” IEEE Trans. Wireless Commun., early access.
 [11] S. Singh, et al., “Blockage and directivity in 60 GHz wireless personal area networks: from crosslayer model to multihop MAC design,” IEEE J. Sel. Areas Commun., vol. 27, no. 8, pp. 14001413, Oct. 2009.
 [12] H. ShokriGhadikolaei, C. Fischione, G. Fodor, P. Popovski, and M. Zorzi, “Millimeter wave cellular networks: A MAC layer perspective,” IEEE Trans. Commun., vol. 63, no. 10, pp. 34373458, Oct. 2015.
 [13] R. Sutton and R. Barto, Reinforcement learning: An introduction. Cambridge, United Kingdom: MIT press, 2018.
 [14] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529533, Feb. 2015.
 [15] R. Shafin, et al., “Artificial intelligenceenabled cellular networks: A critical path to beyond5G and 6G,” IEEE Wireless Commun., vol. 27, no. 2, pp. 212217, Apr. 2020.
 [16] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” 2019, [Online]. Available: https://arxiv.org/abs/1509.02971