With huge amounts of traffic data constantly rushing into the cellular mobile systems, mmWave communication is considered as a promising solution to resolve the frequency resource shortage problem, meanwhile, improving system spectral efficiency compared to the current 4G-LTE system . Commonly, the mmWave signals experience fast attenuation during transmission and have weak penetration ability due to its extremely short wavelength. However, the conventional massive multiple-input-multiple-output (MIMO) beamforming technologies are capable of providing sufficient antenna array gain for mmWave system to coupe with the high penetration loss in practical implementations. Massive MIMO technology combined with mmWave communication guarantees more physically achievable antennas to obtain non-trivial antenna array gain which sequentially leads to better performance. For example, MIMO of 128 antennas in 3GHz frequency point of LTE with half-wavelength of antenna spacing needs a square with size of 6.4m, while equipped with a millimeter wave of 28GHz, the same MIMO array requires only 0.68m that is drastically scaled down.
However, the design of the beamforming matrix in mmWave systems is constrained by expensive millimeter-wave radio-frequency (RF) chains. Traditional full-digital beamformer needs to connect a corresponding RF chain used for AD/DA and up-and-down conversion for each transmit antenna and receive antenna when combined with massive MIMO, imposing intolerant power consumption and hardware cost on the system. Against these problems, there are two main approaches in the research of the hybrid beamforming (HBF) design: the first regards HBF as a matrix factorization problem and minimize the Frobenius norm or, equivalently, the Euclidean distance  and . Authors in  proposed a HBF architecture called spatially sparse precoding (SSP) based on orthogonal matching pursuit (OMP) by jointly designing and optimizing the precoding matrix of the transceiver which converts the full digital beamformer  into a low-dimension baseband digital one requiring RF chains and a high-dimension analog one realized only by phase shifters to significantly reduce the number of expensive RF chains required and achieve near full-digital performance. They decouple the original precoding problem into hybrid precoding and combing sub-problems and assume high transmit signal-to-noise-ratio (SNR) for the convenience of problem relation, which resulting in suboptimal solutions.
The second manner is directly optimizing the original objective using the minimal mean square error (MMSE) criterion , , . Massive MIMO hybrid precoding design algorithm based on the MMSE criterion in  proves the hybrid beamforming can realize any fully digital beamformer with the RF chains number greater than or equal to twice the number of transmitting streams. Despite the viable performance, higher algorithm complexity and RF number introduce unnegligible processing delays in most application scenarios in 5G systems. The same delay problem also emerges in manifold optimization (MO) based hybrid precoding algorithm  and . Due to the hardware and constant modulus constraints on the analog precoding matrix, it is still challenging to design an optimal and practical hybrid beamformer, especially with low complexity and processing delay mmWave hybrid beamforming.
In previous literatures, deep supervised learning (DSL) are used to reduce the algorithm complexity in hybrid beamformer design, , 
. Deep neural network based supervised learning shows great performance during online testing, but often requires a extensive sample library for offline training, which is sensitive to environment, i.e., the channel conditions in mmWave systems. In this paper, we demonstrate that deep reinforcement learning (DRL) can be used for the design of hybrid beamforming matrices. DRL has been indicated to achieve close or even surpassed human performance in Go game and robot control 
due to its powerful ability to deal with nonlinear non-convex problems. Model-free DRL agent recasts the action prediction problem as a markov decision process (MDP) obtaining feedback and current state from the environment, and uses a few shots to effectively learn the optimal behavioral policy for complex problems settlement based on the principle of long-term expectation reward maximization. Value-based DRL algorithm deep Q-networks (DQN)  focus on processing discrete control problems, while policy-based deep deterministic policy gradient (DDPG) can be used to deal with continuous action control problems , which we used to design the hybrid beamformer for its continuous action space and sparse reward in this paper.
In our work, we propose a novel DDPG-based hybrid beamforming design algorithm called the PrecoderNet. The current channel information is taken as the state while the performance indicators such as spectral efficiency and bit error rate (BER) are regarded as the reward function. The real/imaginary part of precoder matrix elements are selected as the action. Therefore, the mmWave hybrid precoding problem can be modeled as a MDP that can be effectively solved by DRL. We use the state as the input of the DRL agent, therefore, the output of the agent is exactly the vectorization formation of the HBF matrix. More specifically, we develop a novel network architecture called the PrecoderNet based on the DDPG algorithm to eliminate the performance gap with low computational complexity. We remark that DDPG-based PrecoderNet can efficiently use the samples generated previously to train the agent without calculating numerous database for offline training in DSL. Thus, our algorithm is more energy-efficient and tractable than the DSL method. Furthermore, the DRL algorithm is essentially a gradient descent algorithm, so good initial points have significant impacts on the algorithm convergence. We utilize external knowledge from the hybrid precoding designs in  to significantly accelerate the learning process of precoding design problem inspired by .
According to the idea that initializes the PrecoderNet with the OMP solution in  and explores the global optimal HBF solution by DDPG, we then evaluate our approach empirically by putting forward the proposed PrecoderNet on a narrowband single-user massive MIMO mmWave HBF communication scenario to improve the performance and ensure the convergence of our algorithm. Simulation results show that both the spectral efficiency (rate) and the BER outperforms the benchmarks, and has a more smaller gap to the full-digital upper bound. It is worth nothing that our algorithm can also be extended to multi-user (MU) large-scale MIMO system and wideband aspects.
The rest of this paper is organized as follows. Section II introduces the researched mmWave system. After the introduction of RL background (Section III), we describe the proposed algorithm in Section IV. The experimental results are given in Section V. Finally, Section VI concludes the paper and provides some discussions about the approach.
Ii System model
Ii-a Network Model
Consider a mmWave single-cell multiuser downlink large scale MIMO system in which the base station (BS) is equipped with transmitting antennas and independent data streams up-converted by RF chains (), then transmitted simultaneously to serve users with receiving antennas per user. The number of transceiver antennas satisfy . Each data stream on the BS side is converted from digital-to-analog (DA) by a dedicated RF chain after processed via a baseband digital beamforming matrix. At the user side, receiving antennas connected with RF chains () for analog-to-digital conversion (AD) decode the receiving signal. Due to the limited number of RF chains of both sides, the full digital beamforming requiring one RF chain per transceiver antenna is impossible under mmWave condition. Instead, we consider using a hybrid beamforming architecture, as shown in Figure 1.
Based on the aforementioned hardware constraints, the equivalent beamforming matrix consists of one baseband digital beamforming matrix and one analog beamforming matrix connected after the RF chains, where the low-dimension only needs a small number of RF chains, and the high-dimension can be constructed with simple phase shifters (PS) to greatly reduce the hardware complexity. The analog beamforming matrix consisting of PS is subjected to constant modulus constraints, i.e., , or . Though the PS can only provide limited beamforming gain, the large scale antenna arrays will compensate its performance.
In our hybrid beamforming architecture, we use a fully-connected structure between transmitting RF chains to the transmit antennas similar to . For the simplicity of presentation, we consider a point-to-point mmWave MIMO single-user HBF scenario as shown in Fig 2. The output signal of each RF chain is propagated to the transmit antenna via phase shifters. Then the signals are combined and finally transmitted by the transmitting antenna. Therefore, the transmitter needs a total of PSs. The signal received by each receiving antenna at the user side is divided into streams by a splitter and processed by the receiving analog precoding matrix , then, the data streams are incorporated and passed to RF chains. The analog precoding matrix also satisfies constant modulus constraints. A total of PSs are required at the user side. After the RF chain performs ADC and down-conversion to the signals, the receiving digital beamforming matrix recovers the data streams prepared for subsequent demodulation.
Ii-B Channel Model
According to , severe decadency, strong penetration loss and limited scattering paths exist in mmWave system.. In addition, the large-scale MIMO antenna arrays are integrated in a much smaller physical size, as a result, so the spatial correlation between antennas cannot be ignored. Therefore, we adopt the geometric Saleh-Valenzuela (S-V) channel model  similar to  and . Consider a uniform linear array (ULA) with half-wavelength of the antenna spacing . Assuming that there are scattering clusters in the environment and each cluster can provide scattering ray. The discrete narrowband channel as shown in (1):
where denotes the complex path gain in the ray of the cluster, and , are the normalized receiver and transmitter array response, respectively, where the angle of arrival and departure are denoted as and respectively. The array response of a ULA with antennas can be expressed as (2):
The average power of all clusters must satisfy the power constraints of the channel: , where is a constant to .
Ii-C Signal Model
The discrete time transmit signal is denoted by , where represents the transmitted data streams satisfying power constraint . Then we can present the receiving signal at user side as (3):
where is the additional white Gaussian noise with zero mean and covariance matrix , i.e., . When we transmit cyclic symmetric complex Gaussian signal s in the system, the spectral efficiency can be represented by (4):
where is the interference and noise covariance matrix after combination in the receiver.
In this segment, we precisely introduce the basal knowledge about the dynamic programming model MDP and the used DRL algorithm DDPG for readers’ reference.
Markov Decision Process (MDP): A MDP consists of one agent, of which the interaction between agent and environment can be represented by a quintuple <S, ,r,,>. S represents the state space while means the action space. Reward r:SR is the feedback from environment measuring the chosen action under current state. is a discount factor that converts an infinite sequence problem into a matter with a maximum upper bound in order that the MDP can converge within finite steps. represents the policy on which the agent selects action depends, and the chosen action is .
Deep Q-networks (DQN): DQN approximates the value-based Q-learning state-value function (s,a)= as a deep neural network with parameter , where is the expected return of the current state-action against the discount factor. The goal of DQN is to maximize the target  of the s-a pair, and update Q-value by bellman equation in dynamic programming. Then the gradient descent
will be carried out after random sampling in the experience replay, and the action with the largest Q value is selected with probabilityor randomly selected with probability .
Deep Deterministic Policy Gradient (DDPG): DDPG is an actor-critic (AC) algorithm using the policy-based deterministic policy network parameterized by to generate deterministic action . DDPG updates the learned actor policy networks parameterized by with gradient descent by taking advantage of the Q-network in DQN as the critic so that it can maximize the output Q-value.
We also offer the summary of symbols and notations for convenience shown in Table I.
Iv-a Problem Formulation
In this work, we consider a narrowband mmWave point-to-point downlink massive MIMO system as shown in Fig 2. In such a communication system, we aim to maximum the spectral efficiency (4) by hybrid beamforming and ensure accepatble user quality of service (QoS) measured by BER under the hardware constraints aforementioned. Perfect instantaneous channel state information (CSI) is assumed to be known at both transmiter/receiber which can be accurately estimated by the zero-forcing method. Thus the HBF design problem can be written as
(5a), (5b) are the constant modulus constraint of transceiver analog beamforming matrix and (5c) is the total transmitter power constraint. The joint optimization of four precoders is usually found to be difficult to solve along with non-convex constraints , . A tractable sub-optimal but efficient method is to decouple the transmitter and receiver HBF design and solve them is a sequential manner , , , . Previous literatures indicate that this approach can achieve near-full-digital performance. Following this trajectory, we further use deep reinforcement learning to search for a near-global optimal solution via the ability of DRL algorithm to process nonlinear non-concave problem, and propose the so-called PrecoderNet to design the HBF by combining DRL and MMSE criterion.
Iv-B DDPG-based Transmitting Hybrid Beamformer Design
In this section, we first focus on the design of hybrid beamforming matrix at the transmitter side. Without loss of generality, we assume identical number of transmit and receive RF chains, i.e., , to simplify the notation. According to , the original problem (5) with fixed and can be converted to an Euclidean distance minimum problem as following (6):
is the full-digital solution was well as the right single value decomposition unitary matrix. This conversion is based on the assumption thatis an approximate diagonal matrix and high transmit SNR, i.e., , which results in a suboptimal design for . However, it has been found in  that optimal is exactly the linear combination of array response vector in (2) and the design of is to select best complex weighting factor of these with constant modulus in nature.
Traditional approaches to design like OMP in  and MO in  both assume good sparsity of the digital precoding matrix which is not satisfied in practical implementation. In addition, the hypothesis of approximately infinite number , makes the dimension of go to infinity. This property inspires us to utilize the continuous control DRL approach DDPG  to deal with such a high dimension problem. To our best knowledge, this is the first time that DRL is successfully applied to hybrid beamforming design.
As introduced in Section III, DDPG composed of a actornet to generate action and a criticnet to evaluate the output of actor can take out continuous action value which is corresponding to the continuity of HBF elements. We propose a DDPG-based mmWave HBF architecture to device transmitting hybrid precoding matrix called PrecoderNet of which each part possesses specifical implication, as shown in Fig. 3.
As illustrated in Fig. 3, the beamforming agent first receives and the estimated channel H and interacts with environment to obtain spectral efficiency as reward. Then our agent reshapes the complex-value matrix into a vector and further separate the real/imaginary parts as the final input of the neural networks. The input series expressed as (7) are denoted as and abbreviated as and , respectively.
where denotes current communication state composed of and at time slot t and K equals to . The baseband digital beamforming design strategy is based on the quality value (Q-function) expressed in (8) of state :
The notation is the finite set of all actions a and is the discount factor to maintain the MDP a bounded iteratively solvable problem. Taking the state as excitation, the actor net A consisting of neural networks gives out a vector as selected action. This vector is recognized to be a matrix as the new state . Afterwards, the agent stores the tuple into a experience replay D . The critic net C evaluates the actornet by sampling a N-size minibatch prior experience from replay buffer D as approximation of
and the loss function ofC is given in (9) and (10).
Then the policy gradient to update C and A is in accordance with (11), (12). The loss function of evaluated network A and the target network parameterized by and respectively are used to mitigate the over-fitting problem. The critic net also contains an evaluated network C and a target network C’ parameterized by and respectively as shown in Fig 3. We soft update all the target networks by according to . The algorithm will converage in few time slots as shown in Section V.
In this way, our PrecoderNet can learn a optimal digital beamformer online and use samples from previously stored experience to update parameters, which improves the learning efficiency while reducing the computational complexity compared with the deep supervised learning ,  and . Finally we can transmit the signals and further design the receiver beamforming matrix via MMSE criterion nextly.
Iv-C MMSE-based Receiving Hybrid Beamformer Design
In the second part of this section, we solve the receiver hybrid beamforming combiner design problem, i.e., and , based on the learned and via MMSE criterion. The received signal at receiver antennas is and the processed received signals are shown in (3). With fixed hybrid precoders , we can minimize the mean-square-error (MSE) between the transmitted and processed signals which can be stated as following (13):
Where represents the product of and analog beamformer still has to satisfy the constant modulus constraint (13a). Such a minimum MSE problem without hardware constraint is well-known  as (14):
In hybrid receiving beamforming design, we still use the same idea as the transmitting beamformer which converts this problem into a Euclidean distance minimum one as proved in . The optimal analog beamformer is a linear combination of receiver array response similar to and thus the optimal can be obtained by OMP method. Note that after the update of , the PrecoderNet iterates a new until the agent converges. In addition, if the CSI changes, our proposed algorithm can automatically learn a new optimal solution in nearly none time.
V-a Hyperparameters of PrecoderNet
In our experiments, we construct the DDPG-based PrecoderNet via four-layered forward neutral networks using Adam optimizer  to operate the gradient descent of the evaluated network and the target network. The size of the input layer is and the output layer has neurons. There are two hidden layers in the networks of which the neuron number is 400 and 300 in order and each of first three layers follows a ReLU function as activation layer while the output layer uses tanh function to provide descent gradient . The learning rate equals to 1e-4 and discount factor empirically. We set to soft update the target network. The same as DDPG, the additional noise for exploration is selected as Gaussian noise which obeys .
Consider a narrowband111We remark that our proposed algorithm is model-free because of its direct evaluation of value function and thus can be easily extended to wideband scenarios regardless of the concrete environment model which is more general in mmWave system. mmWave massive MIMO point-to-point hybrid beamforming system consisting of a BS with transmitting antennas and a user with receiving antennas. Without loss of generality, we set representing there are six data streams to be sent and the number of RF chains at both transmitter/receiver sides. Environment noise
obeys complex Gaussian distribution with zero mean and covariance, i.e., . The spread angles of transmitter and receiver in azimuth domain are equal, i.e., . Assume the scatted cluster number and all clusters have equal power, i.e., , while ray number of each cluster in the limited-scatter mmWave environment. We first compare the performance of the proposed PrecoderNet to the traditional hybrid beamforming algorithms  and the fully-digital beamforming algorithm  while the theoretical up bound is also provided in which signals are sent via the eigenmodes of channel. The horizontal axis signal-to-noise ratio is given as SNR.
The spectral efficiency is shown in Fig. 4 in a system with uniform linear arrays (ULA) at both BS and user sides. Our proposed algorithm obtains higher rate than optimal unconstrained MMSE-based full digital beamformer  and the OMP-based hybrid beamformer . At the low SNR region ranging from -15dB to -10dB, the rate our method achieved is slightly smaller than MMSE but higher than SSP-OMP. When the SNR is larger than 10dB, we obtain the best spectral efficiency performance. In addition, the results of the PrecorderNet is much closer to the upper bound than the contrast algorithms.
We further compare the achieved bit-error-ratio (BER) after processed by the receiver hybrid precoders of above three algorithms as shown in Fig. 5. The BER is defined as the number of error demodulated signal to the total transmit signal number as expressed in (15). We transmit symbols per data streams and use the quadrature phase shifting keying (QPSK) to modulate the data into four constellation points in the Cartesian coordinate. With the additional white Gaussian noise , the received signals at user side are preprocessed by and then demodulate according to the maximum likelihood criterion . Simulation results indicates that our proposed approach achieves the best BER performance. For example, with the same SNR=-5dB, our algorithm obtains a BER at while the benchmarks are both about .
Then we extend these methods to a system and observe the same performance indicators as above. From Fig. 6, we can see that the PrecoderNet always achieves better spectral efficiency in both low and high SNR region compared with the other two algorithms. We still examine the BER of these three approached in this scenario with larger antenna arrays as shown in Fig. 7 with the same setting as Fig. 5, The results show that our method obtain better BER and when SNR>15dB, the BER of PrecoderNet is nearly zero which dramatically outperforms the benchmarks.
Finally, statistical analyses about application complexity was performed by comparing the consumed time of the proposed algorithm with benchmarks as summarized in Table II.
|Algorithms||Time (averaged in 2000 episodes)|
|PrecoderNet||0.0023934s / 2.3934ms|
|SSP-OMP||0.254s / 254ms|
|MMSE||0.466s / 466ms|
The above time is averaged on the results obtained after 2000 simulations. It can be seen from the time consumption that a after-trained PrecoderNet calculates a available digital precoding matrix one in tenth of other algorithms which means more energy efficient and befitting to mmWave system.
In this paper, we focus on the hybrid beamforming design problem for mmWave massive MIMO system and propose a novel HBF design algorithm called PrecoderNet using DRL and MMSE criterion at the transmitter and receiver sides respectively. The system spectral efficiency and BER are used to demonstrate the performance of our proposed algorithm. Numerical results reveals that the proposed algorithm outperforms the benchmarks in the high SNR region and is closer to the upper bound as the SNR increases. Moreover, the spectral efficiency gain compared to the benchmarks becomes more pronounced in the large SNR regime. As for the system reliability, by using the PrecoderNet, the BER of the entire system can be decreased nearly to zero as the SNR increases, which certifies deep reinforcement learning is a promising approach to deal with the (hybrid) beamforming design problem.
-  Y. Niu, Y. Li, D. Jin, L. Su, and A. V. Vasilakos, “A survey of millimeter wave communications (mmwave) for 5g: opportunities and challenges,” Wireless networks, vol. 21, no. 8, pp. 2657–2676, 2015.
-  W. Hong, K.-H. Baek, Y. Lee, Y. Kim, and S.-T. Ko, “Study and prototyping of practically large-scale mmwave antenna systems for 5g cellular devices,” IEEE Communications Magazine, vol. 52, no. 9, pp. 63–69, 2014.
-  O. El Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially sparse precoding in millimeter wave mimo systems,” IEEE transactions on wireless communications, vol. 13, no. 3, pp. 1499–1513, 2014.
-  X. Gao, L. Dai, S. Han, I. Chih-Lin, and R. W. Heath, “Energy-efficient hybrid analog and digital precoding for mmwave mimo systems with large antenna arrays,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 998–1009, 2016.
-  H. Sampath, P. Stoica, and A. Paulraj, “Generalized linear precoder and decoder design for mimo channels using the weighted mmse criterion,” IEEE Transactions on Communications, vol. 49, no. 12, pp. 2198–2206, 2001.
-  F. Sohrabi and W. Yu, “Hybrid digital and analog beamforming design for large-scale antenna arrays,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 3, pp. 501–513, 2016.
-  D. H. Nguyen, L. B. Le, and T. Le-Ngoc, “Hybrid mmse precoding for mmwave multiuser mimo systems,” in 2016 IEEE International Conference on Communications (ICC), pp. 1–6, IEEE, 2016.
-  X. Yu, J.-C. Shen, J. Zhang, and K. B. Letaief, “Alternating minimization algorithms for hybrid precoding in millimeter wave mimo systems,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 3, pp. 485–500, 2016.
-  W. Xia, G. Zheng, Y. Zhu, J. Zhang, J. Wang, and A. P. Petropulu, “A deep learning framework for optimization of miso downlink beamforming,” arXiv preprint arXiv:1901.00354, 2019.
A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, and D. Tujkovic, “Deep learning coordinated beamforming for highly-mobile millimeter wave systems,”IEEE Access, vol. 6, pp. 37328–37348, 2018.
-  H. Huang, Y. Song, J. Yang, G. Gui, and F. Adachi, “Deep-learning-based millimeter-wave massive mimo for hybrid precoding,” IEEE Transactions on Vehicular Technology, vol. 68, no. 3, pp. 3027–3032, 2019.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396, IEEE, 2017.
-  D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  X. Gao, S. Jin, C.-K. Wen, and G. Y. Li, “Comnet: Combination of deep learning and expert knowledge in ofdm receivers,” IEEE Communications Letters, vol. 22, no. 12, pp. 2627–2630, 2018.
-  V. Raghavan and A. M. Sayeed, “Sublinear capacity scaling laws for sparse mimo channels,” IEEE Transactions on Information Theory, vol. 57, no. 1, pp. 345–364, 2010.
-  T. Lin, J. Cong, Y. Zhu, J. Zhang, and K. B. Letaief, “Hybrid beamforming for millimeter wave systems using the mmse criterion,” IEEE Transactions on Communications, vol. 67, no. 5, pp. 3693–3708, 2019.
-  Q. H. Spencer, A. L. Swindlehurst, and M. Haardt, “Zero-forcing methods for downlink spatial multiplexing in multiuser mimo channels,” IEEE transactions on signal processing, vol. 52, no. 2, pp. 461–471, 2004.
-  T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation. No. BOOK, Prentice Hall, 2000.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” Computer Science, 2015.
-  D. J. Zwickl, Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis, 2006.