I Introduction
Recent years have witnessed the successful deployment of massive multiple input multiple output (MassiveMIMO) in the fifth generation (5G) wireless communication systems, as a promising approach to support massive number of users at high data rate, low latency and secure transmission simultaneously and efficiently [1][3]. However, implementing a MassiveMIMO base station (BS) is challenging, as high hardware cost, constrained physical size, and increased power consumption scaling up the conventional MIMO systems by many orders of magnitude, arise when the conventional largescale antenna array is used at the BS.
On another hand, the reconfigurable intelligent surface (RIS), benefited from the breakthrough on the fabrication of programmable metamaterial, has been speculated as one of the key enabling technologies for the future six generation (6G) wireless communication systems scaled up beyond MassiveMIMO to achieve smart radio environment [4][10]. The metamaterial based RIS makes possible wideband antennas with compact size, such that large scale antennas can be easily deployed at both ends of the user devices and BS, to achieve MassiveMIMO gains but with significant reduction in power consumption. With the help of varactor diode or other micro electrical mechanical systems (MEMS) technology, Electromagnetic (EM) properties of the RIS are fully defined by its microstructure, and can be programmed to vary the phase, amplitude, frequency and even orbital angular momentum of an EM wave, effectively modulating a radio signal without a mixer and radio frequency (RF) chain.
The RIS can be deployed as reconfigurable transmitters, receivers and passive reflecting arrays. Being reflecting arrays, the RIS is usually placed in between the BS and single antenna receivers, and consists of a vast number of nearly passive, lowcost and low energy consuming reflecting elements, each of which introduces a certain phase shift to the signals impinging on it. By reconfiguring the phase shifts of elements of RIS, the reflected signals can be added constructively at the desired receiver to enhance the received signal power or destructively at nonintended receivers to reduce the cochannel interference. Due to the low power consumption, the reflecting RIS can be fabricated in very compact size with light weight, leading to easy installation of RIS in building facades, ceilings, moving trains, lamp poles, road signs, etc., as well as ready integration into existing communication systems with minor modifications on hardware [10][14].
Note that, passive reflecting surfaces have been used in radar systems for many years. However, the phase shifts of passive radars cannot be adjusted once fabricated, and the signal propagation cannot be programmed through controlling the phase shifts of antenna elements. The reflecting RIS also differs from relaying systems, in that the RIS reflecting array only alters the signal propagation by reconfiguring the constituent metaatoms of metasurfaces of RISs without RF chains and additional thermal noise added during reflections, whereas the latter requires active RF components for signal reception and emission. Consequently, the beamforming design in relay nodes is classified as active, while it is passive in reflecting RIS assisted systems.
Ia Prior Works
Although RIS has gained considerable attentions in recent years, the most of reported works are primarily focused on implementing hardware testbeds, e.g., reflectarrays and metasurfaces, and on realizing pointtopoint experimental tests [9][10]. More recently, there are some works attempting to investigate optimizing the performance of RISassisted MIMO systems. The optimal receiver and matched filter (MF) were investigated for uplink RIS assisted MIMO systems in [8], where the RIS is deployed as a MIMO receiver. An index modulation (IM) scheme exploiting the programmable nature of the RIS was proposed in [13], where it was shown that the RISbased IM enables high data rates with remarkably low error rates.
When RISs are utilized as reflecting arrays, the error performance achieved by a reflecting RIS assisted single antenna transmitter/receiver system was derived in [14]. A joint design of local optimal transmit beamforming at the BS and the phase shifts at reflecting RIS with discrete entries was proposed in [15] for reflecting RIS assisted singleuser multiple input single output (MISO) systems, by solving the transmit power minimization problem utilizing an alternating optimization technique. The received signal power maximization problem for MISO systems with reflecting RIS was formulated and studied in [16] through the design of transmit beamforming and phase shifts employing efficient fixed point iteration and manifold optimization techniques. The authors in [17]
derived a closedform expression for the phase shifts for reflecting RIS assisted MISO systems when only the statistical channel state information (CSI) is available. Compressive sensing based channel estimation was studied in
[18]for reflecting RIS assisted MISO systems with single antenna transmitter/receiver. Deep learning based algorithm was proposed to obtain phase shifts. In
[19][20], the transmit beamforming and the phase shifts were designed to maximize the secrecy rate for reflecting RIS assisted MIMO systems with only one legitimate receiver and one eavesdropper, employing various optimization techniques.All above mentioned works focus on single user MISO systems. As multiple users and massive access are concerned, the transmit beamforming and the phase shift were studied in [21][22]
, by solving the sum rate/energy efficiency maximization problem, assuming a zeroforcing (ZF) based algorithm employed at the BS, whereas the stochastic gradient descent (SGD) search and sequential fractional programming are utilized to obtain the phase shifter. In
[23], through minimizing the total transmit power while guaranteeing each user’s signaltointerferenceplusnoise ratio (SINR) constraint, the transmit beamforming and phase shifts were obtained by utilizing semidefinite relaxation and alternating optimization techniques. In [24], the fractional programming method was used to find the transmit beamforming matrix, and three efficient algorithms were developed to optimize the phase shifts. In [25], the large system analysis was exploited to derive the closedform expression of the minimum SINR when only spatial correlation matrices of the RIS elements are available. Then, authors targeted at maximizing the minimum SINR by optimizing the phase shifts based on the derived expression. In [26], the weighted sum rate of all users in multicell MIMO settings were investigated, through jointly optimizing the transmit beamforming and the phase shifts subject to each BS’s power constraint and unit modulus constraint.Recently, the modelfree artificial intelligence (AI) has emerged as an extraordinarily remarkable technology to address explosive mass data, mathematically intractable nonlinear nonconvex problems and highcomputation issues
[27][30]. Overwhelming interests in applying AI to the design and optimization of wireless communication systems have been witnessed recently, and it is a consensus that AI will be at the heart of future wireless communication systems (e.g. 6G and beyond) [31][40]. The AI technology is most appealing to large scale MIMO systems with massive number of array elements, where optimization problems become nontrivial due to extremely large dimension optimization involved. Particularly, deep learning (DL) has been used to obtain the beamforming matrix for MIMO systems by building a mapping relations between channel information and the precoding design [34][37]. Actually, DL based approaches are able to significantly reduce the complexity and computation time utilizing the offline prediction, but often require an exhaustive sample library for online training. Meanwhile, the deep reinforcement learning (DRL) technique which embraces the advantage of DL in neural network training as well as improves the learning speed and the performance of reinforcement learning (RL) algorithms, has also been adopted in designing wireless communcation systems [29], [32], [38][40].DRL is particularly beneficial to wireless communication systems where radio channels vary over time. DRL is able to allow wireless communication systems to learn and build knowledge about the radio channels without knowing the channel model and mobility pattern, leading to efficient algorithm designs by observing the rewards from the environment and find out solutions of sophisticated optimization problems. In [38], the hybrid beamforming matrices at the BS were obtained by applying DRL where the sum rate and the elements of the beamforming matrices are denoted as states and actions. In [40]
, the cell vectorization problem is casted as the optimal beamforming matrix selection to optimize network coverage utilizing DRL to track the user distribution pattern. In
[39], the joint design of beamforming, power control, and interference coordination were formulated as an nonconvex optimization problem to maximize the SINR solved by DRL.IB Contributions
In this paper, we investigate the joint design of transmit beamforming at the BS and phase shifts at the reflecting RIS to maximize the sum rate of multiuser downlink MISO systems utilizing DRL, assuming that direct transmissions between the BS and the users are totally blocked. This optimization problem is nonconvex due to the multiuser interference, and the optimal solution is unknown. We develop a DRL based algorithm to find the feasible solution, without using sophisticate mathematical formulations and numerical optimization techniques. Specifically, we use policybased deep deterministic policy gradient (DDPG) derived from Markov decision process to address continuous beamforming matrix and phase shifts
[41]. The main contributions of this paper are summarized as follows:We propose a new joint design of transmit beamforming and phase shifts based on the recent advance in DRL technique. This paper is a very early attempt to formulate a framework that incorporates the DRL technique into optimal designs for reflecting RIS assisted MIMO systems to address largedimension optimization problems.
The proposed DRL based algorithm has a very standard formulation and low complexity in implementation, without knowledge of explicit model of wireless environment and specific mathematical formulations. Such that it is very easy to be scaled to various system settings. Moreover, in contrast to DL based algorithms which rely on sample labels obtained from mathematically formulated algorithms, DRL based algorithms are able to learn the knowledge about the environment and adapt to the environment.
Unlike reported works which utilize alternating optimization techniques to alternatively obtain the transmit beamforming and phase shifter, the proposed algorithm jointly obtain the transmit beamforming matrix and the phase shifts, as one of the outputs of the DRL algorithm. Specifically, the sum rate is utilized as the instant rewards to train the DRL based algorithm. The transmit beamforming matrix and the phase shifts are jointly obtained by gradually maximizing the sum rate through observing the reward and iteratively adjusting the parameters of the proposed DRL algorithm accordingly. Since the transmit beamforming matrix and the phase shifts are continuous, we resort to DDPG to develop our algorithm, in contrast to designs addressing the discrete action space.
Simulations show that the proposed algorithm is able to learn from the environment through observing the instant rewards and improve its behavior step by step to obtain the optimal transmit beamforming matrix and phase shifts. It is also observed that, appropriate neural network parameter settings will increase significantly the performance and convergence rate of the proposed algorithm.
The rest of the paper is organized as follows. The system model will be described in Section II. The DRL based algorithm for joint design of transmit beamforming and phase shifts is presented in Section III. Simulation results are provided in Section V to verify the performance of the proposed algorithms, whereas conclusions are presented in Section VI.
The notations used in this paper are listed as follows. denotes the statistical expectation. For any general matrix , denotes the entry at the row and the column. , and represent the transpose and conjugate transpose of matrix , respectively. is the value of at time . is the column vector of . is the trace of the enclosed item. For any column vector (all vectors in this paper are column vectors), is the entry, while is the channel vector for the user. denotes the magnitude of the vector. denotes the absolute value of a complex number , and and denote its real part and imaginary part, respectively.
Ii System Model and Problem Formulation
We consider a MISO system comprised of a BS, one reflecting RIS and multiple users, as shown in Fig. 1. The BS has antennas and communicates with users where single antenna users. The reflecting RIS is equipped with reflecting elements and one microcontroller. A number of data streams are transmitted simultaneously from the antennas of the BS. Each data stream is targeted at one of the users. The signals are first arrived at the reflecting RIS and then are reflected by the RIS. The direct signal transmissions between the BS and the users are assumed to be negligible. This is reasonable since in practical the reflecting RIS is generally deployed to overcome the situations where severe signal blockage happens inbetween the BS and the users. The RIS functions as a reflecting array, equivalent to introducing phase shifts to impinging signals. Being intelligent surface, the reflecting RIS could be intelligently programmed to vary phase shifts based on the wireless environment through electronic circuits integrated in the metasurfaces.
We assume that, the channel matrix from the BS to the reflecting RIS, , and the channel vector for all , from the RIS to all the users are perfectly known at both the BS and the RIS, with the aid of the transmission of pilot signals and feedback channels. It should be noted that, obtaining CSI at the RIS is a challenging task, which definitely requires that the RIS has the capability to transmit and receive signals. However this is indeed contradictory to the claim that RIS does not need RF chains. One solution is to install RF chains dedicated to channel estimation. To this end, the system should be delicately designed to tradeoff the system performance and cost, which is beyond the scope of this paper.
Assume the frequency flat channel fading. The signal received at the user is given as
(1) 
where denotes the signal received at the user. is a column vector of dimension
consisting of data streams transmitted to all the users, with zero mean unit variance entries,
. is the beamforming matrix applied at the BS, while is the phase shift matrix applied at the reflecting RIS. is the zero mean additive white Gaussian noise (AWGN) with entries of variance .Note that, is a diagonal matrix whose entries are given by , where is the phase shift induced by each element of the RIS. Here we assume ideal reflection by the RIS such that the signal power is lossless from each reflection element or =1. Then, the reflection results in the phase shift of the impinging signals only. In this paper, we consider the continuous phase shift where for the development of DRL based algorithm.
From (1), it can be seen that, compared to MISO relaying systems, reflecting RIS assisted MISO systems do not introduce AWGN at the RIS. This is because that the RIS acts as a passive mirror simply reflecting the signals incident on it, without signal decoding and encoding. The phases of signals impinging on the RIS will be reconfigured through the microcontroller connected to the RIS. It is also clear that, the signals arriving at the users experience the composite channel fading, . Compared to point to point wireless communications, this composite channel fading results in more severe signal loss, if without signal compensation at the RIS.
To maintain the transmission power at the BS, the following constraint is considered
(2) 
where is the total transmission power allowed at the BS.
The received signal model (1) can be further written as
(3) 
where is the column vector of the matrix .
Without joint detection of data streams for all users, the second term of (3) is treated as cochannel interference. The SINR at the user is given by
(4) 
In this paper, we adopt the ergodic sum rate, as given in (5), as a metric to evaluate the system performance,
(5) 
where is the data rate of the user, given by . Unlike the traditional beamforming design and phase shift optimization algorithms that require full uptodate crosscell channel state information (CSI) for RISbased systems. Our objective is to find out the optimal and by maximizing sum rate leveraging the recent advance of DRL technique under given a particular CSI. Unlike the conventional deep neural networks (DNN), where it needs two phases, offline training phase and online learning phase, our proposed DRL method, each CSI is used to construct the state, and run the algorithm to obtain the two matrices continuously. The optimization problem can be formulated as
(6) 
It can be seen that (6) is a nonconvex nontrivial optimization problem, due to the nonconvex objective function and the constraint. Exhaustive search would have to be used to obtain the optimal solution if utilizing classical mathematical tools, which is impossible, particularly for large scale network. Instead, in general, algorithms are developed to find out suboptimal solutions employing alternating optimization techniques to maximize the objective functions, where in each iteration, suboptimal is solved by first fixing [15][20] while suboptimal is derived by fixing the , until the algorithms converge. In this paper, rather than directly solving the challenging optimization problem mathematically, we formulate the sum rate optimization problem in the context of advanced DRL method to obtain the feasible and .
Iii Preliminary knowledge of DRL
In this section, we briefly describe the background of DRL which builds up the foundation for the proposed joint design of transmit beamforming and phase shifts.
Iiia Overview of DRL
In a typical RL, the agent gradually derives its best action through the trialanderror interactions with the environment over time, applying actions to the environment, observing the instant rewards and the transitions of state of the environment, as shown in Fig. 2. There are a few basic elements used to fully characterize the RL learning process, the state, the action, the instant reward, the policy and the value function:
(1) State: a set of observations characterizing the environment. The state denotes the observation at the time step .
(2) Action: a set of choices. The agent takes one action step by step during the learning process. Once the agent takes an action at time instant following a policy , the state of the environment will transit from the current state to the next state . As a result, the agent gets a reward .
(3) Reward: a return. The agent wants to acquire a reward by taking action given state . It is also a performance metric to evaluate how good the action is given a state at time instant .
(4) Policy: the policy
denotes the probability of taking action
conditioned on the state . Note that, the policy function satisfies .(5) Stateaction value function: value to be in a state and action . The reward measures immediate return from action given state , whereas the value function measures potential future rewards which the agent may get from taking action being in the state .
(6) Experience: defined as .
Adopting the function as the stateaction value function. Given the state , the action , and the instant reward at time , the value function is given as
(7) 
where is the discount rate. The function is a metric to evaluate the impact of the choice of action on the expected future cumulative discounted reward achieved by the learning process, with the choice of action under the policy .
The function satisfies the Bellman equation given by
(8) 
where is the transition probability from state to state with action being taken.
The Qlearning algorithm searches the optimal policy . From (8), the optimal function associated with the optimal policy becomes
(9) 
The Bellman equation (9) can be solved recursively to obtain the optimal , without the knowledge of exact reward model and the state transition model. The updating on the Q function is given as
(10) 
where is the learning rate for the update of function.
If is updated at every time instant, it will converge to the optimal stateaction value function . However, this is not easily achieved, particularly with the large dimension state space and action space. Instead, the function approximation is usually used to address problems with enormous state/action spaces. Popular function approximators include feature representation, neural networks and ones directly relating value functions to state variables. Rather than utilizing explicit mathematical modeling, the DNN approximates state/action value function, policy function and the system model as composition of many nonlinear functions, as shown in Fig. 2, where both the Q function and the action are approximated by DNN. However, the neural network based approximation does not give any interpretation and the resulting DRL based algorithm might also give local optimal due to the sample correlation and nonstationary targets.
One key issue using neural network as Q function approximation is that, the states are highly correlated in time domain and result in reduction of randomness of states since they are all extracted from the same episode. The experience replay, which is a buffer window consisting of last few states, can considerably improve the performance of DRL. Instead of updating from the last state, the DNN updates from a batch of randomly sampled states to the experience replay.
With DRL, the value function is completely determined by a parameter vector
(11) 
where is equivalent to the weighting and bias parameters in the neural network. Rather than updating the function directly as in (8), with DRL, the optimal Q value function can be approached by updating using stochastic optimization algorithms
(12) 
where is the learning rate for the update on and
is the gradient of loss function
with respect to .The loss function is generally given as the difference between the neural network predicted value and the actual target value. However, since reinforcement learning is a process learning to approach the optimal value function, the actual target value is not known. To address this problem, two neural networks with the identical architecture are defined, the training neural network and the target neural network, whose value functions are respectively given by and . The target neural network is synchronized to the training neural network at a predetermined frequency. The actual target value is estimated as
(13) 
The loss function is thus given by
(14) 
IiiB Ddpg
As the proposed joint design of transmit beamforming and phase shifts is casted as a DRL optimization problem, the most challenging of it is the continuous state space and action space. To address this issue, we explore the DDPG neural network to solve our optimization problem, as shown in Fig. 2. It can be seen that, there are two DNNs in DDPG neural network, the actor network and the critic network. The actor network takes the state as input and outputs the continuous action, which is in turn input to the critic network together with the state. The actor network is used to approximate the action, thus eliminating the need of finding the action maximizing the Q value function given the next state which involves nonconvex optimization.
The updates on the training critic network are given as follows:
(15) 
(16) 
where is the learning rate for the update on training critic network. is the action output from the target actor network and denotes the gradient with respect to the training critic network . The and the denote the training and the target critic network, in which the parameters of the target network are updated as that of the training network in certain time slots. The update on target network is much slower than the training network. The update on the training actor network is given as
(17) 
where is the learning rate for the update on training actor network. denotes the training actor network with being the DNN parameters and given input . is the gradient of target critic network with respect to the action, whereas is the gradient of training actor network with respect to its parameter . It can be seen from (17), the update of training actor network is affected by the target critic network through gradient of the target critic network with respect to the action, which ensures that the next selection of action is on the favorite direction of actions to optimize the value function.
The updates on the target critic network and the target actor network are given as follows, respectively
(18) 
where are the learning rate for updating of the target critic network and the target actor network, respectively.
Iv DRL based Joint Design of Transmit Beamforming and Phase Shifts
In this section, we present the proposed DRL based algorithm for joint design of transmit beamforming and phase shifts, utilizing DDPG neural network structure shown in Fig. 3. The DRL algorithm is driven by two DNNs, the state , the action and the instant reward . First we introduce the structure of the proposed DNNs, followed by detailed description of and the algorithm.
Iva Construction of DNN
The structures of DNN utilized in this paper are shown in Fig. 3. As can be seen, both proposed DNN structures of the critic network and the actor network of are a fully connected deep neural network. The critic network and the actor network have the identical structure, comprised of one input layer, one output layer, and two hidden layers. The input and the output dimension of the critic network equals to the cardinality of the state set together with the action set and the
value function, respectively. The input and output dimension of the actor network are defined as the cardinality of the state and the action, respectively. The number of neurons of hidden layers depend on the number of users, the number of antennas at the BS and the number of elements at RIS. In general, the number of neurons of hidden layers must be larger than the input and the output dimension. The action output from the actor network will be input to the hidden layer 2 to avoid Python implementation issues in the computation of
.Note that, the correlation between entries of will degrade the efficiency of using neural network as function approximation. To overcome this problem, prior to being input to both the critic and the actor network, the state will go through a whitening process, to remove the correlation between the entries of the state .
In order to overcome the variation on distribution of each layer’s inputs resulting from the changes in parameters of the previous layers, batch normalizing is utilized at the hidden layers. Batch normalization allows for much higher learning rates and less careful about initialization, and in some cases eliminates the need for dropout.
The activation function utilized here is
in order to address the negative inputs. The optimizer used for both the training critic network and the training actor network is Adam with adaptive learning rate and , where and are the decaying rate for the training critic network and training actor network.Noting that, should satisfied the power constraint defined in (6). To implement this, a normalization layer is employed at the output of the actor network, where . For , be maintained to ensure signal reflection without the power consumption.
IvB Algorithm Description
Assuming there exists a central controller, or the agent, which is able to instantaneously collect the channel information, and . At time step , given the channel information, and the action and in the previous state, the agent constructs the state for time step following the section IV.B.1 State.
At the beginning of the algorithm, the experience replay buffer , the critic network and the actor network paramters and , the action and
need to be initialized. In this paper, we simply adopt the identity matrix to initialize
and .The algorithm is run over episodes and each episode iterates steps. For each episode, the algorithm terminates whenever it converges or reaches the maximum number of allowable steps. The optimal and are obtained as the action with the best instant reward. Note that the purpose of this algorithm is to obtain the optimal and utilizing DRL, rather than to train a neural network for online processing. The details of the proposed method are shown in Algorithm 1.
The construction of the state , the action , and the instant reward are described in details as follows.
IvB1 State
The state at the time step is determined by the transmission power at the time step, the received power of users at the time step, the action from the time step, the channel matrix and . Since the neural network can only take real rather than complex numbers as input, in the construction of the state , if a complex number is involved, the real part and the imaginary part will be separated as independent input port. Given transmit symbols with unit variance, the transmission power for the user is given by . The first term is the contribution from the real part, whereas the second term is the contribution from the imaginary part, both of which are used as the independent input port to the critic network and the actor network. In total, there will be entries of the state formed by the transmission power. Assuming that . The received power at the user contributed by the user is given as . Likewise, both the power contributed by the real part and the imaginary part are used as independent input port to the critic network and the actor network. The total number of entries formed here is . The real part and the imaginary part of each entry of and are also used as entries of the state. The total number of entries of the state constructed from the action at the step is given by , while the total number of entries from and is .
In summary, the dimension of the state space is . The reason we differentiate the transmission power and the receiving power contributed by the real part and the imaginary part is that, both the and are matrix with complex entries and the transmission and receiving power only will result in information lost due to the absolute operator.
IvB2 Action
The action is simply constructed by the transmit beamforming matrix and the phase shift matrix . Likewise, to tackle with the real input problem, and are separated as real part and imaginary part, both are entries of the action. The dimension of the action space is .
IvB3 Reward
At the step of the DRL, the reward is determined as the sum rate capacity , given the instantaneous channels , and the action and obtained from the actor network.
Parameter  description  value 

discounted rate for future reward  0.99  
learing rate for training critic network uptate  0.001  
learing rate for training actor network uptate  0.001  
learing rate for target critic network uptate  0.001  
learing rate for target actor network uptate  0.001  
decaying rate for training critic network uptate  0.00001  
decaying rate for training actor network uptate  0.00001  
buffer size for experience replay  100000  
the number of episodes  5000  
the number of steps in each episode  20000  
the number of experiences in the minibatch  16  
the number of steps synchronizing target network with the training network  1 
V Numerical Results and Analysis
In this section, we present performance evaluation for the proposed DRL based algorithm. In the simulations, we randomly generate channel matrix and following rayleigh distribution. We assume that, the large scale path loss and the shadowing effects have been compensated. This is because the objective of this paper in its current format is to develop a framework for the optimal beamforming design and phase shift matrices by employing advanced DRL technique. Once the framework is ready, the effects of the path loss, the shadowing effects, the distribution of users and the direct link from the BS to the users can be easily investigated, through scaling DNNs, reconstructing the state, the action and the reward. All presented illustrations have been averaged results over 500 independent realizations.
Va Setting and benchmarks
The hyperparameters used in the algorithm are shown in Table I. We select two stateoftheart algorithms as benchmarks. These are the weighted minimum mean square error (WMMSE) algorithm [42][43] and an iterative algorithm based on fractional programming (FP) with the ZF beamforming [4]. In their generic forms, both algorithms require full uptodate crosscell CSI. Both are centralized and iterative in their original forms. The iterative FP algorithm with the ZF beamforming used in this paper is formulated in [4] Algorithm 3. Similarly, a detailed explanation and pseudo code of the WMMSE algorithm is given in [43] Algorithm 1. The performance of the proposed DRLbased algorithm in comparison with these stateoftheart benchmarks is illustrated in the following.
VB Comparisons with Benchmarks
We have evaluated the proposed DRLbased approach described in Algorithm 1 as well as two benchmarks. Fig. 4 shows the sum rate versus maximum transmit power . We consider two sets of system parameters, namely , and . It can be seen that our proposed DRLbased algorithm obtains the comparable sumrate performance with these stateoftheart benchmarks (WMMSE and FP optimization algorithm with ZF), and the sum rates increase with the transmit power under all considered algorithms and scenarios.
To further verify our proposed algorithm in more wider application scenarios, we perform another a simulation, which compares the sum rate as a function of the number of elements in RIS shown in Fig. 5 for . It is observed that, the average sum rates increase with the , resulting from the increase in the sum power of reflecting RIS as increases. This is achieved at the cost of the complexity of implementing RIS. It also further indicates that our proposed algorithm is robust in considered wider application scenarios, and approaching the optimal performance.
VC Impact of on DRL
To get better understanding of our proposed DRLbased method, we investigate the impact of on it shown in Fig. 6, in which we considered two settings and with rewards (instant rewards and average rewards) as a function of time steps. In simulations, we use the following method to calculate the average rewards,
(19) 
where is the maximum steps.
It can be seen that, the rewards will converge with the increase of time step . It converges faster at the low SNR () than high SNR (). The reason is that, with higher SNR, the dynamic range of instant rewards is large, resulting in more fluctuations and worse convergence. These two figures also show that, starting from the identity matrices, the DRL based algorithm is able to learn from the environment and adjust and to approach optimal solutions. Furthermore, the result of average rewards versus time steps under different is shown in Fig. 7. It can be seen that the SNRs have significantly effect on the convergence rate and performance, especially for the low SNR scenarios, i.e., below . When , the performance gap is far less than that between and . In other words, the proposed DRL method is extremely sensitive to the low SNR although it takes less time to achieve the convergence.
VD Impact of system settings
Similarly, we investigate the impact of element number on the performance of DRL shown in Fig. 8, in which we considered the system settings with rewards versus time steps. Compared with the transmit power, DRL is more robust to the change of system settings. Specifically, with the increase of elements , the average rewards also increase gradually as expected, but this doesn’t increase the convergence time of the DRL method.
Fig. 9 presents the average sum rate as a function of . From this figure, we see that, the average sum rate increases with . As more transmit power is allocated to the BS, higher average sum rate can be achieved by the proposed DRL based algorithm. This observation is aligned to that of conventional multiuser MISO systems. With joint design of transmit beamforming and phase shifts, the cochannel interference of multiuser MISO systems can be efficiently reduced, resulting in the performance improvement with .
In Fig. 10
, we plot the cumulative distribution function (CDF) of the sum rate over different snapshots for different system settings. It is seen that the CDF curves confirm the observations from Fig.
9, where the average sum rates improve with the transmission power and the number of RIS elements .VE Impact of learning and decaying rate
In our proposed DRL algorithm, we use constant learning and decaying rates for the critic and actor neural networks, and investigate their impacts on the performance and converge rate of DRLbased method. Fig. 11 demonstrates average rewards versus time steps under different learning rates, i.e., {0.01, 0.001, 0.0001, 0.0001}. It can be seen that different learning rates have the great influence on the performance of the DRL algorithm. Specifically, the DRL with 0.001 learning rate achieves the best performance although it takes a longer time to converge compared with 0.0001 and 0.00001 learning rate, while the large learning rate as 0.01 has the worse performance. This is because that too large learning rate will increases the oscillation that renders the performance drop dramatically. To sum up, the learning rate should be selected properly, neither too large nor too small. Fig. 12 compares average rewards versus time steps under different decaying rates, i.e., {0.001, 0.0001, 0.00001, 0.00001}. It shares the similar conclusion with the learning rate, but it exerts less influence on the DRL’s performance and convergence rate. It can be seen that although 0.00001 decaying rate achieves the best performance, the gap between them are narrowed significantly.
Finally, we also should point out that, the performance of DRL based algorithms is very sensitive to initialization of the DNN and the other hyperparameters, i.e., minibatch size, etc. The hyperparameters need to be defined delicately under a given system setting, and the appropriate neural network hyperparameters setting will improves significantly the performance of the proposed DRL algorithm as well as its convergence rate.
Vi Conclusions
In this paper, a new joint design of transmit beamforming and phase shifts based on the recent advances in DRL technique was proposed, which attempts to formulate a framework that incorporates the DRL technique into the optimal designs for reflecting RIS assisted MIMO systems to address largedimension optimization problems. The proposed DRL based algorithm has a very standard formulation and low complexity in implementation, without the knowledge of explicit mathematical formulations of wireless systems. It is therefore very easy to be scaled to accommodate various system settings. Moreover, the proposed DRL based algorithm is able to learn the knowledge about the environment and also is robust to the environment, through trialanderror interactions with the environment by observing predefined rewards. Unlike most reported works utilizing alternating optimization techniques to alternatively obtain the transmit beamforming and phase shifts, the proposed DRL based algorithm obtains the joint design simultaneously as the output of the DNNs. Simulation results show that the proposed DRL algorithm is able to learn from the environment through observing the instant rewards and improve its behavior step by step to obtain the optimal transmit beamforming matrix and phase shifts. It is also observed that, appropriate neural network parameter settings will improve significantly the performance and convergence rate of the proposed algorithm.
References
 [1] S. Yang, L. Hanzo, “Fifty Years of MIMO Detection: The Road to LargeScale MIMOs”, IEEE Commun. Surveys Tuts., Vol.17, no. 4, pp. 19411988, Sep. 2015.
 [2] E. G. Larsson, F. Tufvesson, O. Edfors, and T. L. Marzetta, “MassiveMIMO for Next Generation Wireless Systems”, IEEE Commun. Mag., vol. 52, no. 2, pp. 186195, Feb. 2014.
 [3] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO: Opportunities and Challenges with Very Large Arrays”, IEEE Signal Proces. Mag., vol. 30, no. 1, pp. 4046, Jan. 2013
 [4] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah and C. Yuen, “Reconfigurable Intelligent Surfaces for Energy Efficiency in Wireless Communication,” in IEEE Trans. Wireless Commun., vol. 18, no. 8, pp. 41574170, Aug. 2019.
 [5] J. Zhao. “A Survey of Intelligent Reflecting Surfaces (IRSs): Towards 6G Wireless Communication Networks.” [Online] https://arxiv.org/abs/1907.04789.
 [6] C. Huang, S. Hu, G. C. Alexandropoulos, A. Zappone, C. Yuen, R. Zhang, M. D. Renzo, M. Debbah, “Holographic MIMO Surfaces for 6G Wireless Networks: Opportunities, Challenges, and Trends”, [Online] Aviable: https://arxiv.org/abs/1911.12296
 [7] K. B. Letaief, W. Chen, Y. Shi, J. Zhang and Y. A. Zhang, “The Roadmap to 6G: AI Empowered Wireless Networks,” in IEEE Commu. Mag., vol. 57, no. 8, pp. 8490, Aug. 2019.
 [8] S. Hu, F. Rusek and O. Edfors, “Beyond MassiveMIMO: The Potential of Data Transmission With Large Intelligent Surfaces,” IEEE Trans. Signal Process., vol. 66, no. 10, pp. 27462758, May 15, 2018.
 [9] T. J. Cui, M. Q. Qi, X. Wan, J. Zhao, and Q. Cheng, “Coding metamaterials, digital metamaterials and programmable metamaterials,” Light: Sci. App., vol. 3, no. 10, p. e218, Oct. 2014.
 [10] C. Liaskos, S. Nie, A. Tsioliaridou, A. Pitsillides, S. Ioannidis and I. Akyildiz, “A New Wireless Communication Paradigm through Softwarecontrolled Metasurfaces,” IEEE Commun. Mag., vol. 56, no. 9, pp. 162169, Sept. 2018.
 [11] S. Hu, F. Rusek, and O. Edfors, “Beyond MassiveMIMO: The potential of positioning with large intelligent surfaces,” IEEE Trans. Signal Process., vol. 66, no. 7, pp. 17611774, Apr. 2018.
 [12] C. Huang, G. C. Alexandropoulos, C. Yuen and M. Debbah, “Indoor Signal Focusing with Deep Learning Designed Reconfigurable Intelligent Surfaces,” Proc. SPAWC, Cannes, France, pp. 15, 2019.
 [13] E. Basar, “Reconfigurable Intelligent SurfaceBased Index Modulation: A New Beyond MIMO Paradigm for 6G,” [Online] Aviable: https://arxiv.org/abs/1904.06704
 [14] E. Basar, “Transmission Through Large Intelligent Surfaces: A New Frontier in Wireless Communications,” European Conference on Networks and Communications (EuCNC), Valencia, Spain, pp. 112117, 2019.
 [15] Q. Wu and R. Zhang, “Intelligent Reflecting Surface Enhanced Wireless Network via Joint Active and Passive Beamforming,” IEEE Trans. Wireless Commun., vol. 18, no. 11, pp. 53945409, Nov. 2019.
 [16] Y. Gao, C. Yong, Z. Xiong, D. Niyato, Y. Xiao, J. Zhao, “Reconfigurable Intelligent Surface for MISO Systems with Proportional Rate Constraints,” to appear Proc. ICC 2020, Dublin, Ireland, Jun. 2020.
 [17] Y. Han, W. Tang, S. Jin, C. Wen, and X. Ma, “Large intelligent surfaceassisted wireless communication exploiting statistical CSI,” IEEE Trans. Veh. Tech., vol. 68, no. 8, pp. 82388242, Aug. 2019.
 [18] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling Large Intelligent Surfaces with Compressive Sensing and Deep Learning,” [Online] Aviable: https://arxiv.org/abs/1904.10136.
 [19] M. Cui, G. Zhang, and R. Zhang, ”Secure wireless communication via intelligent reflecting surface,” IEEE Wireless Commun. Lett., vol. 8, no. 5, pp. 14101414, Oct. 2019.
 [20] H. Shen, W. Xu, W. Xu, S. Gong, Z. He, and C. Zhao, ”Secrecy rate maximization for intelligent reflecting surface assisted multiantenna communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 14881492, Sept. 2019.
 [21] M. Fu, Y. Zhou, Y. Shi, “Reconfigurable Intelligent Surface Empowered Downlink NonOrthogonal Multiple Access”, [Online] Aviable: https://arxiv.org/abs/1910.07361.
 [22] C. Huang, A. Zappone, M. Debbah, and C. Yuen, “Achievable rate maximization by passive intelligent mirrors,” IEEE ICASSP, pp. 37143718, Apr. 2018.
 [23] Q. Wu and R. Zhang, “Beamforming Optimization for Intelligent Reflecting Surface with Discrete Phase Shifts,” Proc. ICASSP, Brighton, United Kingdom, pp. 78307833, 2019.
 [24] H. Guo, Y.C. Liang, J. Chen, and E. G. Larsson, “Weighted sumrate optimization for intelligent reflecting surface enhanced wireless networks,” [Online] Aviable: https://arxiv.org/abs/1905.07920,
 [25] Q.U.A. Nadee, A. Kammoun, A. Chaaban, M. Debbah, and M.S. Alouini, “Asymptotic Analysis of Large Intelligent Surface Assisted MIMO Communication,” [Online] Aviable: https://arxiv.org/abs/1903.08127.
 [26] C. Pan, H. Ren, K. Wang, W. Xu, M. Elkashlan, A. Nallanathan and L. Hanzo, “Multicell MIMO Communications Relying on Intelligent Reflecting Surface,” [Online] Aviable: https://arxiv.org/abs/1907.10864.
 [27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT press, Cambridge MA, A Bradford Book, 1998.
 [28] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, MIT press Cambridge, 2016.
 [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nat., vol. 518, no. 7540, pp. 529533, 2015.
 [30] C. Huang, G. C. Alexandropoulos, A. Zappone, C. Yuen and M. Debbah, “Deep Learning for UL/DL Channel Calibration in Generic Massive MIMO Systems,” Proc. IEEE ICC, Shanghai, China, 2019, pp. 16.

[31]
C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen and L. Hanzo, “Machine Learning Paradigms for NextGeneration Wireless Networks,”
IEEE Wireless Commu., vol. 24, no. 2, pp. 98105, Apr. 2017.  [32] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.C. Liang and D. In Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” in IEEE Commun. Surveys Tuts., vol. 21, no. 4, pp. 31333174, 2019.
 [33] T. Lin and Y. Zhu, “Beamforming design for largescale antenna arrays using deep learning,” IEEE Wireless Commun. Lett., vol. 9, no. 1, pp. 103107, Jan. 2020.
 [34] F. Zhou, G. Lu, M. Wen, Y. Liang, Z. Chu and Y. Wang, “Dynamic Spectrum Management via Machine Learning: State of the Art, Taxonomy, Challenges, and Open Research Issues,” IEEE Net., vol. 33, no. 4, pp. 5462, July/August 2019.
 [35] H. Huang, Y. Song, J. Yang, G. Gui and F. Adachi, “DeepLearningBased MillimeterWave Massive MIMO for Hybrid Precoding,” in IEEE Tran. on Veh. Tech., vol. 68, no. 3, pp. 30273032, Mar. 2019.
 [36] X. Li, A. Alkhateeb, “Deep Learning for Direct Hybrid Precoding in Millimeter Wave Massive MIMO Systems,” [Online] Aviable: https://arxiv.org/abs/1905.13212

[37]
H. Huang, W. Xia, J. Xiong, J. Yang, G. Zheng, X. Zhu, “Unsupervised LearningBased Fast Beamforming Design for Downlink MIMO,” in
IEEE Access, vol. 7, pp. 75997605, 2019.  [38] Y. Zhou, F. Zhou, Y. Wu, R. Q. Hu and Y. Wang, “Subcarrier Assignment Schemes Based on QLearning in Wideband Cognitive Radio Networks,” IEEE Trans. Veh. Tech., vol. 69, no. 1, pp. 11681172, Jan. 2020.
 [39] Faris B. Mismar, Brian L. Evans, and A. Alkhateeb, “Deep Reinforcement Learning for 5G Networks: Joint Beamforming, Power Control, and Interference Coordination,” Accepted by IEEE Trans. Commu.
 [40] R. Shafin, H. Chen, Y. H. Nam, S. Hur, J. Park, J. Zhang, J. Reed, and L. Liu, “SelfTuning Sectorization: Deep Reinforcement Learning Meets Broadcast Beam Optimization,” [Online] Aviable: https://arxiv.org/abs/1906.06021
 [41] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” [Online] Aviable: https://arxiv.org/abs/1509.02971.
 [42] Q. Shi, M. Razaviyayn, Z.Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sumutility maximization for a MIMO interfering broadcast channel,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4331 C4340, Sep. 2011.
 [43] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438 C5453, Oct. 2018.
Comments
There are no comments yet.