Ultra-Reliable Indoor Millimeter Wave Communications using Multiple Artificial Intelligence-Powered Intelligent Surfaces

In this paper, a novel framework for guaranteeing ultra-reliable millimeter wave (mmW) communications using multiple artificial intelligence (AI)-enabled reconfigurable intelligent surfaces (RISs) is proposed. The use of multiple AI-powered RISs allows changing the propagation direction of the signals transmitted from a mmW access point (AP) thereby improving coverage particularly for non-line-of-sight (NLoS) areas. However, due to the possibility of highly stochastic blockage over mmW links, designing an intelligent controller to jointly optimize the mmW AP beam and RIS phase shifts is a daunting task. In this regard, first, a parametric risk-sensitive episodic return is proposed to maximize the expected bit rate and mitigate the risk of mmW link blockage. Then, a closed-form approximation of the policy gradient of the risk-sensitive episodic return is analytically derived. Next, the problem of joint beamforming for mmW AP and phase shift control for mmW RISs is modeled as an identical payoff stochastic game within a cooperative multi-agent environment, in which the agents are the mmW AP and the RISs. Two centralized and distributed controllers are proposed to control the policies of the mmW AP and RISs. To directly find an optimal solution, the parametric functional-form policies for these controllers are modeled using deep recurrent neural networks (RNNs). Simulation results show that the error between policies of the optimal and the RNN-based controllers is less than 1.5 achievable rates resulting from the deep RNN-based controllers is 60 the variance of the risk-averse baseline.

READ FULL TEXT VIEW PDF

Authors

page 10

page 17

page 21

11/21/2018

Artificial Intelligence-Defined 5G Radio Access Networks

Massive multiple-input multiple-output antenna systems, millimeter wave ...
06/06/2019

Optimized Deployment of Millimeter Wave Networks for In-venue Regions with Stochastic Users' Orientation

Millimeter wave (mmW) communication is a promising solution for providin...
09/30/2021

Shaping mmWave Wireless Channel via Multi-Beam Design using Reconfigurable Intelligent Surfaces

Millimeter-wave (mmWave) communications is considered as a key enabler t...
03/29/2022

Near-Field Hierarchical Beam Management for RIS-Enabled Millimeter Wave Multi-Antenna Systems

In this paper, we present a low overhead beam management approach for ne...
09/08/2020

Joint Beam Training and Positioning For Intelligent Reflecting Surfaces Assisted Millimeter Wave Communications

Intelligent reflecting surface (IRS) offers a cost effective solution to...
02/22/2019

Transmission Through Large Intelligent Surfaces: A New Frontier in Wireless Communications

In this paper, transmission through large intelligent surfaces (LIS) tha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Millimeter wave (mmW) communications is a promising solution to enable high-speed wireless access in 5G wireless networks and beyond [1, 2, 3]. Nevertheless, the high attenuation and scattering of mmW propagation makes guaranteeing the coverage of mmW wireless networks very challenging [2]. To overcome high attenuation and scattering of mmW propagation challenges, integrating massive antennas for highly directional beamforming at both mmW access point (AP) and user equipment (UE) has been proposed [1, 2]. However, applying beamforming techniques will render the use of directional mmW links very sensitive to random blockage caused by people and objects in a dense environment. This, in turn, gives rise to unstable line-of-sight (LoS) mmW links and unreliable mmW communications [2, 3]. To provide robust LoS coverage, one proposed solution is to deploy ultra-dense APs and active relay nodes to improve link quality using multi-connectivity for a given UE [3, 4, 5, 6]. However, the deployment of multiple mmW APs and active relay nodes is not economically feasible and can lead to high control signalling overhead. To decrease signalling overhead and alleviate economic costs while also establishing reliable communications, recently the use of reconfigurable intelligent surfaces (RISs) has been proposed [1, 7, 8, 9, 10, 11, 12].111Note that, when an RIS is used as a passive reflector, it is customary to use the term intelligent reflecting surface (IRS) to indicate this mode of operation. Meanwhile, when an RIS is used as a large surface with an active transmission, the term large intelligent surface (LIS) is commonly used [7, 9, 10, 11].

I-a Prior Works

RISs are man-made surfaces including conventional reflect-arrays, liquid crystal surfaces, and software-defined meta-surfaces that are electronically controlled [10] and [13]. In a mmW network enabled with RIS, mmW RISs are turned into a software-reconfigurable entity whose operation is optimized to increase the availability of mmW LoS connectivity. Thus, the RISs reflect the mmW signals whenever possible to bypass the blockages[10]. One of the main challenges in using reconfigurable RISs is how to adjust the phases of the reflected waves from different RISs so that the LOS and reflected mmW signals can be added coherently, and the signal strength of their sum is maximized. In this regard, several recent works such as in [7, 4, 14, 15, 8, 16] have been proposed to establish reliable mmW links. In [7], the authors present efficient designs for both transmit power allocation and RIS phase shift control. Their goal is to optimize spectrum or energy efficiency subject to individual rate requirements for UEs. However, the work in [7] does not consider stochastic blockage over mmW LoS links and, thus, its results cannot be generalized to a real-world mmW system. In [4], the authors implement a smart mmWave reflector to provide high data rates between a virtual reality headset and game consoles. To handle beam alignment and tracking between the mmWave reflector and the headset, their proposed method must try every possible combination of mirror transmit beam angle and headset receive beam angle incurring significant overhead due to the brute-force solution for beam alignment. In [15], the authors designed an RIS consisting of 224 reconfigurable meta-surfaces using a two-stage phase shift-control algorithm for 802.11ad networks. Their proposed phase shift-control algorithm uses exhaustive search to find the optimal beam angle of the AP and phase shift of the reflector. However, this exhaustive search is too complex for dynamic mmW networks.

The existing works in [7, 4, 14, 15] and [8] assume static reflectors and do not provide efficient solutions to intelligently and dynamically control the configuration of smart reflectors. Moreover, the goal of [4, 7, 14, 15] and [8]

is to increase the coverage probability or signal-to-noise ratio without mitigating the risk of NLoS mmW link. In practice, an intelligent solution such as the one based on machine learning (ML) is required to be used at the RIS-assisted mmW network as the edge of 5G and beyond 

[17]. Edge ML empowers mmW RISs and APs to predict unknown future blockages, adaptively control their beamforming and phase shift configurations, and guarantee ultra-reliable mmW communication. Towards this vision, in [16], an intelligent controller based on deep neural networks for configuring mmW RISs is studied. The approach proposed in [16] guarantees ultra-reliable mmW communication and captures the unknown stochastic blockages, but it is limited for a scenario with only one RIS. However, in practice, the use of multiple RISs is needed to cooperatively guarantee LoS coverage for all NLoS areas in a distributed manner. Thus, a new framework for coordinated beamforming and phase-shift control for multiple RISs to guarantee robust, stable and near-optimal solution is needed.

I-B Contributions

The main contribution of this paper is to propose a novel framework for guaranteeing ultra-reliable mobile mmW communications using artificial intelligence (AI)-powered RISs. The proposed approach allows the network to autonomously form transmission beams of the mmW AP and control the phase of the reflected mmW signal in the presence of stochastic blockage over the mmW links. To solve the problem of joint beamforming and phase shift-control in an RISs-assisted mmW network while guaranteeing ultra-reliable mmW communications, we formulate a stochastic optimization problem whose goal is to maximize a parametric risk-sensitive episodic return. The parametric risk-sensitive episodic return not only captures the expected bit rate but is also sensitive to the risk of NLoS mmW link over future time slots. Subsequently, we use deep and risk-sensitive reinforcement learning (RL) to solve the problem in an online manner. Next, we model the risk-sensitive reinforcement learning problem as an identical payoff stochastic game in a cooperative multi-agent environment in which the agents are mmW AP and RISs [18]. Two centralized and distributed controllers are proposed to control the policy of the mmW AP and RISs in the identical payoff stochastic game. To find an optimal solution, the parametric functional-form policies are implemented using a deep RNN [19] which directly search the optimal policies of the beamforming and phase shift-controllers. In this regard, we analytically derive a closed-form approximation for the gradient of risk-sensitive episodic return, and the RNN-based policies are subsequently trained using this derived closed-form gradient. We prove that if the centralized and distributed controllers start from the same strategy profile in the policy space of the proposed identical payoff stochastic game, then the gradient update algorithm will converge to the same locally optimal solution for deep RNNs. Simulation results show that the error between the policies of the optimal and RNN-based controllers is small. The performance of deep RNN-based centralized and distributed controllers is identical. Moreover, for a high value of risk-sensitive parameter, the variance of the achievable rates resulting from the deep RNN-based controllers is less than the non-risk based solution. The main continuations of this paper are summarized as follows:

  • We propose a novel smart conrol framework based on artificial intelligence for guaranteeing ultra-reliable mobile mmW communications when multiple RISs are used in an indoor scenario. The proposed approach allows the network to autonomously form transmission beams of the mmW AP and control the phase of the reflected mmW signal from mmW RIS in the presence of unknown stochastic blockage. In this regard, we formulate a new joint stochastic beamforming and phase shift-control problem in an RISs-assisted mmW network under ultra-reliable mmW communication constraint. Our objective is to maximize a parametric risk-sensitive episodic return. The parametric risk-sensitive episodic return not only captures the expected bit rate but is also sensitive to the risk of NLoS mmW link over future time slots.

  • We apply both risk-sensitive deep reinforcement learning (RL) and cooperative multi-agent sysem to find a solution for the joint stochastic beamforming and phase shift-control problem, in an online manner. We model the risk-sensitive reinforcement learning problem as an identical payoff stochastic game in a cooperative multi-agent environment in which the agents are mmW AP and RISs. Then, we propose two centralized and distributed control policies for the transmission beams of mmW AP and phase shift of RISs.

  • To find an optimal solution for our proposed centralized and distributed control policies,we implement parametric functional-form policies using a deep RNN which can directly search the optimal policies of the beamforming and phase shift controllers. We analytically derive a closed-form approximation for the gradient of risk-sensitive episodic return, and the RNN-based policies are subsequently trained using this derived closed-form gradient.

  • We mathematically prove that, if the centralized and distributed controllers start from the same strategy profile in the policy space of the proposed identical payoff stochastic game, then the gradient update algorithm will converge to the same locally optimal solution for deep RNNs. Moreover, we mathematically show that, at the convergence of the gradient update algorithm for the RNN-based policies, the policy profile under the distributed controllers is a Nash equilibrium equilibrium of the RL and cooperative multi-agent system.

  • Simulation results show that the error between the policies of the optimal and RNN-based cotrollers is small. The performance of deep RNN-based centralized and distributed controllers is identical. Moreover, for a high value of risk-sensitive parameter, the variance of the achievable rates resulting from the deep RNN-based controllers is 60% less than the non-risk based solution.

The rest of the paper is organized as follows. Section II presents the system model and the stochastic and risk-sensitive optimization problem in the smart reflector-assisted mmW networks. In Section III, based on the framework of deep and risk-sensitive RL, we propose a deep RNN to solve the stochastic and risk-sensitive optimization problem for the optimal reflector configuration. Then, in Section IV, we numerically evaluate the proposed policy-gradient approach. Finally, conclusions are drawn in Section V.

Ii System Model and Problem Formulation

Ii-a System model

Consider the downlink of between an UE and an indoor RIS-assisted mmW network composed of one mmW AP and multiple AI-powered RISs. In this network, due to the blockage of mmW signals, there exist areas where it is not possible to establish LoS mmW links between the mmW AP and UE, particularly for mobile user. We call these areas as dark areas. Each mmW AP and UE will have, respectively, and antennas to form their beams. In our model, there are AI-powered mmW RISs that intelligently reflect the mmW signals from the mmW AP toward the mobile UE located in the dark areas. Each mmW RIS has meta-surfaces. Since the size of an RIS at mmW bands will be smaller than the size of a typical indoor scenario or the distance between user and RISs which is often in the order of more than 1 meter in an indoor environment. Thus, as mentioned in recent works such as[4, 14] and [15] for mmW communication, we can consider far-field characteristics for mmW signals reflected from an RIS. Here, we consider discrete time slots indexed by . Without loss of generality, we consider the mmW network on a bi-dimensional plane.

The angle of the mmW AP directional antenna is represented by at time slot , where index represents the mmW AP. If the user is in the LoS coverage of the mmW AP, then is matched to the direction of the user toward the mmW AP. In this case, we assume that the AP reflection angle is chosen from a set of discrete values. Let the angle between mmW AP and reflector be . However, if the user moves to the dark areas, the mmW AP chooses the antenna direction toward one of the mmW RISs , . Hence, . When the mmW RIS receives a signal from the mmW AP, the mmW RIS establishes a LoS mmW link using a controlled reflected path to cover the user in the dark areas. Let be the angle of reflected mmW signal from mmW RIS at time slot . We assume that the reflection angles for are chosen from the set of discrete values.

Here, we considered a well-known multi-ray mmW channel model [20]. In this channel model for mmW links, there are rays between mmW transmitter and receiver, and each ray can be blocked by an obstacle. Thus, for angle-of-departure (AoD) transmission beams from mmW AP, , and the angle-of-arrival (AoA) received beams, , of mmW rays, the channel matrix over the mmW AP-to-UE link, , at time slot is given by:

(1)

where

denotes the array response vectors for the AoD at the mmW AP and

is the AoA at the UE for ray . Moreover, where is the path loss and is the complex channel gain of path from the mmW AP to the UE [20]. Here, and where is the distance between mmW AP and UE, is the path loss for 1 meter of distance, and and represent the slopes of the best linear fit to the propagation measurement in mmW frequency band for LoS and NLoS mmW links, respectively.

For AoD reflection beams from mmW RIS , of mmW rays, the channel matrix over the RIS-to-UE mmW path at time slot , , as follow [20]:

(2)

where denotes the array response vectors for AoD at the reflector for ray . where is the path loss and is the complex channel gain of path from the mmW RIS to the UE [20]. Here, and where is the distance between mmW RIS and UE.

For the AoA received rays at RIS , of mmW rays, the channel matrix over the mmW AP-to-RIS link at time slot , , is given by [20]:

(3)

where represents the array response vector for the AoA at the reflector for ray . in which is the path loss, and is the complex channel gain of path from the mmW AP to the reflector  [20]. Here, and where is the distance between mmW AP and RIS . Consequently, For a given transmission, , and reflection directions, , at time slot , the channel matrix between mmW AP and UE over one mmW AP-to-UE link and mmW AP-to-RIS-to-UE links is defined as , where is the channel matrix over the mmW AP-to-UE link, and is the channel matrix over the mmW AP-to-reflector-to-UE link resulting from reflector , at time slot  [13].

In our model, there are two links: transmission link and control link. The transmission link uses the mmW band to send data over AP-to-UE or AP-to-RIS-to-UE links. The sub-6 GHz link is only used to transmit control signals to the controllers of mmW AP and RISs. At the beginning of the downlink transmission, since the exact location of the UE is not known to the controller, we apply the three step low-complexity beam search algorithm presented in [21] in our model to find the angle-of-departure of transmission beam from mmW AP for , and complex path gains, either LoS or NLoS links to the UE. As a result, at the beginning of transmission, if an LoS complex path gain is found, the mmW AP forms its beam directly toward the UE based on the acquired AoD at mmW AP. But, if a NLoS complex path gain is found, the mmW AP sequentially forms its beam toward the mmW reflectors and the beam search algorithm will be applied again. In this case, each mmW reflector changes their reflection angle to sweep all the dark area, until the LoS path loss gain between mmW reflector and UE, and initial relection angles of RISs, for

, are detected. However in the future time slots, the availability of LoS link as well as the channel gain are random variables with unknown distributions due to the mobility of user, and the AoA signals

at the UE is a stochastic variable which randomly changes due to unknown factors such as the user’s orientation. The operators and denote the transpose and the Hermitian transpose of a vector or matrix, respectively. Consequently, for a given beam angle of mmW AP, , and reflection directions of mmW reflectors, , at time slot , the total achievable bit rate over the all the paths between the mmW AP and UE through mmW RISs is given by [20]:

(4)

where is the transmission power, is the mmW bandwidth, and is the noise density.

Fig. 1 is an illustrative example that shows how one mmW AP and two mmW RISs are used to bypass the blockage during four consecutive time slots and . As seen in Fig. 1, since the user is in the dark area during time slots and for mmW AP, the mmW RISs are used to provide coverage for the user. Here, during two time slots and , the mmW AP transmits the signal toward the reflector , and then this reflector reflects the signal toward the user moving in the dark area . Thus, the beam angles of mmW AP signals are and at time slots and . Then, since the user is moving in the dark area during time slot , the mmW AP transmits the signal toward the mmW RIS , . In this case, the user is not in the LoS coverage of reflector and the reflector reflects the signal toward to cover the user at time slot . As shown in Fig. 1, the user is not in any dark area at time slot and mmW AP can directly transmit the signal over LoS link toward the user, . The list of our main notations is given in Table I.

Figure 1: An illustrative example of the system model with one mmW AP and two mmW RISs.

As a result, the phase shift-control policy must not only consider the unknown future trajectory of mobile users but also adapt itself to the possible stochastic blockages in future time slots. In this case, an intelligent policy for the phase shift-controller, which can predict unknown stochastic blockages, is required for mmW AP and mmW RISs to guarantee ultra-reliable mmW communication, particularly for indoor scenarios with many dark areas.

Symbol Definition
Number of mmW AP antennas
Number of mmW AP antennas
Number of meta-surfaces per mmW RIS
The beam angle of AP at timeslot
The reflection angle of RIS at timeslot
The AoA at RIS
The stochastic AoA at UE at timeslot
The channel matrix of the AP-to-UE mmW path
The channel matrix of the AP-to-reflector-to-UE path
The stochastic complex gain over path at time slot
The beamforming-control policy of the mmW AP
The phase shift-control policy of the mmW RIS
The achievable bit rate
The risk sensitivity parameter
The set of agents
The set of joint action space of the agents
The state of POISG at time slot
Action of agent at time slot
Number of future consecutive time slots
Trajectory of the POIPSG during time slots
rate summation during consecutive time slots
The global history at time slot
The history for agent at time slot
The parametric functional-form policy of agent
The risk-sensitive episodic return at time slot
Probability of trajectory during time slots under
parametric policies
Table I: List of our notations.

Ii-B Phase-shift controller for RIS-assisted mmW networks

We define as the beamforming-control policy of the mmW AP at time slot , where is essentially the probability that the mmW AP selects the -th beam angle from set at time slot . Next, we define as the phase shift-control policy of the mmW RIS, where is the probability that the mmW RIS selects the -th reflection angle to reflect the received signal from the mmW AP toward the UE at time slot .

Due to the stochastic changes of the mmW blockage between mmW AP or reflector and UE, and random changes in the user’s orientation, the transmission and phase shift-control policies at a given slot will depend on unknown future changes in the LoS mmW links. Consequently, to guarantee ultra-reliable mmW links subject to future stochastic changes over mmW links, we mitigate the notion of risk instead of maximizing the expected future rate. Concretely, we adopt the entropic value-at-risk (EVaR) concept that is defined as [22]. Here, the operator is the expectation operation. Expanding the Maclaurin series of the and

functions shows that EVaR takes into account higher order moments of the stochastic sum rate

during future consecutive time slots [23]. Consequently, we formulate the joint beamforming and phase shift-control problem for an RIS-assisted mmW network as follows:

(5)
(6)
(7)
(8)
(9)

where the parameter denotes the risk sensitivity parameter [23]. In (5), the objective is to maximize the average of episodic sum of future bit rate,

, while minimizing the variance to capture the rate distribution, using joint beamforming and phase shift-control policies of mmW AP and reflectors during future time slots. The risk sensitivity parameter penalizes the variance and skewness of the episodic sum of future bit rate. In (

5), depends on the beam angle of mmW AP, phase shift angle of mmW RIS, and the unknown AoA from user’s location during future -consecutive time slots.

The joint beamforming and phase shift-control problem in (5) is a stochastic optimization problem that does not admit a closed-form solution and has an exponential complexity [24]. The complexity of the stochastic optimization problem in (5) becomes more significant due to the unknown probabilities for possible random network changes such as the mmW link blockages and the user’s locations[24] as well as the large size of the state-action space. Therefore, we seek a low-complexity control policy to solve (5) that can intelligently adapt to mmW link dynamics over future time slots. In this regard, we propose a framework based on principles of risk-sensitive deep RL and cooperative multi-agent system to solve the optimization problem in (5) with low complexity and in an adaptive manner.

Iii Intelligent Beamfroming and Phase Shift-Control Policy

In this section, we present the proposed gradient-based and adaptive policy search method based on a new deep and risk-sensitive RL framework to solve the joint beamforming and phase shift-control problem in (5) in a coordinated and distributed manner. We model the problem in (5) as an identical payoff stochastic game (IPSG) in a cooperative multi-agent environment [18]. An IPSG describes the interaction of a set of agents in a Markovian environment in which agents receive the same payoffs[25].

An IPSG is defined as a tuple , where is the state space, is a set of agents in which index refers to the mmW AP and indexes 1 to represent the mmW RISs. is the set of joint action space of the agents in which is the set of possible transmission directions for mmW AP and is the set of possible reflection directions for mmW RISs. The observation space is the bit rate over mmW link . Here, is the stochastic state transition function from states of the environment, and joint actions of the agents,

to probability distributions over states of the environment,

. is the immediate identical reward function, and is the initial observation for the controllers of the mmW AP and reflectors[26].

Here, the state includes complex path gains for all paths and AoA at UE at time slot . Due to the dynamics over the mmW paths, the state, , and state transition function, , are not given for the beamforming controller of mmW AP and phase shift-controllers of mmW RISs. Since all agents in have not an observation function for all , the game is a partially observable IPSG (POIPSG). Due to the partially observability of IPSG, a strategy for agent is a mapping from the history of all observations from the beginning of the game into the current action . Hereinafter, we limit our consideration to cases in which the agent has a finite internal memory including the history for agent at time slot , . is a set of actions and observations for agent during consecutive previous time slots. We also define as the global history.

Next we define a policy as the probability of action given past history as a continuous differentiable function of some set of parameters. Hence, we represent the policy of each agent of the proposed POIPSG in a parametric functional-form where is a parameter vector for agent . If is a trajectory of the POIPSG during -consecutive time slots, then the stochastic episodic reward function during future -consecutive time slots is defined as . Here, we are interested in implementing a distributed controller in which the mmW AP and RISs act independently. Thus, the unknown probability of trajectory is equal to if the agents in act independently.

In what follows we define the risk-sensitive episodic return for parametric functional-form policies at time slot as  [23]. Given the parametric functional-form policies, , the goal of the transmission and phase shift controller is to solve the following optimization problem:

(10)
(11)
(12)
(13)

where . We will define the parameter vector and the value of in Subsection III-A.

To solve the optimization problem in (13), the controller needs to have full knowledge about the transition probability , and all possible values of for all of the trajectories during from the POIPSG under policies . Since the explicit characterization of the transition probability and values of the episodic reward for all of the trajectories is not feasible in highly dynamic mmW neworks, we use an RL framework to solve (13). More specifically, we use a policy search approach to find the optimal transmission angle and phase shift-control policies to solve problem in (13) for the following reasons. First, value-based approaches such as -learning are oriented toward finding deterministic policies. However, the optimal policy is often stochastic and policy-search approaches can select different phase shifts with specific probabilities by adaptively tuning the parameters in  [24]. Second, value-based RL methods are oriented toward finding deterministic policies, and they use a parameter, , as an exploration-exploitation tradeoff to apply other possible policies [24]. However, In policy search approach, the exploration-exploitation tradeoff is explicitly applied due to the direct modeling of probabilistic policy [24]

. Third, any small change in the estimated value of an action can cause it to be (or not) selected in the value-based approaches. In this regard, the most popular policy-search method is the policy-gradient method where the gradient objective function is calculated and used in gradient-ascend algorithm. The gradient

of the risk-sensitive objective function is approximated as follows.

Proposition 1.

The gradient of the objective function, , in (13) is approximated by:

(14)

where . Under distributed controller in which mmW AP and RISs act independently, .

Proof.

See Appendix A. ∎

Following Proposition 1, we can use (17) to solve the optimization problem in (13) using a gradient ascent algorithm and, then, find the optimal control policies. To calculate (7), we need a lookup table of all trajectories of risk-sensitive values and policies over time. However, this lookup table is not available for a highly dynamic indoor mmW networks. To overcome this challenge, we combine deep neural network (DNN) with the RL policy-search method. Such a combination was studied in [26], where a DNN learns a mapping from the partially observed state to an action without requiring any lookup table of all trajectories of the risk-sensitive values and policies over time. Next, we propose an RL algorithm that uses a DNN based on policy gradient for solving (13).

Iii-a Implementation of Phase-shift controller with DNN

We use a DNN to approximate the policy for solving (13). Here, the parameters include the weights over all connections of the proposed DNN where is equal to the number of connections [26]. We consider two implementations of the beamforming and phase shift-controllers: centralized and distributed.

1) Centralized controller: the centralized controller has enough memory to record the global history and computational power to train the proposed DNN in Fig. 2. Thus, the deep RNN directly implements the independent beamforming and phase shift-control policies for given the global history and

. Then, the policy is transmitted from the centralized controller to the mmW AP and RISs through the control links. Indeed, the centralized controller is a policy mapping observations to the complete joint distribution over set of joint action space

. The deep RNN that implements the centralized controller is shown in Fig. 2

. This deep RNN includes 3 long short term memory (LSTM) cells, 3 fully connected, 3 rectified linear unit (Relu), and

Softmax layers. The 3 LSTM layers have layers of , , and memory cells.

The main reason for using the RNN to implement the controller is that unlike feedforward neural networks (NNs), the RNNs can use their internal state to process sequences of inputs. This allows RNNs to capture the dynamic temporal behaviors of a system such as highly dynamic changes over mmW links between mmW AP and reflectors in an indoor scenarios [24]

. Thus, we implement the controller using LSTM networks. An LSTM is an artificial RNN architecture used in the field of deep learning. In this case, the LSTM-based controller has enough memory cell in LSTM layers to learn policy that require memories of events over previous discrete time slots. These events are the blockage of mmW links due to the stochastic state transition function from states of the environment in the proposed POIPSG during last time slots. Moreover, the LSTM-based architecture allows us to avoid the problem of vanishing gradients in the training phase. Hence, LSTM-based architecture and compared to other DNNs provides a faster RL algorithm 

[24].

Figure 2: The deep RNN for implementing the centralized controller. Input is and output is .

2) Distributed controllers: In the highly dynamic mmW network, even during the policy signal transmissions over backhaul link from central controller to the mmW AP and RISs, the channel state may change. So, we have proposed a distributed control policy in which each of mmW AP or RISs will optimized their control policy in a distributed manner without requiring to send central policy over backhaul link. Indeed, unlike the central controller, the distributed controller does not suffer from the backhual link delay. Consequently, the distributed controller is faster than centralized solution. In the distributed controllers, the mmW AP and all the RISs act independently. In this case, since each agent acts independently, , and each deep RNN, which is in the controller of each agent , implements the policy because of the limited computational power. Although the mmW AP and RISs act independently, agents share their previous consecutive actions with other agents using the synchronized coordinating links between themselves. The synchronized coordinating link can be a microwave wireless or backhaul wired link between mmW AP and RISs. The deep RNN that implements the distributed controller of each agent is shown in Fig. 3. This deep RNN includes 2 LSTM, 3 fully connected, 2 Relu, and one Softmax layer. The two LSTM layers have layers of and memory cells.

Figure 3: The deep RNN for implementing the distributed phase shift-controller. Input is and output is .

One of the techniques that can be used to prevent an NN from overfitting the training data is the so-called dropout technique [27]

. Dropout is a technique where randomly selected neurons are ignored during training. Indeed, the dropout probability is a hyperparameter that will be tweaked by trial and error, and there is a need to train the NN after each value of the dropout probability to find the best dropout probability setting 

[27]. We will find the value for dropout probabilities and for our proposed deep NN in 2 and 3 using trial-and-error procedure in the simulation Section. Since the payoff is identical for all agents and the observation of environment changes is from the same distribution for all agents, the gradient updating rules of the distributed and central controllers will be same in the considered POIPSG. This fact is shown as follow:

Theorem 1.

Starting from the same point in the search space of policies for the proposed POIPSG and given the identical payoff function, , the gradient update algorithm will converge to the same locally optimal parameter setting for the distributed controllers and centralized controller.

Proof.

See Appendix B. ∎

Following Theorem 1, if the architectures of the centralized controller in Fig. 2, and distributed controllers in Fig. 3 are designed correctly and the proposed deep RNNs are trained with enough data, the performance of distributed controllers should approach that of the centralized controller in the RIS-assisted mmW networks. In this case, instead of using a central server in the RIS-assisted mmW networks with highly computational cost and signaling overhead to send the control policies to all agents across all network, one can use the distributed coordinated controllers with low computational power. In this case, the distributed controllers just need to share the policies with the agents that cooperate to cover the same dark area. Thus, the signaling overhead is also limited to the local area in the distributed controller setting. In addition to these, the policy profile under the distributed controllers is a Nash equilibrium of the POIPSG. We state this more precisely in the following.

Theorem 2.

At the convergence of the gradient update algorithm in (17), the policy profile under the distributed controllers is an NE equilibrium of the POIPSG.

Proof.

See Appendix C. ∎

Consider a training set of samples that is available to train the deep RNN network. Each training sample includes a sample of policies and bit rates during -consecutive time slots before time slot , , and policies and bit rates during future -consecutive time slots after time slot , . Consequently, based on Proposition 1 and by replacing the expectation with sample-based estimator for , we use the gradient-ascend algorithm to train the RNN as follows:

(15)

where , and . Here, is the learning rate. In summary, to solve the optimization problem in (5), we model the problem using deep and risk-sensitive RL framework as the problem (13). Then, to solve the problem (13), we implement two centralized and distributed policies using deep RNNs which are shown Figs. 2 and 3. Then, based on gradient ascent algorithm, we use (15) to iteratively train the proposed deep RNNs and optimize .

Iii-B Complexity of deep RNN-based policies

The complexity of an NN depends on the number of hidden layers, training examples, features, and nodes at each layer [28]. The complexity for training a neural network that has layers and node in layer is given by with training examples and epoch. Meanwhile, the complexity for one feedforward propagation will be . On the other hand, LSTM is local in space and time, which means that the input length does not affect the storage requirements of the network [29]. In practice, after training the RNN-based policy, our proposed solution will use the feed-forward propagation algorithm to find the solution. In this case, following the proposed NN architectures in Figs. 2 and 2 of this response letter (Figs. 2 and 3 of the revised manuscript), the complexities of the centralized and distributed controllers are and , respectively. These complexities are polynomial functions of key parameters such as history length, , number of mmW AP and RISs, , phase shift angles, , and future time slots, . On the other hand the complexity of optimal solution suing brute force algorithm is . Consequently, for a given history length , the optimal solution has the highest complexity, , while the complexity of our proposed distributed solution, , is the least complex.

Iv Simulation Results and Analysis

For our simulations, the carrier frequency is set to 73 GHz and the mmW bandwidth is 1 GHz. The number of transmit antennas at the mmW AP and receive antennas at the UE are set to 128 and 64, respectively. The duration of each time slot is 1 millisecond which is consistent the mmW channel coherence time in typical indoor environments [30]. The transmission power of the mmW AP is 46 dBm and the power density of noise is -88 dBm. We assume that the mmW RIS assigns a square of meta-surfaces to reflect the mmW signals. Each meta-surface shifts the phase of the mmW signals with a step of radians from the range . In our simulation, we assume that one mmW AP and two mmW RISs are mounted on the walls of the room and controlled using our proposed framework to guarantee reliable transmission. To evaluated our proposed RNN-based control policies, we use two real-world and model-based datasets of the users’ trajectories in an indoor environment. To generate model-based dataset, we consider a 35-sq. meter office environment with a static wall blockage at the center. In this regard, we have assumed a given probability distribution for the users’ location in a room. This location probability distribution can be calculated using well-known indoor localization techniques such as the one in [26]. For generating the data set of mobile users’ trajectories, we use a modified random walk model. In this case, the direction of each user’s next step is chosen based on the probability of user’s presence at next step location. Fig. 4 shows the probability distribution of the user’s locations in the office, the location of the mmW RIS, and an illustrative example of a user trajectory. We further evaluate our proposed solution using real-world dataset. We use the OMNI1 dataset [31]. This dataset includes trajectories of humans walking through a lab captured using an omni-directional camera. Natural trajectories collected over 24 hours on a single Saturday. This dataset contains 1600 trajectories during 56 time slots. For comparison purposes, we consider the optimal solution, as a benchmark in which the exact user’s locations and optimal strategies for the reflector during the next future -time slots are known.

Figure 4: The distribution probability of mobile user’s location.

Iv-a Performance evaluation of deep RNN training

To evaluate the performance of the proposed controllers implemented with deep RNN depicted in Figs. 2 and 3, Fig. 5 shows the RMSE between the predicted and optimal policies of the centralized and distributed controllers when dropout probabilities are and . On average the difference between RMSEs over the training and validation sets is less than which shows that the deep RNN model is not over-fitted to the training data set. In addition, on average the difference between RMSEs over training and test sets is less than which shows that implemented deep RNN model is not under-fitted and the deep RNN model can adequately capture the underlying structure of the new dynamic changes over mmW links. Thus, the structure of proposed deep RNN models depicted in 2 and 3 are correctly chosen and the hyper-parameters such as dropout probabilities and in the training phase are tuned correctly. On average, the RMSE for future consecutive time slots is for and for . This show that predicting the correct control strategy becomes harder when the window length of future consecutive time slots increases, but even for the deep RNN can capture the unknown future dynamics over mmW links and correctly predict control strategy in of times. Beside these, the differences of RMSEs between centralized and distributed controllers are , , and over training, validation, and test sets. This shows that the performance of centralized and distributed controllers are almost as same as each others.

Figure 5: RMSE for the parametric functional-form policy.

Iv-B Achievable rate under proposed RNN-based controllers

In Fig. 6, we show the achievable rate, , following the centralized and distributed controller policies over time for model-based dataset presented in the simulation setup. As we can see from Figs. 5(a) and 5(b), when the risk sensitivity parameter is set to zero, called i.e., non-risk scenario, a higher rate with highly dynamic changes is achieved under the optimal solution. However, when the risk sensitivity parameter increases from to , i.e., risk-based scenario, the policy resulting from the centralized and distributed controllers achieves less average rate with lower variance which is more reliable. For model-based datset, on average, the mean and variance of the achievable rate for the non-risk scenario are and higher than the risk-based scenarios for different future time slot lengths, respectively. Moreover, we can also see that, controlling during wider time window of future consecutive time slots leads to more reliable achievable rate but with lower average rate for the risk-based scenario. For example, when , the mean and variance of the achievable rate are and respectively, but the mean and variance of achievable rate respectively decrease to and when . The reason is that controlling the beam angle of mmW AP and phase shift of RISs for larger window of future time slots gives the centralized and distributed controllers more available strategy to decrease the variance more compare to controlling the beam angle and phase shift during tighter window of future time slots. In addition to this, on average, the mean of the rate achieved by the distributed controller is higher than the centralized controller and the difference in the variance of the achieved rate between the centralized and distributed controllers is . This result shows that the performance of the centralized and distributed controllers is identical.

(a) Distributed controllers.
(b) Centralized controller.
Figure 6: Achievable rate, , for model-based dataset.

In Fig. 7, we show the achievable rate, , following the centralized and distributed controller policies for real-world dataset in [31]. From Figs. 6(a) and 6(b), we observe that, in a non-risk scenario, , a high rate with high variance is achieved under the optimal solution. However, in a risk-based scenario, , the policy resulting from the centralized and distributed controllers achieves a smaller data rate but with lower variance which is more reliable. For real-world dataset, on average, the mean and variance of the achievable rate for the non-risk scenario are and higher than the risk-based scenarios for different future time slot lengths, respectively. Moreover, when , the mean and variance of the achievable rate are and respectively, but the mean and variance of the achievable rate respectively decrease to and when . In addition to this, on average, the differences in the variance and the mean of the rate achieved by the centralized and distributed controllers are and . This result shows that the performance of the centralized controller is near the distributed controller performance for real-world dataset.

(a) Distributed controllers.
(b) Centralized controller.
Figure 7: Achievable rate, , for real-world dataset.

In Fig. 8, we show the impact of the risk sensitivity parameter on the reliability of achievable rate. Indeed, in Fig. 8, we show the variance of received rate versus different values of the risk sensitivity parameter resulting from our proposed distributed RNN-based policy for real-world dataset in [31] and model-based dataset presented in the simulation setup. As we can see from this figure, a larger risk sensitivity parameter leads to less variance in the data rate. When we change from to , the rate variance, on average, reduces and for the real-world and the model-based dataset, respectively.

Figure 8: Impact of the risk sensitivity parameter on the achievable rate.

Iv-C Robustness and complexity RNN-based controllers

Fig. 9 shows the average policies resulting from the centralized and distributed controllers, , and optimal joint beamforming and phase shift-controller for the mmW AP and RISs over different future consecutive time slots for the risk-sensitive approach when . From Fig. 9, the error between policies of distributed controllers and optimal solution are , , and , for mmW AP, and RIS 1, and RIS 2 on average. This is due to the fact that during the time slots, the deep RNN, which has enough memory cell, can capture the previous dynamics over mmW links and predict the future mobile user’s trajectory in a given indoor scenario. Thus, the policies from proposed phase shift-controller based on deep RNN is near the optimal solution. From Fig. 9, shows that the controller steers the AP beam toward mmW RIS 1 with radian and mmW RIS 2 with radian with probability and , respectively. Moreover, the controller of RIS 1 reflects the mmW signal from to radians most of the times and also the controller of RIS 2 shifts the phase of the mmW signal to cover from to radians with higher probability. Following the locations of mmW AP and RISs in the simulation scenario depicted in Fig. 4, these results are reasonable because they show that the distributed controller implemented with deep RNN coordinate the beam angle of mmW AP and phase shift-controller of RISs to cover the dark areas with high probability.

Figure 9: Optimal and policy-based strategies of joint beamforming and phase shift-controllers.

In Fig. 10, we show, the gap between the suboptimal and optimal solutions. As we can see, the gap between the RNN-based and optimal policies for the real-world dataset is slightly different from the model-based datasets. On average, the gaps between the RNN-based and optimal policies of mmW AP and RISs are and for the real-world and the model-based dataset, respectively. Consequently, it is clear that our proposed RNN-based solution is near optimal.

Figure 10: Gap between the RNN-based and optimal policies.

To show the robustness of our proposed scheme, we have changed the mobility pattern of the users by adding some random obstacles in the room while we use an RNN-based policy that was previously trained on a scenario without additional obstacles. This scenario allows us to evaluate the robustness of our solution with respect to new unknown random changes in the mobility pattern of users and blockages over mmW channel that were not considered in the training dataset. For this simulation, we have randomly added obstacles with size of in a 35-sq. meter office environment. All the results are averaged over a large number of independent simulation runs. To evaluate the robustness of our proposed RNN-based policy, in Fig. 11, we show the percentage of deviation in the data rate achieved in the new environment with respect to the scenario without additional obstacles. From Fig. 11, we can see that the percentage of rate deviation increases when we add more obstacle in the room. However, when the controller predicts the policies for the next two slots, the deviation percentage is less than . This means our proposed control policy is more than robust with respect to the new environmental changes in the room. Moreover, when the RNN-based controller predicts control policy during 3 or 4 future time slots in a new environment, the robustness of our proposed RNN-based controller decreases. Hence, when or , the RNN-based control policy, which is trained using the dataset of the previous environment, is robust enough to be used in a new environment. In contrast, when or , we need to retrain the RNN-based control policy using the dataset of the new environment.

Figure 11: Percentage of rate deviation v.s. number of obstacles.

V Conclusion

In this paper, we have proposed a novel framework for guaranteeing ultra-reliable mmW communications using multiple AI-enabled RISs. First, based on risk-sensitive RL, we have defined a parametric risk-sensitive episodic return to maximize the expected bit rate and mitigate the risk of mmW link blockage. Then, we have analytically derived a closed-form approximation for the gradient of the risk-sensitive episodic return. Next, we have modeled the problem of joint beamforming for mmW AP and phase shift-controlling for mmW RISs as an identical payoff stochastic game in a cooperative multi-agent environment, in which agents are mmW AP and RISs. We have proposed two centralized and distributed controllers using deep RNNs. Then, we have trained our proposed deep RNN-based controllers based on the derived closed-form gradient of the risk-sensitive episodic return. Moreover, we have proved that the gradient updating algorithm converges to the same locally optimal parameters for deep RNN-based centralized and distributed controllers. Simulation results show that the error between policies of the optimal and proposed controllers is less than . Moreover, the difference between performance of the proposed centralized and distributed controllers is less than . On average, for high value of risk-sensitive parameter, the variance of the achievable rates resulting from deep RNN-based controllers is less than that of the risk-averse.

Appendix A

A-a Proof of Proposition 1

Let be a trajectory during -consecutive time slots which leads to the episodic reward . The Taylor expansion of the utility function for small values of yields . Since , we can rewrite

The probability of the trajectory is . Thus, we can write . Due to the fact that , we have . By performing additional simplifications, we have .

Moreover, when the agent acts independently, the probability of trajectory is equal to . Due to the fact that , and , we can write .

A-B Proof of Theorem 1

Since the agents act independently, for two agents and , where , we have . Thus, we can write . Then, if the agents, which are synchronized by coordinating links, act independently in a distributed manner, we have:

(16)

By comparing (16) and Proposition 1, we can say that (16) shows the results of Proposition 1 where . Whether a centralized controller is being executed by a central server, or it is implemented by agents individually executing policies synchronously, joint histories, , are generated from the same distribution and identical payoff will be achieved by mmW APs and all RISs in POIPSG. This fact shows that the distributed algorithm is sampling from the same distribution as the centralized algorithm samples. Thus, starting from the same point in the search space of policies, on the same history sequence, the gradient updating algorithm will be stepwise the same for the distributed controllers and the centralized one.

A-C Proof of Theorem 2

Assume that for a given global history sequence for and at the convergence of the gradient update algorithm using (17), the policy profile under the distributed controllers is . At this policy profile, since all agents have an identical payoff function, the best response of agent to the given strategies of all other agents is defined as where . In this case, due to the fact that the agents act independently, the gradient updating rule for agent to find its best response is given by (25). Since the global history sequence for