Multi-IRS-assisted Multi-Cell Uplink MIMO Communications under Imperfect CSI: A Deep Reinforcement Learning Approach

Applications of intelligent reflecting surfaces (IRSs) in wireless networks have attracted significant attention recently. Most of the relevant literature is focused on the single cell setting where a single IRS is deployed and perfect channel state information (CSI) is assumed. In this work, we develop a novel methodology for multi-IRS-assisted multi-cell networks in the uplink. We consider the scenario in which (i) channels are dynamic and (ii) only partial CSI is available at each base station (BS); specifically, scalar effective channel powers from only a subset of user equipments (UE). We formulate the sum-rate maximization problem aiming to jointly optimize the IRS reflect beamformers, BS combiners, and UE transmit powers. In casting this as a sequential decision making problem, we propose a multi-agent deep reinforcement learning algorithm to solve it, where each BS acts as an independent agent in charge of tuning the local UE transmit powers, the local IRS reflect beamformer, and its combiners. We introduce an efficient information-sharing scheme that requires limited information exchange among neighboring BSs to cope with the non-stationarity caused by the coupling of actions taken by multiple BSs. Our numerical results show that our method obtains substantial improvement in average data rate compared to baseline approaches, e.g., fixed UE transmit power and maximum ratio combining.


On the Effective Capacity of IRS assisted wireless communication

This paper provides the effective capacity (EC)analysis in futuristic in...

Weighted Sum-Rate Maximization for Intelligent Reflecting Surface Assisted Multi-Cell MU-MIMO Communications

In this paper, we investigate an intelligent reflecting surface (IRS) as...

Robust Transmission Design for RIS-Aided Communications with Both Transceiver Hardware Impairments and Imperfect CSI

Reconfigurable intelligent surface (RIS) or intelligent reflecting surfa...

Intelligent Reflecting Surface Assisted Multi-User MISO Communication: Channel Estimation and Beamforming Design

The concept of reconfiguring wireless propagation environments using int...

Intelligent Reflecting Surfaces: Sum-Rate Optimization Based on Statistical CSI

In this paper, we consider a multi-user multiple-input multiple-output (...

Reconfigurable Intelligent Surfaces-Assisted Multiuser MIMO Uplink Transmission with Partial CSI

This paper considers the application of reconfigurable intelligent surfa...

Deep Reinforcement Learning-Based Adaptive IRS Control with Limited Feedback Codebooks

Intelligent reflecting surfaces (IRS) consist of configurable meta-atoms...

I Introduction

Intelligent reflecting surfaces (IRSs) are one of the innovative technologies for 6G and beyond [zhang2020prospective, hosseinalipour2020federated]. An IRS is an array of passive reflecting elements with a control unit. It manipulates the propagation of an incident signal by providing an abrupt phase shift, which can control the communication channel. IRSs are utilized to provide enhanced communication efficiency without building extra infrastructure [8811733, 9013288, pan2020multicell, 8990007, zhi2020uplink, 9162705, zhang2020intelligent]. In this paper, we study a scenario where multiple IRSs are deployed in a multi-cell cellular setting to provide enhanced data rates to the users.

I-a Related Work

Exploiting IRSs in cellular networks initiated with applications of this technology in the downlink (DL). Studying IRS use cases in the uplink (UL) is thus comparably more recent.

I-A1 Utilizing IRSs in the DL

Most of the relevant literature has considered a single cell system with a single IRS [8811733, 9013288]. Specific investigations have included quality of service (QoS)-constrained transmit power minimization [8811733] and weighted sum-rate maximization [9013288] to obtain the base station (BS) beamformer and IRS reflect beamformer/precoder in the DL. Unlike the prior approaches, the work in [pan2020multicell] considers a multi-cell scenario with a single IRS, where the BS precoders and IRS reflect beamformer are designed to maximize sum-rate.

I-A2 Utilizing IRSs in the UL

Most of the works in UL design are also focused on single cell systems with a single IRS [8990007, zhi2020uplink, 9162705]. Several of these works have studied IRS reflect beamformer design and uplink user equipment (UE) power control problems, where the impact of quantized IRS phase values [8990007] and compressed sensing-based user detection [9162705] on the uplink throughput have also been investigated. The concept of IRS resembles analog beamforming in millimeter-wave (mmWave)-based systems [zhi2020uplink]. Recently, systems with two IRSs have been considered focusing on SINR fairness [zhang2020intelligent].

Despite the potential benefit of improving multi-cell-wide performance, multi-IRS deployment in multi-cell UL scenarios has not been thoroughly modeled and studied due to the added optimization complexity involved in controlling multiple IRSs.

I-B Overview of Methodology and Contributions

In this work, we develop an architecture for multi-IRS-assisted multi-cell UL networks. Our methodology explicitly considers multi-order reflections among IRSs, which is rarely done in existing literature. We address the scenario where (i) channels are time-varying, and (ii) only partial/imperfect CSI is available, in which each BS only has knowledge of scalar effective channel powers from a subset of UEs. This is more practical and realistic as compared to the prior approaches [8990007, zhi2020uplink, 9162705, zhang2020intelligent] that assume perfect knowledge of all channel matrices. We formulate the sum-rate maximization problem aiming to jointly optimize UE transmit powers, IRS reflect beamformers, and BS combiners across cells.

Given the interdependencies between the design variables across different cells, we cast the problem as one of sequential decision making and tailor a multi-agent deep reinforcement learning (DRL) algorithm to solve it. We consider each BS as an independent learning agent that controls the local UE transmit powers, the local IRS reflect beamformer, and its combiners via only index gradient variables. We design the state, action, and reward function for each BS to capture the interdependencies among the design choices made at different BSs. We further develop an information-sharing scheme where only limited information among neighboring BSs is exchanged to cope with the non-stationarity issue caused by the coupling between the actions at other BSs. Through numerical simulations, we show that our proposed scheme outperforms the conventional baselines for data rate maximization.

Ii Multi-cell Systems with Multiple IRSs

In this section, we first introduce the signal model under consideration (Sec. II-A). Then, we formulate the optimization and discuss the challenges associated with solving it (Sec. II-B).

Ii-a Signal Model

We consider a multi-cell system with multiple IRSs for the uplink (UL) as depicted in Fig. 1. The system is comprised of a set of cells and IRSs . For simplicity we assume that each cell has one IRS, i.e., , though our method can be readily generalized to the case where . The IRSs are indexed such that cell contains IRS .

Each cell contains (i) UEs with single antenna, denoted by , (ii) an IRS with reflecting elements, denoted by , and (iii) a BS with antennas denoted by . We let UE refer to UE in cell

. The received signal vector at BS

at the th channel instance is given by


where is the direct channel from UE to BS , is the channel from UE to IRS , is the channel from IRS to BS , and is the channel from IRS to IRS , . Also, is the transmit power and is the transmit symbol of UE , where . The noise vector at BS is assumed to be distributed according to zero mean complex Gaussian with covariance matrix , i.e., , where

denotes the identity matrix and

is the noise variance. Finally,

is a diagonal matrix with its diagonal entries representing the beamforming vector of IRS . , , is modeled as , incurring the signal attenuation and phase shift .

In (1), we consider the channels with three different paths from UE to BS : (i) the direct channel, (ii) the channel after one reflection from the IRSs (the sum over ), and (iii) the channel after two reflections from the IRSs (the sum over ). Higher order reflections can also be incorporated in (1), i.e., signals reflected from more than two IRSs; we focus on up to the second-order reflections due to a large attenuation induced by multiple reflections between IRSs.

We assume that a linear combiner is employed at BS to restore from , which yields


where superscript denotes the conjugate transpose.

Fig. 1: Depiction of a multi-IRS-aided multi-cell system in the UL.

Ii-B Problem Formulation and Challenges

We aim to maximize the sum-rate over all the UEs in the network through design of the UE powers , BS combiners , and IRS beamformers , where is the IRS beamforming vector on the diagonal of , i.e., . With as the signal-to-interference ratio (SINR) of UE , we propose the following optimization problem:

maximize (3)
subject to

where is the set of power values, is the codebook for BS combiners, and is the codebook for IRS beamformers.111A codebook structure can be employed for IRS because IRS is in practice controlled by a field-programmable gate array (FPGA) circuit where FPGA stores a set of coding sequences [cui2014coding].

The problem in (3) is an optimization problem at time , where , i.e., the optimization of variables is performed once every time instances. If the instantaneous channels , , and in (1) are all known, then conventional optimization methods, e.g., successive convex approximation or integer programming, could be applied, since in (3) can be formulated as (4) (shown at the top of the next page) with the known channels. However, IRS-assisted wireless networks face the following challenges in practice:

  • [leftmargin=3mm]

  • IRS channel acquisition: Although most of the works, e.g., [8990007, zhi2020uplink, 9162705, zhang2020intelligent]

    , assume that channels are perfectly known, this assumption is impractical because an IRS is passive and often does not have sensing capabilities. While special IRS hardware with the ability to estimate the concatenated channels does exist 

    [wang2020channel], the time overhead could easily overwhelm the coherent channel resources especially when there are multiple IRSs.

  • Dynamic channels: Channel dynamics in wireless environments adds another degree of difficulty to channel acquisition and estimation. This makes solving the optimization in (3) impossible with conventional model-based optimization approaches, due to dynamic and unknown channels.

  • Centralization: A centralized implementation to solve (3) would require gathering all the information at a central point, which is impractical in our setting. Given the interdependencies among the design variables taken by different cells and their impact on the overall objective function, distributed optimization of the variables in (3) is challenging.

To address these challenges, we convert (3) into a sequential decision making problem, where the variables are designed via successive interactions with the environment through deep reinforcement learning (DRL). While conventional DRL assumes a centralized implementation, we develop a multi-agent DRL approach, where each BS acts as an independent agent in charge of tuning its local UEs transmit powers, local IRS beamformer, and combiners. To cope with the non-stationarity issue of multi-agent DRL [marinescu2017prediction], we carry out the learning through limited information-sharing among neighbouring BSs.


Iii Multi-agent DRL Framework Design

In this section, we first introduce the information collection process at the BSs and design an information-sharing scheme (Sec. III-A

). We then formulate a Markov decision process (MDP) (Sec. 

III-B) and propose a dynamic control scheme (Sec. III-C) to solve our optimization from Sec. II-B.

Iii-a Local Observations and Information Exchange

We consider a setting where each BS only acquires scalar effective channel powers from a subset of UEs. When UE transmits a pilot symbol with power , BS measures the scalar effective channel power (after combining with ), , which is given by


where is the scalar effective channel. The vector is the effective channel from UE to BS (before combining), which is expressed as follows:


BS collects the scalar effective channel powers of the links (i) from local UEs (in cell ) to BS , (ii) from neighbouring UEs (not in cell ) to BS , and (iii) from local UEs to neighbouring BSs. BS measures (i) and (ii) as local observations, but needs to receive (iii), which cannot be measured by BS , from neighbouring BSs. Additionally, BS receives a penalty value from neighbouring BSs, where the penalty value is used for designing the reward function and will be formalized in Sec. III-B3. Note that concurrent estimation of the scalar effective channel powers of multiple UEs can be performed by UE-specific reference signals in the Long-Term Evolution (LTE) standard [LTEstandard]. Acquiring scalar effective channel powers from only a subset of UEs lowers the CSI acquisition overhead compared to the conventional method of acquiring large-dimensional vector or matrix CSI from individual UE for each IRS.

To clarify which neighbouring UEs are included in (ii) and which neighbouring BSs are included in (iii), we define two sets of cell indices. First, we define the set of indices of dominantly interfering neighboring cells, . UEs in cell are dominantly interfering with the data link of local UEs (in cell ). Formally, , we have . The size of this set is a control variable . For (ii), then, we include neighbouring UEs in cell .

Second, we define the set of indices of dominantly interfered neighboring cells, . The data links of UEs in cell are dominantly interfered by local UEs (in cell ). Formally, , we have . The size of this set is a control variable . For (iii), then, we include neighbouring BSs of cell .

The effective channel gain, used in defining and , can be acquired by the antenna circuit before digital processing (e.g., from the automatic gain control (AGC) circuit [mo2017channel]), without the explicit effective channel vector or combiner implementation. BS also measures of all local UEs, by measuring the received signal strength indicator (RSSI) and the reference signal received power (RSRP), which are the conventional measures to evaluate the signal quality in LTE standards [LTEstandard]. Using the SINRs, BS then calculates the achievable data rate of UE as . Here, we omit the bandwidth parameter, assuming the same bandwidth for all the data links.

Iii-B Markov Decision Process Model

We formulate the decision making process of each BS as an MDP with states, actions, and rewards:

Iii-B1 State

We define the state space of BS as


where each constituent set is described below.

(i) Local channel information. consists of the scalar effective channel powers from local UEs observed at two consecutive times and , given by

Here, can be obtained from (5) at time , and is a version of (5) obtained at time using previous-time variables , , and . Having them enables us to capture the effect of channel variation over time.

(ii) From-neighbor channel information. contains the scalar effective channel powers from UE in neighboring cell , and the index , for , . Formally,

This set captures the interference from neighbor UEs to cell .

(iii) To-neighbor channel information. contains the scalar effective channel powers from local UE to BS , and the index , for , . Formally,

This set captures the amount of interference that local UEs in cell inflict on neighboring cells. This information enables BS to adjust the transmit powers of local UEs to reduce interference to the neighboring cells.

(iv) Previous local variables and local sum-rate. consists of previous local variables, i.e., , , and , and the local sum-rate . Formally,

Iii-B2 Action

The action space is defined as


where , , are the index gradient variables used for updating the local UE transmit power, combiner of BS , and local IRS reflect beamformer. These index gradient variables are defined over a binary , or ternary alphabet as we will describe in Sec. IV.

Once BS determines the action in (8), the BS feeds forward to UE , which then updates its power index as . The power of UE is set to , , where denotes -th element of the power set in (3). The BS also feeds forward to IRS , which then updates its beamformer index as , and its beamformer as where is the -th vector in the codebook in (3). Finally, the combiner index is updated as . The combiner of BS is set to .

Iii-B3 Reward

Aiming to only maximize the local sum-rate at each BS could increase the interference to the neighboring cells. To incorporate the entire system performance, we design the reward including penalty terms as


where the first term is the sum-rate of cell and the second term is the sum of penalties. The penalty is the rate loss of the dominantly interfered cell caused by the interference of local UEs (in cell ), which is calculated at BS as


where is the rate loss of UE caused by the interference of local UEs in cell . A similar reward function was found to be effective for multi-agent DRL-based beamforming [ge2020deep]. The term in (10) denotes the data rate of UE without the interference of the local UEs in cell , while is the data rate including the interference. If there is no interference, the two terms cancel with each other, leading to zero penalty. Otherwise, the penalty is positive.

1:  Establish a train DQN with random weights , a target DQN with random weights , an empty experience pool with , and a pool size . Set the discount factor , initial -greedy value , mini-batch size , and DQN-aligning period , .
2:  Agent (BS ) randomly initializes the design variables , , and , and informs local UEs and local IRS of the initial variables, .
3:  Agent selects its action randomly and executes it, .
4:  . Agent observes the next state , .
5:  repeat
7:      Agent selects its action at time based on -greedy policy,

: With probability

, agent selects random action , and with probability , agent selects .
8:     Agent executes its action, .
9:     . Agent observes the next state and gets the reward , .
10:     Agent stores the new experience in its own experience pool , .
11:     if  then
12:         Agent samples a mini-batch consisting of experiences from its experience pool , .
13:         Agent updates the weights of its train DQN using back propagation, .
14:         Agent updates the weights of its target DQN every , .
15:     end if
16:  until Process terminates
Algorithm 1 Dynamic control based on multi-agent DRL.

Iii-C Dynamic Control Scheme based on Multi-agent DRL

In the proposed MDP, the channel values used as states are continuous variables, which makes conventional RL, i.e., Q-learning based on Q-table, not applicable. We thus adopt deep Q-networks (DQN) [mnih2015human]. BS possesses its own train DQN, , with weights , and target DQN, , with weights , where the state and action are defined in Sec. III-B. The pseudocode of the proposed dynamic control scheme based on multi-agent DRL is provided in Algorithm 1. Our algorithm follows a decentralized training with decentralized execution (DTDE) framework, where both training and execution are independently carried out at each agent. Therefore, our algorithm is independent of the UEs in other surrounding agents (BSs). Further, our algorithm incorporates the index gradient approach for codebook-based BS combining and IRS beamforming, which is independent of the number of antennas/elements and the size of the codebook.

Iv Numerical Evaluation and Discussion

In this section, we first describe the simulation setup (Sec. IV-A) and evaluation scenarios (Sec. IV-B). Then, we present and discuss the results (Sec. IV-C).

Iv-a Simulation Setup

Iv-A1 Parameter settings

We consider a cellular network with hexagonal cells, as shown in Fig. 2(fig:cell_plot). We assume , , and , , similar to [pan2020multicell]. The BSs are located at the center of each cell with 10 m height, and the distance between adjacent BSs is 100 m. Each IRS is deployed nearby the BS, and UEs are randomly placed in the cells. The set for UE power control is given by , where dBm and dBm are the minimum and maximum transmit powers, and . For BS combiner and IRS beamformer codebooks, we use a random vector quantization (RVQ) [au2007performance] codebook with size . We set dBm, .

Iv-A2 Channel modeling

We consider a single frequency band with flat fading and adopt a temporally correlated block fading channel model. Following a common cellular standard [ieeep80216], we assume coherence time ms and center frequency GHz. The channel vector is modeled as


where denotes the large-scale fading coefficient from UE to BS , modeled as ; is the path-loss at the reference distance , is the distance between UE and BS , and is the path-loss exponent between them. We set dB and m. denotes the Rayleigh fading vector, modeled by a first-order Gauss-Markov process [sklar2001digital]:


where , , and . The time correlation coefficient obeys the Jakes model [sklar2001digital], i.e., , where is the zeroth order Bessel function of the first kind, and is the maximum Doppler frequency, with velocity of UE and m/s. The same modeling for is applied for the channels between the UEs and the IRSs, i.e., , , with path-loss exponent . Since IRSs are placed at the desired locations to have less variations of IRS-BS/IRS-IRS channels as compared to UE-BS/UE-IRS channels [pan2020multicell], and are assumed to be stationary. Each entry for the channels is distributed according to and , respectively. and denote the large-scale fading coefficients with path loss exponents and , respectively.

We assume , , , , , , and , . To model the presence of extensive obstacles and scatterers, the path-loss exponent between the UEs and BS is taken to be . Because the IRS-aided link can have less path loss than that of direct UE-BS channel by properly choosing the location of the IRS, we set the path-loss exponents of the UE-IRS link, of the IRS-BS link, and of the IRS-IRS link to , , and , respectively [pan2020multicell]. We assume , and adopt ( km/h), ( km/h), and ( km/h), where is the UE speed.

Iv-B Evaluation Scenarios

Iv-B1 Scenario 1. The effective channels from local UEs are not known

In this scenario, each BS measures the scalar effective channel powers directly from received signals without explicitly obtaining the effective channels as a vector form in (6). We introduce two baselines in this scenario: RRR=(random, random, random) and MRR=(maximum, random, random). The name of each baseline is indicating how it selects its (UE power, IRS beamformer, BS combiner) variables as a tuple. We propose DQN1, where the action space consists of elements for UE powers, the IRS beamformer, and BS combiners. The index gradient variables are binary, i.e., .

Iv-B2 Scenario 2. The effective channels from local UEs are known

In this scenario, each BS measures the effective channels from local UEs as the vector form in (6). Each BS is assumed to adopt a maximum ratio combiner (MRC) by finding the index , where is the effective channel from local UE. We introduce several baselines: MRM=(maximum, random, MRC), FRM=(25% of maximum, random, MRC), RRM=(random, random, MRC), and MM with no IRS=(maximum, N/A, MRC). MM with no IRS assumes the IRSs to be turned off. In this scenario, we propose DQN2 and DQN3. In DQN2, the action space consists of elements for UE powers and the IRS beamformer (the action space does not have the elements , in (8)). The BS combiner is designed as MRC and the index gradient variable is binary, i.e., . The action space is DQN3 is the same as DQN2, except it uses a tenary index gradient variable, i.e., .

In both scenarios, the DQNs222All DQNs establish the same state space and reward function given in Sec. III-B. For the state information group (iv) in Sec. III-B1, the indices of previous local variables are stored in the state. are composed of an input layer, an output layer, and two fully-connected hidden layers. The input size is . The output size is , , and

for DQN1, DQN2, and DQN3, respectively. For DQN1, the number of neurons in the two hidden layers is 70 and 100; for DQN2, 40 and 30; and for DQN3, 70 and 70. The rectified linear unit (ReLU) activation function is employed. In Algorithm 

1, we adopt the -greedy method with , where and , . We consider , , and , . We set , i.e., the target DQN is updated with the weights of train DQN after a time of

. We employ the RMSProp optimizer for training.

(a) The simulated network
Fig. 2: (a): Cellular network with hexagonal cells and 100 m distance between adjacent BSs used in our simulations. (b),(c),(d): Average achievable data rates over all UEs obtained by each method. The dotted-lines and solid lines show the performance of schemes in Scenario 1 and Scenario 2, respectively. Each data point in the plots is a moving average over the previous 1000 time slots.

Iv-C Simulation Results and Discussion

Fig. 2(b),(c),(d) depict the average achievable data rate over all 21 UEs with different values of . The dotted lines show the performance of the schemes in Scenario 1. With varying channels, RRR and MRR select random or fixed indices for variables, and therefore have low average data rates over time. On the other hand, DQN1 learns and adapts to the varying channels over time by exploiting the local observations and information-sharing in our sequential decision making.

The solid lines represent the performances of schemes in Scenario 2. The MM with no IRS gives better performance than the baselines using IRS, implying that random IRS beamforming is worse than not deploying it at all. This also reveals the vulnerability of IRS-assisted systems to adversarial IRS utilization. Our DQN2 and DQN3 methods outperform the baselines, which emphasizes the benefit of carefully optimizing the IRS configuration with the rest of the cellular network. DQN2 yields slightly better performance and converges faster than DQN3: the faster convergence is due to neural networks training faster with a smaller number of outputs, and the better overall performance is consistent with the observation

[mnih2015human] that DQNs are more successful with smaller action spaces.

Comparing Scenario 1 with 2, i.e., the dotted lines with the solid lines in Fig. 2(fig:avgrate0999),(fig:avgrate099),(fig:avgrate09), we note that the performance of DQN1, which only uses scalar effective channel powers, is comparable with the baselines in Scenario 2, which use vectorized local effective channels for MRC. Also, with higher values, the DQNs experience faster convergence, which is particularly noticeable in DQN1. The fluctuation of the DQN plots occurs due to the -greedy policy, which explores random action selection occasionally to avoid getting trapped in local optima. Overall, in each case, we see that our MDP-based algorithms obtain significant performance improvements, emphasizing the benefit of our multi-agent DRL method.

V Conclusion

We developed a novel methodology for uplink multi-IRS-assisted multi-cell systems. Due to temporal channel variations and difficulties of channel acquisition, we considered that BSs only acquire scalar effective channel powers from a subset of UEs. We developed an information-sharing scheme among neighboring BSs and proposed a dynamic control scheme based on multi-agent DRL, in which each BS acts as an agent and adaptively designs its local UE powers, local IRS beamformer, and its combiners. Through numerical simulations, we verified that our algorithm outperforms conventional baselines.


D.J. Love was supported in part by the National Science Foundation (NSF) under grants CNS1642982 and CCF1816013.