I Introduction
Intelligent reflecting surfaces (IRSs) are one of the innovative technologies for 6G and beyond [zhang2020prospective, hosseinalipour2020federated]. An IRS is an array of passive reflecting elements with a control unit. It manipulates the propagation of an incident signal by providing an abrupt phase shift, which can control the communication channel. IRSs are utilized to provide enhanced communication efficiency without building extra infrastructure [8811733, 9013288, pan2020multicell, 8990007, zhi2020uplink, 9162705, zhang2020intelligent]. In this paper, we study a scenario where multiple IRSs are deployed in a multicell cellular setting to provide enhanced data rates to the users.
Ia Related Work
Exploiting IRSs in cellular networks initiated with applications of this technology in the downlink (DL). Studying IRS use cases in the uplink (UL) is thus comparably more recent.
IA1 Utilizing IRSs in the DL
Most of the relevant literature has considered a single cell system with a single IRS [8811733, 9013288]. Specific investigations have included quality of service (QoS)constrained transmit power minimization [8811733] and weighted sumrate maximization [9013288] to obtain the base station (BS) beamformer and IRS reflect beamformer/precoder in the DL. Unlike the prior approaches, the work in [pan2020multicell] considers a multicell scenario with a single IRS, where the BS precoders and IRS reflect beamformer are designed to maximize sumrate.
IA2 Utilizing IRSs in the UL
Most of the works in UL design are also focused on single cell systems with a single IRS [8990007, zhi2020uplink, 9162705]. Several of these works have studied IRS reflect beamformer design and uplink user equipment (UE) power control problems, where the impact of quantized IRS phase values [8990007] and compressed sensingbased user detection [9162705] on the uplink throughput have also been investigated. The concept of IRS resembles analog beamforming in millimeterwave (mmWave)based systems [zhi2020uplink]. Recently, systems with two IRSs have been considered focusing on SINR fairness [zhang2020intelligent].
Despite the potential benefit of improving multicellwide performance, multiIRS deployment in multicell UL scenarios has not been thoroughly modeled and studied due to the added optimization complexity involved in controlling multiple IRSs.
IB Overview of Methodology and Contributions
In this work, we develop an architecture for multiIRSassisted multicell UL networks. Our methodology explicitly considers multiorder reflections among IRSs, which is rarely done in existing literature. We address the scenario where (i) channels are timevarying, and (ii) only partial/imperfect CSI is available, in which each BS only has knowledge of scalar effective channel powers from a subset of UEs. This is more practical and realistic as compared to the prior approaches [8990007, zhi2020uplink, 9162705, zhang2020intelligent] that assume perfect knowledge of all channel matrices. We formulate the sumrate maximization problem aiming to jointly optimize UE transmit powers, IRS reflect beamformers, and BS combiners across cells.
Given the interdependencies between the design variables across different cells, we cast the problem as one of sequential decision making and tailor a multiagent deep reinforcement learning (DRL) algorithm to solve it. We consider each BS as an independent learning agent that controls the local UE transmit powers, the local IRS reflect beamformer, and its combiners via only index gradient variables. We design the state, action, and reward function for each BS to capture the interdependencies among the design choices made at different BSs. We further develop an informationsharing scheme where only limited information among neighboring BSs is exchanged to cope with the nonstationarity issue caused by the coupling between the actions at other BSs. Through numerical simulations, we show that our proposed scheme outperforms the conventional baselines for data rate maximization.
Ii Multicell Systems with Multiple IRSs
In this section, we first introduce the signal model under consideration (Sec. IIA). Then, we formulate the optimization and discuss the challenges associated with solving it (Sec. IIB).
Iia Signal Model
We consider a multicell system with multiple IRSs for the uplink (UL) as depicted in Fig. 1. The system is comprised of a set of cells and IRSs . For simplicity we assume that each cell has one IRS, i.e., , though our method can be readily generalized to the case where . The IRSs are indexed such that cell contains IRS .
Each cell contains (i) UEs with single antenna, denoted by , (ii) an IRS with reflecting elements, denoted by , and (iii) a BS with antennas denoted by . We let UE refer to UE in cell
. The received signal vector at BS
at the th channel instance is given by(1) 
where is the direct channel from UE to BS , is the channel from UE to IRS , is the channel from IRS to BS , and is the channel from IRS to IRS , . Also, is the transmit power and is the transmit symbol of UE , where . The noise vector at BS is assumed to be distributed according to zero mean complex Gaussian with covariance matrix , i.e., , where
denotes the identity matrix and
is the noise variance. Finally,
is a diagonal matrix with its diagonal entries representing the beamforming vector of IRS . , , is modeled as , incurring the signal attenuation and phase shift .In (1), we consider the channels with three different paths from UE to BS : (i) the direct channel, (ii) the channel after one reflection from the IRSs (the sum over ), and (iii) the channel after two reflections from the IRSs (the sum over ). Higher order reflections can also be incorporated in (1), i.e., signals reflected from more than two IRSs; we focus on up to the secondorder reflections due to a large attenuation induced by multiple reflections between IRSs.
We assume that a linear combiner is employed at BS to restore from , which yields
(2) 
where superscript denotes the conjugate transpose.
IiB Problem Formulation and Challenges
We aim to maximize the sumrate over all the UEs in the network through design of the UE powers , BS combiners , and IRS beamformers , where is the IRS beamforming vector on the diagonal of , i.e., . With as the signaltointerference ratio (SINR) of UE , we propose the following optimization problem:
maximize  (3)  
subject to  
variables 
where is the set of power values, is the codebook for BS combiners, and is the codebook for IRS beamformers.^{1}^{1}1A codebook structure can be employed for IRS because IRS is in practice controlled by a fieldprogrammable gate array (FPGA) circuit where FPGA stores a set of coding sequences [cui2014coding].
The problem in (3) is an optimization problem at time , where , i.e., the optimization of variables is performed once every time instances. If the instantaneous channels , , and in (1) are all known, then conventional optimization methods, e.g., successive convex approximation or integer programming, could be applied, since in (3) can be formulated as (4) (shown at the top of the next page) with the known channels. However, IRSassisted wireless networks face the following challenges in practice:

[leftmargin=3mm]

IRS channel acquisition: Although most of the works, e.g., [8990007, zhi2020uplink, 9162705, zhang2020intelligent]
, assume that channels are perfectly known, this assumption is impractical because an IRS is passive and often does not have sensing capabilities. While special IRS hardware with the ability to estimate the concatenated channels does exist
[wang2020channel], the time overhead could easily overwhelm the coherent channel resources especially when there are multiple IRSs. 
Dynamic channels: Channel dynamics in wireless environments adds another degree of difficulty to channel acquisition and estimation. This makes solving the optimization in (3) impossible with conventional modelbased optimization approaches, due to dynamic and unknown channels.

Centralization: A centralized implementation to solve (3) would require gathering all the information at a central point, which is impractical in our setting. Given the interdependencies among the design variables taken by different cells and their impact on the overall objective function, distributed optimization of the variables in (3) is challenging.
To address these challenges, we convert (3) into a sequential decision making problem, where the variables are designed via successive interactions with the environment through deep reinforcement learning (DRL). While conventional DRL assumes a centralized implementation, we develop a multiagent DRL approach, where each BS acts as an independent agent in charge of tuning its local UEs transmit powers, local IRS beamformer, and combiners. To cope with the nonstationarity issue of multiagent DRL [marinescu2017prediction], we carry out the learning through limited informationsharing among neighbouring BSs.
(4) 
Iii Multiagent DRL Framework Design
In this section, we first introduce the information collection process at the BSs and design an informationsharing scheme (Sec. IIIA
). We then formulate a Markov decision process (MDP) (Sec.
IIIB) and propose a dynamic control scheme (Sec. IIIC) to solve our optimization from Sec. IIB.Iiia Local Observations and Information Exchange
We consider a setting where each BS only acquires scalar effective channel powers from a subset of UEs. When UE transmits a pilot symbol with power , BS measures the scalar effective channel power (after combining with ), , which is given by
(5) 
where is the scalar effective channel. The vector is the effective channel from UE to BS (before combining), which is expressed as follows:
(6) 
BS collects the scalar effective channel powers of the links (i) from local UEs (in cell ) to BS , (ii) from neighbouring UEs (not in cell ) to BS , and (iii) from local UEs to neighbouring BSs. BS measures (i) and (ii) as local observations, but needs to receive (iii), which cannot be measured by BS , from neighbouring BSs. Additionally, BS receives a penalty value from neighbouring BSs, where the penalty value is used for designing the reward function and will be formalized in Sec. IIIB3. Note that concurrent estimation of the scalar effective channel powers of multiple UEs can be performed by UEspecific reference signals in the LongTerm Evolution (LTE) standard [LTEstandard]. Acquiring scalar effective channel powers from only a subset of UEs lowers the CSI acquisition overhead compared to the conventional method of acquiring largedimensional vector or matrix CSI from individual UE for each IRS.
To clarify which neighbouring UEs are included in (ii) and which neighbouring BSs are included in (iii), we define two sets of cell indices. First, we define the set of indices of dominantly interfering neighboring cells, . UEs in cell are dominantly interfering with the data link of local UEs (in cell ). Formally, , we have . The size of this set is a control variable . For (ii), then, we include neighbouring UEs in cell .
Second, we define the set of indices of dominantly interfered neighboring cells, . The data links of UEs in cell are dominantly interfered by local UEs (in cell ). Formally, , we have . The size of this set is a control variable . For (iii), then, we include neighbouring BSs of cell .
The effective channel gain, used in defining and , can be acquired by the antenna circuit before digital processing (e.g., from the automatic gain control (AGC) circuit [mo2017channel]), without the explicit effective channel vector or combiner implementation. BS also measures of all local UEs, by measuring the received signal strength indicator (RSSI) and the reference signal received power (RSRP), which are the conventional measures to evaluate the signal quality in LTE standards [LTEstandard]. Using the SINRs, BS then calculates the achievable data rate of UE as . Here, we omit the bandwidth parameter, assuming the same bandwidth for all the data links.
IiiB Markov Decision Process Model
We formulate the decision making process of each BS as an MDP with states, actions, and rewards:
IiiB1 State
We define the state space of BS as
(7) 
where each constituent set is described below.
(i) Local channel information. consists of the scalar effective channel powers from local UEs observed at two consecutive times and , given by
Here, can be obtained from (5) at time , and is a version of (5) obtained at time using previoustime variables , , and . Having them enables us to capture the effect of channel variation over time.
(ii) Fromneighbor channel information. contains the scalar effective channel powers from UE in neighboring cell , and the index , for , . Formally,
This set captures the interference from neighbor UEs to cell .
(iii) Toneighbor channel information. contains the scalar effective channel powers from local UE to BS , and the index , for , . Formally,
This set captures the amount of interference that local UEs in cell inflict on neighboring cells. This information enables BS to adjust the transmit powers of local UEs to reduce interference to the neighboring cells.
(iv) Previous local variables and local sumrate. consists of previous local variables, i.e., , , and , and the local sumrate . Formally,
IiiB2 Action
The action space is defined as
(8) 
where , , are the index gradient variables used for updating the local UE transmit power, combiner of BS , and local IRS reflect beamformer. These index gradient variables are defined over a binary , or ternary alphabet as we will describe in Sec. IV.
Once BS determines the action in (8), the BS feeds forward to UE , which then updates its power index as . The power of UE is set to , , where denotes th element of the power set in (3). The BS also feeds forward to IRS , which then updates its beamformer index as , and its beamformer as where is the th vector in the codebook in (3). Finally, the combiner index is updated as . The combiner of BS is set to .
IiiB3 Reward
Aiming to only maximize the local sumrate at each BS could increase the interference to the neighboring cells. To incorporate the entire system performance, we design the reward including penalty terms as
(9) 
where the first term is the sumrate of cell and the second term is the sum of penalties. The penalty is the rate loss of the dominantly interfered cell caused by the interference of local UEs (in cell ), which is calculated at BS as
(10) 
where is the rate loss of UE caused by the interference of local UEs in cell . A similar reward function was found to be effective for multiagent DRLbased beamforming [ge2020deep]. The term in (10) denotes the data rate of UE without the interference of the local UEs in cell , while is the data rate including the interference. If there is no interference, the two terms cancel with each other, leading to zero penalty. Otherwise, the penalty is positive.
IiiC Dynamic Control Scheme based on Multiagent DRL
In the proposed MDP, the channel values used as states are continuous variables, which makes conventional RL, i.e., Qlearning based on Qtable, not applicable. We thus adopt deep Qnetworks (DQN) [mnih2015human]. BS possesses its own train DQN, , with weights , and target DQN, , with weights , where the state and action are defined in Sec. IIIB. The pseudocode of the proposed dynamic control scheme based on multiagent DRL is provided in Algorithm 1. Our algorithm follows a decentralized training with decentralized execution (DTDE) framework, where both training and execution are independently carried out at each agent. Therefore, our algorithm is independent of the UEs in other surrounding agents (BSs). Further, our algorithm incorporates the index gradient approach for codebookbased BS combining and IRS beamforming, which is independent of the number of antennas/elements and the size of the codebook.
Iv Numerical Evaluation and Discussion
In this section, we first describe the simulation setup (Sec. IVA) and evaluation scenarios (Sec. IVB). Then, we present and discuss the results (Sec. IVC).
Iva Simulation Setup
IvA1 Parameter settings
We consider a cellular network with hexagonal cells, as shown in Fig. 2(fig:cell_plot). We assume , , and , , similar to [pan2020multicell]. The BSs are located at the center of each cell with 10 m height, and the distance between adjacent BSs is 100 m. Each IRS is deployed nearby the BS, and UEs are randomly placed in the cells. The set for UE power control is given by , where dBm and dBm are the minimum and maximum transmit powers, and . For BS combiner and IRS beamformer codebooks, we use a random vector quantization (RVQ) [au2007performance] codebook with size . We set dBm, .
IvA2 Channel modeling
We consider a single frequency band with flat fading and adopt a temporally correlated block fading channel model. Following a common cellular standard [ieeep80216], we assume coherence time ms and center frequency GHz. The channel vector is modeled as
(11) 
where denotes the largescale fading coefficient from UE to BS , modeled as ; is the pathloss at the reference distance , is the distance between UE and BS , and is the pathloss exponent between them. We set dB and m. denotes the Rayleigh fading vector, modeled by a firstorder GaussMarkov process [sklar2001digital]:
(12) 
where , , and . The time correlation coefficient obeys the Jakes model [sklar2001digital], i.e., , where is the zeroth order Bessel function of the first kind, and is the maximum Doppler frequency, with velocity of UE and m/s. The same modeling for is applied for the channels between the UEs and the IRSs, i.e., , , with pathloss exponent . Since IRSs are placed at the desired locations to have less variations of IRSBS/IRSIRS channels as compared to UEBS/UEIRS channels [pan2020multicell], and are assumed to be stationary. Each entry for the channels is distributed according to and , respectively. and denote the largescale fading coefficients with path loss exponents and , respectively.
We assume , , , , , , and , . To model the presence of extensive obstacles and scatterers, the pathloss exponent between the UEs and BS is taken to be . Because the IRSaided link can have less path loss than that of direct UEBS channel by properly choosing the location of the IRS, we set the pathloss exponents of the UEIRS link, of the IRSBS link, and of the IRSIRS link to , , and , respectively [pan2020multicell]. We assume , and adopt ( km/h), ( km/h), and ( km/h), where is the UE speed.
IvB Evaluation Scenarios
IvB1 Scenario 1. The effective channels from local UEs are not known
In this scenario, each BS measures the scalar effective channel powers directly from received signals without explicitly obtaining the effective channels as a vector form in (6). We introduce two baselines in this scenario: RRR=(random, random, random) and MRR=(maximum, random, random). The name of each baseline is indicating how it selects its (UE power, IRS beamformer, BS combiner) variables as a tuple. We propose DQN1, where the action space consists of elements for UE powers, the IRS beamformer, and BS combiners. The index gradient variables are binary, i.e., .
IvB2 Scenario 2. The effective channels from local UEs are known
In this scenario, each BS measures the effective channels from local UEs as the vector form in (6). Each BS is assumed to adopt a maximum ratio combiner (MRC) by finding the index , where is the effective channel from local UE. We introduce several baselines: MRM=(maximum, random, MRC), FRM=(25% of maximum, random, MRC), RRM=(random, random, MRC), and MM with no IRS=(maximum, N/A, MRC). MM with no IRS assumes the IRSs to be turned off. In this scenario, we propose DQN2 and DQN3. In DQN2, the action space consists of elements for UE powers and the IRS beamformer (the action space does not have the elements , in (8)). The BS combiner is designed as MRC and the index gradient variable is binary, i.e., . The action space is DQN3 is the same as DQN2, except it uses a tenary index gradient variable, i.e., .
In both scenarios, the DQNs^{2}^{2}2All DQNs establish the same state space and reward function given in Sec. IIIB. For the state information group (iv) in Sec. IIIB1, the indices of previous local variables are stored in the state. are composed of an input layer, an output layer, and two fullyconnected hidden layers. The input size is . The output size is , , and
for DQN1, DQN2, and DQN3, respectively. For DQN1, the number of neurons in the two hidden layers is 70 and 100; for DQN2, 40 and 30; and for DQN3, 70 and 70. The rectified linear unit (ReLU) activation function is employed. In Algorithm
1, we adopt the greedy method with , where and , . We consider , , and , . We set , i.e., the target DQN is updated with the weights of train DQN after a time of. We employ the RMSProp optimizer for training.
IvC Simulation Results and Discussion
Fig. 2(b),(c),(d) depict the average achievable data rate over all 21 UEs with different values of . The dotted lines show the performance of the schemes in Scenario 1. With varying channels, RRR and MRR select random or fixed indices for variables, and therefore have low average data rates over time. On the other hand, DQN1 learns and adapts to the varying channels over time by exploiting the local observations and informationsharing in our sequential decision making.
The solid lines represent the performances of schemes in Scenario 2. The MM with no IRS gives better performance than the baselines using IRS, implying that random IRS beamforming is worse than not deploying it at all. This also reveals the vulnerability of IRSassisted systems to adversarial IRS utilization. Our DQN2 and DQN3 methods outperform the baselines, which emphasizes the benefit of carefully optimizing the IRS configuration with the rest of the cellular network. DQN2 yields slightly better performance and converges faster than DQN3: the faster convergence is due to neural networks training faster with a smaller number of outputs, and the better overall performance is consistent with the observation
[mnih2015human] that DQNs are more successful with smaller action spaces.Comparing Scenario 1 with 2, i.e., the dotted lines with the solid lines in Fig. 2(fig:avgrate0999),(fig:avgrate099),(fig:avgrate09), we note that the performance of DQN1, which only uses scalar effective channel powers, is comparable with the baselines in Scenario 2, which use vectorized local effective channels for MRC. Also, with higher values, the DQNs experience faster convergence, which is particularly noticeable in DQN1. The fluctuation of the DQN plots occurs due to the greedy policy, which explores random action selection occasionally to avoid getting trapped in local optima. Overall, in each case, we see that our MDPbased algorithms obtain significant performance improvements, emphasizing the benefit of our multiagent DRL method.
V Conclusion
We developed a novel methodology for uplink multiIRSassisted multicell systems. Due to temporal channel variations and difficulties of channel acquisition, we considered that BSs only acquire scalar effective channel powers from a subset of UEs. We developed an informationsharing scheme among neighboring BSs and proposed a dynamic control scheme based on multiagent DRL, in which each BS acts as an agent and adaptively designs its local UE powers, local IRS beamformer, and its combiners. Through numerical simulations, we verified that our algorithm outperforms conventional baselines.
Acknowledgment
D.J. Love was supported in part by the National Science Foundation (NSF) under grants CNS1642982 and CCF1816013.