I Introduction
Future networks must provide higher channel capacity, lower latency, and better quality of service than contemporary networks [1]. These goals can only be achieved by drastic improvements of the wireless network architecture [2]. Among potential candidates, Massive MIMO (multipleinput multipleoutput) is an emerging physical layer technology which allows a base station (BS) equipped with many antennas to serve tens of users on the same time and frequency resource [3]. Utilizing the same spectrum and power budget, Massive MIMO can increase both spectral and energy efficiency by orders of magnitude compared with conventional systems that are used today. This is because the propagation channels decorrelate when increasing the number of antennas at each BS and strong array gains are achievable with little interuser interference.
Resource allocation is important in Massive MIMO networks to deal with the interuser interference and, particularly, socalled pilot contamination [4]. Many resource allocation problems in Massive MIMO are easier to solve than in conventional systems since the channel hardening makes the utility functions only depend on the largescale fading coefficients which are stable over a long time period [5], while adaptation to the quickly varying smallscale fading is conventionally needed. Table I categorizes the existing works on power control for cellular Massive MIMO in terms of utility functions and optimization variables. There are only a few works that jointly optimize the pilot and data powers, which is of key importance to deal with pilot contamination in multicell systems. In this paper, we optimize the sum SE with respect to the pilot and data powers. To the best of our knowledge, it is the first paper that considers this problem in cellular Massive MIMO systems, where each BS serves a varying number of users. Note that we did not include singlecell papers in Table I. For example, the paper [6] exploits the special structure arising from imperfect channel state information (CSI) in singlecell systems to maximize the sum SE using an efficient algorithm. This algorithm finds the globally optimal pilot and data powers, but it does not extend to multicell systems since the structure is entirely different.
Deep learning [7]
is a popular datadriven approach to solve complicated problems and has shown superior performance in various applications in image restoration, pattern recognition, etc. Despite its complicated and rather heuristic training phase, deep learning has recently shown promising results in communication applications
[8]. From the universal approximation theorem [9], deep learning can learn to approximate functions for which we have no closedform expression. The authors in [10] construct a fullyconnected deep neural network for the sum SE maximization problem for a wireless system serving a few tens of users. This network structure is reused in [11] to solve an energyefficiency problem. Standard fully connected feedforward networks with many layers are used, but since the considered problems are challenging, the prediction performance is much lower than when directly solving the optimization problems, e.g., the loss varies from to depending on the system setting.Moreover, previous neural network designs for resource allocation in wireless communications are utilizing the instantaneous channel state information (CSI) which is practically questionable, especially in cellular Massive MIMO systems. This is because the smallscale fading varies very quickly and the deep neural networks have very limited time to process the collection of all the instantaneous channel vectors, each having a number of parameters proportional to the number of BS antennas. The recent work in
[18] designs a neural network utilizing only statistical channel information to predict transmit powers in an equallyloaded cellular Massive MIMO system with spatially correlated fading. Although the prediction performance is good, the drawback with the proposed approach is that one specific neural network needs to be trained for each combination number of users in the cells. If there are cells and between and users per cell, you will need to train different neural networks to cover all cases that can appear. Even in the small setup of and considered in [18], this requires 1 million different neural networks which is not practical.In this paper, we consider the joint optimization of the pilot and data powers for maximum sum SE in multicell Massive MIMO systems. Our main contributions are:

We formulate a sum ergodic SE maximization problem, with the data and pilot powers as variables, where each cell may have a different number of active users. To overcome the inherent nonconvexity, an equivalent problem with elementwise convex structure is derived. An alternating optimization algorithm is proposed to find a stationary point. Each iteration is solved in closed form.

We design a deep convolutional neural network (CNN) that learns the solution to the alternating optimization algorithm. The inputs to the CNN are the largescale fading coefficients between each user and BS, while the outputs are the pilot and data powers. Hence, the number of inputs/outputs is independent of the number of antennas. Our deep CNN is named PowerNet, has residual structure, and is densely connected.

We exploit the structure of the sum SE maximization problem to train PowerNet to handle a varying number of users per cell. Hence, in contrast to prior works, a single PowerNet is sufficient irrespective of the number of active users, and no retraining is needed.

Numerical results manifest the effectiveness of the proposed alternating optimization algorithm as compared to the baseline of full transmit power. Meanwhile, PowerNet achieves highly accurate power prediction and a submilliseconds runtime.
The remainder of this paper is organized as follows: Section II introduces our cellular Massive MIMO system model, with a varying number of users per cell, and the basic ergodic SE analysis. We formulate and solve the joint pilot and data power control problem for maximum sum SE in Section III. The proposed low complexity deep learning solution is given in Section IV. Finally, numerical results are shown in Section V and we provide the main conclusions in Section VI.
Notation: Upper (lower) bold letters are used to denote matrices (vectors).
is the expectation of a random variable.
is the Hermitian transpose and the cardinality of set is . We let denote the identity matrix. and denote the complex, real and nonnegative real field, respectively. The floor operator denotes as and the Frobenius norm as . Finally,is circularly symmetric complex Gaussian distribution.
Ii Dynamic Massive MIMO System Model
We consider a multicell Massive MIMO system comprising of cells, each having a BS equipped with antennas. We call it a dynamic system model since each BS is able to serve users, but maybe only a subset of the users are active at any given point in time. We will later model the active subset of users randomly and exploit this structure when training a neural network. Since the wireless channels vary over time and frequency, we consider the standard block fading model [16] where the timefrequency resources are divided into coherence intervals of modulation symbols for which the channels are static and frequency flat. At an arbitrary given coherence interval, BS is serving a subset of active users. We define a set containing the indices of all active users in cell , for which . The channel between active user in cell and BS is denoted as and follows an independent and identically distributed (i.i.d.) Rayleigh fading distribution:
(1) 
where
is the largescale fading coefficient that models geometric pathloss and shadow fading. The distributions are known at the BSs, but the realizations are unknown and need to be estimated in every coherence interval using a pilot transmission phase.
Iia Uplink Pilot Transmission Phase
We assume that a set of orthonormal pilot signals are used in the system. User in each cell is preassigned the pilot with , no matter if the user is active or not in the given coherence interval, but this pilot is only transmitted when the user has data to transmit (or receive). This pilot assignment guarantees that there is no intracell pilot contamination. The channel estimation of a user is interfered by the users that use the pilot signal, which is called pilot contamination. The received baseband pilot signal at BS is
(2) 
where is the additive noise with i.i.d. elements. Meanwhile, is the pilot power that active user in cell allocates to its pilot transmission. The channel between a particular user in cell and BS is estimated from
(3) 
where the set contains the indices of cells having user in active mode, which is formulated from the user activity set of each cell as
(4) 
By using minimum mean square error (MMSE) estimation [25], the channel estimate of an arbitrary active user is as follows.
Lemma 1.
If BS uses MMSE estimation, the channel estimate of active user in cell is
(5) 
which follows a complex Gaussian distribution as
(6) 
By denoting the estimation error as , then it is independently distributed as
(7) 
The statistical information in Lemma 1 of each channel estimate and estimation error are used to construct the linear combining vectors and to derive a closedform expression of the uplink SE.
IiB Uplink Data Transmission Phase
During the uplink data transmission phase, every active user in cell transmits data symbol with . The received signal at BS is the superposition of signals from all users across cells:
(8) 
where is the power that active user in cell allocates to the data symbol and is complex Gaussian noise distributed as . Each BS uses maximum ratio combining (MRC) to detect the desired signals from its users. In particular, BS selects the combining vector for its user as
(9) 
and we will quantify the achievable spectral efficiency by using the useandthenforget capacity bounding technique [16]. The closedform expression of the lower bound on the uplink capacity is shown in Lemma 2.
Lemma 2.
If each BS uses MRC for data detection, a closedform expression for the uplink ergodic SE of active user in cell is
(10) 
where the effective SINR value of this user is
(11) 
and
(12) 
Proof.
The proof follows along the lines of Corollary in [13] except for the different notation and the fact that every user can assign different power to pilot and data. ∎
The numerator of the SINR expression in (11) indicates contributions of the array gain which is directly proportional to the number of antennas at the serving BS. The first part in the denominator represents the pilot contamination effect and it is also proportional to the number of BS antennas. Interestingly, active user in cell will have unbounded capacity when if all users using the same pilot sequence are silent (i.e., inactive or allocated zero transmit power). The remaining terms are noncoherent mutual interference and noise that can have a vanishing impact when the number of antennas grow. Furthermore, the SE of a user is proportional to , which is the prelog factor in (10). This is the fraction of symbols per coherence interval that are used for data transmission, which thus reduces when the number of pilots is increased. In the special case of , the analytical results in Lemma 2 particularize to equallyloaded systems as in the previous works. That special case is unlikely to occur in practice since the data traffic is generated independently for each user.
Iii Joint Pilot and Data Power Control for Sum Spectral Efficiency Optimization
We are concerned with sum SE maximization since high SE is important for future networks, and the (weighted) sum SE maximization is also the core problem to be solved in practical algorithms for dynamic resource allocation [26]. The previous works [6, 24] consider this problem for singlecell systems with joint pilot and data power control or multicell systems with only data power control, respectively. In contrast, we formulate and solve a sum SE maximization problem with joint pilot and data power control. This optimization problem has not been tackled before in the Massive MIMO literature due to its inherent nonconvexity structure. In this section, we develop an iterative algorithm that achieves a stationary point in polynomial time by solving a series of convex subproblems in closed form.
Iiia Problem Formulation
We consider the optimization problem that maximizes the sum SE of all active users in the system with limited power at each transmitted symbol as
(13)  
subject to  
where is the maximum power that user in cell can supply to each transmitted symbol. Problem (13) is independent of the smallscale fading, so it allows for longterm performance optimization, if the users are continuously active and there is no largescale user mobility. However, in practical systems, some users are moving quickly and new scheduling decisions are made every few milliseconds based on the users’ traffic. It is therefore important to be able to solve (13) very quickly to adapt to these changes.^{1}^{1}1Note that the ergodic SE is a reasonable performance metric also in this scenario, since long codewords can span over the frequency domain and the channel hardening makes the channel after MRC almost deterministic. The simulations in [5] shows that coding over 1 kB of data is sufficient to operate closely to the ergodic SE.
Inspired by the weighted MMSE methodology [27], we will now propose an iterative algorithm to find a stationary point to (13). By removing the prelog factor and setting and , as the new optimization variables, we formulate a new problem that is equivalent with (13).
Theorem 1.
Proof.
The proof consists of two main steps: the mean square error is first formulated by considering a singleinput singleoutput (SISO) communication system with deterministic channels having the same SE as in Lemma 2, where is the beamforming coefficient utilized in such a SISO system and is the weight value in the receiver. After that, the equivalence of two problems (13) and (14) is obtained by finding the optimal solution of and given the other optimization variables. The detailed proof is given in Appendix A. ∎
The new problem formulation in Theorem 1 is still nonconvex, but it has an important desired property: if we consider one of the sets , , , and as the only optimization variables, while the other variables are constant, then problem (15) is convex. Note that the set of optimization variables and SE expressions are different than in the previous works [28, 29] that followed similar paths of reformulating their sum SE problems, which is why Theorem 1 is a main contribution of this paper. In particular, in our case we can get closedform solutions in each iteration, leading to a particularly efficient implementation. We exploit this property to derive an iterative algorithm to find a local optimum (stationary point) to (15) as shown in the following subsection.
IiiB Iterative Algorithm
This subsection provides an iterative algorithm to obtain a stationary point to problem (14) by alternating between updating the different sets of optimization variables. This procedure is established by the following theorem.
Theorem 2.
From an initial point satisfying the constraints, a stationary point to problem (14) is obtained by updating in an iterative manner. At iteration , the variables are updated as follows:
Proof.
The proof derives the closedform optimal solutions in (16)–(21) to each of the optimization variables, when the other are fixed, by taking the first derivative of the Lagrangian function of (14) and equating it to zero. The fact that problems (13) and (14
) have the same set of stationary points is further confirmed by the chain rule. The proof is given in Appendix
B. ∎Theorem 2 provides an iterative algorithm that obtains a local optimum to (13) and (14) with low computational complexity because of the closedform solutions in each iteration. Algorithm 1 gives a summary of this iterative process. From any feasible initial set of powers , in each iteration, we update each optimization variable according to (16)–(21). This iterative process will be terminated when the variation of two consecutive iterations is small. For instance the stopping condition may be defined for a given accuracy as
(22) 
By considering the multiplications, divisions, and logarithms as the dominated complexity, the number of arithmetic operations need for Algorithm 1 to reach accuracy is
(23) 
where is the number iterations required for the convergence. From Theorem 2, we further observe the relationship of data and pilot power allocated to a user as the following.
Corollary 1.
If an active user has a largescale fading coefficient equal to zero, then it will always get zero transmit powers when using the algorithm in Theorem 2. Hence, an equivalent way of managing inactive users is to set their largescale fading coefficients to zero and use .
In addition, the system may reject some active users that have small but nonzero largescale fading coefficients since Algorithm 1 can assign zero power to these ones—similar to the behavior of standard waterfilling algorithms. This is a key benefit of sum SE maximization as compared to maxmin fairness power control [16, 17, 14, 13, 18, 19, 20] and maximum productSINR power control [13, 18, 21], which always allocate nonzero power to all users and, therefore, require an additional heuristic user admission control step for selecting which users to drop from service due to their poor channel conditions. If a particular user in cell is not served this implies that and . Hence, this user is neither transmitting in the pilot nor data phase. Corollary 1 will enable us to design a single neural network that can mimic Algorithm 1 for any number of active users.
Iv A lowcomplexity solution with convolutional neural network
We introduce a deep learning framework for joint pilot and data power allocation in dynamic cellular Massive MIMO systems, which uses supervised learning to mimic the power control obtained by Algorithm
1. We stress that for nonconvex optimization problems, a supervised learning approach with high prediction accuracy is both useful for achieving a lowcomplexity implementation, harnessing the advances in implementing neural networks on GPUs, and provides a good baseline for further activities, e.g., supervised learning as a warm start for unsupervised learning or to improve the performance of the testing phase
[30].We first make an explicit assumption on how the largescale fading coefficients are generated for each realization of the Massive MIMO network, by exploiting Corollary 1.
Assumption 1.
We consider an
cell system where the activation of each user is determined by an independent Bernoulli distribution with activity probability
. The largescale fading coefficients associated with a user in cellhave the probability density function (PDF)
, in which and , for .In each realization of the system, i.i.d. users are generated in each cell. User in cell is active (i.e., ) with the probability . Inactive users have and the largescale fading coefficients of active user is obtained as an i.i.d. realization with the PDF that satisfies
(24) 
such that it has its strongest channel from the serving BS.
The process of generating system realizations is illustrated in Fig. 1. Note that all users in cell have the same , which represents the user distribution over the coverage area of this cell, but this function is different for each cell. For notational convenience, each cell has the same maximum number of users and the activity probability is independent of the cell and location, but these assumptions can be easily generalized.
Assumption 1 indicates that a user should be handled equally irrespective of which number that it has in the cell. The fact that all largescale fading coefficients belong the to compact set originates from the law of conservation of energy, and fits well with the structural conditions required to construct a neural networks [9]. There are many ways to define the PDFs of the largescale fading coefficients. One option is to match them to channel measurements obtained in a practical setup [31]. Another option is to define the BS locations and user distributions and then define a pathloss model with shadow fading. In the numerical part of this paper, we take the latter approach and follow the GPP LTE standard [32] that utilizes a Rayleighlognormal fading model that matches well to channel measurements in nonlineofsight conditions. The following model is used in Section V.
Example 1.
Consider a setup with square cells. In each cell, the
users are uniformly distributed in the serving cell at distances to the serving BS that are larger than
m. Each user has the activity probability . For an active user in cell , we generate the largescale fading coefficient to BS as(25) 
where is the physical distance and
is shadow fading that follows a normal distribution with zero mean and standard derivation
dB. If the conditions (24) and/or are not satisfied for a particular user, we simply rerun all the shadow fading realizations for that user.In a cellular network with users there are different realizations of the user activities, which is a huge number (up to in the simulation part with 90 users). If we had to design one specific neural network for each of these realizations, the solution is practically meaningless. A main contribution of our framework is that we can build a single neural network that can handle the activity/inactivity pattern and has a unified structure for all training samples. Note that the proposed network might have more parameters than actually needed, since our main goal is to provide a proofofconcept. The network with the lowest number of parameters is different for every propagation environment and therefore not considered in this work, which focuses on the general properties and not the finetuning.
Iva Existence of a Neural Network for Joint Pilot and Data Power Control
The input to the proposed feedforward neural network is only the largescale fading coefficients and the output is the data and pilot powers. This is fundamentally different from previous works [11, 10]
that use deep learning methods to predict the data power allocation based on perfect instantaneous CSI (i.e., smallscale fading), in which case no channel estimation is involved. Specifically, we define a tensor
containing all the largescale fading coefficients. We let denote the tensor with optimized data powers and denote the tensor with pilot powers. PowerNet learns the continuous mapping^{2}^{2}2The process is a continuous mapping if all are continuous functions.(26) 
where represents the continuous mapping process in Algorithm 1 to obtain the stationary point from the input set of largescale fading together with an initial set of pilot and data powers. Lemma 3 first proves the existence of a feedforward network which imitates the continuous mapping in (26).
Lemma 3.
For any given accuracy , there exists an integer and a feedforward neural network with hidden units for which the mapping process in (26) produces similar performance as Algorithm 1 in the sense that
(27) 
where are the set of network parameters comprising kernels and biases. If we stack the data and pilot powers into the tensors such that and , then the objective function in the lefthand side of (27) can be rewritten as
(28) 
Proof.
Lemma 3 proves that there exists a feedforward network that can predict the data and pilot powers for all users in the coverage area, no matter if the users are active or not as long as Assumption 1 is satisfied. In order to achieve highly accurate prediction performance, we base our contribution on the deep architectures of a multiple hidden layer structure as in [33].
IvB Convolutional Neural Network Architecture
Among all neural network structures in the literature, CNN is currently the most popular family since it achieves higher performance than fullyconnected deep neural network for many applications [34, 35]. One main reason reported in [34] is that CNN effectively deduces the spectral variation existing in a dataset. In order to demonstrate why the use of CNN is suitable for power control in Massive MIMO, let us consider a squared area of km with square cells, each serving users. The largescale fading coefficients are generated as in Example 1, but all users are assumed to be in active mode. The interference in a real cellular system is imitated by wraparound. We gather all the largescale fading coefficients in a tensor of size . For visualization, we first map this tensor to a matrix of size by averaging over the third dimension and plot the result in Fig. 2. The number of horizontal and vertical elements is equal to .
The color map in Fig. 2 represents the largescale fading coefficients. For example, the color of square represents the average largescale fading coefficient from a user in cell to BS . Since there is a grid of cells, and the cells are numbered row by row, the largescale fading coefficients have a certain pattern. Users in neighboring cells have larger largescale fading coefficients than cells that are further away. The strong intensity around the main diagonal represent the cell itself and directly neighboring cells to the left or right on the same line, while the subdiagonals with strong intensities represent neighboring cells at other lines. The other strong intensities in the lowerleft and upperright corners are due to the wraparound topology. A CNN can extract these patterns and utilize them to reduce the number of learned parameters significantly, compared to a conventional fullyconnected network, by sharing weights and biases. Moreover, since each of the users in a cell have largescale fading coefficients generated from the same distribution, a CNN can exploit this structure to reduce the number of parameters.
We will adopt the stateoftheart residual dense block (ResDense) [36] which consists of densely connected convolutions [37] with the residual learning [38]. As shown in Fig. 3, a ResDense block inherits the Densely Connected block in [37]
with residual connection to prevent the gradient vanishing problems
[38]. Compared with ResDense in [36], we use additional (rectified linear unit) ReLU activation unit, i.e.,
, after the residual connection since our mapping process only concentrates on nonnegative values.IvB1 The forward propagation
From an initial set , the first component of the forward propagation is the convolutional layer
(29) 
where
is the epoch index. The operator
denotes a series of convolutions [39], each using a kernel and a bias to extract largescale fading features of the input tensor .^{3}^{3}3A convolutional layer defined for the tensor involves a set of kernels and optional biases each producing an output matrix (often called feature map) from the input . Each element is computed as , where the integer parameter is called stride. Here , in which is the number of zero padding. Notice that are padded with zeros for all . The final feature map of the convolutional layer is obtained by stacking all together, i.e., .All convolutions apply stride
and zero padding
to guarantee the same height and width between the inputs and outputs. After the first layer in (29), the feature map is a tensor with the size . Our proposed PowerNet is then constructed from sequential connected ResDense blocks to extract special features of largescale fading coefficients. Each ResDense block uses the four sets of convolutional kernels to extract better propagation features. The first convolution begins with , then the output signal at each block of the th ResDense block is simultaneously computed as(30)  
(31)  
(32)  
(33) 
where each operator denotes a series of the convolutions. In the three first modules, each kernel , while the remaining has
. In the first three modules, the ReLU activation function
is used for each element.We stress that since the input and output size of the neural network are different, multiple D convolutions are used to make the sides equal. In addition, to exploit correlation in both horizontal and vertical direction in the intermediate data, both horizontal and vertical 1D convolutions are used. A regular transpose layer is applied following vertical 1D convolution to ensure the data size of . The output of these two 1D convolutions are summed up to obtain the final prediction output. This prediction is used for both pilot and data power as depicted in Fig. 3 and is mathematically expressed as
(34)  
(35) 
where and denote the vertical and horizontal series of convolution operators dedicated to predict pilot powers by using convolutional kernels and their related biases . Similar definitions for the convolution layer in (35) are made for the data powers. The feature maps from (34) and (35) are restricted in the closed unit interval by
(36) 
where the elementwise sigmoid activation function is
(37) 
Finally, the predicted pilot and data powers at epoch are obtained by scaling up and as
(38) 
where is a collection of the maximum power budget from all users with . The operator denotes the dot product of two tensors. We emphasize that the forward propagation is applied for both the training and testing phases.
IvB2 The back propagation
The back propagation is only applied in the training phase. We first adopt the Frobenius norm to define the loss function as
(39) 
with respect to the parameters in , where are nonnegative weights that balance between the total transmit power of pilot and data symbols. The loss in (39) is averaged over the training dataset where is the total number of largescale fading realizations, i.e.,
The back propagation utilizes (39) to update all weights and biases in (29)–(35
). PowerNet will use stochastic gradient descent
[7] to obtain a good local solution to . Beginning with a random initial value and remember the current at each epoch , then the update is(40)  
(41) 
where is the socalled momentum and is the learning rate. We stress that the computational complexity of the back propagation can be significantly reduced if a random minibatch with is properly selected [7] rather than processing all the training data at once.
IvC Dataset, Training, and Testing Phases
In order to train PowerNet, we use Algorithm 1 to generate training pairs of user realizations and the corresponding outputs that are jointly optimized by our method presented in Algorithm
Comments
There are no comments yet.