Future networks must provide higher channel capacity, lower latency, and better quality of service than contemporary networks . These goals can only be achieved by drastic improvements of the wireless network architecture . Among potential candidates, Massive MIMO (multiple-input multiple-output) is an emerging physical layer technology which allows a base station (BS) equipped with many antennas to serve tens of users on the same time and frequency resource . Utilizing the same spectrum and power budget, Massive MIMO can increase both spectral and energy efficiency by orders of magnitude compared with conventional systems that are used today. This is because the propagation channels decorrelate when increasing the number of antennas at each BS and strong array gains are achievable with little inter-user interference.
Resource allocation is important in Massive MIMO networks to deal with the inter-user interference and, particularly, so-called pilot contamination . Many resource allocation problems in Massive MIMO are easier to solve than in conventional systems since the channel hardening makes the utility functions only depend on the large-scale fading coefficients which are stable over a long time period , while adaptation to the quickly varying small-scale fading is conventionally needed. Table I categorizes the existing works on power control for cellular Massive MIMO in terms of utility functions and optimization variables. There are only a few works that jointly optimize the pilot and data powers, which is of key importance to deal with pilot contamination in multi-cell systems. In this paper, we optimize the sum SE with respect to the pilot and data powers. To the best of our knowledge, it is the first paper that considers this problem in cellular Massive MIMO systems, where each BS serves a varying number of users. Note that we did not include single-cell papers in Table I. For example, the paper  exploits the special structure arising from imperfect channel state information (CSI) in single-cell systems to maximize the sum SE using an efficient algorithm. This algorithm finds the globally optimal pilot and data powers, but it does not extend to multi-cell systems since the structure is entirely different.
Deep learning 
is a popular data-driven approach to solve complicated problems and has shown superior performance in various applications in image restoration, pattern recognition, etc. Despite its complicated and rather heuristic training phase, deep learning has recently shown promising results in communication applications. From the universal approximation theorem , deep learning can learn to approximate functions for which we have no closed-form expression. The authors in  construct a fully-connected deep neural network for the sum SE maximization problem for a wireless system serving a few tens of users. This network structure is reused in  to solve an energy-efficiency problem. Standard fully connected feed-forward networks with many layers are used, but since the considered problems are challenging, the prediction performance is much lower than when directly solving the optimization problems, e.g., the loss varies from to depending on the system setting.
Moreover, previous neural network designs for resource allocation in wireless communications are utilizing the instantaneous channel state information (CSI) which is practically questionable, especially in cellular Massive MIMO systems. This is because the small-scale fading varies very quickly and the deep neural networks have very limited time to process the collection of all the instantaneous channel vectors, each having a number of parameters proportional to the number of BS antennas. The recent work in designs a neural network utilizing only statistical channel information to predict transmit powers in an equally-loaded cellular Massive MIMO system with spatially correlated fading. Although the prediction performance is good, the drawback with the proposed approach is that one specific neural network needs to be trained for each combination number of users in the cells. If there are cells and between and users per cell, you will need to train different neural networks to cover all cases that can appear. Even in the small setup of and considered in , this requires 1 million different neural networks which is not practical.
In this paper, we consider the joint optimization of the pilot and data powers for maximum sum SE in multi-cell Massive MIMO systems. Our main contributions are:
We formulate a sum ergodic SE maximization problem, with the data and pilot powers as variables, where each cell may have a different number of active users. To overcome the inherent non-convexity, an equivalent problem with element-wise convex structure is derived. An alternating optimization algorithm is proposed to find a stationary point. Each iteration is solved in closed form.
We design a deep convolutional neural network (CNN) that learns the solution to the alternating optimization algorithm. The inputs to the CNN are the large-scale fading coefficients between each user and BS, while the outputs are the pilot and data powers. Hence, the number of inputs/outputs is independent of the number of antennas. Our deep CNN is named PowerNet, has residual structure, and is densely connected.
We exploit the structure of the sum SE maximization problem to train PowerNet to handle a varying number of users per cell. Hence, in contrast to prior works, a single PowerNet is sufficient irrespective of the number of active users, and no retraining is needed.
Numerical results manifest the effectiveness of the proposed alternating optimization algorithm as compared to the baseline of full transmit power. Meanwhile, PowerNet achieves highly accurate power prediction and a sub-milliseconds runtime.
The remainder of this paper is organized as follows: Section II introduces our cellular Massive MIMO system model, with a varying number of users per cell, and the basic ergodic SE analysis. We formulate and solve the joint pilot and data power control problem for maximum sum SE in Section III. The proposed low complexity deep learning solution is given in Section IV. Finally, numerical results are shown in Section V and we provide the main conclusions in Section VI.
Notation: Upper (lower) bold letters are used to denote matrices (vectors).
is the expectation of a random variable.is the Hermitian transpose and the cardinality of set is . We let denote the identity matrix. and denote the complex, real and non-negative real field, respectively. The floor operator denotes as and the Frobenius norm as . Finally,
is circularly symmetric complex Gaussian distribution.
Ii Dynamic Massive MIMO System Model
We consider a multi-cell Massive MIMO system comprising of cells, each having a BS equipped with antennas. We call it a dynamic system model since each BS is able to serve users, but maybe only a subset of the users are active at any given point in time. We will later model the active subset of users randomly and exploit this structure when training a neural network. Since the wireless channels vary over time and frequency, we consider the standard block fading model  where the time-frequency resources are divided into coherence intervals of modulation symbols for which the channels are static and frequency flat. At an arbitrary given coherence interval, BS is serving a subset of active users. We define a set containing the indices of all active users in cell , for which . The channel between active user in cell and BS is denoted as and follows an independent and identically distributed (i.i.d.) Rayleigh fading distribution:
is the large-scale fading coefficient that models geometric pathloss and shadow fading. The distributions are known at the BSs, but the realizations are unknown and need to be estimated in every coherence interval using a pilot transmission phase.
Ii-a Uplink Pilot Transmission Phase
We assume that a set of orthonormal pilot signals are used in the system. User in each cell is preassigned the pilot with , no matter if the user is active or not in the given coherence interval, but this pilot is only transmitted when the user has data to transmit (or receive). This pilot assignment guarantees that there is no intra-cell pilot contamination. The channel estimation of a user is interfered by the users that use the pilot signal, which is called pilot contamination. The received baseband pilot signal at BS is
where is the additive noise with i.i.d. elements. Meanwhile, is the pilot power that active user in cell allocates to its pilot transmission. The channel between a particular user in cell and BS is estimated from
where the set contains the indices of cells having user in active mode, which is formulated from the user activity set of each cell as
By using minimum mean square error (MMSE) estimation , the channel estimate of an arbitrary active user is as follows.
If BS uses MMSE estimation, the channel estimate of active user in cell is
which follows a complex Gaussian distribution as
By denoting the estimation error as , then it is independently distributed as
The statistical information in Lemma 1 of each channel estimate and estimation error are used to construct the linear combining vectors and to derive a closed-form expression of the uplink SE.
Ii-B Uplink Data Transmission Phase
During the uplink data transmission phase, every active user in cell transmits data symbol with . The received signal at BS is the superposition of signals from all users across cells:
where is the power that active user in cell allocates to the data symbol and is complex Gaussian noise distributed as . Each BS uses maximum ratio combining (MRC) to detect the desired signals from its users. In particular, BS selects the combining vector for its user as
and we will quantify the achievable spectral efficiency by using the use-and-then-forget capacity bounding technique . The closed-form expression of the lower bound on the uplink capacity is shown in Lemma 2.
If each BS uses MRC for data detection, a closed-form expression for the uplink ergodic SE of active user in cell is
where the effective SINR value of this user is
The proof follows along the lines of Corollary in  except for the different notation and the fact that every user can assign different power to pilot and data. ∎
The numerator of the SINR expression in (11) indicates contributions of the array gain which is directly proportional to the number of antennas at the serving BS. The first part in the denominator represents the pilot contamination effect and it is also proportional to the number of BS antennas. Interestingly, active user in cell will have unbounded capacity when if all users using the same pilot sequence are silent (i.e., inactive or allocated zero transmit power). The remaining terms are non-coherent mutual interference and noise that can have a vanishing impact when the number of antennas grow. Furthermore, the SE of a user is proportional to , which is the pre-log factor in (10). This is the fraction of symbols per coherence interval that are used for data transmission, which thus reduces when the number of pilots is increased. In the special case of , the analytical results in Lemma 2 particularize to equally-loaded systems as in the previous works. That special case is unlikely to occur in practice since the data traffic is generated independently for each user.
Iii Joint Pilot and Data Power Control for Sum Spectral Efficiency Optimization
We are concerned with sum SE maximization since high SE is important for future networks, and the (weighted) sum SE maximization is also the core problem to be solved in practical algorithms for dynamic resource allocation . The previous works [6, 24] consider this problem for single-cell systems with joint pilot and data power control or multi-cell systems with only data power control, respectively. In contrast, we formulate and solve a sum SE maximization problem with joint pilot and data power control. This optimization problem has not been tackled before in the Massive MIMO literature due to its inherent non-convexity structure. In this section, we develop an iterative algorithm that achieves a stationary point in polynomial time by solving a series of convex sub-problems in closed form.
Iii-a Problem Formulation
We consider the optimization problem that maximizes the sum SE of all active users in the system with limited power at each transmitted symbol as
where is the maximum power that user in cell can supply to each transmitted symbol. Problem (13) is independent of the small-scale fading, so it allows for long-term performance optimization, if the users are continuously active and there is no large-scale user mobility. However, in practical systems, some users are moving quickly and new scheduling decisions are made every few milliseconds based on the users’ traffic. It is therefore important to be able to solve (13) very quickly to adapt to these changes.111Note that the ergodic SE is a reasonable performance metric also in this scenario, since long codewords can span over the frequency domain and the channel hardening makes the channel after MRC almost deterministic. The simulations in  shows that coding over 1 kB of data is sufficient to operate closely to the ergodic SE.
Inspired by the weighted MMSE methodology , we will now propose an iterative algorithm to find a stationary point to (13). By removing the pre-log factor and setting and , as the new optimization variables, we formulate a new problem that is equivalent with (13).
The proof consists of two main steps: the mean square error is first formulated by considering a single-input single-output (SISO) communication system with deterministic channels having the same SE as in Lemma 2, where is the beamforming coefficient utilized in such a SISO system and is the weight value in the receiver. After that, the equivalence of two problems (13) and (14) is obtained by finding the optimal solution of and given the other optimization variables. The detailed proof is given in Appendix -A. ∎
The new problem formulation in Theorem 1 is still non-convex, but it has an important desired property: if we consider one of the sets , , , and as the only optimization variables, while the other variables are constant, then problem (15) is convex. Note that the set of optimization variables and SE expressions are different than in the previous works [28, 29] that followed similar paths of reformulating their sum SE problems, which is why Theorem 1 is a main contribution of this paper. In particular, in our case we can get closed-form solutions in each iteration, leading to a particularly efficient implementation. We exploit this property to derive an iterative algorithm to find a local optimum (stationary point) to (15) as shown in the following subsection.
Iii-B Iterative Algorithm
This subsection provides an iterative algorithm to obtain a stationary point to problem (14) by alternating between updating the different sets of optimization variables. This procedure is established by the following theorem.
From an initial point satisfying the constraints, a stationary point to problem (14) is obtained by updating in an iterative manner. At iteration , the variables are updated as follows:
The proof derives the closed-form optimal solutions in (16)–(21) to each of the optimization variables, when the other are fixed, by taking the first derivative of the Lagrangian function of (14) and equating it to zero. The fact that problems (13) and (14
) have the same set of stationary points is further confirmed by the chain rule. The proof is given in Appendix-B. ∎
Theorem 2 provides an iterative algorithm that obtains a local optimum to (13) and (14) with low computational complexity because of the closed-form solutions in each iteration. Algorithm 1 gives a summary of this iterative process. From any feasible initial set of powers , in each iteration, we update each optimization variable according to (16)–(21). This iterative process will be terminated when the variation of two consecutive iterations is small. For instance the stopping condition may be defined for a given accuracy as
By considering the multiplications, divisions, and logarithms as the dominated complexity, the number of arithmetic operations need for Algorithm 1 to reach -accuracy is
where is the number iterations required for the convergence. From Theorem 2, we further observe the relationship of data and pilot power allocated to a user as the following.
If an active user has a large-scale fading coefficient equal to zero, then it will always get zero transmit powers when using the algorithm in Theorem 2. Hence, an equivalent way of managing inactive users is to set their large-scale fading coefficients to zero and use .
In addition, the system may reject some active users that have small but non-zero large-scale fading coefficients since Algorithm 1 can assign zero power to these ones—similar to the behavior of standard waterfilling algorithms. This is a key benefit of sum SE maximization as compared to max-min fairness power control [16, 17, 14, 13, 18, 19, 20] and maximum product-SINR power control [13, 18, 21], which always allocate non-zero power to all users and, therefore, require an additional heuristic user admission control step for selecting which users to drop from service due to their poor channel conditions. If a particular user in cell is not served this implies that and . Hence, this user is neither transmitting in the pilot nor data phase. Corollary 1 will enable us to design a single neural network that can mimic Algorithm 1 for any number of active users.
Iv A low-complexity solution with convolutional neural network
We introduce a deep learning framework for joint pilot and data power allocation in dynamic cellular Massive MIMO systems, which uses supervised learning to mimic the power control obtained by Algorithm1
. We stress that for non-convex optimization problems, a supervised learning approach with high prediction accuracy is both useful for achieving a low-complexity implementation, harnessing the advances in implementing neural networks on GPUs, and provides a good baseline for further activities, e.g., supervised learning as a warm start for unsupervised learning or to improve the performance of the testing phase.
We first make an explicit assumption on how the large-scale fading coefficients are generated for each realization of the Massive MIMO network, by exploiting Corollary 1.
We consider an . The large-scale fading coefficients associated with a user in cell have the probability density function (PDF)
have the probability density function (PDF), in which and , for .
In each realization of the system, i.i.d. users are generated in each cell. User in cell is active (i.e., ) with the probability . Inactive users have and the large-scale fading coefficients of active user is obtained as an i.i.d. realization with the PDF that satisfies
such that it has its strongest channel from the serving BS.
The process of generating system realizations is illustrated in Fig. 1. Note that all users in cell have the same , which represents the user distribution over the coverage area of this cell, but this function is different for each cell. For notational convenience, each cell has the same maximum number of users and the activity probability is independent of the cell and location, but these assumptions can be easily generalized.
Assumption 1 indicates that a user should be handled equally irrespective of which number that it has in the cell. The fact that all large-scale fading coefficients belong the to compact set originates from the law of conservation of energy, and fits well with the structural conditions required to construct a neural networks . There are many ways to define the PDFs of the large-scale fading coefficients. One option is to match them to channel measurements obtained in a practical setup . Another option is to define the BS locations and user distributions and then define a pathloss model with shadow fading. In the numerical part of this paper, we take the latter approach and follow the GPP LTE standard  that utilizes a Rayleigh-lognormal fading model that matches well to channel measurements in non-line-of-sight conditions. The following model is used in Section V.
Consider a setup with square cells. In each cell, the users are uniformly distributed in the serving cell at distances to the serving BS that are larger than
users are uniformly distributed in the serving cell at distances to the serving BS that are larger thanm. Each user has the activity probability . For an active user in cell , we generate the large-scale fading coefficient to BS as
where is the physical distance and is shadow fading that follows a normal distribution with zero mean and standard derivation
is shadow fading that follows a normal distribution with zero mean and standard derivationdB. If the conditions (24) and/or are not satisfied for a particular user, we simply rerun all the shadow fading realizations for that user.
In a cellular network with users there are different realizations of the user activities, which is a huge number (up to in the simulation part with 90 users). If we had to design one specific neural network for each of these realizations, the solution is practically meaningless. A main contribution of our framework is that we can build a single neural network that can handle the activity/inactivity pattern and has a unified structure for all training samples. Note that the proposed network might have more parameters than actually needed, since our main goal is to provide a proof-of-concept. The network with the lowest number of parameters is different for every propagation environment and therefore not considered in this work, which focuses on the general properties and not the fine-tuning.
Iv-a Existence of a Neural Network for Joint Pilot and Data Power Control
The input to the proposed feedforward neural network is only the large-scale fading coefficients and the output is the data and pilot powers. This is fundamentally different from previous works [11, 10]
that use deep learning methods to predict the data power allocation based on perfect instantaneous CSI (i.e., small-scale fading), in which case no channel estimation is involved. Specifically, we define a tensorcontaining all the large-scale fading coefficients. We let denote the tensor with optimized data powers and denote the tensor with pilot powers. PowerNet learns the continuous mapping222The process is a continuous mapping if all are continuous functions.
where represents the continuous mapping process in Algorithm 1 to obtain the stationary point from the input set of large-scale fading together with an initial set of pilot and data powers. Lemma 3 first proves the existence of a feedforward network which imitates the continuous mapping in (26).
where are the set of network parameters comprising kernels and biases. If we stack the data and pilot powers into the tensors such that and , then the objective function in the left-hand side of (27) can be rewritten as
Lemma 3 proves that there exists a feedforward network that can predict the data and pilot powers for all users in the coverage area, no matter if the users are active or not as long as Assumption 1 is satisfied. In order to achieve highly accurate prediction performance, we base our contribution on the deep architectures of a multiple hidden layer structure as in .
Iv-B Convolutional Neural Network Architecture
Among all neural network structures in the literature, CNN is currently the most popular family since it achieves higher performance than fully-connected deep neural network for many applications [34, 35]. One main reason reported in  is that CNN effectively deduces the spectral variation existing in a dataset. In order to demonstrate why the use of CNN is suitable for power control in Massive MIMO, let us consider a squared area of km with square cells, each serving users. The large-scale fading coefficients are generated as in Example 1, but all users are assumed to be in active mode. The interference in a real cellular system is imitated by wrap-around. We gather all the large-scale fading coefficients in a tensor of size . For visualization, we first map this tensor to a matrix of size by averaging over the third dimension and plot the result in Fig. 2. The number of horizontal and vertical elements is equal to .
The color map in Fig. 2 represents the large-scale fading coefficients. For example, the color of square represents the average large-scale fading coefficient from a user in cell to BS . Since there is a grid of cells, and the cells are numbered row by row, the large-scale fading coefficients have a certain pattern. Users in neighboring cells have larger large-scale fading coefficients than cells that are further away. The strong intensity around the main diagonal represent the cell itself and directly neighboring cells to the left or right on the same line, while the sub-diagonals with strong intensities represent neighboring cells at other lines. The other strong intensities in the lower-left and upper-right corners are due to the wrap-around topology. A CNN can extract these patterns and utilize them to reduce the number of learned parameters significantly, compared to a conventional fully-connected network, by sharing weights and biases. Moreover, since each of the users in a cell have large-scale fading coefficients generated from the same distribution, a CNN can exploit this structure to reduce the number of parameters.
We will adopt the state-of-the-art residual dense block (ResDense)  which consists of densely connected convolutions  with the residual learning . As shown in Fig. 3, a ResDense block inherits the Densely Connected block in 
with residual connection to prevent the gradient vanishing problems. Compared with ResDense in , after the residual connection since our mapping process only concentrates on non-negative values.
Iv-B1 The forward propagation
From an initial set , the first component of the forward propagation is the convolutional layer
is the epoch index. The operatordenotes a series of convolutions , each using a kernel and a bias to extract large-scale fading features of the input tensor .333A convolutional layer defined for the tensor involves a set of kernels and optional biases each producing an output matrix (often called feature map) from the input . Each element is computed as , where the integer parameter is called stride. Here , in which is the number of zero padding. Notice that are padded with zeros for all . The final feature map of the convolutional layer is obtained by stacking all together, i.e., .
All convolutions apply stride
and zero paddingto guarantee the same height and width between the inputs and outputs. After the first layer in (29), the feature map is a tensor with the size . Our proposed PowerNet is then constructed from sequential connected ResDense blocks to extract special features of large-scale fading coefficients. Each ResDense block uses the four sets of convolutional kernels to extract better propagation features. The first convolution begins with , then the output signal at each block of the -th ResDense block is simultaneously computed as
where each operator denotes a series of the convolutions. In the three first modules, each kernel , while the remaining has
. In the first three modules, the ReLU activation functionis used for each element.
We stress that since the input and output size of the neural network are different, multiple D convolutions are used to make the sides equal. In addition, to exploit correlation in both horizontal and vertical direction in the intermediate data, both horizontal and vertical 1D convolutions are used. A regular transpose layer is applied following vertical 1D convolution to ensure the data size of . The output of these two 1D convolutions are summed up to obtain the final prediction output. This prediction is used for both pilot and data power as depicted in Fig. 3 and is mathematically expressed as
where and denote the vertical and horizontal series of convolution operators dedicated to predict pilot powers by using convolutional kernels and their related biases . Similar definitions for the convolution layer in (35) are made for the data powers. The feature maps from (34) and (35) are restricted in the closed unit interval by
where the element-wise sigmoid activation function is
Finally, the predicted pilot and data powers at epoch are obtained by scaling up and as
where is a collection of the maximum power budget from all users with . The operator denotes the dot product of two tensors. We emphasize that the forward propagation is applied for both the training and testing phases.
Iv-B2 The back propagation
The back propagation is only applied in the training phase. We first adopt the Frobenius norm to define the loss function as
with respect to the parameters in , where are non-negative weights that balance between the total transmit power of pilot and data symbols. The loss in (39) is averaged over the training dataset where is the total number of large-scale fading realizations, i.e.,
). PowerNet will use stochastic gradient descent to obtain a good local solution to . Beginning with a random initial value and remember the current at each epoch , then the update is
where is the so-called momentum and is the learning rate. We stress that the computational complexity of the back propagation can be significantly reduced if a random mini-batch with is properly selected  rather than processing all the training data at once.
Iv-C Dataset, Training, and Testing Phases
In order to train PowerNet, we use Algorithm 1 to generate training pairs of user realizations and the corresponding outputs that are jointly optimized by our method presented in Algorithm