I Introduction
Beamforming is an important multiantenna technique to deal with interference and improve the capacity of multiuser wireless communications systems. Most beamforming design problems are nonconvex, so early efforts to optimize beamforming mainly refer to numerical algorithms [1][2], which cause high complexity for practical implementation. Recently deep learning has been recognized as a new “learn to optimize” approach to design beamforming [3, 4, 5, 6, 7]. The deep learning approach significantly reduces the optimization complexity and the resulting beamforming solution can be potentially implemented in real time.
The optimization and performance of beamforming critically depend on the availability of perfect channel state information (CSI) at the transmitter. Existing deep learning based beamforming solutions are mainly based on the availability of perfect CSI. In the downlink, perfect CSI is difficult to obtain at the base station (BS) for several reasons. In the time division duplex (TDD) systems, normally channel reciprocity is assumed, so the downlink CSI can be estimated from the pilots sent by the mobile users in the uplink. However, in practice, the channel reciprocity does not hold because the analog radio frontends at the BS and the mobile users exhibit nonreciprocity due to nonidentical behavior of the individual transmit and receive chains [8]. In the frequency division duplex (FDD) systems, for the BS to obtain the CSI, the BS needs to first transmit pilots to the users for the downlink channel estimation, and then users feed back the estimated downlink channel to the BS. As a result, this channel acquisition process incurs a large overhead and reduces the effective system spectral efficiency.
There has been significant efforts in obtaining the CSI using deep learning techniques. CsiNet was develop in [9]
to learn CSI sensing and recovery in FDDbased massive multipleinput and multipleoutput (MIMO) systems using the channel structure. A learned denoisingbased approximate message passing neural network was proposed in
[10] for beamspace channel estimation in Millimeterwave (mmWave) massive MIMO systems. Based on the fact that the uplink and downlink channels share the same propagation environment, deep neural network for channel calibration between the two directions was designed in [11] for a generic massive MIMO system. A sparse complexvalued neural network was introduced in [12]to approximate the uplinktodownlink mapping function in FDD massive MIMO systems. Convolutional neural networks and generative adversarial networks were used in
[13] to infer the downlink CSI by observing the uplink CSI. The feasibility of channel mapping in space and frequency was demonstrated in [14], where the channels at one set of antennas using one frequency band can be learned from the channels at another set of antennas that use a different frequency band. A comprehensive joint channel estimation and feedback framework based on deep learning was proposed in [15], which realizes the estimation, compression, and reconstruction of downlink channels in FDD massive MIMO systems. There is an emerging direction of studies that aims to learn the beamforming solutions rather than improve the CSI accuracy, which is close to the main idea of this work. For instance, the work in [16] proposed a deep learning based CSI feedback framework for beamforming design in FDD massive MIMO systems to maximize the beamforming performance gain. A deep neural network using unsupervised training was proposed in [17] to map the received uplink pilots to the beamforming matrix at the BS for the intelligent reflecting surface assuming uplinkdownlink channel reciprocity. A channel sensing and hybrid precoding design was proposed in [18], by using the received pilots without the intermediate channel estimation step for TDD massive MIMO systems.While existing deep learning approaches have achieved success in individual tasks of channel estimation and beamforming design, they normally require massive amounts of data and computational resources, and their simple combination will not guarantee satisfactory end results. This is due to their inherent limitation of being datadriven and modelagnostic. The optimization of each separate task only focuses on its own objective assuming the other is ideal. In reality, the optimization in each task will introduce some error or imperfection, so the overall performance could deteriorate when they are simply putting together.
A promising direction to remedy this problem is modeldriven deep learning that combines the datadrive approach with the underlying domain knowledge, mathematical models and problem structures, etc., to achieve a better inference with less data. Recent advancements in modeldriven deep learning approaches in physical layer communications were discussed in [19]. In our previous works [7], [20], we have proposed the modeldriven neural network design for beamforming optimization by exploiting the problem structure. Deep neural networks that adopt the algorithmic structure and constraints of adaptive signal processing techniques were proposed in [21] that can efficiently learn to perform fast highquality ultrasound beamforming by using very few training data. Modeldriving learning is a new concept that can be broadly applied in engineering design, and a comprehensive review of leading approaches for combining modelbased algorithms with deep learning can be found in [22], with detailed signal processing and communications oriented examples.
In this paper, we aim to design a modeldrive deep learning approach to jointly tackle the challenge of channel estimation and optimize the downlink beamforming to maximize the sum rate of generic multiuser MIMO systems. Different from the literature, we assume only information about the uplink channel is available, without explicit knowledge of the downlink CSI, and the relation between the downlink and uplink channels is unknown. The uplink channel information could take the form of either perfect CSI or received pilots. This method will alleviate the burden of channel estimation at the user side, reduce the feedback overhead, and it is flexible enough to be used in both TDD and FDD systems, and in massive MIMO and multicell settings. Especially for FDD systems, the proposed approach allows to learn the downlink beamforming directly without the need of sending downlink pilots, uplink feedback or explicit channel estimation. The novelty of our work is twofold: first, we propose to optimize the beamforming in order to maximize the end performance and therefore bypass the explicit intermediate channel estimation step; second, we introduce a modeldriven deep learningbased approach. Comparing to existing datadriven beamforming learning, our proposed approach specifies the most appropriate features to be learned with improved performance of inference and end performance. Our main contributions are summarized as follows:

We exploit the algorithmic structure of beamforming solutions as the useful model information, and propose a hybrid method for joint learning of the downlink channel and optimization of beamforming to guarantee the sum rate performance. To be specific, we design a neural network consisting of two subnets for learning the downlink channel and the auxiliary power vector, respectively, from which the downlink beamforming solution can be constructed. The overall loss function is hybrid and chosen to be a weighted sum of the loss functions of channel and power learning using supervised training, and the sum rate using unsupervised training.

We investigate techniques to further reduce the problem dimension and achieve nearoptimal lowcomplexity learning, by using the zeroforcing (ZF) beamforming in the loss function, which is specially appealing for massive MIMO systems.

We extend the proposed method to multicell massive MIMO systems in which a BS in each cell is able to learn the beamforming solution in a distributed manner without signalling exchange with other cells. To the best of our knowledge, this is the first distributed learning solution for the optimization of multicell beamforming.

Extensive simulations are carried out to evaluate the performance the proposed algorithms, which show that the proposed algorithm can achieve a sum rate close to the weighted minimum mean squared error (WMMSE) algorithm [2], and significantly outperforms existing learning methods.
The remainder of this paper is organized as follows. Section II introduces the system model and the problem formulation. The uplink to downlink channel mapping is discussed in Section III. The modeldriven hybrid learning approach for a general downlink is proposed in section IV. Section V presents techniques to further reduce the training complexity for singlecell massive MIMO systems. Section VI extends the result to allow distributed learning of the beamforming solution in a multicell massive MIMO scenario. Simulation results and conclusions are given in Section VII and Section VIII, respectively.
Notions: The boldface lower case letters and capital letters are used to represent column vectors and matrices, respectively. The notation and denote the transpose conjugate and the Frobeniusnorm of a complex matrix , respectively. denotes the complex field. The operator represents a complex Gaussian vector with mean and covariance matrix . denotes an identity matrix.
denotes the expectation of a random variable.
Ii System Model and Problem Formulation
We start with a singlecell multiinput singleoutput (MISO) downlink system where a BS with antennas serves singleantenna users. The received signal at the user can be written as
(1) 
where denotes the downlink channel vector from the BS to the user , and denote the transmit beamforming vector and the information signal for the user with normalized power, respectively.
is the additive Gaussian white noise with zero mean and variance of
. The beamforming matrix is and we collect the downlink CSI into . The received SINR at user is expressed as(2) 
The sum rate is then written as . Based on the above model, the sum rate maximization problem under the total transmit power constraint can be formulated as
(3) 
The extended system models to massive MIMO and multicell scenarios will be discussed in Sections V and VI, respectively, and specific techniques to reduce the training complexity and enable distributed learning will also be introduced.
When the downlink CSI is available, there exist numerical algorithms that can find the locally optimal beamforming solution of problem (3) such as the WMMSE algorithm [2]. In our recent work [7], we have proposed a deep learning method to solve this problem with perfect downlink CSI. However, without downlink CSI , existing numerical or deep learning algorithms cannot be applied. Therefore, we focus on the design of downlink beamforming algorithms, when only the uplink channel information is available, either in the form of perfect CSI or the received pilot signal.
Iii Uplink to Downlink Channel Mapping
In this paper, we rely on the uplink channel information to infer the downlink channel and optimize the downlink beamforming, so we assume that there exists a deterministic and unique mapping from the uplink channel to the downlink channel but its explicit form is unknown and can be learned by a deep neural network. This assumption is based on the fact that a wireless channel between a transmitterreceiver pair is determined by the positions of the transmitterreceiver pair, antennas, carrier frequency and the environment in which the signals propagate including the objectives and their materials and shapes within the environment. Because both uplink and downlink channels share the same propagation environment, given the positions of the transceiver pair, there is an intrinsic mapping between the uplink and the downlink channels. Below we will give details for the FDD and TDD cases, respectively.

In a FDD system, consider the singleantenna uplink channel and the downlink channel that operate at frequencies and , respectively. Assume that there are distinct propagation paths in the environment, the uplink channel can be written as
(4) where is the path attenuation, is the path delay and is a frequencyindependent phase shift that captures the reflection and attenuation effects of the signal along the path .
The path attenuation depends on the distance between the transceiver pair, their antenna gains, the carrier frequency and the environment, the phase shift depends on the scattering and the path delay depends on the propagation distance. Therefore when the environment and other factors are unchanged, there is a deterministic mapping from the positions to the channel [14]
. Next we look at the mapping from the channel to the positions. Although the mapping from the channel to the positions may not always be unique, it is unique with a high probability in many practical wireless communication scenarios especially as the number of antennas increases which is widely exploited in the wireless fingerprinting
[25] and positioning [26]. In other words, the mapping between positions and channel can be assumed bijective, so is the mapping between the uplink and downlink channels. 
In a TDD system, the channel reciprocity is usually assumed but in reality, the analog radio frontends at different wireless nodes such as BS and the mobile users exhibit nonreciprocity due to nonidentical behavior of the individual transmit and receive chains. This is caused by the mismatches in the frequencyresponses of both the BS and user side radio frontends between the transmit and receive modes, and the differences in mutual coupling of BS antenna units and the associated RF transceivers under transmit and receive modes [27] [28].
Specifically, consider the channels between a BS with antennas and a singleantenna user in linear TDD systems. The uplink channel and downlink channel can be written as [8][11],
(5) where is the physical reciprocal channel, and are the frequencyresponses at the user side in the transmitting and receiving modes, respectively. Denote as the frequencyresponse matrix and as the mutual coupling matrix of the BS, and then and where the subscripts and denote the transmitting and receiving modes, respectively. The frequencyresponse matrix is diagonal but the mutual coupling matrix is not diagonal. In general, , so the uplink and dowlink channels are nonreciprocal, and their relation can be described as
(6) which is a deterministic and unique mapping.
Once the bijective mapping between the uplink and downlink channel is established, it can be learned by using the deep neural networks based on the universal approximation theorem [29]
. Note that in the above, we have adopted explicit parametric modelling of uplink and downlink channels with simplifying assumptions (e.g., not all hardware impairment sources such as power amplifier distortion, phase noise and quantization noise, are considered), but in practice such parametric models may not be accurate. In this paper we do not assume any specific parametric channel models in our theoretical development and instead we use the modeldriven learning approach to learn the best mapping of the uplink to downlink channel in order to maximize the end performance.
Iv The Proposed General Algorithm Framework
Since we consider a generic system which could be either TDD or FDD, the exact theoretical characterization on the mapping between the uplink and downlink channel is unknown as long as it is bijective, so the most viable way to obtain the downlink CSI without user feedback is to learn it from data first, based on which, the downlink beamforming will be learned or optimized subsequently to maximize the sum rate. This is a traditional method that treats the channel learning and the end performance optimization separately. The main drawback of the separate learning is that the explicit channel learning process does not take into account the ultimate objective of maximizing the sum rate, and it also causes error propagation when optimizing the beamforming. In this paper, we use a deep learning approach to solve the problem (3) directly from the uplink channel information. We still include learning the downlink channel from the uplink channel information but it is only an intermediate step and the focus is not to achieve the best channel learning performance. The key idea of our proposed method is to exploit the optimal structure of the beamforming solution as the useful model information, which then guides the design of a highly efficient neural network to solve (3), with the assistance of the learned downlink CSI. More details are given below.
Iva Structure of beamforming
According to [23], the optimal downlink beamforming vectors that maximizes the sum rate possesses the structure below
(7) 
where and are positive parameters and satisfy . The parameter vector represents the downlink power allocation. is an auxiliary vector variable which is useful to determine the direction of beamforming. Because it needs to satisfy the same total power constraint as the downlink power vector , can be interpreted as the virtual uplink power vector. The advantage of this representation is that the power vector can be regarded as the key feature of the beamforming solution. Instead of learning the highdimensional beamforming matrix directly, (7) allows us to learn the lowdimensional feature , which will greatly improve the learning efficiency and accuracy, and reduce the training complexity.
IvB Key modules of the proposed neural network
Based on the expression in (7), we propose a neural network to jointly learn the downlink channel, the power feature vector with the end objective of maximizing the system sum rate, as illustrated in Fig 1.
The proposed neural network takes uplink channel information as input, and its output is the beamforming matrix . The input to the neural network can be either the perfect uplink CSI with a dimension of by stacking the real and imaginary parts, or the uplink pilot signal which will be discussed later in this section. The proposed neural network structure consists of the following three modules:

CSINet. This subnet aims to learn the downlink channel from the uplink channel and its output is separated into the real part and the imaginary part . When the input is the uplink CSI, it will perform uplinkdownlink calibration for TDD systems, while in FDD systems it will map the channel in the uplink band to the channel in the downlink band. It is not necessary to specify the TDD or FDD system as this module learns the downlink channel automatically for either case. When the input is the uplink pilot signal, it additionally estimates or refines the uplink channel but this is embedded in the module implicitly. Suppose we have a channel training dataset in which there are uplinkdownlink CSI pairs. Given the predicted output result of the th sample in the CSINet is and the target result in the training dataset is , the mean squared error (MSE)based loss function of CSINet is defined as
(8) The structure of CSINet depends on the specific systems of interest, and in this paper, we adopt fully connected layers and details will be given in Section VII. Note that for the uplink CSI input, when the channels for different users are uncorrelated and the statistics is similar, we can use the same CSINet to learn the downlink channel in a singleuser manner. This effectively increases the amount of training data by a factor of , while reduces the complexity of the neural networks.

PowerNet. This subnet aims to learn the concatenated power vector , which is the key feature of the beamforming solution. In the literature, there is no method available that can find the optimal and in (7) to maximize the sum rate with a reasonable complexity. The WMMSE algorithm [2] is a well known iterative method to find the locally optimal solutions. It ensures the continuity of the mapping from the channel to the solution, which can be learned by a neural network. Therefore, we generate samples of the power allocation vectors and
for training, by using the WMMSE algorithm. The supervised learning with the following loss function based on the MSE metric will be used to train the
PowerNet,(9) where and are the th training samples of the downlink and uplink power vectors in the power training database obtained from the WMMSE algorithm, respectively, and and are the predicted results of PowerNet. Similar to CSINet, the neural network structure is systemdependent, and in this paper, we adopt fullyconnected layers or convolutional neural network (CNN) layers which will be specified in simulation results of Section VII.

Beamforming Recovery Module. This module has two functions. First, it aims to find the downlink beamforming matrix from the downlink channel output from the PowerNet and the uplink and downlink power output from the PowerNet using the structure specified in (7
); there is no parameter to optimize in this module. Second, this module is important to calculate the sum rate using unsupervised learning which then forms the overall loss function for effective hybrid training as described in the next subsection.
IvC Hybrid training
To train the proposed neural network, we construct the overall loss function as the weighted sum of the losses of the CSINet, the PowerNet and the sum rate, i.e.,
(10) 
where , and is the weight for each loss component. and can be obtained by supervised learning from the CSINet and the PowerNet, respectively, while can be calculated by using the beamforming matrix obtained from the above Beamforming Recovery Module and the learned downlink channel from the CSINet based on (2); overall, the training adopts a hybrid supervised and unsupervised approach. Note that both channel learning and power learning are auxiliary in our proposed algorithm, and the focus of the overall learning is to maximize the sum rate but not to achieve the best learning performance of the downlink channel matrices and power vectors. The incorporation of the sum rate into the loss function is important because this ensures that the training of the neural network is guided by the end performance; this is in stark contrast to the separate training approach which only focuses on learning the channel or the power vectors but cannot guarantee the sum rate performance. In addition, the inclusion of the and is also important because the downlink channel is unknown and it is difficult to learn the overall mapping from the uplink channel directly to the downlink beamforming which leads to unsatisfactory training performance.
We illustrate the advantage of the proposed algorithm over the supervised learning () and unsupervised learning methods () in Fig. 2, where the uplink CSI follows distribution and the relation between the th elements of the uplink and downlink CSI is . As can be seen, the proposed algorithm outperforms the supervised learning method, and the performance gap increases as the number of antennas grows. When , the proposed algorithm achieves about 10% higher sum rate than the supervised learning method, while the unsupervised learning cannot achieve satisfactory sum rate performance as explained above. The results show that in sharp contrast to the case where perfect CSI is available, existing methods of supervised and unsupervised learning could not achieve satisfactory performance when the CSI needs to be learned.
IvD Learning from the uplink pilot signal
In this subsection, we introduce the preprocessing when the uplink channel information is in the form of the received pilot signal. Assume that users transmit pilot symbols. Suppose the pilot signals sent by all users are collected in , then the received pilot signal in the uplink can be written as
(11) 
where is the received noise and the elements of has zero mean and variance of . In order to recover the uplink channel from the pilot signal, we choose the pilot
to be a submatrix of scaled discrete Fourier transform (DFT) matrix of dimension
(multiplied by the square root of the transmit power), which has orthogonal columns and rows and its elements have unit amplitude.For our proposed algorithm, instead of using the received pilot , we use the least square version as the input to the neural network:
(12) 
Clearly when , which means in this case, pilot sequences between users are orthogonal; while when , there exists pilot contamination among users which will degrade the performance of estimating the uplink channel. Note that in our proposed approach, we do not need to estimate the uplink channel explicitly, but we learn the downlink beamforming from the uplink pilot directly.
For comparison, the traditional separate approach would first estimate the uplink channel, e.g., by using a linear minimum meansquared error (MMSE) estimator, and then use it as the input of the neural network. Suppose the linear MMSE estimator is in the form of
(13) 
where and are the weighting coefficients. Then the corresponding channel estimation can be obtained by solving the following optimization problem:
(14) 
The optimal and are given by
(15)  
(16) 
and the liner MMSE estimation of is given by
(17) 
where is the mean value of , i.e., , has zero mean, and which can be estimated from the training data of the uplink channel. The derivation can be found in Appendix A.
V Lowcomplexity Implementation for Massive MIMO Downlink
The proposed algorithm framework in Section IV is general so it can be applied to massive MIMO downlink, straightforwardly. However, the large number of antennas may introduce a high computational complexity, especially in the calculation of the loss function during the training. In this section, we propose two techniques that can reduce the complexity when calculating the loss function of the sum rate in the training process for massive MIMO systems, without compromising the end performance, especially when .
Va Massive MIMO channel model
We assume that the BS is equipped with uniform linear array (ULA) antennas, without loss of generality. The channel between the BS and a user (the user index is omitted for simplicity) that consists of paths is modelled as
(18) 
where and are the attenuation, the path delay, the phase shift and the angle of arrival (AoA, for the uplink) or the angle of departure (AoD, for the downlink) of the th path, respectively. Moreover, is the array response vector defined as
(19) 
where is the antenna spacing and is the wavelength.
From the structure of the optimal beamforming (7), we can see that it involves inversion of a matrix of dimension . This causes a high complexity for massive MIMO systems, because the calculation of the sum rate in the loss function (10) requires the construction of the optimal beamforming according to (7), which needs to calculate the inversion of a matrix of size .
VB Dimension reduction
In the first technique, we aim to reduce the dimension of the inverse matrix, when calculating the optimal beamforming using (7). We find the following proposition is useful which is adopted from the result in interference channels [30] [31].
Proposition 1: Suppose and the users’ channels are linearly independent and that . Then if is a beamforming vector for user that corresponds to a rate point on the Pareto boundary, there exists complex numbers such that
(20) 
It can be proved using the same method as that in [30] and therefore the proof is omitted.
Recall that , and define . Suppose , then the SINR expression becomes
(21) 
Define the eigenvalue decomposition
, where is the unitary eigenmatrix and is the diagonal eigenvalue matrix. Now define the new beamforming vector and the new channel vector . Then the sum rate maximization can be written equivalently as(22)  
We can see from the above new problem formulation (22) that the size of the new channel matrix reduces from to . The structure of the optimal beamforming is revised to
(23) 
and as a result the size of the matrix inversion is reduced from to .
Note that although the above dimension reduction technique reduces the dimension for matrix inversion significantly, it involves extra matrix multiplication and eigenvalue decomposition, when constructing the new channel vectors . It was shown in [32] that standard linear algebra operations for a square matrix of dimension , including matrix inversion and eigenvalue decomposition problems have the same time complexity as the matrix multiplication algorithm and there exists the CoppersmithWinograd algorithm for matrix multiplication with a complexity of [34]. Therefore, the benefit of the dimension reduction technique on the overall training time is only obvious when and this is verified by the simulation results in Section VII.
VC ZF beamforming in the loss function
Another technique to reduce the complexity of matrix inversion in (7), when calculating the sum rate in the loss function, is to use the ZF beamforming in the following form
(24) 
where is chosen to satisfy the power constraint, i.e., . It has been shown in the seminal work on massive MIMO [33] that, ZF beamforming is nearoptimal when the number of antennas is large. From the complexity’s viewpoint, the advantage of the ZF beamforming is that it only involves the matrix inversion of which is a matrix and reduces the complexity of matrix inversion in (7).
Similar to the dimension reduction technique, the advantage of the ZF beamforming is more prominent when and diminishes as increases, but its performance is always close to the WMMSE solution as long as . These properties are verified by the simulation results in Section VII.
Vi Multicell Massive MIMO Downlink
In this section, we apply the proposed modelbased learning to multicell massive MIMO systems, where an example of the multicell massive MIMO system with seven cells is illustrated in Fig. 3. Specifically, we first introduce the distributed multicell massive MIMO downlink beamforming, without the need of signal or data exchange between cells, and then describe the uplink channel estimation via pilots, followed by the proposed modelbased learning method.
Via Distributed Optimization of Beamforming in Multicell Massive MIMO Downlink
Consider an cell massive MIMO system, and in each cell a BS with antennas serves singleantenna users, . Denote the beamforming vector for the th user in the th cell as , then the received scalar signal in the th cell is expressed as follows:
(25)  
where is the downlink channel from the BS in the th cell to the th user in the th cell, is the signal for the th user in the th cell. Therefore the SINR of the th user in the th cell is written as follows
(26) 
Based on (26), to calculate for the th user in the th cell, it requires not only the beamforming and downlink channels , but also the knowledge of any possible interfering cell with and . This includes interfering BSs’ downlink channels to the considered user in the th cell and all users’ beamforming vectors in those interfering cells with . This means to optimize the SINR in any single cell, the solution of the beamforming vector in this cell is coupled with the solutions in other cells. Traditional methods, such as the coordinated transmission, treat the multicell as a large single cell and use ZF beamforming method to cancel out the interference term in (26). These methods require the cooperation between cells such as the exchange of CSI and centralized joint optimization of beamforming vectors, which results in high signaling overhead [35].
In order to decouple the beamforming between different cells, the signal to leakage plus noise ratio (SLNR) is used in this paper as an alternative, where the SLNR of the th user in the th cell is defined as follows [36]:
(27) 
Note that the key difference between the SLNR in (27) and the SINR in (26) is the interference item in the denominator, where the SLNR in (27) considers the ‘leaked’ interference power due to the beamforming from the th user in the th cell to all other cells’ users, instead of the received interference in SINR in (26). Traditionally the data rate is defined based on SINR in (26), but this requires the cooperation between different cells to estimate the data rate performance of the user. Since the SLNR definition in (27) shares similarity with the SINR definition in (26), here we define an approximate data rate based on SLNR the th user in the th cell as follows:
(28) 
where the subscript “SLNR” is used to differentiate it from the original data rate definition, and the approximation is based on the high SNR assumption.
With the definition of SLNR in (27), the optimization of beamforming vector for the th cell depends only on the downlink channels from the BS in the th channel to users in all cells, which can be estimated via the uplink estimation. More importantly, since the required channel information can be obtained by the BS in a single cell, the exploitation of SLNR instead of SINR decouples the beamforming in individual cells. Therefore the multicell sum rate maximization problem can be decoupled to the percell sum rate maximization problem to be addressed in each cell as
(29) 
where is the transmit power limit for each BS. To solve the problem in (29), the beamforming vector is rewritten as , where satisfies for all , and represents the direction of the beamforming vector, while the power is characterized by . Then with (28), the objective function in (29) can be further rewritten as follows:
(30)  
while the percell sum rate maximization problem in (29) can be further rewritten as follows
(31)  
s.t. 
The problem (31) is nonconvex, so we propose to use the alternating optimization to solve it. Specifically, with a given power allocation , the optimal beamforming direction is solved as:
(32) 
When the beamforming directions are fixed, the objective function of (31) is a posynomial with regard to , so (31) can be solved via geometric programming [37]. Since in each step, the objective function is nondecreasing, such an alternating optimization algorithm will converge. Therefore we can use the alternating optimization algorithm to generate the downlink power solutions as the labelled data in the training process.
When , the dimension reduction techniques in Section V.B can be also applied to the multicell scenario to reduce the training complexity.
Note that the optimal beamforming based on SLNR involves only the downlink channels from the BS in a single cell to the users in all cells. This enables the distributed beamforming at each single cell and no cooperation is required between cells, which helps to reduce the signaling overhead in the multicell scenario. More importantly, this enables the proposed learningbased method to be decoupled in the multicell scenario and applicable when the system scales up.
ViB Learning From Uplink Pilots
The optimization based on SLNR requires the channel from the BS in each cell to learn the downlink channels to users from the uplink channel information, which can be achieved by estimation from uplink pilots.
Consider the th cell surrounded by neighbouring cells. Assume that each user in the multicell system transmits pilot symbols of length , then for the BS in the th cell, the received pilot signal in the uplink is written as
(33) 
where denotes the uplink channels from all users to the BS in the th cell, is the received noise whose elements has zero mean and variance of , and is the pilots sent by all users. Similar to the single cell scenario, the pilot adopts the submatrix of a discrete DFT matrix with dimension , which is scaled by the square root of the transmit power. Then the least square version of the uplink channel is used as the input of the neural network, and can be estimated via the received pilot signal as follows
(34) 
Clearly to avoid the pilot contamination between users, it requires the pilot’s length to be no less than the total number of users, i.e., .
For the traditional separate approach, the uplink channel will be estimated using the linear MMSE method.
ViC The Proposed Distributed Learning
In this subsection, we will adapt the proposed modelbased learning approach in Section IV to the multicell scenario. We will use a different SLNR beamforming structure below based on (32), i.e.,
(35) 
From (35), we can see that in order to construct the multicell beamforming, we need the downlink channel information and downlink power allocation. Therefore, we can still use the hybrid loss function (10) which is rewritten below
(36) 
except the loss of PowerNet only involves the downlink power, i.e., for the th cell, it becomes
(37) 
Note that the proposed method based on SLNR and learning from uplink pilots can be generalized to larger systems. The above analysis is based on the th cell, which treats all other cells as interfering cells. With a homogeneous assumption that the conditions are similar in each cell in the whole multicell massive MIMO system, the analysis of th cell is applicable to any cell in the system. This means the trained neural network’s parameters are applicable to all cells in a distributed manner, which is demonstrated in the simulation results in Section VII.C.
Vii Simulation Results
In this section, numerical simulations are carried out to evaluate the performance of the proposed algorithms. training samples and
testing samples are used with a batchsize of 100 and 200 epochs. We use Keras with Tensorflow 1.15 as the backend to train the proposed neural network on a Nvidia GP100 GPU card in a High Performance Computing (HPC) system. The weight factors used in the loss function are hyperparameters and chosen as
, by using a trial and error approach, unless otherwise specified. The much lower value of is to balance each component in the loss function considering typical values of the sum rate.For the proposed modeldriven solution, when calculating the sum rate in the loss function, both the original algorithm that uses the beamforming solution in (7), and the one that uses the simplified ZF beamforming in (24) are included. For comparison, we consider the following benchmark algorithms:

The WMMSE solution: this is the solution obtained by using the iterative algorithm proposed in [2] assuming the downlink channel is available, therefore it serves as a performance upper bound.

The Learned Channel and Beamforming Solution: this solution is obtained by first learning the downlink channel via supervised learning, then using the learned downlink channel to infer the beamforming solution using unsupervised training.

The Learned Channel and ZF Beamforming: this solution is used as the benchmark solution in the singlecell scenario. It first learns the downlink channel via supervised learning as the above solution, and then constructs a ZF beamforming (24) by using the learned channel.

The Learned Channel and SLNR Beamforming: this solution is used as the benchmark solution in the multicell scenario. It first learns the downlink channel via supervised learning as the above two solutions, and then constructs a SLNR beamforming (32) by using the learned channel. Note that the SLNR based approximate data rate is used only for the purpose of the neural network training, while the sum rate performance is calculated based on the actual SINR defined in (26).
Note that the closedform noniterative ZF and SLNR based beamforming solutions are chosen as the benchmarks to ensure the low complexity comparable to our proposed algorithm. In the following, we will present the simulation results and analysis for three scenarios: singlecell smallscale MIMO, singlecell massive MIMO, and multicell massive MIMO systems, as well as the generalization results.
Viia Smallscale MIMO Scenario
In this scenario, a TDD downlink system in which one BS with equal numbers of transmit antennas and users is considered, i.e., . We assume the uplink channel elements follow an independent Rayleigh distribution with zero mean and unit variance. Following the result in [11], the relation between the downlink channel and the uplink channel due to the radio frontend mismatch is modelled as:
(38) 
where the unitary matrix
and are used to model the mismatches in the frequencyresponses of the BS and the user sides, respectively. This mapping will be learned by the CSINet. Suppose the learned downlink channel is . The learning performance is characterized by the normalized MSE (NMSE), which is defined as(39) 
The structure and hyperparameters of the proposed neural network are as follows. For the CSINet, we use four fully connected layers, each with neurons and the ‘tanh’ activation function. No activation function is employed at the output layer. The PowerNet employs four fully connected layers, each with
neurons and the ‘relu’ activation function and batch normalization. The output layer of the
PowerNet uses ‘softmax’ as the activation function. Beamforming learning for the ‘Learned Channel and Beamforming’ solution uses the same neural network as CSINet, except that a batch normalization is included at each layer. training samples are used for this scenario. We also provide brief analysis of the complexity for the online prediction. For fully connected layers, suppose the input dimension is and the number of neurons in the hidden layer is , then the numbers of multiplication and addition operations are equal to . Therefore the overall neural network has an approximate complexity for the online prediction.We first show the sum rate results of various algorithms in Fig. 4, when the transmit power is 20 dB and the number of users/BS antennas vary from 2 to 8. We assume that the perfect uplink channel CSI is available. It can be seen that the proposed solution achieves the sum rate close to that of the WMMSE algorithm , and it significantly outperforms the benchmark schemes. The performance of both benchmark methods that first learn the downlink channel is not satisfactory. The scheme of learned channel and beamforming, achieves the worst performance while the one using a ZF beamforming achieves a higher data rate but still much lower than the proposed solutions. To further investigate the reason, we plot the channel learning results of downlink channel learning in Fig. 4. It is obvious that the benchmark schemes that explicitly learns the channel first outperforms the proposed solution in terms of achieving a lower estimation NMSE. However, its end performance, i.e., the sum rate is much lower than the proposed solution. This confirms the advantage of the proposed modeldriven learning with a hybrid training that can better exploit the available uplink channel information to optimize the end performance; the traditional approach that separately learns the channel and then designs the beamforming is not adequate since it is purely datadriven.
ViiB Singlecell Massive MIMO Scenario
In this scenario, we adopt the channel model specified in Section V for a FDD massive MIMO system and assume . The uplink and downlink operate at 2.5 GHz and 2.4 GHz, respectively. The antenna spacing is half wavelength of the downlink signal. Since the DoA or the AoD is limited within a certain region of the mean angle , it is modelled as . The path delay and the phase shift
are uniformly distributed in the ranges of
and , respectively. We assume the downlink channel attenuation of the th path follows an independent Rayleigh distribution with zero mean and unit variance. Given the ULA channel model in (18), the nonlinear relation between the uplink channel and dowlink channel are characterized by:(40) 
where the unitary matrix and are used to model the mismatches in the frequencyresponses of the BS and the user sides, respectively, which will be learned by the CSINet.
The structure and the hyperparameters of the proposed neural network are as follows. The CSINet contains three fully connected layers, each with neurons each and the ‘tanh’ activation function. The PowerNet uses two 1D CNN with filter sizes of 16 and 8, batch normalization and ‘relu’ activation function and a dropout rate of 0.3, followed by a fully connected layer with 256 neurons and ‘relu’ activation function, and an output layer that uses ‘softmax’ as the activation function. We provide brief complexity analysis for the online prediction. For 1D convolutional layers, suppose there are kernels of size in the th convolutional layer and the input dimension is
(padding is added such that the first input and output dimensions remain the same across layers, i.e.,
), then the output dimension of is , and the numbers of multiplication and addition operations are equal to when and when . Thus, the total complexity of all convolutional layers measured by the number of multiplications is . Therefore for the PowerNet, the complexity for the online prediction is approximately , while the overall complexity is still considering the CSINet.The sum rate results of various algorithms are shown in Fig. 5, when the transmit power is 10 dB and the number of users vary from 2 to 10. It can be seen that the proposed solution achieves the sum rate very close to the WMMSE solution and the use of a ZF beamforming in the loss function has almost no performance loss. The proposed solutions outperform the learned channel and the ZF beamforming solution, by about 10%, although the latter performs much better than the small scale MIMO scenario. The learned channel and beamforming solution is still the worse because it uses a datadriven approach, without taking into account the overall design objective when learning the downlink channel.
Next, we evaluate the required training time of the proposed algorithms when using the lowcomplexity implementation discussed in Section V. The training time of the proposed original algorithm and the percentage of the time required by the reduced dimension and the ZF beamforming techniques in relation to are shown in Table I at the top of the next page. It is obvious that when the number of users is small, the reduced dimension in matrix inversion of the two lowcomplexity schemes leads to a much shorter training time. However, as the number of users increases, the gain of the lowcomplexity schemes diminishes. This is because for the reduced dimension scheme, it involves an extra eigenvalue decomposition and matrix multiplications. In addition, as the number of users increases, the training time is dominated by the width of the fully connected layers which is , therefore the time saved by matrix inversion becomes insignificant.
The sum rate performance versus the number of pilots is shown in Fig. 5, when the number of users is . Similar to Fig. 5, it is confirmed again that the use of ZF beamforming in the loss function achieves almost the same performance as the original algorithm and both are close to the WMMSE solution when the number of pilot symbols . The learned channel and ZF beamforming achieves good performance in this scenario, which is different from the small scale scenario. There is still significant gap between the proposed solutions and the learned channel and beamforming solution and this verifies the superior performance of the proposed modeldriven approach.
[dir=NW]Proposed AlgorithmsNo. of Users  

Original formulation, (s)  1570  2314  3388  4716  6683 
Reduced dimension (%)  67.24%  80.98%  102.86%  95.68%  99.98% 
ZF Loss (%)  60.11%  73.69%  84.02%  87.78%  93.13% 
ViiC Multicell Massive MIMO Scenario
In the multicell massive MIMO scenario, we consider a TDD system in which the uplink and the downlink operate at different frequencies of 2.5 GHz and 2.4 GHz, respectively. In the simulations, a multicell system with cells is considered, where each cell has one active user as illustrated in Fig. 3. The total transmit power of the BS is 10 dBm, the bandwidth is 20 MHz and the noise power spectrum density is 174 dBm/Hz. The radius of each cell is 200 m, and the users are randomly located in each cell by following a uniform distribution. The minimum distance between each user and its serving BS is 10 m.
The channel attenuation includes both the small and large scale fading effects. The relation between small scale channel attentions is the same as (VIIB) in the singlecell scenario. The large scale fading (measured in dB, and the uplink/downlink subscript is omitted) between the BS and the user is given as follows
(41) 
where is the reference path loss gain measured in dB, which includes the effect of the central frequency , is related to the path loss exponent, and is the distance between the BS and the user measured in kilometer. In the simulation, we assume that the uplink and downlink path loss gains are expressed as, respectively,
(42) 
The structure and the hyperparameters of the proposed neural network are as follows. Three fully connected layers are used in CSINet and each has neurons each and use the ‘tanh’ activation function. The weight factors used in the loss function are . During the training procedure, this CSINet is reused for each cell to learn the downlink channels based on their uplink channels. Since the constructed CSINet and the proposed method are for a single cell which do not rely on signalling exchange with other cells, they can be deployed in a distributed way at each BS once the training is completed. The complexity for the online prediction is approximately .
In the multicell massive MIMO scenarios, the number of transmit antennas is one of the key factors that determines the users’ end performance, so we first evaluate the sum rate performance of all users versus different numbers of antennas in Fig. 6. As seen from Fig. 6, the proposed solution achieves a tight sum rate performance compared to the WMMSE solution. The proposed solution outperforms all benchmark algorithms, while the performance comparison of each benchmark algorithm follows a similar trend as in the singlecell scenario. It is also noticed that the proposed solution based on the ZF loss function shows a close performance to the proposed solution based on the SLNR in the loss function, and the performance gap is generally reducing as the number of antennas increases. It is also clear that the datadriven approaches are worse than the modeldriven approaches, while the learned channel and beamforming solution is still the worst.
Next we investigate the sum rate performance versus different numbers of pilot symbols in Fig. 6, when and the uplink channels are estimated via the pilots. It can be seen that the performance of all algorithms improves as the number of pilots increases, but there is a significant gap from the WMMSE solution when . When , the proposed solutions with SLNR and ZF loss functions achieve close performance to the WMMSE solution and significantly outperform other benchmark schemes. This observation is expected as there are a total of seven users in the considered scenario. When the number of pilots is less than seven, there will be users with nonorthogonal pilots, which results in a deteriorated accuracy of the recovered uplink channel from the pilots.
ViiD Generalization
In this subsection, we study the generalization performance of the proposed joint training method in the multicell massive MIMO scenario when . The model to be tested is trained with a total transmit power dBm for the originally considered 7cell scenario and no Doppler effect is considered. First, we evaluate the performance of the trained model under different levels of transmit power in Fig. 7. It is seen that the inference of this model is still capable of achieving a tight performance close to the WMMSE solution as the transmit power is near 10 dBm, but the performance gap increases as the testing power deviation is large. This shows the proposed method can generalize to scenarios where there is a small variation of the testing power.
Secondly, we consider the scenario where the system has different number of cells and the results are presented in Fig. 7. It is seen that both proposed solutions achieve the sum rate performance close to the WMMSE solution under different cell numbers, which confirms that the proposed distributed learning solution generalizes well as the number of cells increases. Finally, we consider the user mobility characterized by the Doppler effect in Fig. 8. Note that in practical systems, the Doppler effect may be estimated and compensated at the receiver side [38]. Here we assumed the Doppler effect is not compensated so that it will influence the estimated CSI and the sum rate performance. We consider the maximum Doppler frequency to be in the range of 10 to 150 Hz, which corresponds to a speed range from 1 to 20 m/s. Fig. 8 shows the inference results of the trained model evaluated by considering different maximum Doppler frequencies. It is seen that the sum rate performance of the proposed solutions shows moderate degradation compared to the WMMSE solution as the Doppler frequency increases, but is still significantly higher than the benchmark solutions.
Viii Conclusions
In this paper, we have proposed a new downlink beamforming optimization algorithm to maximize the sum rate using deep learning when only the uplink channel information is available, but its mapping to the downlink channel is unknown. We introduced a modeldriven learning approach by exploiting the structure of the optimal beamforming solution to facilitate an effective neural network design. The proposed approach was extended to massive MIMO and distributed multicell scenarios. Simulation results demonstrated that our proposed algorithm can approach the performance of the conventional WMMSE algorithm, and achieves a much higher sum rate than the benchmark schemes. These results show the importance of modeldriven and holistic learning approaches to optimize downlink beamforming in practical systems.
Appendix A
From (14), the MSE of channel estimation can be expressed as
(43)  
Because , where has a zero mean and . Then (43) can be further written as
MSE  (44) 
Apparently the optimal should satisfy
(45) 
Then
Comments
There are no comments yet.