Deep Learning for Massive MIMO CSI Feedback

by   Chao-Kai Wen, et al.

In frequency division duplex mode, the downlink channel state information (CSI) should be conveyed to the base station through feedback links so that the potential gains of a massive multiple-input multiple-output can be exhibited. However, the excessive feedback overhead remains a bottleneck in this regime. In this letter, we use beep learning technology to develop CsiNet, a novel CSI sensing and recovery network that learns to effectively use channel structure from training samples. In particular, CsiNet learns a transformation from CSI to a near-optimal number of representations (codewords) and an inverse transformation from codewords to CSI. Experiments demonstrate that CsiNet can recover CSI with significantly improved reconstruction quality compared with existing compressive sensing (CS)-based methods. Even at excessively low compression regions where CS-based methods cannot work, CsiNet retains effective beamforming gain.


page 1

page 2

page 3

page 4


Aggregated Network for Massive MIMO CSI Feedback

In frequency division duplexing (FDD) mode, it is necessary to send the ...

Deep Learning-based CSI Feedback Approach for Time-varying Massive MIMO Channels

Massive multiple-input multiple-output (MIMO) systems rely on channel st...

Deep Learning for 1-Bit Compressed Sensing-based Superimposed CSI Feedback

In frequency-division duplexing (FDD) massive multiple-input multiple-ou...

CQNet: Complex Input Quantized Neural Network designed for Massive MIMO CSI Feedback

The Massive Multiple Input Multiple Output (MIMO) system is a core techn...

Compressed CSI Feedback With Learned Measurement Matrix for mmWave Massive MIMO

A major challenge to implement compressed sensing method for channel sta...

Beamforming Feedback-based Model-driven Angle of Departure Estimation Toward Firmware-Agnostic WiFi Sensing

This paper proves that the angle of departure (AoD) estimation using the...

Bi-directional Beamforming Feedback-based Firmware-agnostic WiFi Sensing

In the field of WiFi sensing, as an alternative sensing source of the ch...

I. Introduction

The massive multiple-input multiple-output (MIMO) system is widely regarded as a major technology for fifth-generation wireless communication systems. By equipping a base station (BS) with hundreds or even thousands of antennas in a centralized [1] or distributed [2]

manner, such a system can substantially reduce multiuser interference and provide a multifold increase in cell throughput. This potential benefit is mainly obtained by exploiting channel state information (CSI) at BSs. In current frequency division duplexity (FDD) MIMO systems (e.g., long-term evolution Release-8), the downlink CSI is acquired at the user equipment (UE) during the training period and returns to the BS through feedback links. Vector quantization or codebook-based approaches are usually adopted to reduce feedback overhead. However, the feedback quantities resulting from these approaches need to be scaled linearly with the number of transmit antennas and are prohibitive in a massive MIMO regime.

The challenge of CSI feedback in massive MIMO systems has motivated numerous studies [3, 4]

. These works have mainly focused on reducing feedback overhead by using the spatial and temporal correlation of CSI. In particular, correlated CSI can be transformed into an uncorrelated sparse vector in some bases; thus, one can use compressive sensing (CS) to obtain a sufficiently accurate estimate of a sparse vector from an underdetermined linear system. This concept has inspired the establishment of CSI feedback protocols based on CS

[3] and distributed compressive channel estimation [4]. The use of several algorithms, including LASSO -solver [5] and AMP [6], has also been proposed in CS. However, these algorithms [5, 6] struggle to recover compressive CSI because they use a simple sparsity prior while their channel matrix is not perfectly but is approximately sparse. Moreover, the changes among most adjacent elements in the channel matrix are subtle. These properties complicate modeling their priors. Although researchers have designed advanced algorithms (e.g., TVAL3 [7] and BM3D-AMP [8]) that can impose elaborate priors on reconstruction, these algorithms do not significantly boost CSI recovery quality because hand-crafted priors remain far from practice.

Summarily, three central problems are inherent in CS-based methods. First, they rely heavily on the assumption that channels are sparse in some bases. However, channels are not exactly sparse in any basis and may not even have an interpretable structure. Second, CS uses random projection and does not fully exploit channel structures. Third, existing CS algorithms for signal reconstruction are often iterative approaches, which have slow reconstruction. In the present study, we address the above problems using deep learning (DL). DL attempts to mimic the human brain to accomplish a specific task by training large multilayered neural networks with vast numbers of training samples. Our developed CSI sensing (or encoder) and recovery (or decoder) network is hereafter called CsiNet. CsiNet has the following features.

  • Encoder. Rather than using random projection, CsiNet learns a transformation from original channel matrices to compress representations (codewords) through training data. The algorithm is agnostic to human knowledge on channel distribution and instead directly learns to effectively use the channel structure from training data.

  • Decoder. CsiNet learns inverse transformation from codewords to original channels. Inverse transformation is non-iterative and multiple orders of magnitude faster than iterative algorithms.

A UE uses the encoder to transform channel matrices into codewords. Once the codewords are returned to the BS, it recovers the original channel matrices by using the decoder. The methodology can be used in FDD MIMO systems as a feedback protocol. In fact, CsiNet is closely related to the autoencoder

[9, Ch. 14] in DL, which is used to learn a representation (encoding) for a set of data typically for dimensionality reduction. Recently, several DL architectures have been proposed to reconstruct natural images from CS measurements [10, 11, 12]. Although DL exhibits state-of-the-art performance in natural-image reconstruction, whether DL can also show its ability in wireless channel reconstruction is unclear because this reconstruction is more sophisticated than image reconstruction. The present work is the first to suggest a DL-based CSI reduction and recovery approach.111For an overview of applying DL to the wireless physical layer, we refer the interested readers to [13]. The most relevant work appears to be [14], in which DL-based CSI encoding has been used in a closed-loop MIMO system. Different from [14], which has not considered CSI recovery, we show that CSI can be recovered with significantly improved reconstruction quality through DL compared with existing CS-based approaches. Even reconstructions at an excessively low compression rate retain sufficient content that allows effective beamforming gain.

II. System Model and CSI Feedback

We consider a simple single-cell downlink massive MIMO system with transmit antennas at a BS and a single receiver antenna at a UE. The system is operated in OFDM over subcarriers. The received signal at the th subcarrier is provided as follows:


where , , , and denote the channel vector, precoding vector, data-bearing symbol, and additive noise of the th subcarrier, respectively. Let be the CSI stacked in the spatial frequency domain. The BS can design the precoding vectors once it receives feedback. In the FDD system, the UE should return to the BS through feedback links. The total number of feedback parameters is , which is not allowed for limited feedback links. Although downlink channel estimation is challenging, this topic is beyond the scope of this paper. We assume that perfect CSI has been acquired through pilot-based training [15] and focus on the feedback scheme.

To reduce feedback overhead, we propose that

can be sparsitied in the angular-delay domain using a 2D discrete Fourier transform (DFT) as follows:


where and are and DFT matrices, respectively. To clarify this concept, a realization of the absolute values of with the COST 2100 channel model [16] is depicted in Fig. 1(a). Parameterization is performed using a uniform linear array (ULA) with half-wavelength spacing in an indoor environment. The elements of contain only a small fraction of large components, and the other components are close to zero. In the delay domain, only the first rows of contain values because the time delay between multipath arrivals lies within a limited period. Therefore, we can retain the first rows of and remove remaining rows. By an abuse of notation, we continuously use to denote the truncated matrix. The total number of feedback parameters can be reduced to , which remains a large number in the massive MIMO regime.

In this study, we are interested in designing the encoder


which can transform the channel matrix into an -dimensional vector (codeword), where . The data compression ratio is . In addition, we have to design the inverse transformation (decoder) from the codeword to the original channel, that is,


The CSI feedback approach is as follows. Once the channel matrix is acquired at the UE side, we perform 2D DFT in (2) to obtain the truncated matrix and then use the encoder (3) to generate a codeword . Next, is returned to the BS, and the BS uses the decoder (4) to obtain . The final channel matrix in the spatial-frequency domain can be obtained by performing inverse DFT.

Fig. 1: (a) Pseudo-color plot of the strength of . (b) Architecture of CsiNet, which includes the encoder and decoder.

III. CsiNet

We exploit the recent and popular conventional neural networks (CNNs) for the encoder and decoder they can exploit spatial local correlation by enforcing a local connectivity pattern among the neurons of adjacent layers. The overview of the proposed DL architecture, named CsiNet, is shown in Fig.

1(b), in which the values denote the length, width, and number of feature maps, respectively. The first layer of the encoder is a convolutional layer with the real and imaginary parts of being its input. This layer uses kernels with dimensions of to generate two feature maps. Following the convolutional layer, we reshape the feature maps into a vector and use a fully connected layer to generate the codeword , which is a real-valued vector of size . The first two layers mimic the projection of CS and serve as encoders. However, in contrast to random projections in CS, CsiNet attempts to translate the extracted feature maps into a codeword.

Once we obtain the codeword , we use several layers (as a decoder) to map it back into the channel matrix . The first layer of the decoder is a fully connected layer that considers as input and outputs two matrices of size , which serve as an initial estimate of the real and imaginary parts of . The initial estimate is then fed into several “RefineNet units” that continuously refine the reconstruction. Each RefineNet unit consists of four layers, as shown in Fig. 1(b). In RefineNet unit, the first layer is the input layer. All the remaining 3 layers use kernels. The second and third layers generate and feature maps, respectively, and the final layer generates the final reconstruction of

. Using appropriate zero padding, the feature maps produced by the three convolutional layers are set to the same size as the input channel matrix size

. The rectified linear unit (ReLU),

, is used as the activation function, and we introduce batch normalization to each layer.

Two features of a RefineNet unit are as follows. First, the output size of the RefineNet unit is equal to the channel matrix size. This concept is inspired by [10, 11]. To reduce dimensionality, nearly all conventional implementations of CNNs involve pooling layers, which is a form of down-sampling. In contrast to conventional implementations, our target is refinement rather than dimensionality reduction. Second, in the RefineNet unit, we introduce identity shortcut connections that directly pass data flow to later layers. This approach is inspired by the deep Residual Network [17, 12]

, which avoids the vanishing gradient problem caused by multiple stacked non-linear transformations.

Experiments reveal that two RefineNet units produce good performance. Adding further RefineNet units does not significantly boost reconstruction quality but adds to computational complexity. Once the channel matrix has been refined by a series of RefineNet units, the channel matrix is input into the final convolutional layer, and the sigmoid function is used to scale values to the

range. CsiNet can be extended to deal with cases involving multiple antennas at the UE by increasing the numbers of feature maps, i.e., . We leave the exploitation of the spatial correlation across UE antennas as a topic for future studies.

To train CsiNet, we use end-to-end learning for all the kernel and bias values of the encoder and decoder. This training procedure differs from the two-step approach used in [12]. The set of parameters is denoted as . The input to CsiNet is , and the reconstructed channel matrix is denoted by for the th patch. Notably, the input and output of CsiNet are normalized channel matrices, whose elements are scaled in the

range. Similar to the autoencoder, CsiNet is an unsupervised learning algorithm. The set of parameters is updated by the ADAM algorithm. The loss function is the mean squared error (MSE), which is calculated as follows:


where the norm is the Euclidean norm, and is the total number of samples in the training set.

IV. Experiments

To generate the training and testing samples, we create two types of channel matrices through the COST 2100 channel model [16]: 1) the indoor picocellular scenario at the  GHz band, and 2) the outdoor rural scenario at the  MHz band. All parameters follow their default setting in [16]. The BS is positioned at the center of a square area with lengths of and m for indoor and outdoor scenarios, respectively, whereas the UEs are randomly positioned in the square area per sample. We use the ULA with antennas at the BS and subcarriers. When transforming the channel matrix into the angular-delay domain, we retain the first rows of the channel matrix. That is, is

in size. The training, validation, and testing sets contain 100,000, 30,000, and 20,000 samples, respectively. All testing samples are excluded from the training and validation samples. We train several parameter sets with Glorot uniform initialization and then select the parameter set that provides minimal loss in the validation test. The epochs, learning rate, and batch size are set as

, , and , respectively.

Methods Indoor Outdoor
LASSO -7.59 0.91 -5.08 0.82
BM3D-AMP -4.33 0.80 -1.33 0.52
1/4 TVAL3 -14.87 0.97 -6.90 0.88
CS-CsiNet -11.82 0.96 -6.69 0.87
CsiNet -17.36 0.99 -8.75 0.91
LASSO -2.72 0.70 -1.01 0.46
BM3D-AMP 0.26 0.16 0.55 0.11
1/16 TVAL3 -2.61 0.66 -0.43 0.45
CS-CsiNet -6.09 0.87 -2.51 0.66
CsiNet -8.65 0.93 -4.51 0.79
LASSO -1.03 0.48 -0.24 0.27
BM3D-AMP 24.72 0.04 22.66 0.04
1/32 TVAL3 -0.27 0.33 0.46 0.28
CS-CsiNet -4.67 0.83 -0.52 0.37
CsiNet -6.24 0.89 -2.81 0.67
LASSO -0.14 0.22 -0.06 0.12
BM3D-AMP 0.22 0.04 25.45 0.03
1/64 TVAL3 0.63 0.11 0.76 0.19
CS-CsiNet -2.46 0.68 -0.22 0.28
CsiNet -5.84 0.87 -1.93 0.59

NMSE in dB and cosine similarity

Fig. 2: Reconstruction images for different compression ratios by different algorithms in indoor picocellular scenarios.

We compare CsiNet with three state-of-the-art CS-based methods, namely, LASSO -solver [5], TVAL3 [7], and BM3D-AMP [8]. In all experiments, we assume that the optimal regularization parameter of LASSO is given by an oracle. Among these algorithms, LASSO provides the bottom-line result of the CS problem by considering only the simplest sparsity prior. TVAL3 is a remarkably fast total variation-based recovery algorithm that considers increasingly elaborate priors. BM3D-AMP is the most accurate compressive recovery algorithm in natural-image reconstruction. We also provide the corresponding results for CS-CsiNet, which only learns to recover CSI from CS measurements (or random linear measurements). The architecture of CS-CsiNet is identical to that of the decoder of CsiNet.

The difference between the recovered channel and original is quantified by a normalized MSE, which is defined as follows:


The feedback CSI serves as a beamforming vector. Let be the reconstructed channel vector of the th subcarrier. If is used as a beamforming vector, then we achieve the equivalent channel at the UE side. To measure the quality of the beamforming vector, we also consider the cosine similarity


Notably, when evaluating NMSE and , we recover the output of CsiNet (i.e., the normalized channel matrix) back to their original levels.

The corresponding NMSE and of all the concerned methods are summarized in Table I, with the best results presented in bold font. CsiNet obtains the lowest NMSE values and significantly outperforms CS-based methods at all compression ratios. Compared with CS-CsiNet, CsiNet also provides significant gains, which are due to the sophisticated DL architecture in the encoder and decoder. When the compression ratio is reduced to , the CS-based methods can no longer function, whereas CsiNet and CS-CsiNet continue to perform well. Fig. 2 shows some reconstruction samples at different compression ratios along with the corresponding pseudo-gray plots of the strength of . CsiNet clearly outperforms the other algorithms.

Furthermore, CSI recovery through CsiNet can be executed with a relatively lower overhead than that through CS-based algorithms because CsiNet requires only several layers of simple matrix-vector multiplications. Specifically, the average running times (in seconds) of LASSO, BM3D-AMP, TVAL3, and CsiNet are , , , and , respectively. CsiNet performs approximately to times faster than CS-based methods.

Finally, we provide some other observations. First, the DFT matrix that is used to transform from the spatial domain into the angular domain is unnecessary. Table II shows the NMSE and results using CsiNet with compared to CsiNet without . CsiNet can also exhibit good performances without employing when retraining entire layers. This finding demonstrates that CsiNet can learn a proper basis by itself without preprocessing the channel matrix into the angular domain and thus implies that CsiNet can be applied in other antenna configurations. Second, angular (or spatial) resolution increases with the number of antennas at the BS. The corresponding NMSE and of all the concerned methods when , , and are summarized in Table III. The reconstruction performances of all the algorithms improve because becomes sparser. CsiNet can be significantly improved because it is more capable of exploiting subtle changes among adjacent elements than CS-based methods.

Domain Indoor Outdoor
1/4 Spatial (without ) -24.57 1.00 -9.42 0.92
Angular (with ) -17.36 0.99 -8.75 0.91
1/16 Spatial (without ) -9.20 0.94 -4.14 0.77
Angular (with ) -8.65 0.93 -4.51 0.79
1/32 Spatial (without ) -8.77 0.93 -2.96 0.69
Angular (with ) -6.24 0.89 -2.81 0.67
1/64 Spatial (without ) -5.83 0.86 -1.78 0.56
Angular (with ) -5.84 0.87 -1.93 0.59
TABLE II: The comparison of the spatial domain and angular domain.
1/4 LASSO -4.55 0.80 -5.08 0.82 -5.28 0.83
BM3D-AMP -1.06 0.47 -1.33 0.52 -1.61 0.62
TVAL3 -3.87 0.77 -6.90 0.88 -6.09 0.85
CsiNet -6.13 0.85 -8.75 0.91 -12.38 0.94
1/16 LASSO -0.65 0.44 -1.01 0.46 -1.23 0.51
BM3D-AMP 1.92 0.27 0.55 0.11 0.35 0.23
TVAL3 0.03 0.40 -0.43 0.45 -0.79 0.50
CsiNet -3.44 0.74 -3.34 0.72 -5.54 0.83
1/32 LASSO -0.13 0.27 -0.24 0.27 -0.38 0.34
BM3D-AMP 21.53 0.23 22.66 0.04 23.64 0.13
TVAL3 0.65 0.28 0.46 0.28 0.28 0.31
CsiNet -2.30 0.65 -2.81 0.67 -3.76 0.74
1/64 LASSO -0.06 0.12 -0.06 0.12 -0.057 0.16
BM3D-AMP 23.26 0.04 25.45 0.03 26.78 0.13
TVAL3 1.02 0.23 0.76 0.19 0.72 0.18
CsiNet -1.24 0.48 -1.93 0.58 -2.74 0.67
TABLE III: NMSE (dB) and for different in outdoor rural scenarios.

V. Conclusion

We used DL in CsiNet, a novel CSI sensing and recovery mechanism. CsiNet performed well at low compression ratios and reduced time complexity. We believe that its reconstruction quality can be further improved by applying advance DL technology, and we hope this study encourages future research in this direction.