I. Introduction
The massive multipleinput multipleoutput (MIMO) system is widely regarded as a major technology for fifthgeneration wireless communication systems. By equipping a base station (BS) with hundreds or even thousands of antennas in a centralized [1] or distributed [2]
manner, such a system can substantially reduce multiuser interference and provide a multifold increase in cell throughput. This potential benefit is mainly obtained by exploiting channel state information (CSI) at BSs. In current frequency division duplexity (FDD) MIMO systems (e.g., longterm evolution Release8), the downlink CSI is acquired at the user equipment (UE) during the training period and returns to the BS through feedback links. Vector quantization or codebookbased approaches are usually adopted to reduce feedback overhead. However, the feedback quantities resulting from these approaches need to be scaled linearly with the number of transmit antennas and are prohibitive in a massive MIMO regime.
The challenge of CSI feedback in massive MIMO systems has motivated numerous studies [3, 4]
. These works have mainly focused on reducing feedback overhead by using the spatial and temporal correlation of CSI. In particular, correlated CSI can be transformed into an uncorrelated sparse vector in some bases; thus, one can use compressive sensing (CS) to obtain a sufficiently accurate estimate of a sparse vector from an underdetermined linear system. This concept has inspired the establishment of CSI feedback protocols based on CS
[3] and distributed compressive channel estimation [4]. The use of several algorithms, including LASSO solver [5] and AMP [6], has also been proposed in CS. However, these algorithms [5, 6] struggle to recover compressive CSI because they use a simple sparsity prior while their channel matrix is not perfectly but is approximately sparse. Moreover, the changes among most adjacent elements in the channel matrix are subtle. These properties complicate modeling their priors. Although researchers have designed advanced algorithms (e.g., TVAL3 [7] and BM3DAMP [8]) that can impose elaborate priors on reconstruction, these algorithms do not significantly boost CSI recovery quality because handcrafted priors remain far from practice.Summarily, three central problems are inherent in CSbased methods. First, they rely heavily on the assumption that channels are sparse in some bases. However, channels are not exactly sparse in any basis and may not even have an interpretable structure. Second, CS uses random projection and does not fully exploit channel structures. Third, existing CS algorithms for signal reconstruction are often iterative approaches, which have slow reconstruction. In the present study, we address the above problems using deep learning (DL). DL attempts to mimic the human brain to accomplish a specific task by training large multilayered neural networks with vast numbers of training samples. Our developed CSI sensing (or encoder) and recovery (or decoder) network is hereafter called CsiNet. CsiNet has the following features.

Encoder. Rather than using random projection, CsiNet learns a transformation from original channel matrices to compress representations (codewords) through training data. The algorithm is agnostic to human knowledge on channel distribution and instead directly learns to effectively use the channel structure from training data.

Decoder. CsiNet learns inverse transformation from codewords to original channels. Inverse transformation is noniterative and multiple orders of magnitude faster than iterative algorithms.
A UE uses the encoder to transform channel matrices into codewords. Once the codewords are returned to the BS, it recovers the original channel matrices by using the decoder. The methodology can be used in FDD MIMO systems as a feedback protocol. In fact, CsiNet is closely related to the autoencoder
[9, Ch. 14] in DL, which is used to learn a representation (encoding) for a set of data typically for dimensionality reduction. Recently, several DL architectures have been proposed to reconstruct natural images from CS measurements [10, 11, 12]. Although DL exhibits stateoftheart performance in naturalimage reconstruction, whether DL can also show its ability in wireless channel reconstruction is unclear because this reconstruction is more sophisticated than image reconstruction. The present work is the first to suggest a DLbased CSI reduction and recovery approach.^{1}^{1}1For an overview of applying DL to the wireless physical layer, we refer the interested readers to [13]. The most relevant work appears to be [14], in which DLbased CSI encoding has been used in a closedloop MIMO system. Different from [14], which has not considered CSI recovery, we show that CSI can be recovered with significantly improved reconstruction quality through DL compared with existing CSbased approaches. Even reconstructions at an excessively low compression rate retain sufficient content that allows effective beamforming gain.II. System Model and CSI Feedback
We consider a simple singlecell downlink massive MIMO system with transmit antennas at a BS and a single receiver antenna at a UE. The system is operated in OFDM over subcarriers. The received signal at the th subcarrier is provided as follows:
(1) 
where , , , and denote the channel vector, precoding vector, databearing symbol, and additive noise of the th subcarrier, respectively. Let be the CSI stacked in the spatial frequency domain. The BS can design the precoding vectors once it receives feedback. In the FDD system, the UE should return to the BS through feedback links. The total number of feedback parameters is , which is not allowed for limited feedback links. Although downlink channel estimation is challenging, this topic is beyond the scope of this paper. We assume that perfect CSI has been acquired through pilotbased training [15] and focus on the feedback scheme.
To reduce feedback overhead, we propose that
can be sparsitied in the angulardelay domain using a 2D discrete Fourier transform (DFT) as follows:
(2) 
where and are and DFT matrices, respectively. To clarify this concept, a realization of the absolute values of with the COST 2100 channel model [16] is depicted in Fig. 1(a). Parameterization is performed using a uniform linear array (ULA) with halfwavelength spacing in an indoor environment. The elements of contain only a small fraction of large components, and the other components are close to zero. In the delay domain, only the first rows of contain values because the time delay between multipath arrivals lies within a limited period. Therefore, we can retain the first rows of and remove remaining rows. By an abuse of notation, we continuously use to denote the truncated matrix. The total number of feedback parameters can be reduced to , which remains a large number in the massive MIMO regime.
In this study, we are interested in designing the encoder
(3) 
which can transform the channel matrix into an dimensional vector (codeword), where . The data compression ratio is . In addition, we have to design the inverse transformation (decoder) from the codeword to the original channel, that is,
(4) 
The CSI feedback approach is as follows. Once the channel matrix is acquired at the UE side, we perform 2D DFT in (2) to obtain the truncated matrix and then use the encoder (3) to generate a codeword . Next, is returned to the BS, and the BS uses the decoder (4) to obtain . The final channel matrix in the spatialfrequency domain can be obtained by performing inverse DFT.
III. CsiNet
We exploit the recent and popular conventional neural networks (CNNs) for the encoder and decoder they can exploit spatial local correlation by enforcing a local connectivity pattern among the neurons of adjacent layers. The overview of the proposed DL architecture, named CsiNet, is shown in Fig.
1(b), in which the values denote the length, width, and number of feature maps, respectively. The first layer of the encoder is a convolutional layer with the real and imaginary parts of being its input. This layer uses kernels with dimensions of to generate two feature maps. Following the convolutional layer, we reshape the feature maps into a vector and use a fully connected layer to generate the codeword , which is a realvalued vector of size . The first two layers mimic the projection of CS and serve as encoders. However, in contrast to random projections in CS, CsiNet attempts to translate the extracted feature maps into a codeword.Once we obtain the codeword , we use several layers (as a decoder) to map it back into the channel matrix . The first layer of the decoder is a fully connected layer that considers as input and outputs two matrices of size , which serve as an initial estimate of the real and imaginary parts of . The initial estimate is then fed into several “RefineNet units” that continuously refine the reconstruction. Each RefineNet unit consists of four layers, as shown in Fig. 1(b). In RefineNet unit, the first layer is the input layer. All the remaining 3 layers use kernels. The second and third layers generate and feature maps, respectively, and the final layer generates the final reconstruction of
. Using appropriate zero padding, the feature maps produced by the three convolutional layers are set to the same size as the input channel matrix size
. The rectified linear unit (ReLU),
, is used as the activation function, and we introduce batch normalization to each layer.
Two features of a RefineNet unit are as follows. First, the output size of the RefineNet unit is equal to the channel matrix size. This concept is inspired by [10, 11]. To reduce dimensionality, nearly all conventional implementations of CNNs involve pooling layers, which is a form of downsampling. In contrast to conventional implementations, our target is refinement rather than dimensionality reduction. Second, in the RefineNet unit, we introduce identity shortcut connections that directly pass data flow to later layers. This approach is inspired by the deep Residual Network [17, 12]
, which avoids the vanishing gradient problem caused by multiple stacked nonlinear transformations.
Experiments reveal that two RefineNet units produce good performance. Adding further RefineNet units does not significantly boost reconstruction quality but adds to computational complexity. Once the channel matrix has been refined by a series of RefineNet units, the channel matrix is input into the final convolutional layer, and the sigmoid function is used to scale values to the
range. CsiNet can be extended to deal with cases involving multiple antennas at the UE by increasing the numbers of feature maps, i.e., . We leave the exploitation of the spatial correlation across UE antennas as a topic for future studies.To train CsiNet, we use endtoend learning for all the kernel and bias values of the encoder and decoder. This training procedure differs from the twostep approach used in [12]. The set of parameters is denoted as . The input to CsiNet is , and the reconstructed channel matrix is denoted by for the th patch. Notably, the input and output of CsiNet are normalized channel matrices, whose elements are scaled in the
range. Similar to the autoencoder, CsiNet is an unsupervised learning algorithm. The set of parameters is updated by the ADAM algorithm. The loss function is the mean squared error (MSE), which is calculated as follows:
(5) 
where the norm is the Euclidean norm, and is the total number of samples in the training set.
IV. Experiments
To generate the training and testing samples, we create two types of channel matrices through the COST 2100 channel model [16]: 1) the indoor picocellular scenario at the GHz band, and 2) the outdoor rural scenario at the MHz band. All parameters follow their default setting in [16]. The BS is positioned at the center of a square area with lengths of and m for indoor and outdoor scenarios, respectively, whereas the UEs are randomly positioned in the square area per sample. We use the ULA with antennas at the BS and subcarriers. When transforming the channel matrix into the angulardelay domain, we retain the first rows of the channel matrix. That is, is
in size. The training, validation, and testing sets contain 100,000, 30,000, and 20,000 samples, respectively. All testing samples are excluded from the training and validation samples. We train several parameter sets with Glorot uniform initialization and then select the parameter set that provides minimal loss in the validation test. The epochs, learning rate, and batch size are set as
, , and , respectively.Methods  Indoor  Outdoor  

NMSE  NMSE  
LASSO  7.59  0.91  5.08  0.82  
BM3DAMP  4.33  0.80  1.33  0.52  
1/4  TVAL3  14.87  0.97  6.90  0.88 
CSCsiNet  11.82  0.96  6.69  0.87  
CsiNet  17.36  0.99  8.75  0.91  
LASSO  2.72  0.70  1.01  0.46  
BM3DAMP  0.26  0.16  0.55  0.11  
1/16  TVAL3  2.61  0.66  0.43  0.45 
CSCsiNet  6.09  0.87  2.51  0.66  
CsiNet  8.65  0.93  4.51  0.79  
LASSO  1.03  0.48  0.24  0.27  
BM3DAMP  24.72  0.04  22.66  0.04  
1/32  TVAL3  0.27  0.33  0.46  0.28 
CSCsiNet  4.67  0.83  0.52  0.37  
CsiNet  6.24  0.89  2.81  0.67  
LASSO  0.14  0.22  0.06  0.12  
BM3DAMP  0.22  0.04  25.45  0.03  
1/64  TVAL3  0.63  0.11  0.76  0.19 
CSCsiNet  2.46  0.68  0.22  0.28  
CsiNet  5.84  0.87  1.93  0.59 
NMSE in dB and cosine similarity
.We compare CsiNet with three stateoftheart CSbased methods, namely, LASSO solver [5], TVAL3 [7], and BM3DAMP [8]. In all experiments, we assume that the optimal regularization parameter of LASSO is given by an oracle. Among these algorithms, LASSO provides the bottomline result of the CS problem by considering only the simplest sparsity prior. TVAL3 is a remarkably fast total variationbased recovery algorithm that considers increasingly elaborate priors. BM3DAMP is the most accurate compressive recovery algorithm in naturalimage reconstruction. We also provide the corresponding results for CSCsiNet, which only learns to recover CSI from CS measurements (or random linear measurements). The architecture of CSCsiNet is identical to that of the decoder of CsiNet.
The difference between the recovered channel and original is quantified by a normalized MSE, which is defined as follows:
(6) 
The feedback CSI serves as a beamforming vector. Let be the reconstructed channel vector of the th subcarrier. If is used as a beamforming vector, then we achieve the equivalent channel at the UE side. To measure the quality of the beamforming vector, we also consider the cosine similarity
(7) 
Notably, when evaluating NMSE and , we recover the output of CsiNet (i.e., the normalized channel matrix) back to their original levels.
The corresponding NMSE and of all the concerned methods are summarized in Table I, with the best results presented in bold font. CsiNet obtains the lowest NMSE values and significantly outperforms CSbased methods at all compression ratios. Compared with CSCsiNet, CsiNet also provides significant gains, which are due to the sophisticated DL architecture in the encoder and decoder. When the compression ratio is reduced to , the CSbased methods can no longer function, whereas CsiNet and CSCsiNet continue to perform well. Fig. 2 shows some reconstruction samples at different compression ratios along with the corresponding pseudogray plots of the strength of . CsiNet clearly outperforms the other algorithms.
Furthermore, CSI recovery through CsiNet can be executed with a relatively lower overhead than that through CSbased algorithms because CsiNet requires only several layers of simple matrixvector multiplications. Specifically, the average running times (in seconds) of LASSO, BM3DAMP, TVAL3, and CsiNet are , , , and , respectively. CsiNet performs approximately to times faster than CSbased methods.
Finally, we provide some other observations. First, the DFT matrix that is used to transform from the spatial domain into the angular domain is unnecessary. Table II shows the NMSE and results using CsiNet with compared to CsiNet without . CsiNet can also exhibit good performances without employing when retraining entire layers. This finding demonstrates that CsiNet can learn a proper basis by itself without preprocessing the channel matrix into the angular domain and thus implies that CsiNet can be applied in other antenna configurations. Second, angular (or spatial) resolution increases with the number of antennas at the BS. The corresponding NMSE and of all the concerned methods when , , and are summarized in Table III. The reconstruction performances of all the algorithms improve because becomes sparser. CsiNet can be significantly improved because it is more capable of exploiting subtle changes among adjacent elements than CSbased methods.
Domain  Indoor  Outdoor  

NMSE  NMSE  
1/4  Spatial (without )  24.57  1.00  9.42  0.92 
Angular (with )  17.36  0.99  8.75  0.91  
1/16  Spatial (without )  9.20  0.94  4.14  0.77 
Angular (with )  8.65  0.93  4.51  0.79  
1/32  Spatial (without )  8.77  0.93  2.96  0.69 
Angular (with )  6.24  0.89  2.81  0.67  
1/64  Spatial (without )  5.83  0.86  1.78  0.56 
Angular (with )  5.84  0.87  1.93  0.59 
Methods  

NMSE  NMSE  NMSE  
1/4  LASSO  4.55  0.80  5.08  0.82  5.28  0.83 
BM3DAMP  1.06  0.47  1.33  0.52  1.61  0.62  
TVAL3  3.87  0.77  6.90  0.88  6.09  0.85  
CsiNet  6.13  0.85  8.75  0.91  12.38  0.94  
1/16  LASSO  0.65  0.44  1.01  0.46  1.23  0.51 
BM3DAMP  1.92  0.27  0.55  0.11  0.35  0.23  
TVAL3  0.03  0.40  0.43  0.45  0.79  0.50  
CsiNet  3.44  0.74  3.34  0.72  5.54  0.83  
1/32  LASSO  0.13  0.27  0.24  0.27  0.38  0.34 
BM3DAMP  21.53  0.23  22.66  0.04  23.64  0.13  
TVAL3  0.65  0.28  0.46  0.28  0.28  0.31  
CsiNet  2.30  0.65  2.81  0.67  3.76  0.74  
1/64  LASSO  0.06  0.12  0.06  0.12  0.057  0.16 
BM3DAMP  23.26  0.04  25.45  0.03  26.78  0.13  
TVAL3  1.02  0.23  0.76  0.19  0.72  0.18  
CsiNet  1.24  0.48  1.93  0.58  2.74  0.67 
V. Conclusion
We used DL in CsiNet, a novel CSI sensing and recovery mechanism. CsiNet performed well at low compression ratios and reduced time complexity. We believe that its reconstruction quality can be further improved by applying advance DL technology, and we hope this study encourages future research in this direction.
References
 [1] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, no. 11, pp. 3590–3600, Nov. 2010.
 [2] J. Zhang et al., “On capacity of largescale MIMO multiple access channels with distributed sets of correlated antennas,” IEEE J. Sel. Areas Commun., vol. 31, no. 2, pp. 133–148, Feb. 2013.
 [3] P. H. Kuo, H. Kung, and P. A. Ting, “Compressive sensing based channel feedback protocols for spatiallycorrelated massive antenna arrays,” in Proc. IEEE WCNC, Shanghai, China, Apr. 2012, pp. 492–497.
 [4] X. Rao and V. K. Lau, “Distributed compressive CSIT estimation and feedback for FDD multiuser massive MIMO systems,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3261–3271, Jun. 2014.
 [5] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. Pure and Applied Math., vol. 75, pp. 1412–1457, 2004.
 [6] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proc. Natl. Acad. Sci., vol. 106, no. 45, pp. 18 914–18 919, 2009.
 [7] C. Li, W. Yin, and Y. Zhang, “User’s guide for tval3: Tv minimization by augmented lagrangian and alternating direction algorithms,” CAAM report, vol. 20, pp. 46–47, 2009.
 [8] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117–5144, 2016.
 [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press.
 [10] S. Lohit et al., “Convolutional neural networks for noniterative reconstruction of compressively sensed images,” preprint, 2017. [Online]. Available: http://arxiv.org/abs/1708.04669.
 [11] A. Mousavi, G. Dasarathy, and R. G. Baraniuk, “DeepCodec: Adaptive sensing and recovery via deep convolutional neural networks,” preprint, 2017. [Online]. Available: http://arxiv.org/abs/1707.03386.
 [12] H. Yao et al., “DRNet: Deep residual reconstruction network for image compressive sensing,” preprint, 2017. [Online]. Available: http://arxiv.org/abs/1702.05743.
 [13] T. Wang et al., “Deep learning for wireless physical layer: opportunities and challenges,” preprint, 2017. [Online]. Available: https://arxiv.org/abs/1710.05312.
 [14] Timothy J. O’Shea, et al., “Deep learning based MIMO communications,” preprint, 2017. [Online]. Available: https://arxiv.org/abs/1707.07980.
 [15] J. Choi et al., “Downlink training techniques for FDD massive MIMO systems: openloop and closedloop training with memory,” IEEE J. Sel. Topics Signal Process, vol. 8, no. 5, pp. 802–814, Oct. 2014.
 [16] L. Liu et al., “The COST 2100 MIMO channel model,” IEEE Wireless Commun., vol. 19, no. 6, pp. 92–99, Dec. 2012.
 [17] K. He et al., “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.