I Introduction
Massive multipleinput multipleoutput (MIMO) can significantly enhance spectral efficiency and reduce interference in cellular networks, and thus has been regarded a key enabler for 5G and beyond networks[1]. However, the benefits heavily rely on accurate channel state information (CSI), which represents a major challenge, especially for systems with higher frequencies, e.g., millimeter wave (mmWave) massive MIMO systems [2]. It is largely due to the faster temporal variations and the large dimension of the channel matrix. A common approach is to send sufficient pilot symbols for reliable channel estimation, which then enables coherent data detection. However, with the limited coherence time and large dimension of the channel matrix, the pilot overhead will easily occupy too much radio resource. In addition, due to the limited number of orthogonal pilot sequences in multicell systems, pilot contamination will further jeopardize the performance of coherent massive MIMO systems [3]. To overcome this difficulty, blind data detection methods have been proposed to recover the data from the received signal without training pilots [4, 5, 6, 7]
. The blind scheme can achieve a degrees of freedom (DoF) of
for a massive MIMO system with transmit antennas and receive antennas in a rich scattering environment where is the coherence time [4]. This is the same as that achieved by a coherent massive MIMO system taking account of the pilot overhead [8].Recently, it has been shown that by exploiting the sparsity structure of massive MIMO channels, the performance can be further improved [8, 9, 10, 7]. Particularly, Zhang et. al [8] showed that under some regularity conditions, blind detection can achieve a DoF arbitrarily close to for sparse massive MIMO channels. Approximate message passing (AMP)based algorithms were proposed in [8] and [9] for blind data detection by exploiting the sparsity of the channel, which showed a superior performance. However, AMPbased approaches generally rely on strict assumptions on the prior distributions of the channel and data, and require an iterative message passing algorithm that leads to significant complexity due to the large problem size in massive MIMO systems. In [10], a subspacebased method, which decomposes the covariance of the received signal into the subspace of the channel and refines the decomposition by exploiting the channel sparsity, was proposed in blind detection for massive MIMO. However, a long sample sequence of the received signal is required to estimate the covariance, which is applicable only with a very long channel coherence time and very low mobility. In addition, all of the aforementioned works focused on algorithm development, but did not provide a theoretical analysis of the achievable performance.
In this paper, we propose a novel formulation for blind data detection for sparse massive MIMO channels, supported by an efficient algorithm which does not require the knowledge of the prior distributions of the data and the channel. In addition, the proposed scheme does not contain tuning parameters, and enjoys fast implementation. The theoretical justification of the formulation and the convergence rate of the algorithm are also developed. The main contributions are summarized as follows:

Data Concentration Property:^{1}^{1}1In this paper, the term concentration represents the concentration of measure phenomenon. This phenomenon can be informally expressed as
“A random variable that depends in a Lipschitz way on many independent variables is essentially constant.”
[11]. In this paper, we exploit the statistical information of the data transmitted in general communication systems for a novel formulation of the blind detection problem. Specifically, we show that under mild conditions, there is a data concentration phenomenon, which enables a simple blind sparse recovery formulation of a massive MIMO system over the Stiefel manifold. 
Blind Detection via norm Maximization: It was shown in [8] that the channel sparsity leads to a fundamental performance gain for the degrees of freedom of massive MIMO systems. Hence our proposed formulation leverages the sparsity structure of massive MIMO channels. Traditionally, norm is the most widely used formulation to induce sparsity [12]. However, the nonsmooth nature of the norm results in lowconvergence speed and high complexity [13]. In this paper, inspired by the smoothness and the fact that a highorder norm promotes sparsity [14, 15, 16, 17, 18], we consider norm maximization over the Stiefel manifold for blind data recovery of massive MIMO systems with sparse channels. Furthermore, we show that the global maximizer of the formulation is arbitrarily close to the true data up to a phase and permutation ambiguity for a sufficiently large number of antennas.

Efficient Algorithm and Convergence Rate Analysis: By taking advantage of the smooth property of the norm and the geometric structure of the Stiefel manifold, a parameterfree algorithm with low complexity is proposed for blind data recovery for massive MIMO systems. We show that under mild conditions, the proposed algorithm converges to the stationary point of the norm maximization problem. We further show that the convergence rate is given by for a largescale receive array, where is the number of transmitters, is the number of receivers, and are the largescale fading coefficient of the
th transmitter and the variance of the channel noise. The coefficient
represents the sparsity level of the channel and is the number of iterations.
The rest of the paper is organized as follows. In Section II, we present the system model, the sparse MIMO channel and the data concentration property. In Section III, we present the normbased manifold optimization problem as a new formulation for blind data detection problem. We further discuss the structural properties of the optimal solution to the new formulation in the noiseless and noisy cases. A fast and parameterfree algorithms is proposed and the corresponding convergence analysis are given in Section IV. Sample numerical simulation results are provided in Section V. Finally, Section VI summarizes the work.
Notations: and denote the inverse and conjugate transpose of matrix , respectively. represents the null space of . is used to take an elementwise abstract value. represents a diagonal matrix constructed by using as the diagonal elements. The
th row vector and
th column vector in are and , respectively. represents the element in the th row and th column of . denotes the Hadamard product. denotes the ceiling operator. is the general inner product of and . Finally, , and are respectively, the Frobenius norm, spectral norm and induced norm of matrix .Ii System Model
In this section, we introduce the system model of the considered massive MIMO system, followed by the sparse channel model and the data concentration property.
Iia System Model
Consider a singlecell mmWave massive MIMO communication system, as illustrated in Fig. 1. There are singleantenna users transmitting to a base station (BS) that is equipped with receive antennas with . Denote as the symbols transmitted by user within one frame. The aggregate transmit symbols of the users are denoted by . For simplicity, we consider a block flat fading channel, but the framework can be easily extended to OFDM systems. The received signal at the BS is given by
(1) 
where is the transmit power of the users, is the aggregate MIMO channel matrix, is the channel matrix between the th user and the BS, is the aggregate largescale fading coefficients of the users to the BS, and is the aggregate additive channel noise of independent and identically distributed (i.i.d.) elements. We assume that the BS has knowledge of , which can be obtained in practice with a very low signaling overhead due to the slowly varying path gain.
IiB Sparsity of the mmWave MIMO Channel
The mmWave propagation environment is well characterized by a clustered channel model [19], which can be parameterized by paths of the th user. The smallscale mmWave channel matrix between user and the BS during the coherence time is given by
(2) 
where denotes the normalized path gain of the th path for the th user. We assume that
are i.i.d. random variables following the complex Gaussian distribution
, and denote the azimuth and zenith angles of arrival (AoA) of the th path for the th user, and represents the receive and transmit array response vectors. For simplicity, we assume that the BS is equipped with a uniform rectangular planar array (URPA) with and elements () in the horizontal and vertical direction, respectively. The array response vector is given by [20](3) 
where and .
The spatial aggregate channel of users can be expressed by a “virtual angular domain” representation [21] as follows:
(4) 
where is the steering matrix for the receive array. With an receive URPA, we have [22], where and
are the unitary discrete Fourier transform (DFT) matrices.
can be interpreted as the normalized channel gain between the th user and the th discrete receive angle [21]. From [23], the number of clusters is quite limited in the mmWave band. Furthermore, is usually quite large to mitigate the path loss effect in mmWave frequencies [24]. Hence, the number of paths is usually much smaller than the channel dimension, i.e., . Therefore, can be regarded as approximately sparse.^{2}^{2}2The reason why is not exactly sparse is the mismatch between the exact receive angle and the predefined one by , a.k.a. the energy leakage phenomenon. In this paper, we define the sparsity level as the average number of nonzero elements in the sparse channel , i.e., . Fig. 2 shows a realization of for a massive MIMO mmWave channel according to the statistical spatial channel model (SSCM) implemented in NYUSIM [23]. As illustrated, most of the entries of are quite small (with ^{3}^{3}3Here we count a value that is less than 1% of as zero. due to the limited number of scatterers in mmWave channels [19]. The channel sparsity can be further promoted through a more accurate sparse basis , which can be found using the offline learning method proposed in [25].IiC Data Concentration on the Stiefel Manifold
The aggregate data matrix transmitted by the users is assumed to be independent with zero mean and a normalized covariance . The following proposition establishes that is approaching exponentially fast^{4}^{4}4One can use the simple Chebyshev’s inequality to show that , where is the variance of the square of each element of
, from the standard argument of the law of large numbers
[26, Section 8.2, Theorem 2.1]. However, our result is stronger as it shows that the concentration is exponentially fast with respect to under some mild conditions. with respect to .Proposition 1.
(Exponential concentration of data)
Assume the elements of matrix
are independent, zeromean and have bounded support given by
with . Then,
there exists a constant such that, for any ,
we have
Proof:
See Appendix A. ∎
Remark 1.
The conditions in Proposition 1 can be satisfied by a general constellation in the transmit data of massive MIMO systems. For example, the two types of constellations used in 5G systems, i.e., phaseshift keying (PSK) and quadrature amplitude modulation (QAM), both have zero mean and bounded support. For example, quadrature PSK (QPSK) symbols have after the power normalization. From Proposition 1, as long as
, the probability of
decays exponentially fast to , as illustrated in Fig. 3.Based on the data concentration property in Proposition 1, we assume that lies on a Stiefel manifold as defined below.
Definition 1.
(Complex Stiefel manifold) The complex Stiefel manifold is defined as the subspace of orthonormal  frames in , namely,
(5) 
Iii Problem Formulation: norm Maximization over the Stiefel Manifold
In this section, we formulate blind data recovery for massive MIMO systems with sparse channels as an norm maximization problem over the Stiefel manifold, and discuss the structural properties of the optimal solution.
Iiia Problem Formulation
Following the discussion in Section IIB, the received signal in (1) can be transformed to
(6) 
where . We first briefly review some conventional approaches for blind data detection in massive MIMO systems exploiting both the channel sparsity and the data concentration, i.e., .

Conventional normbased formulation[27]
To exploit the data concentration property of , we observe that is also a sparse quantity with sparsity induced by . Hence, one may directly use the norm to promote sparsity and formulate the blind data recovery as
(7) where the solution is an estimate of . However, Problem (7) may lead to trivial solutions when . When , will have a null space with rank . Hence is a trivial solution to Problem (7) with . When , solving Problem (7) will recover the data . However, due to the nonsmooth nature of the norm, algorithms such as subgradient descent [27] will have a high complexity. Furthermore, it was shown in [14] that the formulation in (7) is sensitive to noise because the norm essentially encourages all small entries to be 0 [16].

Complete dictionary learning approach[28]
Another approach to exploit the property and the channel sparsity is to apply the complete dictionary method [28]. Specifically, the approach recovers each column of sequentially[29, Section 3]. To recover the th column of , we solve the following problem:
(8) where is to promote sparsity. Consider as the solution of (8), then is an estimation of , where is an orthonormal basis for . However, this method requires that the dictionary must be a square matrix (complete)[28]. In massive MIMO systems, we usually have a coherence time larger than the number of users . Hence, is usually not square. Therefore, this formulation cannot be directly adopted for the blind data recovery of massive MIMO systems.
As illustrated above, conventional approaches are either inefficient or have strict requirements on the system parameters. Thus, an alternative formulation to exploit both data concentration and channel sparsity is needed. Recently, it has been shown in the machine learning literatures
[16, 14] that maximizing a highorder norm (norm, ) leads to sparse solutions. An intuitive explanation is that the sparsest points on the unit sphere, e.g., points , , and in , have the largest norm , as shown in Fig. 4.Inspired by this observation, we propose a smooth nonconvex alternative formulation of Problem (7). The new formulation can avoid the aforementioned issues caused by the conventional normbased formulation. It also relaxes the requirement on the coherence time. The proposed problem is given by
(9) 
In Problem (9), we choose the cube of the norm to promote sparsity, with the following justifications. First, maximization of will not lead to trivial solutions caused by minimizing the norm formulation, i.e., , for . Second, the smoothness of the object function will make it possible to design a fastconvergence algorithm. Third, is a milder sparsity promoting function since it is very flat around 0, which will not encourage all small entries to be 0 and thus is insensitive to small noise in the signal. Finally, can achieve the smallest sample complexity^{5}^{5}5From a machine learning perspective, the sample complexity of a machine learning algorithm represents the number of trainingsamples needed to successfully learn a target function. for exact recovery compared to other choices of , [30]. In the following part, we shall provide theoretical support for formulation (9) by showing that solving it will recover the data matrix. We start with the noiseless case, and then generalize the analysis to the practical noisy case.
IiiB Theoretical Analysis for Noiseless Case
Without loss of generality, we assume and for the noiseless case. We first analyze the properties of Problem (9) and show that the global optimal solution is arbitrarily close to the true data up to a phasepermutation ambiguity. Specifically, the two solutions, and , are called equivalent up to a phase permutation ambiguity if , where is a phasepermutation matrix as defined below.
Definition 2.
(Phasepermutation matrix) The dimensional phasepermutation matrix is defined as:
(10) 
where with and , with being a standard basis vector, and being any permutations of the elements.
Note that a phasepermutation ambiguity can be resolved with very small signaling overhead in massive MIMO systems. We shall defer the discussion to Section IVC.
The following theorem summarizes the key result, namely, that the optimal solution of Problem (9) is arbitrarily close to the true data (up to a phasepermutation ambiguity) for a sufficiently large number of antennas , thus, demonstrating the correctness of our formulation.
Theorem 1.
(Blind data detection in the noiseless case) Let with ^{6}^{6}6 is a product of an independent Bernoulli random variable with parameter and a circular symmetric complex normal random variable, i.e., , where , . Theorem 1 is based on the assumption that elements of are i.i.d. Bernoullicomplex Gaussian random variables., , and . Define as the set of optimal solutions of Problem (9) and . For any , there exists a constant , for , such that
where is a phasepermutation matrix.
Proof:
See Appendix B. ∎
IiiC Theoretical Analysis for Noisy Case
In this section, we study the robustness of the solution to Problem (9) with respect to noise. The result is summarized in the following theorem.
Theorem 2.
(Blind data detection in the noisy case) Let with , , , with , and . Define as the set of optimal solutions of Problem (9) and . For any , there exists a constant , for , such that
where and is a phasepermutation matrix.
Proof:
See Appendix C. ∎
Remark 2.
From Theorem 2, there is a high probability that the optimal solution to Problem (9) will be within an uncertainty ball centered at (up to a phasepermutation ambiguity) with a radius of , where is the noise variance. The radius of the uncertainty ball decreases with decreasing according to . When , the result reduces to Theorem 1 in the noiseless case.
Iv A Lowcomplexity Parameterfree Algorithm
In this section, we propose a lowcomplexity parameterfree algorithm to solve Problem (9) and provide the corresponding convergence analysis. The resolution of the phasepermutation ambiguity will also be given.
Iva Review of the Gradient Method over the Stiefel Manifold
Problem (9) is an optimization problem over the Stefiel manifold. As such, one can apply gradient search over the Stiefel manifold [31] to solve Problem (9). Specifically, the gradient iteration over the Stiefel manifold is given by
where is the iteration index and denotes the gradient of the objective function at , which is given by the orthogonal projection of the Euclidean gradient onto the tangent space at [31]. is the stepsize to move in the direction ; and is the retraction on the manifold, which maps from the tangent space onto the manifold itself. However, directly applying the gradient method to Problem (9) suffers from a high per iteration complexity due to the two maps in each iteration. Specifically, to compute one needs to projects the Euclidean gradient onto the tangent space of and maps the tangent vector onto the Stiefel manifold. Moreover, to find the optimal stepsize , a curvilinear search is required in each iteration, which is usually timeconsuming.
IvB Parameterfree Algorithm to Solve Problem (9)
To resolve the limitations of the gradient method, we propose a lowcomplexity and parameterfree algorithm to solve Problem (9). The algorithm is derived based on the FrankWolfe method [32], which considers a linear approximation of the objective function at each iteration, hence substantially simplifying the per iteration computation.
IvB1 Derivation of the proposed parameterfree algorithm
The FrankWolfe method for Problem (9) iterates according to
(11) 
(12)  
for 
where is the Euclidean gradient at of the objective function of Problem (9). We shall elaborate on these two steps next.

Step 1: Simple computation of (11) by exploiting the Stiefel manifold constraint. We first focus on the computation of (11). This step aims to find an such that the inner product achieves the maximum, which can be obtained by projecting onto the Stiefel manifold . This is similar to retraction in the gradient method, , which maps the moving direction onto the manifold. The difference is that retraction is a mapping from the tangent space of to the manifold. However, for the Stiefel manifold, polardecompositionbased retraction, , is a projectionlike retraction [33] which does not require the direction to be in the tangent space. Therefore, we can directly use polardecompositionbased retraction to obtain
(13) That is, returns the matrix with orthonormal columns after the right polar decomposition[34] of matrix . It turns out that the right polar decomposition has many fast computation methods [35, 36]
. In this paper, we use compact singular value decomposition (SVD) to implement polar factorization. By compact SVD, we have
(14)
(15) 

Step 2: Optimal step size in (12) by exploiting the convex objective function. In general, the update Step (12) requires a timeconsuming line search. By exploiting the convexity of the objective function of Problem (9) as well as the geometric structure of the Stiefel manifold, we show in Lemma 1 that the optimal step size in (12) is given by .
Now, using (15) to solve (11) and by fixing the step size as in (12), we summarize the main procedure of the proposed parameterfree algorithm to solve Problem (9) as
(16) 
IvB2 Convergence analysis
Define the firstorder optimality metric as
(17) 
The metric in (17) is a measure of the optimality
of due to the following lemma.
Lemma 2.
(Optimality Measure) is a stationary point of Problem (9), if and only if .
Proof:
See Appendix E. ∎
Based on this, the following theorem summarizes the convergence of the proposed algorithm to solve Problem (9).
Theorem 3.
(Convergence of the proposed algorithm) Let be the sequence generated by the proposed algorithm in (16) with a random initial point . We have

is monotonically increasing.

.

.
Proof:
See Appendix F. ∎
Remark 3.
To evaluate the impact of the key system parameters on the convergence rate of the proposed algorithm in (16), we present the following theorem.
Theorem 4.
Proof:
See Appendix G. ∎
Remark 4.
Theorem 4 shows that the convergence rate is , with high probability. This suggests that a smaller number of receive antennas , a smaller number of users , a lower sparsity level , or a smaller noise variance will lead to a faster convergence rate. The convergence rate in (18) is also consistent with the simulation results in Fig. 6 as will be shown later.
IvC Resolving Phasepermutation Ambiguity
According to Theorem 2, there is a phasepermutation ambiguity in the solutions of Problem (9). This ambiguity can be resolved with little overhead [9, 8, 37]. Specifically, after we obtain the output of Algorithm (16), , the phasepermutation ambiguity can be resolved by the following two steps [8, 37]:

Step 1: Using one common reference symbol to eliminate the phase ambiguity. Without loss of generality, we assume that the first transmitted symbol of each user is the reference symbol, i.e., = (as illustrated in Fig. (a)a). The result after eliminating the phase ambiguity is given by
(19) where is an estimation of with , as illustrated in Fig. (b)b.

Step 2: Using symbols as the user ID to eliminate the permutation ambiguity. After eliminating the phase ambiguity, the permutation ambiguity can be eliminated by comparing the user ID with symbols, where is the size of the modulation alphabet ,as illustrated in Fig. 5.
Table I compares overheads for different schemes, which, together with the achievable rate comparison in Fig. 8 shows that the proposed blind detection scheme achieves a higher achievable rate by saving the pilot overhead.
Scheme  Overhead 

Pilotbased coherent detection  
Proposed blind data detection  1+ 
The overall algorithm is summarized by Algorithm 1.
V Simulation results
In this section, we first numerically verify the convergence property of the algorithm proposed in Section IV, and then present comprehensive simulation results to show the superiority of the proposed scheme for blind data detection in massive MIMO systems.
Va Convergence Property
The convergence property is verified with randomly generated and , such that^{7}^{7}7
can be generated via the QR decomposition of any random matrix.
and . Without loss of generality, we fix in the following results. Fig. 6 considers the noisy case, i.e., and is the additive Gaussian channel noise with zero mean and variance , to illustrate how the key system parameters influence the convergence rate. The value of the objective function is normalized by , which is the theoretical maximum value of the objective function with a sufficiently large (further details about this value can be found in Appendix C). The curves are plotted for individual trials with different experimental settings when . The results imply that the convergence rate is influenced by , , and the noise variance . Specifically, a smaller , a smaller , a smaller , or a smaller noise variance will accelerate the convergence rate consistent with Theorem 4.VB Performance Evaluation
In this section, we evaluate the proposed method, in comparison with existing ones. In the following results, the spatial channel matrix is generated according to Eq. (2) with paths and i.i.d. Gaussian with unit variance for all users. Each user has independent azimuth and elevation AoAs, and
, which are assumed to be uniformly distributed in
and . The antenna elements in the URPA are separated by a halfwavelength distance. The “virtual angular domain” channel is obtained according to Eq. (4). Each element of is drawn from the i.i.d. QPSK symbols and normalized by . All the results are conducted by averaging over Monte Carlo trials, unless otherwise specified.VB1 Performance and runtime comparison with baselines
To demonstrate the benefits of the proposed method, we introduce the following four stateoftheart blind data detection schemes as baselines. Note that we apply the ambiguity elimination scheme described in Section IVC to all the approaches.

Baseline 1 (ProBiGAMP) [8]: This is an approximate probabilistic message passing algorithm for blind data detection. In the simulation, we use EMBiGAMP_DL [38] in the GAMP Matlab package^{8}^{8}8The Matlab code can be downloaded at https://sourceforge.net/projects/gampmatlab/. and add the projection operation illustrated in [8]. This baseline is introduced to show the robustness of the proposed method to different distributions of the channel and data.

Baseline 2 (normbased method): For normbased methods, we adopt the one proposed in [27] for complete (orthogonal) dictionary learning. In the simulation, we divide the data matrix into adjacent squared matrices and implement the normbased method for each squared matrix. This baseline is introduced to show the effectiveness of the proposed problem formulation.

Baseline 3 (normbased method): This baseline solves the normbased problem; i.e., in Problem (9) is changed into . A similar scheme has been used in image processing to solve an orthogonal dictionary learning problem[14, 18] in the realvalue domain. This baseline is introduced to show the importance of the choice of for the highordernormbased problem formulation.
We use the average error vector magnitude (EVM) as the performance metric for data detection. The specific expression is given by
We also compare the proposed scheme with the following trainingbased sparsityexploiting data detection scheme.

Baseline 5 (Pilotbased method) [40]: This baseline is a pilotbased method which leverages the sparsity of channel. In the channel estimation phase, randomly generated training symbols with length are sent to the BS. The sparse channel is then estimated by solving the popular regularized leastsquares problem: , where is the received signal in the training period. After is estimated, the transmitted data is detected via zeroforcing (ZF). This baseline is presented to show the superiority of the proposed blind scheme compared to the sparsityexploiting trainingbased method.
To facilitate the comparison with the trainingbased method, we introduce the following achievable rate metrics[9]
for the blind and trainingbased schemes, respectively. For the blind schemes, the overhead is caused by the common reference symbol to eliminate the phase ambiguity, and the loss is caused by the permutation ambiguity. We compare the performance of the proposed method with the aforementioned baselines in terms of the EVM, computation time and achievable rate with , , where ,, and is generated as the white Gaussian noise with variance .
Fig. 7 shows the performance comparison with , and . We see that the proposed scheme^{9}^{9}9The rounding technique in [41] is adopted for the based method. exhibits the best performance in terms of EVM. The proposed scheme outperforms Baselines 1 and 2 because that it is more robust to noise and the power leakage phenomena. Specifically, the sparsity penalty is a milder sparsity penalty which is very flat around 0 and insensitive to noise in the signal, while the strict sparsity penalties used in Baselines 1 and 2 essentially encourage all small entries to be 0 [16]. Baseline 3 shows inferior performance since it requires a higher sample complexity than the proposed normbased formulation [30]. Although Baseline 4 shows a competitive performance to the proposed one, the runtime comparison given in Table II indicates its
Comments
There are no comments yet.