Massive multiple-input multiple-output (MIMO) can significantly enhance spectral efficiency and reduce interference in cellular networks, and thus has been regarded a key enabler for 5G and beyond networks. However, the benefits heavily rely on accurate channel state information (CSI), which represents a major challenge, especially for systems with higher frequencies, e.g., millimeter wave (mmWave) massive MIMO systems . It is largely due to the faster temporal variations and the large dimension of the channel matrix. A common approach is to send sufficient pilot symbols for reliable channel estimation, which then enables coherent data detection. However, with the limited coherence time and large dimension of the channel matrix, the pilot overhead will easily occupy too much radio resource. In addition, due to the limited number of orthogonal pilot sequences in multi-cell systems, pilot contamination will further jeopardize the performance of coherent massive MIMO systems . To overcome this difficulty, blind data detection methods have been proposed to recover the data from the received signal without training pilots [4, 5, 6, 7]
. The blind scheme can achieve a degrees of freedom (DoF) offor a massive MIMO system with transmit antennas and receive antennas in a rich scattering environment where is the coherence time . This is the same as that achieved by a coherent massive MIMO system taking account of the pilot overhead .
Recently, it has been shown that by exploiting the sparsity structure of massive MIMO channels, the performance can be further improved [8, 9, 10, 7]. Particularly, Zhang et. al  showed that under some regularity conditions, blind detection can achieve a DoF arbitrarily close to for sparse massive MIMO channels. Approximate message passing (AMP)-based algorithms were proposed in  and  for blind data detection by exploiting the sparsity of the channel, which showed a superior performance. However, AMP-based approaches generally rely on strict assumptions on the prior distributions of the channel and data, and require an iterative message passing algorithm that leads to significant complexity due to the large problem size in massive MIMO systems. In , a subspace-based method, which decomposes the covariance of the received signal into the subspace of the channel and refines the decomposition by exploiting the channel sparsity, was proposed in blind detection for massive MIMO. However, a long sample sequence of the received signal is required to estimate the covariance, which is applicable only with a very long channel coherence time and very low mobility. In addition, all of the aforementioned works focused on algorithm development, but did not provide a theoretical analysis of the achievable performance.
In this paper, we propose a novel formulation for blind data detection for sparse massive MIMO channels, supported by an efficient algorithm which does not require the knowledge of the prior distributions of the data and the channel. In addition, the proposed scheme does not contain tuning parameters, and enjoys fast implementation. The theoretical justification of the formulation and the convergence rate of the algorithm are also developed. The main contributions are summarized as follows:
Data Concentration Property:111In this paper, the term concentration represents the concentration of measure phenomenon. This phenomenon can be informally expressed as
“A random variable that depends in a Lipschitz way on many independent variables is essentially constant.”. In this paper, we exploit the statistical information of the data transmitted in general communication systems for a novel formulation of the blind detection problem. Specifically, we show that under mild conditions, there is a data concentration phenomenon, which enables a simple blind sparse recovery formulation of a massive MIMO system over the Stiefel manifold.
Blind Detection via -norm Maximization: It was shown in  that the channel sparsity leads to a fundamental performance gain for the degrees of freedom of massive MIMO systems. Hence our proposed formulation leverages the sparsity structure of massive MIMO channels. Traditionally, -norm is the most widely used formulation to induce sparsity . However, the non-smooth nature of the -norm results in low-convergence speed and high complexity . In this paper, inspired by the smoothness and the fact that a high-order norm promotes sparsity [14, 15, 16, 17, 18], we consider -norm maximization over the Stiefel manifold for blind data recovery of massive MIMO systems with sparse channels. Furthermore, we show that the global maximizer of the formulation is arbitrarily close to the true data up to a phase and permutation ambiguity for a sufficiently large number of antennas.
Efficient Algorithm and Convergence Rate Analysis: By taking advantage of the smooth property of the -norm and the geometric structure of the Stiefel manifold, a parameter-free algorithm with low complexity is proposed for blind data recovery for massive MIMO systems. We show that under mild conditions, the proposed algorithm converges to the stationary point of the -norm maximization problem. We further show that the convergence rate is given by for a large-scale receive array, where is the number of transmitters, is the number of receivers, and are the large-scale fading coefficient of the
-th transmitter and the variance of the channel noise. The coefficientrepresents the sparsity level of the channel and is the number of iterations.
The rest of the paper is organized as follows. In Section II, we present the system model, the sparse MIMO channel and the data concentration property. In Section III, we present the -norm-based manifold optimization problem as a new formulation for blind data detection problem. We further discuss the structural properties of the optimal solution to the new formulation in the noiseless and noisy cases. A fast and parameter-free algorithms is proposed and the corresponding convergence analysis are given in Section IV. Sample numerical simulation results are provided in Section V. Finally, Section VI summarizes the work.
Notations: and denote the inverse and conjugate transpose of matrix , respectively. represents the null space of . is used to take an element-wise abstract value. represents a diagonal matrix constructed by using as the diagonal elements. The
-th row vector and-th column vector in are and , respectively. represents the element in the -th row and -th column of . denotes the Hadamard product. denotes the ceiling operator. is the general inner product of and . Finally, , and are respectively, the Frobenius norm, spectral norm and induced norm of matrix .
Ii System Model
In this section, we introduce the system model of the considered massive MIMO system, followed by the sparse channel model and the data concentration property.
Ii-a System Model
Consider a single-cell mmWave massive MIMO communication system, as illustrated in Fig. 1. There are single-antenna users transmitting to a base station (BS) that is equipped with receive antennas with . Denote as the symbols transmitted by user within one frame. The aggregate transmit symbols of the users are denoted by . For simplicity, we consider a block flat fading channel, but the framework can be easily extended to OFDM systems. The received signal at the BS is given by
where is the transmit power of the users, is the aggregate MIMO channel matrix, is the channel matrix between the -th user and the BS, is the aggregate large-scale fading coefficients of the users to the BS, and is the aggregate additive channel noise of independent and identically distributed (i.i.d.) elements. We assume that the BS has knowledge of , which can be obtained in practice with a very low signaling overhead due to the slowly varying path gain.
Ii-B Sparsity of the mmWave MIMO Channel
The mmWave propagation environment is well characterized by a clustered channel model , which can be parameterized by paths of the -th user. The small-scale mmWave channel matrix between user and the BS during the coherence time is given by
where denotes the normalized path gain of the -th path for the -th user. We assume that
are i.i.d. random variables following the complex Gaussian distribution, and denote the azimuth and zenith angles of arrival (AoA) of the -th path for the -th user, and represents the receive and transmit array response vectors. For simplicity, we assume that the BS is equipped with a uniform rectangular planar array (URPA) with and elements () in the horizontal and vertical direction, respectively. The array response vector is given by 
where and .
The spatial aggregate channel of users can be expressed by a “virtual angular domain” representation  as follows:
where is the steering matrix for the receive array. With an receive URPA, we have , where and
are the unitary discrete Fourier transform (DFT) matrices.can be interpreted as the normalized channel gain between the -th user and the -th discrete receive angle . From , the number of clusters is quite limited in the mmWave band. Furthermore, is usually quite large to mitigate the path loss effect in mmWave frequencies . Hence, the number of paths is usually much smaller than the channel dimension, i.e., . Therefore, can be regarded as approximately sparse.222The reason why is not exactly sparse is the mismatch between the exact receive angle and the predefined one by , a.k.a. the energy leakage phenomenon. In this paper, we define the sparsity level as the average number of non-zero elements in the sparse channel , i.e., . Fig. 2 shows a realization of for a massive MIMO mmWave channel according to the statistical spatial channel model (SSCM) implemented in NYUSIM . As illustrated, most of the entries of are quite small (with 333Here we count a value that is less than 1% of as zero. due to the limited number of scatterers in mmWave channels . The channel sparsity can be further promoted through a more accurate sparse basis , which can be found using the offline learning method proposed in .
Ii-C Data Concentration on the Stiefel Manifold
The aggregate data matrix transmitted by the users is assumed
to be independent with zero mean and a normalized covariance .
The following proposition establishes that
is approaching exponentially fast444One can use the simple Chebyshev’s inequality to show that ,
where is the variance of the square of each element
of , from the standard argument of the law of large numbers
, from the standard argument of the law of large numbers[26, Section 8.2, Theorem 2.1]. However, our result is stronger as it shows that the concentration is exponentially fast with respect to under some mild conditions. with respect to .
(Exponential concentration of data)
Assume the elements of matrix
are independent, zero-mean and have bounded support given by
with . Then,
there exists a constant such that, for any ,
See Appendix -A. ∎
The conditions in Proposition 1 can be satisfied by a general constellation in the transmit data of massive MIMO systems. For example, the two types of constellations used in 5G systems, i.e., phase-shift keying (PSK) and quadrature amplitude modulation (QAM), both have zero mean and bounded support. For example, quadrature PSK (QPSK) symbols have after the power normalization. From Proposition 1, as long as
, the probability ofdecays exponentially fast to , as illustrated in Fig. 3.
Based on the data concentration property in Proposition 1, we assume that lies on a Stiefel manifold as defined below.
(Complex Stiefel manifold) The complex Stiefel manifold is defined as the subspace of orthonormal - frames in , namely,
Iii Problem Formulation: -norm Maximization over the Stiefel Manifold
In this section, we formulate blind data recovery for massive MIMO systems with sparse channels as an -norm maximization problem over the Stiefel manifold, and discuss the structural properties of the optimal solution.
Iii-a Problem Formulation
where . We first briefly review some conventional approaches for blind data detection in massive MIMO systems exploiting both the channel sparsity and the data concentration, i.e., .
Conventional -norm-based formulation
To exploit the data concentration property of , we observe that is also a sparse quantity with sparsity induced by . Hence, one may directly use the -norm to promote sparsity and formulate the blind data recovery as
where the solution is an estimate of . However, Problem (7) may lead to trivial solutions when . When , will have a null space with rank . Hence is a trivial solution to Problem (7) with . When , solving Problem (7) will recover the data . However, due to the non-smooth nature of the -norm, algorithms such as subgradient descent  will have a high complexity. Furthermore, it was shown in  that the formulation in (7) is sensitive to noise because the -norm essentially encourages all small entries to be 0 .
Complete dictionary learning approach
Another approach to exploit the property and the channel sparsity is to apply the complete dictionary method . Specifically, the approach recovers each column of sequentially[29, Section 3]. To recover the -th column of , we solve the following problem:
where is to promote sparsity. Consider as the solution of (8), then is an estimation of , where is an orthonormal basis for . However, this method requires that the dictionary must be a square matrix (complete). In massive MIMO systems, we usually have a coherence time larger than the number of users . Hence, is usually not square. Therefore, this formulation cannot be directly adopted for the blind data recovery of massive MIMO systems.
As illustrated above, conventional approaches are either inefficient or have strict requirements on the system parameters. Thus, an alternative formulation to exploit both data concentration and channel sparsity is needed. Recently, it has been shown in the machine learning literatures[16, 14] that maximizing a high-order norm (-norm, ) leads to sparse solutions. An intuitive explanation is that the sparsest points on the unit -sphere, e.g., points , , and in , have the largest -norm , as shown in Fig. 4.
Inspired by this observation, we propose a smooth non-convex alternative formulation of Problem (7). The new formulation can avoid the aforementioned issues caused by the conventional -norm-based formulation. It also relaxes the requirement on the coherence time. The proposed problem is given by
In Problem (9), we choose the cube of the -norm to promote sparsity, with the following justifications. First, maximization of will not lead to trivial solutions caused by minimizing the -norm formulation, i.e., , for . Second, the smoothness of the object function will make it possible to design a fast-convergence algorithm. Third, is a milder sparsity promoting function since it is very flat around 0, which will not encourage all small entries to be 0 and thus is insensitive to small noise in the signal. Finally, can achieve the smallest sample complexity555From a machine learning perspective, the sample complexity of a machine learning algorithm represents the number of training-samples needed to successfully learn a target function. for exact recovery compared to other choices of , . In the following part, we shall provide theoretical support for formulation (9) by showing that solving it will recover the data matrix. We start with the noiseless case, and then generalize the analysis to the practical noisy case.
Iii-B Theoretical Analysis for Noiseless Case
Without loss of generality, we assume and for the noiseless case. We first analyze the properties of Problem (9) and show that the global optimal solution is arbitrarily close to the true data up to a phase-permutation ambiguity. Specifically, the two solutions, and , are called equivalent up to a phase permutation ambiguity if , where is a phase-permutation matrix as defined below.
(Phase-permutation matrix) The dimensional phase-permutation matrix is defined as:
where with and , with being a standard basis vector, and being any permutations of the elements.
Note that a phase-permutation ambiguity can be resolved with very small signaling overhead in massive MIMO systems. We shall defer the discussion to Section IV-C.
The following theorem summarizes the key result, namely, that the optimal solution of Problem (9) is arbitrarily close to the true data (up to a phase-permutation ambiguity) for a sufficiently large number of antennas , thus, demonstrating the correctness of our formulation.
(Blind data detection in the noiseless case) Let with 666 is a product of an independent Bernoulli random variable with parameter and a circular symmetric complex normal random variable, i.e., , where , . Theorem 1 is based on the assumption that elements of are i.i.d. Bernoulli-complex Gaussian random variables., , and . Define as the set of optimal solutions of Problem (9) and . For any , there exists a constant , for , such that
where is a phase-permutation matrix.
See Appendix -B. ∎
Iii-C Theoretical Analysis for Noisy Case
In this section, we study the robustness of the solution to Problem (9) with respect to noise. The result is summarized in the following theorem.
(Blind data detection in the noisy case) Let with , , , with , and . Define as the set of optimal solutions of Problem (9) and . For any , there exists a constant , for , such that
where and is a phase-permutation matrix.
See Appendix -C. ∎
From Theorem 2, there is a high probability that the optimal solution to Problem (9) will be within an uncertainty ball centered at (up to a phase-permutation ambiguity) with a radius of , where is the noise variance. The radius of the uncertainty ball decreases with decreasing according to . When , the result reduces to Theorem 1 in the noiseless case.
Iv A Low-complexity Parameter-free Algorithm
In this section, we propose a low-complexity parameter-free algorithm to solve Problem (9) and provide the corresponding convergence analysis. The resolution of the phase-permutation ambiguity will also be given.
Iv-a Review of the Gradient Method over the Stiefel Manifold
Problem (9) is an optimization problem over the Stefiel manifold. As such, one can apply gradient search over the Stiefel manifold  to solve Problem (9). Specifically, the gradient iteration over the Stiefel manifold is given by
where is the iteration index and denotes the gradient of the objective function at , which is given by the orthogonal projection of the Euclidean gradient onto the tangent space at . is the stepsize to move in the direction ; and is the retraction on the manifold, which maps from the tangent space onto the manifold itself. However, directly applying the gradient method to Problem (9) suffers from a high per iteration complexity due to the two maps in each iteration. Specifically, to compute one needs to projects the Euclidean gradient onto the tangent space of and maps the tangent vector onto the Stiefel manifold. Moreover, to find the optimal stepsize , a curvilinear search is required in each iteration, which is usually time-consuming.
Iv-B Parameter-free Algorithm to Solve Problem (9)
To resolve the limitations of the gradient method, we propose a low-complexity and parameter-free algorithm to solve Problem (9). The algorithm is derived based on the Frank-Wolfe method , which considers a linear approximation of the objective function at each iteration, hence substantially simplifying the per iteration computation.
Iv-B1 Derivation of the proposed parameter-free algorithm
The Frank-Wolfe method for Problem (9) iterates according to
where is the Euclidean gradient at of the objective function of Problem (9). We shall elaborate on these two steps next.
Step 1: Simple computation of (11) by exploiting the Stiefel manifold constraint. We first focus on the computation of (11). This step aims to find an such that the inner product achieves the maximum, which can be obtained by projecting onto the Stiefel manifold . This is similar to retraction in the gradient method, , which maps the moving direction onto the manifold. The difference is that retraction is a mapping from the tangent space of to the manifold. However, for the Stiefel manifold, polar-decomposition-based retraction, , is a projection-like retraction  which does not require the direction to be in the tangent space. Therefore, we can directly use polar-decomposition-based retraction to obtain
. In this paper, we use compact singular value decomposition (SVD) to implement polar factorization. By compact SVD, we have
Step 2: Optimal step size in (12) by exploiting the convex objective function. In general, the update Step (12) requires a time-consuming line search. By exploiting the convexity of the objective function of Problem (9) as well as the geometric structure of the Stiefel manifold, we show in Lemma 1 that the optimal step size in (12) is given by .
Iv-B2 Convergence analysis
Define the first-order optimality metric as
The metric in (17) is a measure of the optimality
of due to the following lemma.
(Optimality Measure) is a stationary point of Problem (9), if and only if .
See Appendix -E. ∎
Based on this, the following theorem summarizes the convergence of the proposed algorithm to solve Problem (9).
(Convergence of the proposed algorithm) Let be the sequence generated by the proposed algorithm in (16) with a random initial point . We have
is monotonically increasing.
See Appendix -F. ∎
To evaluate the impact of the key system parameters on the convergence rate of the proposed algorithm in (16), we present the following theorem.
See Appendix -G. ∎
Theorem 4 shows that the convergence rate is , with high probability. This suggests that a smaller number of receive antennas , a smaller number of users , a lower sparsity level , or a smaller noise variance will lead to a faster convergence rate. The convergence rate in (18) is also consistent with the simulation results in Fig. 6 as will be shown later.
Iv-C Resolving Phase-permutation Ambiguity
According to Theorem 2, there is a phase-permutation ambiguity in the solutions of Problem (9). This ambiguity can be resolved with little overhead [9, 8, 37]. Specifically, after we obtain the output of Algorithm (16), , the phase-permutation ambiguity can be resolved by the following two steps [8, 37]:
Step 1: Using one common reference symbol to eliminate the phase ambiguity. Without loss of generality, we assume that the first transmitted symbol of each user is the reference symbol, i.e., = (as illustrated in Fig. (a)a). The result after eliminating the phase ambiguity is given by
where is an estimation of with , as illustrated in Fig. (b)b.
Step 2: Using symbols as the user ID to eliminate the permutation ambiguity. After eliminating the phase ambiguity, the permutation ambiguity can be eliminated by comparing the user ID with symbols, where is the size of the modulation alphabet ,as illustrated in Fig. 5.
Table I compares overheads for different schemes, which, together with the achievable rate comparison in Fig. 8 shows that the proposed blind detection scheme achieves a higher achievable rate by saving the pilot overhead.
|Pilot-based coherent detection|
|Proposed blind data detection||1+|
The overall algorithm is summarized by Algorithm 1.
V Simulation results
In this section, we first numerically verify the convergence property of the algorithm proposed in Section IV, and then present comprehensive simulation results to show the superiority of the proposed scheme for blind data detection in massive MIMO systems.
V-a Convergence Property
The convergence property is verified with randomly generated and , such that777 and . Without loss of generality, we fix in the following results. Fig. 6 considers the noisy case, i.e., and is the additive Gaussian channel noise with zero mean and variance , to illustrate how the key system parameters influence the convergence rate. The value of the objective function is normalized by , which is the theoretical maximum value of the objective function with a sufficiently large (further details about this value can be found in Appendix -C). The curves are plotted for individual trials with different experimental settings when . The results imply that the convergence rate is influenced by , , and the noise variance . Specifically, a smaller , a smaller , a smaller , or a smaller noise variance will accelerate the convergence rate consistent with Theorem 4.
V-B Performance Evaluation
In this section, we evaluate the proposed method, in comparison with existing ones. In the following results, the spatial channel matrix is generated according to Eq. (2) with paths and i.i.d. Gaussian with unit variance for all users. Each user has independent azimuth and elevation AoAs, and
, which are assumed to be uniformly distributed inand . The antenna elements in the URPA are separated by a half-wavelength distance. The “virtual angular domain” channel is obtained according to Eq. (4). Each element of is drawn from the i.i.d. QPSK symbols and normalized by . All the results are conducted by averaging over Monte Carlo trials, unless otherwise specified.
V-B1 Performance and runtime comparison with baselines
To demonstrate the benefits of the proposed method, we introduce the following four state-of-the-art blind data detection schemes as baselines. Note that we apply the ambiguity elimination scheme described in Section IV-C to all the approaches.
Baseline 1 (Pro-Bi-GAMP) : This is an approximate probabilistic message passing algorithm for blind data detection. In the simulation, we use EMBiGAMP_DL  in the GAMP Matlab package888The Matlab code can be downloaded at https://sourceforge.net/projects/gampmatlab/. and add the projection operation illustrated in . This baseline is introduced to show the robustness of the proposed method to different distributions of the channel and data.
Baseline 2 (-norm-based method): For -norm-based methods, we adopt the one proposed in  for complete (orthogonal) dictionary learning. In the simulation, we divide the data matrix into adjacent squared matrices and implement the -norm-based method for each squared matrix. This baseline is introduced to show the effectiveness of the proposed problem formulation.
Baseline 3 (-norm-based method): This baseline solves the -norm-based problem; i.e., in Problem (9) is changed into . A similar scheme has been used in image processing to solve an orthogonal dictionary learning problem[14, 18] in the real-value domain. This baseline is introduced to show the importance of the choice of for the high-order-norm-based problem formulation.
We use the average error vector magnitude (EVM) as the performance metric for data detection. The specific expression is given by
We also compare the proposed scheme with the following training-based sparsity-exploiting data detection scheme.
Baseline 5 (Pilot-based method) : This baseline is a pilot-based method which leverages the sparsity of channel. In the channel estimation phase, randomly generated training symbols with length are sent to the BS. The sparse channel is then estimated by solving the popular regularized least-squares problem: , where is the received signal in the training period. After is estimated, the transmitted data is detected via zero-forcing (ZF). This baseline is presented to show the superiority of the proposed blind scheme compared to the sparsity-exploiting training-based method.
To facilitate the comparison with the training-based method, we introduce the following achievable rate metrics
for the blind and training-based schemes, respectively. For the blind schemes, the overhead is caused by the common reference symbol to eliminate the phase ambiguity, and the loss is caused by the permutation ambiguity. We compare the performance of the proposed method with the aforementioned baselines in terms of the EVM, computation time and achievable rate with , , where ,, and is generated as the white Gaussian noise with variance .
Fig. 7 shows the performance comparison with , and . We see that the proposed scheme999The rounding technique in  is adopted for the -based method. exhibits the best performance in terms of EVM. The proposed scheme outperforms Baselines 1 and 2 because that it is more robust to noise and the power leakage phenomena. Specifically, the sparsity penalty is a milder sparsity penalty which is very flat around 0 and insensitive to noise in the signal, while the strict sparsity penalties used in Baselines 1 and 2 essentially encourage all small entries to be 0 . Baseline 3 shows inferior performance since it requires a higher sample complexity than the proposed -norm-based formulation . Although Baseline 4 shows a competitive performance to the proposed one, the runtime comparison given in Table II indicates its