Cell-free massive multi-user (MU) multiple-input multiple-output (MIMO) wireless systems promise significant enhancements in spectral efficiency compared to traditional cellular systems[2, 3, 4, 5]. The distributed nature of such systems assures that every user equipment (UE) is able to communicate with multiple nearby access points (APs) [6, 7]. Cell-free massive MU-MIMO systems are envisioned to operate in time-division duplex (TDD) mode. The ideal case for channel estimation in the uplink would be to use orthogonal pilot sequences. However, densely-populated cell-free massive MU-MIMO systems, in which hundreds or even thousands of UEs communicate in the same time-frequency resource, prevent the use of orthogonal training sequences as it would reduce the achievable data rates. While nonorthogonal pilots can certainly mitigate this issue, without taking special precautions, the accuracy of the extracted channel estimates will be severely compromised, resulting in poor spectral efficiency.
In order to address this issue, recent research has mainly focused on pilot reuse, and maximizing the signal-to-interference-plus-noise ratio (SINR) with linear estimation and equalization methods while taking pilot contamination into account [8, 9, 10, 11, 12]. While such approaches have relatively low complexity, they allow for high spectral efficiency only in scenarios in which a large number of AP antennas serve a far smaller number of UEs [13, 14, 7, 15, 16]. In other words, as the number of UEs approaches or even exceeds the number of AP antennas, the performance of such densely-populated cell-free massive MU-MIMO systems degrades considerably, especially when relying on linear channel estimation and data detection methods .
We propose a novel joint channel estimation and data detection (JED) algorithm tailored to densely populated cell-free massive MU-MIMO systems in which the number of UEs is close to or larger than the number of AP antennas. The distributed placement of UEs and APs results in sparse channel matrices, as every UE is only nearby to a small number of APs. The proposed JED algorithm simultaneously exploits the sparsity of cell-free massive MU-MIMO channels and the boundedness of constellation sets in order to minimize the pilot overhead while providing high spectral efficiency. Our algorithm approximately solves a relaxed version of the maximum a-posteriori (MAP) JED problem using forward backward splitting (FBS). To improve the performance and reduce the complexity of our JED algorithm, we combine nonorthogonal pilot sequences with novel permutation strategies of AP and UE indices, which enable us to find better initializers. We present simulation results that demonstrate the advantages of JED compared to traditional methods that separate channel estimation from data detection in terms of the root-mean-squared-symbol error (RMSSE), bit error-rate (BER), mutual information (MI), and channel-estimation mean-square error (MSE).
I-B Relevant Prior Art
I-B1 Channel Estimation and Data Detection
The majority of research on uplink transmission in cell-free massive MU-MIMO systems has focused on linear methods that separate channel estimation from data detection, such as maximum ratio combining (MRC), zero forcing (ZF), and linear minimum mean-square error (L-MMSE) equalization [18, 15, 7, 19, 20, 21, 22]. The optimization targets of such linear methods are typically the MSE of channel estimation and/or signal estimation, or maximizing post-equalization SINR. Albeit computationally efficient and easy to analyze, linear methods do not perform well in systems (i) that use nonorthogonal pilot sequences or (ii) where the number of UE antennas approaches the number of AP antennas . The situation is further aggravated in overloaded systems, where the number of UE antennas exceeds the number of AP antennas. In contrast, our nonlinear JED algorithm enables reliable transmission in densely-populated systems with (often significantly) fewer pilots than UEs. In addition, our JED algorithm will not cause an increase in fronthaul data rates compared to the centralized data detectors put forward in . Concretely, given a cell-free massive MU-MIMO system with APs, each equipped with antennas (we assume ), and single antenna UEs transmitting pilots and payload for time slots within one coherence block, the total amount of fronthaul signaling is complex scalars. This is the same as that of the level centralized method in [22, Tbl. I] (where and
). Furthermore, our JED algorithm does not require knowledge of second-order statistics on the UEs channel vectors, which further reduces the pilot and fronthaul overheads compared to the methods in. As a drawback, the complexity of our JED algorithm is substantially higher than that of linear methods. Nonetheless, as we will show in sec:numerical results, linear methods that separate channel estimation from data detection, even when performed in a centralized manner, perform only poorly in densely-populated scenarios. Moreover, decentralized linear data detectors as proposed in [21, 22], which excel in complexity and scalability, perform even worse than their centralized counterparts and are not suitable for densely-populated scenarios.
I-B2 Joint Channel Estimation and Data Detection
JED has been studied in the small-scale MIMO literature [23, 24, 25, 26, 27]. While the complexity of such methods does not scale well to large systems, an efficient JED algorithm has been proposed in  for massive single-input multiple-output (SIMO) systems. For massive MU-MIMO systems, JED algorithms have been proposed only recently in [29, 30, 31, 32, 33, 34, 35]. To the best of our knowledge, none of these methods exploit the specifics of cell-free massive MU-MIMO systems. For example, reference  maximizes the -norm to exploit beamspace sparsity of millimeter-wave (mmWave) MIMO systems. Message passing (MP) algorithms have also been used for JED in [29, 32]. In contrast, our method exploits the sparse nature of cell-free massive MU-MIMO channels in combination with the boundedness of QAM constellations. Furthermore, our UE and AP permutation methods discussed in sec: initialization could also improve the performance of MP-based algorithms.
I-B3 Sparsity in Cell-free Massive MU-MIMO systems
Sparsity of the channel matrices in cell-free massive MU-MIMO systems has, up to now, not been exploited extensively. In [36, 37, 38, 39], the authors exploit the sparsity of beamforming vectors during downlink transmission by only serving a small portion of UEs. Exploiting sparsity to identify active UEs was proposed in . Reference  formulates channel estimation as a convex optimization problem using a sparsity-inducing -norm penalty. In [42, 43], channel sparsity in beamspace domain has been exploited for cellular mmWave communication systems. In contrast, we exploit sparsity for the JED algorithm in cell-free massive MU-MIMO systems, which comes from the distributed placement of APs and UEs, and the fact that the path loss between UEs and APs causes only a small number of strong links to be present.
I-B4 Pilot Design and Reuse
In densely-populated cell-free massive MU-MIMO systems, the shortage of pilots has been identified as a major concern. A straightforward approach is to assume that the number of UEs is smaller than the number of available pilot sequences, which enables the use of orthogonal training [44, 37, 45]. If the number of UEs exceeds the number of available pilots, either pilot reuse or nonorthogonal training is necessary. Reference [10, 46] propose to divide the UEs into fixed groups, each assigned with one pilot, whereas reference  proposes a dynamic pilot assignment strategy. Nonorthogonal pilot sequence design has been studied in [8, 11] aiming at minimizing channel estimation MSE. Nonorthogonal pilot reuse strategies have been proposed in [17, 47]. JED algorithm natively enables the use of nonorthogonal pilot sequences as the data symbols are also used to estimate the channel matrix.
I-B5 Clustering for UEs and APs
UE-centric and AP-centric clustering schemes that aim at reducing the backhaul data transfer have been studied in network MIMO  and rediscovered in cell-free massive MU-MIMO . UE-centric clustering enables the UEs to communicate with only a few nearby APs, which has been studied in [48, 37, 49, 39, 50, 21]. AP-centric clustering only serves a few nearby UEs and has been studied in [18, 7, 6]. As mentioned above, clustering strategies for pilot reuse have been discussed in [10, 46]. Clustering to facilitate channel estimation has been proposed in . In contrast, our approach only clusters AP and UE indices and dynamically constructs virtual cells in which we perform orthogonal channel training as interference among virtual cells is minimized—to minimize intra-virtual-cell interference, we propose to use mutually unbiased bases  as pilot sequences.
Lower case and upper case boldface letters denote matrices and vectors, respectively. We use , , and to represent the entry in the th row and th column of the matrix , the th column of the matrix , and the th element of the vector , respectively. We use , , and for the identity, all-ones, and
all-zeros matrix, respectively. The superscripts, , and refer to the complex conjugate, transpose, and Hermitian transpose, respectively. For the matrices and , we define the real-valued inner product as , where extracts the real part of and is the matrix trace. For the vectors and , we define . Consequently, we have , where is the vectorization operator. The matrix operators and denote the Kronecker product and Hadamard product, respectively. The operator denotes element-wise larger-equal-than. For a matrix , we will use the following entrywise norms: , , and with .
I-D Paper Outline
The rest of the paper is organized as follows. sec:sys model introduces the system model. sec:JED formulates the JED problem and details our FBS algorithm. sec: initialization proposes principled initialization schemes for the nonconvex JED problem. sec:perm discusses UE and AP permutation based on CSI and on physical locations. sec:numerical results analyzes the computational complexity and demonstrates the efficacy of our method via simulation results. We conclude in sec:conclusion.
We now introduce the cell-free massive MU-MIMO system and summarize the channel model.
Ii-a System Model
We focus on the uplink in a cell-free massive MU-MIMO system with distributed single-antenna APs and single-antenna UEs. As shown in fig:cfree mimo, all the APs are connected to a central processing unit (CPU) via a backhaul network. Due to the distributed nature of cell-free massive MU-MIMO systems and the channel’s sparsity, the area can be divided into virtual cells shown with different colors in fig:cfree mimo. Each virtual cell will be constructed dynamically (i.e., dependent on the channel matrix or the physical UE/AP locations) to minimize inter-cell interference—this approach will be detailed in sec:perm. We assume a block-fading scenario with TDD and a coherence time of time slots, where time slots are reserved for pilot-based channel training and time slots for payload data. The input-output relation of the considered frequency-flat111For frequency-selective channels, we can use orthogonal frequency-division multiplexing (OFDM) to obtain an equivalent system model per subcarrier. cell-free massive MU-MIMO system is given by 
where is the receive-signal matrix, is the MIMO channel matrix, contains two parts and will be introduced below, and
models noise, with entries assumed to be i.i.d. circularly-symmetric complex Gaussian with varianceper complex entry. To simplify notation, we separate training from payload by rewriting eq:mimo model as follows:
Here, the matrices and contain training pilots and data symbols, respectively; the pilot sequences are designed as tight frames (see sec:tight frames for the details) and the entries of are chosen from the constellation ; the matrices and contain the received pilot and data symbols, respectively. Our goal is to jointly estimate the channel matrix and detect the entries in from the received signals in and the known training-pilot matrix .
Ii-B Cell-free Massive MU-MIMO Channels
To develop a JED algorithm for cell-free massive MU-MIMO communication, we use the channel model put forward in  and consider single-antenna APs (see Remark 1 for a possible generalization to multi-antenna APs). For this model, the channel matrix in eq:mimo model is decomposed as
denotes the normalized uplink transmit signal-to-noise ratio (SNR),is the cell-free channel matrix, and is a diagonal power control matrix. The average power of each transmit symbol in is normalized so that , and the entries of are normalized so that . Following the model in , the entries of are modeled as where and characterize large-scale and small-scale fading between the th receive antenna and the th UE, respectively. We assume and is detailed in sec:comp under cfree. The power control matrix is used to attenuate the transmit symbols in and we absorb its effect in the channel matrix .
Since in cell-free massive MU-MIMO systems with random placement of UEs and APs, the UEs are only close to a few APs, most of the entries in will be small—a central property which we will discuss further in sec:channel sparsity. Since , the total received power for the th UE is , which may vary substantially among UEs. To this end, we use a per-UE power control scheme that limits the maximum receive power by restricting the transmission power of some UEs based on their channel condition. Concretely, we set an upper limit on the received power such that UEs whose received power would exceed the limit have to transmit with lower power—weak users continue transmitting at their nominal power. To achieve this goal, we define the entries of the diagonal power-control matrix as follows:
Here, defines the maximum dynamic range between the weakest and strongest UE received power in decibels. This power control scheme relies on the magnitudes of the channel matrix entries, which change only significantly at the timescale of large-scale fading. Hence, it is possible that APs transmit the power level to the UEs in the downlink phase and the UEs could back-off accordingly. While this power control scheme is merely to confine the dynamic range of the received signals, it does not fundamentally alter the channel’s sparsity property discussed next.
Ii-C Channel Sparsity of Cell-free Massive MU-MIMO Systems
Due to the distributed and random placement of APs and UEs, each UE is likely to be close to only a few APs—this property causes most links to be weak and the channel matrix in eq: channel matrix to be sparse. fig: chnl_none illustrates this key property, where we show the absolute values of a channel matrix for APs and UEs placed randomly in a km square area.
Since the enumeration of APs and UEs is arbitrary, and only a few entries of the channel matrix contain most of the energy, it is key to realize that one can permute the rows (APs) and columns (UEs) of the channel matrix to approximate a block-diagonal structure, by merely re-indexing the UEs and APs from the CPU’s viewpoint. As an example, fig: chnl_csi and fig: chnl_phy show block-diagonal structures that can be obtained by leveraging either channel-state information (CSI) or physical UE location, respectively. Interestingly, the UEs within each diagonal block will experience strong inter-UE interference, whereas UEs in different blocks will experience only little interference. Effectively, such clustering strategies create virtual cells, which can be used to perform orthogonal training within each virtual cell and nonorthogonal training among virtual cells where interference is minimized. See sec:perm for the details.
Iii Joint Channel Estimation and Data Detection
We now formulate the JED problem and then relax it so that it can be solved approximately using FBS .
Iii-a The MAP-JED Problem
Using Bayes’ theorem and the assumption made in sec:sys model, the channel matrixand data matrix
can be recovered jointly by maximizing the posterior probability density function (PDF) as follows:
Here, we assume that and are independent, and the entries of are i.i.d. taken from the constellation set . Since the entries of are assumed to be i.i.d. circularly-symmetric complex normal, the conditional PDF in eq:MAP original is given by
Due to channel sparsity, we assume that the channel coefficients in follow a sparsity-inducing complex-valued Laplace prior. With the definition in [55, Eq. 14] and the assumption that the entries in are i.i.d., the joint PDF is
By inserting eq:MAP pdf1 and eq:Laplace into eq:MAP original, we obtain the following equivalent MAP-JED problem:
Here, the parameter controls the channel’s sparsity, where larger values promote sparser channel matrices.
For multi-antenna APs, we can generalize our problem formulation by leveraging block-sparsity . This requires us to replace the Laplace prior in eq:Laplace by , where , , and . Here, is the channel vector of UE to AP , where is the number of antennas for each AP. The permutation technique and the JED solver can also be adapted to this block-sparsity prior. While the multi-antenna AP case might be more practical, we stick to the single-antenna AP case for simplicity of exposition.
For cell-free massive MU-MIMO systems that have access to second-order statistics for each UE channel vector, as, e.g., in [21, 22], one can adapt our JED problem formulation with a suitable Gaussian prior for
instead of the Laplace prior in eq:Laplace. While our JED formulation in eq:opt1 requires only one hyperparameter (namely), a MAP-JED method that exploits such second-order statistics would require additional pilot resources . For the sake of brevity, a detailed comparison between the two approaches is left for future work.
We note that JED in cell-free massive MU-MIMO systems is different from that in cellular massive MIMO systems for two reasons. First, channel sparsity naturally arises in cell-free massive MU-MIMO channels as a result of APs and UEs placement in space. The collocated antennas at BSs in cellular massive MU-MIMO system results in approximately the same path loss, which eliminates channel sparsity. The second reason is that channel sparsity in cell-free massive MU-MIMO systems enables us to group APs and UEs in such a way that the sparse channel matrices are approximately block diagonal (cf. fig:channel plots). This structure enables us to deploy fewer pilots than UEs, while the UEs within each virtual cell (corresponding to a block in the block-diagonal matrix) can still be furnished with orthogonal pilots and UEs among virtual cells with near-orthogonal pilots. We will further detail this idea in sec: initialization.
Iii-B Biconvex Relaxation of the JED Problem
We now provide means that enable us to approximately solve the MAP-JED problem in eq:opt1 with manageable complexity. We start by relaxing the discrete constellation set to its convex hull, which is defined as 
where is the th symbol in . Note that for QPSK with , the convex hull is a box around the four constellation points. This relaxation enables us to find solutions in a continuous region instead of a discrete set and has been used recently for massive MIMO data detection which separates channel estimation from data detection [57, 58, 59].
To improve the performance of the relaxed problem, we additionally use a strategy put forward in . Intuitively, for QPSK, we are favoring solutions near the four corner points. Thus, we add a concave regularizer with parameter to the objective of the relaxed problem which pushes the solution towards the corners of the convex hull:
While the problem eq:opt3 remains nonconvex, the following lemma establishes conditions for which the problem is biconvex in and . A short proof is given in app:biconvexity.
The problem in eq:opt3 is biconvex in and if , where is the smallest eigenvalue of
is the smallest eigenvalue of.
Iii-C Uniqueness of the JED Solution
Since our goal is to simultaneously recover the channel matrix and the data symbols in , certain nonuniqueness issues of the solution may arise. We now show how such ambiguities can be avoided with suitable pilot matrices . Define a diagonal phase-shift matrix with , , and a permutation matrix where is a standard basis vector, which is one in the th entry and zero otherwise. Let be a solution to eq:opt3. Then, the alternative tuple can also be a solution as it has exactly the same cost in eq:opt3 and satisfies the constraints, as long as the phase shifts satisfy , where are entries of the training matrix . Such nonuniqueness issues have been studied in [34, 33, 61, 35, 32] and can be resolved in various ways. Pilot-based systems with orthogonal pilots avoid such issues entirely. Since our goal is to undertrain channels with nonorthogonal pilots, uniqueness of a solution to eq:opt3 is no longer guaranteed. We now provide a simple condition for which no phase-permutation ambiguity can arise. A short proof is given in app:phase-permutation-ambiguity.
Fix a pilot matrix , where has normalized columns so that . Let
be the coherence of the matrix . If , then no phase-permutation ambiguity can exist.
In sec:tight frames, we provide pilot matrices that avoid the phase-permutation ambiguity and enable accurate channel estimation even in heavily undertrained systems.
Iii-D JED via Forward-Backward Splitting
We now show an FBS-based approach to approximately solve the problem in eq:opt3 at low complexity. Due to the nonconvex nature of eq:opt3, FBS is not guaranteed to find an optimal solution. Nevertheless, we show in sec:convergence that FBS is guaranteed to converge to a stationary point with a proper stepsize. Furthermore, we show in sec:numerical results that our algorithm performs well for various performance metrics.
FBS is an efficient numerical method to iteratively solve convex optimization problems of the following form :
Here, the function is differentiable and convex, and is a more general (not necessarily smooth or bounded) convex function. Given a non-analytic function , we use the Wirtinger derivatives  to define the gradient. To this end, after initializing the algorithm with , FBS solves the problem in eq:fbsproblem via iterations by computing
where is the gradient of with respect to , and is a per-iteration stepsize. We use the adaptive stepsize selection proposed in [54, Sec. 4.1] to accelerate the convergence. The proximal operator for is defined as
Instead of performing alternating optimization in and , we use FBS to solve for both matrices simultaneously. We group the two matrices together by defining , where and is known and fixed throughout the iterations; the matrices and contain the optimization variables. We define the functions and in eq:fbsproblem as
where we define the indicator function as
With the above definitions, the objective function consisting of a sum of eq:fterm and eq:gterm is not analytic. That said, the objective function is not only dependent on the complex matrix , but also implicitly on and hence the quantities and are both gradients of with respect to and . According to [62, Eq. 4.49], the steepest descent direction is simply and thus the complex-valued gradient of is given by
The proximal operator for is given by
where is the shrinkage operator  defined as
where division and multiplication are interpreted entry-wise. Here, is the sparsity parameter, is the per-iteration stepsize, and we define for .
The proximal operator for can be derived from eq:proximal def as
where we moved the indicator function in back to the constraint. By completing the square, the solution to eq: NC proximal S is
where . Note, for stability we must choose to be small enough that . The right-hand side proximal operator is the projection onto the convex hull . For complex-valued QAM constellations, each element of the projection is applied independently to real and imaginary parts as
where and defines the radius of the box around .
Iii-E Convergence of FBS for the JED Problem in eq:opt3
Since the problem in eq:opt3 is nonconvex, we now analyze the convergence properties of FBS which depend on the initial choice of the initialization variable . For this reason, rather than identifying a specific stepsize to use, it is simpler to guarantee convergence when a simple backtracking line search is used . The following result is proven in app:fbsproof.
Let be the objective function given by eq:fterm and eq:gterm. Suppose that the stepsizes of FBS are bounded away from zero, and selected small enough to satisfy the following backtracking line search condition:
where is the gradient for at . Then, the objective decreases monotonically, i.e., we have
In addition, if is further restricted to satisfy , then the sequence of iterates converges.
Note that while FBS for solving eq:opt3 is guaranteed to converge if the stepsizes are chosen appropriately, it is not guaranteed to converge to an optimal solution. We reiterate that our FBS solver only approximately solves the formulated MAP-JED problem in eq:opt3 but our simulation results in sec:numerical results demonstrate that it converges to excellent stationary points. Establishing stronger optimality guarantees is extremely challenging and left for future work.
Iv Initializing FBS-JED
We now show methods to initialize our FBS-JED algorithm that improve performance and reduce complexity.
Iv-a Pilot Sequence Design
One key aspect for JED is designing suitable pilot sequences, which is particularly important as we focus on undertraining the channel matrix with nonorthogonal pilots. Concretely, we will use pilot matrices with low coherence as defined in eq:coherence.
As in lem:phasepermutationambiguity, let the pilot matrix be , where has unit-norm rows and normalized columns . A prominent instance of matrices with near-orthogonal columns are equiangular tight frames (ETFs)  for which all pairs of inner products, i.e., for achieve the same coherence , given by the Welch lower bound . Furthermore, ETFs have orthonormal rows, i.e., . Another useful class of matrices with low-coherence are mutually unbiased bases (MUBs). MUBs have the following block structure , where with blocks. Besides having orthonormal rows, MUBs also have orthogonal blocks, i.e., , , and the coherence between any two columns of two different blocks is .
From the perspective of sparse signal recovery, nonorthogonal pilots where and noise may cause ambiguity and render estimation difficult. ETFs and MUBs have low coherence and thus enable stable recovery of sparse signals using -norm penalty as shown in . Even though MUBs have higher coherence than ETFs, the orthogonal structure within each block matrix is particularly helpful for block-permuted channel matrices that form virtual cells (cf. fig:channel plots). We refer the interested readers to [63, 52] for detailed properties and construction steps of MUBs and ETFs. Note that we use ETFs for the -norm-based channel estimator eq:l1 chEst for unpermuted channels, and we use MUBs to estimate each block on the diagonal separately for permuted channels; see sec: chEst JS for the details.
Iv-B Permutation of APs and UEs
The distributed nature of APs and UEs in cell-free massive MU-MIMO systems promotes sparsity in the channel matrix. Given this property together with the fact that enumeration of APs and UEs is arbitrary, we can permute the rows and columns of the channel matrix to attain approximately block-diagonal structure (cf. fig:channel plots). Mathematically, we introduce two permutation matrices and to reformulate input-output relation as:
By defining , , and , we see that all of the assumptions for in sec:sys model and also the cost function in eq:opt3 are invariant to such permutation. Note that all of the assumptions on , , and in sec:sys model are invariant to such permutations, i.e., the problem eq:opt3 can be posed equivalently as
where , , and . For the assumptions on and , the JED problem remains unaffected by such AP and UE permutations, and the i.i.d. Laplace distribution of each entry in and the i.i.d. circular symmetry of the Gaussian noise also remains. In fact, the new variables and are merely two new matrices in and , respectively. Nonetheless, the goal behind such AP and UE permutations are to (i) assign suitable pilot sequences to the UEs and (ii) find better initializers for both and , which matters as we are solving a nonconvex problem using FBS. By permuting the channel matrix to obtain approximately block-diagonal structures as illustrated in fig: chnl_csi and fig: chnl_phy, we can separately perform channel estimation and data detection within these blocks, which we call “virtual cells" in fig:channel plots, with MUBs that approximately decouple the interference among both blocks.
We note that clustering methods in papers on UE-centric cell-free systems, e.g., , typically form overlapping cells, which is cruicial for the design of decentralized and scalable data detection methods. However, we do not perform our permutation approach in that way for two reasons. First, our virtual cells only serve the purpose of assigning pilots to UEs and simplifying channel estimation. Overlapping cells would not help us in accomplishing this goal. Second, the vectors in each sub-block of MUBs are orthogonal, while the vectors from different sub-blocks are correlated. If we were to form overlapping cells, then the block-wise orthogonality of MUBs can no longer be exploited.
Iv-C Initialization of the Channel Matrix
In our conference paper , we used a least-square (LS) channel estimator to initialize . Here, we show that we can improve upon this approach using the permutation idea introduced above. Consider an example with two virtual cells in , where the input-output relation during the training phase is given by the following block structure:
Here, we use the permutation approach to create two virtual cells and whose entries are much stronger than those in the block-off-diagonal matrices and . See fig: chnl_csi and 2(c) for an illustration of four virtual cells. Assume that the pilot matrix is constructed by an MUB with , where and the inter-cell correlation is . Due to the facts that (i) the entries in the block-off-diagonal matrices are much weaker than the block-diagonal matrices and (ii) the coherence between and is low, we can perform independent training within the two virtual cells, assuming the off-diagonal blocks are zero.
We now illustrate this approach for estimating ; the case for is analogous. Since is orthogonal and the entries in are close to zero, we can perform least-squares (LS) channel estimation222We utilize LS channel estimation instead of the linear mean-square error (L-MMSE) estimator as we lack the necessary statistical knowledge of the interference caused by UEs from an adjacent virtual cell, i.e., the distribution of is unknown.
where . The property of MUBs helps to mitigate the noise and the interference present in . To see this, recall that the entries in are i.i.d. circularly-symmetric complex Gaussian with variance . Furthermore, we have . Hence, the covariance and the interference of the noise after LS channel estimation are and , respectively. The interference is small as long as the entries in are small. The presented permutation strategy is designed to ensure this property, i.e., inter-virtual-cell interference is minimized. Consequently, the use of MUBs for initial channel training is sensible for the following reasons: (i) Orthogonal pilots are used within each virtual cell and (ii) noise and inter-virtual-cell interference are further suppressed due to the incoherence between MUB blocks. We reiterate that our clustering approach only improves initialization of our FBS algorithm—sec: complexity shows that this approach results in low MSE and can significantly reduce our algorithm’s complexity. The CPU still solves the JED problem in eq:opt4 as a whole.
Iv-D James-Stein Estimator and Median Absolute Deviation
To further reduce the MSE of initial channel estimation, we propose to use the James-Stein (JS) estimator . By treating as a matrix consisting of circularly-symmetric complex Gaussian random entries with variance per complex entry, we can improve the initial channel estimate as follows. Let represent one column of in eq:chest block. Assuming that is circularly-symmetric complex Gaussian with variance , the complex-valued version of the JS estimator is given by
which results in lower channel estimation MSE compared to the traditional LS estimator if .
The remaining piece of the puzzle is to identify the unknown variance of the noise and interference term . Fortunately, reference  recently provided a computationally efficient way to estimate the noise variance in systems where sparse signals are observed in complex Gaussian noise. By exploiting the sparsity of , we can estimate the noise variance [67, Eq. 4] where refers to the sample median and refers to the entry-wise absolute value square of the vector .
Iv-E Initialization of the Data Matrix
We now show how to initialize the data matrix . We first define a vector that is the vectorized and then compute followed by L-MMSE estimation from the received payload data matrix as since . We reiterate that the proposed methods to initialize the channel and data matrices can also improve the performance and complexity of other JED algorithms.
V AP and UE Permutation
We now propose two algorithms that perform AP and UE permutation with the goal of constructing virtual cells.
V-a CSI-based Channel Matrix Permutation
We start by focusing on a CSI-based permutation approach. Our goal is to cluster the entries in the channel matrix into an approximately block-diagonal structure as shown in fig: chnl_csi. To this end, we define the auxiliary matrix where for all . Given , we can permute its rows and columns by where and are permutation matrices for the rows and columns of . Permuting into an approximately block-diagonal matrix can be formulated as a nonconvex optimization problem:
Here, is the set of all possible permutation matrices, is a mask which determines the structure of the permuted matrix , and indicates there are virtual cells to be constructed. To arrive at an approximately block-diagonal structure with virtual cells on the diagonal, we set to be block-diagonal with diagonal blocks . In our simulations, we set for different modulation schemes which are divisible by and .
Since the complexity of enumerating all possible pair of permutation matrices in eq:combinatorial problem is prohibitive, we use a convexification method put forward in . Specifically, we relax the set of permutation matrices to the set of doubly-stochastic matrices:
While a solution in eq:doubly set is not necessarily a pair of permutation matrices (as the entries may lie in the set ), we use the technique from  to gently push the results to the corners of . The resulting problem to solve therefore becomes
with the parameter . Such a problem can be solved with FBS as well. The gradient can be calculated by exploiting the property of Hadamard product. The proximal can be solved with Douglas-Rachford splitting (DRS). We omit the details of such a solver due to the lack of space.
V-B Physical Location-Based Channel Matrix Permutation
The block-diagonal structure can also be attained by permuting the channel matrix using information on physical locations as shown in fig: chnl_phy. Given Euclidean distances, we group fixed APs into balanced-sized clusters where we assign all of the UEs accordingly.
To obtain balanced-sized clusters, we employ the algorithm put forward in , which consists of an assignment step and an update step for cluster grouping and centroid updating, respectively. For AP clustering, we iterate the two steps until convergence, whereas we only run the assignment step once for UE clustering since the centroids are already obtained in the AP clustering. In what follows, we only demonstrate how to formulate the AP assignment problem in the form that is solvable with FBS and DRS since the update step is obvious.
In order to minimize the overall Euclidean distance between APs and their corresponding centroids, the relaxed version of AP assignment problem can be formulated as
Here, is the partition matrix where indicates the th AP belongs to the th cluster, is the distance matrix consisting of the Euclidean distances between all the APs and their corresponding centroids, and the second term with a parameter indicates that we are favoring solutions close to the corners of . Consequently, eq: phy obj is basically a simplified version of eq:penalty 1.
Vi Numerical Results
We now demonstrate the efficacy of our JED algorithm.
Vi-a Simulation Setup
We evaluate our algorithm with the cell-free channel model detailed in sec:cell free model and consider a square area of km with randomly positioned UEs. As in our previous study , we assess the performance of BPSK, QPSK, 16-QAM with , , and randomly positioned APs, respectively. The maximum UE transmission power is mW and we use the per-UE power control with dB discussed in sec:cell free model. The carrier frequency is GHz and the bandwidth MHz. The receive and UE antennas are at a height of m and m, respectively. We use the three-slope path-loss model defined in . The small-scale fading and large-scale fading parameters between the th antenna and the th UE are and where is the path loss, is dB, and is shadow fading with variance . We permute the channel matrices using CSI-based and physical locations-based methods shown in sec:perm. Pilots are tailored for different channels. We design with ETFs and MUBs for unpermuted and permuted channels, respectively. We perform the -norm-based channel estimator (cf. eq:l1 chEst) for unpermuted channels. For permuted channels, we estimate channels in the way discussed in sec: initialization.
Vi-B Performance Metrics and Baseline Algorithms
In a cell-free massive MU-MIMO system, the UEs are experiencing different SNRs which prevents us from generating conventional BER vs. SNR plots. Thus, we characterize the per-UE cumulative density function (CDF) for the RMSSE, BER, and channel estimation MSE to examine our algorithm’s efficacy from different aspects. Also, instead of providing a spectral efficiency (SE) analysis, which is difficult due to the nonlinearity of our JED algorithm and would require Gaussian codebooks instead of discrete transmit constellations (which is what our JED algorithm exploits), we numerically calculate the mutual information (MI) between the discrete transmit signals and soft-symbol estimates generated by the data detector for each UE individually, and we show the resulting distribution. Our performance metrics are as follows.
Vi-B1 Per-UE BER
We define the BER for the th UE as , where is the total number of bit errors for UE over payload data slots and is the number of bits per symbol.
Vi-B2 Per-UE RMSSE
We define the RMMSE for UE over payload data slots as
where and denote the estimated and transmitted data symbols of the th UE at time slot , respectively.
Vi-B3 Per-UE MI
In the interest of a SE analysis, we numerically simulate the MI for each UE given the discrete input constellation . The MI for the th UE is defined as
Here, and are the pilot time slots and total time slots, respectively. The prefactor takes the pilot overhead into account and decreases the per-UE MI by the fraction of used pilots (as they do not carry any payload data). The quantities and denote the transmitted and quantized estimated data symbols of the th UE over data slots, is the empirical source entropy of the th UE over data slots, and is the empirical conditional entropy of the th UE over data slots. Since the output of JED is continuous, we quantize the output of JED and numerically compute the empirical entropies.
Vi-B4 Per-UE MSE
We define the channel estimation MSE of the th UE as , where and are the estimated and true channel vectors, respectively.
By treating all of the above performance quantities as random variables, we use Monte-Carlo simulations to characterize their CDFs over multiple UE and antenna placements, noise realizations, and data transmissions. The fraction of Monte-Carlo trials for which the per-UE RMSSE was belowis defined as ; the quantities and are defined analogously. To ensure consistency among all performance metrics (i.e., good performance is indicated by a curve in the upper-left of the respective plot), we define the per-UE MI as (which is technically a complementary CDF) on the y-axis and show the largest per-UE MI value on the left-hand-side of the x-axis—this is in stark contrast to classical CDF plots for the Gaussian SE in the literature (see, e.g., ). We note that all of the above performance metrics come with their own shortcomings. We thus demonstrate the efficacy of our JED problem with all four metrics.
Vi-B6 Baseline Algorithms
To characterize the performance of our JED algorithm, we introduce two baseline algorithms for comparison. The first one is the L-MMSE symbol detector defined in sec: init Sd. To obtain a channel estimate for this detector, we employ the -norm-based channel estimator to exploit the sparsity in cell-free massive MU-MIMO systems:
where we tune the sparsity parameter for each scenario. The other benchmark is the single-input multiple-output (SIMO) lower bound, which perfectly cancels MU interference in a genie-aided fashion . Both baselines are simulated with permuted channels.
Vi-C RMSSE, BER, MI, and MSE Results
We now compare the performance of JED with unpermuted and permuted channel matrices to our baseline algorithms. Figure 3 shows simulation results for a antenna system with UEs transmitting pilots and BPSK payload data over time slots, where (%) pilots are used. The permuted channels have two virtual cells. Figure 4 shows simulation results for a antenna system with UEs transmitting pilots and QPSK payload data over time slots, where (%) pilots are used. The permuted channels have four virtual cells. Figure 5 shows simulation results for a antenna system with UEs transmitting pilots and -QAM payload data over time slots, where (%) pilots are used. The permuted channels have two virtual cells. Note that the differences among the three setups are the number of APs, modulation scheme, and the number of virtual cells. We do not further investigate the case where the available time slots is smaller than the number of UEs ; the interested readers are encouraged to simulate such cases using our code that will be made available on GitHub after possible acceptance of the paper. To understand the performance in terms of RMSSE, it is instructive to compare the resulting RMSSEs to the error vector magnitudes (EVMs) allowed in UE implementation. The EVM characterizes the distortion caused by transmitter hardware on the digital constellation. The 3GPP 5G NR technical specification [73, Tbl. 22.214.171.124-1] allows UEs to distort the BPSK, QPSK, and -QAM constellations by an EVM of , and %, respectively. In fig:BPSK_64x128x128(a), fig:QPSK_128x128x128(a) and fig:16QAM_256x128x128(a), we see that JED enables more than %, % and % of the UEs to have an RMSSE at the receiver that is smaller than the EVM allowed to UE hardware. In fig:BPSK_64x128x128(b), fig:QPSK_128x128x128(b) and fig:16QAM_256x128x128(b), we see that JED enables an uncoded BER for %, %, and % of the UEs for BPSK, QPSK, and 16-QAM, respectively. In fig:BPSK_64x128x128(c), fig:QPSK_128x128x128(c) and fig:16QAM_256x128x128(c), we see that the JED supports % of the UEs to achieve a transmission rate at bits/symbol, bits/symbol, and bits/symbol for BPSK, QPSK, and -QAM, respectively. Particularly, the per-UE MI results suggest that if cell-free massive MU-MIMO systems are equipped with strong error correction codes, over % of the UEs could transmit at high data rates, without the common assumption [13, 14, 7, 15, 16] that the number of UEs should be far lower than the number of receive antennas. In fig:BPSK_64x128x128(d), fig:QPSK_128x128x128(d), and fig:16QAM_256x128x128(d), we see that JED provides dB, dB, and dB lower channel estimation MSE than the -norm channel estimator in eq:l1 chEst for BPSK, QPSK, and 16QAM, respectively. Notably, our JED algorithm achieves satisfactory performance in overloaded () and fully-loaded () MU-MIMO systems. More specifically, we observe that the L-MMSE detector, even when performed in a centralized manner with an -norm-based channel estimator, still performs significantly worse than our JED algorithm. Decentralized data detectors, although attractive due to their low complexity and scalability [21, 22], perform even worse. In fact, this observation is particularly valid in densely-populated systems as shown in fig:BPSK_64x128x128 and 4 where centralized L-MMSE data detection completely fails in the overloaded case () and is almost worse than our JED algorithm in the fully-loaded case ().
We also note that in user-centric cell-free massive MU-MIMO systems , the number of APs that serve each UE is does not necessarily depend on the ratio . However, we observe from the above simulations that increasing the ratio from to significantly improves the efficacy of linear methods. This property applies for both centralized and distributed data detectors—corresponding simulations are omitted due to the page limit.
Vi-D Computational Complexity Analysis
We now analyze the complexity of our JED algorithm. We start by measuring the complexity of the L-MMSE equalizer and our FBS solver by counting the number of real-valued multiplications (and ignore the complexity of additions, square roots, reciprocals, etc.). We assume that one complex-valued multiplication requires four real-valued multiplications. In what follows, the numbers in parentheses refer to the complexity.
The L-MMSE equalizer corresponds to computing . From , we have that the total complexity is . In each iteration of FBS, we first compute . Then, we multiply it with and . The next step is to scale with . Therefore, the total computational complexity in each iteration of FBS is Note that we ignore the complexity of the permutation problem for two reasons. First, this problem only needs to be solved when the large-scale fading components of the UEs change, which is at lower rate than the JED problem and mainly depends on UE locations—this observation is even more obvious for the position-based permutation problem. Second, solving these permutation problems mainly requires additions and other simple operations. Since we measure complexity by counting the number of multiplications, it is challenging to relate the complexity of such operations in a fair manner.
Due to the nonconvexity of JED, better initialization methods improve the performance and require fewer iterations for FBS to converge. Solving the LASSO problem eq:l1 chEst indeed yields a good initializer, but is also computationally intensive. We therefore propose to use the initialization techniques proposed in sec: initialization to reduce complexity while still enabling excellent performance. Figure 8 and 8 show the required iterations for convergence and the initialized channel estimation MSE results for a antenna system with UEs transmitting with QPSK over time slots, where (%) are used for training. The stopping condition requires the ratio between the norm of the estimated gradient in the current iteration and the maximum of the norm of the estimated gradient throughout all the iterations to be smaller than the tolerance. For the unpermuted channels (shown as “None” in fig:QPSK_iterations), ETFs are used for pilots; LS and -norm channel estimator are used to initialize ; the L-MMSE equalizer using the variance of noise is selected to initialize . For the permuted channels (shown as CSI and PHY in fig:QPSK_iterations), MUBs are used as pilots; the block-wise James-Stein estimator aided with the MAD technique is used to initialize (permuted channel matrix); the L-MMSE equalizer is used to initialize as shown in sec: init Sd. Analogously, we use the -norm channel estimator as the benchmark in both plots.
In fig:QPSK_iterations, our proposed initialization techniques enable % of the UEs to converge after iterations which achieves the same convergence speed as the -norm benchmark. In stark contrast, we see that the poor initialization generated by least square (LS) channel estimation requires more than iterations for of UEs to converge. Specifically, after clustering the large entries into blocks on the diagonal, both of CSI and PHY permutation methods are able to halve the required iterations to . Besides, the block-diagonal channel matrix also enables distributed processing with JED in each block in future work, which could be the key to significantly reduce interconnect data rates and algorithm complexity. Clearly, the development of new methods that further reduce the complexity of JED are necessary to enable a successful deployment in practice.
In fig:complexity_reduction_CHEST_MSE, we show the channel estimation MSE for different initialization methods instead of the MSE of the JED algorithm. Analogously, we use the -norm-based method in unpermuted channels as the benchmark. We see that the -norm method provides the lowest MSE for channel estimation and thus has the fastest convergence speed in the complexity comparison. In the same channel, the least square (LS) channel estimation in a fully-loaded system with only nonorthogonal pilots provides the worst channel estimation MSE and yields the lowest convergence speed. The permuted channel matrices have four virtual cells on the diagonal which form the block-wise structure. Such a structure and the usage of mutually unbiased bases (MUBs) enable local orthogonality in each virtual cell with which we perform LS and our proposed initialization techniques in each virtual cell to get dB gain for MSE. Lower initialized channel estimation MSE and better shaped channel matrix together make the convergence of FBS comparable with the -norm-based benchmark.
We emphasize that the computational complexity of our JED algorithm, even when reduced by the proposed initialization methods, remains to be the main bottleneck in practice. One of the goals of our paper is to demonstrate that densely populated scenarios benefit significantly from more sophisticated data-detection algorithms (cf. fig:BPSK_64x128x128 and 4). However, even in conventional cell-free massive MU-MIMO scenarios with more AP than UE antennas, our JED algorithm significantly outperforms linear data detectors (cf. fig:16QAM_256x128x128). As it can be seen in sec:MI results, if we try to serve more UEs, even the centralized L-MMSE data detector fails at providing satisfying performance—alternative methods that enable decentralized data detection in cell free systems  would struggle even more in such scenarios. In short, JED buys performance advantages at higher complexity and issues with scalability to more UEs, but realizes a clear advantage in such densely-populated scenarios. On the bright side, our JED algorithm requires essentially only matrix-vector multiplications, which is key to enabling efficient and parallel hardware implementations.
Vi-E Pilot Overhead vs. MI Trade-off
Figure 8 shows the trade-off between the per-UE MI achieved by % of UEs and the amount of used pilots (as a fraction of orthogonal training). We show the per-UE MI for the system with BPSK, system with QPSK, and the system with 16-QAM; all the systems have UEs transmitting over time slots. The three dashed lines correspond to upper bounds, assuming that a rate loss is only caused by pilot transmission. For example, the point in the blue dashed line shows that the maximum transmission rate with -QAM modulation is bits/symbol/UE with % pilot overhead. We see that our JED algorithm can asymptotically achieve a per-UE MI close to the upper bound for all the scenarios. For pilot overhead, all three systems reach the upper bound. The highest per-UE MI is achieved by with ,