MIMO technology is a technique that exploits the spatial dimension by adding more antennas  to increase spectral efficiency and network capacity. However, conventional MIMO configurations fall short of providing the required spatial diversity in the upcoming fifth generation (5G) mobile communication standard, which promises to connect billions of devices and achieve several gigabit-per-second data rates. Towards this end, massive MIMO has been introduced , in which few hundred antennas serve tens of terminals over time and frequency resources.
Despite the extensive work on massive MIMO, large MIMO will also play an important role in the future. Large MIMO systems use tens of antennas in communication terminals, and can afford large number of antennas on both the transmitter and the receiver sides , such as for example , , , and configurations. Large point-to-point MIMO wireless links are of specific interest in 5G for high-speed wireless backhaul connectivity between base stations (BSs). Also, multipoint-to-point large multiuser MIMO can be used in 5G in the uplink when the number of served transmitting users is less than, but comparable to, the number of BS antennas. Nevertheless, large MIMO can also be considered for point-to-multipoint downlink multiuser MIMO (MU-MIMO) , whether in enhanced versions of the current wireless communications standards, or in 5G, where users sharing the same physical resource blocks are chosen based on the degree of orthogonality of their cascaded precoder and channel.
After being traditionally driven by diversity-multiplexing tradeoffs, recent wireless communication system designs have been driven by two factors; system performance in terms of throughput and bit error rate, and system complexity in terms of processing latency and computational complexity. The performance of MIMO systems is largely determined by the detection scheme at the receiver side; various schemes provide different performance-complexity tradeoffs . Linear detectors, such as zero forcing (ZF) and minimum mean square error (MMSE), are the least-complex, but the least-optimal as well. On the other hand, maximum likelihood (ML) detectors are optimal but most computationally intensive, with complexity that grows exponentially with the number of antennas. Several sub-optimal detectors fill the spectrum in between, including sphere decoders (SD) and their variants [6, 7, 8, 9]. Moreover, in addition to conventional hard-output (HO) detectors, soft-output (SO) detectors play an important role in near-capacity achieving systems, but are more complex because they require processing significantly more signal combinations to generate reliability information.
In massive MIMO systems, linear detectors achieve near-optimal performance by exploiting the channel hardening effect , and approximate matrix inversions via Neumann series approximations  are used for practical implementations. However, large MIMO systems do not have very large receive-to-transmit antenna ratios. Hence, they cannot achieve the performance gains of asymmetric massive MIMO systems, and they do not allow for similar practical implementations, where Neumann series expansions fail to converge. For large MIMO systems, the detection schemes in the literature are grouped into several areas: detection based on local search [12, 13]
; detection based on meta-heuristics[14, 15]; detection via message passing on graphical models [16, 17]; lattice reduction (LR) aided detection [18, 19]; and detection using Monte Carlo sampling . However, for these schemes to achieve a near-ML performance with high orders of antennas and modulation constellations, the entailed complexity would be prohibitive.
A popular family of MIMO detectors that achieves good performance-complexity tradeoffs employs non-linear subset-stream detection. The nulling-and-cancellation (N/C) detector  is a low-complexity member of this family; it consists of linear nulling followed by successive interference cancellation (SIC). The chase detector (CD) 
is a more complex member of this family; it first creates a list of candidate decision vectors, and then chooses the best candidate from this list as a final decision. Chase detection is considered a special case of list detection. However, it differs from list sphere decoding (LSD) for example in the way the list is generated and administered; in LSD, list admission is based on proximity to an initial solution, while in CD, list generation is deterministic, and is done by spanning all possible sub-tree symbols emanating from the root symbol in a specific layer of interest. Furthermore, other popular subset-stream detectors exist (e.g., [24, 25, 26]), that decompose the channel matrix into lower order sub-channels to reduce the number of jointly detected streams.
All aforementioned subset-stream detectors make use of QR decomposition (QRD). However, the SO sub-space detector (SSD) , transforms the channel matrix via a punctured QRD, which we refer to in this paper as WR decomposition (WRD). In [28, 29, 30], WRD-based SSD is generalized to allow for joint detection of arbitrary-sized subsets of decoupled streams, and efficient implementation methods are presented. The QRD-based version of this detector is called the layered orthogonal lattice detector (LORD) [31, 32], and both are special cases of CD. To the best of our knowledge, the use of punctured QRD in MIMO detectors has not been studied analytically in the literature, and its applicability to large MIMO systems has not been addressed.
The contributions of this paper are summarized as follows:
We present a family of WRD-based detectors that build on popular QRD-based detectors. In particular, we propose a punctured ML (PML) detector, a punctured N/C (PN/C) detector, a punctured CD (PCD), as well as a hard-output sub-space detector.
We analyze mathematically the bit error rate (BER) performance of the proposed HO detectors. First, the diversity gain is characterized and used to show that channel matrix puncturing does not negatively affect the diversity gain in HO detection. Second, the performance of these detectors is studied via a probabilistic BER characterization.
We extend the study for several variations of SO detection schemes, and show that significant performance gains can be achieved with channel puncturing.
We propose efficient architectures and analyze the computational complexity of the proposed detectors. We show that the computational savings are much more pronounced with large MIMO dimensions.
We study the performance of the proposed detectors in the context of large MIMO with high order modulations, and in the presence of spatial channel correlation. We show that the performance of these schemes scales up efficiently with high orders, and that they are superior to their QRD-based counterparts in the presence of channel correlation.
The remainder of the paper is organized as follows. The system model and basic reference detectors are presented in Sec. II. The proposed WRD-based ML detector, N/C detector, CD, and SSD detection algorithms are presented in Sec. III. The achievable diversity gains of these detectors are derived in Sec. IV, followed by a probabilistic BER characterization that describes the behaviour of the proposed approaches in Sec. V. The SO versions of the detectors are then proposed in Sec. VI, and an efficient architecture is proposed in Sec. VII alongside a complexity study. Finally, simulation results are presented in Sec. VIII.
Regarding notation, bold upper case, bold lower case, and lower case letters correspond to matrices, vectors, and scalars, respectively. Scalar norms, vector norms, and Frobenius norms are denoted by , , and , respectively. , , , , and , stand for the expected value, trace function, real part, transpose, and conjugate transpose, respectively.
refers to normal distribution, andrefers to the Q-function, where . is a punctured matrix with entries , and
is an identity matrix of size. Detector ML optimality is in the log-max sense.
Ii System Model and Reference Detectors
Ii-a System Model
We consider spatial multiplexing in a MIMO system with transmit antennas and receive antennas. The equivalent complex baseband input-output system relation is given by
where is the received complex vector,
is the channel matrix with entries that are assumed to be i.i.d. complex, circularly symmetric Gaussian random variables,is the transmitted symbol vector, and
is a complex-valued circular-symmetric Gaussian random vector with zero mean and variance(). Each symbol , , belongs to a normalized complex constellation (), and we have , where is the finite set of points on a -dimensional lattice generated by all possible symbol vectors. For simplicity, we assume a uniform modulation constellation on all layers, and hence . The coded bit-representation of a symbol is denoted by , where and for . The signal to noise ratio () is defined in terms of the noise variance as .
At the receiver side, and assuming perfect knowledge of the channel, QRD decomposes as , where has orthonormal columns and , and is a square upper-triangular matrix (UTM) with real and positive diagonal entries. The transformed receive symbol vector can then be equivalently expressed as
where and are statistically identical since is orthonormal.
An “exhaustive” log-max ML detector searches the complete lattice , computing Euclidean distance metrics, to solve for
Note that the SD achieves exact log-max ML performance with less computations, by executing a tree-based search on a subset of , skipping vectors in the space whose partial distance already exceeds the current best distance.
Ii-B Nulling-and-Cancellation (N/C) Detector
The N/C detector  is used in the widely known vertical Bell Labs layered space time (V-BLAST) architecture . When combined with QRD, N/C becomes a computationally-efficient procedure which is highly sensitive to layer ordering. Nulling is performed by linearly pre-multiplying the received vector with , which suppresses the interference from , , at the layer. This is followed by SIC (back-substitution and slicing) to suppress co-antenna interference; hence, is computed as
for , where is the slicing operator on the constellation . N/C serves as an upper bound on the performance of other detection schemes.
Ii-C Chase Detector (CD)
where , , , , , , is a vector of zero-valued entries, and . Then, for each at the root layer, a candidate vector is calculated as in (4) and added to . The maximum number of candidate vectors in is , and the final HO decision vector is chosen from to be
Note that CD differs from LSD  in several aspects. For example, LSD list admission depends on run-time channel conditions, which makes it nondeterministic and more complex. Also, in a SO setting, LSD does not guarantee computing all the required distance metrics.
Ii-D Layered Orthogonal Lattice Detector (LORD)
Instead of executing the CD routine once, LORD repeats chase detection with different layer orderings, each time with a different layer as root, by cyclically shifting the columns of . The best output from these trials is the final solution. Each permuted at step , , is QR-decomposed into and according to (5). Let denote the output CD solution from step . Then, the final solution is , where
Since distances are preserved under different layer orderings with QRD, the accumulated candidate vectors across different partitions form an “extended” candidate list, despite the potential overlap of lists from each partition. Therefore, the added gain with LORD compared to CD is significant.
Iii Detection Schemes Based on Punctured Channel Matrix
Iii-a Punctured QR Decomposition (WRD)
WRD transforms into a punctured UTM with , by puncturing entries between the diagonal and the last column through a matrix , such that . A brute force approach for computing  involves matrix inversions, which is complex and prone to roundoff error. However, an alternative approach that employs QRD followed by elementary matrix operations can be used to derive and .
Let be QR-decomposed such that . Obviously, and for all , hence, . Now assume the entry in row of is to be nulled, for and . We have and , from which it follows that . Hence, with , the equations
when repeated for , would puncture the required entry and update the entry in row of , as well as update the column of accordingly, while
would normalize in and update the non-zero entries in row of accordingly. All these operations are to be carried for . The resultant is , and the resultant is . The transformed received symbol vector after applying can then be expressed as
where in this case is a diagonal matrix. For example, in the special case of MIMO, is obtained from by puncturing entries :
Note that the column at the root layer in (layer here), remains orthogonal to all other columns. Hence, taking the expectation of over , we have:
Therefore, although the resultant noise after puncturing is colored, WRD preserves the noise variance at the layer of interest. However, the statistical properties of the elements of get distorted under puncturing. The non-zero elements of (given i.i.d. Rayleigh fading) are known to be independent random variables with the following distributions [34, 21]:
The off-diagonal elements are circular symmetric complex Gaussian with unit variance.
where chi-squared comes from the sum of squares of Rayleigh distributed random variables. While the distributions of non-zero off-diagonal elements remain intact, the distributions of diagonal elements at upper layers , lose degrees of freedom from down to , as depicted in Fig. 1 for a channel matrix. This is caused by the fact that each puncturing operation at layer renders the column of dependent on one of the remaining columns, thus eliminating two degrees of freedom from the corresponding distribution of .
Empirical cumulative distribution functions (CDFs) of the diagonal elements of (a), and (b) shown in dotted lines compared to theoretical chi-squared CDFs in solid lines.
Similar to the ML detector, an “exhaustive” PML detector searches to find
Pre-multiplying by , unlike , modifies Euclidean distances, hence we have
Note that this minimum distance detector is not optimal due to the presence of colored noise.
Iii-B Punctured N/C Detector (PN/C)
With PN/C, we null by pre-multiplying by instead of , and perform SIC as
for , where , and . Note that slicing on layers can be done in parallel since is diagonal.
Iii-C Punctured Chase Detector (PCD)
For a given , the distance in (22) is minimized as
where , which is a vectorized slicing operation, and . The symbol vector is then added to , together with its distance . The final HO symbol vector is found from as the one with smallest distance.
While the PCD computes distances only to candidate symbol vectors, for a given layer ordering and channel partition, it is clear from (23) that it achieves the exact performance as that of the PML detector. In other words, there is no vector in the lattice , outside the set , that can have a smaller distance metric than that of the PCD solution. The proof goes as follows:
Iii-D Vector-Based Sub-Space Detector (VSSD)
The VSSD is an extension to PCD, the same way LORD is an extension to CD. The columns of are cyclically shifted, and punctured UTMs are generated as shown in Fig. 2. Each permuted at step , , is WR-decomposed into and according to (15). Let denote the PCD solution from step . The final solution is , where is defined as:
Note that we revert back to the original space of to compute the true Euclidean distance metrics in (30). The gain achieved by VSSD compared to PCD is limited, since each generates an independent space, and hence we end up taking the best output from independent trials. The VSSD is in effect the HO version of the reference SO SSD , and we refer to it by simply SSD in the remainder of this paper.
Iii-E Symbol-Based Sub-Space Detector (SSSD)
As a variation of SSD, the SSSD selects at each step , only the root symbol of the output vector as a component of the final output vector. Thus, the output vector gets assembled one symbol at a time over executions of PCD, where
For example, in a MIMO system, we have , where is the HO solution of a PCD following the partition in Fig. 2(a). Similarly , , and , are obtained following the partitions (b), (c), and (d), respectively. Note that we can define symbol-based LORD (SLORD) in a similar manner:
Iv Analysis of Achievable Diversity Gain
It is known that ML detection achieves full receive diversity , and it can be shown that the N/C and PN/C detectors, being special cases of ZF with decision feedback, can only achieve a receive diversity gain of . Moreover, it can be argued that both SSD (VSSD) and LORD also achieve full diversity, since they exploit the full channel matrix to compute distance metrics. In what follows, we study the achievable diversity gains of PML (PCD), SSSD, and SLORD.
Iv-a Punctured ML Detector / Punctured Chase Detector (PML/PCD)
To capture the diversity order of PML, we derive the pairwise error probability (PEP). Suppose that is transmitted, while is erroneously detected, the PEP can be expressed as
where is the probability that event occurs, and . Since consists of circular symmetric complex Gaussian random variables, then so is . It is easy to show that
where is introduced since is a scalar. Hence, we have
where the inequality holds since (section 5.2 in ). Moreover, using union bound, we have
where , and . Finally, using the Chernoff bound, the average PEP is upper bounded as
where since the columns of were normalized in (11).
where the expected value over the elements of results in full receive diversity , because each column of contains
independent Rayleigh distributed random variables, whose square is exponentially distributed. However, withinstead of in PML detection, the first columns have single diagonal elements, whose squares are chi-squared distributed with degrees of freedom, which corresponds to two exponentially distributed complex random variables, and hence a receive diversity order equal to . Only column of provides a diversity equal to . Therefore, by analogy with (41), the average PEP for the PML detector is
and hence PML detection can not achieve a receive diversity gain of order greater than 2. However, noting that PML and PCD are identical (Sec. III-C), and knowing that the regular CD achieves a receive diversity order of 2 (more on that in Sec. V-B), we conclude that channel puncturing does not reduce the diversity gain of the CD.
Iv-B Symbol-Based Sub-Space Detector (SSSD)
To capture the diversity order of SSSD, we derive a modified PEP. Without loss of generality, we assume that layer is the root layer of interest. Hence, an error occurs when is transmitted and is erroneously detected, with probability
where , is computed as in Sec. III-C, and are the Nth column of and , and and are the first columns of and , respectively. Let , and let ; we have
Since and the columns of are circular symmetric complex Gaussian, then so is . Thus, it can be shown that
Hence, continuing from (IV-B), we have
Then, using union and Chernoff bounds, with (), the average PEP can be upper bounded as
where the last approximation holds since the second exponential term is less that , with equality at high (. Finally, taking the expectation over all squared elements of , which are exponentially distributed, we obtain
The denominator represents noise plus interference, hence, SSSD appears to achieve a full receive diversity gain at the layer of interest when BERs are plotted in terms of signal-to-interference-plus-noise ratio (). In the case of SLORD, following a similar derivation, the average PEP can be expressed as
where , and consists of the first columns of . Note that can be expressed as