On the Low-Complexity, Hardware-Friendly Tridiagonal Matrix Inversion for Correlated Massive MIMO Systems

02/21/2018 ∙ by Chuan Zhang, et al. ∙ Shanghai University 0

In massive MIMO (M-MIMO) systems, one of the key challenges in the implementation is the large-scale matrix inversion operation, as widely used in channel estimation, equalization, detection, and decoding procedures. Traditionally, to handle this complexity issue, several low-complexity matrix inversion approximation methods have been proposed, including the classic Cholesky decomposition and the Neumann series expansion (NSE). However, the conventional approaches failed to exploit neither the special structure of channel matrices nor the critical issues in the hardware implementation, which results in poorer throughput performance and longer processing delay. In this paper, by targeting at the correlated M-MIMO systems, we propose a modified NSE based on tridiagonal matrix inversion approximation (TMA) to accommodate the complexity as well as the performance issue in the conventional hardware implementation, and analyze the corresponding approximation errors. Meanwhile, we investigate the VLSI implementation for the proposed detection algorithm based on a Xilinx Virtex-7 XC7VX690T FPGA platform. It is shown that for correlated massive MIMO systems, it can achieve near-MMSE performance and 630 Mb/s throughput. Compared with other benchmark systems, the proposed pipelined TMA detector can get high throughput-to-hardware ratio. Finally, we also propose a fast iteration structure for further research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Massive MIMO (or large-scale MIMO, M-MIMO), as an emerging technology employing hundreds of antennas at the base station (BS) to serve a relatively small number of users simultaneously, is a promising technology for the next generation wireless communications [1, 2]

. By adding multiple antennas, degrees of freedom in wireless channel can be offered to accommodate more information data. Hence, a significant performance improvement can be obtained in terms of data rate, reliability, spectral efficiency, and energy efficiency


Unfortunately, the advantages of M-MIMO systems come at the cost of the significantly increased computational complexity at the BS. One of the most challenging issues is the low-complexity signal detection algorithm in both uplink and downlink directions [3]. For the optimal signal detection algorithm such as the maximum-likelihood (ML) detection [4], its complexity increases exponentially with the number of terminal antennas [5], which entails prohibitive complexity for the M-MIMO detection. The fixed-complexity sphere decoding (SD) [6, 7] would also incur unaffordable complexity with the scaling up dimension or the high modulation order. In [3], low-complexity sub-optimal linear detection schemes such as zero-forcing (ZF) [8] and minimum mean square error (MMSE) [9] have been investigated. Compared to other more sophisticated methods such as belief propagation [10, 11] and iterative solvers [12, 13, 14], the linear detection methods are favorable especially for practical system designers not only for their low complexity and implementation convenience, but also for their close-to-optimal performance [3]. When the number of antennas increases without bound, thermal noise, intra-cell interference, fast fading, and channel estimation errors [15]. Moreover, if the number of BS antennas () is much larger than that of user antennas (), i.e., , the simplest linear detectors are one of the best choices with respect to performance/complexity tradeoff [3].

Therefore, recent papers [16, 17, 18, 19, 20] studied linear detection and linear precoding under the optimistic assumption about the propagation conditions and the number of antennas. Some prototype M-MIMO systems also adopt linear algorithms considering the performance and implementation convenience [21, 22, 23, 24, 25]. All those linear algorithms have to address the key issue of an unfavorable filtering matrix inversion, for its increasing computational complexity in the M-MIMO systems. With the increased size of the channel gain matrix , the property of channel hardening, that the concern matrix tends to be diagonal dominant [3, 26], has inspired recent research groups [16, 18] to propose the low-complexity matrix inverse approximation (MIA) methods.

Most of the studies [16, 17, 18, 19, 20]

referred to above assume that the channels are independent and identically distributed (i.i.d.) Gaussian MIMO channels. However, in practice, the channel vectors for different users are generally correlated because the antennas are not sufficiently separated or the propagation environment does not offer enough scattering

[15, 27]. Spatial correlation between antennas is an important factor that affects the design and performance of MIMO systems, which specially matters for M-MIMO systems, owing to the number of antennas scaling up to an unprecedented magnitude [3]. For the scenarios of less favorable propagation environment where the channel is of high correlation or the ratio () is low, the study in the filtering matrix inversion and its efficient hardware implementation have not been considered based on the best knowledge of the authors.

In this work, we address the complexity issue of the filtering matrix inversion associated with linear detection in the M-MIMO systems. We consider a single-cell multi-user system model, which accounts for antenna correlation, ratio between BS antennas and users, and path loss in the uplink. We analyze the residual error induced by the matrix inversion approximation (MIA) method relying on NSE under the less favorable propagation environment. We propose an iterative MIA method also based on NSE, which can converge to an arbitrary -terms approximation without extra hardware cost compared to the recent works. Considering the residual error and the computational complexity, we analyze the initial matrix and then develop an optimizational tridiagonal matrix approximation (TMA) method to improve the convergence performance. We present corresponding VLSI architecture and show the reference FPGA results, which enables us to show the hardware efficiency compared to conventional methods. Finally, we provide the solution considering the performance/complexity tradeoff via presented designs.

Notation: The operations, , , , , , and , denote the conjugate, transpose, conjugate transpose, trace, absolute operator, and expectation, respectively. For a matrix , the entry in the row and column of is denoted by . denotes the Frobenius norm of matrix . For a vector , the entry in the is denoted by ; denotes the -norm of vector , such that


, , and refer to the identity, full-ones, and zero matrices, respectively. A matrix consists of the entries of is denoted as 111Then ..

Ii Preliminaries

ii.a System Model

Consider an uplink M-MIMO system with a single base station (BS) and user terminals (UTs), where the BS is equipped with antennas and each UT is equipped with single antenna only222UTs with multiple antennas can be decomposed into several single antenna cases or can be generalized from single-antenna UT case directly [28, 29, 30].. Denote to be the transmitted signal of the UT with a normalized expected total transmit power . The received signal at the BS side, , can be expressed as


where denotes the channel condition between the BS and the UT.

is the additive white Gaussian noise (AWGN) with zero mean and normalized variance


In the conventional uplink M-MIMO system analysis [1], only the geometric attenuation is adopted, whereas the antenna correlation issue for massive receiving antennas is ignored. In this paper, we apply the Kronecker model as proposed in [31] and obtain


where is the path-loss component between the BS and the UT, and denotes the correlation matrix for massive receiving antennas. is the fast fading coefficients with i.i.d. elements. Substituting Eq. (3) into Eq. (2), we have the following matrix form




Without loss of generality, we assume the classical MMSE equalizer is applied for the uplink M-MIMO system, and the estimated symbol at the receiver side is thus given by


The most challenging task in MMSE is to compute the inversion operation of the filtering matrix . In the traditional MIMO systems, since the number of received antennas is usually small, e.g. , the matrix inversion operation can be efficiently computed through the standard exact decomposition methods, such as QR-Gram Schmidt method [32], QR-Givens Rotation method [33], and Gauss-Jordan method [34]. If we note that is Hermitian, we can further apply Cholesky decomposition method to reduce the implementation complexity as shown in [35]. However, when the numbers of received antennas and of UTs , become significant (e.g. and ), the brute-force implementation may not be able to suffer due to the high power consumption and the massive storage requirement333In general, the conventional exact inversion methods (including Cholesky decomposition) require complicated data-flow mechanisms with sequential computation order, and involve hardware operations such as square-roots and divisions. As a result, those methods consume expensive implementation costs in terms of area, power, and latency, especially when the active UTs are large and varying..

ii.b Neumann Series Approximation

As an alternative approach to compute the large-scale matrix inversion, the Neumann series expansion (NSE) [36] takes the advantage of iterative structure to gradually improve the accuracy, which avoids the prohibitive power and storage cost for general exact decomposition methods [37]. For illustration purpose, we summarize the theoretical background in Lemma 1 and describe the benchmark implementation in what follows.

Lemma 1 (Neumann series).

For a non-singular matrix with , we have is also non-singular and its inverse is given by


Applying Lemma 1 and with some mathematical manipulations, we can rewrite the inverse of using NSE and obtain , where is an arbitrary matrix satisfying


To write them into the iterative form, define . We have


where is the multiplication coefficient.

The following assumptions are adopted through the rest of the paper. First, we generate the receiving correlation matrix as and , according to [38]. Here denotes the correlation coefficient between consecutive antennas and denotes a given phase. Second, perfect channel-state information is assumed to be available at the receiver side. Third, we assume non-correlation among different UTs in the uplink.

Iii Low-Complexity Tridiagonal Matrix Approximation Algorithm

In the conventional approach to deal with M-MIMO systems [1], we usually rely on the fact that to simplify the filtering matrix under the uncorrelated channel, e.g., when . The remaining channel inversion operation is rather straightforward. Even for general large value of , the off-diagonal elements are relatively small and we can rely on to efficiently compute using NSE [26, 3]. However, for correlated M-MIMO environment, there is still lack of systematical ways for efficient implementation. In this section, we first analyze the characteristics of correlated M-MIMO channels and then propose a tridiagonal approximation algorithm to improve the accuracy of NSE.

iii.a Characteristics of Correlated M-MIMO Channels

With the receiving antenna correlation, one of the most important issues is to obtain the statistical information of , from which we can analyze and possibly simplify the corresponding receiving structure using NSE. In what follows, we derive the statistical information of with general receiving correlation and summarize the main results in Theorem 1.

Theorem 1.

In the correlated MIMO environment with the receiving antenna correlation matrix , the filtering matrix, , has the following statistical property, where its mean and variance matrices are given by Eq.s (10) and (11), respectively.

Remark 1.

Please refer to Appendix A for the proof.

From Theorem 1, the channel variance matrix has been greatly affected by the receiving correlation , whereas the statistical mean does not change if compared with uncorrelated case. In this case, the magnitude of off-diagonal elements in the filtering matrix actually increases with the Frobenius norm [39] of the receiving correlation matrix , and decreases with the number of antenna elements , as shown in Fig. 1. The recent work has proven that a larger () would be favorable propagation, so we next mainly analyze the effect of correlation on the property of in this paper.

If we further plot the -norm [39] of diagonal elements versus off diagonal elements as shown in Fig. 2, we can observe that the -norm of the main () diagonal decreases with the correlation coefficient , whereas -norms of other diagonals increase. This is partially because of the ever increasing ratio . The conventional NSE way to compute relying on may result in inaccuracy and slow convergence.

As shown in Fig. 3, the ratio of , where , is at least larger than the ratio in various correlation condition. So NSE relying on would help to improve the convergence stability and speed according to [36].

Fig. 1:

The normalized distribution of

with a fixed number of UTs, . tends to be strongly diagonally dominant in a favorable condition with large or small .
Fig. 2: Comparison of the normalized -norm between main diagonal and other diagonals with a fixed ratio, , , for different propagation channels.
Fig. 3: Comparison of and , with , .

iii.b Error Analysis

In practical M-MIMO systems, the number of NSE terms cannot be very large because of the considerations on latency and complexity. In this case, we next analyze the error resulted from the truncated polynomial, , for MMSE estimation in the massive systems. We define the approximation error matrix as , and compute the estimated symbol at the receiver side with the -term approximation , i.e.,


where is the residual estimation error. We denote the square of -norm of residual estimation error by and bound by:


In M-MIMO systems, let the number of receive antennas grow large while keeping the number of transmit antennas constant, we further assume the uncorrelated noise is averaged out, which is corresponding to [3]. In this case, the expect of can be derived as:

Remark 2.

Please refer to Appendix B for the proof.

From Eq.s (13) and (14), the size of is determined by . Considering the convergence condition in NSE, i.e., , one intuitive method to reduce is to realize a large -term approximation. As a limit case, the caused by the approximation error is zero when .

Another method to reduce is to decrease the consecutive norm of the inner matrix, . Concretely, in view of , can reduce faster if the initial matrix is more close to the filtering matrix . As shown in Fig. 4

, the probability that the inner matrix

has a smaller Frobenius norm when using is always higher than the method relying on , which means an improved convergence speed and stability. Moreover, the scenarios where the correlation is high, i.e., , with a small may not satisfy the convergence condition of NSE, which means a much unfavorable propagation environment. Thus, in the remainder of this paper, we mainly concern the propagation environment where .

Fig. 4: when and , .

iii.c Tridiagonal Approximation Algorithm

To further improve the stability and convergence for the NSE method accommodating less favorable propagation environment, a valid method is to acquire more off-diagonal elements, i.e., . Though setting up requires more computations as compared to choosing , the overall complexity can be expected to reduce because of the improved convergence rate, which in-turn would require smaller iteration number in Eq. (9). Therefore, one would enjoy a better performance/complexity tradeoff. Shown in Fig. 3, it is obvious that occupies a dominated proportion in magnitude, whereas other s account for relatively small proportion. Then we initialize the matrix as . For lower complexity and better convergence for NSE, a modified method to compute the approximate inversion of is developed. It should be noted that a tri-band matrix inversion for pre-coding was proposed in [18].

Inspired by existing literature [40], the straightforward inversion of the tridiagonal matrix could be carried out as follows. For better explanation, we denote by


The inverse matrix can be given by


where , . Also


Other parameters are given by the second-order linear recurrences as follows:


where the initialization conditions are and .

One critical issue of this straightforward inversion operation is the prohibitive complexity involved, which is . Note that the complexity here is estimated by counting the number of complex multiplications. For M-MIMO systems with considerable large , the resulting high complexity may hinder the application of such inversion method in real systems. We again note that the filtering matrix is Hermitian. Then vector is the conjugate transpose of . Since the inverse of a Hermitian matrix is still Hermitian, only half of the required entries have to be calculated with the sophisticated equations of Eq.s (16) and (17). The other half could be obtained conveniently by simple conjugate operation. Admittedly, the Hermitian property does help to reduce about half complexity, the complexity stays in the same order and fails to meet our initial expectation for introducing NSE. For a better balance of performance and complexity, especially in correlated massive MIMO systems, further simplifications are expected.

To this end, in what follows, we first decompose the straightforward tridiagonal approach into two steps: 1) Step 1, which includes Eq.s (16) and (17), mainly deals with the entry calculation; 2) Step 2, which includes Eq. (18), mainly deals with the parameter calculation. Then, stepwise optimizations are carried out by looking into details of each step. Statistical property of the filtering matrix has been made full use to achieve a better performance/complexity tradeoff.

iii.c1 Simplification for Step 1

Since the inverse of a tridiagonal matrix is a full matrix, the complexity of Step 1 is . However, in correlated M-MIMO channels, -norm of the main diagonal for is higher than that of the secondary diagonals, which implies further simplification.

Let us only consider the lower triangular part of , i.e., . According to the formula , the calculation starts with the main diagonal, then recursively computes the off-diagonal entries in a column-wise manner. The entries of vector will be multiplied recursively. If we understand how the magnitude of entries of change with channels, we will be able to depict the off-diagonal entries’ attenuation accordingly. We evaluate the overall magnitude of by its mean . The tendency of with correlation coefficient changing is given in Table I.



TABLE I: The tendency of with correlation coefficient changing

It is observed that when , stays less than , which indicates that for channels with , the magnitudes of outer diagonals shrink at the scale of layer by layer. In other words, the dominated portion of fall into the region between two secondary diagonals. The rest part of , whose magnitude is less than of the main diagonal, can be omitted as a result. Thus, the complexity will be effectively reduced to .

iii.c2 Simplification for Step 2

We recall that in correlated M-MIMO channels, -norm of the main diagonal for is higher than that of the secondary diagonals. Therefore, the second term at the right-hand-side of the following equation can be reasonably removed:

Then, we have the following approximation:


With this approximation, the formula for calculating the main diagonal of can be simplified as:


For previous calculation in Eq. (18), two sequences and are obtained in ascending and descending orders of or , respectively. Thanks to the approximation, there is no more need to calculate and the processing latency has been successfully halved.

To evaluate the performance degradation, numerical results are given in Table II, by denoting the error percentage as . Since observation shows that It is observed that when the error percentage stays below , it is safe to introduce this approximation within this region. For clearer denotation, we use instead of in following.


Error ()


TABLE II: Error percentage with proposed approximation when changes

With the aforementioned simplification steps, now the new algorithm to calculate is given by Alg. 1 as follows:

0:  , ,
1:  Initialize:
2:  for  do
7:  end for
8:  for  do
12:  end for
12:  , ,
Algorithm 1 Proposed Tridiagonal Matrix Inversion
Remark 3.

Please refer to Appendix D for the derivation.

The statistical property of the filtering matrix inspires us to reduce the computational complexity to calculate . Though is a full matrix, the entries outside the three central diagonals are notably small in magnitude. Therefore, we mainly focus on the three dominating diagonals, i.e., . For the purpose of parallel computation, some units are modified accordingly. The resulting hardware overhead is usually small compared to the matrix-matrix multiplications. Furthermore, we only directly calculate the lower-triangular part of the Hermitian matrix.

We denote this tridiagonal matrix inversion approximation by TMA. Simulation results are in Section V.A. The detailed hardware implementation is shown in Section IV as follows.

Iv Hardware-Friendly Implementation

iv.a Efficient Pipelining Structure

The TMA formula can be written as follows,


where is the multiplication coefficient. According to the linear detection formula Eq. (6), general framework of hardware can be designed as in Fig. 5. The framework mainly consists of the following parts:

iv.a1 Orange Block

It is the pre-processing module (PM), where , and is the normalized variance of white Gaussian noise (AWGN) with zero mean. The output of PM is , which can be split as . For the diagonal Neumann series (DNS) method, matrix includes all the diagonal elements of ; whereas for the TMA, matrix only includes the tridiagonal elements in .

iv.a2 Blue Block

It is the main-computing module (MM), which uses TMA to calculate with iterations. The main structure of this part is IIR-like, whose iterations can be controlled to realize different accuracy requirements.

iv.a3 Green Block

It is the estimation module (EM), which computes , where .

Fig. 5: Proposed linear detection architecture with TMA.

iv.b Diagonal/Tri-Diagonal Matrix Inversion

This module calculates the inversion matrix of , which is a diagonal matrix in the DNS scheme. So the inversion matrix of is used to compute the reciprocals of diagonal elements. In order to compute , a reciprocal unit based on look-up table (LUT), which is particularly suitable for FPGA implementation, is employed.

The tridiagonal matrix inversion has been simplified in Alg. 1, and the correspondent hardware design can be seen in Fig. 6. This architecture is realized in a pipelining way, in which each clock increases from to and and change accordingly. A switch is employed for scheduling: 1) when , the switch is open and the output is ; 2) when , the switch is closed and outputs and . Different blocks have different functions.

iv.b1 Red Block

It operates iteratively to realize States 3 and 5 in Alg. 1.

iv.b2 Blue Block

It completes States 6 and 9 to compute the main diagonal elements of .

iv.b3 Green Block

It computes the secondary diagonal elements of lower triangle. For is Hermitian, is Hermitian as well. With conjugate operation we can get the entire matrix .

The inversion needs real adders, complex multipliers, and LUTs. The latency of this module is clocks, where is the size of .

Fig. 6: Architecture of tridiagonal inversion based on Alg. 1.

Alg. 1 can be converted to Alg. 2 for lower implementation complexity. We combine the common hardware of two identical algorithm states, which could save half of the hardware cost. It is noted that States 3 and 5 in Alg. 1 share the same iterative operation as States 6 and 9. Therefore, the common hardware can be reused. The resulting architecture also works in a pipelining way but increases one every two clocks. Now, the work done in one clock previously needs two clocks to complete, and latency doubles. The folded architecture is in Fig. 7, which executes States 3 and 5 (or States 4 and 6) in Alg. 2 when (or ). It needs real adder, complex multipliers, LUTs, and clocks.

0:  , ,
1:  Initialize:
2:  for  do
8:  end for
9:  for  do
13:  end for
13:  , ,
Algorithm 2 Proposed Tridiagonal Matrix Approximation
Fig. 7: Architecture of tridiagonal inversion based on Alg. 2.
Remark 4.

Architectures in both Fig.s 6 and 7 can take care both diagonal and tri-diagonal matrix inversions.

iv.c Diagonal/Tri-Diagonal Multiplier

This module calculates matrix , which is the product of matrices and . No matter is a diagonal matrix or tridiagonal matrix , the channel hardening property makes the multiplication module simple. To calculate , every diagonal element of is multiplied with the corresponding row of . Furthermore, when calculating , three lines of diagonal elements needs to product corresponding row of the latter matrix, and using registers to add three sets of results in correct combination. The hardware design can be seen in Fig. 8. By using complex multipliers, registers and complex adders, the multiplication between tridiagonal matrix and any matrix can be realized. Meanwhile, when we input the main diagonal elements into the first set of multiplications, the product between diagonal matrix and any matrix can be realized as well. Therefore, we have:

Remark 5.

Architecture in Fig. 8 can implement multiplication of diagonal or tri-diagonal matrix with any matrix.

Fig. 8: Architecture of tridiagonal matrix multiplier, where is corresponding element of the matrix and denotes corresponding row in the matrix.

iv.d Neumann Series Iteration Module

This module is IIR-like, including a matrix multiplier and a matrix adder, which implements the matrix approximation inverse. With and inputted, it outputs in the iteration. By controlling the iteration number, different accuracies can be achieved.

Matrix multiplier is implemented by a systolic array. For is Hermitian, once we calculate the lower-triangular part, the upper-triangular part can be obtained by conjugating. The systolic array has complex adders and complex multipliers. For is also Hermitian, the matrix multipliers in PM and MM can be multiplexed with one lower-triangular systolic array. Matrix adder needs one real adder for diagonal matrix addition, and two complex adders and one real adder for tridiagonal addition.

Remark 6.

Please refer to Appendix C for the proof that is Hermitian.

Remark 7.

Remarks 4 and 5 indicate that the design in Fig. 5 can implement detection based on either DNS or TMA.

iv.e Complexity Analysis

In this section, complexity comparison is given in terms of numbers of real adders and real multipliers. One complex adder includes two real adders, and one complex multiplier includes four real multipliers and two real adders. Then, the complexity of the proposed detector is:

Lemma 2.

For one -user TMA detection, the numbers of real adders () and real multipliers () are,


PM module mainly involves a low-triangular systolic array, which has complex adders and complex multipliers. The total cost is real adders and real multipliers.

Sharing the low-triangular systolic array with PM, MM becomes the main matrix inversion part: 1) for DNS, it costs real adders and complex multipliers ( real adders and real multipliers in total), 2) for TMA, it costs real adders, complex adders, and complex multipliers ( real adders and real multipliers in total).

EM module only involves a vector multiplier, which costs complex adders and complex multipliers ( real adders and real multipliers in total). ∎

The complexity comparison of Cholesky decomposition (CD) [37, 41], DNS, and TMA is in Table III.


Method Latency (clocks) Adder # Multiplier #


TABLE III: Complexity comparison on different detection methods

iv.f Timing Analysis

When , the timing analyses is as follows:

Lemma 3.

For one users, antennas MIMO system, the latency for TMA detection is,


Using the lower-triangular systolic array, PM generates its output from Clock to . In systolic array, when is calculated in Clock , and are calculated in Clock . For is obtained in Clock , is obtained in Clock .

To calculate , each needs clock cycles by Alg. 2, which fits the output frequency of . Thus, is obtained in Clock , and in Clock .

According to the rule of systolic array, () is calculated from Clock to . After pipelining, the final estimation vector completes in Clock . ∎

Lemma 4.

For one users antennas MIMO system, the latency for DNS detection is,


Compared to TMA, dia inversion requires one less clock for each . Dia multiplier needs one less clock as well. Thus, the latency reduces by . ∎

Since the latency is mainly spent for the iterative multiplication, as the iteration number grows, the latency becomes longer and throughput drops. Based on numerical results in Section V.A, different ’s are set for different ’s and ’s.

V Architecture Analysis

v.a Numerical Results

We propose the Neumann series approximation to reduce the complexity of MIMO linear detection method. However, different iterations lead to different latencies, if becomes too large, the latency will be too high and result in low throughput. However, when is not big enough, the approximate result will not reach the required accuracy. Here, the accuracy metric is defined as: the performance degradation is less than dB compared with the Cholesky decomposition based exact MMSE detection when . Different sets of and correlated coefficients result in quite different iteration counts. We compare on the iteration counts for diagonal scheme (DNS), tridiagonal scheme (TNS), and simplified tridiagonal scheme (TMA).




TABLE IV: Comparison on iteration counts for different ’s and ’s
Fig. 9: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.
Fig. 10: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.
Fig. 11: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.
Fig. 12: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.
Fig. 13: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.
Fig. 14: BER comparison among exact MMSE method, DNS, TNS, and TMA. Set in channel by -QAM.

Fig. 9-14 show the performance comparison of the aforementioned methods in different systems, and the results are summarized in Table IV. When grows, both methods need more iteration counts to reach required accuracy. According to error analysis in Section III.B, when the Neumann series scheme will not be suitable for such channel situation, our simulation results also show this in Table IV. Another observation is that larger requires fewer iteration counts. From the comparison we can clearly find that TMA saves to iterations than DNS. Also, our improved approximate tridiagonal matrix inversion in Section III.C1 does not bring extra performance loss.

In order to better show the advantage of proposed methods, it is necessary to compare their performance with other algorithms. Rather than put 1 or 2 other algorithms in each of Fig.s 9-11, we would like to show that under different correlation scenarios, the state-of-the-art linear iteration detectors such as CG or SD (adaptive preconditioned SD, APSD) methods could not outperform the exact MMSE detector based on Cholesky decomposition. As long as our proposed detectors can achieve comparable performance as the exact MMSE detector based on Cholesky decomposition, their performance is guaranteed even compared with the state-of-the-art linear iterative detectors. In Fig. 15, BER performances under different correlation conditions are compared. Specifically, we have: i) User Correlated case , ii) BS Correlated case , and iii) Fully correlated case . It is shown that, the proposed TMA detector can effectively approach the exact MMSE detector in different correlation conditions. Therefore, the advantages of proposed decoders can be guaranteed over the state-of-the-art detectors, regarding different correlation conditions.

Fig. 15: Performance comparison with different correlation conditions.

v.b VLSI Implementation Results

We investigate the VLSI implementation architecture for the proposed (TMA and DNS friendly) detection method based on a Xilinx Virtex-7 XC7VX690T FPGA and compare the hardware cost in Table V. For fair comparison with NSE [16], GS [42], CG [43], and OCD [44], the quantization length is set as . The fixed-point simulation results of proposed quantization scheme are shown in Fig.s 9-11, and the performance loss is negligible. Here, “fd” is short for “fixed”. We assume BS antennas and users. Although TMA detection requires LUTs as DNS, it requires lower and enjoys higher throughput. Shown in Tables VI and VII, with decreased and increased , the throughput advantage is more obvious.


Detection algorithm TMA DNS
Modulation -QAM -QAM
Block RAMs
Maximum frequency MHz MHz


TABLE V: FPGA results for TMA and DNS with


Method Latency (clocks) Throughput (Mbps)


TABLE VI: Throughput for different ’s, when , , -QAM


Method Latency (clocks) Throughput (Mbps)


TABLE VII: Throughput for different ’s, when , , -QAM

To achieve Mb/s data rate for each user specified in 3GPP LTE-Advanced [45], we use instances of TMA detector. Table VIII lists comparison with other detection implementation results with Xilinx Virtex-7 XC7VX690T FPGA for MIMO system: NSE [16], GS [42], CG [43], and OCD [44]. It is shown that for correlated massive MIMO systems, it can achieve near-MMSE performance and Mb/s throughput. For fair comparison, hardware efficiency is compared in terms of throughput/LUTs and throughput/FFs. The throughput/LUTs ratio of proposed TMS is better than NSE [16], GS [42], and CG [43]. If we consider another metric throughput/FF, it would be noted that this work will have the highest value of compared to the state-of-the-art.


Detector M. Wu [16] Z. Wu [42] B. Yin [43] M. Wu [44] This Work
[JSTSP’Nov. 14] [ISCAS’16] [ISCAS’15] [TCAS-I’Dec. 16]
Performance near-MMSE near-MMSE near-MMSE near-MMSE near-MMSE
Modulation scheme -QAM -QAM -QAM -QAM -QAM
Iteration count
Instance or subcarrier #
Latency [clks]
Clock freq. [MHz]
Throughput [Mb/s]


TABLE VIII: Implementation results and comparison on a Xilinx Virtex-7 XC7VX690T FPGA for MIMO system

v.c Comments and Further Work

Advantages: The hardware cost of DNS is lower than that of accurate inversion method (Cholesky decomposition). With weaker requirement of iteration number, DNS enjoys a higher throughput. With the same performance, TMA saves iterations than DNS. Since the iteration number can be easily controlled, it is feasible for different applications.

Drawbacks: On bad channel condition, both DNS and TMA detections need higher iteration counts , which increases the detecting latency and lowers the throughput.

An improved Neumann series: In order to solve the problem, an improved Neumann series inverse framework is proposed for higher convergence rate. The formula of Neumann series can be rewritten as follows:


Then a formula based on multiple iterations is obtained:

Fig. 16: Architecture of the improved approximate matrix inversion.

To obtain the iteration’s output of original Neumann series, now we only need iterations. Convergence comparison after such alteration is listed in Table IX. Compared with the original iteration inverse framework, the modified scheme adds a systolic multiplication module. Though it requires more hardware, a great rise of convergence and throughput appears. Also, the hardware can be parallelized for lower latency per iteration. This framework applies to situations with harsh channel condition and relatively high accuracy requirement. The iteration part of modified hardware framework is in Fig. 16.

There are mainly two differences from the former structure. Firstly, the former constant multiplication coefficient matrix will be replaced by continuous self-squares. Secondly, the bias increment matrix, which used to be fixed in the original iteration, is now replaced by the last result of iteration operation.


Iteration Original Neumann series Order Improved Neumann series Order