I Introduction
Massive MIMO (or largescale MIMO, MMIMO), as an emerging technology employing hundreds of antennas at the base station (BS) to serve a relatively small number of users simultaneously, is a promising technology for the next generation wireless communications [1, 2]
. By adding multiple antennas, degrees of freedom in wireless channel can be offered to accommodate more information data. Hence, a significant performance improvement can be obtained in terms of data rate, reliability, spectral efficiency, and energy efficiency
[3].Unfortunately, the advantages of MMIMO systems come at the cost of the significantly increased computational complexity at the BS. One of the most challenging issues is the lowcomplexity signal detection algorithm in both uplink and downlink directions [3]. For the optimal signal detection algorithm such as the maximumlikelihood (ML) detection [4], its complexity increases exponentially with the number of terminal antennas [5], which entails prohibitive complexity for the MMIMO detection. The fixedcomplexity sphere decoding (SD) [6, 7] would also incur unaffordable complexity with the scaling up dimension or the high modulation order. In [3], lowcomplexity suboptimal linear detection schemes such as zeroforcing (ZF) [8] and minimum mean square error (MMSE) [9] have been investigated. Compared to other more sophisticated methods such as belief propagation [10, 11] and iterative solvers [12, 13, 14], the linear detection methods are favorable especially for practical system designers not only for their low complexity and implementation convenience, but also for their closetooptimal performance [3]. When the number of antennas increases without bound, thermal noise, intracell interference, fast fading, and channel estimation errors [15]. Moreover, if the number of BS antennas () is much larger than that of user antennas (), i.e., , the simplest linear detectors are one of the best choices with respect to performance/complexity tradeoff [3].
Therefore, recent papers [16, 17, 18, 19, 20] studied linear detection and linear precoding under the optimistic assumption about the propagation conditions and the number of antennas. Some prototype MMIMO systems also adopt linear algorithms considering the performance and implementation convenience [21, 22, 23, 24, 25]. All those linear algorithms have to address the key issue of an unfavorable filtering matrix inversion, for its increasing computational complexity in the MMIMO systems. With the increased size of the channel gain matrix , the property of channel hardening, that the concern matrix tends to be diagonal dominant [3, 26], has inspired recent research groups [16, 18] to propose the lowcomplexity matrix inverse approximation (MIA) methods.
Most of the studies [16, 17, 18, 19, 20]
referred to above assume that the channels are independent and identically distributed (i.i.d.) Gaussian MIMO channels. However, in practice, the channel vectors for different users are generally correlated because the antennas are not sufficiently separated or the propagation environment does not offer enough scattering
[15, 27]. Spatial correlation between antennas is an important factor that affects the design and performance of MIMO systems, which specially matters for MMIMO systems, owing to the number of antennas scaling up to an unprecedented magnitude [3]. For the scenarios of less favorable propagation environment where the channel is of high correlation or the ratio () is low, the study in the filtering matrix inversion and its efficient hardware implementation have not been considered based on the best knowledge of the authors.In this work, we address the complexity issue of the filtering matrix inversion associated with linear detection in the MMIMO systems. We consider a singlecell multiuser system model, which accounts for antenna correlation, ratio between BS antennas and users, and path loss in the uplink. We analyze the residual error induced by the matrix inversion approximation (MIA) method relying on NSE under the less favorable propagation environment. We propose an iterative MIA method also based on NSE, which can converge to an arbitrary terms approximation without extra hardware cost compared to the recent works. Considering the residual error and the computational complexity, we analyze the initial matrix and then develop an optimizational tridiagonal matrix approximation (TMA) method to improve the convergence performance. We present corresponding VLSI architecture and show the reference FPGA results, which enables us to show the hardware efficiency compared to conventional methods. Finally, we provide the solution considering the performance/complexity tradeoff via presented designs.
Notation: The operations, , , , , , and , denote the conjugate, transpose, conjugate transpose, trace, absolute operator, and expectation, respectively. For a matrix , the entry in the row and column of is denoted by . denotes the Frobenius norm of matrix . For a vector , the entry in the is denoted by ; denotes the norm of vector , such that
(1) 
, , and refer to the identity, fullones, and zero matrices, respectively. A matrix consists of the entries of is denoted as ^{1}^{1}1Then ..
Ii Preliminaries
ii.a System Model
Consider an uplink MMIMO system with a single base station (BS) and user terminals (UTs), where the BS is equipped with antennas and each UT is equipped with single antenna only^{2}^{2}2UTs with multiple antennas can be decomposed into several single antenna cases or can be generalized from singleantenna UT case directly [28, 29, 30].. Denote to be the transmitted signal of the UT with a normalized expected total transmit power . The received signal at the BS side, , can be expressed as
(2) 
where denotes the channel condition between the BS and the UT.
is the additive white Gaussian noise (AWGN) with zero mean and normalized variance
.In the conventional uplink MMIMO system analysis [1], only the geometric attenuation is adopted, whereas the antenna correlation issue for massive receiving antennas is ignored. In this paper, we apply the Kronecker model as proposed in [31] and obtain
(3) 
where is the pathloss component between the BS and the UT, and denotes the correlation matrix for massive receiving antennas. is the fast fading coefficients with i.i.d. elements. Substituting Eq. (3) into Eq. (2), we have the following matrix form
(4) 
where
(5) 
Without loss of generality, we assume the classical MMSE equalizer is applied for the uplink MMIMO system, and the estimated symbol at the receiver side is thus given by
(6) 
The most challenging task in MMSE is to compute the inversion operation of the filtering matrix . In the traditional MIMO systems, since the number of received antennas is usually small, e.g. , the matrix inversion operation can be efficiently computed through the standard exact decomposition methods, such as QRGram Schmidt method [32], QRGivens Rotation method [33], and GaussJordan method [34]. If we note that is Hermitian, we can further apply Cholesky decomposition method to reduce the implementation complexity as shown in [35]. However, when the numbers of received antennas and of UTs , become significant (e.g. and ), the bruteforce implementation may not be able to suffer due to the high power consumption and the massive storage requirement^{3}^{3}3In general, the conventional exact inversion methods (including Cholesky decomposition) require complicated dataflow mechanisms with sequential computation order, and involve hardware operations such as squareroots and divisions. As a result, those methods consume expensive implementation costs in terms of area, power, and latency, especially when the active UTs are large and varying..
ii.b Neumann Series Approximation
As an alternative approach to compute the largescale matrix inversion, the Neumann series expansion (NSE) [36] takes the advantage of iterative structure to gradually improve the accuracy, which avoids the prohibitive power and storage cost for general exact decomposition methods [37]. For illustration purpose, we summarize the theoretical background in Lemma 1 and describe the benchmark implementation in what follows.
Lemma 1 (Neumann series).
For a nonsingular matrix with , we have is also nonsingular and its inverse is given by
(7) 
Applying Lemma 1 and with some mathematical manipulations, we can rewrite the inverse of using NSE and obtain , where is an arbitrary matrix satisfying
(8) 
To write them into the iterative form, define . We have
(9) 
where is the multiplication coefficient.
The following assumptions are adopted through the rest of the paper. First, we generate the receiving correlation matrix as and , according to [38]. Here denotes the correlation coefficient between consecutive antennas and denotes a given phase. Second, perfect channelstate information is assumed to be available at the receiver side. Third, we assume noncorrelation among different UTs in the uplink.
Iii LowComplexity Tridiagonal Matrix Approximation Algorithm
In the conventional approach to deal with MMIMO systems [1], we usually rely on the fact that to simplify the filtering matrix under the uncorrelated channel, e.g., when . The remaining channel inversion operation is rather straightforward. Even for general large value of , the offdiagonal elements are relatively small and we can rely on to efficiently compute using NSE [26, 3]. However, for correlated MMIMO environment, there is still lack of systematical ways for efficient implementation. In this section, we first analyze the characteristics of correlated MMIMO channels and then propose a tridiagonal approximation algorithm to improve the accuracy of NSE.
iii.a Characteristics of Correlated MMIMO Channels
With the receiving antenna correlation, one of the most important issues is to obtain the statistical information of , from which we can analyze and possibly simplify the corresponding receiving structure using NSE. In what follows, we derive the statistical information of with general receiving correlation and summarize the main results in Theorem 1.
Theorem 1.
In the correlated MIMO environment with the receiving antenna correlation matrix , the filtering matrix, , has the following statistical property, where its mean and variance matrices are given by Eq.s (10) and (11), respectively.
(10) 
(11)  
Remark 1.
Please refer to Appendix A for the proof.
From Theorem 1, the channel variance matrix has been greatly affected by the receiving correlation , whereas the statistical mean does not change if compared with uncorrelated case. In this case, the magnitude of offdiagonal elements in the filtering matrix actually increases with the Frobenius norm [39] of the receiving correlation matrix , and decreases with the number of antenna elements , as shown in Fig. 1. The recent work has proven that a larger () would be favorable propagation, so we next mainly analyze the effect of correlation on the property of in this paper.
If we further plot the norm [39] of diagonal elements versus off diagonal elements as shown in Fig. 2, we can observe that the norm of the main () diagonal decreases with the correlation coefficient , whereas norms of other diagonals increase. This is partially because of the ever increasing ratio . The conventional NSE way to compute relying on may result in inaccuracy and slow convergence.
iii.b Error Analysis
In practical MMIMO systems, the number of NSE terms cannot be very large because of the considerations on latency and complexity. In this case, we next analyze the error resulted from the truncated polynomial, , for MMSE estimation in the massive systems. We define the approximation error matrix as , and compute the estimated symbol at the receiver side with the term approximation , i.e.,
(12)  
where is the residual estimation error. We denote the square of norm of residual estimation error by and bound by:
(13) 
In MMIMO systems, let the number of receive antennas grow large while keeping the number of transmit antennas constant, we further assume the uncorrelated noise is averaged out, which is corresponding to [3]. In this case, the expect of can be derived as:
(14) 
Remark 2.
Please refer to Appendix B for the proof.
From Eq.s (13) and (14), the size of is determined by . Considering the convergence condition in NSE, i.e., , one intuitive method to reduce is to realize a large term approximation. As a limit case, the caused by the approximation error is zero when .
Another method to reduce is to decrease the consecutive norm of the inner matrix, . Concretely, in view of , can reduce faster if the initial matrix is more close to the filtering matrix . As shown in Fig. 4
, the probability that the inner matrix
has a smaller Frobenius norm when using is always higher than the method relying on , which means an improved convergence speed and stability. Moreover, the scenarios where the correlation is high, i.e., , with a small may not satisfy the convergence condition of NSE, which means a much unfavorable propagation environment. Thus, in the remainder of this paper, we mainly concern the propagation environment where .iii.c Tridiagonal Approximation Algorithm
To further improve the stability and convergence for the NSE method accommodating less favorable propagation environment, a valid method is to acquire more offdiagonal elements, i.e., . Though setting up requires more computations as compared to choosing , the overall complexity can be expected to reduce because of the improved convergence rate, which inturn would require smaller iteration number in Eq. (9). Therefore, one would enjoy a better performance/complexity tradeoff. Shown in Fig. 3, it is obvious that occupies a dominated proportion in magnitude, whereas other s account for relatively small proportion. Then we initialize the matrix as . For lower complexity and better convergence for NSE, a modified method to compute the approximate inversion of is developed. It should be noted that a triband matrix inversion for precoding was proposed in [18].
Inspired by existing literature [40], the straightforward inversion of the tridiagonal matrix could be carried out as follows. For better explanation, we denote by
(15) 
The inverse matrix can be given by
(16) 
where , . Also
(17) 
Other parameters are given by the secondorder linear recurrences as follows:
(18) 
where the initialization conditions are and .
One critical issue of this straightforward inversion operation is the prohibitive complexity involved, which is . Note that the complexity here is estimated by counting the number of complex multiplications. For MMIMO systems with considerable large , the resulting high complexity may hinder the application of such inversion method in real systems. We again note that the filtering matrix is Hermitian. Then vector is the conjugate transpose of . Since the inverse of a Hermitian matrix is still Hermitian, only half of the required entries have to be calculated with the sophisticated equations of Eq.s (16) and (17). The other half could be obtained conveniently by simple conjugate operation. Admittedly, the Hermitian property does help to reduce about half complexity, the complexity stays in the same order and fails to meet our initial expectation for introducing NSE. For a better balance of performance and complexity, especially in correlated massive MIMO systems, further simplifications are expected.
To this end, in what follows, we first decompose the straightforward tridiagonal approach into two steps: 1) Step 1, which includes Eq.s (16) and (17), mainly deals with the entry calculation; 2) Step 2, which includes Eq. (18), mainly deals with the parameter calculation. Then, stepwise optimizations are carried out by looking into details of each step. Statistical property of the filtering matrix has been made full use to achieve a better performance/complexity tradeoff.
iii.c1 Simplification for Step 1
Since the inverse of a tridiagonal matrix is a full matrix, the complexity of Step 1 is . However, in correlated MMIMO channels, norm of the main diagonal for is higher than that of the secondary diagonals, which implies further simplification.
Let us only consider the lower triangular part of , i.e., . According to the formula , the calculation starts with the main diagonal, then recursively computes the offdiagonal entries in a columnwise manner. The entries of vector will be multiplied recursively. If we understand how the magnitude of entries of change with channels, we will be able to depict the offdiagonal entries’ attenuation accordingly. We evaluate the overall magnitude of by its mean . The tendency of with correlation coefficient changing is given in Table I.



It is observed that when , stays less than , which indicates that for channels with , the magnitudes of outer diagonals shrink at the scale of layer by layer. In other words, the dominated portion of fall into the region between two secondary diagonals. The rest part of , whose magnitude is less than of the main diagonal, can be omitted as a result. Thus, the complexity will be effectively reduced to .
iii.c2 Simplification for Step 2
We recall that in correlated MMIMO channels, norm of the main diagonal for is higher than that of the secondary diagonals. Therefore, the second term at the righthandside of the following equation can be reasonably removed:
Then, we have the following approximation:
(19) 
With this approximation, the formula for calculating the main diagonal of can be simplified as:
(20) 
For previous calculation in Eq. (18), two sequences and are obtained in ascending and descending orders of or , respectively. Thanks to the approximation, there is no more need to calculate and the processing latency has been successfully halved.
To evaluate the performance degradation, numerical results are given in Table II, by denoting the error percentage as . Since observation shows that It is observed that when the error percentage stays below , it is safe to introduce this approximation within this region. For clearer denotation, we use instead of in following.


Error ()  

With the aforementioned simplification steps, now the new algorithm to calculate is given by Alg. 1 as follows:
Remark 3.
Please refer to Appendix D for the derivation.
The statistical property of the filtering matrix inspires us to reduce the computational complexity to calculate . Though is a full matrix, the entries outside the three central diagonals are notably small in magnitude. Therefore, we mainly focus on the three dominating diagonals, i.e., . For the purpose of parallel computation, some units are modified accordingly. The resulting hardware overhead is usually small compared to the matrixmatrix multiplications. Furthermore, we only directly calculate the lowertriangular part of the Hermitian matrix.
Iv HardwareFriendly Implementation
iv.a Efficient Pipelining Structure
The TMA formula can be written as follows,
(21) 
where is the multiplication coefficient. According to the linear detection formula Eq. (6), general framework of hardware can be designed as in Fig. 5. The framework mainly consists of the following parts:
iv.a1 Orange Block
It is the preprocessing module (PM), where , and is the normalized variance of white Gaussian noise (AWGN) with zero mean. The output of PM is , which can be split as . For the diagonal Neumann series (DNS) method, matrix includes all the diagonal elements of ; whereas for the TMA, matrix only includes the tridiagonal elements in .
iv.a2 Blue Block
It is the maincomputing module (MM), which uses TMA to calculate with iterations. The main structure of this part is IIRlike, whose iterations can be controlled to realize different accuracy requirements.
iv.a3 Green Block
It is the estimation module (EM), which computes , where .
iv.b Diagonal/TriDiagonal Matrix Inversion
This module calculates the inversion matrix of , which is a diagonal matrix in the DNS scheme. So the inversion matrix of is used to compute the reciprocals of diagonal elements. In order to compute , a reciprocal unit based on lookup table (LUT), which is particularly suitable for FPGA implementation, is employed.
The tridiagonal matrix inversion has been simplified in Alg. 1, and the correspondent hardware design can be seen in Fig. 6. This architecture is realized in a pipelining way, in which each clock increases from to and and change accordingly. A switch is employed for scheduling: 1) when , the switch is open and the output is ; 2) when , the switch is closed and outputs and . Different blocks have different functions.
iv.b1 Red Block
iv.b2 Blue Block
iv.b3 Green Block
It computes the secondary diagonal elements of lower triangle. For is Hermitian, is Hermitian as well. With conjugate operation we can get the entire matrix .
The inversion needs real adders, complex multipliers, and LUTs. The latency of this module is clocks, where is the size of .
Alg. 1 can be converted to Alg. 2 for lower implementation complexity. We combine the common hardware of two identical algorithm states, which could save half of the hardware cost. It is noted that States 3 and 5 in Alg. 1 share the same iterative operation as States 6 and 9. Therefore, the common hardware can be reused. The resulting architecture also works in a pipelining way but increases one every two clocks. Now, the work done in one clock previously needs two clocks to complete, and latency doubles. The folded architecture is in Fig. 7, which executes States 3 and 5 (or States 4 and 6) in Alg. 2 when (or ). It needs real adder, complex multipliers, LUTs, and clocks.
iv.c Diagonal/TriDiagonal Multiplier
This module calculates matrix , which is the product of matrices and . No matter is a diagonal matrix or tridiagonal matrix , the channel hardening property makes the multiplication module simple. To calculate , every diagonal element of is multiplied with the corresponding row of . Furthermore, when calculating , three lines of diagonal elements needs to product corresponding row of the latter matrix, and using registers to add three sets of results in correct combination. The hardware design can be seen in Fig. 8. By using complex multipliers, registers and complex adders, the multiplication between tridiagonal matrix and any matrix can be realized. Meanwhile, when we input the main diagonal elements into the first set of multiplications, the product between diagonal matrix and any matrix can be realized as well. Therefore, we have:
Remark 5.
Architecture in Fig. 8 can implement multiplication of diagonal or tridiagonal matrix with any matrix.
iv.d Neumann Series Iteration Module
This module is IIRlike, including a matrix multiplier and a matrix adder, which implements the matrix approximation inverse. With and inputted, it outputs in the iteration. By controlling the iteration number, different accuracies can be achieved.
Matrix multiplier is implemented by a systolic array. For is Hermitian, once we calculate the lowertriangular part, the uppertriangular part can be obtained by conjugating. The systolic array has complex adders and complex multipliers. For is also Hermitian, the matrix multipliers in PM and MM can be multiplexed with one lowertriangular systolic array. Matrix adder needs one real adder for diagonal matrix addition, and two complex adders and one real adder for tridiagonal addition.
Remark 6.
Please refer to Appendix C for the proof that is Hermitian.
iv.e Complexity Analysis
In this section, complexity comparison is given in terms of numbers of real adders and real multipliers. One complex adder includes two real adders, and one complex multiplier includes four real multipliers and two real adders. Then, the complexity of the proposed detector is:
Lemma 2.
For one user TMA detection, the numbers of real adders () and real multipliers () are,
(22) 
Proof:
PM module mainly involves a lowtriangular systolic array, which has complex adders and complex multipliers. The total cost is real adders and real multipliers.
Sharing the lowtriangular systolic array with PM, MM becomes the main matrix inversion part: 1) for DNS, it costs real adders and complex multipliers ( real adders and real multipliers in total), 2) for TMA, it costs real adders, complex adders, and complex multipliers ( real adders and real multipliers in total).
EM module only involves a vector multiplier, which costs complex adders and complex multipliers ( real adders and real multipliers in total). ∎


Method  Latency (clocks)  Adder #  Multiplier # 
CD  
DNS  
TMA  

iv.f Timing Analysis
When , the timing analyses is as follows:
Lemma 3.
For one users, antennas MIMO system, the latency for TMA detection is,
(23) 
Proof:
Using the lowertriangular systolic array, PM generates its output from Clock to . In systolic array, when is calculated in Clock , and are calculated in Clock . For is obtained in Clock , is obtained in Clock .
To calculate , each needs clock cycles by Alg. 2, which fits the output frequency of . Thus, is obtained in Clock , and in Clock .
According to the rule of systolic array, () is calculated from Clock to . After pipelining, the final estimation vector completes in Clock . ∎
Lemma 4.
For one users antennas MIMO system, the latency for DNS detection is,
(24) 
Proof:
Compared to TMA, dia inversion requires one less clock for each . Dia multiplier needs one less clock as well. Thus, the latency reduces by . ∎
Since the latency is mainly spent for the iterative multiplication, as the iteration number grows, the latency becomes longer and throughput drops. Based on numerical results in Section V.A, different ’s are set for different ’s and ’s.
V Architecture Analysis
v.a Numerical Results
We propose the Neumann series approximation to reduce the complexity of MIMO linear detection method. However, different iterations lead to different latencies, if becomes too large, the latency will be too high and result in low throughput. However, when is not big enough, the approximate result will not reach the required accuracy. Here, the accuracy metric is defined as: the performance degradation is less than dB compared with the Cholesky decomposition based exact MMSE detection when . Different sets of and correlated coefficients result in quite different iteration counts. We compare on the iteration counts for diagonal scheme (DNS), tridiagonal scheme (TNS), and simplified tridiagonal scheme (TMA).


Method  
DNS  
TMA  

Fig. 914 show the performance comparison of the aforementioned methods in different systems, and the results are summarized in Table IV. When grows, both methods need more iteration counts to reach required accuracy. According to error analysis in Section III.B, when the Neumann series scheme will not be suitable for such channel situation, our simulation results also show this in Table IV. Another observation is that larger requires fewer iteration counts. From the comparison we can clearly find that TMA saves to iterations than DNS. Also, our improved approximate tridiagonal matrix inversion in Section III.C1 does not bring extra performance loss.
In order to better show the advantage of proposed methods, it is necessary to compare their performance with other algorithms. Rather than put 1 or 2 other algorithms in each of Fig.s 911, we would like to show that under different correlation scenarios, the stateoftheart linear iteration detectors such as CG or SD (adaptive preconditioned SD, APSD) methods could not outperform the exact MMSE detector based on Cholesky decomposition. As long as our proposed detectors can achieve comparable performance as the exact MMSE detector based on Cholesky decomposition, their performance is guaranteed even compared with the stateoftheart linear iterative detectors. In Fig. 15, BER performances under different correlation conditions are compared. Specifically, we have: i) User Correlated case , ii) BS Correlated case , and iii) Fully correlated case . It is shown that, the proposed TMA detector can effectively approach the exact MMSE detector in different correlation conditions. Therefore, the advantages of proposed decoders can be guaranteed over the stateoftheart detectors, regarding different correlation conditions.
v.b VLSI Implementation Results
We investigate the VLSI implementation architecture for the proposed (TMA and DNS friendly) detection method based on a Xilinx Virtex7 XC7VX690T FPGA and compare the hardware cost in Table V. For fair comparison with NSE [16], GS [42], CG [43], and OCD [44], the quantization length is set as . The fixedpoint simulation results of proposed quantization scheme are shown in Fig.s 911, and the performance loss is negligible. Here, “fd” is short for “fixed”. We assume BS antennas and users. Although TMA detection requires LUTs as DNS, it requires lower and enjoys higher throughput. Shown in Tables VI and VII, with decreased and increased , the throughput advantage is more obvious.


Detection algorithm  TMA  DNS 
Modulation  QAM  QAM 
LUTs  
FFs  
DSP48s  
Block RAMs  
Maximum frequency  MHz  MHz 



Method  Latency (clocks)  Throughput (Mbps)  
DNS  
TMA  



Method  Latency (clocks)  Throughput (Mbps)  
DNS  
TMA  

To achieve Mb/s data rate for each user specified in 3GPP LTEAdvanced [45], we use instances of TMA detector. Table VIII lists comparison with other detection implementation results with Xilinx Virtex7 XC7VX690T FPGA for MIMO system: NSE [16], GS [42], CG [43], and OCD [44]. It is shown that for correlated massive MIMO systems, it can achieve nearMMSE performance and Mb/s throughput. For fair comparison, hardware efficiency is compared in terms of throughput/LUTs and throughput/FFs. The throughput/LUTs ratio of proposed TMS is better than NSE [16], GS [42], and CG [43]. If we consider another metric throughput/FF, it would be noted that this work will have the highest value of compared to the stateoftheart.


Detector  M. Wu [16]  Z. Wu [42]  B. Yin [43]  M. Wu [44]  This Work 
[JSTSP’Nov. 14]  [ISCAS’16]  [ISCAS’15]  [TCASI’Dec. 16]  
Method  NSE  GS  CG  OCD  TMA 
Performance  nearMMSE  nearMMSE  nearMMSE  nearMMSE  nearMMSE 
Modulation scheme  QAM  QAM  QAM  QAM  QAM 
Iteration count  
Instance or subcarrier #  
LUTs  
FFs  
DSP48s  
BRAMs  
Latency [clks]  
Clock freq. [MHz]  
Throughput [Mb/s]  
Throughput/LUTs  

v.c Comments and Further Work
Advantages: The hardware cost of DNS is lower than that of accurate inversion method (Cholesky decomposition). With weaker requirement of iteration number, DNS enjoys a higher throughput. With the same performance, TMA saves iterations than DNS. Since the iteration number can be easily controlled, it is feasible for different applications.
Drawbacks: On bad channel condition, both DNS and TMA detections need higher iteration counts , which increases the detecting latency and lowers the throughput.
An improved Neumann series: In order to solve the problem, an improved Neumann series inverse framework is proposed for higher convergence rate. The formula of Neumann series can be rewritten as follows:
(25) 
(26) 
Then a formula based on multiple iterations is obtained:
(27) 
To obtain the iteration’s output of original Neumann series, now we only need iterations. Convergence comparison after such alteration is listed in Table IX. Compared with the original iteration inverse framework, the modified scheme adds a systolic multiplication module. Though it requires more hardware, a great rise of convergence and throughput appears. Also, the hardware can be parallelized for lower latency per iteration. This framework applies to situations with harsh channel condition and relatively high accuracy requirement. The iteration part of modified hardware framework is in Fig. 16.
There are mainly two differences from the former structure. Firstly, the former constant multiplication coefficient matrix will be replaced by continuous selfsquares. Secondly, the bias increment matrix, which used to be fixed in the original iteration, is now replaced by the last result of iteration operation.


Iteration  Original Neumann series  Order  Improved Neumann series  Order 
Comments
There are no comments yet.