I Introduction
With the explosive growth of mobile applications, cloud synchronization services and the rapid development of highquality multimedia services such as high resolution image and 4Kresolution high dynamic range (HDR) video streaming, the existing 4G mobile communication technology could not meet the needs of enterprises and consumers for wireless communication networks any more. As a result, 5G mobile communication technology [2, 3, 4] has been raised up with higher transmission speed, stronger bearing capacity, and a wider range of applications. Massive multipleinput multipleoutput (MMIMO) is one of the key technologies of 5G [5, 6], owing to its plenty advantages, such as high spectral efficiency, high power efficiency, and high robustness. The performance of MMIMO systems relies heavily on the acquisition of the uplink (UL) and downlink (DL) channel state information (CSI). However, the largescale antenna array of MMIMO brings new challenges to channel estimation [7] :

The growing dimension of channel matrices (CMs) or channel covariance matrices (CCMs) makes the complexity and resource consumption of the traditional UL and DL channel estimation algorithm greatly increased, limiting the MMIMO system to play its superiority.

The amount of CSI that users feed back to the base station (BS) during DL channel estimation is growing with the increase of the number of antennas at the BS, which is a great burden of the feedback channel.

Channel reciprocity makes it easy to acquire DL CSI from UL CSI for time division duplexing (TDD) systems, while the nonreciprocity characteristic causes that the DL channel estimation of frequency division duplexing (FDD) systems cannot be predigested, which is a great burden for user’s devices.
In order to reduce the computational complexity, we need to take advantage of the lowrank properties of the channel, which can reduce the dimension of CMs and CCMs significantly and acquire the valid information. Many works point out that the directions of arrivals (DOA) as well as the directions of departures (DOD) of propagation signals are limited in a narrow region (i.e., the angle spread (AS) is small) because the BS equipped with largescale antenna array has to be located on the top of high buildings, which is known as the finite scattering model [12]. Another similar scenario is the mmWave communication, where the channel is naturally sparse and the AS equals [13]. Meanwhile, due to the largescale antenna array of MMIMO, the spatial resolution of the BS is significantly improved, which means the BS can distinguish users from different directions more easily so that the representation of channel can be strongly sparse and there are relatively few nonzero elements in CMs and CCMs. As a result, a lot of new or optimized methods to acquire CSI have been proposed [14, 15, 16, 17, 18], especially for DL channel estimation of FDD system due to its nonreciprocity characteristic. [14]
proposed an approach under joint spatial division and multiplexing (JSDM) scheme for DL channel estimation of FDD system, where the sparsity of CCMs is exploited and the eigenvalue decomposition (EVD) algorithm is required, which is a challenge for implementation.
[15] proposed a lowrank matrix approximation based on compressed sensing (CS) and solved via a quadratic semidefine programming (SDP), which is novel but far too complex to implement. [16] deployed a CSbased method with the joint channel sparsity model (JCSM) which utilizes virtual angular domain representation of the CM and limited local scattering in order to reduce the training and feedback overhead. To this end, we first proposed a transmission strategy based on spatial based expansion model (SBEM) in [19], which comes from array signal processing theory. This scheme is also known as angledivision multiple access (ADMA) scheme. ADMA scheme has some particular advantages:
Due to the increased angle resolution of antenna arrays at the BS, the angular information can be easily obtained by the discrete Fourier transform (DFT) of CMs under ADMA scheme.

The angular information is corresponding to the real directions of users with the array signal processing theory, while the others’ methods only have a virtual angular representation.

As a result of the reciprocity brought by the DOA and DOD, the complexity and overhead of DL channel estimation and feedback can be reduced.

The estimation algorithm mainly contains DFT calculation, matrix multiplication, sorting and grouping, which is convenient for implementation.
As it has shown in [19], the performance of ADMA is better than [14] and [16], especially at low signal noise ratio (SNR). In addition, there are also blind and semiblind estimation methods to be explored. Those methods have higher transmission efficiency because they need fewer (or no) training sequences. But the result of those methods may be not accurate at the start of transmission because the BS needs some time to accumulate channel statistics information. Moreover, the efficient implementation of ADMA is very challenging due to the nonlinear computation involved in algorithm level, therefore hinder its application for channel estimation.
In order to bridge the aforementioned gap between algorithm and implementation, this paper devotes itself in proposing the hardware architecture for channel estimation under ADMA scheme for the first time. Hardwareaware partition of the algorithm is conducted. Our main technical contributions can be listed as follows:

We propose a hardwareefficient channel estimator under ADMA, which takes the advantage of the sparsity of MMIMO systems in order to reduce the complexity, save the amount of training sequences, and speed up the channel estimation of large amount of users.

We discuss the approximation of algorithm and transmission strategy as well as the quantization optimization in order to make the our channel estimator suitable for hardware implementation.

We propose the first channel estimator architecture with ADMA scheme, successfully achieve higher hardware efficiency and higher processing speed for channel estimation of MMIMO systems.

We develop an optimized architecture to simplify our original channel estimator, with little performance loss but huge resources reduction and higher hardware efficiency.

We present the FPGA implementation of our ADMA channel estimator on Xilinx Virtex xcvuflgae, to demonstrate its suitability for G wireless. The advantages have been verified by FPGA implementations.
The remainder of this paper is organized as follows. Section II proposes the implementationaware partition of ADMA algorithm. The hardwarefriendly approximation and the simulation results are presented in Section III. The detailed pipelined hardware architecture is presented in Section IV. FPGA implementation is also given in the same section to demonstrate the advantages. Finally, Section V concludes the entire paper.
Notations. The notations employed in this paper are listed in Table I for clearer representation.


Symbol  Definition  
number of antennas at the BS,  
number of users that the BS serves,  
length of training sequences,  
number of parameters the BS can handle,  
/  vector / matrix ,  
the th element of vector ,  
the th element of matrix ,  
/  the transpose of vector / matrix ,  
/  the Hermitian of vector / matrix ,  
set of continuous integers,  
the cardinality of the set ,  
subvector of by keeping the elements indexed by ,  
submatrix of by collecting the columns indexed by ,  


the statistical expectation.  

Ii ImplementationAware Partition of ADMA Channel Estimation
Implementationaware partition of ADMA channel estimation is first conducted in this section.
Iia SettingUp of ADMA
For the ease of illustration, we consider a multiuser MMIMO system, where the BS is equipped with () antennas in the form of uniform linear array (ULA) and serves users. We assume that the number of parameters which the BS can handle is . In addition, as we presume that each user is equipped with only one antenna, the CM of user can be described as a vector . From array signal processing theory, the UL channel vector of user has the form
(1) 
where is the number of beamforming rays, is the complex gain of the th ray and is the array manifold vector which can be expressed as
(2) 
Remark 1.
In this paper, we do not discuss in the situation that users are equipped with multiple antennas or the propagation signal contains multiple subcarriers in orthogonal frequency duplex division multiplexing (OFDM) systems. In fact, the sparsity of the vectors which can be obtained by collecting the row or column of channel matrices. And so do the sparsity of channel matrices of different subcarriers. So when we obtain the sparsity under ADMA scheme, it can be promoted to plenty of scenarios.
IiB Channel Sparsity Revealed by ADMA
To grantee the performance of the proposed channel estimator, the sparsity reveal by ADMA must be kept during the implementation process. The ADMA presents a sparse channel representation for the channel of a MMIMO system via the Discrete Fourier Transform (DFT) of channel vector, i.e., . which can be calculated by
(3) 
where is the DFT matrix whose element is . For the ease of description, there are two lemmas which can be proved from paper [19] :
Lemma 1.
If (i.e., AS is zero) and , there will be only one nonzero element in and the index of this nonzero element is relative to its DOA or DOD.
Proof:
For , can be simplified to , then the th element of can be calculated as
(4)  
If , we can get that
(5) 
Eq. 5 denotes the relationship between the index of the nonzero element (i.e., ) in and the DOA when , which can be described as
(6) 
∎
Since we have discussed the situation with and , we can move onto the more complex and realistic scheme:

when and , each propagation ray is corresponding to a nonzero element in . The index of the middle element is corresponding to the DOA of user while the number of the nonzero elements is corresponding to the AS of user.

when and is large but finite, the power leakage emerges because the resolution of the BS is relatively limited, which causes that is not always an integer. However, there are only a few nonzero elements concentrated around since is large. In fact, denotes the sample precision of the Discrete Time Fourier Transform (DTFT) of in the frequency domain. Since the index of the nonzero elements in is corresponding to the DOA and AS of user, can also determine the spatial resolution of the BS.

when and is large but finite, it is similar to the situation with and is large but finite, but the amount of nonzero elements in will be larger, which is interrelated to the AS of user.
From the above we can see that we can simply get a sparse channel representation by applying DFT to the channel vector and pick the nonzero elements with their indexes. In practical scene, since the BS can handle () parameters at most, we can use nonzero points of the DFT channel vector instead of all points to represent the CSI, which can reduce quite a lot of calculating and feedback overhead.
IiC Sparsity Enhancer for ADMA
To enhance the channel sparsity under ADMA scheme, we define:
Definition 1.
Define as the rotation matrix for user which can be expressed as
(7) 
where . Then we can add this rotateoperation to the DFT calculating. Define as the new channel representation with rotation given by
(8) 
In this way, we can use less nonzero elements to represent the channel vector. Or in practical scene, the nonzero elements we pick will contain more energy of the channel, which is a great benefit for the training overhead.
Remark 2.
Notice that the rotation is actually the translation of in the frequency domain. Since the spatial resolution of the BS is relatively limited, we can get the sample points aligned with the middle of the peak of the DTFT of to the greatest extent via the rotation operation. Since the sampling interval in the frequency domain is , it is only necessary to search over
In this case, we can define the index set to describe the signature of channel vectors with rotation as following:
Definition 2.
Define as the spatial signature of user which can be determined according to
(9) 
Now we have two parameters for each user to be determined under ADMA scheme: and . The main benefit of this sparse channel representation is that we only need a few training sequences because users from different directions whose DOAs do not overlap can share the same training sequence. In practical scene, we usually use orthogonal training sequence which can make full use of the BS.
Meanwhile, we can explain the transmission strategy under ADMA scheme which can be divided into three stages: the preamble stage, the UL training stage and the DL training stage. The aim of the preamble stage is to collect the two parameters of all users and divide them into different groups according to their spatial signatures. Then in the UL training stage and the DL training stage we can perform faster estimation than conventional channel estimation methods due to the grouping in the preamble stage. The preamble stage is not necessary after each UL and DL training stage and the times for UL and DL training stages after one preamble stage is corresponding to the mobility of users.
IiD Preamble Module
In the preamble period, we need to find and for each user so that we can allocate all users into different groups in which the index sets, i.e., of users do not overlap each other’s.
First we allocate all users into groups, each containing users as the BS can handle up to training sequences. Then we apply the conventional UL training for each group, and the receiving signals matrix of each group in the BS is given by
(10) 
where , , , and is used to satisfy the energy constraint ( is the UL training energy constraint of user, and is the pilot signal training power), is the additive white Gaussian noise matrix. Then can be calculated through linear square (LS) method as
(11) 
Then we can find and for each user by adopting Eq. (9). The specific method is discussed in Section III. After that, we need to allocate all users into groups in which the index sets of users do not overlap each other’s so that the users in the same group can share the same training consequence, which can be described as
(12) 
where is a certain guard interval which depends on the tolerance of users for the interference due to pilot reusing. Here we present a grouping strategy that is easy for VLSI implementation in Section IV.
IiE UL Training Module
In the UL training, all users send their training sequences to the BS. The received signals matrix in the BS is given by
(13) 
So first we extract the channel vector for group through a conventional LS method:
(14) 
Since the two parameters of each user is different, we should extract for each user through
(15) 
Finally we can recover the for user by
(16) 
IiF DL Training Module and Its Reciprocity
Based on the reciprocity of ADMA, the DL CSI can be easily obtained from the UL training as shown in [19]. The reciprocity of ADMA comes from that the propagation path of electromagnetic wave is reciprocal. As a result, the DOA (DOD) of DL signal is the same as the DOD (DOA) of the UL signal. Assume that the DL spatial signature of user is which can determined by the UL spatial signature by applying Eq. (6):
(17) 
where and are the elements in and , while and denote the UL and DL carrier wavelengths. Since is a monotonic function with , the minimum and maximum elements of and have an onetoone correspondence, i.e.,:
(18) 
where . Meanwhile, can be calculated by similarly.
The DL training is mostly the same with UL training except the Grouping strategy. In DL training module, since users with identical spatial signatures can be carried out with the same beamforming vectors simultaneously, they can share the same training sequence. Meanwhile, users whose spatial signatures do not overlap each other’s can share the same training sequence, which is the same with the UL training Grouping strategy. As a result, we denote our DL training strategy in two steps. First we allocate users with identical spatial signatures into the same cluster. Then we allocate these clusters in to different groups through Eq. (12). The rest of transmission and estimation is the same with the UL training module.
With the successful algorithm partition, we are now able to carry out the detailed implementationaware algorithm optimization and modulewise architecture design as follows.
Iii Approximation and Quantization
For simulations, the mean square error (MSE) is calculated as follows:
(19) 
For comparison, the system parameters are set as: , which are consistent with those in [19].
Iiia Approximation for Sliding Window Method
The authors of [19] proposed a basic way to find for user by adopting a one dimensional search over and for each possible by sliding a window of size over the elements in to determine that maximizes the channel power ratio. However, there are two main problems if we operate through this method. The first one is that searching over can not be carried out since it is a continuous interval. Fig. 1 shows the corresponding MSE of the simulations with separate elements we choose in the . As we can see from that, the MSE of is nearly the same as those of , so turns out to be suitable for VLSI implementation.
The second problem is that this method will introduce quite a lot latency and increase the computation complexity as the accumulator and divider are needed. Here we have some approximations to lower the complexity:

The first one is to find the maximum element in for each possible and determine the best and for user from the largest elements of all the possible .

The second one is to find the maximum element and the second largest element in and calculate the quadratic sum of the largest two elements for each possible . Determine the best and from the largest quadratic sum of all the possible (the index will be the mean value of the indexes of the largest two elements in with ) .

The third one is to find the maximum element in and calculate the quadratic sum of continuous elements which center on the maximum element for each possible . Determine the best and from the largest quadratic sum of all the possible (the index comes from the largest elements in with ) .
Fig. 2 shows the MSE increase for the approximations above. It is obvious that the performance of the third approximation is nearly the same as the basic method with a divider economized. Meanwhile, the performance loss of the first or the second method is higher but relatively acceptable with only one or two comparators needed.
IiiB Quantization Scheme
For quantization, the variables are quantified with sign bit, integral bits, and fractional bits which is expressed as fixed []. The width of integer
is usually determined by the Probability Density Function (PDF) of the data. But in our algorithm the largest data must be less than
since the channel state information contains the largest element in . Here we show the statistics of large amount of the largest data in , and . According to our statistics, the largest data is less than so that the width of integral part of variables is set as .In order to determine the width of fractional part of the variables, the corresponding MSE of double floating and fixed simulations are illustrated in Fig. 4. From Fig. 4 the MSE of fixed [] and fixed [] simulation keep almost the same, with a slight degradation compared with the double floating simulation. However, the MSE of fixed [] simulation is a relatively far from double floating simulation. As a result, the quantization scheme with fixed [] may be preferred for hardware implementation.
Iv Pipelined Architecture
For channel estimation under ADMA scheme, the operations are conducted on the dimensional vectors and matrices, where is large. For lowcomplexity and high processing speed, the pipelined architecture is demonstrated in Fig. 5. In addition, the quantization scheme with “fixed []” is employed, together with , respectively.
Our design has two stages controlled by a 1to2 switch. Stage 1 consists of pretreatment module, preamble processing module and ULgrouping module corresponding to Eq.s (11) and (3). Stage 2 comprises pretreatment module and ULEstimation module corresponding to Eq.s (15) and (16).
Iva Module Design
IvA1 Pretreatment Module
The pretreatment module can be reused since preamble module and ULestimation module are processed in different time slots. Pretreatment module consists of data buffer and LSbased estimation module. The LSbased estimation module corresponding to Eq. (11) can be implemented by a systolic structure [20] whose data flow graph is shown in Fig. 7, which is an efficient processing method for matrixvector multiplication. The processing element (PE) performs one complex multiplication and one complex addition. Each PE is corresponding to one elements of training sequence and one column of receiving data matrix so that the data buffer is needed to get the data transmission proper because is received by column (i.e., we receive elements in one column of in one clock period).
IvA2 Fast Fourier Transform (FFT) Module
Eq. (3) can be divided into two steps. One is a diagonal matrix and vector multiplication which can be implemented by a complex multiplier and a generator which outputs the diagonal elements of in pipeline. The other is DFT which can be implemented by Fast Fourier Transform (FFT) processors, reducing the computational complexity to . These are plenty of structures of FFT which emphasize either higher processing speed or less resources overhead [21, 22, 23]. For higher hardware efficiency, the singlepath feedback pipelined hardware architecture is employed as it is shown in Fig. 9, where the number of registers is the smallest as a result of the application of multiplexers.
IvA3 Uplink Grouping module
In the Uplink Grouping module, there are two main submodules: sorting and grouping. The sorting module is implemented by merging network [24] in pipeline which is shown in Fig. 8. This sorting network is mainly based on recursion, merging from 2element comparison to element comparison (assuming is a positive integer). Meanwhile, the merger module is a combination of symmetric comparing network and two bitonic sorter of elements. Then the bitonic sorter can be implemented by a half cleaner module and two bitonic sorter. The grouping module is implemented by a systolic structure shown in Fig. 10 where each comparing PE is corresponding to a group and decide if the latest input is suitable for the group by comparing it with the latest in this group. Since the outputs of sorting module is paralleled and the input of grouping module is serial, a paralleltoserial module is necessary.
Remark 3.
Notice that the grouping messages are sent to users through a independent feedback channel which is not contained in our hardware design.
IvA4 Uplink estimation module
In the Uplink estimation module, the implementation of Eq. (15) is a combination of a complex multiplier, an FFT module and an extraction module. Besides, the implementation of Eq. (16) consists of an Inverse Fast Fourier Transform (IFFT) module and a complex multiplier. Due to the sparsity of , the IFFT module can be treated as an matrix and vector multiplication which can be implemented by systolic structure which consists of PEs for higher efficiency.
IvB Optimized Architecture without Rotation
As we can see from the Fig. 5, the FFT modules of preamble processing module and Uplink estimation module could be reused since they are not deployed at the same time. However, the spatial signatures of users in the same group are different, leading to the waste of FFT module. Here we find that we can simply omit the rotation operations as the architecture shown in Fig. 6, which reuses the FFT module and reduces the number of FFT modules from to , saving the resources a lot.


Modules  Complex Multipliers  Complex Adders  Real Comparators  Registers 
LSbased Estimation  0  
FFT  0  
ABS  1  0  0  0 
Maxselection  0  0  1  1 
Sorting  0  0  
Grouping  0  1  
Extraction  0  0  1  0 
IFFT  0  

IvC Processing Schedule and Overhead Analysis
For channel estimation under ADMA scheme, the timing of the entire design is shown in Fig. 11, where is the clock cycle. we can see that each module is processed in pipeline except the ULgrouping module. The timing of the optimized architecture without rotation is the same with it is shown in Fig. 11. The resource statistics of each module is listed in Table II. In addition, the latency and processing time of each module is listed in Table III. Here, “Latency” is associated with one data package, and “Processing time” is associated with data packages. Notice that is an integer between and which is determined by the spatial signature of each user.


Modules  Latency ()  Processing time () 
LSbased Estimation  
FFT  
MaxSelection  
Sorting    
Grouping    
Extraction  
IFFT  

IvD FPGA Implementation Results
In order to demonstrate the advantage of channel estimation under ADMA scheme, our architectures are implemented with Xilinx Virtex7 Ultrascale vu440flga28922e FPGA. For the ease of Implementation, the parameters are set as . The resources overhead and maximum frequency are shown in Table IV. We can see that the omission of rotation operations brings us reduction in LUTs, reduction in registers, reduction in block RAMs and reduction in DSPs. And for the timing constraints, since the critical path lies in the FFT module, the maximum frequency of these two architecture can both reach 217.39 megahertz.


Structures  With Rotation  Without Rotation 
LUTs  
Registers  
Block RAMs  
DSPs  
Frequency (MHz)  

V Conclusions
In this paper, the hardwareefficient channel estimator based on ADMA scheme is first proposed. The corresponding optimizations on quantization and approximation are presented as well. To achieve high efficiency and low complexity, the pipelining technique and systolic structure have been employed to tailor the architecture for regularity. Finally, FPGA implementations are given. Suggestions on the choice of rotation are listed. Future work will be directed towards its application in our 5G Cloud Testbed.
References
 [1] X. Liu, H. Xie, J. Sha, F. Gao, S. Jin, X. You, and C. Zhang, “The VLSI architecture for channel estimation based on ADMA,” in Proc. IEEE International Conference on ASIC (ASICON), 2017, pp. 1073–1076.
 [2] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, 2014.
 [3] M. Shafi, A. F. Molisch, P. J. Smith, T. Haustein, P. Zhu, P. De Silva, F. Tufvesson, A. Benjebbour, and G. Wunder, “5G: A tutorial overview of standards, trials, challenges, deployment, and practice,” IEEE J. Sel. Areas Commun., vol. 35, no. 6, pp. 1201–1221, 2017.
 [4] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO: Opportunities and challenges with very large arrays,” IEEE signal processing magazine, vol. 30, no. 1, pp. 40–60, 2013.
 [5] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski, “Five disruptive technology directions for 5G,” IEEE Commun. Mag., vol. 52, no. 2, pp. 74–80, 2014.
 [6] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, “Massive MIMO for next generation wireless systems,” IEEE Commun. Mag., vol. 52, no. 2, pp. 186–195, 2014.
 [7] H. Xie, F. Gao, and S. Jin, “An overview of lowrank channel estimation for massive MIMO systems,” IEEE Access, vol. 4, pp. 7313–7321, 2016.
 [8] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, no. 11, pp. 3590–3600, 2010.
 [9] J. Jose, A. Ashikhmin, T. L. Marzetta, and S. Vishwanath, “Pilot contamination problem in multicell TDD systems,” in Information Theory, 2009. ISIT 2009. IEEE International Symposium on. IEEE, 2009, pp. 2184–2188.
 [10] F. Fernandes, A. Ashikhmin, and T. L. Marzetta, “Intercell interference in noncooperative TDD large scale antenna systems,” IEEE J. Sel. Areas Commun., vol. 31, no. 2, pp. 192–201, 2013.
 [11] L. You, X. Gao, X.G. Xia, N. Ma, and Y. Peng, “Pilot reuse for massive MIMO transmission over spatially correlated Rayleigh fading channels,” IEEE Trans. Wireless Commun., vol. 14, no. 6, pp. 3352–3366, 2015.
 [12] A. G. Burr, “Capacity bounds and estimates for the finite scatterers MIMO wireless channel,” IEEE J. Sel. Areas Commun., vol. 21, no. 5, pp. 812–818, 2003.
 [13] X. Gao, L. Dai, S. Han, I. ChihLin, and R. W. Heath, “Energyefficient hybrid analog and digital precoding for mmWave MIMO systems with large antenna arrays,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 998–1009, 2016.
 [14] A. Adhikary, J. Nam, J.Y. Ahn, and G. Caire, “Joint spatial division and multiplexing The largescale array regime,” IEEE Trans. Inf. Theory, vol. 59, no. 10, pp. 6441–6463, 2013.
 [15] S. L. H. Nguyen and A. Ghrayeb, “Compressive sensingbased channel estimation for massive multiuser MIMO systems,” in Proc. IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2013, pp. 2890–2895.
 [16] X. Rao and V. K. Lau, “Distributed compressive CSIT estimation and feedback for FDD multiuser massive MIMO systems,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3261–3271, 2014.
 [17] C. Sun, X. Gao, S. Jin, M. Matthaiou, Z. Ding, and C. Xiao, “Beam division multiple access transmission for massive MIMO communications,” IEEE Trans. Commun., vol. 63, no. 6, pp. 2170–2184, 2015.
 [18] J. Fang, X. Li, H. Li, and F. Gao, “Lowrank covarianceassisted downlink training and channel estimation for FDD massive MIMO systems,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1935–1947, 2017.
 [19] H. Xie, F. Gao, S. Zhang, and S. Jin, “A unified transmission strategy for TDD/FDD massive MIMO systems with spatial basis expansion model,” IEEE Trans. Veh. Technol., vol. 66, no. 4, pp. 3170–3184, 2017.
 [20] R. Urquhart and D. Wood, “Systolic matrix and vector multiplication methods for signal processing,” in Proc. Insr. Elec. Eng., vol. 131, no. 6, 1984, pp. 623–631.
 [21] M. Ayinala, M. Brown, and K. K. Parhi, “Pipelined parallel FFT architectures via folding transformation,” IEEE Trans. VLSI Syst., vol. 20, no. 6, pp. 1068–1081, 2012.
 [22] C. Cheng and K. K. Parhi, “Highthroughput VLSI architecture for FFT computation,” IEEE Trans. Circuits Syst. II, vol. 54, no. 10, pp. 863–867, 2007.
 [23] Y.N. Chang, “An efficient VLSI architecture for normal I/O order pipeline FFT design,” IEEE Trans. Circuits Syst. II, vol. 55, no. 12, pp. 1234–1238, 2008.
 [24] K. E. Batcher, “Sorting networks and their applications,” in Proc. AFIPS Spring Joint Comput. Conf, vol. 32, 1968, pp. 307–314.
Comments
There are no comments yet.