I Introduction
Due to ultrahighresolution of spatial direction, and superhighspectral efficiency, massive multipleinput multipleoutput (MIMO) has drawn tremendous research activities from academia and industry world. It has made great progress on several important aspects like channel modeling, lowcomplexity beamforming, channel estimation, pilot optimization, pilot contamination controlling, etc. [1], [2], [3], [4]. Direction of arrival (DOA) estimation has been an active area since its applications include wireless communications, radar, radio astronomy, sonar, navigation, tracking of various objects, and rescue and other emergency assistance devices [5]. If massive MIMO behaves as a receive array, then DOA estimation precision will be dramatically improved due to its ultrahighresolution of spatial direction.
Wireless direction finding has a long history tracing back to the very beginnings of wireless communications. In the coming future, demand for directionfinding will arise in many potential engineering applications including internet of things (IoT) [6], directional modulation systems [7, 8, 9, 10, 11, 12, 13], unmanned aerial vehicle (UAV) [14], intelligent transportation, wireless sensor networks (WSNs) [15] and millimeterwavebased massive MIMO for 5G and beyond so on [16]. Many DOA estimation algorithms have been proposed and analyzed. Capon algorithm [17] is maximum likelihood estimation of power which aims at maximizing the signaltointerference ratio (SINR). Schmidt developed a more popular method, i.e., the multiple signal classification (MUSIC) [18] algorithm, which is a highresolution eigenstructurebased DOAfinding method. To reduce the complexity of MUSIC with linear search, its lowcomplexity version, called RootMUSIC [19], was proposed to solve the roots of the polynomial around the unit circle to find DOA. Many authors proposed several different direction finding methods, analyzed and improved their performances. In [20], five methods of combining ML and MUSIC were proposed, which could achieve both good performance and computational simplicity. In [21], the authors exploited the first derivative of the cost function in RootMUSIC, which performed better than traditional methods. By examining the disturbance of the root of the polynomial formed on the RootMUSIC intermediate step, the authors in [22]
provided its analysis and proved that it outperforms the MUSIC algorithm in a uniformly spaced linear array (ULA). However, the above research all assume that array response and noise variance are known perfectly, which is unfeasible in practice. Therefore, Friedlander modeled and solved the problem of direction finding when there existed inaccurate mutual coupling, gain, and phase among array elements
[23, 24, 25].However, as the number of antennas tends to largescale, the beamforming computational amount, and circuit complexity and cost of digital implementation become too high for commercial applications. Therefore, a hybrid analog and digital (HAD) beamforming structure is a natural choice, which will strike a good balance among beamforming computational amount, circuit cost, and circuit implementation complexity. Concerning HAD precoding in mmWave massive MIMO systems, a mixed analogtodigital converter (ADC) receiver architecture [26] was presented, as combining costly high and less expensive low resolution ADCs, had worse in performance than fullresolution ADC structure. Therefore, in [27], a HAD precoding algorithm is firstly proposed to make a balance between hardware cost and system performance.
Several research activities on HAD structure focus on transmitter not receiver. In [28], the authors developed a lowcomplexity precoder of alternately iterative minimization by enforcing an orthogonal constraint on the digital precoder. In [29], the authors proposed two precoders based on the principle of manifold optimisation and particle swarm optimisation. An energyefficient hybrid precoding for subconnected architecture was proposed in [30]. To make a balance between energy efficiency and spectrum efficiency in [31], the authors analyzed the green point for fixed product of the transceivers number and the active antennas number per transceiver, and independent transceivers number and active antennas number per transceiver. Due to the HAD structure, the achievable sumrate inevitably decreases compared to fullydigital beamforming, in [32] the sumrate degradation was proved to be compensated by simply employing more transmit antennas. Also taking the rate into account, the authors in [33] developed an iterative HAD beamforming algorithm for the single user mmWave channel, which can approach the rates achieved by unconstrained digital beamforming solutions. In [34], the authors presented receive baseband combiners with the target of minimizing meansquarederror between transmitted and processed received signals.
Mediumscale or largescale receive antenna array with digital beamforming can be employed at receiver to achieve a highresolution DOA estimation. Therefore, considering the hardware cost and performance, it is necessary to apply the hybrid structure in the direction finding. In [35], the authors proposed two iteration methods, i.e., differential beam search and differential beam tracking beamforming algorithms for side by side subarray configuration.
To the best of our knowledge, how to use a massive HAD beamforming structure to make an estimate of DOA direction based on concept of spatial spectrum is an open challenging problem. In this paper, each subarray output of the HAD structure is viewed as a virtual large antenna output, and the total HAD antenna array can be modelled as a large digital virtual array when we do digital beamforming/PA operation. we will focus on the aspect research and make our effort to solve this problem, our main contributions are summarized as follows:

By fully exploiting the subarray structure, two hybrid DOAfinding methods of using linear search, hybrid analog and digital phase alignment (HADPA) and hybrid digital and analog phase alignment (HDAPA), are proposed to estimate DOA based on parametric method. Compared to conventional analog phase alignment (APA), they are much lowercomplexity. By reducing the size of stepsize, their estimate accuracy can be improved but at the same time their complexities increase accordingly. Compared to APA, the proposed HADPA can reduce the complexity from to , where and are the numbers of subarrays and antennas per subarray. Furthermore, the proposed HDAPA dramatically reduces the search complexity by confining the searching set of feasible solutions to the limited finite number by exploiting the the periodic characteristic of digitally large virtual array with spacing being multiple of half wavelength.

To avoid the highcomplexity of HADPA and APA due to pure linear searching with small stepsize, based on spatial spectral estimation, a RootMUSICHDAPA method is proposed to achieve an extremely lowcomplexity with an approximately close form. Due to the periodic property of virtual array direction pattern, there exists the effect of directionfinding ambiguity effect, i.e., optimal solutions for the estimated direction. A method of maximizing the average receive power by a limited linear searching over a set of finite feasible directions predetermined by RootMUSIC, called HDAPA, is adopted to find the true optimal solution and delete those spurious ones. As shown in mathematic analysis and simulation results in Section V, RootMUSIC plus HDAPA can make a significant reduction in computational complexity compared with APA, HADPA, and HDAPA.

To assess the performance of the proposed two methods, the hybrid CramerRao lower bound (CRLB) for HAD structure is derived by statistic theory and matrix theory. Simulation results verify that the proposed hybrid RootMUSICHDAPA scheme is shown to achieve the CRLB as signaltonoise ratio (SNR) increases up to medium and large SNR regions.
The remainder of this paper is organized as follows. Section II describes system model. In Section III, two methods, HADPA and HDAPA, are proposed to realize a lowercomplexity compared to conventional APA. In Section IV, compared to HADPA and HDAPA, a lowercomplexity RootMUSICHDAPA is proposed by providing an approximately analytical solution, and the corresponding hybrid CRLB is also derived to verify its performance. Simulation results are presented in Section V. Finally, we make our conclusions in Section VI.
Notation: throughout the paper, matrices, vectors, and scalars are denoted by letters of bold upper case, bold lower case, and lower case, respectively. Signs
, and denote transpose, conjugate, and conjugate transpose, respectively. Notation stands for the expectation operation. Matrices denotes the identity matrix and denotes matrix of all zeros. denotes matrix trace. Operation denotes the Kronecker product of two matrices.Ii System Model
Fig. 1 sketches the receive HAD beamforming structure. A farfield emitter transmit a narrowband signal , where is the baseband signal, and is the carrier frequency. The signal impinges on the HAD antenna array. Uniformlyspaced linear array (ULA) are divided into subarrays, and each subarray is composed of antenna elements. Consider analog beamforming (AB), the th subarray output is
(1) 
where is the timedomain block index with each block consisting of snapsots, i.e., is the number of snapshots per block, and are the propagation delays determined by the direction of the source relative to the array given by
(2) 
where is the propagation delay from the emitter to a reference point on the array, is the speed of light, and denotes the antenna spacing. In (1), is the corresponding phase for analog beamforming/phase alignment corresponding to the th antenna of subarray . Stacking all subarray outputs in (1) forms the matrixvector notation
(3) 
where is an additive white Gaussian noise (AWGN), whose entries are independent identically distributed. , and the column vector is the socalled array manifold defined by
(4) 
and the AB matrix is a block diagonal matrix
(5) 
where is the AB vector of subarray . The radio frequency (RF) signal in (3) passes through parallel RF chains and is downconverted to the following baseband signal vector
(6) 
which experiences analogtodigital convertor (ADC) and yields
(7) 
Via digital beamforming (DB) operation, the above signal vector becomes
(8) 
where the DB vector .
Iii Proposed LowComplexity PhaseAlignmentbased DOA Estimation
In this section, by maximizing the output receive power, we firstly present the APA method in Section IIIA. The APA requires to compute different values of phase per step search as the inputs of phase shifters at the RF chain. To reduce the number of phase values needed to be computed, a lowcomplexity HADPAbased DOA estimator is proposed in Section IIIB with only calculating different values for phase shifters on RF chain. This significantly alleviates the computational load at receive terminal. In order to further reduce complexity, we finally propose a HDAPA DOA estimator in Section IIIC, making eachstep search only requires values of phases. By exploiting the periodic characteristics of the large virtual digital array, we obtain a feasible set of estimated angles, and then the APA is used to delete the false angles and keep the true optimal angle.
Iiia Conventional APA
It is assumed that the emitter direction in Fig. 1 is . From the previous section, after AB and ADC, we have the output summation signal of the th subarray as follows
(9)  
where is the array manifold of subarray ,
(10) 
Since only APA is used, the DB vector is set and fixed to a vector of all ones, i.e., .
Below, we will maximize the output power of receive signal in Fig. 1 by optimizing the vector . Firstly, let us define the average output power
(11) 
where . The above equation can further be expanded as
(12)  
where
(13) 
By adjusting the value of in (13), we can optimize the receive power in (12) to reach its maximum value. Observing the last line of (12), we find the analog optimizing vector is exactly aligned with the array manifold produced by the direction under the condition
(14) 
which forces all signals of antenna elements to coherently combine at the output and form the maximum value of output power. To implement linear exhaustive searching, we split the range of direction angle from to into subintervals or bins. Let us define the phase searching stepsize as follows
(15) 
In (13), the angle is chosen from the angle set . As the search direction angle varies from to , the APA before ADC at the receiver in Fig. 1 cannot save the receive data unlike DPA. In other words, the new APA phases should be computed and sent towards phase shifters per stepsearch and the new block of signal will be received to compute the output outcome of the new search point. The total number of values of all are floatingpoint operations (FLOPs). Thus the complexity of APA is
(16) 
FLOPs. Finally, the maximum receive power are found by comparison. Obviously, to approach the CRLB, the stepsize should be chosen such small that it is close to the root of CRLB. This implies a large value of and a high computational amount.
IiiB Proposed LowComplexity HADPA DOA Estimator
The above APA algorithm needs to do exhaustive linear search from to and compute values at the same time, which will cause a highcomplexity. In the subsection, we present a lowcomplex hybrid phase alignment DOA estimation. Firstly, we decompose the PA phase into two parts:
(17) 
where the first part is to cancel the phase of element for each subarray and the second part is to cancel the common phase of subarray . This means PA consists of two steps: APA in the first step and DPA in the second step.
After the APA, the output of subarray is described as follows
(18)  
where
(19) 
To remove the common factor of subarray , we design the following DPA vector
(20) 
where
(21) 
Therefore, in Fig. 1 is represented as
(22)  
Similar to (12), we have the average receive power as follows
(23)  
According to the APA mentioned above, we find that, when
(24) 
and
(25) 
we obtain the maximum power . Because of APA, the number of blocks should be chosen to be . Thus, the computational amount of the proposed method in the subsection is
(26) 
FLOPs.
IiiC Proposed LowComplexity HDAPA DOA Estimator
In this subsection, we will provide another lowercomplexity HPA alternative scheme with a reverse PA order: DPA, and APA. Firstly, we use the first block of data to perform the DPA by exhaustive linear search. Once we find the feasible set of optimal directions where some pseudosolutions are included and the number of all solutions are . Secondly, the next blocks of data are utilized to perform APA. This means the total number of blocks for PA is . Given the initial phases of all analog phase shifters are zeros, the discrete output summation signal of the th subarray corresponding to block is
(27)  
where
(28)  
which are used as the input of digital beamformer in Fig. 1. After passing through DPA, we have
(29)  
which could be stored in memory at reciever. Furthermore, (23) is represented as
(30)  
where
(31) 
where the angle is chosen from the angle set . Due to DB, the stepsize could be set to arbitrarily small. It is assumed that the optimal direction is attained by an exhaustive linear search over the set . Clearly, satisfies the following approximate identity
(32) 
where , and . From (32), we can obtain the set of feasible solutions for the estimated emitter direction as follows
(33) 
The estimation angles in the above equation are substituted into (19) which produce matrix , i.e.,
(34) 
where corresponds to according to (19), i.e.,
(35) 
and
(36) 
We substitute each column of the above two equations into (23) which will bring s. At last, we determine which yields the maximum value of . In the same manner as shown in (26), the computational amount of the proposed HDAPA is
(37) 
FLOPs. In particular, we need to clarify what the main differences are between two steps APA and DPA in HADPA and HDAPA. Since APA operates in the analog domain, eachstep search corresponding to one bin needs one new block of data because analog signal cannot be stored before ADC operation in Fig. 1. In other words, if the search interval of direction angle is divided into bins, then APA requires blocks of data to complete an exhaustive linear search over the total search range. Conversely, for the case of DPA, the sampled and quantized signal can be saved in memory. Only one block of data is required to complete an exhaustive linear search over the interval direction angle . This means that DPA has a shorter delay and length of receive data compared with APA. This is the benefit from DPA.
Iv Proposed Lowcomplexity Hybrid RootMUSICHDAPA Estimator and Hybrid CRLB
In Section III, we present how to estimate DOA from the aspect of pure linear search. Below, we will use the concept of spatial spectral estimation method to estimate DOA by RootMUSIC criteria with the aid of HDAPA in Section IVA, which will achieve a faster estimation speed compared to those methods based on pure linear search. Fig. 2 briefly describe the schematic diagram of the proposed RootMUSICHDAPA.
Iva Proposed RootMUSICHDAPA DOA Estimator
Here, each subarray will be still viewed as a large virtual antenna, initially, like HDAPA, assume all phases of analog beamforming vector are equal to zeros, i.e.,
(38) 
According to (27), the output vector of all subarrays at block is
(39)  
where
(40) 
can be viewed as the array manifold vector of the virtual array with each subarray as its virtual antenna elements, and is the common factor due to the summation of all elements per subarray. Let us define
(41) 
Thus, in (39) is written as
(42) 
Now, we adopt the RootMUSIC algorithm in digital part to estimate DOA. The covariance matrix of the output vector of virtual antenna array in Fig. 1 is
(43)  
where
represents the variance of the receive signal, which equals the average receive signal power. Furthermore, similar to the conventional RootMUSIC method, the singularvalue decomposition (SVD) of
is expressed as(44) 
where denotes the column vector consisting of the singular vector corresponding to the largest singular value, the matrix contains the singular vectors corresponding to smallest singular values, and the diagonal matrix has the following form
(45) 
Using the definition of pseudo spectrum of MUSIC algorithm in [5], we have the corresponding pseudo spectrum
(46)  
By maximizing the above , we have obtain the emitter direction. In general, there are two kinds of ways to estimate the emitter direction: linear search and RootMUSIC. The latter is attractive due to its lowcomplexity and nearanalytic solution. In what follows, we will design a modified RootMUSIC algorithm to find the optimal direction in the case of our hybrid structure, which is different from fullydigital structure. Considering that the denominator in the right side of equation (46) is close to zero for , we define the polynomial equation
(47)  
where , is the element in the th column of the th row of ,
(48) 
and
(49) 
then (47) is rewritten in the simple form
(50)  
Observing the above polynomial equation, we find its highest degree is . This means that this equation has roots. When is a root of , is its root as well. Now, we define the set of its roots as follows
(51) 
which yields the set of associated emitter directions as follows
(52) 
where
(53) 
Now, we use the digital beamformer (30) to keep the true optimal solution by deleting other pseudosolutions in , which is formulated as the following optimization problem
(54) 
which yields
(55) 
and
(56) 
from (47). Observing (47), (48), and (49), it is evident that the function is a periodic function of with period . In other words, , and , for , form all feasible solutions to (50). Thus, we have the extended feasible set as follows
(57) 
where
(58) 
Considering the objective function in (54) is also a periodic function of with period . Excluding the pseudosolution in feasible set of solutions requires APA. Before ADC, it is impossible to store the analog signal, then we need to use the next new blocks of signals.
Therefore, similar to the HADPA, we compute the set of all s in (30) corresponding to all phases in as follows
(59) 
The value of emitter direction associated with the largest element in set is the resulting estimated direction angle. This completes the estimate process of the proposed RootMUSICHDAPA scheme.
IvB Hybrid CRLB
To evaluate the proposed HDAPA and RootMUSICHDAPA methods above, the CRLB for hybrid structure, based on (43), is derived in Appendix A and is described in the following theorem.
Theorem 1: For the HAD beamforming structure in Fig. 1, with single emission source and ULA, the variance of unbiased DOA estimator is lower bounded by the following hybrid CRLB
(60) 
where
(61)  
Proof: See Appendix A.
IvC Complexity Analysis and Comparison
According to (37), the computational amount of the RootMUSICHDAPA is
(62)  
FLOPs. Regardless of computational complexity, APA and HADPA require more timedomain blocks to implement onetime phase alignment so as to estimate the highresolution DOA compared with HDAPA and RootMUSICHDAPA. The required numbers of time blocks for RootMUSICHDAPA, HDAPA, HADPA, and APA are as follows: , , , and , respectively. Obviously, the numbers of time blocks for RootMUSICHDAPA and HDAPA are , independent of stepsize, smaller than , i.e., the number of HADPA and APA depending on stepsize. In general, is far smaller than . Actually, the number of time blocks has a profound impact on the computational complexity as listed in Table I. Reversely, as shown in Table I, the computational complexity of each method is a linear function of the corresponding number of time blocks.
Algorithms  Complexity  

Conventional APA  
Proposed HADPA  
Proposed HDAPA  
Proposed RootMUSICHDAPA 

V Simulation Results
In this section, we present simulation results to demonstrate the performance of the three DOA estimators proposed by us: HADPA, HDAPA, and RootMUSICHDAPA. Simulation parameters are chosen as follows: the direction of emitter , , and . In mediumscale and largescale MIMO scenarios, the number of antennas at receive array is set to and , respectively.
Firstly, Fig. 3 and Fig. 4 plot the curves of root mean squared error (RMSE) versus stepsize of the four DOA estimators APA, HADPA, and HDAPA in Section III, and the proposed RootMUSICHDAPA in Section IV for different values of : 32 (mediumscale) and 128 (largescale), respectively. Here, , the number of subaarays, is set to , and SNR is equal to . It is seen from the two figures that the RMSE performance of all three methods of linear searching proposed in Section III improve as stepsize decreases. In particular, when stepsize is small enough, APA and HADPA will be closer to the fullydigital CRLB while the proposed HDAPA and RootMUSICHDAPA can converge to the hybrid CRLB. In large scale case, we observe that, when stepsize exceeds , the proposed RootMUSICHDAPA performs better than two pure linear searching algorithms: APA, and HADPA. Conversely, it is worsen than APA, and HADPA. However, a small stepsize means high complexity. In what follows, we will compare their complexity. The proposed RootMUSICHDAPA owns an extremely lower computational complexity than other methods. Thus, below, we will make a deep and extensive investigation on the proposed RootMUSICHDAPA.
Fig. 5 shows the performance curves of RMSE versus SNR of the proposed RootMUSICHDAPA algorithm with , , and , where the corresponding CRLBs are used as a performance benchmark. From Fig. 5, it is obvious that the proposed RootMUSICHDAPA method can achieve the corresponding CRLBs as SNR exceeds a fixed threshold. For example, and , the proposed method can reach the CRLB curve when SNR is larger than 5dB. Also, we find that as increases, the RMSE performance of the proposed method degrades gradually, and the corresponding CRLB value increases.
To observe the impact of the total number of array antennas on the proposed RootMUSICHDAPA scheme, in Fig. 6, we change the value of from 32 to 128, given fixed , and . Similar to Fig. 5, Fig. 6 still plots the RMSE verus SNR curves of the proposed RootMUSICHDAPA method. From this figure, we obtain the same performance trend as Fig. 5. Particularly, we note that, as the total number of antennas increases, the accuracy of the proposed algorithm improves accordingly.
Fig.7 illustrates the RMSE performance versus the number of snapshots for three different values of SNR: 0dB, 5dB, and 10dB. Regardless of the value of SNR and the number of snapshots/sampling points, the RMSE performance will always reach the corresponding CRLB. Additionally, as the number of snapshots increases, the RMSE performance becomes better and better.