# Blind Over-the-Air Computation and Data Fusion via Provable Wirtinger Flow

Over-the-air computation (AirComp) shows great promise to support fast data fusion in Internet-of-Things (IoT) networks. AirComp typically computes desired functions of distributed sensing data by exploiting superposed data transmission in multiple access channels. To overcome its reliance on channel station information (CSI), this work proposes a novel blind over-the-air computation (BlairComp) without requiring CSI access, particularly for low complexity and low latency IoT networks. To solve the resulting non-convex optimization problem without the initialization dependency exhibited by the solutions of a number of recently proposed efficient algorithms, we develop a Wirtinger flow solution to the BlairComp problem based on random initialization. To analyze the resulting efficiency, we prove its statistical optimality and global convergence guarantee. Specifically, in the first stage of the algorithm, the iteration of randomly initialized Wirtinger flow given sufficient data samples can enter a local region that enjoys strong convexity and strong smoothness within a few iterations. We also prove the estimation error of BlairComp in the local region to be sufficiently small. We show that, at the second stage of the algorithm, its estimation error decays exponentially at a linear convergence rate.

## Authors

• 4 publications
• 34 publications
• 18 publications
• ### Stochastic Beamforming for Reconfigurable Intelligent Surface Aided Over-the-Air Computation

Over-the-air computation (AirComp) is a promising technology that is cap...
05/21/2020 ∙ by Wenzhi Fang, et al. ∙ 0

• ### Blind Demixing for Low-Latency Communication

In the next generation wireless networks, lowlatency communication is cr...
01/07/2018 ∙ by Jialin Dong, et al. ∙ 0

• ### Over-the-Air Computation in MIMO Multi-Access Channels: Beamforming and Channel Feedback

To support future IoT networks with dense sensor connectivity, a techniq...
03/29/2018 ∙ by Guangxu Zhu, et al. ∙ 0

• ### Over-the-Air Computation Systems: Optimization, Analysis and Scaling Laws

For future Internet of Things (IoT)-based Big Data applications (e.g., s...
09/01/2019 ∙ by Wanchun Liu, et al. ∙ 0

• ### On Analog Gradient Descent Learning over Multiple Access Fading Channels

We consider a distributed learning problem over multiple access channel ...
08/20/2019 ∙ by Tomer Sery, et al. ∙ 0

• ### Reduced-Dimension Design of MIMO Over-the-Air Computing for Data Aggregation in Clustered IoT Networks

One basic operation of Internet-of-Things (IoT) networks is aggregating ...
12/06/2018 ∙ by Dingzhu Wen, et al. ∙ 0

Massive number of internet of things (IoT) devices are expected to simul...
05/18/2018 ∙ by Zhifeng Yuan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The broad range of Internet-of-Things (IoT) applications continues to contribute substantially to the economic development and the improvement of our lives [1]. In particular, the wirelessly networked sensors are growing at an unprecedented rate, making data aggregation highly critical for IoT services [2]. For large scale wireless networking of sensor nodes, orthogonal multiple access protocols are highly impractical because of their low spectrum utilization efficiency for IoT and the excessive network latency [3]. In response, the concept of over-the-air computation

(AirComp) has recently been considered for computing a class of nomographic functions, such as arithmetic mean, weighted sum, geometric mean and Euclidean norm of distributed sensor data via concurrent, instead of the sequential, node transmissions

[4]. AirComp exploits the natural superposition of co-channel transmissions from multiple data source nodes [5].

There have already been a number of published works related to AirComp. Among them, one research thread takes on the information theoretic view and focuses on achievable computation rate under structured coding schemes. Specifically, in the seminal work of [6], linear source coding was designed to reliably compute a function of distributed sensor data transmitted over the multiple-access channels (MACs). Lattice codes were adopted in [6, 7] to compute the sum of source signals over MACs efficiently. Leveraging lattice coding, a compute-and-forward relaying scheme [5] was proposed for relay assisted networks. On the other hand, a different line of studies [8, 9] investigates the error of distributed estimation in wireless sensor networks. In particular, linear decentralized estimation was investigated in [8] for coherent multiple access channels. Power control was investigated in [9] to optimize the estimation distortion. It was shown in [10] that pre- and post-processing functions enable the optimization of computation performance by harnessing the interference for function computations. Another more recent line of studies focused on designing transmitter and receiver matrices in order to minimize the distortion error when computing desired functions. Among others, MIMO-AirComp equalization and channel feedback techniques for spatially multiplexing multi-function computation have been proposed [3]. Another work developed a novel transmitter design at the multiple antennas IoT devices with zero-forcing beamforming [11].

However, the main limitation of current AirComp is the dependence on channel-state-information (CSI), which leads to high latency and significant overhead in the massive Internet-of-Things networks with a large amount of devices. Even though the work [12] has proposed a type of CSI at sensor nodes, called No CSI, the receiver still needs to obtain the statistical channel knowledge. Recently, blind demixing has become a powerful tool to elude channel-state-information (i.e., without channel estimation at both transmitters and receivers) thereby enabling low-latency communications [13, 14, 15]. Specifically, in blind demixing, a sequence of source signals can be recovered from the sum of bilinear measurements without the knowledge of channel information [16]. Inspired by the recent progress of blind demixing, in this paper, we shall propose a novel blind over-the-air computation

(BlairComp) scheme for low-latency data aggregation, thereby computing the desired function (e.g., arithmetic mean) of sensing data vectors without the prior knowledge of channel information. However, the BlairComp problem turns out to be a highly intractable nonconvex optimization problem due to the bilinear signal model.

There is a growing body of recent works to tame the nonconvexity in solving the high-dimensional bilinear systems. Specifically, semidefinite programming was developed in [14] to solve the blind demixing problem by lifting the bilinear model into the matrix space. However, it is computationally prohibitive for solving large-scale problem due to the high computation and storage cost. To address this issue, the nonconvex algorithm, e.g., regularized gradient descent with spectral initialization [13], was further developed to optimize the variables in the natural vector space. Nevertheless, the theoretical guarantees for the regularized gradient [13] provide pessimistic convergence rate and require carefully-designed initialization. The Riemannian trust-region optimization algorithm without regularization was further proposed in [15] to improve the convergence rate. However, the second-order algorithm brings unique challenges in providing statistical guarantees. Recently, theoretical guarantees concerning regularization-free Wirtinger flow with spectral initialization for blind demixing was provided in [16]. However, this regularization-free method still calls for spectral initialization. To find a natural implementation for the practitioners that works equally well as spectral initialization, in this paper, we shall propose to solve the BlairComp problem via randomly initialized Wirtinger flow with provable optimality guarantees.

Based on the random initialization strategy, a line of research studies the benign global landscapes for the high-dimensional nonconvex estimation problems, followed by designing generic saddle-point escaping algorithms, e.g., noisy stochastic gradient descent

[17], trust-region method [18], perturbed gradient descent [19]. With sufficient samples, these algorithms are guaranteed to converge globally for phase retrieval [18], matrix recovery [20], matrix sensing [21], robust PCA [21]

and shallow neural networks

[22], where all local minima are provably as good as global and all the saddle points are strict. However, the theoretical results developed in [17, 18, 19, 20, 21, 22] are fairly general and may yield pessimistic convergence rate guarantees. Moreover, these saddle-point escaping algorithms are more complicated for implementation than the natural vanilla gradient descent or Wirtinger flow. To advance the theoretical analysis for gradient descent with random initialization, the fast global convergence guarantee concerning randomly initialized gradient descent for phase retrieval has been recently provided in [23].

In this paper, our main contribution is to provide the global convergence guarantee concerning Wirtinger flow with random initialization for solving the nonconvex BlairComp problem. It turns out that, for BlairComp, the procedure of Wirtinger flow with random initialization can be separated into two stages:

• Stage I: the estimation error is nearly stable, which takes only a few iterations,

• Stage II: the estimation error decays exponentially at a linear convergence rate.

In addition, we identify the exponential growth of the magnitude ratios of the signals to perpendicular components, which explains why Stage I lasts only for a few iterations.

#### Notations

Throughout this paper, or denotes that there exists a constant such that whereas means that there exists a constant such that . denotes that there exists some sufficiently large constant such that . In addition, the notation means that there exists constants such that . Let superscripts and denote the transpose and conjugate transpose of a matrix/vector, respectively. Let the superscript denote the conjugate transpose of a complex number.

## Ii Problem Formulation

Blind over-the-air computation (BlairComp) aims to facilitate low-latency data aggregation in IoT networks without a priori knowledge of CSI. This is achieved by computing the desired functions of the distributed sensing data based on the natural signal superposition of transmission over multi-access channels.

### Ii-a Blind Over-the-Air Computation

We consider a wireless sensor network consisting of active sensor nodes and a single fusion center. Let denote the sensor data vector collected at the -th node. The fusion center, through AirComp, aims to compute nomographic functions of distributed data that can be decomposed as [10]

 Hℓ(d1ℓ,⋯,dsℓ)=Fℓ(∑si=1Giℓ(diℓ)),ℓ=1,⋯,N. (1)

Function denotes the pre-processing function by the sensor nodes and denotes the post-processing function at the fusion center. Typical nomographic functions by AirComp include the arithmetic mean, weighted sum, geometric mean, polynomial, Euclidean norm [10].

In this work, we focus on a specific nomographic function

 ¯θ=s∑i=1¯xi, (2)

where is the preprocessed data vector transmitted by the -th node. Over channel access opportunities (e.g., time slots), the received signals at fusion center in the frequency domain can be written as [13, 15]

 yj=s∑i=1bHj¯hi¯xHiaij+ej, 1≤j≤m, (3)

where are the access vectors, the CSI vectors, and is an independent circularly symmetric complex Gaussian measurement noise.

To compute the desired functions via BlairComp without knowledge of , we can consider a precoding scheme with randomly selected known vectors

follows i.i.d. circularly symmetric complex normal distribution

for . Furthermore, the first

columns of the unitary discrete Fourier transform (DFT) matrix

form the known matrix [13]. The target of BlairComp is to compute the desired function vector via concurrent transmissions without channel information, thereby providing low-latency data aggregation in the IoT networks.

### Ii-B Multi-Dimensional Nonconvex Estimation

One way to estimate the result vector from the received signals in (3) is to use with as the ambiguity alignment parameter, for which one could first solve the bilinear optimization problem:

 (4)

which is highly nonconvex. To measure the computation accuracy for BlairComp problem, we define the following metric

 (5)

Note that are ambiguity alignment parameters such that

 ωi=argminωi∈C(∥(ω∗i)−1hi−¯hi∥22+∥ωixi−¯xi∥22). (6)

To estimate , one reference symbol in is needed.

In this paper, we shall propose to solve the high-dimensional BlairComp problem via Wirtinger flow with random initialization. Our main contribution is to provide the statistical optimality and convergence guarantee for the randomly initialized Wirtinger flow algorithm by exploiting the benign geometry of the high-dimensional BlairComp problem.

## Iii Main Approach

In this section, we first propose an algorithm based on randomly initialized Wirtinger flow to solve the BlairComp problem . We shall present a statistical analysis to demonstrate the optimality of this algorithm for solving the high-dimensional nonconvex estimation problem.

### Iii-a Randomly Initialized Wirtinger Flow Algorithm

Wirtinger flow with random initialization is an iterative algorithm with a simple gradient descent update procedure without regularization. Specifically, the gradient step of Wirtinger flow is represented by the notion of Wirtinger derivatives [24], i.e., the derivatives of real valued functions over complex variables.

To simplify the notations, we denote , where

 z=⎡⎢⎣z1⋯zs⎤⎥⎦∈Cs(N+K) with zi=[hixi]∈CN+K. (7)

For each , and denote the Wirtinger gradient of with respect to and respectively as:

 ∇hif(z) =m∑j=1(s∑k=1bHjhkxHkakj−yj)bjaHijxi, (8a) ∇xif(z) =m∑j=1(s∑k=1hHkbjaHkjxk−y∗j)aijbHjhi. (8b)

In light of the Wirtinger gradient (8), the update rule of Wirtinger flow uses a stepsize via

 [ht+1ixt+1i]=[htixti]−η⎡⎢ ⎢ ⎢⎣1∥xti∥22∇hif(zt)1∥hti∥22∇xif(zt)⎤⎥ ⎥ ⎥⎦,i=1,⋯,s. (9)

Before proceed to theoretical analysis, we first present an example to illustrate the practical efficiency of Wirtinger flow with random initialization for solving problem (4). The ground truth values and initial points are randomly generated according to

 ¯hi∼N(0,K−1IK), ¯xi∼N(0,N−1IN), (10) h0i∼N(0,K−1IK), x0i∼N(0,N−1IN), (11)

for . In all our simulations, we set and normalize for . Specifically, for each value of , and , the design vectors ’s and ’s for each , are generated according to the descriptions in Section II. With the chosen step size in all settings, Fig. 1 shows the relative error, i.e., (5), versus the iteration count. We observe the convergence of Wirtinger flow with random initialization exhibits two stages: Stage I: within dozens of iterations, the relative error remains nearly flat, Stage II: the relative error shows exponential decay despite the different problem sizes.

In practical scenario, the estimation error of ambiguity alignment parameters would have influences on the relative error, i.e., (5). Hence, we illustrate the relationship between the estimation error of ambiguity alignment parameters and the relative error via the following experiment. Let , , the step size be and the number of users . In each iteration, for , the estimated ambiguity alignment parameter is given by where is given by (6) and is the additive noise. In the experiment, the parameter varies from to . Fig. 1 shows the relative error versus the parameter . Both the relative error and the parameter are shown in the dB scale. As we can see, the relative error scales linearly with the parameter .

### Iii-B Theoretical Analysis

To present the main theorem, we first introduce several fundamental definitions. Specifically, the incoherence parameter [13], which characterizes the incoherence between and for .

###### Definition 1 (Incoherence for BlairComp).

Let the incoherence parameter be the smallest number such that

Let and , respectively, denote

 ˜hti=(ωti∗)−1htiand˜xti=ωtixti,i=1,⋯,s, (12)

where ’s are alignment parameters. We further define the norm of the signal component and the perpendicular component with respect to for as

 αhti :=⟨¯hi,˜hti⟩/∥¯hi∥2, (13) βhti :=∥∥ ∥∥˜hti−⟨¯hi,˜hti⟩∥¯hi∥22¯hi∥∥ ∥∥2, (14)

respectively. Here, ’s are the alignment parameters. Similarly, the norms of the signal component and the perpendicular component with respect to for can be represented as

 αxti :=⟨¯xi,˜xti⟩/∥¯xi∥2, (15) βxti :=∥∥ ∥∥˜xti−⟨¯xi,˜xti⟩∥¯xi∥22¯xi∥∥ ∥∥2, (16)

respectively.

Without loss of generality, we assume () for and , for . Define the condition number with . Then the main theorem is presented in the following.

###### Theorem 1.

Assume that the initial points obey (11) for and the stepsize satisfies . Suppose that the sample size satisfies for some sufficiently large constant

. Then with probability at least

for some constants , there exists a sufficiently small constant and such that

1. The randomly initialized Wirtinger flow makes the estimation error decays exponentially, i.e.,

 error(θt,¯θ)≤γ(1−η16κ)t−Tγ, t≥Tγ, (17)
2. The magnitude ratios of the signal component to the perpendicular component with respect to and obey

 max1≤i≤sαhtiβhti≳1√KlogK(1+c3η)t, (18a) max1≤i≤sαxtiβxti≳1√NlogN(1+c4η)t, (18b)

respectively, where for some constants .

3. The normalized root mean square error for obeys

 RMSE(xti,¯xi)≲√NlogN(1+c4η)−t, (19)

for some constant .

Theorem 1 provides precise statistical analysis on the computational efficiency of Wirtinger flow with random initialization. Specifically, in Stage I, it takes iterations for randomly initialized Wirtinger flow to reach sufficient small relative error, i.e., where is some sufficiently small constant. The short duration of Stage I is own to the exponential growth of the magnitude ratio of the signal component to the perpendicular components. Moreover, in Stage II, it takes iterations to reach -accurate solution at a linear convergence rate. Thus, the iteration complexity of randomly initialized Wirtinger flow is guaranteed to be as long as the sample size exceeds .

To further illustrate the relationship between the signal component (resp. ) and the perpendicular component (resp. ) for , we provide the simulation results under the setting of , , and with for . In particular, , versus iteration count (resp. , versus iteration count) for is demonstrated in Fig. 2 (resp. Fig. 2). Consider Fig. 1, Fig. 2 and Fig. 2 collectively, it shows that despite the rare decline of the estimation error, i.e., , during Stage I, the size of the signal component, i.e., and for each , exponentially increase and the signal component becomes dominant component at the end of Stage I. Furthermore, the exponential growth of the ratio (resp. ) for each is illustrated in Fig. 2 (resp. Fig. 2).

## Iv Dynamics Analysis

In this section, we prove the main theorem by investigating the dynamics of the iterates of Wirtinger flow with random initialization. The steps of proving Theorem 1 are summarized as follows.

1. Stage I:

• Dynamics of population-level state evolution. Provide the population-level state evolution of (24a) and (24b), (25a), (25b) respectively, where the sample size approaches infinity. We then develop the approximate state evolution (27), which are remarkably close to the population-level state evolution, in the finite-sample regime. See details in Section IV-A.

• Dynamics of approximate state evolution. Show that there exists some such that , if (13), (14), (15) and (16) satisfy the approximate state evolution (27). The exponential growth of the ratio and are further demonstrated under the same assumption. Please refer to Lemma 1.

• Leave-one-out arguments. Prove that with high probability , , and satisfy the approximate state evolution (27) if the iterates are independent with . Please refer to Lemma 2. To achieve this, the “near-independence" between and is established via exploiting leave-one-out arguments and some variants of the arguments. Specifically, the leave-one-out sequences and random-sign sequences are constructed in Section IV-C. The concentrations between the original and these auxiliary sequences are then provided in Lemma 4-Lemma 9.

2. Stage II: Local geometry in the region of incoherence and contraction. We invoke the prior theory provided in [16] to show local convergence of the random initialized Wirtinger flow in Stage II.

Claims (18) and (19) are further proven in Section IV-F.

### Iv-a Dynamics of Population-level State Evolution

In this subsection, we investigate the dynamics of population-level (where we have infinite samples) state evolution of (13), (14), (15) and (16).

Without loss the generality, we assume that for , where are some constants and , and

denotes the first standard basis vector. This assumption is based on the rotational invariance of Gaussian distributions. Since the deterministic nature of

, the ground truth signals (channel vectors) cannot be transferred to a simple form, which yields more tedious analysis procedure. For simplification, for , we denote

 xti1andxti⊥:=[xtij]2≤j≤N (20)

as the first entry and the second through the -th entries of , respectively. Based on the assumption that for , (15) and (16) can be reformulated as

 αxi:=˜xti1andβxi:=∥∥˜xti⊥∥∥2. (21)

To study the population-level state evolution, we start with consider the case where the sequences (refer to (7)) are established via the population gradient, i.e., for ,

 [ht+1ixt+1i]=[htixti]−η⎡⎢ ⎢ ⎢⎣1∥xti∥22∇hiF(zt)1∥hti∥22∇xiF(zt)⎤⎥ ⎥ ⎥⎦, (22)

where

 ∇hiF(z):=E[∇hif(h,x)]=∥xi∥22hi−(¯xHixi)¯hi,
 ∇xiF(z):=E[∇xif(h,x)]=∥hi∥22xi−(¯hHihi)¯xi.

Here, the population gradients are computed based on the assumption that (resp. ) and (resp. ) are independent with each other. With simple calculations, the dynamics for both the signal and the perpendicular components with respect to , are given as

 ˜xt+1i1 =(1−η)˜xti1+ηq2i∥˜hti∥22¯hHi˜hti, (23a) ˜xt+1i⊥ =(1−η)˜xti⊥. (23b)

Assuming that is sufficiently small and () for and recognizing that , we arrive at the following population-level state evolution for both and :

 αxt+1i =(1−η)αxti+ηqiαhtiα2hti+β2hti, (24a) βxt+1i =(1−η)βxti. (24b)

Likewise, the population-level state evolution for both and :

 αht+1i =(1−η)αhti+ηqiαxtiα2xti+β2xti, (25a) βht+1i =(1−η)βhti. (25b)

In finite-sample case, the dynamics of the randomly initialized Wirtinger flow iterates can be represented as

 zt+1i= [ht+1ixt+1i]=[hti−η/∥xti∥22⋅∇hiF(z)xti−η/∥xti∥22⋅∇xiF(z)]− −[η/∥xti∥22⋅(∇hif(z)−∇hiF(z))η/∥hti∥22⋅(∇xif(z)−∇xiF(z))]. (26)

Under the assumption that the last term in (IV-A) is well-controlled, which will be justified in Appendix B, we arrive at the approximate state evolution:

 αht+1i =(1−η+ηqiψhtiα2xti+β2xti)αhti+η(1−ρhti)qiαxtiα2xti+β2xti, (27a) βht+1i =(1−η+ηqiφhtiα2xti+β2xti)βhti, (27b) αxt+1i =(1−η+ηqiψxtiα2hti+β2hti)αxti+η(1−ρxti)qiαhtiα2hti+β2hti, (27c) βxt+1i =(1−η+ηqiφxtiα2hti+β2hti)βxti, (27d)

where and represent the perturbation terms.

### Iv-B Dynamics of Approximate State Evolution

To begin with, we define the discrepancy between the estimate and the ground truth as the distance function, given as

 dist(z,¯z)=(s∑i=1dist2(zi,¯zi))1/2, (28)

where for . Here, and each is the alignment parameter. It is easily seen that if (13), (14), (15) and (16) obey

 |αhti−qi|≤γ2κ√sandβhti≤γ2κ√sand |αxti−qi|≤γ2κ√sandβxti≤γ2κ√s, (29)

for , then . Moreover, based triangle inequality, there is .

In this subsection, we shall show that as long as the approximate state evolution (27) holds, there exists some constant satisfying condition (IV-B). This is demonstrated in the following Lemma. Prior to that, we first list several conditions and definitions that contribute to the lemma.

• The initial points obey

 αh0i≥qiKlogKandαx0i≥qiNlogN, (30a) √α2h0i+β2h0i∈[1−1logK,1+1logK]qi, (30b) √α2x0i+β2x0i∈[1−1logN,1+1logN]qi, (30c) for i=1,⋯,s.
• Define

 Tγ :=min{t:satifes (???)}, (31)

where is some sufficiently small constant.

• Define

 T1 :=min{t:miniαhtiqi≥c7log5m, miniαxtiqi≥c′7log5m}, (32) T2 :=min{t:miniαhtiqi>c8, miniαxtiqi>c′8}, (33)

for some small absolute positive constants .

• For , it has

 12√KlogK≤αhtiqi≤2, c5≤βhtiqi≤1.5and αht+1i/αhtiβht+1i/βhti≥1+c5η, i=1,⋯,s, (34) 12√NlogN≤αxtiqi≤2, c6≤βxtiqi≤1.5and αxt+1i/αxtiβxt+1i/βxti≥1+c6η, i=1,⋯,s, (35)

for some constants .

###### Lemma 1.

Assume that the initial points obey condition (30) and the perturbation terms in the approximate state evolution (27) obey for , and some sufficiently small constant .

1. Then for any sufficiently large and the stepsize that obeys , it follows and (34), (35).

2. Then with the stepsize following , one has that , , .

###### Proof.

The proof of Lemma 1 is inspired by the proof of Lemma 1 in [23]. ∎

The random initialization (11) satisfies the condition (30) with probability at least [23]. According to this fact, Lemma 1 ensures that under both random initialization (11) and approximate state evolution (27) with the stepsize , Stage I only lasts a few iterations, i.e.,