DeepAI

# Distributed Parameter Estimation in Randomized One-hidden-layer Neural Networks

This paper addresses distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially receive measurements of an unknown parameter that is only partially observable to them. In this paper, we present a fully distributed estimation algorithm where agents exchange local estimates with their neighbors to collectively identify the true value of the parameter. We prove that this distributed update provides an asymptotically unbiased estimator of the unknown parameter, i.e., the first moment of the expected global error converges to zero asymptotically. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. Applying our method to a real-world dataset related to appliances energy prediction, we observe that our empirical findings verify the theoretical results.

• 6 publications
• 38 publications
02/22/2019

### On Parameter Estimation of Hidden Ergodic Ornstein-Uhlenbeck Process

We consider the problem of parameter estimation for the partially observ...
08/11/2015

### Are Slepian-Wolf Rates Necessary for Distributed Parameter Estimation?

We consider a distributed parameter estimation problem, in which multipl...
09/10/2013

### Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

In this paper we present an optimization-based view of distributed param...
03/30/2020

### Supplementary Material for CDC Submission No. 1461

In this paper, we focus on the influences of the condition number of the...
03/30/2020

### On Effects of Condition Number of Regression Matrix upon Hyper-parameter Estimators for Kernel-based Regularization Methods

In this paper, we focus on the influences of the condition number of the...
09/23/2018

### On the Information in Extreme Measurements for Parameter Estimation

This paper deals with parameter estimation from extreme measurements. Wh...
11/10/2016

### Distributed Estimation and Learning over Heterogeneous Networks

We consider several estimation and learning problems that networked agen...

## I Introduction

Supervised learning is a fundamental machine learning problem, where given input-output data samples, a learner aims to find a mapping (or function) from inputs to outputs [1]

. A good mapping is one that can be used for prediction of outputs corresponding to previously unseen inputs. Recently, deep neural networks have dominated the task of supervised learning in various applications, including computer vision

[2], speech recognition [3], robotics [4], and biomedical image analysis [5]. These methods, however, are data hungry and their application to domains with few/sparse labeled samples remains an active field of research [6]. An alternative effective method for supervised learning is shallow architectures with one-hidden-layer. This architecture was motivated by the classical results of Cybenko [7] and Barron [8], showing that (under some technical assumptions) one can use sigmoidal basis functions to approximate any output that is a continuous function of the input. These results later motivated researchers to develop algorithmic frameworks to leverage shallow networks for data representation. The seminal work of Rahimi and Recht is a prominent point in case [9]. In their approach, the nonlinear basis functions are selected using Monte-Carlo sampling with a theoretical guarantee that the approximated function converges asymptotically with respect to the number of data samples and basis functions.

The problem of function approximation in supervised learning (both in shallow and deep neural networks) is often formulated via empirical risk minimization [1], which amounts to solving an optimization problem over a high-dimensional parameter. Due to the computational challenges associated with high-dimensional optimization, an appealing solution turns out to be decentralized training of neural networks [10]. On the other hand, recent advancement in distributed computing within control and signal processing communities [11, 12, 13, 14, 15, 16] has provided novel decentralized techniques for parameter estimation over multi-agent networks. In these scenarios, each individual agent receives partially informative measurements about the parameter and engages in local communications with other agents to collaboratively accomplish the global task. A crucial component of these methods is a consensus protocol [17], allowing collective information aggregation and estimation. Distributed algorithms gained popularity due to their ability to handle large data sets, low computational burden over agents, and robustness to failure of a central agent.

Motivated by the importance of distributed computing in high-dimensional parameter estimation, in this paper, we consider distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially obtain low-dimensional measurements of the parameter (in various locations at different randomized frequencies). Despite the parameter being partially observable to each individual agent, the global spread of measurements is informative enough for a collective estimation. We propose a fully distributed update where each agent engages in local interactions with its neighboring agents to construct iterative estimates of the parameter. The update is akin to consensus+innovation algorithms in the distributed estimation literature [11, 13, 18].

Our main theoretical contribution is to characterize the first and second moments of the global estimation error. In particular, we prove that the distributed update provides an asymptotically unbiased estimator of the unknown parameter when all the randomness is expected out, i.e., the first moment of the global error converges to zero asymptotically. This result also allows us to characterize the convergence rate and derive an optimal innovation rate to speed up the convergence. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. We finally simulate our method on a real-world data related to appliances energy prediction, where we observe that our empirical findings verify the theoretical results.

## Ii Problem Statement

Notation: We adhere to the following notation table throughout the paper:

set for any integer

transpose of vector

identity matrix of size
vector of all ones with dimension
vector of all zeros
-norm operator

-th largest eigenvalue of matrix

expectation operator
spectral radius of matrix
trace operator
is positive semi-definite

The vectors are in column format. Boldface lowercase variables (e.g., ) are used for vectors, and boldface uppercase variables (e.g., ) are used for matrices.

### Ii-a One-Hidden-Layer Neural Networks: The Centralized Problem

Let us consider a regression problem of the form

 y=f(x)+v,

where is the output, is the input, and is a the noise term with zero mean and constant variance. The objective is to find the unknown mapping (or function) based on available input-output pairs . Various regression methods assume different functional forms to approximate

. For example, in linear regression, the input-output relationship is assumed to follow a linear model.

In this work, we focus on one-hidden-layer neural networks [7], where the approximated function is a nonlinear function of the input, and

 ˆf(x)=M∑l=1θlϕ(x,ωl), (1)

where is called a basis function (or feature map) parameterized by . In the above model, the parameters and are unknown and should be learned from data (i.e., input-output pairs). The underlying intuition behind this model is that the feature map transforms the original data from dimension to , where often time we have . Since the new space has a higher dimension, it provides more flexibility for approximation of the unknown function (as opposed to a linear model that is restrictive). It turns out that approximations of form (1) are dense in the space of continuous functions [7], i.e., they can be used to approximate any continuous function (on the unit cube).

However, from an algorithmic perspective, learning both and is computationally expensive. For a nonlinear feature map (e.g., cosine feature map), the problem is indeed non-convex and thus hard to solve. An alternative approach was proposed in [9] where one-hidden-layer neural networks are thought as Monte-Carlo approximations of kernel expansions. In particular, if we assume that

is a random variable with a support

, the corresponding kernel can be obtained via [19]

 k(x,x′)=∫Ωϕ(x,ω)ϕ(x′,ω)dτ(ω). (2)

Hence, if are independent samples from , the approximated kernel expansion corresponds to (1) and learning becomes a convex optimization problem with a modest computational cost. are then called random features in this model.

One such example is using cosine feature map to approximate a Gaussian kernel with unit width. In this case, (1) will be as follows

 ˆf(x)=M∑l=1θl√2cos(ν⊤lx+bl), (3)

where

come from a multi-variate Gaussian distribution

and

come from a uniform distribution

. In this paper, we will focus on the approximated function of form (3) and propose a distributed algorithm for learning the parameter .

### Ii-B Local Measurements in Multi-agent Networks

The proposed scenario in the previous section was centralized in the sense that the estimation task was done only by one agent that has all the data . In this section, we propose an iterative distributed scheme where we have a network of agents, each of which has access to a subset of data. In particular, agent has access to only data points at each iteration.

###### Assumption 1

Without loss of generality, we assume each agent observes the same number of data points at each time, i.e., throughout the paper.

This assumption is only for the sake of presentation clarity. Our main results can be extended to the case where different agents have various numbers of measurements.

Now, in the distributed model, the observation matrix at time will be as follows

 Hi,t=⎡⎢⎣ϕ(x1,i,t,ω1,i,t)…ϕ(x1,i,t,ωM,i,t)………ϕ(xc,i,t,ω1,i,t)…ϕ(xc,i,t,ωM,i,t)⎤⎥⎦, (4)

with any agent having access to . We then have the following measurement model

 yi,t=Hi,tθ+vi,t,

where is the unknown parameter that needs to be learned, and denotes the observation noise at agent . The above local measurement model can be interpreted as iteratively collecting low-dimensional measurements of parameter at different locations using distinct frequencies.

We follow the general assumptions of zero mean and constant variance on the noise term, i.e., we have and . We further denote by the estimate of for agent at time .

### Ii-C Multi-agent Network Model

The interactions of agents, which in turn defines the network, is captured with the matrix . Formally, we denote by , the -th entry of the matrix . When , agent communicates with agent . We assume that is symmetric, doubly stochastic with positive diagonal elements. The assumption simply guarantees the information flow in the network. Alternatively, from the technical point of view, we respect the following hypothesis.

###### Assumption 2

(connectivity) The network is connected, i.e., there is a path from any agent to another agent .

The assumption implies that the Markov chain

is irreducible and aperiodic, thus having a unique stationary distribution, i.e.,

is the unique (unnormalized) left eigenvector corresponding to

. It also entails that is unique, and the other eigenvalues of are less than unit in magnitude [20].

### Ii-D Distributed Estimation Update

To construct an iterative estimate of the parameter , each agent at time performs the following distributed update

 ^θi,t+1 =n∑j=1Pij^θj,t+αH⊤i,t(yi,t−Hi,t^θi,t), (5)

where is the step size. The update is akin to consensus+innovation schemes in the distributed estimation literature [11, 13, 18], and we analyze this update in Section III in the context of one-hidden-layer neural networks. Intuitively, the first part of the update (consensus) allows agents to keep their estimates close to each other, and the second part (innovation) takes into account the new measurements.

## Iii Main Theoretical Results

In this section, we provide our main theoretical results. We show that the local update (5) is an asymptotically unbiased estimator of the global parameter . Based on this result, we characterize the optimal step-size to obtain the fastest convergence rate. We then prove that the asymptotic second moment of the collective estimation error is bounded.

### Iii-a First Moment

Let us define the local error for each agent as

 ei,t≜^θi,t−θ. (6)

Subtracting from both sides of the local update (5), we can write the iterative local error process as follows

 (7)

Stacking the local errors in a vector, we denote the global error by

 et≜[e⊤1,t,…,e⊤n,t]⊤. (8)

We now characterize the global error process with the following proposition.

###### Proposition 1

Given Assumptions 1-2, the expected global error can be expressed as an LTI system that takes the form

 E[et]=QE[et−1],

where

 Q=P⊗IM−αcIMn, (9)

and denotes the Kronecker product. The expectation is taken over the stochasticity of and .

The proof of proposition 1 is given in the Appendix. It shows that the agents will collectively generate estimates of the parameter that are asymptotically unbiased as long as the spectral radius of is less than 1.

### Iii-B Step Size Tuning

According to Proposition 1, the convergence rate depends on the choice of the step size. If one wants to speed up the convergence rate of the process, it is necessary to shrink the spectral radius of as much as possible. This corresponds to solving the following problem

 α⋆=argminα>0{max{|λ1(Q)|,|λMn(Q)|}}. (10)

According to Assumption 2, is the unique (un-normalized) eigenvector of the matrix associated with , because . It is then immediate that

 λ1(Q)=1−αc. (11)

On the other hand, we have that

 λMn(Q)=λn(P)−αc. (12)

Plotting and in terms of , we can notice that the optimal would occur exactly where , in which case we have the following relationship

 αc−λn(P)=1−αc⇒α⋆=1+λn(P)2c. (13)

Plugging the optimal step size (13) into (11) and (12), we get

 |λ1(Q)|=|λMn(Q)|=1−λn(P)2,

and achieve the fastest convergence rate. This result suggests that when is close to one, we have the fastest convergence rate. Since is the smallest eigenvalue of , this would also imply that other eigenvalues are close to one in this scenario since . Intuitively, this indicates that is close to identity and agents have high self-reliance, i.e., they do not rely highly on their neighbors. Indeed, since otherwise the connectivity constraint is violated. Notice that in this paper, we are not concerned with network design, i.e., we assume that is given, and we can choose based on (13) accordingly.

### Iii-C Asymptotic Second Moment

To capture the efficiency of the collective estimation, we should also study the variance of the error, which (asymptotically) amounts to the second moment in view of Proposition 1. In the next theorem, we present an asymptotic upper bound on the second moment for a feasible range of step size .

###### Theorem 2

Given Assumptions 1-2, and the further assumption that and , the expected second moment of the estimation error is bounded as follows

 limt→∞E[e⊤tet]≤αMnσ2v2−αc(M+1),

for any . The expectation is taken over the stochasticity of random features and observation noise .

The proof of theorem 2 is given in the Appendix. It shows that the (asymptotic) expected second moment of the estimation error is bounded by a finite value that scales linearly with respect to the number of agents for a certain range of step size . It also suggests that the optimal step size in (13) will work whenever .

## Iv Numerical Experiments

We now provide empirical evidence in support of our algorithm by applying it to a regression dataset on UCI Machine Learning Repository. In this dataset, the input includes a number of attributes including temperature in kitchen area, humidity in kitchen area, temperature in living room area, humidity in laundry room area, temperature outside, pressure, etc.. The regression model aims at representing appliances energy use in terms of these features. More details about this dataset can be found in [21] as well as the UCI Machine Learning Repository. We randomly choose 16000 observations out of its 19735 observations for our simulation.

We consider observation matrices of form (4), where the bases are cosine functions as follows

 ϕ(x,ω)=ϕ(x,ν,b)=√2cos(x⊤ν+b), (14)

as described in section II-A where come from a multi-variate Gaussian distribution and come from a uniform distribution . Without loss of generality, we set , i.e., we use five basis functions in the approximation model (3). One can consider other values for and perform cross-validation to find the best one, but this is outside of the scope of this paper, as our focus is on estimation rather than model selection.

Network Structure: We consider a network of agents. Each agent has access to observation matrix with data points at time . Also, each agent is connected to agents (with a circular shift for any number outside of the range ). The matrix is such that agent is connected to itself with weight , connected to agents with weight , and connected to agents with weight . The smallest eigenvalue of our network is less than , so according to the step size constraint in Theorem 2, we can use the optimal step size (13) for this simulation. Therefore, the step size is set to be as in (13).

Benchmark: Since this dataset is from real-world and the ground truth value is unknown, we consider the solution of the centralized problem as the baseline. The local error at time is then calculated as the difference between local estimates and the centralized estimates as given in (6). We run update (5) for iterations such that the process reaches a steady state. To verify our results, we need to repeat the update process using Monte-Carlo simulations on random features to estimate the expectations.

Performance: We visualize the error process in Proposition 1 by presenting the plot of norm- of the expected global error, i.e., the norm- of given in Proposition 1 at . The vertical axis in Fig. 1 represents the average global error obtained by repeating Monte-Carlo simulations to form an estimate of the expected global error. The horizontal axis shows the number of Monte-Carlo simulations indexed by where . As the number of Monte-Carlo simulations increases, the norm- of the average global error will converge to the norm- of the expected global error in Proposition 1. As we can observe, the estimation of the expected global error converges to zero verifying that agents form asymptotically unbiased estimators of the parameter.

We next plot the expected norm- square of global error, i.e., given in Theorem 2 at . The vertical axis in Fig. 2 represents the norm- square of the global error averaged over Monte-Carlo simulations. The horizontal axis shows the number of Monte-Carlo simulations index by where . As the number of Monte-Carlo simulations increases, the average norm- square of the global error will converge to the expected norm- square of the global error in Theorem 2. The expected norm- square of the global error is upper bounded by according to Theorem 2 for this simulation set up and as we can observe, the average norm- square of global error is always less than verifying the accuracy of the upper bound in Theorem .

## V Conclusion

In this paper, we considered a distributed scheme for parameter estimation in randomized one-hidden-layer neural networks. A network of agents exchange local estimates of the parameter, formed using partial observations, to collaboratively identify the true value of the parameter. Our main contribution is to characterize the behavior of this distributed estimation scheme. We showed that the global estimation error is asymptotically unbiased and its second moment is finite under mild assumptions. Interestingly, our results shed light on the interplay of step size and network structure, which can be used for optimal design in practice. We verified this empirically by applying our method to a real-world data. Future directions include studying the estimation problem when the parameter has some dynamics [22] or the random frequencies are generated from a time-varying distribution. Due to the non-stationary nature of the problem in these two cases, the theoretical analysis becomes challenging and interesting to explore.

## Appendix

For presentation clarity, we use the following definitions in the proofs:

 Ut ≜diag[H⊤1,tH1,t,…,H⊤n,tHn,t] Ei,t ≜H⊤i,tvi,t Et ≜[E⊤1,t,…,E⊤n,t]⊤. (15)

### V-a Proof of Proposition 1

To prove Proposition 1, we first need to show that

 E[H⊤i,tHi,t]=cIM, (16)

for any . Recall that where and , and thus

 E[ϕ(x,ω)]=0,

since cosine is a periodic function. Therefore, we can conclude that for any and ,

 E[ϕ(x,ω)ϕ(x′,ω′)]=E[ϕ(x,ω)]E[ϕ(x′,ω)]=0, (17)

whenever is independent from . Notice that given the observation model (4), the -th entry of the matrix can be written as

 [H⊤i,tHi,t]pq=c∑j=1ϕ(xj,i,t,ωp)ϕ(xj,i,t,ωq). (18)

When , we have according to (17); otherwise, , since for any we have

 E[ϕ2(x,ωp)]=k(x,x)=exp(−∥x−x∥22)=e0=1.

Hence, , entailing that

 E[Ut]=cIMn, (19)

in view of (15). Following the lines of the proof of Lemma 1 in [18], the error process can be expressed as the following

 et+1=Q′tet+αEt, (20)

where

 Q′t=P⊗IM−αUt. (21)

Taking expectation over random features on both sides and noting (19), we have

 Q≜E[Q′t]=P⊗IM−αE[Ut]=P⊗IM−αcIMn.

Recalling (15), we can also immediately see from the zero-mean assumption on the noise that for every . Combining this with above and returning to (20) will finish the proof of Proposition 1.

### V-B Proof of Theorem 2

To prove Theorem 2, we first need to show a recursive relationship for the error process based on (20) where

 E[e⊤t+1et+1] =E[(Q′tet+αEt)⊤(Q′tet+αEt)] (22) =E[e⊤tQ′t⊤Q′tet]+α2E[E⊤tEt] ≤ρ(E[Q′t⊤Q′t])E[e⊤tet]+α2E[E⊤tEt] =λ1(E[Q′t⊤Q′t])E[e⊤tet]+α2E[E⊤tEt],

where we used the fact , resulting in zero cross-terms in the second line. To further bound , let us recall (21). As and are both symmetric and , we have that

 E[Q′t⊤Q′t]=E[(P⊗IM)(P⊗IM)−αUt(P⊗IM) −(P⊗IM)αUt+α2U2t] =(P⊗IM)(P⊗IM)−2αc(P⊗IM)+α2E[U2t].

Now, we apply Lemma 3 to bound above as

 E[Q′t⊤Q′t] +α2(M+1)c2IMn =P2⊗IM−2αc(P⊗IM) +α2(M+1)c2IMn =(P2−2αcP)⊗IM+α2(M+1)c2IMn.

Then, the largest eigenvalue of can be bounded as follows

 λ1(E[Q′t⊤Q′t]) (23) ≤λ1((P2−2αcP)⊗IM+α2(M+1)c2IMn) =λ1(P2−2αcP)+α2(M+1)c2.

Now, let denote the kernel matrix formed with measurements at agent at time where its -th entry is . Recalling (15), we can then bound the additive term in the recursive relation (22) as follows

 α2E[E⊤tEt]= α2E[n∑i=1E⊤i,tEi,t] (24) = α2E[n∑i=1v⊤i,tHi,tH⊤i,tvi,t] = α2ME[n∑i=1v⊤i,tKi,tvi,t] = α2Mn∑i=1Tr[Ki,tE[vi,tv⊤i,t]] = α2Mn∑i=1Tr[Ki,t]σ2v=α2cMnσ2v.

Letting

 Φa ≜λ1(P2−2αcP)+α2(M+1)c2 Φb ≜α2cMnσ2v, (25)

and using (23) and (24), we can re-write the recursive relation in (22) as

 E[e⊤t+1et+1]≤ΦaE[e⊤tet]+Φb. (26)

We can find the feasible range of through the inequality which ensures that the recursive process (26) will converge.

First, we have the following fact

 λ1(P2−2αcP)=max{1−2αc,λ2n(P)−2αcλn(P)}.

One can show that when and otherwise.

For the case when , we have the following

 Φa<1 ⟺ 1−2αc+α2c2(M+1)<1 ⟺ α2c2(M+1)<2αc ⟺ α<2c(M+1).

Therefore, given , we have that

 E[e⊤t+1et+1] ≤ΦaE[e⊤tet]+Φb ≤ΦtaE[e⊤1e1]+Φb(Φt−1a+...+Φa+1) =ΦtaE[e⊤1e1]+Φb(1−Φta)1−Φa.

This upper bound will converge to as , and noting definitions of and in (25), we derive the upper bound in the statement of Theorem 2.

For the case when , we have the following

 Φa<1 (27) ⟺ λ2n(P)−2αcλn(P)+α2c2(M+1)<1 ⟺ (λ2n(P)−1)−2αcλn(P)+α2c2(M+1)<0.

Considering the LHS of the last line in (27) as a quadratic function of , one can show that

 αc<2λn(P)+√4(M+1)−4Mλ2n(P)2(M+1),

must be true for (27) to hold. Therefore, the following must be true as well

 2λn(P)+√4(M+1)−4Mλ2n(P)2(M+1) (28) >1+λn(P)2 ⟺ (2−(M+1))λn(P) +√4(M+1)−4Mλ2n(P) −(M+1)>0.

Viewing the LHS of (28) as a function of , one can immediately verify that the function is always non-positive for any as long as . Therefore, contradicts . The only feasible region for is , finishing the proof of Theorem 2.

### V-C Statement and Proof of Lemma 3

###### Lemma 3

Under same assumptions as Theorem 2,

 E[U2t]⪯c2(M+1)IMn, (29)

where is defined in (15).

In the proof, we omit the time index and agent index for presentation clarity, i.e., we denote by , by for any , and by for any , respectively. We will show that is a diagonal matrix and all of its diagonal entries are upper bounded by .

Let us start by observing that the -th entry of the matrix (for any agent) can be written as

 M∑j=1(c∑k=1ϕ(ωp,xk)ϕ(ωj,xk)c∑k′=1ϕ(ωj,xk′)ϕ(ωq,xk′)), (30)

We now consider a single term in the previous summation:

 ϕ(ωp,xk)ϕ(ωj,xk)ϕ(ωj,xk′)ϕ(ωq,xk′), (31)

and analyze its expectation case by case.

Case 1: and ( and ).

Since and are independent, the expectation of the product of these two functions is zero as previously discussed in (17), so (31) would be zero.

Case 2: and ( or ).

In this case, three out of four product terms in (31) will include or . Then, the expectation of the other term will be zero again as cosine is periodic. Thus, the expectation of (31) will still be zero.

Case 3: and .

Now, (31) will become a product of two expectations of unbiased approximates of the kernel function in view of (2). Thus, the expectation of (31) will become which is less than 1. There are terms of this form in (30), which implies that it is upper bounded by .

Case 4: .

In this case, (31) becomes

 ϕ2(ωq,xk)ϕ2(ωq,xk′).

So, the expectation of (31) with will become the following where and :

 E[ϕ2(ωq,xk)ϕ2(ωq,xk′)] =4 E[cos2(ν⊤qxk+bq)cos2(ν⊤qxk′+bq)] =E[(cos(ν⊤q2xk+2bq)+1)(cos(ν⊤q2xk′+2bq)+1)] =E[1+cos(ν⊤q2xk+2bq)cos(ν⊤q2xk′+2bq)] +E[cos(ν⊤q2xk+bq)] +E[cos(ν⊤q2xk′+bq)] =E[1+cos(ν⊤q2xk+2bq)cos(ν⊤q2xk′+2bq)]≤2,

simply because cosine is bounded by 1, and its integral over is equal to zero. Notice that there are terms like above for every , and thus for a specific , the summation of term (31) where is upper bounded by .

We can then conclude that the expectation of term (30) is nonzero only for , and the diagonal entries of are upper bounded by . Recalling the definition of from (15) and combining it with the fact that

 E[H⊤i,tHi,tH⊤i,tHi,t]⪯(M+1)c2IM,

concludes the proof.

## Acknowledgments

We gratefully acknowledge the support of Texas A&M Triads for Transformation (T3) program.

## References

• [1] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning.   Springer series in statistics New York, NY, USA:, 2001, vol. 1.
• [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer vision and pattern recognition

, 2016, pp. 770–778.
• [3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
• [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, pp. 1334–1373, 2016.
• [5]

D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,”

Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
• [6] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting self-supervised learning via knowledge transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
• [7]

G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

Mathematics of Control, Signals, and Systems (MCSS), vol. 2, no. 4, pp. 303–314, 1989.
• [8] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.
• [9] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in Neural Information Processing Systems, 2009, pp. 1313–1320.
• [10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
• [11] U. Khan, S. Kar, A. Jadbabaie, J. M. Moura et al., “On connectivity, observability, and stability in distributed estimation,” in IEEE Conference on Decision and Control (CDC), 2010, pp. 6639–6644.
• [12] S. S. Stanković, M. S. Stanković, and D. M. Stipanović, “Decentralized parameter estimation by consensus based stochastic approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, 2011.
• [13] S. Kar, J. M. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012.
• [14] S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual averaging,” in IEEE Conference on Decision and Control, 2013, pp. 6196–6201.
• [15] N. Atanasov, R. Tron, V. M. Preciado, and G. J. Pappas, “Joint estimation and localization in sensor networks,” in IEEE Conference on Decision and Control (CDC), 2014, pp. 6875–6882.
• [16] A. Mitra and S. Sundaram, “An approach for distributed state estimation of lti systems,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2016, pp. 1088–1093.
• [17] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
• [18] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed estimation of dynamic parameters: Regret analysis,” in 2016 American Control Conference (ACC), 2016, pp. 1066–1071.
• [19] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
• [20] R. A. Horn and C. R. Johnson, Matrix analysis.   Cambridge university press, 2012.
• [21] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,” Energy and buildings, vol. 140, pp. 81–97, 2017.
• [22] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of dynamic parameters in social networks,” in Advances in Neural Information Processing Systems, 2013.