I Introduction
Supervised learning is a fundamental machine learning problem, where given inputoutput data samples, a learner aims to find a mapping (or function) from inputs to outputs [1]
. A good mapping is one that can be used for prediction of outputs corresponding to previously unseen inputs. Recently, deep neural networks have dominated the task of supervised learning in various applications, including computer vision
[2], speech recognition [3], robotics [4], and biomedical image analysis [5]. These methods, however, are data hungry and their application to domains with few/sparse labeled samples remains an active field of research [6]. An alternative effective method for supervised learning is shallow architectures with onehiddenlayer. This architecture was motivated by the classical results of Cybenko [7] and Barron [8], showing that (under some technical assumptions) one can use sigmoidal basis functions to approximate any output that is a continuous function of the input. These results later motivated researchers to develop algorithmic frameworks to leverage shallow networks for data representation. The seminal work of Rahimi and Recht is a prominent point in case [9]. In their approach, the nonlinear basis functions are selected using MonteCarlo sampling with a theoretical guarantee that the approximated function converges asymptotically with respect to the number of data samples and basis functions.The problem of function approximation in supervised learning (both in shallow and deep neural networks) is often formulated via empirical risk minimization [1], which amounts to solving an optimization problem over a highdimensional parameter. Due to the computational challenges associated with highdimensional optimization, an appealing solution turns out to be decentralized training of neural networks [10]. On the other hand, recent advancement in distributed computing within control and signal processing communities [11, 12, 13, 14, 15, 16] has provided novel decentralized techniques for parameter estimation over multiagent networks. In these scenarios, each individual agent receives partially informative measurements about the parameter and engages in local communications with other agents to collaboratively accomplish the global task. A crucial component of these methods is a consensus protocol [17], allowing collective information aggregation and estimation. Distributed algorithms gained popularity due to their ability to handle large data sets, low computational burden over agents, and robustness to failure of a central agent.
Motivated by the importance of distributed computing in highdimensional parameter estimation, in this paper, we consider distributed parameter estimation in randomized onehiddenlayer neural networks. A group of agents sequentially obtain lowdimensional measurements of the parameter (in various locations at different randomized frequencies). Despite the parameter being partially observable to each individual agent, the global spread of measurements is informative enough for a collective estimation. We propose a fully distributed update where each agent engages in local interactions with its neighboring agents to construct iterative estimates of the parameter. The update is akin to consensus+innovation algorithms in the distributed estimation literature [11, 13, 18].
Our main theoretical contribution is to characterize the first and second moments of the global estimation error. In particular, we prove that the distributed update provides an asymptotically unbiased estimator of the unknown parameter when all the randomness is expected out, i.e., the first moment of the global error converges to zero asymptotically. This result also allows us to characterize the convergence rate and derive an optimal innovation rate to speed up the convergence. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. We finally simulate our method on a realworld data related to appliances energy prediction, where we observe that our empirical findings verify the theoretical results.
Ii Problem Statement
Notation: We adhere to the following notation table throughout the paper:
set for any integer  

transpose of vector 

identity matrix of size  
vector of all ones with dimension  
vector of all zeros  
norm operator  
th largest eigenvalue of matrix 

expectation operator  
spectral radius of matrix  
trace operator  
is positive semidefinite 
The vectors are in column format. Boldface lowercase variables (e.g., ) are used for vectors, and boldface uppercase variables (e.g., ) are used for matrices.
Iia OneHiddenLayer Neural Networks: The Centralized Problem
Let us consider a regression problem of the form
where is the output, is the input, and is a the noise term with zero mean and constant variance. The objective is to find the unknown mapping (or function) based on available inputoutput pairs . Various regression methods assume different functional forms to approximate
. For example, in linear regression, the inputoutput relationship is assumed to follow a linear model.
In this work, we focus on onehiddenlayer neural networks [7], where the approximated function is a nonlinear function of the input, and
(1) 
where is called a basis function (or feature map) parameterized by . In the above model, the parameters and are unknown and should be learned from data (i.e., inputoutput pairs). The underlying intuition behind this model is that the feature map transforms the original data from dimension to , where often time we have . Since the new space has a higher dimension, it provides more flexibility for approximation of the unknown function (as opposed to a linear model that is restrictive). It turns out that approximations of form (1) are dense in the space of continuous functions [7], i.e., they can be used to approximate any continuous function (on the unit cube).
However, from an algorithmic perspective, learning both and is computationally expensive. For a nonlinear feature map (e.g., cosine feature map), the problem is indeed nonconvex and thus hard to solve. An alternative approach was proposed in [9] where onehiddenlayer neural networks are thought as MonteCarlo approximations of kernel expansions. In particular, if we assume that
is a random variable with a support
and a probability distribution
, the corresponding kernel can be obtained via [19](2) 
Hence, if are independent samples from , the approximated kernel expansion corresponds to (1) and learning becomes a convex optimization problem with a modest computational cost. are then called random features in this model.
One such example is using cosine feature map to approximate a Gaussian kernel with unit width. In this case, (1) will be as follows
(3) 
where
come from a multivariate Gaussian distribution
andcome from a uniform distribution
. In this paper, we will focus on the approximated function of form (3) and propose a distributed algorithm for learning the parameter .IiB Local Measurements in Multiagent Networks
The proposed scenario in the previous section was centralized in the sense that the estimation task was done only by one agent that has all the data . In this section, we propose an iterative distributed scheme where we have a network of agents, each of which has access to a subset of data. In particular, agent has access to only data points at each iteration.
Assumption 1
Without loss of generality, we assume each agent observes the same number of data points at each time, i.e., throughout the paper.
This assumption is only for the sake of presentation clarity. Our main results can be extended to the case where different agents have various numbers of measurements.
Now, in the distributed model, the observation matrix at time will be as follows
(4) 
with any agent having access to . We then have the following measurement model
where is the unknown parameter that needs to be learned, and denotes the observation noise at agent . The above local measurement model can be interpreted as iteratively collecting lowdimensional measurements of parameter at different locations using distinct frequencies.
We follow the general assumptions of zero mean and constant variance on the noise term, i.e., we have and . We further denote by the estimate of for agent at time .
IiC Multiagent Network Model
The interactions of agents, which in turn defines the network, is captured with the matrix . Formally, we denote by , the th entry of the matrix . When , agent communicates with agent . We assume that is symmetric, doubly stochastic with positive diagonal elements. The assumption simply guarantees the information flow in the network. Alternatively, from the technical point of view, we respect the following hypothesis.
Assumption 2
(connectivity) The network is connected, i.e., there is a path from any agent to another agent .
The assumption implies that the Markov chain
is irreducible and aperiodic, thus having a unique stationary distribution, i.e.,is the unique (unnormalized) left eigenvector corresponding to
. It also entails that is unique, and the other eigenvalues of are less than unit in magnitude [20].IiD Distributed Estimation Update
To construct an iterative estimate of the parameter , each agent at time performs the following distributed update
(5) 
where is the step size. The update is akin to consensus+innovation schemes in the distributed estimation literature [11, 13, 18], and we analyze this update in Section III in the context of onehiddenlayer neural networks. Intuitively, the first part of the update (consensus) allows agents to keep their estimates close to each other, and the second part (innovation) takes into account the new measurements.
Iii Main Theoretical Results
In this section, we provide our main theoretical results. We show that the local update (5) is an asymptotically unbiased estimator of the global parameter . Based on this result, we characterize the optimal stepsize to obtain the fastest convergence rate. We then prove that the asymptotic second moment of the collective estimation error is bounded.
Iiia First Moment
Let us define the local error for each agent as
(6) 
Subtracting from both sides of the local update (5), we can write the iterative local error process as follows
(7) 
Stacking the local errors in a vector, we denote the global error by
(8) 
We now characterize the global error process with the following proposition.
Proposition 1
The proof of proposition 1 is given in the Appendix. It shows that the agents will collectively generate estimates of the parameter that are asymptotically unbiased as long as the spectral radius of is less than 1.
IiiB Step Size Tuning
According to Proposition 1, the convergence rate depends on the choice of the step size. If one wants to speed up the convergence rate of the process, it is necessary to shrink the spectral radius of as much as possible. This corresponds to solving the following problem
(10) 
According to Assumption 2, is the unique (unnormalized) eigenvector of the matrix associated with , because . It is then immediate that
(11) 
On the other hand, we have that
(12) 
Plotting and in terms of , we can notice that the optimal would occur exactly where , in which case we have the following relationship
(13) 
Plugging the optimal step size (13) into (11) and (12), we get
and achieve the fastest convergence rate. This result suggests that when is close to one, we have the fastest convergence rate. Since is the smallest eigenvalue of , this would also imply that other eigenvalues are close to one in this scenario since . Intuitively, this indicates that is close to identity and agents have high selfreliance, i.e., they do not rely highly on their neighbors. Indeed, since otherwise the connectivity constraint is violated. Notice that in this paper, we are not concerned with network design, i.e., we assume that is given, and we can choose based on (13) accordingly.
IiiC Asymptotic Second Moment
To capture the efficiency of the collective estimation, we should also study the variance of the error, which (asymptotically) amounts to the second moment in view of Proposition 1. In the next theorem, we present an asymptotic upper bound on the second moment for a feasible range of step size .
Theorem 2
The proof of theorem 2 is given in the Appendix. It shows that the (asymptotic) expected second moment of the estimation error is bounded by a finite value that scales linearly with respect to the number of agents for a certain range of step size . It also suggests that the optimal step size in (13) will work whenever .
Iv Numerical Experiments
We now provide empirical evidence in support of our algorithm by applying it to a regression dataset on UCI Machine Learning Repository^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction. In this dataset, the input includes a number of attributes including temperature in kitchen area, humidity in kitchen area, temperature in living room area, humidity in laundry room area, temperature outside, pressure, etc.. The regression model aims at representing appliances energy use in terms of these features. More details about this dataset can be found in [21] as well as the UCI Machine Learning Repository. We randomly choose 16000 observations out of its 19735 observations for our simulation.
We consider observation matrices of form (4), where the bases are cosine functions as follows
(14) 
as described in section IIA where come from a multivariate Gaussian distribution and come from a uniform distribution . Without loss of generality, we set , i.e., we use five basis functions in the approximation model (3). One can consider other values for and perform crossvalidation to find the best one, but this is outside of the scope of this paper, as our focus is on estimation rather than model selection.
Network Structure: We consider a network of agents. Each agent has access to observation matrix with data points at time . Also, each agent is connected to agents (with a circular shift for any number outside of the range ). The matrix is such that agent is connected to itself with weight , connected to agents with weight , and connected to agents with weight . The smallest eigenvalue of our network is less than , so according to the step size constraint in Theorem 2, we can use the optimal step size (13) for this simulation. Therefore, the step size is set to be as in (13).
Benchmark: Since this dataset is from realworld and the ground truth value is unknown, we consider the solution of the centralized problem as the baseline. The local error at time is then calculated as the difference between local estimates and the centralized estimates as given in (6). We run update (5) for iterations such that the process reaches a steady state. To verify our results, we need to repeat the update process using MonteCarlo simulations on random features to estimate the expectations.
Performance: We visualize the error process in Proposition 1 by presenting the plot of norm of the expected global error, i.e., the norm of given in Proposition 1 at . The vertical axis in Fig. 1 represents the average global error obtained by repeating MonteCarlo simulations to form an estimate of the expected global error. The horizontal axis shows the number of MonteCarlo simulations indexed by where . As the number of MonteCarlo simulations increases, the norm of the average global error will converge to the norm of the expected global error in Proposition 1. As we can observe, the estimation of the expected global error converges to zero verifying that agents form asymptotically unbiased estimators of the parameter.
We next plot the expected norm square of global error, i.e., given in Theorem 2 at . The vertical axis in Fig. 2 represents the norm square of the global error averaged over MonteCarlo simulations. The horizontal axis shows the number of MonteCarlo simulations index by where . As the number of MonteCarlo simulations increases, the average norm square of the global error will converge to the expected norm square of the global error in Theorem 2. The expected norm square of the global error is upper bounded by according to Theorem 2 for this simulation set up and as we can observe, the average norm square of global error is always less than verifying the accuracy of the upper bound in Theorem .
V Conclusion
In this paper, we considered a distributed scheme for parameter estimation in randomized onehiddenlayer neural networks. A network of agents exchange local estimates of the parameter, formed using partial observations, to collaboratively identify the true value of the parameter. Our main contribution is to characterize the behavior of this distributed estimation scheme. We showed that the global estimation error is asymptotically unbiased and its second moment is finite under mild assumptions. Interestingly, our results shed light on the interplay of step size and network structure, which can be used for optimal design in practice. We verified this empirically by applying our method to a realworld data. Future directions include studying the estimation problem when the parameter has some dynamics [22] or the random frequencies are generated from a timevarying distribution. Due to the nonstationary nature of the problem in these two cases, the theoretical analysis becomes challenging and interesting to explore.
Appendix
For presentation clarity, we use the following definitions in the proofs:
(15) 
Va Proof of Proposition 1
To prove Proposition 1, we first need to show that
(16) 
for any . Recall that where and , and thus
since cosine is a periodic function. Therefore, we can conclude that for any and ,
(17) 
whenever is independent from . Notice that given the observation model (4), the th entry of the matrix can be written as
(18) 
When , we have according to (17); otherwise, , since for any we have
Hence, , entailing that
(19) 
in view of (15). Following the lines of the proof of Lemma 1 in [18], the error process can be expressed as the following
(20) 
where
(21) 
Taking expectation over random features on both sides and noting (19), we have
Recalling (15), we can also immediately see from the zeromean assumption on the noise that for every . Combining this with above and returning to (20) will finish the proof of Proposition 1.
VB Proof of Theorem 2
To prove Theorem 2, we first need to show a recursive relationship for the error process based on (20) where
(22)  
where we used the fact , resulting in zero crossterms in the second line. To further bound , let us recall (21). As and are both symmetric and , we have that
Now, we apply Lemma 3 to bound above as
Then, the largest eigenvalue of can be bounded as follows
(23)  
Now, let denote the kernel matrix formed with measurements at agent at time where its th entry is . Recalling (15), we can then bound the additive term in the recursive relation (22) as follows
(24)  
Letting
(25) 
and using (23) and (24), we can rewrite the recursive relation in (22) as
(26) 
We can find the feasible range of through the inequality which ensures that the recursive process (26) will converge.
First, we have the following fact
One can show that when and otherwise.
For the case when , we have the following
Therefore, given , we have that
This upper bound will converge to as , and noting definitions of and in (25), we derive the upper bound in the statement of Theorem 2.
For the case when , we have the following
(27)  
Considering the LHS of the last line in (27) as a quadratic function of , one can show that
must be true for (27) to hold. Therefore, the following must be true as well
VC Statement and Proof of Lemma 3
In the proof, we omit the time index and agent index for presentation clarity, i.e., we denote by , by for any , and by for any , respectively. We will show that is a diagonal matrix and all of its diagonal entries are upper bounded by .
Let us start by observing that the th entry of the matrix (for any agent) can be written as
(30) 
We now consider a single term in the previous summation:
(31) 
and analyze its expectation case by case.
Case 1: and ( and ).
Since and are independent, the expectation of the product of these two functions is zero as previously discussed in (17), so (31) would be zero.
Case 2: and ( or ).
In this case, three out of four product terms in (31) will include or . Then, the expectation of the other term will be zero again as cosine is periodic. Thus, the expectation of (31) will still be zero.
Case 3: and .
Now, (31) will become a product of two expectations of unbiased approximates of the kernel function in view of (2). Thus, the expectation of (31) will become which is less than 1. There are terms of this form in (30), which implies that it is upper bounded by .
Case 4: .
In this case, (31) becomes
So, the expectation of (31) with will become the following where and :
simply because cosine is bounded by 1, and its integral over is equal to zero. Notice that there are terms like above for every , and thus for a specific , the summation of term (31) where is upper bounded by .
Acknowledgments
We gratefully acknowledge the support of Texas A&M Triads for Transformation (T3) program.
References
 [1] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series in statistics New York, NY, USA:, 2001, vol. 1.

[2]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [3] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: Endtoend speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
 [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, pp. 1334–1373, 2016.

[5]
D. Shen, G. Wu, and H.I. Suk, “Deep learning in medical image analysis,”
Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.  [6] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting selfsupervised learning via knowledge transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.

[7]
G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
Mathematics of Control, Signals, and Systems (MCSS), vol. 2, no. 4, pp. 303–314, 1989.  [8] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.
 [9] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in Neural Information Processing Systems, 2009, pp. 1313–1320.
 [10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
 [11] U. Khan, S. Kar, A. Jadbabaie, J. M. Moura et al., “On connectivity, observability, and stability in distributed estimation,” in IEEE Conference on Decision and Control (CDC), 2010, pp. 6639–6644.
 [12] S. S. Stanković, M. S. Stanković, and D. M. Stipanović, “Decentralized parameter estimation by consensus based stochastic approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, 2011.
 [13] S. Kar, J. M. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012.
 [14] S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual averaging,” in IEEE Conference on Decision and Control, 2013, pp. 6196–6201.
 [15] N. Atanasov, R. Tron, V. M. Preciado, and G. J. Pappas, “Joint estimation and localization in sensor networks,” in IEEE Conference on Decision and Control (CDC), 2014, pp. 6875–6882.
 [16] A. Mitra and S. Sundaram, “An approach for distributed state estimation of lti systems,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 1088–1093.
 [17] A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
 [18] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed estimation of dynamic parameters: Regret analysis,” in 2016 American Control Conference (ACC), 2016, pp. 1066–1071.
 [19] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
 [20] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012.
 [21] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a lowenergy house,” Energy and buildings, vol. 140, pp. 81–97, 2017.
 [22] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of dynamic parameters in social networks,” in Advances in Neural Information Processing Systems, 2013.