. A good mapping is one that can be used for prediction of outputs corresponding to previously unseen inputs. Recently, deep neural networks have dominated the task of supervised learning in various applications, including computer vision, speech recognition , robotics , and biomedical image analysis . These methods, however, are data hungry and their application to domains with few/sparse labeled samples remains an active field of research . An alternative effective method for supervised learning is shallow architectures with one-hidden-layer. This architecture was motivated by the classical results of Cybenko  and Barron , showing that (under some technical assumptions) one can use sigmoidal basis functions to approximate any output that is a continuous function of the input. These results later motivated researchers to develop algorithmic frameworks to leverage shallow networks for data representation. The seminal work of Rahimi and Recht is a prominent point in case . In their approach, the nonlinear basis functions are selected using Monte-Carlo sampling with a theoretical guarantee that the approximated function converges asymptotically with respect to the number of data samples and basis functions.
The problem of function approximation in supervised learning (both in shallow and deep neural networks) is often formulated via empirical risk minimization , which amounts to solving an optimization problem over a high-dimensional parameter. Due to the computational challenges associated with high-dimensional optimization, an appealing solution turns out to be decentralized training of neural networks . On the other hand, recent advancement in distributed computing within control and signal processing communities [11, 12, 13, 14, 15, 16] has provided novel decentralized techniques for parameter estimation over multi-agent networks. In these scenarios, each individual agent receives partially informative measurements about the parameter and engages in local communications with other agents to collaboratively accomplish the global task. A crucial component of these methods is a consensus protocol , allowing collective information aggregation and estimation. Distributed algorithms gained popularity due to their ability to handle large data sets, low computational burden over agents, and robustness to failure of a central agent.
Motivated by the importance of distributed computing in high-dimensional parameter estimation, in this paper, we consider distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially obtain low-dimensional measurements of the parameter (in various locations at different randomized frequencies). Despite the parameter being partially observable to each individual agent, the global spread of measurements is informative enough for a collective estimation. We propose a fully distributed update where each agent engages in local interactions with its neighboring agents to construct iterative estimates of the parameter. The update is akin to consensus+innovation algorithms in the distributed estimation literature [11, 13, 18].
Our main theoretical contribution is to characterize the first and second moments of the global estimation error. In particular, we prove that the distributed update provides an asymptotically unbiased estimator of the unknown parameter when all the randomness is expected out, i.e., the first moment of the global error converges to zero asymptotically. This result also allows us to characterize the convergence rate and derive an optimal innovation rate to speed up the convergence. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. We finally simulate our method on a real-world data related to appliances energy prediction, where we observe that our empirical findings verify the theoretical results.
Ii Problem Statement
Notation: We adhere to the following notation table throughout the paper:
|set for any integer|
transpose of vector
|identity matrix of size|
|vector of all ones with dimension|
|vector of all zeros|
-th largest eigenvalue of matrix
|spectral radius of matrix|
|is positive semi-definite|
The vectors are in column format. Boldface lowercase variables (e.g., ) are used for vectors, and boldface uppercase variables (e.g., ) are used for matrices.
Ii-a One-Hidden-Layer Neural Networks: The Centralized Problem
Let us consider a regression problem of the form
where is the output, is the input, and is a the noise term with zero mean and constant variance. The objective is to find the unknown mapping (or function) based on available input-output pairs . Various regression methods assume different functional forms to approximate
. For example, in linear regression, the input-output relationship is assumed to follow a linear model.
In this work, we focus on one-hidden-layer neural networks , where the approximated function is a nonlinear function of the input, and
where is called a basis function (or feature map) parameterized by . In the above model, the parameters and are unknown and should be learned from data (i.e., input-output pairs). The underlying intuition behind this model is that the feature map transforms the original data from dimension to , where often time we have . Since the new space has a higher dimension, it provides more flexibility for approximation of the unknown function (as opposed to a linear model that is restrictive). It turns out that approximations of form (1) are dense in the space of continuous functions , i.e., they can be used to approximate any continuous function (on the unit cube).
However, from an algorithmic perspective, learning both and is computationally expensive. For a nonlinear feature map (e.g., cosine feature map), the problem is indeed non-convex and thus hard to solve. An alternative approach was proposed in  where one-hidden-layer neural networks are thought as Monte-Carlo approximations of kernel expansions. In particular, if we assume that
is a random variable with a support
and a probability distribution, the corresponding kernel can be obtained via 
Hence, if are independent samples from , the approximated kernel expansion corresponds to (1) and learning becomes a convex optimization problem with a modest computational cost. are then called random features in this model.
One such example is using cosine feature map to approximate a Gaussian kernel with unit width. In this case, (1) will be as follows
come from a multi-variate Gaussian distributionand
come from a uniform distribution. In this paper, we will focus on the approximated function of form (3) and propose a distributed algorithm for learning the parameter .
Ii-B Local Measurements in Multi-agent Networks
The proposed scenario in the previous section was centralized in the sense that the estimation task was done only by one agent that has all the data . In this section, we propose an iterative distributed scheme where we have a network of agents, each of which has access to a subset of data. In particular, agent has access to only data points at each iteration.
Without loss of generality, we assume each agent observes the same number of data points at each time, i.e., throughout the paper.
This assumption is only for the sake of presentation clarity. Our main results can be extended to the case where different agents have various numbers of measurements.
Now, in the distributed model, the observation matrix at time will be as follows
with any agent having access to . We then have the following measurement model
where is the unknown parameter that needs to be learned, and denotes the observation noise at agent . The above local measurement model can be interpreted as iteratively collecting low-dimensional measurements of parameter at different locations using distinct frequencies.
We follow the general assumptions of zero mean and constant variance on the noise term, i.e., we have and . We further denote by the estimate of for agent at time .
Ii-C Multi-agent Network Model
The interactions of agents, which in turn defines the network, is captured with the matrix . Formally, we denote by , the -th entry of the matrix . When , agent communicates with agent . We assume that is symmetric, doubly stochastic with positive diagonal elements. The assumption simply guarantees the information flow in the network. Alternatively, from the technical point of view, we respect the following hypothesis.
(connectivity) The network is connected, i.e., there is a path from any agent to another agent .
Ii-D Distributed Estimation Update
To construct an iterative estimate of the parameter , each agent at time performs the following distributed update
where is the step size. The update is akin to consensus+innovation schemes in the distributed estimation literature [11, 13, 18], and we analyze this update in Section III in the context of one-hidden-layer neural networks. Intuitively, the first part of the update (consensus) allows agents to keep their estimates close to each other, and the second part (innovation) takes into account the new measurements.
Iii Main Theoretical Results
In this section, we provide our main theoretical results. We show that the local update (5) is an asymptotically unbiased estimator of the global parameter . Based on this result, we characterize the optimal step-size to obtain the fastest convergence rate. We then prove that the asymptotic second moment of the collective estimation error is bounded.
Iii-a First Moment
Let us define the local error for each agent as
Subtracting from both sides of the local update (5), we can write the iterative local error process as follows
Stacking the local errors in a vector, we denote the global error by
We now characterize the global error process with the following proposition.
The proof of proposition 1 is given in the Appendix. It shows that the agents will collectively generate estimates of the parameter that are asymptotically unbiased as long as the spectral radius of is less than 1.
Iii-B Step Size Tuning
According to Proposition 1, the convergence rate depends on the choice of the step size. If one wants to speed up the convergence rate of the process, it is necessary to shrink the spectral radius of as much as possible. This corresponds to solving the following problem
According to Assumption 2, is the unique (un-normalized) eigenvector of the matrix associated with , because . It is then immediate that
On the other hand, we have that
Plotting and in terms of , we can notice that the optimal would occur exactly where , in which case we have the following relationship
and achieve the fastest convergence rate. This result suggests that when is close to one, we have the fastest convergence rate. Since is the smallest eigenvalue of , this would also imply that other eigenvalues are close to one in this scenario since . Intuitively, this indicates that is close to identity and agents have high self-reliance, i.e., they do not rely highly on their neighbors. Indeed, since otherwise the connectivity constraint is violated. Notice that in this paper, we are not concerned with network design, i.e., we assume that is given, and we can choose based on (13) accordingly.
Iii-C Asymptotic Second Moment
To capture the efficiency of the collective estimation, we should also study the variance of the error, which (asymptotically) amounts to the second moment in view of Proposition 1. In the next theorem, we present an asymptotic upper bound on the second moment for a feasible range of step size .
The proof of theorem 2 is given in the Appendix. It shows that the (asymptotic) expected second moment of the estimation error is bounded by a finite value that scales linearly with respect to the number of agents for a certain range of step size . It also suggests that the optimal step size in (13) will work whenever .
Iv Numerical Experiments
We now provide empirical evidence in support of our algorithm by applying it to a regression dataset on UCI Machine Learning Repository111https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction. In this dataset, the input includes a number of attributes including temperature in kitchen area, humidity in kitchen area, temperature in living room area, humidity in laundry room area, temperature outside, pressure, etc.. The regression model aims at representing appliances energy use in terms of these features. More details about this dataset can be found in  as well as the UCI Machine Learning Repository. We randomly choose 16000 observations out of its 19735 observations for our simulation.
We consider observation matrices of form (4), where the bases are cosine functions as follows
as described in section II-A where come from a multi-variate Gaussian distribution and come from a uniform distribution . Without loss of generality, we set , i.e., we use five basis functions in the approximation model (3). One can consider other values for and perform cross-validation to find the best one, but this is outside of the scope of this paper, as our focus is on estimation rather than model selection.
Network Structure: We consider a network of agents. Each agent has access to observation matrix with data points at time . Also, each agent is connected to agents (with a circular shift for any number outside of the range ). The matrix is such that agent is connected to itself with weight , connected to agents with weight , and connected to agents with weight . The smallest eigenvalue of our network is less than , so according to the step size constraint in Theorem 2, we can use the optimal step size (13) for this simulation. Therefore, the step size is set to be as in (13).
Benchmark: Since this dataset is from real-world and the ground truth value is unknown, we consider the solution of the centralized problem as the baseline. The local error at time is then calculated as the difference between local estimates and the centralized estimates as given in (6). We run update (5) for iterations such that the process reaches a steady state. To verify our results, we need to repeat the update process using Monte-Carlo simulations on random features to estimate the expectations.
Performance: We visualize the error process in Proposition 1 by presenting the plot of norm- of the expected global error, i.e., the norm- of given in Proposition 1 at . The vertical axis in Fig. 1 represents the average global error obtained by repeating Monte-Carlo simulations to form an estimate of the expected global error. The horizontal axis shows the number of Monte-Carlo simulations indexed by where . As the number of Monte-Carlo simulations increases, the norm- of the average global error will converge to the norm- of the expected global error in Proposition 1. As we can observe, the estimation of the expected global error converges to zero verifying that agents form asymptotically unbiased estimators of the parameter.
We next plot the expected norm- square of global error, i.e., given in Theorem 2 at . The vertical axis in Fig. 2 represents the norm- square of the global error averaged over Monte-Carlo simulations. The horizontal axis shows the number of Monte-Carlo simulations index by where . As the number of Monte-Carlo simulations increases, the average norm- square of the global error will converge to the expected norm- square of the global error in Theorem 2. The expected norm- square of the global error is upper bounded by according to Theorem 2 for this simulation set up and as we can observe, the average norm- square of global error is always less than verifying the accuracy of the upper bound in Theorem .
In this paper, we considered a distributed scheme for parameter estimation in randomized one-hidden-layer neural networks. A network of agents exchange local estimates of the parameter, formed using partial observations, to collaboratively identify the true value of the parameter. Our main contribution is to characterize the behavior of this distributed estimation scheme. We showed that the global estimation error is asymptotically unbiased and its second moment is finite under mild assumptions. Interestingly, our results shed light on the interplay of step size and network structure, which can be used for optimal design in practice. We verified this empirically by applying our method to a real-world data. Future directions include studying the estimation problem when the parameter has some dynamics  or the random frequencies are generated from a time-varying distribution. Due to the non-stationary nature of the problem in these two cases, the theoretical analysis becomes challenging and interesting to explore.
For presentation clarity, we use the following definitions in the proofs:
V-a Proof of Proposition 1
To prove Proposition 1, we first need to show that
for any . Recall that where and , and thus
since cosine is a periodic function. Therefore, we can conclude that for any and ,
whenever is independent from . Notice that given the observation model (4), the -th entry of the matrix can be written as
When , we have according to (17); otherwise, , since for any we have
Hence, , entailing that
Taking expectation over random features on both sides and noting (19), we have
V-B Proof of Theorem 2
where we used the fact , resulting in zero cross-terms in the second line. To further bound , let us recall (21). As and are both symmetric and , we have that
Now, we apply Lemma 3 to bound above as
Then, the largest eigenvalue of can be bounded as follows
We can find the feasible range of through the inequality which ensures that the recursive process (26) will converge.
First, we have the following fact
One can show that when and otherwise.
For the case when , we have the following
Therefore, given , we have that
For the case when , we have the following
Considering the LHS of the last line in (27) as a quadratic function of , one can show that
must be true for (27) to hold. Therefore, the following must be true as well
V-C Statement and Proof of Lemma 3
In the proof, we omit the time index and agent index for presentation clarity, i.e., we denote by , by for any , and by for any , respectively. We will show that is a diagonal matrix and all of its diagonal entries are upper bounded by .
Let us start by observing that the -th entry of the matrix (for any agent) can be written as
We now consider a single term in the previous summation:
and analyze its expectation case by case.
Case 1: and ( and ).
Case 2: and ( or ).
In this case, three out of four product terms in (31) will include or . Then, the expectation of the other term will be zero again as cosine is periodic. Thus, the expectation of (31) will still be zero.
Case 3: and .
Now, (31) will become a product of two expectations of unbiased approximates of the kernel function in view of (2). Thus, the expectation of (31) will become which is less than 1. There are terms of this form in (30), which implies that it is upper bounded by .
Case 4: .
In this case, (31) becomes
So, the expectation of (31) with will become the following where and :
simply because cosine is bounded by 1, and its integral over is equal to zero. Notice that there are terms like above for every , and thus for a specific , the summation of term (31) where is upper bounded by .
We gratefully acknowledge the support of Texas A&M Triads for Transformation (T3) program.
-  J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series in statistics New York, NY, USA:, 2001, vol. 1.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, pp. 1334–1373, 2016.
D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,”Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
-  M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting self-supervised learning via knowledge transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
G. Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of Control, Signals, and Systems (MCSS), vol. 2, no. 4, pp. 303–314, 1989.
-  A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, 1993.
-  A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in Neural Information Processing Systems, 2009, pp. 1313–1320.
-  B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
-  U. Khan, S. Kar, A. Jadbabaie, J. M. Moura et al., “On connectivity, observability, and stability in distributed estimation,” in IEEE Conference on Decision and Control (CDC), 2010, pp. 6639–6644.
-  S. S. Stanković, M. S. Stanković, and D. M. Stipanović, “Decentralized parameter estimation by consensus based stochastic approximation,” IEEE Transactions on Automatic Control, vol. 56, no. 3, pp. 531–543, 2011.
-  S. Kar, J. M. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation models and imperfect communication,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012.
-  S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual averaging,” in IEEE Conference on Decision and Control, 2013, pp. 6196–6201.
-  N. Atanasov, R. Tron, V. M. Preciado, and G. J. Pappas, “Joint estimation and localization in sensor networks,” in IEEE Conference on Decision and Control (CDC), 2014, pp. 6875–6882.
-  A. Mitra and S. Sundaram, “An approach for distributed state estimation of lti systems,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2016, pp. 1088–1093.
-  A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
-  S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed estimation of dynamic parameters: Regret analysis,” in 2016 American Control Conference (ACC), 2016, pp. 1066–1071.
-  A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
-  R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012.
-  L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,” Energy and buildings, vol. 140, pp. 81–97, 2017.
-  S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of dynamic parameters in social networks,” in Advances in Neural Information Processing Systems, 2013.