## I Introduction

We consider decentralized online optimization problems: a network of agents aims to minimize a global objective that is a sum of local convex objectives available only to each node. The problem is online and distributed because data samples upon which the local objectives depend are sequentially and locally observed by each agent. In this setting, agents aim to make inferences as well as one which has access to all data at a centralized location in advance. Instead of assuming agents seek a common parameter vector , we focus on the case where agents seek to learn a common *decision function* that belong to a reproducing kernel Hilbert space (RKHS). Such functions represent, e.g., nonlinear statistical models [2] or trajectories in a continuous space [3]

. Learning in multi-agent settings arises predominately in two technological settings: industrial-scale machine learning, where optimizing statistical model parameters is decentralized across a parallel processing architecture to attain computational speedup; and networked intelligent systems such as sensor networks

[4], multi-robot teams [5, 6], and Internet of Things [7, 8]. In the later setting, decentralized processing justified as opposed to using a fusion center when the communication cost of centralization exceeds the cost of distributed information protocols. This is true of multi-agent systems with streaming data considered here.Efforts to develop optimization tools for multi-agent online learning have thus far been restricted to the case where each agent learns a linear statistical model [9] or a task-driven dictionary [10]

that is as good as one with data aggregated across the network. However, these efforts exclude the state of the art tools for statistical learning based on nonlinear interpolators: namely, kernel methods

[11, 12]and neural networks

[13, 14]. We note that instabilities associated with non-convexity which are only a minor issue in centralized settings [15] become both theoretically and empirically difficult to overcome in settings with consensus constraints [10], and therefore efforts to extend neural network learning to multi-agent online learning likely suffer the same drawbacks.^{1}

^{1}1In general, globally convergent decentralized online training of neural networks is an open problem, whose solution requires fundamentally new approaches to stochastic global optimization. Therefore, we focus on extending kernel methods to decentralized online settings, motivated both by its advantageous empirical performance, as well as the theoretical and practical benefits of the fact that the optimization problem defined by their training is convex. This stochastic convex problem, however, is defined over an infinite dimensional space, and therefore it is not enough to solve the optimization problem, but one must also solve it in an optimally sparse way. Doing so in multi-agent settings is the goal of this work.

To contextualize our solution methodology, consider centralized vector-valued stochastic convex programming, which has classically been solved with stochastic gradient descent (SGD) [16]. SGD involves descending along the negative of the stochastic gradient rather than the true gradient to avoid the fact that computing the gradient of the average objective has complexity comparable to the training sample size, which could be infinite. In contrast, the setting considered in this work is a stochastic program defined over a function space, which is in general an intractable variational inference problem. However, when the function space is a RKHS [17], the Representer Theorem allows us to transform a search over an infinite space into one over a set of weights and data samples [18]. Unfortunately, the feasible set of the resulting problem has complexity comparable to the sample size , and thus is intractable for [19]. Compounding this problem is that the storage required to construct the functional generalization of SGD is comparable to the iteration index of the algorithm, which is untenable for online settings.

Efforts to mitigate the complexity of the function representation (“the curse of kernelization”) have been previously developed. These combine functional extensions of stochastic gradient method with compressions of the function parameterization independently of the optimization problem to which they are applied [20, 21, 22, 23, 24] or approximate the kernel during training [25, 26, 27, 28, 29], and at best converge on average. In contrast, a method was recently proposed that combines greedily constructed [30] sparse subspace projections with functional stochastic gradient method and guarantees exact convergence to the minimizer of the average risk functional. This technique, called parsimonious online learning with kernels (POLK), tailors the parameterization compression to preserve the descent properties of the underlying RKHS-valued stochastic process [31], and inspires the approach considered here.

In this work, we extend the ideas in [31] to multi-agent settings. Multiple tools from distributed optimization may be used to do so; however, we note that the Representer Theorem [18] has not been established for general stochastic saddle point problems in RKHSs. Therefore, we adopt an approximate primal-only approach based on penalty methods [32, 33]

, which in decentralized optimization is known as distributed gradient descent (DGD). Using functional stochastic extensions of DGD, together with the greedy Hilbert subspace projections designed in POLK, we develop a method such that each agent, through its local data stream and message passing with only its neighbors, learns a memory-efficient approximation to the globally optimal regression function with probability

. Such global stability guarantees are in contrast to specialized results for multi-agent kernel learning [34, 35] and alternative distributed online nonlinear function estimation methods such as dictionary learning [15, 36, 10] or neural networks [14], which suffer from instability due to the non-convexity of the optimization problem their training defines.The result of the paper is organized as follows. In Section II we clarify the problem setting of stochastic programming in RKHSs in the centralized and decentralized case. In Section III, we propose a new penalty functional that permits deriving a decentralized online method for kernel regression without any complexity bottleneck by making use of functional stochastic gradient method (Section III-A) combined with greedy subspace projections (Section III-B). In Section IV we present our main theoretical results, which establishes that the function sequence of each agent generated by the proposed technique converges to a neighborhood of the globally optimal function with probability . In Section V

, we present numerical examples of decentralized online multi-class kernel logistic regression and kernel support vector machines with data generated from Gaussian mixtures, and observe a state of the art trade-off between Lyapunov stability and statistical accuracy. We then apply the resulting method to the benchmark Brodatz texture dataset

[37] and observe state of the art decentralized online multi-class classification performance.## Ii Problem Formulation

### Ii-a Decentralized Functional Stochastic Programming

Consider the problem of expected risk minimization, where the goal is to learn a regressor that minimizes a loss function quantifying the merit of a statistical model averaged over a data set. We focus on the case when the number of training examples

is very large or infinite. In this work, input-output examples,, are i.i.d. realizations drawn from a stationary joint distribution over the random pair

, where and . Here, we consider finding regressors that are not vector valued parameters, but rather functions in a hypothesized function class , which allows for learning nonlinear statistical models rather than generalized linear models that rarely achieve satisfactory statistical error rates in practice [38, 12]. The merit of the function is evaluated by the convex loss function that quantifies the merit of the estimator evaluated at feature vector . This loss is averaged over all possible training examples to define the statistical loss , which we combine with a Tikhonov regularizer to construct the regularized loss [39, 40]. We then define the optimal function as(1) |

In this work, we focus on extensions of the formulation in (1) to the case where data is scattered across an interconnected network that represents, for instance, robotic teams [10], communication systems [41], or sensor networks [4]. To do so, we define a symmetric, connected, and directed network with nodes and edges and denote as the neighborhood of agent . For simplicity we assume that the number of edges is even. Each agent observes a local data sequence as realizations from random pair and seeks to learn a common globally optimal regression function . This setting may be mathematically captured by associating to each node a convex loss functional that quantifies the merit of the estimator evaluated at feature vector , and defining the goal for each node as the minimization of the common global loss

(2) |

Observe that this global loss is a network-wide average (scaled by ) of all local losses, and therefore the minimizers of (1) and (2) coincide when have a common joint distribution for each . However, in multi-agent optimization, this is not generally the case, thus when selecting a regression function with only local data, different agents will learn a different decision function that it is not optimal as compared to one selected in a centralized manner, i.e., with the data gathered by all agents. To overcome this limitation we allow message passing between agents and we impose a consensus constraint on the regression function among neighbors . Thus we consider the nonparametric decentralized stochastic program:

(3) |

For further define the product Hilbert space of functions aggregated over the network whose elements are stacked functions that yield vectors of length when evaluated at local random vectors . Moreover, define the stacked random vectors and that represents labels or physical measurements, for instance.

The goal of this paper is to develop an algorithm to solve (II-A) in distributed online settings where nodes do not know the distribution of the random pair but observe local independent training examples sequentially.

### Ii-B Function Estimation in Reproducing Kernel Hilbert Spaces

The optimization problem in (1), and hence (II-A), is intractable in general, since it defines a variational inference problem integrated over the unknown joint distribution .
However, when is equipped with a *reproducing kernel* (see [42, 12]), a function estimation problem of the form (1) may be reduced to a parametric form via the Representer Theorem [43, 19].
Thus, we restrict the Hilbert space in Section II-A to be one equipped with a kernel that satisfies for all functions in :

(4) |

for all . Here denotes the Hilbert inner product for . Further assume that the kernel is positive semidefinite, i.e. for all . Function spaces of this type are called reproducing kernel Hilbert spaces (RKHS).

In (4), property (i) is the reproducing property (via Riesz Representation Theorem [43]). Replacing by in (4) (i) yields which is the origin of the term “reproducing kernel.” This property induces a nonlinear transformation of the input space : denote by a nonlinear map of the feature space that assigns to each the kernel function . The reproducing property yields that the inner product of the image of distinct feature vectors and under the map requires only kernel evaluations: (the ’kernel trick’).

Moreover, property (4) (ii) states that functions may be written as a linear combination of kernel evaluations. For kernelized and regularized empirical risk minimization (ERM), the Representer Theorem [17, 18] establishes that the optimal in hypothesized function class admit an expansion in terms of kernel evaluations *only* over training examples

(5) |

where denotes a set of weights. The upper index in (5) is referred to as the model order, and for ERM the model order and training sample size are equal. Common choices include the polynomial and radial basis kernels, i.e., and , respectively, where .

Suppose, for the moment, that we have access to

i.i.d. realizations of the random pairs for each agent such that the expectation in (II-A) is computable, and we further ignore the consensus constraint. Then the objective in (II-A) becomes:(6) |

Then, by substituting the Representer Theorem [cf. (5)] into (II-A), we obtain that optimizing in reduces to optimizing over the set of weights:

(7) |

where we have defined the Gram (or kernel) matrix , with entries given by the kernel evaluations between and as . We further define the vector of kernel evaluations , which are related to the kernel matrix as . The dictionary of training points associated with the kernel matrix is defined as .

By exploiting the Representer Theorem, we transform a nonparametric infinite dimensional optimization problem in (6) into a finite -dimensional parametric problem (7). Thus, for empirical risk minimization, the RKHS provides a principled framework to solve nonparametric regression problems as a search over for an optimal set of coefficients.

However, to solve problems of the form (6) when training examples become sequentially available or their total number is not finite, the objective in (6) becomes an expectation over random pairs as [11]

(8) |

where we substitute the Representer Theorem generalized to the infinite sample-size case established in [19] into the objective (II-A) with as some countably infinite indexing set. That is, as the data sample size , the representation of becomes infinite as well. Thus, our goal is to solve (II-B) in an approximate manner such that each admits a finite representation near , while satisfying the consensus constraints for (which were omitted for the sake of discussion between (6) - (II-B)).

## Iii Algorithm Development

We turn to developing an online iterative and decentralized solution to solving (II-A) when the functions are elements of a RKHS, as detailed in Section II-B. To exploit the properties of this function space, we require the applicability of the Representer Theorem [cf. (5)], but this result holds for any regularized minimization problem with a convex functional. Thus, we may address the consensus constraint in (II-A) by enforcing approximate consensus on estimates in expectation. This specification may be met by introducing the penalty functional

(9) |

The reasoning for the definition (III) rather than one that directly addresses the consensus constraint deterministically is given in Remark 1, motivated by following the algorithm derivation. For future reference, we also define the local penalty as

(10) |

and we observe from (III) - (III) that . Further define . We note that in the vector-valued decision variable case, other techniques to address the constraint in (II-A) are possible such as primal-dual methods [9] or dual methods [44], but the Representer Theorem has not been established for RKHS-valued stochastic saddle point problems. It is an open question whether expressions of the form (5) apply to problems with general functional constraints, but this matter is beyond the scope of this work. Therefore, these other approaches which make use of Lagrange duality do not readily extend to the nonparametric setting considered here.

### Iii-a Functional Stochastic Gradient Method

Given that the data distribution is unknown, minimizing directly via variational inference is not possible. Rather than postulate a specific distribution for , we only assume access to sequentially available (streaming) independent and identically distributed samples from their joint density. Then, we may wield tools from stochastic approximation to minimize (III), which in turn yields a solution to (II-A). Begin by defining, , the stochastic approximation of the penalty function , evaluated at a realization of the stacked random pair :

(11) |

and the local instantaneous penalty function similarly. To compute the functional stochastic gradient of evaluated at a sample point , we first address the local loss in (III-A) as [22, 31]:

(12) |

where we have applied the chain rule. Now, define the short-hand notation

for the derivative of with respect to its first scalar argument evaluated at . To evaluate the second term on the right-hand side of (12), differentiate both sides of the expression defining the reproducing property of the kernel [cf. (4)(i)] with respect to to obtain

(13) |

Then, given (12) - (13), we may compute the overall gradient of the instantaneous penalty function in (III-A) as

(14) | ||||

where on the right-hand side of (14), we have defined the vector stacking notation to denote the stacking of component-wise functional gradients, each associated with function , , and used the fact that the variation of the instantaneous approximate of the cross-node term, , by the same reasoning as (12) - (13), is . With this computation in hand, we present the stochastic gradient method for the -regularized multi-agent expected risk minimization problem in (II-A) as

(15) |

where is an algorithm step-size either chosen as diminishing with or a small constant – see Section IV. We may glean from (III-A) that the update for the network-wide function decouples into ones for each agent , using the node-separability of the penalty , i.e.,

(16) |

We further require that, given , the step-size satisfies and the global sequence is initialized as . With this initialization, the Representer Theorem (5) implies that, at time , the function admits an expansion in terms of feature vectors observed thus far as

(17) |

On the right-hand side of (17) we have introduced the notation , , and . Moreover, observe that the kernel expansion in (17), taken together with the functional update (III-A), yields the fact that performing the stochastic gradient method in amounts to the following parallel parametric updates on the kernel dictionaries and coefficients :

(18) | ||||

where the second case on the last line of (18) is for .
Observe that this update causes to have one more column than . We define the *model order* as number of data points in the dictionary of agent at time (the number of columns of ). FSGD is such that , and hence grows unbounded with iteration index . Next we address this intractable memory growth such that we may execute stochastic descent through low-dimensional projections of the stochastic gradient, inspired by [31]. First, we clarify the motivation for the choice of the penalty function (III).

###### Remark 1.

In principle, it is possible to address the RKHS-valued consensus constraint in (II-A) directly, through primal-only stochastic methods, by introducing the penalty function

(19) |

Observe, however, that FSGD applied to (19), using comparable reasoning to that which leads to (III-A) from (III), yields

(20) |

Unfortunately, we cannot inductively define a parametric representation of (1) for node in terms of its own kernel dictionaries and weights independently of the *entire function* associated to node , since the last term in (1) lives directly in the Hilbert space. Thus, to implement (1) each agent would need to store the entire kernel dictionary and weights of all its neighbors at each step, which is impractically costly. The use of (III) rather than (19) is further justified that under a hypothesis regarding the mean transformation of the local data spaces, , consensus with respect to the Hilbert norm, in addition to the mean square sense, is achieved when the penalty coefficient is (see Section IV for details).

### Iii-B Sparse Subspace Projections

To mitigate the complexity growth noted in Section III-A, we approximate the function sequence (III-A) by one that is orthogonally projected onto subspaces that consist only of functions that can be represented using some dictionary , i.e., , and . For convenience we define , and as the resulting kernel matrix from this dictionary. We enforce function parsimony by selecting dictionaries with for each [31].

To be specific, we propose replacing the local update (III-A) in which the dictionary grows at each iteration by its projection onto subspace as

(21) |

where we define the projection operator onto subspace by the update (III-B).

Coefficient update The update (III-B), for a fixed dictionary , yields one in the coefficient space only. This fact may be observed by defining the un-projected stochastic gradient step starting at function parameterized by dictionary and coefficients :

(22) |

This update may be represented using dictionary and weights

(23) | ||||

where the last coefficient is for . Note that has columns, which is also the length of . For a fixed , the stochastic projection (III-B) is a least-squares update on the coefficient vector: the Representer Theorem allows us to rewrite (III-B) in terms of kernel expansions as in Section 3.2 of [31], which yields

(24) |

where we define the cross-kernel matrix whose entry is given by . The other kernel matrices and are defined similarly. Observe that is the number of columns in , while is the number of columns in [cf. (23)]. Given that the local projections of onto stochastic subspaces , for a fixed node-specific dictionaries , is a least-squares problem, we now detail the kernel dictionary selection from past data .

Dictionary Update The selection procedure for the kernel dictionary is based upon greedy compression [45]: function defined by the stochastic gradient method without projection is parameterized by dictionary [cf. (23)] of model order . We form by selecting a subset of columns from that best approximate in terms of Hilbert norm error, which may be done by executing *kernel orthogonal matching pursuit* (KOMP) [30, 46] with error tolerance to find a kernel dictionary matrix based on the one which adds the latest sample point . This choice is due to the fact that we can tune its stopping criterion to guarantee stochastic descent, and guarantee the model order of the learned function remains finite – see Section IV for details.

We now describe the variant of KOMP we propose using, called Destructive KOMP with Pre-Fitting (see [46], Section 2.3). Begin with an input a candidate function of model order parameterized by kernel dictionary and coefficients . The method then approximates by a function with a lower model order. Initially, this sparse approximation is the original function so that its dictionary is initialized with that of the original function , with corresponding coefficients . Then, the algorithm sequentially removes dictionary elements from the initial dictionary , yielding a sparse approximation of , until the error threshold is violated, in which case it terminates. See Appendix A for further details.

We summarize the key steps of the proposed method in Algorithm 1 for solving (II-A) while maintaining a finite model order, thus allowing for the memory-efficient learning of nonparametric regression functions online in multi-agent systems. The method, Greedy Projected Penalty Method, executes the stochastic projection of the functional stochastic gradient iterates onto sparse subspaces stated in (III-B). Initial functions are set to null , i.e., it has empty dictionary and coefficient vector . The notation is used to denote the empty matrix or vector respective size or . Then, at each step, given an independent training example and step-size , we compute the *unconstrained* functional stochastic gradient iterate (22) with respect to the instantaneous penalty function (III-A) which admits the parameterization and as stated in (23). These parameters are then fed into KOMP with approximation budget , such that .

## Iv Convergence Analysis

We turn to establishing that the method presented in Algorithm 1 converges with probability to the minimizer of the penalty function [cf. (III)] when attenuating algorithm step-sizes are used, and to a neighborhood of the minimizer along a subsequence when constant step-sizes are used. Moreover, for the later case, the kernel dictionary that parameterizes the regression function for each agent remains finite in the worst case. This analysis is an application of Section IV of [31], but these results, together with the properties of the penalty function allow us to establish bounds on the deviation for each individual in the network from the common globally optimal regression function.

Before analyzing the proposed method developed in Section III, we define key quantities to simplify the analysis and introduce standard assumptions which are necessary to establish convergence. Define the local projected stochastic functional gradient associated with the update in (III-B) as

(25) | ||||

such that the local update of Algorithm 1 [cf. (III-B)] may be expressed as a stochastic descent using projected functional gradients The definitions of (25) and the local stochastic gradient may be stacked to analyze the global convergence behavior of the algorithm. For further reference, we define the stacked projected functional stochastic gradient of the penalty function as . Then the stacked global update of the algorithm is

(26) |

Moreover, observe that the stochastic functional gradient in (14), based upon the fact that are independent and identically distributed realizations of the random pair

, is an unbiased estimator of the true functional gradient of the penalty function

in (III), i.e.(27) |

for all . In (27), we denote as the sigma algebra which measures the algorithm history for times , i.e. . Next, we formally state technical conditions on the loss functions, data domain, and stochastic approximation errors that are necessary to establish convergence.

###### Assumption 1.

The feature space and target domain are compact, and the kernel map may be bounded as