DeepAI

# COKE: Communication-Censored Kernel Learning for Decentralized Non-parametric Learning

This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert (RKH) space by jointly minimizing a global objective function, with access to locally observed data only. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent with different sizes and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge and preserve data privacy, we leverage the random feature (RF) approximation approach to map the large-volume data represented in the RKH space into a smaller RF space, which facilitates the same-size parameter exchange and enables distributed agents to reach consensus on the function decided by the parameters in the RF space. For fast convergent implementation, we design an iterative algorithm for Decentralized Kernel Learning via Alternating direction method of multipliers (DKLA). Further, we develop a COmmunication-censored KErnel learning (COKE) algorithm to reduce the communication load in DKLA. To do so, we apply a communication-censoring strategy, which prevents an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests with both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.

• 5 publications
• 209 publications
• 115 publications
• 2 publications
08/04/2022

### QC-ODKLA: Quantized and Communication-Censored Online Decentralized Kernel Learning via Linearized ADMM

This paper focuses on online kernel learning over a decentralized networ...
09/22/2019

08/30/2019

### GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

When the data is distributed across multiple servers, efficient data exc...
04/25/2019

10/11/2017

### Decentralized Online Learning with Kernels

We consider multi-agent stochastic optimization problems over reproducin...
08/01/2019

### Adaptive Kernel Learning in Heterogeneous Networks

We consider the framework of learning over decentralized networks, where...
11/09/2020

### BayGo: Joint Bayesian Learning and Information-Aware Graph Optimization

This article deals with the problem of distributed machine learning, in ...

## 1 Introduction

Decentralized learning has attracted extensive interest in recent years, largely due to the explosion of data generated everyday from mobile sensors, social media services, and other networked multi-agent applications (Worden and Manson, 2006; Ilyas et al., 2013; Facchinei et al., 2015; Demarie and Sabia, 2019). In many of these applications, the observed data are usually kept private at local sites without being aggregated to a fusion center, either due to the prohibitively high cost of raw data transmission or privacy concerns. Meanwhile, each agent in the network only communicates with its one-hop neighbors within its local area to save transmission power. Such localized data processing and transmission obviate the implementation of any centralized learning techniques. Under this circumstance, this article focuses on the decentralized learning problem where a network of distributed agents aim to collaboratively learn a functional model describing the global data with only access to their own locally observed datasets.

To learn the functional model that is often nonlinear and complex, nonparametric kernel methods are widely appreciated thanks to the “kernel trick” that makes some well-behaved linear learning algorithms applicable in a high-dimensional implicit feature space, without explicit mapping from data to that feature space (Shawe-Taylor et al., 2004; Hofmann et al., 2008; Pérez-Cruz and Bousquet, 2004). However, it is challenging to directly apply them to a decentralized multi-agent setting and solve under the consensus optimization framework using algorithms such as decentralized alternating direction method of multipliers (ADMM) Shi et al. (2014) without any raw data sharing or aggregation. This is because decentralized learning relies on solving local optimization problems and then aggregating the updates on the local decision variables over the network through one-hop communications in an iterative manner (Nedić et al., 2016). Unfortunately, these decision variables of local objective functions resulted from the kernel trick are data-dependent and thus cannot be optimized in the absence of raw data exchange under the decentralized consensus framework.

There are several works applying kernel methods in decentralized learning for various applications under different settings (Predd et al., 2006; Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Shin et al., 2016, 2018; Koppel et al., 2018). These works, however, either assume that agents have access to their neighbors’ observed raw data (Predd et al., 2006) or require agents to transmit their raw data to their neighbors (Koppel et al., 2018)

to ensure consensus through collaborative learning. These assumptions may not be valid in many practical applications that involve users’ private data. Moreover, standard kernel learning for big data faces the curse of dimensionality when the number of training examples increases

(Shawe-Taylor et al., 2004). For example, in (Mitra and Bhatia, 2014; Chouvardas and Draief, 2016), the nonlinear function learned at each node is represented as a weighted combination of kernel functions centered on its local observed data. As a result, each agent needs to transmit both the weights of kernel functions and its local data to its neighbors at every iterative step to guarantee consensus of the common prediction function. Thus, both the computation and communication resources are demanding in the distributed implementation. Although Gao et al. (2015) and Koppel et al. (2018) have developed techniques such as data selection and sparse subspace projection, respectively, to alleviate the curse of dimensionality problem, these techniques typically incur considerable extra computation in addition to the data privacy concern. Furthermore, when computation cost is more affordable than the communication in the big data scenario, communication cost of the iterative learning algorithms becomes the bottleneck for efficient distributed learning (McMahan et al., 2016). Therefore, it is crucial to design communication-efficient distributed kernel learning algorithms that preserve privacy.

### 1.1 Related work

This work lies at the intersection of centralized non-parametric kernel methods, decentralized learning, and communication-efficient implementation. Related work to these three subjects is reviewed below.

Centralized kernel methods. Kernel methods have been widely applied in centralized learning problems where data are assumed to be collected and processed by a single server and are known to suffer from the curse of dimensionality for large-scale learning tasks. To mitigate the computational complexity of kernel methods, various techniques are developed, including stochastic approximation (Bucak et al., 2010; Dai et al., 2014; Gu et al., 2018), restricting the number of function parameters (Gomes and Krause, 2010; Wang et al., 2012; Zhang et al., 2013; Le et al., 2016; Koppel et al., 2017), and approximating the kernel during training (Honeine, 2015; Drineas and Mahoney, 2005; Lu et al., 2016; Sheikholeslami et al., 2018; Rahimi and Recht, 2008; Băzăvan et al., 2012; Nguyen et al., 2017). Among them, random feature (RF) mapping methods have gained popularity thanks to their ability to map the large-scale data into a RF space of much reduced dimension by approximating the kernel with a fixed (small) number of random features, which thus circumvents the curse of dimensionality problem (Rahimi and Recht, 2008; Băzăvan et al., 2012; Nguyen et al., 2017). Enforcing orthogonality on random features can greatly reduce the error in kernel approximation (Yu et al., 2016; Shen et al., 2018), and the learning performance of RF-based methods is evaluated in (Bach, 2017; Rudi and Rosasco, 2017; Li et al., 2018).

Decentralized kernel learning. For the decentralized kernel learning problem relevant to our work (Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Koppel et al., 2018), gradient descent is conducted locally for each agent to update its learning model, followed by diffusion-based information exchange among agents. However, these methods either assume that agents have access to their neighbors’ observed raw data or require agents to transmit their raw data to their neighbors to ensure convergence on the prediction function. For the problem studied in this article where the observed data are only locally available, these methods are not applicable since there are no common decision parameters for consensus without any raw data exchange. Moreover, these methods still encounter the curse of dimensionality when the local dataset goes large. Though data selection (Gao et al., 2015) and subspace projection (Koppel et al., 2018) are adopted to alleviate the curse of dimensionality problem, they typically require significant extra computational resources. The RF mapping (Rahimi and Recht, 2008) offers a viable approach to overcome these issues, by having all agents map their datasets of various sizes onto the same RF space. For instance, (Bouboulis et al., 2018) proposes a diffusion-based combine-then-adapt (CTA) method that achieves consensus on the model parameters in the RF space for the online learning problem, without the exchange of raw data. However, the convergence speed of the diffusion-based method is relatively slow compared with higher-order methods such as ADMM (Liu et al., 2019).

Communication-efficient optimization. Communication-efficient algorithms for decentralized optimization and learning problems have attracted attention when data movement among computing nodes becomes a bottleneck due to the high latency and limited bandwidth of decentralized networks. To reduce the communication cost, one way is to transmit the compressed information by quantization (Zhu et al., 2016; Alistarh et al., 2017; Zhang et al., 2019) or sparsification (Stich et al., 2018; Alistarh et al., 2018; Wangni et al., 2018). However, these methods only reduce the required bandwidth at each communication round, not the number of rounds or the number of transmissions. Alternatively, some works randomly select a number of nodes for broadcasting and operate asynchronous updating to reduce the number of transmissions per iteration (Mota et al., 2013; Li et al., 2014; Jaggi et al., 2014; McMahan et al., 2016; Yin et al., 2018; Yu et al., 2019). In contrast to random nodes selection, a more intuitive way is to evaluate the importance of a message in order to avoid unnecessary transmissions (Chen et al., 2018; Liu et al., 2019; Li et al., 2019b). This is usually implemented by adopting a censoring scheme to adaptively decide if a message is informative enough to be transmitted during the iterative optimization process. Other efforts to improve the communication efficiency are made by accelerating the convergence speed of the iterative algorithm implementation (Shamir et al., 2014; Reddi et al., 2016; Li et al., 2019a).

### 1.2 Contributions

This paper develops communication-efficient privacy-preserving decentralized kernel learning algorithms under the consensus optimization framework. Relative to prior art, our contributions are summarized as follows.

• We first formulate the decentralized multi-agent kernel learning problem as a decentralized consensus optimization problem the RF space. Since most machine learning scenarios can afford plenty computational capability but limited communication resources, we solve this problem with ADMM, which has shown fast convergence at the expense of relatively high computation cost per iteration

(Shi et al., 2014). To the best of our knowledge, this is the first work to solve the decentralized kernel learning in the RF space by ADMM without any raw data exchange, which preserves privacy. The key of our proposed Decentralized Kernel Learning via ADMM (DKLA) algorithm is to apply the RF mapping, which not only reduces the computational complexity but also enables consensus on a set of model parameters of fixed size in the RF space.

• To increase the communication efficiency, we further develop a COmmunication-censored KErnel learning (COKE) algorithm, which achieves desired learning performance given limited communication resources and energy supply. Specifically, we devise a simple yet powerful censoring strategy to allow each user to autonomously skip unnecessary communications when its local update is not informative enough for transmission, without aid of a central coordinator. In this way, the communication efficiency can be boosted at almost no sacrifice of the learning performance. When the censoring strategy is absent, COKE degenerates to DKLA.

• In addition, we conduct theoretical analysis in terms of both functional convergence and generalization performance to provide guidelines for practical implementations of our proposed algorithms. We show that the individually learned functional at each agent through DKLA and COKE both converges to the optimal one at a linear rate under mild conditions. For the generalization performance, we show that features are sufficient to ensure

learning risk for the decentralized kernel ridge regression problem, where

is the number of effective degrees of freedom that will be defined in Section 4.2.

• Finally, we test the performance of our proposed DKLA and COKE algorithms on both synthetic and real datasets. The results corroborate that both DKLA and COKE exhibit attractive learning performance and COKE is highly communication-efficient.

### 1.3 Organization and notation of the paper

Organization. Section 2 formulates the problem of non-parametric learning and highlights the challenges in applying traditional kernel methods in the decentralized setting. Section 3 develops the decentralized kernel learning algorithms, including both DKLA and COKE. Section 4 presents the theoretical results and Section 5 reports the numerical tests using both synthetic data and real datasets. Concluding remarks are provided in Section 6.

Notation. denotes the set of real numbers.

denotes the Euclidean norm of vectors and

denotes the Frobenius norm of matrices. denotes the cardinality of a set. denotes a matrix, denotes a vector, and denotes a scalar.

## 2 Problem Statement

This section reviews basics of kernel-based learning and decentralized optimization, introduces notation, and provides background needed for our novel DKLA and COKE schemes.

Consider a network of agents interconnected over a fixed topology , where , , and denote the agent set, the edge set and the adjacency matrix, respectively. The elements of are when the unordered pair of distinct agents , and otherwise. For agent , its one-hop neighbors are in the set . The term agent used here can be a single computational system (e.g. smart phone, database, etc.) or a collection of co-located computational systems (e.g. data centers, computer clusters, etc.). Each agent only has access to its locally observed data composed of input-label pairs

that are independently and identically distributed (i.i.d) samples obeying an unknown probability distribution

on , with and . The kernel learning task is to find a prediction function that best describes the ensemble of all data from all agents. Suppose that belongs to the reproducing kernel Hilbert space (RKHS) induced by a positive semidefinite kernel that measures the similarity between and , for all . In a decentralized setting, this means that each agent has to be able to learn the global function such that for , without exchange of any raw data and in the absence of a fusion center, where the error term are minimized accordingly to certain optimality metric.

To evaluate the learning performance, a nonnegative loss function

is utilized to measure the difference between the true label value and the predicted value . For regression problems, a common loss function is the quadratic loss and the risk is the mean-squared-error (MSE). For binary classifications, the common loss functions are the hinge loss and the logistic loss , all of which are convex with respect to . The learning problem is then formulated to minimize the expected risk of the prediction function:

 R(f)=∫X×Yℓ(f(x),y)dp(x,y), (1)

which indicates the generalization ability of to new data.

However, the distribution is unknown in most learning tasks. Therefore, minimizing is not applicable. Instead, given the finite number of training examples, the problem turns to minimizing the empirical risk:

 minf∈H^R(f):=N∑i=1^Ri(f), (2)

where is the local empirical risk for agent given by

 ^Ri(f)=1TiTi∑t=1ℓ(f(xi,t),yi,t)+λi∥f∥2H, (3)

with being the norm associated with , and be a regularization parameter that controls over-fitting.

The Representer Theorem states that the minimizer of a regularized empirical risk functional defined over a RKHS can be represented as a finite linear combination of kernel functions evaluated on the data pairs from the training dataset (Schölkopf et al., 2001). If are centrally available at a fusion center, then the minimizer of (2) admits

 f⋆(x)=N∑i=1Ti∑t=1αi,tκ(x,xi,t):=α⊤κ(x), (4)

where is the coefficient vector to be learned, is the total number of samples, and . In RKHS, since , it yields , where is the kernel matrix that measures the similarity between any two data points in the training dataset. In this way, the local empirical risk function in (3) can be reformulated as a function of :

 ^Ri(α):=1TiTi∑t=1ℓ(f⋆(xi,t),yi,t)+λi∥f∥2H=1TiTi∑t=1ℓ(α⊤κi(xi,t),yi,t)+λiα⊤Kα, (5)

where is a vector that stores the computed similarity between and all data . Then, (2) becomes

 minα∈RTN∑i=1^Ri(α). (6)

Relating the decentralized kernel learning problem with the decentralized consensus optimization problem, solving (6) is equivalent to solving

 min{αi∈RT}Ni=1N∑i=1^Ri(αi)s.t.αi=αn,∀i,∀n∈Ni, (7)

where and are the local copies of the common optimization parameter vector at agent and agent , respectively. The problem can then be solved by ADMM (Shi et al., 2014) or other primal dual methods (Terelius et al., 2011). However, it is worth noting that (7) reveals a subtle yet profound difference from an optimization problem with a summable objective function, namely, each local function depends not only on the global decision variable , but the global data because of the kernel terms and . As a result, solving the local objective for agent requires raw data from all other agents to obtain and , which contradicts the situation that private data are only locally available. Moreover, notice that is of the same size as that of the ensemble dataset, which incurs the curse of dimensionality and insurmountable computational cost when becomes large, even when the data are available to all agents.

To resolve this issue, an alternative formulation is to associate a local prediction model with each agent, with being the local optimal solution and  (Ji et al., 2016). In this way, the local cost function becomes

 (8)

where is of size and is of size , and they both depend on local data only. With (8), the optimization problem (7) is then modified to

 min{¯αi∈RTi}Ni=1N∑i=1^Ri(¯αi)s.t.¯fn(xi,t)=¯fi(xi,t),∀i,∀n∈Ni,t=1,…,Ti, (9)

and can be solved distributedly by ADMM. Note that the consensus constraint is the learned prediction values , not the parameters . This is because are data dependent and may have different sizes at different agents (the dimension of is equal to the number of training examples at agent ), and cannot be directly optimized.

Still, this method has four drawbacks. To begin with, it is necessary to associate a local learning model for each agent for the decentralized implementation. However, the local learning model and the global optimal model in (2) may not be the same since different local training data are used. Therefore, the optimization problem (9) is only an approximation of (2). Even with the equality constraint to minimize the gap between the decentralized learning and the optimal centralized one, the approximation performance is not guaranteed. Besides, the equality constraint still requires raw data exchange among agents in order for agent to be able to compute the values from agent ’s data . Apparently, this violates the privacy-preserving requirement for many practical applications. In addition, with being large, both the storage and computational costs are high for each agent due to the curse of dimensionality problem at the local sites as well. Lastly, the frequent local communication is resource-consuming under communication constraints. To circumvent all these obstacles, the goal of this paper is to develop efficient decentralized algorithms that preserve privacy and conserve communication resources.

## 3 Algorithm Development

In this section, we leverage the RF approximation and ADMM to develop our algorithms. We first introduce the RF mapping method. Then, we devise the DKLA algorithm that globally optimizes a shared learning model for the multi-agent system. Further, we take the limited communication resources in large-scale decentralized networks into consideration and develop the COKE algorithm. Both DKLA and COKE are computationally efficient and preserve data privacy at the same time. In addition, COKE is also communication efficient.

### 3.1 RF based kernel learning

As stated in previous sections, standard kernel methods incur the curse of dimensionality issue when the data size grows large. To make kernel methods scalable for a large dataset, RF mapping is adopted for approximation by using the shift-invariance property of kernel functions (Rahimi and Recht, 2008).

For a shift-invariant kernel that satisfies , if

is absolutely integrable, then its Fourier transform

is guaranteed to be nonnegative (

), and hence can be viewed as its probability density function (pdf) when

is scaled to satisfy  (Bochner, 2005). Therefore, we have

 κ(xt,xτ)=∫pκ(ω)ejω⊤(xt−xτ)dω:=Eω[ejω⊤(xt−xτ)]=Eω[ϕ(xt,ω)ϕ∗(xτ,ω)], (10)

where denotes the expectation operator, with , and is the complex conjugate operator. In (10), the first equality is the result of the Fourier inversion theorem, and the second equality arises by viewing as the pdf of . In this paper, we adopt a Gaussian kernel

, whose pdf is a normal distribution with

.

The main idea of the RF mapping method is to approximate the kernel function by the sample average

 (11)

where with randomly drawn from the distribution , and is the conjugate transpose operator.

To obtain a real-valued approximation for , the following two real-valued mappings can be adopted, both satisfying the condition (Rahimi and Recht, 2008):

 ϕr(x,ω) =[cos(ω⊤x),sin(ω⊤x)]⊤, (12) ϕr(x,ω) =√2cos(ω⊤x+b), (13)

where is drawn uniformly from .

With the real-valued RF mapping, the minimizer of (2) then admits the following representation:

 ^f⋆(x)=N∑i=1Ti∑t=1αi,tϕ⊤L(xi,t)ϕL(x)=θ⊤ϕL(x), (14)

where denotes the new decision vector to be learned in the RF space and . If (12) is adopted, then and are of size . Otherwise, if (13) is adopted, then and are of size . In either case, the size of is fixed and does not increase with the number of data samples.

### 3.2 DKLA: Decentralized kernel learning via ADMM

Consider the decentralized kernel learning problem described in Section 2 and adopt the RF mapping described in Section 3.1. Let all agents in the network have the same set of random features, i.e., . Plugging (14) into the local cost function in (3) gives

 ^Ri(θ):=1TiTi∑t=1ℓ(^f⋆(xi,t),yi,t)+λi∥f∥2H=1TiTi∑t=1ℓ(θ⊤ϕL(xi,t),yi,t)+λi∥θ∥22. (15)

In (15), we have

 ∥θ∥22:=(N∑i=1Ti∑t=1αi,tϕ⊤L(xi,t))(N∑n=1Ti∑τ=1αn,τϕL(xn,τ))=N∑i=1Ti∑t=1N∑n=1Ti∑τ=1αi,tαn,τκ(xi,t,xn,τ):=∥f∥2H.

Therefore, with the RF mapping, the centralized benchmark (2) becomes

 minθ∈RLN∑i=1^Ri(θ). (16)

Here for notation simplicity, we denote the size of by , which can be achieved by adopting the real-valued mapping in (13). Adopting the real-valued mapping in (12) only changes the size of while the algorithm development is the same. The RF mapping is essential, because it results in a common optimization parameter of fixed size for all agents.

To solve (16) in a decentralized manner via ADMM, we associate a model parameter with agent , which is a local copy of . Enforcing the consensus constraint for such that all agents reach consensus on the prediction function parameterized by , the decentralized kernel learning problem based on the RF mapping becomes to jointly minimize the following objective function:

 min{θi∈RL}Ni=1N∑i=1^Ri(θi)s.t.θn=θi,∀n∈Ni,∀i. (17)

Note that the new decision variables to be optimized are local copies of the global optimization parameter and are of the same size for all agents. On the contrary, the decision variables in (9) are data-dependent and may have different sizes. In addition, the size of is , which can be much smaller than that of (equal to ) in (6). For big data scenarios where , RF mapping greatly reduces the computational complexity. Moreover, as shown in the following, the updating of does not involve any raw data exchange and the RF mapping from to is not one-to-one mapping, therefore preserves privacy. Further, it is easy to set the regularization parameters that control over-fitting. Specifically, since the parameters are of the same length among agents, we can set them to be , where is the corresponding over-fitting control parameter assuming all data are collected at a center. On the other hand, the regularization parameters in (5) depend on local data and need to satisfy , which is relatively difficult to tune in a large-scale network.

Accordingly, the augmented Lagrangian function of problem (17) is

 (18)

where are the dual variables corresponding to the equality constraint in (17) and is the penalty parameter.

We then apply ADMM to solve (17) and develop the DKLA algorithm such that all converges to the global optimum of (16) in the RF space. Following (Shi et al., 2014), the updates for and are distributed to agent as follows:

 θki:=argminθi⎧⎨⎩^Ri(θi)+ρ|Ni|∥θi∥22+θ⊤i⎡⎣γk−1i−ρ∑n∈Ni(θk−1i+θk−1n)⎤⎦⎫⎬⎭, (19a) γki=γk−1i+ρ∑n∈Ni(θki−θkn), (19b)

where is the cardinality of . The learning algorithm DKLA is outlined in Algorithm 1. It is fully decentralized since the updates of and depend only on local and neighboring information.

Theorem 1 For a connected network with convex local objective functions , and the initialized dual variables as in Algorithm 1, DKLA converges to an optimal solution of (16). Further, when the local objective functions , are strongly convex, DKLA converges to the optimal solution of (16) at a linear rate.

Remark 1. For the decentralized kernel learning problem in the RF space, choosing the loss function to be the quadratic loss for each agent in a regression problem gives a strongly convex local objective function while choosing the loss function to be the logistic loss in a classification problem gives a convex local objective function. It should be noted that the kernel transformation with RF mapping is essential in enabling convex consensus formulation with convergence guarantee. For example, in a regular optimization problem with a local cost function , even if it is quadratic, the nonlinear function inside could destroy the convexity. In contrast, with kernel mapping, of any form is expressed as a linear function of , and hence the local cost function is guaranteed to be convex.

Remark 2. Adopting the RF mapping to convert the decentralized kernel learning problem into a standard consensus optimization problem in the RF domain, the convergence results of the model parameter then follow directly from (Shi et al., 2014, Theorem 1) for such standard problems.

### 3.3 COKE: Communication-censored decentralized kernel learning

From Sections 3.1 and 3.2, we can see that the decentralized kernel learning in the RF space under the consensus optimization framework has much reduced computational complexity, thanks to the RF mapping technique that transforms the learning model into a smaller RF space. In this subsection, we further consider the case when the communication resource is limited and aim to reduce the communication cost of DKLA. To start, we notice that in the DKLA iteration (19), each agent maintains local variables at iteration , i.e., its local primal variable , local dual variable and state variables received from its neighbors. While the dual variable is kept locally for agent , the transmission of its updated local variable to its one-hop neighbors happens in every iteration, which consumes a large amount of communication bandwidth and energy along iterations for large-scale networks. In order to improve the communication efficiency, we develop the COKE algorithm by employing a censoring function at each agent to decide if a local update is informative enough to be transmitted.

To evaluate the importance of a local update and enforce the communication censoring function at iteration for agent , we introduce a new state variable to record agent ’s latest broadcast primal variable up to time . Then, at iteration , we define the difference between agent ’s current state and its previously transmitted state as

 ξki=^θk−1i−θki, (20)

and choose a censoring function as

 Hi(k,ξki)=∥ξki∥2−hi(k), (21)

where is a non-increasing non-negative sequence. A typical choice for the censoring function is where and are constants.

Then, when executing the COKE algorithm, each agent maintains local variables at each iteration . Comparing with the DKLA update in (19), the additional local variable is the state variable that records its latest broadcast primal variable up to time . Moreover, the state variables from its neighbors are that record the latest received primal variables from its neighbors, instead of the timely updated and broadcast variables of its neighbors . While in COKE, each agent computes local updates at every step, its transmission to neighbors does not always occur, but is determined by the censoring criterion (21). To be specific, at each iteration , if , then , and agent is allowed to transmit its local primal variable to its neighbors. Otherwise, and no information will be transmitted. If agent receives from any neighbor , then that neighbor’s state variable kept by agent becomes , otherwise, . Consequently, agent ’s local parameters are updated as follows:

 θki:=argminθi⎧⎨⎩^Ri(θi)+ρ|Ni|∥θi∥22+θ⊤i⎡⎣γk−1i−ρ∑n∈Ni(^θk−1i+^θk−1n)⎤⎦⎫⎬⎭, (22a) γki=γk−1i+ρ∑n∈Ni(^θki−^θkn), (22b)

with a censoring step conducted between (22a) and (22b). We outline the COKE algorithm in Algorithm 2.

The key feature of COKE is that agent ’s local variables and are updated all the time, but the transmission of happens only when the censoring condition is met. By skipping unnecessary transmissions, the communication efficiency of COKE is improved. It is easy to see that large saves more communication but may lead to divergence from the optimal solution of (16), while small does not contribute much to communication saving. Noticeably, DKLA is a special case of COKE when the communication-censoring strategy is absent by setting .

## 4 Theoretical Guarantee

In this section, we perform theoretical analyses to address two questions related to the convergence properties of the COKE algorithms. First, whether it converges to the globally optimal point, and if so, at what rate? Second, what is the achieved generalization performance in learning. Since DKLA is a special case, the results, especially the second one, extend to DKLA straightforwardly. For theoretical analysis, we make the following assumptions.

###### Assumption 1

The network with topology is undirected and connected.

###### Assumption 2

The local cost functions are strongly convex with constants such that , , given any . The minimum convexity constant is . The gradients of the local cost functions are Lipschitz continuous with constants . That is, for any agent given any . The maximum Lipschitz constant is .

###### Assumption 3

The number of training samples of different agents is of the same order of magnitude, i.e., .

###### Assumption 4

There exists

, such that for all estimators

, , where is the expected risk to measure the generalization ability of the estimator .

###### Assumption 5

The estimates are bounded, i.e., such that .

Assumption 1 and 2 are standard for decentralized optimization over decentralized networks (Shi et al., 2014), Assumption 4 is standard in generation performance analysis of kernel learning (Li et al., 2018), Assumption 5 is valid for most of the popular loss functions (Bouboulis et al., 2018), and Assumption 3 is enforced to exclude the extreme unbalance case of data distributed over the network.

### 4.1 Linear convergence of DKLA and COKE

We first establish that DKLA enables agents in the decentralized network to reach consensus on the prediction function at a linear rate. We then show that when the censoring function is properly chosen and the penalty parameter satisfies certain conditions, COKE also guarantees that the individually learned functional on the same sample linearly converges to the optimal solution.

Theorem 2 [Linear convergence of DKLA] Initialize the dual variables as , with Assumptions 1 - 3, the learned functional at each agent through DKLA is R-linearly convergent to the optimal functional for any , where denotes the optimal solution to (16) obtained in the centralized case. That is,

 limk→∞^fθki(x)=^fθ∗(x),∀i. (23)

Proof. See Appendix A.

Theorem 3 [Linear convergence of COKE] Initialize the dual variables as , set the censoring thresholds to be , with and , and choose the penalty parameter such that

 0<ρ

where , , and are arbitrary constants, and are the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. and

are the maximum singular value of the unsigned incidence matrix

and the minimum non-zero singular value of the signed incidence matrix of the network, respectively. Then, with Assumptions 1 - 3, the learned functional at each agent through COKE is R-linearly convergent to the optimal one for any , where denotes the optimal solution to (16) obtained in the centralized case. That is,

 limk→∞^fθki(x)=^fθ∗(x),∀i. (25)

Proof. See Appendix A.

The above theorems establish the exact convergence of the functional learned in the multi-agent system for the decentralized kernel regression problem via DKLA and COKE. Different from the previous works (Koppel et al., 2018; Shin et al., 2018), our analytic results are obtained by converting the non-parametric data-dependent learning model into a parametric data-independent model in the RF space and solved under the consensus optimization framework. In this way, we not only reduce the computational complexity of the standard kernel method and make the RF-based kernel methods scalable to large-size datasets, but also preserve privacy since no raw data exchange among agents is required. RF mapping is crucial in our algorithms, with which we are able to show the linear convergence of the functional by showing the linear convergence of the iteratively updated decision variables in the RF space. While the linear convergence of decision variables in DKLA can be directly derived from (Shi et al., 2014), however, the linear convergence proof of the decision variables in COKE becomes more challenging when applying the communication censoring strategy, as addressed in our previous work (Liu et al., 2019). Thanks to our previous efforts in communication-efficient optimization (Liu et al., 2019) and the RF mapping technique that enables the consensus optimization in the RF space, we are able to prove that the learned functional also converges linearly when the Assumptions 1 - 3 hold under the consensus optimization framework with the communication censoring strategy, see (Liu et al., 2019) and Appendix A for more details.

### 4.2 Generalization property of COKE

The ultimate goal of decentralized learning is to find a function that generalizes well for the ensemble of all data from all agents. To evaluate the generalization property of the predictive function learned by COKE, we are then interested in bounding the difference between the expected risk of the predictive function learned by COKE at the -th iteration, defined as , and the expected risk in the RKHS. This is different from bounding the approximation error between the kernel and the approximated by random features as in the literature (Rahimi and Recht, 2008; Sutherland and Schneider, 2015; Sriperumbudur and Szabó, 2015). As DKLA is a special case of COKE, the generalization performance of COKE can be extended to DKLA straightforwardly.

To illustrate our finding, we focus on the kernel regression problem whose loss function is least squares, i.e., . With the RF mapping, the objective function (16) of the regression problem can be formulated as

 ^R(θ)=N∑i=1^Ri(θ)=N∑i=1(1Ti∥yi−(ΦiL)⊤θ∥22+λN∥θ∥22), (26)

where , , and is the data mapped to the RF space.

The optimal solution of (26) is given in closed form by

 θ∗=(~Φ⊤~Φ+λI)−1~Φ⊤~y, (27)

where with , and with . The optimal prediction model is then expressed by

 ^fθ∗(x)=(θ∗)⊤ϕL(x). (28)

In the following theorem, we give a general result of the generalization performance of the predictive function learned by COKE for the kernel regression problem, which is built on the linear convergence result given in Theorem 2 and taking into account of the number of random features adopted.

Theorem 4 Let

be the largest eigenvalue of the kernel matrix

, and choose the regularization parameter so as to control overfitting. Under the Assumptions 1 - 4, with the censoring function and other parameters given in Theorem 2, for all and , if the number of random features satisfies

 L≥1λ(1ϵ2+23ϵ)log16dλKδp,

then with probability at least , the excess risk of obtained by Algorithm 2 converges to an upper bound, i.e.,

 limk→∞(E(^fk)−E(fH))≤3λ+O(1√T), (29)

where , and is the number of effective degrees of freedom that is known to be an indicator of the number of independent parameters in a learning problem (Avron et al., 2017).

Proof. See Appendix B.

Theorem 4 states the trade-off between the computational efficiency and the statistical efficiency through the regularization parameter , effective dimension , and the number of random features adopted. We can see that to bound the excess risk with a higher probability, we need more random features, which results in a higher computational complexity. The regularization parameter is usually determined by the number of training data and one common practice is to set for the regression problem (Caponnetto and De Vito, 2007). Therefore, with features, COKE achieves a learning risk at a linear rate. We also notice that different sampling strategies affect the number of random features required to achieve a given generalization error. For example, importance sampling is studied for the centralized kernel learning in RF space in (Li et al., 2018). Interested readers are referred to (Li et al., 2018) and reference therein.

## 5 Experiments

This section evaluates the performance of our COKE algorithm in regression tasks using both synthetic and real-world datasets. Since we consider the case that data are only locally available and cannot be shared among agents, we use the following benchmarks where the RF mapping is adopted for comparison with our COKE algorithm.

CTA. This method is devised to cope with the online streaming data in (Bouboulis et al., 2018), at each time instant, each agent combines information from its neighbors i.e., and update its own parameter in the RF space with the gradient descent method. Here, we adopt it for the decentralized learning problem with batch-form data at each agent.

DKLA. Algorithm 1 proposed in Section 3.2 where ADMM is applied and the communication among agents happen at every iteration without being censored.

The performance of all algorithms is evaluated using both synthetic and real-world datasets, where the entries of data samples are normalized to lie in and each agent uses of its data for training and the rest for testing. The generalization performance at each iteration is evaluated using MSE given by