Decentralized online learning has been widely studied in the last decades, mostly motivated by its broad applications in networked multi-agent systems, such as wireless sensor networks, robotics, and internet of things, etc [26, 35]. In these systems, a number of agents collect their own online streaming data and aim to learn a common functional model through local information exchange. This objective is usually achieved by decentralized online convex optimization [29, 42, 21, 48, 39]. With an online gradient descent based algorithm , or through online alternating direction method of multipliers (ADMM) , a static regret can be achieved over a time horizon . Further, if the cost functions are strictly convex, an efficient algorithm based on the Newton method achieves a regret bound of . In addition to static environments, online learning in dynamic environments has attracted more and more attentions recently [19, 38, 2, 12, 47]. However, all these works assume that the functional model to be learned by agents is linear, which may not be always true in practical applications.
Motivated by the universality of kernel methods in approximating nonlinear functions, this paper aims to solve the decentralized online kernel learning problem where the common function to be learned by agents is assumed to be nonlinear and belong to the reproducing kernel Hilbert space (RKHS). However, directly applying kernel methods for decentralized online learning is formidably challenging because they adopt nonparametric models where the number of model variables grows proportionally to the data size, which incurs the curse of dimensionality issue when data size goes large as time evolves. In addition, the data-dependent decision variables prevent consensus optimization when the data sizes vary at different agents and across time as well as under certain circumstances where raw data exchange is prohibited .
To alleviate the computational complexity of kernel methods, various dimensionality reduction techniques have been developed, including stochastic approximation , restricting the number of function parameters [23, 22], and approximating the kernel during training [36, 10, 34, 31]. Among them, random feature (RF) mapping methods [34, 10, 31] not only circumvent the curse of dimensionality problem but also enable consensus optimization without any raw data exchange among agents, which makes them popular in many decentralized kernel learning works, including batch-form learning [43, 37] and online streaming learning [5, 40, 17].
Another key problem in decentralized learning is that it relies on iterative local communications for computational feasibility and efficiency. This incurs frequent communications among agents to exchange their locally computed updates of the shared learning model, which can cause tremendous communication overhead in terms of both link bandwidth and transmission power. Therefore, communication-efficient algorithms are desired in decentralized learning. To improve the communication efficiency, we can harness the function smoothness or the Nesterov gradient to achieve fast convergence [32, 33], transmit the compressed information by quantization [49, 46, 40] or sparsification [41, 15], randomly select a number of nodes for broadcasting/communication, and operate asynchronous updating to reduce the number of transmissions per iteration [24, 1, 30, 44, 45]. In contrast to random node selection, a more intuitive way is to evaluate the importance of a message in order to avoid unnecessary transmissions. This is usually implemented by adopting a communication censoring/event-triggering scheme to adaptively decide if a message is informative enough to be transmitted during the iterative optimization process [8, 28, 25, 43, 7].
In this article, we thus focus on the decentralized online kernel learning problem in networked multi-agent systems and aim to develop both communication- and computation-efficient algorithms. We first utilize RF mapping to transform the original nonparametric data-dependent learning problem into a parametric fixed-size data-independent learning problem to circumvent the curse of dimensionality issue in traditional kernel methods and enable consensus optimization in a decentralized setting in the RF space. Different from existing gradient descent based method [5, 40] or standard ADMM algorithm , we propose to solve the decentralized kernel learning problem by linearized ADMM and develop the Online Decentralized Kernel learning via Linearized ADMM (ODKLA) algorithm. In ODKLA, the local cost function of each agent is replaced by its first-order approximation centered at the current iterate and results in a closed-form primal update if the local cost function is convex. In this way, the computation efficiency of ODKLA is improved compared with standard ADMM where the primal update requires to solve a suboptimization problem every time while still enjoying fast convergence speed. To further reduce the communication cost, we develop the Quantized and Communication-censored Online Decentralized Kernel learning via Linearized ADMM (QC-ODKLA) algorithm by introducing a communication censoring strategy and a quantization strategy. The communication censoring strategy allows each agent to autonomously skip unnecessary communications when its local update is not informative enough for transmission, while the quantization strategy restricts the total number of bits transmitted in the learning process. The communication efficiency can be boosted at almost no sacrifice to the learning performance. Our key contributions are summarized as follows.
We develop the ODKLA that utilizes linearized ADMM to solve the online decentralized multi-agent kernel learning problem in the RF space. ODKLA is fully decentralized and does not involve solving sub-optimization problems, which is thus more computationally efficient than standard ADMM. Moreover, ODKLA is essentially a variant of the higher-order ADMM and thus achieves faster convergence compared with the diffusion-based first-order gradient descent methods .
Utilizing both communication-censoring and quantization strategies, we develop the QC-ODKLA algorithm, which achieves desired learning performance given limited communication resources and energy supply. When both strategies are absent, QC-ODKLA degenerates to ODKLA.
In addition, we analyze the regret bound of QC-ODKLA. We show that when all techniques are adopted (linearized ADMM, quantization, and communication censoring), QC-ODKLA is still able to achieve the optimal sublinear regret over time slots under mild conditions, i.e., the communication censoring thresholds should be decaying.
Finally, we test the performance of our proposed ODKLA and QC-ODKLA algorithms on extensive real datasets. The results corroborate that both ODKLA and QC-ODKLA exhibit attractive learning performance and computation efficiency, while QC-ODKLA is highly communication-efficient. Such salient features make it an attractive solution for broad applications where decentralized learning from streaming data is at its core.
The remaining of this paper is organized as follows. Section II provides some preliminaries for decentralized kernel learning. Section III formulates the online decentralized kernel learning problem. Section IV develops the online decentralized kernel learning algorithms, including both ODKLA and QC-ODKLA. Section V presents the theoretical results. Section VI tests the proposed methods by real datasets. Concluding remarks are summarized in Section VII.
Notation. denotes the set of real numbers.
denotes the Euclidean norm of vectors anddenotes the Frobenius norm of matrices. denotes the cardinality of a set. denotes a matrix, denotes a vector, and denotes a scalar.
Ii-a Network and communication models
Network Model. Consider a bidirectionally connected network of agents and arcs, whose underlying undirected communication graph is denoted as , where is the set of agents with cardinality and is the set of undirected arcs with cardinality . Two agents and are called as neighbors when and, by the symmetry of the network, . For agent , its one-hop neighbors are in the set with cardinality , which is also known as the degree of agent . The degree matrix of the communication graph is which is diagonal with the th diagonal element being . Define the symmetric adjacency matrix associated with the communication graph as , whose th entry is 1 if agent and are neighbors or 0 otherwise. Define the unsigned incidence matrix and the signed incidence matrix of the communication graph as and , respectively. According to , we have
Communication Model. In this paper, we consider synchronous communications. That is, the iterative process of algorithm implementation consists of three stages: communication, observation, and computation. In the communication stage, each agent broadcasts its state variable to its neighbors and receives state variables from its neighbors according to the communication censoring rule, which shall be introduced later. After communicating with its neighbors, each agent collects its streaming data and formulates its own local objective function in the observation stage. In the computation stage, each agent carries out local updates based on the observed data, local objective function, and state variables.
Ii-B Random feature mapping
Random feature (RF) mapping is proposed to make kernel methods scalable for large datasets . For a shift-invariant kernel that satisfies , if
is absolutely integrable, then its Fourier transformis guaranteed to be nonnegative (
), and hence can be viewed as its probability density function (pdf) whenis scaled to satisfy . Therefore, we have
where denotes the expectation operator, with , and is the complex conjugate operator. In (1), the first equality is the result of the Fourier inversion theorem, and the second equality arises by viewing as the pdf of . In this paper, we adopt a Gaussian kernel
, whose pdf is a normal distribution with. The main idea of the RF mapping method is to approximate the kernel function by the sample average
where are randomly drawn from the distribution , and is the conjugate operator. For implementation, the following real-valued mapping is usually adopted:
Iii Problem Statement
Consider the network model described in Section II-A, each agent in the network only has access to its locally observed data composed of independently and identically distributed (i.i.d) input-label pairs
obeying an unknown probability distributionon , with and . The decentralized learning task is to find a nonlinear prediction function such that for , where the error term is minimized accordingly to certain optimality metric. This is usually achieved by minimizing the empirical risk:
is a nonnegative loss function,is the function space belongs to, and is a regularization parameter that controls over-fitting. For regression problems, a common loss function is the quadratic loss. For binary classifications, the common loss functions are the hinge loss and the logistic loss .
Assume belongs to the RKHS induced by a shift-invariant positive semidefinite kernel , and adopt the RF mapping method described in Section II-B. Then, the function to be learned in (4) can be approximated by the following representation:
where is the decision vector to be learned in the RF space, and is the mapped data in the RF space using (3):
With the approximation (5), the decentralized kernel learning problem is formulated as
where is the local copy of the global parameter associated with each agent . The constraint in (7) enforces the consensus constraint on neighboring agents and using an auxiliary variable . The optimization problem can then be solved using DKLA proposed in . A communication-censored algorithm (COKE) is also proposed in  to improve the communication efficiency of DKLA.
However, both DKLA and COKE operate in batch form when all data are available. Whereas in many real-life applications, function learning tasks are expected to perform in an online fashion with sequentially arriving data. In this article, we consider the case that each agent collects the data points
in an online fashion, and the parameter is estimated based on instantaneous data samples. To achieve an optimal sublinear regrets from the optimal performance of (7), we customize the general online decentralized alternating direction method of multipliers algorithm proposed in  to decentralized online kernel learning to efficiently solve the online kernel learning problem over a decentralized network. At every time , decentralized online kernel learning (approximately) solves an optimization problem to obtain the update from the current decision and the newly arrived data:
where is the local instantaneous cost function dependent of the new data only, whereas captures the influence of all the past data.
In the next section, we first propose a computation-efficient algorithm to solve (8). We then utilize communication-censoring and quantization strategies to improve the communication efficiency of the proposed algorithm.
Iv Algorithm Development
In this section, we first utilize linearized ADMM to efficiently solve (8) and then add the censoring and quantization techniques to develop a communication-efficient decentralized online kernel learning algorithm.
For notational clarity, we define that contains all the local copies and . We further define the aggregated function as . With these definitions, we rewrite (8) in a matrix form for the update:
where and .
Iv-a ODKLA: online decentralized kernel learning via linearized ADMM
where is the penalty parameter, is the Lagrange multiplier associated with the constraint . Then, at time , the updates of the primal variables , and the dual variable are sequentially given by
Note that given the instantaneous loss , iterates (11)-(13) only run once, and thus the optimization problem in (9) is only approximately solved. It has been proven in  that with initializations , and , the update of the auxiliary variable is not necessary and the Lagrange multiplier can be replaced by a lower dimensional variable . The simplified updates of ADMM for general online decentralized optimization refer to. Though simplified, the general decentralized ADMM still involves solving local optimization problem for the primal variables update, thus is computational intensive. To reduce the computation complexity of ADMM, we replace in (11) by its linear approximation at , and develop the Online Decentralized Kernel learning via Linearized ADMM (ODKLA) algorithm where the iterates of and are generated by the simplified recursions
The ODKLA algorithm can be implemented distributedly. Specifically, each agent only needs to update a primal variable and a dual variable with the following iterations
Note that with linearized ADMM, at each time , ODKLA has closed-form solutions for all agents to update their primal variables, instead of solving optimization problems as in (11). Thus, the computational efficiency is improved. The ODKLA algorithm is outlined in Algorithm 1.
Remark 1. Our paper shares similar problem formulation (8) as . However, our methods differ from  in two ways. First, we utilize linearized ADMM to solve the decentralized kernel learning problem while  adopts the standard ADMM. Compared with , our algorithms enjoy light computation. Second, we also develop the communication efficient algorithms in the next section using quantization and communication censoring strategies while the communication efficiency is not discussed in .
Iv-B QC-ODKLA: quantized and communication-censored ODKLA
ODKLA resolves the challenges caused by streaming data in decentralized network setting in a computationally efficient manner. However, as seen in (16) - (17), agents communicate all the time which causes low communication efficiency. Thus, we introduce communication censoring and quantization strategies to deal with the limited communication resource situation and develop the Quantized and Communication-censored ODKLA algorithm (QC-ODKLA).
To start, we introduce a new state variable for each agent to record its latest broadcast primal variable up to time . Then, the difference between agent ’s updated state and its previously transmitted state at time is defined as
We then introduce an evaluation function
to evaluate if the local updates are informative enough to be transmitted, with predefined positive constants and . If , then is deemed informative, and agent is allowed to transmit a quantized update to its neighbors. Here, the quantization is introduced to reduce the communication cost from the perspective of bit numbers per transmission. To facilitate the measurement and analysis of the impact of quantization, we adopt the difference-based quantization scheme proposed in . That is, at time , instead of quantizing , we quantize the difference . Specifically, for each element within the range of , if we restrict the number of transmission bits to be , then we can evenly divided the range to be intervals of equal length . Then the rounding quantizer applied to outputs
where is the floor operation. In practice, it is not necessary to transmit , instead, we can simply transmit the integer using the bits. Thus, the total number of bits for agent to transmit the quantized difference to its neighbors is only bits.
The whole communication process thus involves three parts, evaluation, quantization, and state update. If , then is deemed informative, and agent is allowed to transmit a quantized difference to its neighbors and updates its local state as . Otherwise, is censored, agent sets , and no information is transmitted. Similarly, upon receiving from its neighbor , agent updates the state variables of its neighbor’s as , otherwise, .
V Regret Analysis
In this section, we analyze the regret bound of QC-ODKLA. As in , we define the cumulative network regret of online decentralized learning as
where is the optimal solution of (7) that assumes all data are available. We prove that QC-ODKLA achieves the optimal sublinear regret for convex local cost functions . Since ODKLA is a special case of QC-ODKLA where both the quantization and communication-censoring strategies are absent, the regret analysis of QC-ODKLA extends to ODKLA straightforwardly. The following commonly used assumptions are adopted.
The local cost functions are convex and differentiable with respect to . Also, assume the gradients of the local cost functions are Lipschitz continuous with constants . That is, . The maximum Lipschitz constant is .
The estimates and the optimal solution of (7) are bounded. That is, , and .
Note that all assumptions are standard in online decentralized kernel learning [5, 17, 40]. The convexity of local cost functions are easily satisfied in learning problems if the local cost functions are square loss or the hinge loss.
To study the regret bound for QC-ODKLA, we notice that the difference of QC-ODKLA and ODKLA is the communication censoring step and quantization step in the communication stage, which introduces an error if an update is censored and/or quantized in an transmission. We define the introduced error for agent at time as
and the overall introduced error at time as . We first show that the overall introduced error in QC-ODKLA is upper bounded by the quantization error and the pre-defined threshold parameters.
For the updates (21) and (22), under the assumptions V.1 and V.2, if the quantized difference is only allowed to transmit when for the pre-defined threshold parameters and , then, for any time , the overall error introduced in the QC-ODKLA is upper bounded by
where is the length of the quantization interval.
Proof. Define , the introduced error for each agent can be represented as
According to the censoring rule, if for , we have , which implies . Otherwise, if for , we have , which implies since . Therefore, the overall introduced error .
With Lemma V.3, we are ready to establish the network regret bound of QC-ODKLA.
Proof. See Appendix A.
Remark 2. Note that in addition to the network size () and topology (), the communication censoring and quantization strategies (incorporated in ) also affect the cumulative network regret, which creates a trade-off between the communication efficiency and the online learning performance.
This section evaluates the performance of our ODKLA and QC-ODKLA algorithms in regression tasks for streaming data from real-world datasets.
Benchmarks. Since we consider the case that data are only locally available and cannot be shared among agents, the RFF-DOKL algorithm which is developed based on online gradient descent and a diffusion strategy  and the DOKL algorithm which is developed based on online ADMM  will be simulated and compared in our experiments with the proposed ODKLA and QC-ODKLA algorithms.
The regression tasks are carried out on 6 datasets available at the UCI machine learning repository. The detailed descriptions of the six datasets are listed below.
Tom’s hardware. This dataset contains samples with including the number of created discussions and authors interacting of a topic and representing the average number of display to a visitor about that topic .
Twitter. This dataset consists of samples with being a feature vector reflecting the number of new interactive authors and the lengths of discussions on a given topic, etc., and representing the average number of active discussions on a certain topic. The learning task is to predict the popularity of these topics .
Energy. This dataset contains samples with describing the humidity and temperature in different areas of the house, pressure, wind speed, and viability outside, while denotes the total energy consumption in the house .
Air quality. This dataset contains samples measured by a gas multi-sensor device in an Italian city, where represents the hourly concentration of CO, NOx, NO2, etc, while denotes the concentration of polluting chemicals in the air .
Conductivity. This dataset contains samples extracted from superconductors, where represents critical information to construct superconductor such as density and mass of atoms. The task is to predict the critical temperature which creates superconductor .
Blood data. This dataset contains samples recorded by patient monitors at different hospitals where and the goal is to predict the blood pressure based on several physiological parameters from Photoplethysmography and Electrocardiogram signals .
Settings and parameter tuning. All experiments are conducted using Matlab 2021 on an Intel CPU @ 3.6 GHz (32 GB RAM) desktop. For each dataset, the data samples are randomly shuffled and then partitioned among nodes so that each node has samples. The features are normalized so that all values are between and . The number of random features adopted for RF approximation is throughout the simulations. The Gaussian kernel bandwidth is fined tuned to be for Tom’s hardware, Twitter, Air quality, and Blood datasets. For Conductivity and Energy datasets, and , respectively. The regularization parameter . The stepsize and are fine-tuned via grid-search for each method and each dataset individually. The connected graph is randomly generated with or nodes. For Twitter, Conductivity, and Blood datasets, we use a 10-node network. The remaining datasets use a 5-node network. The censoring threshold parameters are for energy data, and for all the other datasets.
MSE performance. We first evaluate the learning performance of all algorithms by the mean-squared-error (MSE), which is commonly adopted in online learning problems [5, 17]. From Figures 1 (a) - 6 (a), we can see that the learning performance of ODKLA, RFF-DOKL, and DOKL is very close while the trivial difference comes from the distinction of specific datasets. Further, the learning performance of QC-ODKLA is always comparable to that of the ODKLA, after introducing the communication censoring and quantization strategies. The quantization level is set to be ,
Communication efficiency. We then evaluate the communication efficiency among different algorithms. We present the MSE performance versus trigger counts in Figures 1 (b) - 6 (b) and MSE performance versus communication bits in Figures 1 (c) - 6 (c). Figures 1 (b) - 6 (b) show that QC-ODKLA triggers a few transmissions in the early learning stage, which greatly improves the communication efficiency. Further, thanks to the quantization, QC-ODKLA only needs 3 bits to transmit an element, the total number of communication bits is also greatly reduced accordingly. For other methods to transmit each element of updates, suppose the agent uses a 32-bit CPU operating mode, then the communication cost is 32 bits per iteration per agent per element. Therefore, QC-ODKLA is corroborated to greatly reduce the communication cost.
Computation efficiency. Finally, we evaluate the computation efficiency of all algorithms by their running time on six datasets, which is recorded in Table I. RFF-DOKL is a gradient descent-based first-order algorithm, which achieves the highest computation efficiency. Comparing ODKLA with the ADMM based DOKL method, we see that the linearization step reduces a large amount of computation of standard ADMM. Under the circumstance that online streaming data vary fast, a computation-efficient algorithm is preferred, reflecting the advantages of the proposed ODKLA and QC-ODKLA algorithms. Also, note that QC-ODKLA is computationally slower than ODKLA since the communication censoring and quantization steps consume computation resources.
This paper studies the online decentralized kernel learning problem under communication constraints for multi-agent systems. We utilize RF mapping to circumvent the curse of dimensionality issue caused by the increasing size of sequentially arriving data. To efficiently solve such a challenging problem, we then develop a novel online decentralized kernel learning algorithm via linearized ADMM (ODKLA). We integrate the communication-censoring and quantization strategies into the proposed ODKAL algorithm (QC-ODKLA) to further save communication overheads. We derive the sublinear regret bound for QC-ODKLA theoretically, and verify their effectiveness in learning performance, communication and computation efficiency via simulations on various real datasets. Future work will be devoted to multi-kernel learning and dynamic kernel learning.
Appendix A Proof of Theorem v.4
Proof. Define , which is the stack of copies of , and , we rewrite (23) as
Observe from (32) that stays in the column space of if is also initialized therein. Therefore, we introduce variables , which stay in the column space of , and let for any . Then, (32) is equivalent to
where is the optimal primal-dual triplet.
Rearrange terms in (34) to place at the left side, we have
where the second equality utilizes and such that . We consider to bound the instantaneous regret at time first. With Assumption V.1, it holds