Decentralized Online Learning: Take Benefits from Others' Data without Sharing Your Own to Track Global Trend

01/29/2019 ∙ by Yawei Zhao, et al. ∙ 24

Decentralized Online Learning (online learning in decentralized networks) attracts more and more attention, since it is believed that Decentralized Online Learning can help the data providers cooperatively better solve their online problems without sharing their private data to a third party or other providers. Typically, the cooperation is achieved by letting the data providers exchange their models between neighbors, e.g., recommendation model. However, the best regret bound for a decentralized online learning algorithm is n√(T), where n is the number of nodes (or users) and T is the number of iterations. This is clearly insignificant since this bound can be achieved without any communication in the networks. This reminds us to ask a fundamental question: Can people really get benefit from the decentralized online learning by exchanging information? In this paper, we studied when and why the communication can help the decentralized online learning to reduce the regret. Specifically, each loss function is characterized by two components: the adversarial component and the stochastic component. Under this characterization, we show that decentralized online gradient (DOG) enjoys a regret bound n√(T)G + √(nT)σ, where G measures the magnitude of the adversarial component in the private data (or equivalently the local loss function) and σ measures the randomness within the private data. This regret suggests that people can get benefits from the randomness in the private data by exchanging private information. Another important contribution of this paper is to consider the dynamic regret -- a more practical regret to track users' interest dynamics. Empirical studies are also conducted to validate our analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decentralized Online Learning (or, online learning in decentralized networks) receives extensive attentions in recent years (Shahrampour and Jadbabaie, 2018; Kamp et al., 2014; Koppel et al., 2018; Zhang et al., 2018a, 2017b; Xu et al., 2015; Akbari et al., 2017; Lee et al., 2016; Nedić et al., 2015; Lee et al., 2018; Benczúr et al., 2018; Yan et al., 2013). It assumes that computational nodes in a network can communicate between neighbors to minimize an overall cumulative regret. Each computational node, which could be a user in practice, will receive a stream of online losses that are usually determined by a sequence of examples that arrive sequentially. Formally, we can denote as the loss received by the -th computational node among the networks at the -th iteration. The goal of decentralized online learning usually is to minimize its static regret, which is defined as the difference between the cumulative loss (the sum of all the online loss over all the nodes and steps ) suffered by the learning algorithm and that of the best model which can observe all the loss functions beforehand.

Decentralized online learning attracts more and more attentions recently, mainly because it is believed by the community that it enjoys the following advantages for real-world large-scale applications:

  • (Utilize all computational resource) It can utilize the computational resource (of edging devices) by avoiding collecting all the loss functions (or equivalently data) to one central node and putting all computational burden on a single node.

  • (Protect data privacy) It can help many data providers collaborate to better minimize their cumulative loss, while at the same time protecting the data privacy as much as possible.

However, the current theoretical study does not explain why people need to use decentralized online learning, since the currently best regret result for decentralized online learning for convex loss functions (Hosseini et al., 2013; Yan et al., 2013)) is equal to the overall regret if each node (user) only runs local online gradient without any communication with others222 is the number of nodes or users and is the total number of iterations. The regret of an online algorithm is for convex loss functions (Hazan, 2016; Shalev-Shwartz, 2012). Therefore, the overall regret is if all users do not communicate..

Therefore, this reminds us to ask a fundamental question:

Can people really get benefit with respect to the regret from the decentralized online learning by exchanging information?

In this paper, we mainly study when can the communication really help decentralized online learning to minimize its regret. Specifically, we distinguish two components in the loss function : the adversary component and the stochastic compoent. Then we prove that decentralized online gradient can achieve a static regret bound of ( measures the randomness of the private data), where the first component of the bound is due to the adversary loss while the second component is due to the stochastic loss. Moreover, if a dynamic sequence of models with a budget is used as the reference points, the dynamic regret of the decentralized online gradient is . The dynamic regret is a more suitable performance metric for real-world applications where the optimal model changes over time, such as people’s favorite style of pop musics usually change along with time as the global environment. This shows the communication can help to minimize the stochastic losses, rather than the adversary losses. This result is further verified empirically by extensive experiments on several real datasets.

Notations and definitions

In the paper, we make the following notations.

  • For any and

    , the random variable

    is subject to a distribution , that is, . Besides, a set of random variables and the corresponding set of distributions are defined by

    respectively. For math brevity, we use the notation to represent that holds for any and . represents mathematical expectation.

  • For a decentralized network, we use

    to represent its confusion matrix. It is a symmetric doublely stochastic matrix, which implies that every element of

    is non-negative, , and . We use with

    to represent its eigenvalues. Note that

    .

  • represents gradient operator. represents the norm in default.

  • represents “less than equal up to a constant factor”.

  • represents the set of all online algorithms.

  • and

    represent all the elements of a vector is

    and , respectively.

2 Related work

Online learning has been studied for decades of years. The static regret of a sequential online convex optimization method can achieve and bounds for convex and strongly convex loss functions, respectively (Hazan, 2016; Shalev-Shwartz, 2012; Bubeck, 2011). Recently, both the decentralized online learing and the dynamic regret have drawn much attention due to their wide existence in the practical big data scenarios.

2.1 Decentralized online learning

Online learning in a decentralized network has been studied in (Shahrampour and Jadbabaie, 2018; Kamp et al., 2014; Koppel et al., 2018; Zhang et al., 2018a, 2017b; Xu et al., 2015; Akbari et al., 2017; Lee et al., 2016; Nedić et al., 2015; Lee et al., 2018; Benczúr et al., 2018; Yan et al., 2013). Shahrampour and Jadbabaie (2018) provides a dynamic regret (defined in Eq. (2)) bound of for decentralized online mirror descent, where , , and represent the number of nodes in the newtork, the number of iterations, and the budget of dynamics, respectively. When the Bregman divergence in the decentralized online mirror descent is chosen appropriately, the decentralized online mirror descent becomes identical to the decentralized online gradient descent. In this paper, we achieve a better dynamic regret bound of for a decentralized online gradient descent method, which mainly benefits from a better bound of network error (see Lemma 5). Moreover, Kamp et al. (2014) presents a static regret of for decentralized online prediction. However, it assumes that all the loss functions are generated from an unknown identical distribution, this assumption is too strong to be practical in the dynamic environment and be applied for a general online learning task. Additionally, many decentralized online optimization methods are proposed, for example, decentralized online multi-task learning (Zhang et al., 2018a), decentralized online ADMM (Xu et al., 2015), decentralized online gradient descent (Akbari et al., 2017), decentralized continuous-time online saddle-point method (Lee et al., 2016), decentralized online Nesterov’s primal-dual method (Nedić et al., 2015; Lee et al., 2018), and online distributed dual averaging(Hosseini et al., 2013). However, these previous work only studied the static regret bounds of the decentralized online learning algorithms, while they did not provide any theoretical analysis for dynamic environments. Besides, Yan et al. (2013) provides necessary and sufficient conditions to preserve privacy for decentralized online learning methods, which be studied to extend our method in our future work.

2.2 Dynamic regret

The dynamic regret of online learning algorithms has been widely studied for decades (Zinkevich, 2003; Hall and Willett, 2015, 2013; Jadbabaie et al., 2015; Yang et al., 2016; Bedi et al., 2018; Zhang et al., 2017a; Mokhtari et al., 2016; Zhang et al., 2018b; György and Szepesvári, 2016; Wei et al., 2016; Zhao et al., 2018). The first dynamic regret is defined as subject to where is a budget for the change over the reference points (Zinkevich, 2003). For this definition, the online gradient descent can achieve a dynamic regret of , by selecting an appropriate learning rate. Later, other types of dynamic regrets are also introduced, by using different types of reference points. For example, Hall and Willett (2015, 2013) choose the reference points satisfying , where is the predictive optimal model. When the function predicts accurately, the budget can decrease significantly so that the dynamic regret effectively decreases. Jadbabaie et al. (2015); Yang et al. (2016); Bedi et al. (2018); Zhang et al. (2017a); Mokhtari et al. (2016); Zhang et al. (2018b) chooses the reference points with , where is the loss function at the -th iteration. György and Szepesvári (2016) provides a new analysis framework, which achieves dynamic regret333György and Szepesvári (2016) uses the notation of “shifting regret” instead of “dynamic regret”. In the paper, we keep using “dynamic regret” as used in most previous literatures. for all the above reference points. Recently, Zhao et al. (2018) proves that the lower bound of the dynamic regret is , which indicates that the above algorithms are optimal in terms of dynamic regret. In this paper, we propose a new definition of dynamic regret, which covers all the previous ones as special cases. In addition, our regret bound degenerates to the dynamic regret of online gradient descent when and , which is .

In some literatures, the regret in a dynamic environment is measured by the number of changes of the reference points over time, which is usually termed as shifting regret or tracking regret (Herbster and Warmuth, 1998; György et al., 2005; Gyorgy et al., 2012; György and Szepesvári, 2016; Mourtada and Maillard, 2017; Adamskiy et al., 2016; Wei et al., 2016; Cesa-Bianchi et al., 2012; Mohri and Yang, 2018; Jun et al., 2017). Both the shifting regret and the tracking regret are usually studied in the setting of “learning with expert advice” while the dynamic regret is more often studied in the general setting of online learning.

3 Problem formulation

In decentralized online learning, the topological structure of the network can be represented by an undirected graph with vertex set and edge set . In real applications, each node is associated with a separate learner, for example an mobile device of one user, which maintains a local predictive model. Users would like to cooperatively better minimize their regret without sharing their private data, so they typically share their private models to their neighbors (or friends), which are all the directly adjacent nodes in for each node.

Let denote the local model for user at iteration . In iteration user predicts the local model for an unknown loss, and then receives the loss . As a result, the decentralized online learning algorithm suffered a instantaneous loss . ’s are independent to each other in terms of and , charactering the stochastic component in the function , while the subscripts and of indicate the adversarial component, for example, the user’s profile, location, local time, and etc. The stochastic component in the function is usually caused by the potential relation among local models. For example, users’ perference to music may be impacted by a popular trend in the Internet at the same time.

To measure the efficacy of a decentralized online learning algorithm , a commonly used performance measure is the static regret which is defined as

(1)

where .

The static regret essentially assumes that the optimal model would not change over time. However, in many practical online learning application scenarios, the optimal model may evolve over time. For example, when we want to conduct music recommendation to a user, user’s preference to music may change over time as his/her situation. Thus, the optimal model should change over time. It leads to the dynamics of the optimal recommendation model. Therefore, for any online algorithm , we choose to use the dynamic regret as the metric:

(2)

where is defined by

restricts how much the optimal model may change over time. Obviously, when , the dynamic regret degenerates to the static regret.

4 Decentralized Online Gradient (DOG) algorithm

In the section, we introduce the DOG algorithm, followed by the analysis for the dynamic regret.

4.1 Algorithm description

1:Learning rate , number of iterations , and the confusion matrix .
2:Initialize for all ;
3:for  do
4:      For all users (say the -th node )
5:     Query the neighbors’ local models ;
6:     Observe the loss function , and suffer loss ;
7:     Query the gradient ;
8:     Update the local model by
9:end for
Algorithm 1 DOG: Decentralized Online Gradient method.

In the DOG algorithm, users exchange their local models periodically. In each iteration, each user runs the following steps:

  • (Query) Query the local models from his/her all neighbors;

  • (Gradient) Apply the local model to and obtain the gradient;

  • (Update) Update the local model by taking average with neighbors’ models followed by a gradient step.

The detailed description of the DOG algorithm can be found in Algorithm 1. is the confusion matrix of the graph , and is a doubly stochastic matrix (Wu et al., 2018; Zeng and Yin, 2018; Yuan et al., 2016), but may be not symmetric. Given an adjacent matrix of a decentralized network, there are many methods including Push-Sum (Assran et al., 2018) and Sinkhorn matrix scaling (Knight, 2008) to obtain a .

Specifically, denote

From the global view of point, the updating rule of DOG can be cast into the following form

Denote . We can verify that

Proof is presented in Lemma 1. It shows that the average of local models, i.e., is trained by using the style of gradient descent in a centralized network.

4.2 Dynamic regret of DOG

Next we show the dynamic regret of DOG in the following. Before that, we first make some common assumption used in our analysis. Denote the function by

Assumption 1.

We make following assumptions throughout this paper:

  • For any , , and , there exist constants and such that

    and

  • For given vectors and , we assume .

  • For any and , we assume the function is convex, and has -Lipschitz gradient.

  • is a doubly stochastic matrix. Let where and represent the second largest absolute value of the eigenvalue of and . We assume .

essentially gives the upper bound for the adversarial component in . The stochastic component is bounded by . Note that if there is no stochastic component, is nothing but the upper bound of the gradient like the setting in many online learning literature. It is important for our analysis to split these two components, which will be clear very soon.

The last assumption about is an essential assumption for the decentralized setting. The largest eigenvalue for a doubly stochastic matrix is . is the spectral gap, measuring how fast the information can propagate within the network (the larger the faster).

Now we are ready to present the dynamic regret for DOG.

Theorem 1.

Denote constants and by

Choosing in Algorithm 1, under Assumption 1 we have

By choosing an approximate learning rate , we obtain sublinear regret as follows.

Corollary 1.

Choosing

in Algorithm 1, under Assumption 1 we have

(3)

For simpler discussion, let us treat , , and as constants. The dynamic regret can simplified into . If , the dynamic regret degenerates the static regret . The discussion for the dynamic regret is conducted in the following aspects.

  • (Tightness.) To see the tightness, we consider a few special cases:

    • ( and .) It degenerates to the vanilla online learning setting but with dynamic regret. The implied static regret is consistent with the dynamic regret result in Zhao et al. (2018), which is proven to be optimum.

    • ( and .) It degenerates to the decentralized optimization scenario Tang et al. (2018). The static regret implies the convergence rate , which is consistent with the result in Tang et al. (2018).

  • (Insight.) Consider the baseline that all users do not communicate but only run local online gradient. It is not hard to verify that the static regret for this baseline approach is . Comparing to the static regret with iterative communication, the improvement is only on the stochastic component. Denote that measures the magnitude of the adversarial component and measures the stochastic component. This result reveals an important observation that the communication does not really help improve the adversarial component, only the stochastic component can benefit from the communication. This observation makes quite sense, since if the users’ private data are totally arbitrary, there is no reason they can benefit to each other by exchanging anything.

  • (Improve existing dynamic regret in decentralized setting.) Shahrampour and Jadbabaie (2018) only considers the adversary loss, and provides regret for DOG. Compared with the result in Shahrampour and Jadbabaie (2018), our regret enjoys the state-of-the-art dependence on and , and meanwhile improves the dependence on . The improvement is caused by a better bound of the difference between the local model and its average at time (see Lemma 5).

  • (Improve existing static regret in the decentralized setting.) When using the dynamic regret defined in (4), but choosing local model , instead of , to feed the local loss function , Zhang et al. (2017b) provides static regret for the decentralized online conditional gradient method. Our analysis shows that the regret can be improved to by using DOG.

Next we discuss how close all local models ’s close to their average at each time. The following result suggests that ’s are getting closer and closer over iterations.

Theorem 2.

Denote . Choosing

in Algorithm 1, under Assumption 1 we have

The result suggests that approaches to roughly in the rate (treat and as constants.), which is faster than the convergence of the averaged regret from Corollary 1.

4.3 Dynamic regret under different definition

For any online algorithm , the existing researech Zhang et al. (2017b) defines the regret by using any local model, e.g., , instead of on the -th node. It is defined by

(4)

where is any local model for the -th node with at the time . Inspired by Theorem 1 and Theorem 2, we find that the existing regret can be bounded by the following theorem.

Theorem 3.

Denote constants and by

Choosing in Algorithm 1, under Assumption 1 we have

Choosing an appropriate learning rate , we can bound the regret as follows.

Corollary 2.

Choosing

in Algorithm 1, under Assumption 1 we have

As illustrated in Corollary 2, we obtain dynamic regret for DOG. Using the notation , (Zhang et al., 2017b) presents static regret (the case of ), for decentralized online conditional gradient method. Comparing with it, our analysis extends the regret to the dynamic setting, and shows that the dependence on can be improved by using DOG.

Figure 1: An illustration of the dynmaics caused by the time-varying distributions of data. Data distributions and satisify and , respectively. Suppose we want to conduct classification between data drawn from distributions and , respectively. The optimal classification model should change over time.

5 Empirical studies

For simplicity, in the experiments we only consider online logistic regression with squared

norm regularization. The objective function is,

where is a given hyper-parameter. is the randomness of the function , which is caused by the randomness of data in the experiment. Under this setting, we compare the performance of the proposed Decentralized Online Gradient method (DOG) and that of the Centralized Online Gradient method (COG).

The dynamic budget is fixed as to determine the space of reference points. The learning rate is tuned to be optimal for each dataset sperately. We evaluate the learning performance by measuring the average loss:

instead of using the dynamic regret directly, since the optimal reference point is the same for both DOG and COG.

5.1 Datasets

To test the proposed algorithm, we utilized a toy dataset and three real-world datasets, whose details are presented as follows.

Synthetic Data For the -th node, a data matrix is generated, s.t. , where represents the adversary part of data, and represents the stochastic part of data. with is used to make balance between the adversary and stochastic components. A large represents the adversary component is significant, and the stochastic component becomes significant with the decrease of . Specifically, elements of is uniformly sampled from the interval . Note that and with are drawn from different distributions. is generated according to which is generated uniformly. When , is generated by sampling from a time-varying distribution . When , is generated by sampling from another time-varying distribution . Due to this correlation, can be considered as the label of the instance . The above dynamics of time-varying distributions are illustrated in Figure 1, which shows the change of the optimal learning model over time and the importance of studying the dynamic regret.

Real Data Three real public datasets are room-occupancy444https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+, usenet2555http://mlkd.csd.auth.gr/concept_drift.html, and spam666http://mlkd.csd.auth.gr/concept_drift.html. room-occupancy is a time-series dataset, which is from a natural dynamic enviroment. Both usenet2 and spam are “concept drift” (Katakis et al., 2010)

datasets, for which the optimal model changes over time. For all dataset, all values of every feature have been normalized to be zero mean and one variance.

(a) synthetic data, nodes, ring topology,
(b) synthetic data, nodes, ring topology,
(c) synthetic data, nodes, ring topology,
(d) synthetic data, nodes, ring topology,
Figure 2: Since the stochastic component becomes significant against the adversary component with the decrease of , DOG is more effective to reduce the average loss than local online gradient descent.
(a) synthetic data, nodes,random topology
(b) room-occupancy, nodes, ring topology
(c) usenet2, nodes, ring topology
(d) spam, nodes, ring topology
Figure 3: The average loss yielded by DOG is comparable to that yielded by COG.
(a) synthetic data, random topology
(b) room-occupancy, ring topology
(c) usenet2, ring topology
(d) spam, ring topology
Figure 4: The average loss yielded by DOG is insensitive to the network size.

5.2 Results

First, we test whether DOG is effective to reduce the stochastic component of regret. It is compared with the local online gradient descent (Local OGD), where every node trains a local model, and there is no communication between any two nodes. We vary to generate different kinds of data, obtaining different balance between adversary and stochastic components. As shown before, the stochastic component of data becomes significant with the decrease of . Figure 2 shows that DOG becomes significantly effective to reduce the stochastic component of regret with a small . It validates that exchanging models in a decentralized network is necessary and important to reduce regret, which matches with our theoretical result. In the following empirical studies, we generate data in the setting of .

Second, DOG yields comparable performance with COG. Figure 3 summarizes the performance of DOG compared with COG on all the datasets. For the synthetic dataset, we simulated a decentralized network consisting of nodes, where every node is randomly connected with other nodes. For the three real datasets, we simulated a network consisting of nodes. In these networks, the nodes are connected by a ring topology. Under these settings, we can observe that both DOG and COG are effective for the online learning tasks on all the datasets, while DOG achieves slightly worse performance.

Third, the performance of DOG is not sensitive to the network size, but sensitive to the variance of the stochastic data. Figure 4 summarizes the effect of the network size on the performance of DOG. We change the number of nodes from to on the synthetic dataset, and from to on the real datasets. The synthetic dataset is tested by using the random topology, and those real datasets are tested by using the ring topology. Figure 4 draws the curves of average loss over time steps. We observe that the average loss curves are mostly overlapped with different nodes. It shows that DOG is robust to the network size (or number of users), which validates our theory, that is, the average regret does not increase with the number of nodes. Furthermore, we observe that the average loss becomes large with the increase of the variance of stochastic data, which validates our theoretical result nicely.

Finally, the performance DOG is improved in a well-connected network. Figure 5 shows the effect of the topology of the network on the performance of DOG, where five different topologies are used. Besides, the ring topology, the Disconnected topology means there are no edges in the network, and every node does not share its local model to others. The Fully connected topology means all nodes are connected, where DOG de-generates to be COG. The topology WattsStrogatz represents a Watts-Strogatz small-world graph, for which we can use a parameter to control the number of stochastic edges (set as and in this paper). The result shows Fully connected enjoys the best performance, because that for it while for other topologies. Specifically, in those topologies is presented in Table 1. A small leads to a good performance of DOG, which validates our theoretical result nicely.

DC FC Ring WS() Ws()
synthetic data 1 0 0.99 0.37 0.58
real data 1 0 0.96 0.83 0.76
Table 1: in different topologies used in our experiment. A large represents good connectivity of the communication network. “DC” represents the Disconnected topology, “FC” represents the Fully connected topology, “Ring” represents the ring topology, and “WS” represents the WattsStrogatz topology.
(a) synthetic data, nodes
(b) room-occupancy, nodes
(c) usenet2, nodes
(d) spam, nodes
Figure 5: The average loss yielded by DOG is insensitive to the topology of the network.

6 Conclusion

We investigate the online learning problem in a decentralized network, where the loss is incurred by both adversary and stochastic components. We define a new dynamic regret, and propose a decentralized online gradient method. By using the new analysis framework, the decentralized online gradient method achieves regret. It shows that the communication is only effective to decrease the regret caused by the stochastic component, and thus users can benefit from sharing their private models, instead of private data. Extensive empirical studies validates the theoretical results.

References

  • Adamskiy et al. (2016) D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret.

    Journal of Machine Learning Research

    , 17(23):1–21, 2016.
  • Akbari et al. (2017) M. Akbari, B. Gharesifard, and T. Linder. Distributed online convex optimization on time-varying directed graphs. IEEE Transactions on Control of Network Systems, 4(3):417–428, Sep. 2017.
  • Assran et al. (2018) M. Assran, N. Loizou, N. Ballas, and M. Rabbat.

    Stochastic Gradient Push for Distributed Deep Learning.

    arXiv.org, Nov. 2018.
  • Bedi et al. (2018) A. S. Bedi, P. Sarma, and K. Rajawat. Tracking moving agents via inexact online gradient descent algorithm. IEEE Journal of Selected Topics in Signal Processing, 12(1):202–217, Feb 2018.
  • Benczúr et al. (2018) A. A. Benczúr, L. Kocsis, and R. Pálovics. Online Machine Learning in Big Data Streams. CoRR, 2018.
  • Bubeck (2011) S. Bubeck. Introduction to online optimization, December 2011.
  • Cesa-Bianchi et al. (2012) N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror Descent Meets Fixed Share (and feels no regret). In NIPS 2012, page Paper 471, 2012.
  • Dufossé and Uccar (2016) F. Dufossé and B. Uccar. Notes on Birkhoff-von Neumann decomposition of doubly stochastic matrices. Research Report RR-8852, Inria - Research Centre Grenoble – Rhône-Alpes, Feb. 2016.
  • György and Szepesvári (2016) A. György and C. Szepesvári. Shifting regret, mirror descent, and matrices. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2943–2951. JMLR.org, 2016.
  • György et al. (2005) A. György, T. Linder, and G. Lugosi. Tracking the Best of Many Experts. Proceedings of Conference on Learning Theory (COLT), 2005.
  • Gyorgy et al. (2012) A. Gyorgy, T. Linder, and G. Lugosi. Efficient tracking of large classes of experts. IEEE Transactions on Information Theory, 58(11):6709–6725, Nov 2012.
  • Hall and Willett (2013) E. C. Hall and R. Willett. Dynamical Models and tracking regret in online convex programming. In Proceedings of International Conference on International Conference on Machine Learning (ICML), 2013.
  • Hall and Willett (2015) E. C. Hall and R. M. Willett. Online Convex Optimization in Dynamic Environments. IEEE Journal of Selected Topics in Signal Processing, 9(4):647–662, 2015.
  • Hazan (2016) E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
  • Herbster and Warmuth (1998) M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, Aug 1998.
  • Hosseini et al. (2013) S. Hosseini, A. Chapman, and M. Mesbahi. Online distributed optimization via dual averaging. In 52nd IEEE Conference on Decision and Control, pages 1484–1489, Dec 2013.
  • Jadbabaie et al. (2015) A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online Optimization : Competing with Dynamic Comparators. In

    Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pages 398–406, 2015.
  • Jun et al. (2017) K.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using coin betting. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54, pages 943–951, 20–22 Apr 2017.
  • Kamp et al. (2014) M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman. Communication-efficient distributed online prediction by dynamic model synchronization. In Proceedings of the 2014th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’14, pages 623–639, Berlin, Heidelberg, 2014. Springer-Verlag.
  • Katakis et al. (2010) I. Katakis, G. Tsoumakas, and I. Vlahavas.

    Tracking recurring contexts using ensemble classifiers: An application to email filtering.

    Knowledge and Information Systems, 22(3):371–391, 2010.
  • Knight (2008) P. Knight. The sinkhorn–knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1):261–275, 2008.
  • Koppel et al. (2018) A. Koppel, S. Paternain, C. Richard, and A. Ribeiro. Decentralized online learning with kernels. IEEE Transactions on Signal Processing, 66(12):3240–3255, June 2018.
  • Lee et al. (2016) S. Lee, A. Ribeiro, and M. M. Zavlanos. Distributed continuous-time online optimization using saddle-point methods. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4314–4319, Dec 2016.
  • Lee et al. (2018) S. Lee, A. Nedić, and M. Raginsky. Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems, 5(1):34–44, March 2018.
  • Mohri and Yang (2018) M. Mohri and S. Yang. Competing with automata-based expert sequences. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pages 1732–1740, 09–11 Apr 2018.
  • Mokhtari et al. (2016) A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of IEEE Conference on Decision and Control (CDC), pages 7195–7201. IEEE, 2016.
  • Mourtada and Maillard (2017) J. Mourtada and O.-A. Maillard. Efficient tracking of a growing number of experts. arXiv.org, Aug. 2017.
  • Nedić et al. (2015) A. Nedić, S. Lee, and M. Raginsky. Decentralized online optimization with global objectives and local communication. In 2015 American Control Conference (ACC), pages 4497–4503, July 2015.
  • Shahrampour and Jadbabaie (2018) S. Shahrampour and A. Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714–725, March 2018.
  • Shalev-Shwartz (2012) S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  • Tang et al. (2018) H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication Compression for Decentralized Training. arXiv.org, Mar. 2018.
  • Wei et al. (2016) C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic environments. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Proceedings of Advances in Neural Information Processing Systems, pages 3972–3980, 2016.
  • Wu et al. (2018) T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293–307, June 2018.
  • Xu et al. (2015) H.-F. Xu, Q. Ling, and A. Ribeiro. Online learning over a decentralized network through admm. Journal of the Operations Research Society of China, 3(4):537–562, Dec 2015.
  • Yan et al. (2013) F. Yan, S. Sundaram, S. V. N. Vishwanathan, and Y. Qi. Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25(11):2483–2493, Nov 2013.
  • Yang et al. (2016) T. Yang, L. Zhang, R. Jin, and J. Yi. Tracking Slowly Moving Clairvoyant - Optimal Dynamic Regret of Online Learning with True and Noisy Gradient. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2016.
  • Yuan et al. (2016) K. Yuan, Q. Ling, and W. Yin. On the Convergence of Decentralized Gradient Descent. SIAM Journal on Optimization, 2016.
  • Zeng and Yin (2018) J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing, 66(11):2834–2848, June 2018.
  • Zhang et al. (2018a) C. Zhang, P. Zhao, S. Hao, Y. C. Soh, B. S. Lee, C. Miao, and S. C. H. Hoi. Distributed multi-task classification: a decentralized online learning approach. Machine Learning, 107(4):727–747, Apr 2018a.
  • Zhang et al. (2017a) L. Zhang, T. Yang, J. Yi, R. Jin, and Z.-H. Zhou. Improved Dynamic Regret for Non-degenerate Functions. In Proceedings of Neural Information Processing Systems (NIPS), 2017a.
  • Zhang et al. (2018b) L. Zhang, T. Yang, rong jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 5882–5891, 10–15 Jul 2018b.
  • Zhang et al. (2017b) W. Zhang, P. Zhao, W. Zhu, S. C. H. Hoi, and T. Zhang. Projection-free distributed online learning in networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, pages 4054–4062, International Convention Centre, Sydney, Australia, 06–11 Aug 2017b.
  • Zhao et al. (2018) Y. Zhao, S. Qiu, and J. Liu. Proximal Online Gradient is Optimum for Dynamic Regret. CoRR, cs.LG, 2018.
  • Zinkevich (2003) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of International Conference on Machine Learning (ICML), pages 928–935, 2003.

Appendix

Proof to Theorem 1:

Proof.

From the regret definition, we have

Now, we begin to bound .

For , we have

holds due to has -Lipschitz gradients, and .

For , we have

holds due to

holds due to has Lipschitz gradients.

Therefore, we obtain