I Problem Formulation
Most online learning algorithms compute an estimate
at time by recursively updating the prior estimate using data observed at that same time instant . We consider in this work a general mapping (i.e., learning rule) of the form:(1) 
where maps the iterate to using the data
. Throughout this manuscript, we allow for the mapping to be stochastic and timevarying due to the potentially timevarying distribution of the random variable
. One popular instance of this recursion is the stochastic gradient algorithm [11]:(2) 
which can be used to estimate the minimizer of stochastic risks of the form:
(3) 
where we write , with a subscript , to allow for the possibility of the minimizer drifting with time due to changes in the distribution of the streaming data . Of course, description (1) captures many more algorithm variations, besides the stochastic gradient algorithm (2), such as proximal [18, 10], empirical [22]
, variancereduced
[15, 12], distributed [9, 2, 1, 5], and secondorder constructions [16]. We restrict ourselves in this work to the important class of mappings that satisfy the following meansquare contractive property. We illustrate later by means of examples that several popular learning mappings already satisfy this condition. [Meansquare contraction] We say that a mapping is “meansquare contractive” around a “meansquare fixedpoint” if for any generated by the mapping it holds that:(4) 
with . In general, the point , the rate of contraction , and the additive term will be a function of the distribution of , and are hence allowed to be timevarying to account for nonstationarity. ∎ We refer to the point as the “meansquare fixedpoint” of the mapping , since applying at yields in light of (4):
(5) 
and hence for small in the meansquare sense.
If the mapping happens to be deterministic and , we can drop the additive term, as well as the expectation, and recover after taking squareroots:
(6) 
which corresponds to the traditional definition of a contractive mapping [7]
. As we shall show, a number of stochastic algorithms are meansquare contractive, allowing our exposition to cover them all. In the case of the stochastic gradient descent algorithm (
2), the point will correspond to the minimizer of (3), in which case and can be used interchangeably. In general, however, such as the decentralized strategies (21)–(22) listed further ahead, we will need to make a subtle distinction.In addition to the stochastic nature of the mapping resulting from its dependence on the random variable , we allow for to be timevarying due to drifts in the distribution of , which results in a drift of the fixedpoint over time (this explains why we are using a subscript in ). Relations similar to (4) frequently appear as intermediate results in the performance analysis of stochastic algorithms in stationary environments, although stationarity is not necessary for establishing (4). By establishing a general tracking result for meansquare contractive mappings, and subsequently appealing to prior results establishing (4
), we can recover known results, and also establish some new results on the tracking performance of stochastic learners for general loss functions.
Ia Related Works
The tracking performance of adaptive filters, focusing primarily on meansquare error designs is fairly well established (see, e.g., [4, 16]). In the decentralized setting, though generally restricted to deterministic optimization with exact gradients, the tracking performance of primal and primaldual algorithms has been studied in [8, 21, 19]. In the stochastic setting, the tracking performance of the diffusion strategy is established in [20], while the work [3] considers a federated learning architecture. The purpose of this work is to establish a unified tracking analysis for the broad class of meansquare contractive mappings, which includes many algorithms as special cases, and will allow us to efficiently recover new tracking results as well.
Ii Tracking Analysis
Iia Nonstationary environments
We consider a timevarying environment, where the fixedpoint evolves according to some randomwalk model. Such models are prevalent in the study of nonstationary effects. [Random Walk] We assume that the meansquare fixed point of the mapping (1) evolves according to a random walk:
(7) 
where is independent of . We will allow the random variable
to be nonstationary, with potentially nonzero mean, and only require a global bound on its secondorder moment, namely
. ∎ Note that, by allowing to be nonstationary with nonzero mean, the assumption is more relaxed than typically assumed in the adaptive filtering literature [16, 20]. On the other hand, by only imposing a bound on the secondorder moment of, rather than on its norm with probability one, condition (
7) is also more relaxed than in related works on deterministic dynamic optimization (e.g., [6]). Letting and using (4), we have:(8) 
where in step we used Jensen’s inequality for along with Assumption IIA and .
If the random variable happens to be zeromean and independent of , the inequality can be sharpened by avoiding the use of Jensen’s inequality in step of (IIA) and instead appealing to independence of with and . This results in:
(9) 
In order to continue with the analysis, we assume the following. [Global bounds] The rate of contraction as well as the driving term are bounded from above for all , i.e., and . ∎ As we will see in Section IIIA, Assumption IIA generalizes conditions typically imposed in the study of adaptive filters in nonstationary environments. After iterating (IIA) and (9), we arrive at the next result. [Tracking performance] Suppose is a meansquarecontractive mapping according to Definition I. Then, we have:
(10) 
In the case when for all , we have the tighter relation:
(11) 
Proof:
We note that in steadystate, the terms and vanish exponentially, and we are left with a drift term proportional to and a second term proportional . Furthermore, we note that the nonstationary result (11) can be obtained from the stationary result with by merely adding the drift term .
Iii Application to Learning Algorithms
We now show how Theorem IIA can be used to recover the tracking performance of several wellknown algorithms under the random walk model (7). We begin by rederiving and generalizing some known tracking results to illustrate the implications of Assumption IIA and verify Theorem IIA, and then proceed to derive new tracking results for the multitask diffusion algorithm [1, 13, 14].
Iiia LeastMeanSquare (LMS) Algorithm
For illustration purposes, we begin with the leastmean square algorithm, which takes the form:
(12) 
where the data arises from the linear model:
(13) 
and denotes an independent sequence of regressors and denotes measurement noise. As is standard in the study of the transient behavior of adaptive filters (see, e.g., [16, Part V]), we subtract (12) from , take squares and expectations to obtain:
(14) 
with , , and . Examination of and shows that the LMS algorithm (12) satisfies Assumption IIA whenever the moments of the regressor and measurement noise are timeinvariant (or bounded). This does not restrict the drift of the objective and the measurement which will, of course, be nonstationary as a result. This assumption is also consistent with the modeling conditions typically applied when studying the tracking performance of adaptive filters [16, Eq. (20.16)]. Assuming stationarity of the regressor and measurement noise we find for small stepsizes :
(15)  
(16) 
Hence, we have from (11):
(17) 
The result is consistent with [16, Lemma 21.1], with the factor appearing in (17) since we are considering here the meansquare deviation of around , rather than the excess meansquare error studied in [16, Lemma 21.1]. When the drift term is no longer zeromean, we can bound:
(18) 
and find from (10):
(19) 
We observe that the drift penalty incurred in the case when has nonzero mean is , which is significantly larger than in the case where , which is . This is to be expected as the cumulative effect of in the recursive relation (7) is no longer equal to zero when .
IiiB Decentralized Stochastic Optimization
We now consider the problem of general decentralized stochastic optimization. We associate with each agent a cost:
(20) 
In this section, we consider the diffusion algorithm for decentralized stochastic optimization [2, 17]:
(21)  
(22) 
for pursuing the minimizer of the aggregate cost:
(23) 
where
denotes the right Perron eigenvector associated with the leftstochastic combination matrix
[2]. If we collect and , the diffusion recursion (21)–(22) can be viewed as an instance of (1). Note that by setting the number of agents to one we recover ordinary centralized stochastic gradient descent (2), and as such the results in this section will apply to that case as well. We impose the following standard assumptions on the cost as well as the stochastic gradient approximation [17]. [Bounded Hessian] Each cost is twicedifferentiable with bounded Hessian for all , i.e., . ∎ Note that this condition ensures that each is stronglyconvex with Lipschitz gradients and that the respective parameters are bounded independently of . Independence of the bounds on problem parameters over time is common in the study of optimization algorithms in nonstationary and dynamic environments [20, 6] and will ensure that Assumption IIA is satisfied. We additionally assume that the objectives of the agents do not drift too far apart. [Bounded Disagreement] The distance between each local minimizer is bounded independently of , i.e.:(24) 
for all pairs and times . ∎ We also make the following common assumption on the quality of the gradient estimate. [Gradient noise] Using approximates the true gradient of (20) sufficiently well, i.e.:
(25)  
(26)  
(27) 
where denotes the filtration of random variables up to , , for all and some constants independent of . ∎ It has already been established that the diffusion recursion (21)–(22) is a meansquare contractive mapping according to Definition I for some and in stationary environments [2, Eq. (58)]. In order to recover tracking performance through Theorem IIA, we need to ensure that the rate of contraction and driving term can be bounded independent of time , i.e., that Assumption IIA holds under conditions 23–24. [Tracking performance of diffusion] The diffusion algorithm (21)–(22) is meansquare contractive around with , and
(28) 
where
denotes the second largest magnitude eigenvalue of the combination matrix
and denote problemindependent constants. The quantity denotes the fixedpoint from Definition I, which in light of (28), is within of the minimizer of (23). The tracking performance is given by:(29) 
where . When , we have:
(30) 
∎ When the gradient approximation is exact, i.e., , we recover from (29) which aligns with the result [6, Remark 1], where deterministic dynamic optimization with exact gradients is considered. On the other hand, when , we find from (30) and recover [20, Eq. (80)] up to problemindependent factors.
IiiC Multitask Decentralized Learning
In this section, we continue to consider a collection of agents, each with associated local cost (20). However, instead of pursuing the Pareto solution (23), we pursue the multitask problem [13]:
(31) 
where denotes the weighted Laplacian matrix associated with the graph adjacency matrix . The formulation (31), in contrast to (23), does not force each agent in the network to reach consensus, and instead allows for the independent minimization of subject to a coupling smoothness regularizer . We refer the reader to [13, 14] for a more detailed motivation for minimizing (31) instead of (23), and will focus here on the tracking performance of the resulting algorithm. A solution to (31) can be pursued via the multitask strategy [1, 13]:
(32)  
(33) 
where if and otherwise. Comparing the diffusion strategy (21)–(22) to the multitask strategy (32)–(33) we note a structural similarity with the subtle difference that the combination weights in (33), in contract to in (22) are not constant and depend on the stepsize and regularization parameter . The multitask diffusion strategy (32)–(33) has also been shown to be meansquare contractive [13, Eq. (54)] and hence, we can verify Assumption IIA and appeal to Theorem IIA to infer its tracking performance. [Tracking performance of multitask diffusion] The multitask diffusion algorithm (32)–(33) is meansquare contractive around with , and
(34) 
where denote problemindependent constants. The quantity denotes the fixedpoint from Definition I, which in light of (34), is within of the minimizer of (31). The tracking performance is hence given by:
(35) 
where . When , we have:
(36) 
Iv Simulation Results
Iva Tracking Multitask Problems
We illustrate the tracking performance of the multitask diffusion strategy (32)–(33) established in Corollary IIIC in Fig. 1. We consider a collection of
agents connected by a randomly generated graph. Each agent observes feature vectors
and labelsfollowing a logistic regression model with separating hyperplane
[17, Appendix G]. The collection of initial hyperplanes are generated to be smooth over the graph using the procedure of [13, Sec. VI] and subsequently subjected to a common drift term . Performance is displayed in Fig. 1. We observe that an optimal stepsize choice exists for both drift rates, with smaller allowing for smaller stepsizes, resulting in smaller effects of the gradient noise and overall better tracking performance. The trends align with the results of Corollary IIIC.IvB Illustration of Theorem IiA in the Presence of Drift Bias
We next verify one of the main conclusions of Theorem IIA, namely that the dominant term in the expressions for tracking performance deteriorates from when (Eq. (30)) to in the nonzero mean case (Eq. (29)). We consider a collection of agents observing independent data originating from a common linear model according to (13), subjected to a drift term . All agents construct local leastsquares cost functions , and pursue by means of the resulting diffusion strategy (21)–(22). The tracking performance in both the zeromean and biased drift settings for various choices of the stepsizes parameter is displayed in Fig 2.
References
 [1] (201408) Multitask diffusion adaptation over networks. IEEE Transactions on Signal Processing 62 (16), pp. 4129–4144. External Links: Document, ISSN 19410476 Cited by: §I, §IIIC, §III.
 [2] (201304) Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7 (2), pp. 205–220. External Links: Document, ISSN 19410484 Cited by: §I, §IIIB.
 [3] (202002) Dynamic federated learning. available as arXiv:2002.08782. Cited by: §IA.
 [4] (2014) Adaptive filter theory. Pearson. External Links: ISBN 9780132671453, LCCN 2012025640, Link Cited by: §IA.
 [5] (2014) Communicationefficient distributed dual coordinate ascent. In Proc. International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3068–3076. Cited by: §I.
 [6] (202003) Can primal methods outperform primaldual methods in decentralized dynamic optimization?. available as arXiv:arXiv:2003.00816. Cited by: §IIA, §IIIB.
 [7] (1989) Introductory Functional Analysis with Applications. John Wiley & Sons. Cited by: §I.
 [8] (201403) Decentralized dynamic optimization through the alternating direction method of multipliers. IEEE Transactions on Signal Processing 62 (5), pp. 1185–1197. External Links: Document, ISSN 19410476 Cited by: §IA.
 [9] (200901) Distributed subgradient methods for multiagent optimization. IEEE Trans. Automatic Control 54 (1), pp. 48–61. External Links: Document, ISSN 00189286 Cited by: §I.
 [10] (2013) Proximal algorithms. Foundations and Trends in Optimization 1 (3), pp. 127–239. External Links: Link, Document, ISSN 21673888 Cited by: §I.
 [11] (1997) Introduction to Optimization. Optimization Software. Cited by: §I.
 [12] (201905) Stabilized SVRG: simple variance reduction for nonconvex optimization. available as arXiv:1905.00529. Cited by: §I.
 [13] (201805) Learning over multitask graphs  Part I: Stability analysis. available as arXiv:1805.08535. Cited by: §IIIC, §III, §IVA.
 [14] (202001) Multitask learning over graphs. submitted for publication, available as arXiv:2001.02112. Cited by: §IIIC, §III.
 [15] (2016) Stochastic variance reduction for nonconvex optimization. In Proc. of ICML, New York, NY, USA, pp. 314–323. Cited by: §I.
 [16] (2008) Adaptive Filters. John Wiley & Sons, Inc.. External Links: Document, ISBN 9780470374122, Link Cited by: §IA, §I, §IIA, §IIIA.

[17]
(201407)
Adaptation, learning, and optimization over networks.
Foundations and Trends in Machine Learning
7 (45), pp. 311–801. External Links: Link, Document, ISSN 19358237 Cited by: §IIIB, §IVA.  [18] (2011) Convergence rates of inexact proximalgradient methods for convex optimization. In Proc. Advances in Neural Information Processing Systems 24, Granada, Spain, pp. 1458–1466. External Links: Link Cited by: §I.
 [19] (201711) Decentralized predictioncorrection methods for networked timevarying convex optimization. IEEE Transactions on Automatic Control 62 (11), pp. 5724–5738. External Links: Document, ISSN 23343303 Cited by: §IA.
 [20] (2013) On distributed online classification in the midst of concept drifts. Neurocomputing 112, pp. 138–152. Cited by: §IA, §IIA, §IIIB.
 [21] (201612) Distributed dynamic optimization over directed graphs. In Proc. IEEE Conference on Decision and Control (CDC), Vol. , Las Vegas, USA, pp. 245–250. External Links: Document, ISSN null Cited by: §IA.
 [22] (2016) On the convergence of decentralized gradient descent. SIAM Journal on Optimization 26 (3), pp. 1835–1854. External Links: Document, Link, https://doi.org/10.1137/130943170 Cited by: §I.