Optimization and Learning with Information Streams: Time-varying Algorithms and Applications

10/17/2019 ∙ by Emiliano Dall'Anese, et al. ∙ 0

There is a growing cross-disciplinary effort in the broad domain of optimization and learning with streams of data, applied to settings where traditional batch optimization techniques cannot produce solutions at time scales that match the inter-arrival times of the data points due to computational and/or communication bottlenecks. Special types of online algorithms can handle this situation, and this article focuses on such time-varying optimization algorithms, with emphasis on Machine Leaning and Signal Processing, as well as data-driven control. Approaches for the design of time-varying first-order methods are discussed, with emphasis on algorithms that can handle errors in the gradient, as may arise when the gradient is estimated. Insights on performance metrics and accompanying claims are provided, along with evidence of cases where algorithms that are provably convergent in batch optimization perform poorly in an online regime. The role of distributed computation is discussed. Illustrative numerical examples for a number of applications of broad interest are provided to convey key ideas.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Optimization and Learning With Information Streams

Convex optimization underpins many important Statistical Learning, Signal Processing (SP), and Machine Learning (ML) applications. From the dawn of these fields, where least squares and kernel-based regression were prevalent across many research domains 

[friedman2001elements],111Due to strict submission policies, we provide only a set of representative references.

to contemporary Big Data applications for online social media, the Internet, and complex infrastructures, convex optimization enabled the development of core SP and ML algorithms and provided means to uncover fundamental insights on implementation trade-offs. Modern Big Data problems, involving tasks as diverse as kernel-based learning, sparse subspace clustering, support vector machines, and subspace tracking via low-rank models, have stimulated a rich set of research efforts that led to breakthrough approaches for the development of scalable, efficient, and parallelizable data-processing and learning algorithms. Many of these algorithms come with a precise analysis of the convergence rates in various settings 

[cevher2014convex].

Despite these advances, the advent of streaming data sources in many engineering and science domains poses severe computational strains on existing algorithmic solutions. The ability to store, process, and leverage information from heterogeneous and high-dimensional data streams using solutions that are grounded on batch optimization methods 

[NocedalWright, Be17] can no longer be taken for granted. Timely application areas that strive for real-time and distributed data processing and learning methods include networked autonomous vehicles, power grids, communication systems and the emerging Internet-of-Things (IoT) infrastructure, among many others.

In this article, we review the recent time-varying optimization framework [popkov2005gradient, Zavala2010, Bernstein2019feedback, Paper3], which poses a sequence of optimization problems, and departs from batch optimization on a central processor in favor of computationally-light algorithmic solutions where data points are processed on-the-fly and without central storage. A first goal is to illustrate how the time-varying optimization formalism can naturally provide means to model SP and ML tasks with streams of data (with a number of instances provided in Table I). The paper then addresses key questions pertaining to the design and analysis of first-order algorithms that can effectively solve these time-varying optimization problems; in particular, key aspects that will be highlighted include: i) challenges in the design of online algorithms, along with concepts that relate the inter-arrival time of the data and the computational time of the algorithms; ii) relevant metrics that can be utilized to analyze the performance of the algorithms; in particular, guidelines for the selection of given performance metrics (based on the mathematical structure of the time-varying problem) will be provided; and, iii) challenges related to distributed implementation of the time-varying algorithms.

Before proceeding, we also point out that, beyond SP and ML, the time-varying optimization setting considered here is relevant also for emerging data-driven control (DDC) architectures, where learning applications are tightly-integrated components of closed-loop control systems. In this case, learning components may provide means to evaluate or approximate on-the-fly constraints and costs [Bernstein2019feedback, koller2018learning], or to drive the output of dynamical systems to solutions of time-varying optimization problems by learning first-order information of the cost functions [colombino2019online, poveda2017framework] (see Table I for some examples of instances); this is an area that is rooted at the crossroads of SP, ML, Optimization and Control, with a natural cross-fertilization of tools and methods developed by different communities (and the divisions can be somewhat arbitrary).

Machine Learning / Signal Processing Data-driven Closed-loop Control
Subspace tracking, robust subspace tracking [dixit2019online] Measurement-based online algorithms [Bernstein2019feedback]
Subspace clustering, sparse subspace clustering [friedman2001elements, elhamifar2013sparse] Extremum seeking [poveda2017framework]
Sparse [cevher2014convex], kernel-based [friedman2001elements]

, robust, linear regression 

[friedman2001elements]
Online optimization as feedback controller [colombino2019online]
Zeroth-order methods [Hajinezhad19], bandit methods [chen2018bandit, besbes2015non] Predictive control with Gaussian Processes [koller2018learning]
Support vector machines [friedman2001elements]
Learning problems over networks [onlineSaddle]
TABLE I: Example of instances of robust online algorithms (: Illustrated in this article ).

Batch/convergence vs online/tracking. A central question related to ML and DDC applications pertains to the ability of existing iterative optimization algorithms — especially first-order methods [cevher2014convex, Be17] — to handle data streams effectively. Suppose that data points arrive sequentially at intervals of seconds; discretize the temporal axis as , where can be selected as the inter-arrival time of data, and suppose that a given ML task involves the solution of the following problem at time , based on data gathered over a (possibly sliding) window :

(1)

where is the parameter of interest, is a non-empty closed convex set, and the time-varying function may take the form , with convex and -smooth (i.e, is -Lipschitz continuous) and convex but not necessarily smooth. For example, to illustrate the temporal variability of the function as well as the explicit dependency of the cost on the data stream, for an -regularized least-squared problem one may have , with a time-varying sparsity-promoting parameter, , and . As another example, problem (1) could be used for subspace tracking based on a sliding window of (vectorized) video images, by setting to be a least-squares term, and a nuclear norm regularization [dixit2019online] (these examples will be illustrated shortly). Hereafter, to simplify the notation, we drop from the cost function, letting the dependency of on the data be implicit.

Supposing that (1) is solved using a proximal gradient method or an accelerated proximal gradient method (proximal gradient methods are generalizations of projected gradient methods, see [cevher2014convex, Be17]), it is known that when is convex and is -smooth, the number of iterations required to obtain an objective function within an error is and , respectively, with the starting point for the algorithm, and the optimal solution at time  [cevher2014convex, Be17] ( refers to the big O notation). If one can computationally afford a number of proximal gradient steps in the above order within an interval , then it is clear that can be identified, within an acceptable error, at each step . However, what if data points arrive at a rate such that only a few steps (or even just one step) can be performed before a new datum arrives?

Continuing to use the proximal gradient method as a running example, and taking the extreme (yet realistic) case where only one step can be performed within an interval , conventional wisdom would suggest to utilize the following online implementation:

(2)

where is the proximal operator [Be17], is a given step-size sequence, and is built based on new data up to time (and possibly discarding older data, as in a sliding window mode). Relevant questions in this setting revolve around the definition of suitable metrics to analytically characterize the performance of the algorithm (2), since the classical notions of convergence and -accurate solutions for batch optimization are no longer suitable. Viewing (1) under the lens of a time-varying optimization formalism [Zavala2010, Paper3, Bernstein2019feedback, onlineSaddle, dixit2019online], metrics that are related to tracking of the sequence or sequences of optimal solutions will be discussed in Section III.

Before proceeding, we bring up a point that highlights another challenge in designing and analyzing online algorithms. One may surmise that a naïve online implementation of algorithms conceived for batch computation may just work well, with algorithms that are faster for batch optimization still being faster in time-varying optimization. However, this is not always the case. Surprisingly, the best algorithms in the static case may be the worst algorithms in the dynamic case, as shown in our illustrative numerical results in Fig. 1. The heavy ball method can even diverge for a simple time-varying least-squares problem.

Fig. 1: Example of a 50 dimensional time-varying least-squares problem, defined using a sliding window of 50 data points, for 950 time points; two big jumps in the solution near time indices 250 and 550 (by design). Left: Convergence in the static case; Right: plot shows the performance of various algorithms on tracking the optimal objective value. Nesterov ver. 1 does not use knowledge of strong convexity, while ver. 2 does. The non-linear conjugate gradient exploits the quadratic objective to have an exact line-search (usually impractical), and is the variant from [NocedalWright, Eq. (5.49)].

Time-varying optimization vis-à-vis online learning. The time-varying optimization formalism is closely aligned with online learning [shalev2012online, Jadbabaie2015, hall2015online, hazan2016introduction] from a basic mathematical standpoint. However, a key conceptual difference is that the time-varying algorithms described here are “computation limited,” whereas online learning is “data limited” or “information limited.” Taking as an example the learner-environment framework outlined in [Jadbabaie2015, hall2015online, dixit2019online], the online learning counterpart of (2) could be written as ; that is, a typical online learning algorithm produces a “prediction” based on information of the cost function (in the form of functions or functional evaluations) and data that are available up to the time . Once the prediction is computed, partial or full feedback regarding the function is revealed. The performance of online learning algorithms is therefore evaluated relative to the best action in hindsight; that is, relative to the case where is available to the learner; for example, the so-called regret at time is given by , with the cost achieved by the algorithm and the cost in hindsight. The majority of online learning research assumes static constraints, full or partial (e.g., bandit) feedback, and uses static regret as a metric, which compares to the optimal fixed optimizer [shalev2012online] (“shifting”, “drifting”, or “dynamic” regret analysis is sometimes used, and we will discuss it in Section III). It is also worth mentioning that, while the considered example involved a proximal-gradient step (to highlight the subtleties relative to (2)), computationally-heavier online learning algorithms can be utilized to produce the prediction , with the computational effort between time-slots not necessarily being a concern [shalev2012online, hazan2016introduction]. For example, the popular follow-the-leader method requires a batch solution of an optimization problem at each time step if the cost functions are not linear [shalev2012online].

On the other hand, the time-varying optimization setting outlined here is mainly driven by computational bottlenecks. At time , information regarding the function to be minimized is available in terms of either its functional form or (an estimate of) its gradients. The algorithm then seeks to obtain an optimal solution ; however, because of complexity and data rate considerations, only one or a few algorithmic steps can be performed within seconds, before a new datum arrives (and the process then restarts in order to seek a new optimizer ). As explained shortly in Section III, the performance of a time-varying algorithm is measured against the solution that would have been obtained had we had the time to run an algorithm to convergence at each interval . In this time-varying setting, concepts that relate the inter-arrival time of the data, the sampling time , and the computational time of the algorithms play a key role, and clearly assumptions must be made about how varies over time. For example, there might be enough time to compute two gradients of , or one gradient and two function evaluations (for a line search), so an algorithm can choose how to spend computational resources.

We focus here on algorithms implemented with a constant step-size; this is a natural choice for cases where the optimal solution remains transient and the algorithm runs indefinitely. This is another distinction relative to online learning with a time-invariant optimizer, where the step-size may depend on the time-horizon or a “doubling trick” [shalev2012online, Sec. 2.3.1] is utilized (with the latter still involving changes in the algorithm based on how many iterations have been taken). An example to distinguish standard online learning from time-varying optimization is spam filtering. At time , a user receives an email message with features , and their email software must decide whether to label the email as a spam message or a legitimate message. An online learning problem in this scenario is to make the best use of prior emails (and their correct labels) to make a prediction for the new email, after which the user will supply the correct answer, and the software will take this into account at time . A time-varying

problem in this scenario is the case where we assume that all users receive the same type of spam email and thus do not need an individually trained classifier, but that the nature of email spam evolves over time, and hence the email provider must update the spam classifier every day. The email provider has access to all the emails of their users, and thus plenty of data, but might use a complicated classifier that cannot be completely trained in one day.

Fig. 2: Online learning and time-varying optimization are both sequential, but in online learning, the information at a given round is restricted, and often full or partial feedback is revealed. In time-varying optimization, there is no restriction of information, but the full problem cannot be solved within a single time step due to computational cost.

Ii Time-Varying Algorithms

We will overview online algorithms to track solutions of time-varying problems of the form (1) that are designed based on three key principles:

[P1] First-order methods. First-order methods are computationally-light, they facilitate the derivation of parallel and distributed architectures, and they can handle non-smooth objectives by leveraging the proximal mapping [cevher2014convex, Be17]. In the context of this article, they exhibit robustness to inaccuracies in the gradient information — an important feature further explained next222In some scenarios, second-order information is computationally cheap to obtain and then second-order methods are competitive (cf. [Paper3] in the context of prediction-correction methods) but we do not pursue these methods here..

[P2] Approximate first-order information. We consider first-order algorithms that are robust to inaccurate gradient estimates; more precisely, the online algorithm is executed using a sequence , with well-posed assumptions on the sequence of differences . For example, assumptions relative to in existing literature involve boundedness of (in a given norm) [Dallanese2018feedback, schmidt2011convergence], as explained in Section III. In a stochastic case, boundedness of (where denotes expectation) is presumed [dixit2019online]. This setting finds important applications in ML and DDC with data streaming, with prime examples outlined shortly.

[P3] Distributed computation. We cover problems of the form (1) or suitable reformulations that are to be solved collaboratively by a network.

We start by revisiting the proximal gradient method (2) under the lens of [P1]–[P2]. This algorithm is relevant for a number of instances listed in Table I, in particular when or projection onto the sets is computationally cheap. We then turn the attention to primal-dual methods and variants [Koshal11]; these methods are naturally applicable to the case where the set in (1) is expressed as , with convex (and involving a computationally-cheap projection), and a vector-valued convex function; here, the constraint is dualized to construct the Lagrangian function. This setting is relevant, for example, in network optimization problems with data streams [onlineSaddle, Koshal11, Bernstein2019feedback]. On the other hand, a similar structure emerge in consensus-based reformulations of (1), where , with a given consensus matrix constructed based on the communication graph [Ling14admm, dimakis2010gossip, boyd2011distributed].

Ii-a Proximal gradient method

The online algorithm (2) with approximate first-order information amounts to the sequential execution of the following step:

(3)

where we recall that . If , (3) reduces to the online projected gradient method with approximate gradient information. We focus our attention to algorithms with a constant step-size; that is, for all ; this is reasonable when no prior on the evolution of the problem is available, and the algorithm is executed indefinitely (as opposed to a given finite interval). The algorithm (3) is the starting point for all variants we consider, and handles the key issues of the time-varying setting, namely functions and constraints (and hence the optimizers) possibly changing at each step , and inexact gradients. The algorithm (3) is utilized in the examples of applications presented in Section II-B. Section II-C will discuss a more general algorithm based on Lagrangian functions, and Section III will elaborate on performance metrics.

Ii-B Examples of applications of the proximal gradient method

ML example #1: Subspace tracking for video streaming

. Robust principal component analysis (PCA) can be used to separate foreground from background in video, among many other applications. The model is that a matrix

, which encodes a video as pixels by video-frames, can be decomposed as where is sparse (foreground) and is low-rank (background). Now suppose is a video clip, and is the subsequent video clip, and the objective is to decompose all video clips into foreground and background in a streaming and real-time fashion. This form of robust subspace tracking was considered by [dixit2019online], and is modeled by solving the following problem (for parameters ):

(4)
Fig. 3: Results of robust PCA on traffic footage. Left: 2 iterations of proximal gradient per time video clip. Right: 10 iterations of proximal gradient per video clip.

where the term is the nuclear norm. Identifying and as in Eq. (4), then is differentiable and is -Lipschitz continuous [CombettesBook2, Prop. 12.30], and where

solves the inner minimization problem and can be computed using the singular value decomposition (SVD) of

 [CombettesBook2, Ex. 24.69]. To speed up computation, the SVD algorithm may be allowed to produce small errors, such as in randomized SVD methods [halko2011finding], iterative methods like Lanczos, or in methods with large roundoff error. In particular, if

, an efficient SVD algorithm is to compute the eigenvalue decomposition (EVD) of

; this multiplication and EVD operation have the same asymptotic flop count as the usual SVD, but is faster since it has smaller constants and can exploit well-tuned matrix multiply algorithms (especially true on the GPU). It has higher numerical error due to squaring the condition number of . There are other choices for defining and , such as those in [dixit2019online], but our choices fit into the approximate gradient framework which comes with guarantees. Eq. (4) fits into Eq. (1) by letting be a vectorized version of with and , and .

To illustrate the example, we took a dataset of 254 traffic video clips from [trafficDBpaper2005], of resolution so , and in the range of 48 to 52, and chose and . Most video clips are from subsequent times, but there is a jump between the last clip in the evening and the first clip in the morning. On a 2-core laptop, a single proximal gradient step takes about seconds. Thus for real-time processing, assuming a frame rate of 25 frames/second, one could take just under 20 iterations per video clip. Fig. 3 illustrates the proximal gradient algorithm taking just 2 iterations per clip (left) and 10 iterations per clip (right). Obviously taking more iterations per video clip leads to better results. The quality of the background component is much better for clip 160 than it is for clip 40, also as expected.

ML example #2: Online sparse subspace clustering. Subspace tracking identifies a shared low-dimensional subspace that explains most of the data [friedman2001elements]; on the other hand, subspace clustering is a key ML application utilized to group data points that lie in low-dimensional affine spaces [friedman2001elements]. Subspace clustering is useful when data points lie near low-dimensional affine spaces and represent qualitatively different real-world objects based on their respective affine spaces.

Fig. 4: The average image of data points labeled as representing a certain person as “time” increases and convergence to tracking error is reached, for times . “Time” refers to which sliding window of the data set is being used.

Here we illustrate an approach referred to as sparse subspace clustering [elhamifar2013sparse]

; it involves the solution of a sparse representation problem, followed by spectral clustering applied to the graph corresponding to the similarity matrix formed using the minimizer of the sparse representation problem. In particular, the sparse representation problem 

[elhamifar2013sparse] is

(5)

where is the vector -norm, diag is the vector consisting of the diagonal elements of , and and are vectors of all zeros and ones respectively. Letting be a vectorized version of with , this fits in the framework of (1). is a sliding window so that the clustering problem does not grow in the number of data points that need to be labeled, in order to avoid creating a growing computational demand. Fig. 4 visually represents the evolution of the center of one subspace, by averaging the data points of one cluster and showing how it changes over time, starting with a mixture of faces and then converging on the identity of a single person.

Fig. 5: Example of measurement-based online algorithm, with application to power systems (adapted from [Dallanese2018feedback]). The numerical results correspond to the case where are power commands for distributed energy resources (DERs), collects the powers consumed by non-controllable loads, and is in this case a scalar representing the setpoint for the aggregate power of both DERs and loads; i.e., . Left: example of tracking performance for the real power of three representative DERs (in W). Right: with computed by a central network operator, the update (3) decouples across nodes when (with the number of DERs, and ) .

DDC example #1: Measurement-based online network optimization. Consider a physical network (e.g., a power system, a transportation network, or a communication network) described by a input-output map , with a vector of control inputs, (possibly unknown) exogenous inputs, and , known network matrices; the vector collects network outputs. For example, in power systems, collects the power consumed by uncontrollable loads throughout the (possibly very large) network, collects controllable power injections, collects power flows, and the network map is based on a linearized power flow model [Dallanese2018feedback]. As an illustrative example, suppose that the function in (1) is in an effort to drive the network output towards a time-varying reference point . The gradient of in this case reads . Evaluating the gradient requires one to measure or estimate the vector at each step of the algorithm, and this task can be problematic (if not even impossible) in many real-world applications. If, on the other hand, sensors are deployed to measure the network output , then a proxy of the gradient can be constructed as , with a measurement of  [Bernstein2019feedback]. Because of the inherent measurement errors, but also because of a possibly inaccurate knowledge of the model matrix , is a noisy version of . An example of measurement-based architectural framework is illustrated in Figure 5, for an application to power systems (adapted from [Dallanese2018feedback]).

Ii-C Handling Lagrangian functions

Consider now the case where ; we recall that his setting is relevant, for example, in network optimization problems with data streams [Koshal11, Bernstein2019feedback] or in SP and ML applications where the projection onto can be computationally intensive. Focusing first on the case where is strongly convex, we consider the design of first-order algorithms based on the time-varying saddle-point problem [Koshal11]:

(6)

where is a regularized Lagrangian function, is a regularization parameter, and is the vector of multipliers associated with the constraint . Accordingly, based on the principles [P1]–[P2], an approximate online primal-dual method is of the form [Bernstein2019feedback]:

(7)

where is a proxy for the gradient of with respect to , and is an estimate of , with denoting the Jacobian of (i.e., an estimate of the gradient of the smooth part of the regularized Lagrangian). It is important to notice that if , then (6) reverts to the standard Lagrangian function; then, (7) can be utilized to track optimal primal-dual trajectories of (1) based on metrics grounded on the notion of regret, but there are no linear convergence results due to the lack of strong convexity of the dual problem [Koshal11, Du2019]. When , then becomes -strongly concave in and linear convergence results are available at the cost of tracking an approximate solution. This will be discussed in Section III.

Iii Performance Analysis: Which Metrics?

Given the temporal variability of the underlying optimization problems, the so-called dynamic regret [besbes2015non, hall2015online, onlineSaddle, Jadbabaie2015, dixit2019online, li2018using], and (a slightly modified notion of) -linear convergence [Paper3, Bernstein2019feedback, dixit2019online] are metrics that can be used to characterize the performance of time-varying algorithms. This section discusses these metrics and relevant bounding techniques. Further, this section provides guidelines on the adoption of given performance metrics, based on the mathematical structure of the time-varying problem.

It is first necessary to make assumptions on (i) a “measure” of the temporal variability of (1), and (ii) the “correctness” of the first-order information . Measures of the latter include [schmidt2011convergence]:

(8)

with and representing the error accumulated up to time . Stochastic counterparts of (8) of the form are also considered as discussed in, e.g., [dixit2019online].

Temporal variability of the problem (1) could be measured based on how fast its optimal solutions evolve; more precisely, assuming first that the cost in (1) is strongly convex at all times (and, therefore, the trajectory is unique), one can consider [Paper3, Bernstein2019feedback, dixit2019online]:

(9)

with referred to as “path length” or “cumulative drifting.” The metric (9) can be utilized also when is a function of random parameters drawn from a time-varying distribution; see, e.g., [Cao2019]. When the cost is convex but not strongly convex, (9) refers to a non-unique path and its respective length; however, an alternative measure that resolves this ambiguity involves a notion of worst-case path length [besbes2015non]:

(10)

where denotes the set of optimal solutions at time .

Additional metrics have been considered to capture the temporal variability of the underlying problem; for example, a variant of (9) involving a dynamical model is proposed in [hall2015online]. As another example, the metric , where is a suitable “comparator”; for example, could be the center of the solution trajectory to capture the diameter of the minimizer sequence. For completeness, we also mention that, assuming that the constraint set is static (that is, for all ), then additional metrics are [besbes2015non, Jadbabaie2015]

(11)

where, however, compactness of is needed to ensure finite value of .

Iii-a Strongly convex time-varying functions: Linear convergence

We start by considering the case where the function in the cost of (1) is -strongly convex333If is -strongly convex, we can assume is -strongly convex without loss of generality since we can add to and subtract from , and calculation of gradients and proximity operators of the new functions are given by standard formulas [CombettesBook2]. and -smooth, with . Strong convexity implies that the sequence of optimal solutions is unique; therefore, a pertinent performance assessment involves the analysis of the tracking error sequence  [Bernstein2019feedback, Paper3]. To this end, one can obtain a slightly modified notion of -linear convergence of the form [Bernstein2019feedback, Paper3]:

(12)

for some (that is, without considering the transient behavior of the algorithm), where is the contraction coefficient, and is a function of the drifting and the gradient error . Assuming that there exists a scalar that upper bounds the elements of the sequence , where and are bounds for and for all , respectively, (12) naturally leads to the following asymptotic result:

(13)

where , which bounds the maximum tracking error.

Concrete expressions for (13) are provided in the following two examples.

Online proximal method and projected gradient method. For the example (3), one has that  [Be17]; therefore, (12) is a contraction (i.e., ) when . Taking a constant step-size , one has that the bound (13) is given by

(14)

When , one recovers results for the convergence of batch algorithms with errors in the gradient. Note also that this result allows for an approximate gradient () calculation.

We refer the reader to [necoara2019linear] for more results on linear convergence in the static setting when the cost function satisfies some relaxed strong convexity conditions.

Online primal-dual method. We now turn the attention to primal-dual gradient methods based on the Lagrangian function (6). The linear convergence results above are modified in this case to account for both primal and dual variables; that is, we consider the tracking error sequence , where . In this case, the definition of the drifting is also modified accordingly as .

Still assuming that is strongly convex, the traditional (un-regularized) Lagrangian function is not strongly concave in and thus no linear convergence results of the form (13) may be possible [Koshal11, Du2019] (we notice also that in [Du2019] is not strongly convex, but a special structure of the regularized Lagrangian is assumed). When , then the regularized Lagrangian in (6) is a strongly-convex strongly-concave function, and linear convergence results becomes available for both static [Koshal11, Du2019] and time-varying optimization problems [Bernstein2019feedback]; the price to pay, though, is tracking of the unique saddle-point of the regularized Lagrangian function, which does not coincide in general with an optimal primal-dual pair of (1); see, for example, the results in [Koshal11]. Sacrificing optimality for convergence is often appropriate in a time-varying setting, since if the regularization error is small compared to the drift and gradient error , then a time-varying regularized algorithm would achieve very similar performance as a non-regularized one, with the added value of linear convergence.

Letting be the saddle-point of the regularized Lagrangian when , and focusing first on the case where , the primal-dual operator is strongly monotone with strong-monotonicity constant (i.e., , with the identity, is monotone), and Lipschitz continuous with a given constant whenever is convex and continuously differentiable, and with a Lipschitz continuous gradient (a precise derivation of is available in e.g., [Koshal11, Bernstein2019feedback]). Under these premises, can be bounded as in (14) with replaced by , an upper bound on the norm of the error in the computation of the gradient of both primal and dual steps [Bernstein2019feedback], and . Clearly, if is selected as . These results hold also for the case with non-differentiable function and the proximal operator in the primal step, as shown in Eq. (7).

Iii-B Dynamic regret

The dynamic regret is defined as [hall2015online, Jadbabaie2015, besbes2015non, chen2018bandit]:

(15)

where we recall that is in our case the optimal value function obtained by utilizing a batch algorithm. For strongly convex functions, boundedness of implies boundedness of the instantaneous regret when is Lipschitz continuous uniformly in (e.g., the (sub-)gradient of is bounded over the set [dixit2019online]; therefore, a recursive application of (12) gives a bound on the dynamic regret. For the algorithm (3) and its projected gradient counterpart, it holds that  [dixit2019online]. If the path length and the cumulative error are linear in , no sublinear regret is possible as is confirmed by the lower bounds provided in, e.g., [besbes2015non].

When the cost function is not strongly convex and the relaxed conditions for linear convergence explained in e.g,  [necoara2019linear] (for example, a quadratic functional growth condition) are not satisfied, the dynamic regret then becomes a key performance metric. This is also the case for constrained problems, when the un-regularized Lagrangian is utilized to design the algorithm (that is, when i.e., in (6)), since even if the cost is strongly convex, the primal-dual operator is monotone (but not strongly monotone) if . A number of results in the literature are available for step-sizes ; here, on the other hand, we recall that we consider regret bounds for algorithms with a constant step-size since they more naturally fit the time-varying setting. As an example, consider a smooth cost and set , with for all ; then, for an arbitrary “comparator” (i.e., reference for the performance analysis), a bound for the dynamic regret amounts to:

(16)

For example, if , then is the diameter of the minimizing sequence. Other bounds could be derived for approximate first-order information by extending the results of [besbes2015non, hall2015online]; they are close in spirit to (16), and they imply that no sublinear regret is achievable if the metric utilized to capture the time-variability of the problem grows linearly. For completeness, we refer the reader to the lower bounds in, e.g.,  [besbes2015non, li2018using], and the bounds for primal-dual methods designed based on the standard Lagrangian function in, e.g., [chen2018bandit] and [Bernstein2019feedback].

Iv Distributed Computation for information streams

Another key aspect of data streams is that they can be distributed across different locations and sources. With the increasing sheer amount of data, possibly coupled with privacy concerns, distributed computation plays a crucial role to ensure that the data points are processed as close as possible to where they are generated. We focus on two key features and challenges in distributed time-varying optimization with distributed information streams: (i) step-size conditions; and, (ii) asynchronicity of the updates. Other aspects (e.g., communication vs. convergence, quantization, federated architectures) are also important in distributed optimization, in par with standard static processing; we will comment on these aspects in Section V as part of the outlook.

Step-size conditions and synchronicity of the updates are two key differentiators between traditional static and online distributed optimization; if not handled properly, they may jeopardize performance and even convergence of standard distributed algorithms when applied to information streams. Take, for example, decentralized (sub)-gradient descent (DGD): unless the step-size vanishes, convergence to the optimizer is not assured. On the other hand, as we discussed, if the step-size vanishes, then tracking of a time-varying optimizer becomes challenging. When considering cost functions that change over time, synchronicity becomes an even more important aspect; in principle, nodes at different locations are required to sample cost functions at the same instant, otherwise we would be solving problems that pertain to different time instances at different nodes (jeopardizing performance at best, convergence at worst)444 We focus on DGD, instead of distributed methods with possibly superior performance, for its simplicity, its chronological precedence, and because a number of current methods still employ it in some ways at their core. .

Iv-a Example of time-varying consensus problem

To outline the ideas, consider a simplified version of (1) for a prototypical consensus problem:

(17)

Consider spatially distributed nodes labeled as , each one with a private cost , which for simplicity of exposition we consider -smooth and strongly convex. The nodes can communicate via a communication graph and we will be looking at algorithms that would allow the nodes to agree on an optimizer of (17) at any time , while communicating only with their neighboring nodes. In the static setting, where the cost functions do not change over time, many algorithms have been developed to solve (17[boyd2011distributed, dimakis2010gossip], such as gradient tracking, exact first-order algorithm (EXTRA), dual averaging, dual decomposition, and ADMM (see, e.g., [Shi2015] and references therein) in addition to DGD. We emphasize that the convergence claims of [boyd2011distributed, dimakis2010gossip, Shi2015] are for static optimization; a goal of this section is to highlight challenges in the design and analysis of distributed algorithms when moving from static optimization to time-varying optimization. Because of space limitations, we refer the reader to the work [Maros2019] for a comprehensive literature review on several aspects on time-varying distributed optimization, as well as [Hosseini2016, Akbari2017, Shahrampour2018] for examples of online and time-varying algorithms over networks.

DGD involves copies of the variable to each node, denoted here as , and it generates a sequence as

(18)

where means that the sum is carried over all the neighbors of and itself, are non-negative weights (which are often chosen as the relative degree between the nodes), is the step-size, while is in this case the gradient of at , i.e., , or a proxy of the gradient as discussed in Section II. In the static setting, even in the strongly convex and -smooth setting, the sequence can be proven to converge, in the sense that , only when the step-size is vanishing: (under some extra but standard and mild assumptions on the communication graph, e.g., connectedness). When the step-size is constant, then convergence is achieved only within a ball around of the optimizer. To understand this result, stack the variables in a vector and rewrite the recursion (18) as

(19)

where now is the matrix that contains the weights , while is the vector that contains all the local gradients. In particular, the matrix has maximum eigenvalue equal to

, with the corresponding eigenvector with all entries equal to

. Then, one can interpret (19) as a standard gradient algorithm to solve the modified problem

(20)

whose optimizer is different from (17) if stays constant, showing that then exact convergence can never be achieved if is constant. It is also apparent that synchronicity must be enforced otherwise it is not clear what objective is being minimized; if the costs were sampled at different ’s at different nodes, the first term would read which is not the original objective.

In Figure 6, we illustrate the average tracking error for a time-varying problem defined over nodes (connectivity shown in upper left corner). The optimization problem is strongly convex and strongly smooth uniformly in time, and it has the form , where and

are drawn i.i.d. from uniform probability distribution of support

and , respectively, while . We study the performance of DGD with vanishing step-size (), DGD with constant step-size, EXTRA, dual decomposition on the adjacency matrix of the graph, and distributed ADMM. Note that in this setting, the latter three methods have linear convergence in the static setting, and the latter two converge to an error bound in the time-varying setting [Ling14admm],  [Jakubiec2013]. In the static setting (i.e., ), EXTRA, dual decomposition, and ADMM maintain their theoretical promises and converge linearly. DGD with vanishing step-size converge slower, while DGD with constant step-size converge to an error bound. When we consider a time-varying setting, DGD with vanishing step-size diverges, while the other methods converge to an error bound as expected. Note that even if EXTRA has not been shown to converge in the time-varying setting, it is expected to do so, since it is linearly converging in the static setting. Having better performance in static setting does not clearly predict better performance in the time-varying setting: for example, it seems that dual decomposition does much better in the time-varying scenario, while in the static setting it is worse than EXTRA and ADMM.

Fig. 6: Numerical simulations for a time-varying optimization problem solved in a distributed way. Top left: the communication graph consisting of 20 nodes; Top right: tracking error in the time-invariant case; Bottom right: tracking error in the time-varying synchronous case; Bottom left: tracking error in the time-varying asynchronous case.

Finally, the lower left corner of Fig. 6 illustrates the case where we introduce asynchronicity in the sampling of the cost function (in this case nodes can sample functions asynchronously up to time instances in the past, meaning that each node has an , with randomly generated and ). The error is higher for all the algorithms, but it seems that DGD with constant step-size is the most robust. This is striking since DGD with constant step-size is the worst performing algorithm in the static setting, and shows once more that results in a static scenario cannot be easily translated into time-varying settings.

V Outlook

Streams of heterogeneous and spatially distributed data impose significant communication and computational strains on existing algorithmic solutions. Deploying hardware with more powerful computational means is simply not a viable choice in many applications, and communication constraints still create severe bottlenecks in massively distributed settings. Time-varying optimization is rapidly emerging as an attractive solution. This article emphasizes that we must revisit key design principles for batch optimization to enable an online processing of data without losing information or optimization capabilities.

Fig. 7: Outlook for analytical tools and application domains

Figure 7 lists a number of key open questions in the time-varying optimization domain. For example, motivated by the representative results in Figure 1, we expect research efforts to explore the feasibility of accelerated methods in time-varying settings, along with efforts to characterize the performance of time-varying accelerated methods. Based on our discussion in Section IV, we also expect lines of research dealing with performance analysis of asynchronous time-varying algorithms, possibly operating with dynamic communications graphs. Lastly, while the present paper focuses on time-varying convex programs, we point out that a number of key research problems pertain to the development of approximate online algorithms for time-varying nonconvex problems. This is driven by emerging applications such IoT and connected vehicles. Figure 7 also lists a number of application domains beyond the ML and SP areas.

References

Vi Authors

Emiliano Dall’Anese (emiliano.dallanese@colorado.edu) is an Assistant Professor within the department of Electrical, Computer, and Energy Engineering at the University of Colorado Boulder. Previously, he was a postdoctoral associate at the University of Minnesota and a senior scientist at the National Renewable Energy Laboratory. He received the Ph.D. in Information Engineering from the University of Padova, Italy, in 2011. His research interests focus on optimization, decision systems, and statistical learning, with applications to networked systems and energy systems.

Andrea Simonetto (andrea.simonetto@ibm.com) is a research staff member in the optimization and control group of IBM Research Ireland. He received the Ph.D. degree in systems and control from Delft University of Technology, Delft, The Netherlands, in 2012. Previously, he was a postdoc at Delft University of Technology and at the Université catholique de Louvain, Louvain-la-Neuve, Belgium. His research interests include large-scale centralized and distributed optimization with applications in smart energy, intelligent transportation systems, and personalized health.

Stephen Becker (stephen.becker@colorado.edu) is an Assistant Professor in the department of Applied Mathematics at the University of Colorado Boulder. Previously, he was a Goldstine postdoctoral fellow at IBM Research T.J. Watson Lab and a postdoctoral fellow at the Laboratoire Jacques-Louis Lions at Paris 6 University. He received the Ph.D. in Applied & Computational Mathematics at the California Institute of Technology in 2011. His work focuses on large-scale continuous optimization for signal processing and machine-learning, as well as numerical analysis.

Liam Madden (liam.madden@colorado.edu) is a Ph.D. student in the department of Applied Mathematics at the University of Colorado Boulder.