## I Introduction and Problem Formulation

The proximal-gradient method is a powerful framework for solving optimization problems with objectives that consist of a differentiable convex function and a nonsmooth convex function [1, 2, 3]. By taking advantage of this composite structure, proximal-gradient methods are known to exhibit the same convergence rates of the gradient method for smooth problems [4]. Accordingly, proximal-gradient methods can be leveraged to efficiently solve a number of problems that arise in the broad areas of, e.g., statistical learning, network optimization, and design of optimal controllers for distributed systems [5, 6, 7, 8].

This paper investigates the design of *online* proximal-gradient methods for composite convex functions that continuously evolves over time. To outline the setting concretely, discretize the temporal axis as , with a given interval. Let be a close convex and proper function with a Lipschitz-continuous gradient at each time , with a given set; further, let be a lower semi-continuous proper convex function for all . Consider then the following time-varying optimization problem [9, 10, 11]:

(1) |

Let be an optimal solution of (1) at time (which is unique if is strongly convex). In principle, a proximal method or an accelerated proximal method can be utilized to attain ; for example, it is known that when
is convex -smooth,
the number of iterations required to obtain an objective function within an error is and for a proximal method and its accelerated counterpart, respectively, with the starting point for the algorithm [4]^{1}^{1}1*Notation*

: For a given vector

, , with denoting transposition; for and , denotes the inner product. For a differentiable function , is the gradient vector of with respect to . If is non-differentiable, denotes the subdifferential of at ; in particular, a vector is a subgradient of at if for all in the domain of . On the other hand, denotes the -subdifferential of at ; a vector is an -subgradient of at if for all in the domain of . Finally, refers to the big O and is the little o notation. . Results for strongly convex functions can be found in e.g., [4, 1].In contrast, this paper targets an online (or “running”, or “catching-up” [12]) case where only one or a few steps of the proximal-gradient method can be performed within an interval (i.e., before the underlying optimization problem may change); further, the paper considers the case when the implementation of the algorithmic steps is *inexact*. Taking the case where only one step can be performed within an interval , an online proximal-gradient algorithm amounts to the execution of the following steps [11] at each time :

(2a) | ||||

(2b) |

where is the step size, and the proximal operator is defined as [1]:

(3) |

with a given parameter. Notice that constraints of the form , with a convex set, can be handled via indicator functions [1]; i.e., by setting , where is a lower-semicontinuous convex function, and if and otherwise.

Inexactness of the steps (2) may emerge because of the following two aspects: (i) only an approximate first-order information of may be available [13, 5]; and, (ii) the proximal operator may be computed only up to a certain precision [8, 14, 15]. Before proceeding, examples of applications that motivate the proposed setting are briefly explained.

*Example 1: Feedback-based network optimization.* Online algorithms are, in this case, utilized to produce decisions to nodes of a networked system (i.e., a power system, a transportation network, or a communication network); temporal variability emerge from time-varying problem inputs (i.e., non-controllable power injections in a power system) or time-varying engineering objectives [13, 16]

. Measurements of the network state are utilized to obtain an estimate of the gradient of

at each time step. Inexactness of the proximal operator captures the case where the projection is performed onto an inner approximation of the actual feasibility region (i.e., when one has an approximate region for aggregations of energy resources) [17]; or, when the proximal operator is not easy to compute within an interval .*Example 2: Online zeroth-order methods.* Zeroth-order methods involves an estimate of at based on functional evaluations , with a given perturbation; see, e.g., Gaussian smoothing or Kiefer-Wolfowitz approaches [18, 19, 20]. Inexactness of (3) is due to projections onto a restriction of [20] or when (3) is not solved to convergence.

*Example 3: Learning under information streams.* For applications with continuous streams of data, the interval may coincide with the inter-arrival time of data points; because of an underlying limited computational budget (compared to ), one may afford one step of the proximal-gradient method and a limited number of algorithmic steps to solve (3

). Examples include singular value decomposition (SVD) based proxies

[5] or structured sparsity [21].For static optimization settings, convergence of inexact proximal-gradient methods has been investigated in, e.g., [14, 15, 8, 22] (see also pertinent references therein); in particular, [8] showed that the inexact proximal-gradient method can achieve the same rate of convergence of the exact counterpart if the error sequence decreases at appropriate rates. In an online setting, [5] investigated the convergence of the proximal-gradient method with an approximate knowledge of (but with an exact implementation of the proximal operator); strongly convex cost functions were considered. In this paper, we analyze the convergence of the online inexact proximal-gradient method with errors in both the computation of and (3). In particular:

The results of [5] are generalized for the case of errors in the proximal operator, and for the case of costs that are not strongly-convex.

The analysis of, e.g., [8, 14, 15] in the context of batch optimization is extended to a time-varying setting considered, with the temporal variability of solution paths (1) playing a key role in the convergence rates.

Under suitable assumptions, convergence of the error iterates is established for strongly convex cost functions. On the other hand, convergence claims are established in terms of dynamic regret when the cost is not strongly convex, under the additional assumption that the feasibility sets are compact. Bounds are expressed in terms of the cumulative error and the path length of the optimal solutions. The role of the errors is emphasized in the bounds, thus suggesting how to allocate computational resources to strike a balance between performance and precision in the gradient computation and in the proximal operator.

## Ii Online Inexact Algorithm

The models for the errors in the computation of the gradient of and of the proximal operator are described first, followed by the online inexact proximal-gradient method.

*Gradient error.* For a given point , the first-order information of is available in the form of , with denoting the gradient error. The error sequence is assumed to be bounded; however, the convergence results derived in the next section can be readily extended to the case where is stochastic as, e.g., [5].

*Error in the proximal step.* A point is an approximation of with a precision if [14]:

(4) |

It is useful to notice that equation (4) implies that [14, 8]

(5) |

where corresponds to the case where the proximal operator is computed exactly; furthermore, since is a -strongly convex function, one has that:

(6) |

Together, equations (6) and (5) imply that

(7) |

See also Appendix -A. More details on (4) will be provided shortly.

With these definitions in place, the online inexact algorithm is presented next, where the parameter is set to as in e.g., [1, 8, 4].

Online inexact proximal-gradient algorithm

Initialize , , and set .

For each :

[S1] Obtain estimate of the gradient

[S2] Perform the following updates:

(8a) | ||||

(8b) |

[S3] Go to [S1].

At each time step , the algorithm is assumed to have the availability of: i) an estimate of , and, ii) the function . This is the case when, e.g., depends on data or its gradient is not available, while represents regularization terms based on a prior on the optimal solution, or set indicator functions for constraints. It is thus reasonable to assume that one has access to – since it is in general engineered – while access on depends on data; take for example, the case , where is the data stream. Conditions on the step size will be given shortly in Section III.

The characteristics of error sequence depends on the particular application. For example, in measurements-based online network optimization algorithms, captures measurement noise (see Example 1) [13, 16]; therefore, a bound on [13] (or on the expected value of [5]) is utilized to assess the tracking performance of the algorithm, but it may not be under the control of the designer of the algorithm. On the other hand, the error may be controllable by the designer of the algorithm in, e.g., zeroth-order methods (Example 2) and applications such as subspace tracking and online sparse regression (Example 3); see e.g., [18, 19, 20, 5, 21] and pertinent references therein.

Regarding the the error sequence , there are two common themes in the examples considered in Section I: i) if a set indicator function is considered, points may be projected in the interior of ; and, ii) for a given lower semi-continuous convex function, (3) may not be solved to convergence. In both cases, the error sequence can be controlled, based on given computational budgets or other design specifications. Examples are provided next.

*Example: inexact projection* [14]. Suppose that , for a given closed and convex set ; let denote the distance of the point from the convex set . Then, the definition (4) implies that if and only if

(9) |

That is, if , then may not lie in the boundary of ; rather, may lie in the interior of . It is worth noticing that the point is always feasible as explained in [14].

*Example: structured sparsity*. Take, for example, the case where with a given sub-block of the vector . In this case, a block coordinate method can be utilized to solve (2) [8, 21]; the block coordinate method can be run up a given error .

*Example: SVD-based proxies*. Proximal operators that involve an SVD computation (e.g., nuclear norm minimization) may be computed inexactly, especially for large matrices.

Finally, it is also worth mentioning that the two errors could be analyzed in a unified way of one interprets as a perturbation in the computation of the (exact) operator (see, e.g., Definition 3 in [14]); however, similarly to [8], the current models of the errors allow one to better appreciate the role of the “exactness” of the first-order information and the proximal operator in the performance of the algorithm as shown in the next section.

## Iii Performance Analysis

This section will analyze the performance of the online inexact algorithm (8); two metrics will be considered:

*i)* convergence of the sequence ; and,

*ii)* the dynamic regret, defined as (see, e.g., [23, 24, 25] and references therein):

(10) |

The error sequence will be analyzed when the cost function is strongly convex; in particular, bounds on the cumulative error and -linear convergence results will be offered. The latter is indeed possible only in case of strong convexity. On the other hand, the results for the dynamic regret can are provided for both the convex and the strongly-convex case.

The following standard assumptions are presumed throughout this section.

###### Assumption 1

The function is close, convex and proper. Assume that has a -Lipschitz continuous gradient at each time .

###### Assumption 2

The function is a lower semi-continuous proper convex function for all .

###### Assumption 3

For all , attains its minimum at a certain .

To characterize bounds on the error sequence and the dynamic regret, it is necessary to introduce a “measure” of the temporal variability of (1) as well as of the “exactness” of the first-order information of and the computation of the proximal operator. For the former, define

(11) |

along with the following quantities [5, 13, 24]:

(12) |

with typically referred to as the “path length” or “cumulative drifting.” When is strongly convex, is uniquely defined; on the other hand, is associated with a solution path when is not strongly convex. Consider further the following definitions for the cumulative errors [8]:

(13) |

With these definitions in place, the convergence results are established first for the case where the function in (1) is -strongly convex for all .

The following lemma will be utilized to derive convergence results when the function in (1) is -strongly convex.

###### Lemma 1

*Proof*. See Appendix -B.

Based on Lemma 1, the following theorems characterize the behavior of the error sequence .

###### Theorem 1

*Proof*. See Appendix -C.

###### Theorem 2

Suppose that there exists finite constants , and such that , , and for all . Then, under the same assumptions of Theorem 1, it hols that

(17) |

*Proof*. See Appendix -D.

From Theorem 1, it can be seen that if , , and grow as , then the averaged tracking error behaves as . The same limiting behavior can be obtained even if the error sequences and decrease over time, if . As expected, the error in the gradient computation is down-weighted by the step size; on the other hand, the error in the proximal operator directly affects the tracking performance. The result of Theorem 2 may suggest how to allocate computational resources to minimize the maximum tracking error, for a given and ; since is multiplied by , one may want to increase the interval (thus increasing ) and allocate more resources in the proximity operator (thus decreasing ).

The next result pertains to the dynamic regret, and it extends the existing results of [5] to the case of inexact proximal operators.

###### Theorem 3

*Proof*. See Appendix -E.

Finally, the next result pertains to the dynamic regret in case of functions that are convex but not strongly-convex. We impose the additional assumption that the cost function includes a time-varying set indicator function for a compact set. The derivation of bounds on the dynamic regret for the case of sets that are not compact is left as a future research.

###### Theorem 4

*Proof*. See Appendix -F.

If , , , , and grow as , then ; that is, the dynamic regret settles to a constant value. A no-regret result can be obtained if , , , , and all grow sublinearly in ; that is, if they grow as . However, this may not be achievable when the optimization problem continuously evolves over time.

## Iv Illustrative Numerical Results

As an example of application of the proposed methods, we consider a network flow problem based on the network in Fig. 1. the network graph the network has nodes and (directed) links, and the routing matrix is based on the directed edges. Let denote the rate generated at node for traffic at time , and the flow between noted and for traffic . For brevity, let and stacks the node traffic and link rates for the -th flow. One has that and are related by the flow conservation constraint ; if a node does not generate or receive traffic, then or with the -th row of . Consider then the following time-varying problem (where we recall that is the time index):

(20) |

where and stack the traffic rates and link rates for brevity; is a maximum traffic rate; and is a regularization function that makes the cost strongly convex. The per-link capacity constraints is , where is the time-varying link capacity and is a time-varying link traffic that is non-controllable. Notice that problem (IV) can be equivalently rewritten in terms of only the vector variable (and the term in the cost can be dropped); simulations will be based on the reformulated problem.

For the numerical results, assume that two traffic flows are generated by nodes and , and they are received at nodes and

, respectively. In terms of dynamics of the optimal solutions, at each time step the channel gain of links are generated by using a complex Gaussian random variable with mean

and a given variance

for both real and imaginary parts; the transmit power for each node is a Gaussian random variable with mean and a variance ; the exogenous traffic follows a random walk, where the increment has zero mean and a variance ; and, the cost is perturbed by modifying . Different values for and are obtained by varying the variance of these random variables.The algorithm (8) was implemented, with the following settings:

Gradient errors: the cost function was assumed unknown; therefore, at each step of the algorithm, the gradient is estimated using a multi-point bandit feedback [18, 20]. Briefly, to estimate the gradient of a function around a point , consider drawing points from the unit sphere; then, an estimate can be found as , where is given parameter [20]. Notice that this requires functional evaluations at each step of the algorithm.

Error in the proximal operator: since the estimate of the gradient requires functional evaluations around the current point, we consider a restriction of the feasible set; in particular, consider the constraint , where is a pre-selected constant (that is related to [20]). This will allow for functional evaluations at points that are feasible. In the numerical tests, amounts to , whereas the maximum errors due to the gradient estimate and the inexact projection add up to .

Figure 2 shows the evolution of the cumulative tracking error at each time step (with unique, since the cost function is by design strongly convex). Based on Theorem 1, in the current setting the limiting behavior of is . Indeed, a plateau can be seen, with an asymptotic error that is larger for the inexact proximal gradient method. Figure 3 illustrates the traffic rates achieved with a batch algorithm and with the inexact proximal gradient method. It can be seen that the optimal traffic rates are tracked. Slightly lower traffic rates are obtained in the online case because of the projection onto a restriction of the feasible set.

### -a Results for the inexact proximal operator

Technical details for the error in the proximal operator are derived. These technical details are then utilized in Section III to derive pertinent convergence results. We start from the following lemma.

###### Lemma 2

Take the function . Then .

Proof. The proof can be derived as a special case of Theorem 3.1.1 in [26].

Based on Lemma 2, we obtain:

(21) |

Since is an subgradient:

(22) |

Since (22) is true for every , it is also specifically true for . Then we can write (22) as

(23) |

and therefore . Now defining and putting (22) and (23) together, we get

(24) |

Since is chosen equal to in (8), then (24) suggests the following inequality:

(25) |

where . It also follows that can be written as

(26) |

with .

### -B Proof of Lemma 1

Based on (26), we can write

(27a) | ||||

(27b) |

Now, the prox operator is non-expansive, therefore:

(28) |

which leads to

(29) |

Consider the function . The norm of the gradient is bounded as

(30) |

and therefore is Lipschitz (and a contraction for )[27]. Hence, we can bound (29) as

(31) |

Adding and subtracting to the first term in the right hand side of (31) we can rewrite (31) as

(32a) | |||

(32b) | |||

(32c) |

Applying (32c) recursively we get the result.

### -C Proof of Theorem 1

### -D Proof of Theorem 2

### -E Proof of Theorem 3

### -F Proof of Theorem 4

Since has a -Lipschitz continuous gradient:

(36) |

Using the convexity of we also get

(37) |

Therefore, putting (36) and (37) together:

(38) |

On the other hand, rewrite equation (25) as

(39) |

with . Adding the inequalities (38) and (39):

(40) |

and therefore

(41) |

Set , say ; then:

Comments

There are no comments yet.