# Stochastic Averaging for Constrained Optimization with Application to Online Resource Allocation

Existing approaches to resource allocation for nowadays stochastic networks are challenged to meet fast convergence and tolerable delay requirements. The present paper leverages online learning advances to facilitate stochastic resource allocation tasks. By recognizing the central role of Lagrange multipliers, the underlying constrained optimization problem is formulated as a machine learning task involving both training and operational modes, with the goal of learning the sought multipliers in a fast and efficient manner. To this end, an order-optimal offline learning approach is developed first for batch training, and it is then generalized to the online setting with a procedure termed learn-and-adapt. The novel resource allocation protocol permeates benefits of stochastic approximation and statistical learning to obtain low-complexity online updates with learning errors close to the statistical accuracy limits, while still preserving adaptation performance, which in the stochastic network optimization context guarantees queue stability. Analysis and simulated tests demonstrate that the proposed data-driven approach improves the delay and convergence performance of existing resource allocation schemes.

## Authors

• 19 publications
• 24 publications
• 131 publications
• 58 publications
• 87 publications
• ### Resource Allocation for Multi-User Downlink MISO OFDMA-URLLC Systems

This paper considers the resource allocation algorithm design for downli...
10/14/2019 ∙ by Walid R. Ghanem, et al. ∙ 0

• ### Redistribution Mechanism Design on Networks

Redistribution mechanisms have been proposed for more efficient resource...
10/21/2019 ∙ by Wen Zhang, et al. ∙ 0

• ### Optimal Resource Allocation for Uplink OFDMA in 802.11ax Networks

In this paper, we address the scheduling and resource allocation problem...
11/02/2018 ∙ by Konstantinos Dovelos, et al. ∙ 0

• ### A Machine Learning Framework for Resource Allocation Assisted by Cloud Computing

Conventionally, the resource allocation is formulated as an optimization...
12/16/2017 ∙ by Jun-Bo Wang, et al. ∙ 0

• ### Mode Selection and Resource Allocation in Sliced Fog Radio Access Networks: A Reinforcement Learning Approach

The mode selection and resource allocation in fog radio access networks ...
02/13/2020 ∙ by Hongyu Xiang, et al. ∙ 0

• ### Online Learning and Matching for Resource Allocation Problems

In order for an e-commerce platform to maximize its revenue, it must rec...
11/18/2019 ∙ by Andrea Boskovic, et al. ∙ 0

• ### Online convex optimization and no-regret learning: Algorithms, guarantees and applications

Spurred by the enthusiasm surrounding the "Big Data" paradigm, the mathe...
04/12/2018 ∙ by E. Veronica Belmega, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Contemporary cloud data centers (DCs) are proliferating to provide major Internet services such as video distribution, and data backup [2]. To fulfill the goal of reducing electricity cost and improving sustainability, advanced smart grid features are nowadays adopted by cloud networks [3]. Due to their inherent stochasticity, these features challenge the overall network design, and particularly the network’s resource allocation task. Recent approaches on this topic mitigate the spatio-temporal uncertainty associated with energy prices and renewable availability [4, 5, 6, 7, 8]. Interestingly, the resultant algorithms reconfirm the critical role of the dual decomposition framework, and its renewed popularity for optimizing modern stochastic networks.

However, slow convergence and the associated network delay of existing resource allocation schemes have recently motivated improved first- and second-order optimization algorithms [9, 10, 11, 12]. Yet, historical data have not been exploited to mitigate future uncertainty. This is the key novelty of the present paper, which permeates benefits of statistical learning to stochastic resource allocation tasks.

Targeting this goal, a critical observation is that renowned network optimization algorithms (e.g., back-pressure and max-weight) are intimately linked with Lagrange’s dual theory, where the associated multipliers admit pertinent price interpretations [13, 14, 15]. We contend that learning these multipliers can benefit significantly from historical relationships and trends present in massive datasets [16]. In this context, we revisit the stochastic network optimization problem from a machine learning vantage point, with the goal of learning the Lagrange multipliers in a fast and efficient manner. Works in this direction include [17] and [18], but the methods there are more suitable for problems with two features: 1) the network states belong to a distribution with finite support; and, 2) the feasible set is discrete with a finite number of actions. Without these two features, the involved learning procedure may become intractable, and the advantageous performance guarantee may not hold. Overall, online resource allocation capitalizing on data-driven learning schemes remains a largely uncharted territory.

Motivated by recent advances in machine learning, we systematically formulate the resource allocation problem as an online task with batch training and operational learning-and-adapting phases. In the batch training phase, we view the empirical dual problem of maximizing the sum of finite concave functions, as an empirical risk maximization (ERM) task, which is well-studied in machine learning [16]

. Leveraging our specific ERM problem structure, we modify the recently developed stochastic average gradient approach (SAGA) to fit our training setup. SAGA belongs to the class of fast incremental gradient (a.k.a. stochastic variance reduction) methods

[19], which combine the merits of both stochastic gradient, and batch gradient methods. In the resource allocation setup, our offline SAGA yields empirical Lagrange multipliers with order-optimal linear convergence rate as batch gradient approach, and per-iteration complexity as low as stochastic gradient approach. Broadening the static learning setup in [19] and [20], we further introduce a dynamic resource allocation approach (that we term online SAGA) that operates in a learn-and-adapt fashion. Online SAGA fuses the benefits of stochastic approximation and statistical learning: In the learning mode, it preserves the simplicity of offline SAGA to dynamically learn from streaming data thus lowering the training error; while it also adapts by incorporating attributes of the well-appreciated stochastic dual (sub)gradient approach (SDG) [4, 5, 6, 7, 21], in order to track queue variations and thus guarantee long-term queue stability.

In a nutshell, the main contributions of this paper can be summarized as follows.

1. Using stochastic network management as a motivating application domain, we take a fresh look at dual solvers of constrained optimization problems as machine learning iterations involving training and operational phases.

2. During the training phase, we considerably broaden SAGA to efficiently compute Lagrange multiplier iterates at order-optimal linear convergence, and computational cost comparable to the stochastic gradient approach.

3. In the operational phase, our online SAGA learns-and-adapts at low complexity from streaming data. For allocating stochastic network resources, this leads to a cost-delay tradeoff

with high probability, which markedly improves SDG’s

Outline. The rest of the paper is organized as follows. The system models are described in Section II. The motivating resource allocation setup is formulated in Section III. Section IV deals with our learn-and-adapt dual solvers for constrained optimization. Convergence analysis of the novel online SAGA is carried out in Section V. Numerical tests are provided in Section VI, followed by concluding remarks in Section VII.

Notation. denotes expectation (probability);

denotes the all-one vector; and

denotes the -norm of vector . Inequalities for vectors , are defined entry wise; ; and stands for transposition. denotes big order of , i.e., as ; and denotes small order of , i.e., as .

## Ii Network Modeling Preliminaries

This section focuses on resource allocation over a sustainable DC network with mapping nodes (MNs), and DCs. MNs here can be authoritative DNS servers as used by Akamai and most content delivery networks, or HTTP ingress proxies as used by Google and Yahoo! [22], which collect user requests over a geographical area (e.g., a city or a state), and then forward workloads to one or more DCs distributed across a large area (e.g., a country).

Notice though, that the algorithms and their performance analysis in Sections V and IV can be applied to general resource allocation tasks, such as energy management in power systems [23], cross-layer rate-power allocation in communication links [24], and traffic control in transportation networks [25].

Network constraints. Suppose that interactive workloads are allocated as in e.g., [22], and only delay-tolerant workloads that are deferrable are to be scheduled across slots. Typical examples include system updates and data backup, which provide ample optimization opportunities for workload allocation based on the dynamic variation of energy prices, and the random availability of renewable energy sources.

Suppose that the time is indexed in discrete slots , and let denote an infinite time horizon. With reference to Fig. 1, let denote the amount of delay-tolerant workload arriving at MN in slot , and the vector collecting the workloads routed from MN to all DCs in slot . With the fraction of unserved workload buffered in corresponding queues, the queue length at each mapping node at the beginning of slot , obeys the recursion

 qmnj,t+1=[qmnj,t+vj,t−∑i∈Ij~xi,j,t]+,   ∀j,t (1)

where denotes the set of DCs that MN is connected to.

At the DC side, denotes the workload processed by DC during slot . Unserved portions of the workloads are buffered at DC queues, whose length obeys (cf. (1))

 qdci,t+1=[qdci,t−xi,t+∑j∈Ji~xi,j,t]+,   ∀i,t (2)

where is the queue length in DC at the beginning of slot , and denotes the set of MNs that DC is linked with. The per-slot workload is bounded by the capacity for each DC ; that is,

 0≤xi,t≤Di, ∀i,t. (3)

Since the MN-to-DC link has bounded bandwidth , it holds that

 0≤~xi,j,t≤Bi,j,   ∀i,j,t. (4)

If MN and DC are not connected, then .

Operation costs. We will use the strictly convex function to denote the cost for distributing workloads from MN to DC , which depends on the distance between them.

Let denote the energy provided at the beginning of slot by the renewable generator at DC , which is constrained to lie in . The power consumption by DC is taken to be a quadratic function of the demand [6]; that is , where is a time-varying parameter capturing environmental factors, such as humidity and temperature [8]. The energy transaction cost is modeled as a linear function of the power imbalance amount to capture the cost in real time power balancing; that is, , where is the buy/sell price in the local power wholesale market.111When the prices of buying and selling are not the same, the transaction cost is a piece-wise linear function of the power imbalance amount as in [6]. Clearly, each DC should buy energy from external energy markets in slot at price if , or, sell energy to the markets with the same price, if .

The aggregate cost for the considered MN-DC network per slot is a convex function, given by

 Ψt({xi,t},{~xj,t}):=∑i∈IGei,t(xi,t)+∑i∈I∑j∈JGdi,j(~xj,t). (5)

When (5) is assumed under specific operating conditions, our analysis applies to general smooth and strong-convex costs.

Building on (1)-(5), the next section will deal with allocation of resources per slot . The goal will be to minimize the long-term average cost, subject to operational constraints.

## Iii Motivating Application: Resource Allocation

For notational brevity, collect all random variables at time

in the state vector ; the optimization variables in the vector and let . Considering the scheduled as the outgoing amount of workload from DC , define the “node-incidence” matrix with entry given by

 A(i,e):=⎧⎪⎨⎪⎩1,if link e enters % node i−1,if link e leaves node i0,else.

We assume that each row of has at least one entry, and each column of has at most one

entry, meaning that each node has at least one outgoing link, and each link has at most one source node. Further, collect the instantaneous workloads across MNs in the zero-padded

vector , as well as the link and DC capacities in the vector .

Hence, the sought scheduling is the solution of the following long-term network-optimization problem

 Ψ∗:= min{xt, ∀t}limT→∞1TT∑t=1E[Ψt(xt)] (6a) s. t. 0≤xt≤¯x, ∀t (6b) qt+1=[qt+Axt+ct]+, ∀t (6c) limT→∞1T∑Tt=1E[qt]<∞ (6d)

where (6b) and (6c) come from concatenating (3)-(4) and (1)-(2), while (6d) ensures strong queue stability as in [21, Definition 2.7]. The objective in (6a) considers the entire time horizon, and the expectation is over all sources of randomness, namely the random vector , and the randomness of the variables and induced by the sample path of . For the problem (6), the queue dynamics in (6c) couple the optimization variables over the infinite time horizon. For practical cases where the knowledge of is causal, finding the optimal solution is generally intractable. Our approach to circumventing this obstacle is to replace (6c) with limiting average constraints, and employ dual decomposition to separate the solution across time, as elaborated next.

### Iii-a Problem reformulation

Substituting (6c) into (6d), we will argue that the long-term aggregate (endogenous plus exogenous) workload must satisfy the following necessary condition

 limT→∞1TT∑t=1E[Axt+ct]≤0. (7)

Indeed, (6c) implies that after summing over and taking expectations yields . Since both and are bounded under (6d), dividing both sides by and taking , yields (7). Using (7), we can write a relaxed version of (6) as

 ~Ψ∗:=min{xt,∀t}limT→∞1TT∑t=1E[Ψt(xt)]s.t. (???) and (???). (8)

Compared to (6), the queue variables are not present in (8), while the time-coupling constraints (6c) and (6d) are replaced with (7). As (8) is a relaxed version of (6), it follows that . Hence, if one solves (8) instead of (6), it will be prudent to derive an optimality bound on , provided that the schedule obtained by solving (8) is feasible for (6). In addition, using arguments similar to those in [21] and [5], it can be shown that if the random process is stationary, there exists a stationary control policy , which is only a function of the current ; it satisfies (6b) almost surely; and guarantees that , and . This implies that the infinite time horizon problem (8) is equivalent to the following per slot convex program

 ~Ψ∗:= minχ(⋅)E[Ψ(χ(st);st)] (9a) s.t. E[Aχ(st)+ct(st)]≤0 (9b) χ(st)∈X:={0≤χ(st)≤¯x}, ∀st (9c)

where we interchangeably used and , to emphasize the dependence of the real-time cost and decision on the random state . Note that the optimization in (9) is w.r.t. the stationary policy . Hence, there is an infinite number of variables in the primal domain. Observe though, that there is a finite number of constraints coupling the realizations (cf. (9b)). Thus, the dual problem contains a finite number of variables, hinting that the problem is perhaps tractable in the dual space [24, 12]. Furthermore, we will demonstrate in Section V that after a careful algorithmic design in Section IV, our online solution for the relaxed problem (9) is also feasible for (6).

### Iii-B Lagrange dual and optimal solutions

Let denote the Lagrange multiplier vector associated with constraints (9b). Upon defining , the partial Lagrangian function of (8) is , where the instantaneous Lagrangian is given by

 Lt(xt,λ):= Ψt(xt)+λ⊤(Axt+ct). (10)

Considering as the feasible set specified by the instantaneous constraints in (9c), which are not dualized in (10), the dual function can be written as

 D(λ):=min{xt∈X,∀t}L(x,λ):=E[minxt∈XLt(xt,λ)]. (11)

Correspondingly, the dual problem of (8) is

 maxλ≥0D(λ):=E[Dt(λ)] (12)

where . We henceforth refer to (12) as the ensemble dual problem. Note that similar to , here and are both parameterized by .

If the optimal Lagrange multipliers were known, a sufficient condition for the optimal solution of (8) or (9) would be to minimize the Lagrangian or its instantaneous versions over the set [26, Proposition 3.3.4]. Specifically, as formalized in the ensuing proposition, the optimal routing and workload schedules in this MN-DC network can be expressed as a function of associated with (9b), and the realization of the random state .

###### Proposition 1.

Consider the strictly convex costs and in (5). Given the realization in (9), and the Lagrange multipliers associated with (9b), the optimal instantaneous workload routing decisions are

 ~x∗i,j,t(st)=[(∇Gdi,j)−1((λmnj)∗−(λdci)∗)]Bi,j0 (13a) and the optimal instantaneous workload scheduling decisions are given by [with Gei(xt;st):=Gei,t(xt)] x∗i,t(st)=[(∇Gei(st))−1((λdci)∗)]Di0 (13b)

where and denote the inverse functions of and , respectively.

We omit the proof of Proposition 1, which can be easily derived using the KKT conditions for constrained optimization [26]. Building on Proposition 1, it is interesting to observe that the stationary policy we are looking for in Section III-A is expressible uniquely in terms of . The intuition behind this solution is that the Lagrange multipliers act as interfaces between MN-DC and workload-power balance, capturing the availability of resources and utility information which is relevant from a resource allocation point of view.

However, to implement the optimal resource allocation in (13), the optimal multipliers must be known. To this end, we first outline the celebrated stochastic approximation-based and corresponding Lyapunov approaches to stochastic network optimization. Subsequently, we develop a novel approach in Section IV to learn the optimal multipliers in both offline and online settings.

### Iii-C Stochastic dual (sub-)gradient ascent

For the ensemble dual problem (12), a standard (sub)gradient iteration involves taking the expectation over the distribution of to compute the gradient [24]. This is challenging because the underlying distribution of

is usually unknown in practice. Even if the joint probability distribution functions were available, finding the expectations can be non-trivial especially in high-dimensional settings (

and/or ).

To circumvent this challenge, a popular solution relies on stochastic approximation [27, 21, 6]. The resultant stochastic dual gradient (SDG) iterations can be written as (cf. (12))

 ˇλt+1 =[ˇλt+μ∇Dt(ˇλt)]+ (14)

is an unbiased estimator of the ensemble gradient given by

. The primal variables can be found by solving “on-the-fly” the instantaneous problems, one per slot

 xt∈argminxt∈XLt(xt,ˇλt) (15)

where the operator accounts for cases that the Lagrangian has multiplier minimizers. The minimization in (15) is not difficult to solve. For a number of relevant costs and utility functions in Proposition 1, closed-form solutions are available for the primal variables. Note from (14) that the iterate depends only on the probability distribution of through the stochastic gradient . Consequently, the process is Markov with transition probability that is time invariant since is stationary. In the context of Lyapunov optimization [5, 13] and [21], the Markovian iteration in (14) is interpreted as a virtual queue recursion; i.e., .

Thanks to their low complexity and ability to cope with non-stationary scenarios, SDG-based approaches are widely used in various research disciplines; e.g., adaptive signal processing [28], stochastic network optimization [21, 13, 17, 29], and energy management in power grids [5, 23]. Unfortunately, SDG iterates are known to converge slowly. Although simple, SDG does not exploit the often sizable number of historical samples. These considerations motivate a systematic design of an offline-aided-online approach, which can significantly improve online performance of SDG for constrained optimization, and can have major impact in e.g., network resource allocation tasks by utilizing streaming big data, while preserving its low complexity and fast adaptation.

Before developing such a promising approach that we view as learn-and-adapt SDG scheme, we list our assumptions that are satisfied in typical network resource allocation problems.

(as1) State process

is independent and identically distributed (i.i.d.), and the common probability density function (pdf) has bounded support.

(as2) The cost in (5) is non-decreasing w.r.t. ; it is a -strongly convex function222We say that a function is -strongly convex if and only if is convex for all , where [26].; and its gradient is Lipschitz continuous with constant for all .

(as3) There exists a stationary policy satisfying , for all , and , where is the slack constant; and

(as4) The instantaneous dual function in (12) is -strongly concave, and its gradient is -Lipschitz continuous with condition number , for all .

Although (as1) can be relaxed if ergodicity holds, it is typically adopted by stochastic resource allocation schemes for simplicity in exposition [13, 17, 30]. Under (as2), the objective function is non-decreasing and strongly convex, which is satisfied in practice with quadratic/exponential utility or cost functions [30]. The so-called Slater’s condition in (as3) ensures the existence of a bounded Lagrange multiplier [26], which is necessary for the queue stability of (6); see e.g., [31, 17]. If (as3) cannot be satisfied, one should consider reducing the workload arrival rates at the MN side, or, increasing the link and facility capacities at the DC side. The -Lipschitz continuity of in (as4) directly follows from the strong-convexity of the primal function in (as2) with , where is the spectral radius of the matrix . The strong concavity in (as4) is frequently assumed in network optimization [9], and it is closely related to the local smoothness and the uniqueness of Lagrange multipliers assumed in [13, 17, 30]. For pessimistic cases, (as4) can be satisfied by subtracting an -regularizer from the dual function (11

), which is typically used in machine learning applications (e.g., ridge regression). We quantify its sub-optimality in Appendix A. Note that Appendix A implies that the primal solution

will be -optimal and feasible for the regularizer . Since we are after an -optimal online solution, it suffices to set .

### Iv-a Batch learning via offline SAGA based training

Consider that a training set of historical state samples is available. Using , we can find an empirical version of (11) via sample averaging as

 ^DS(λ):=1NN∑n=1^Dn(λ)=1NN∑n=1[minxn∈XLn(xn,λ)]. (16)

Note that has been replaced by to differentiate training (based on historical data) from operational (a.k.a. testing or tracking) phases. Consequently, the empirical dual problem can be expressed as

 maxλ≥01NN∑n=1^Dn(λ). (17)

Recognizing that the objective is a sum of finite concave functions, the task in (17) in the machine learning parlance is termed empirical risk maximization (ERM) [16], which is carried out using the batch gradient ascent iteration

 λk+1=[λk+ηNN∑n=1∇^Dn(λk)]+ (18)

where the index represents the batch learning (iteration) index, and is the stepsize that controls the learning rate. While iteration (18) exhibits a decent convergence rate, its computational complexity will be prohibitively high as the data size grows large. A typical alternative is to employ a stochastic gradient (SG) iteration, which uniformly at random selects one of the summands in (18). However, such an SG iteration relies only on a single unbiased gradient correction, which leads to a sub-linear convergence rate. Hybrids of stochastic with batch gradient methods are popular subjects recently [19, 32].333Stochastic iterations for the empirical dual problem are different from that in Section III-C, since stochasticity is introduced by the randomized algorithm itself, in oppose to the stochasticity of future states in the online setting.

Leveraging our special problem structure, we will adapt the recently developed stochastic average gradient approach (SAGA) to fit our dual space setup, with the goal of efficiently computing empirical Lagrange multipliers. Compared with the original SAGA that is developed for unconstrained optimization [19], here we start from the constrained optimization problem (6), and derive first a projected form of SAGA.

Per iteration , offline SAGA first evaluates at the current iterate , one gradient sample with sample index selected uniformly at random. Thus, the computational complexity of SAGA is that of SG, and markedly less than the batch gradient ascent (18), which requires such evaluations. Unlike SG however, SAGA stores a collection of the most recent gradients for all samples , with the auxiliary iteration index denoting the most recent past iteration that sample was randomly drawn; i.e., . Specifically, SAGA’s gradient combines linearly the gradient randomly selected at iteration with the stored ones to update the multipliers. The resultant gradient is the sum of the difference between the fresh gradient and the stored one at the same sample, as well as the average of all gradients in the memory, namely

 gk(λk)=∇^Dν(k)(λk)−∇^Dν(k)(λk[ν(k)])+1NN∑n=1∇^Dn(λk[n]). (19a) Therefore, the update of the offline SAGA can be written as λk+1=[λk+ηgk(λk)]+ (19b)

where denotes the stepsize. The steps of offline SAGA are summarized in Algorithm 1.

To expose the merits of SAGA, recognize first that since is drawn uniformly at random from set , we have that , and thus the expectation of the corresponding gradient sample is given by

 E[∇^Dν(k)(λk)]:=N∑n=1P{ν(k)=n}∇^Dn(λk)=1NN∑n=1∇^Dn(λk). (20)

Hence, is an unbiased estimator of the empirical gradient in (18). Likewise, , which implies that the last two terms in (19a) disappear when taking the mean w.r.t. ; and thus, SAGA’s stochastic averaging gradient estimator is unbiased, as is the case with SG that only employs .

With regards to variance, SG’s gradient estimator has , which can be scaled down using decreasing stepsizes (e.g., ), to effect convergence of iterates in the mean-square sense [27]. As can be seen from the correction term in (19a), SAGA’s gradient estimator has lower variance than . Indeed, the sum term of in (19a) is deterministic, and thus it has no effect on the variance. However, representing gradients of the same drawn sample , the first two terms are highly correlated, and their difference has variance considerably smaller than . More importantly, the variance of stochastic gradient approximation vanishes as approaches the optimal argument for (17); see e.g., [19]. This is in oppose to SG where the variance of stochastic approximation remains even if the iterates are close to the optimal solution. This variance reduction property allows SAGA to achieve a linear convergence with constant stepsizes, which is not achievable for the SG method.

In the following theorem, we use the result in [19] to show that the offline SAGA method is linearly convergent.

###### Theorem 1.

Consider the offline SAGA iteration in (19), and assume that the conditions in (as2)-(as4) are satisfied. If denotes the unique optimal argument in (17), and the stepsize is chosen as with Lipschitz constant as in (as4), then SAGA iterates initialized with satisfy

 Eν∥λk−λ∗S∥2≤(ΓN)kCS(λ0) (21a) where ΓN:=1−min(14N,13κ) with κ denoting the condition number in (as4), the expectation Eν is over all choices of the sample index ν(k) up to iteration k, and constant CS(λ0) is CS(λ0) :=∥λ0−λ∗S∥2 (21b) −2N3L[^DS(λ0)−^DS(λ∗S)−⟨∇^DS(λ∗S),λ0−λ∗S⟩].
###### Proof.

See Appendix -B. ∎

Since , Theorem 1 asserts that the sequence generated by SAGA converges exponentially to the empirical optimum in mean-square. Similar to in (as4), if in (16) are -smooth, then so is , and the sub-optimality gap induced by can be bounded by

 Eν[^DS(λ∗S)−^DS(λk)]≤LEν∥λk−λ∗S∥2≤(ΓN)kLCS(λ0). (22)

Remarkably, our offline SAGA based on training data is able to obtain the order-optimal convergence rate among all first-order approaches at the cost of only a single gradient evaluation per iteration. We illustrate the convergence of offline SAGA with two SG variants in Fig. 2 for a simple example. As we observe, SAGA converges to the optimal argument linearly, while the SG method with diminishing stepsize has a sublinear convergence rate and the one with constant stepsize converges to a neighborhood of the optimal solution.

###### Remark 1.

The stepsize in Theorem 1 requires knowing the Lipschitz constant of (16). In our context, it is given by , with denoting the spectral radius of the matrix , and defined in (as2). When can be properly approximated in practice, it is worth mentioning that the linear convergence rate of offline SAGA in Theorem 1 can be established under a wider range of stepsizes, with slightly different constants and [19, Theorem 1].

### Iv-B Learn-and-adapt via online SAGA

Offline SAGA is clearly suited for learning from (even massive) training datasets available in batch format, and it is tempting to carry over our training-based iterations to the operational phase as well. However, different from common machine learning tasks, the training-to-testing procedure is no longer applicable here, since the online algorithm must also track system dynamics to ensure the stability of queue lengths in our network setup. Motivated by this observation, we introduce a novel data-driven approach called online SAGA, which incorporates the benefits of batch training to mitigate online stochasticity, while preserving the adaptation capability.

Unlike the offline SAGA iteration that relies on a fixed training set, here we deal with a dynamic learning task where the training set grows in each time slot . Running sufficiently many SAGA iterations for each empirical dual problem per slot could be computationally expensive. Inspired by the idea of dynamic SAGA with a fixed training set [20], the crux of our online SAGA in the operational phase is to learn from the dynamically growing training set at an affordable computational cost. This allows us to obtain a reasonably accurate solution at an affordable computational cost - only a few SAGA iterations.

To start, it is useful to recall a learning performance metric that uniformly bounds the difference between the empirical loss in (16) and the ensemble loss in (11) with high probability (w.h.p.), namely [16]

 supλ≥0|D(λ)−^DS(λ)|≤Hs(N) (23)

where denotes a bound on the statistical error induced by the finite size of the training set

. Under proper (so-termed mixing) conditions, the law of large numbers guarantees that

is in the order of , or, for specific function classes [16, Section 3.4]. That the statement in (23) holds with w.h.p. means that there exists a constant such that (23) holds with probability at least . In such case, the statistical accuracy depends on , but we keep that dependency implicit to simplify notation [16, 33].

Let be an -optimal solution obtained by the offline SAGA over the training samples, where will be termed optimization error emerging due to e.g., finite iterations say on average per sample; that is, , where is the optimal argument in (17). Clearly, the overall learning error, defined as the difference between the empirical loss with and the ensemble loss with , is bounded above w.h.p. as

 D(λ∗)−^DS(λKN)= D(λ∗)−^DS(λ∗)+^DS(λ∗)−^DS(λKN) ≤ D(λ∗)−^DS(λ∗)+^DS(λ∗S)−^DS(λKN) ≤ Hs(N)+Ho(KN) (24)

where the bound superimposes the optimization error with the statistical error. If is relatively small, will be large, and keeping the per sample iterations small is reasonable for reducing complexity, but also because stays comparable to . With the dynamically growing training set in the operational phase, our online SAGA will aim at a “sweet-spot” between affordable complexity (controlled by ), and desirable overall learning error that is bounded by with .

The proposed online SAGA method consists of two complementary stages: offline learning and online learning and adaptation. In the offline training-based learning, it runs SAGA iteration (19) (cf. Algorithm 1) on a batch dataset with historical samples, and on average iterations per sample. In the online operational phase, initialized with the offline output ,444To differentiate offline and online iterations, let denote -th iteration in the offline learning, and denote -th iteration during slot . online SAGA continues the learning process by storing a “hot” gradient collection as in Algorithm 1, and also adapts to the queue length that plays the role of an instantaneous multiplier estimate analogous to in (14). Specifically, online SAGA keeps acquiring data and growing the training set based on which it learns online by running iterations (19) per slot . The last iterate of slot initializes the first one of slot ; that is, . The learned captures a priori information . To account for instantaneous state information as well as actual constraint violations which are valuable for adaptation, online SAGA will superimpose to the queue length (multiplier estimate) . To avoid possible bias in the steady-state, this superposition is bias-corrected, leading to an effective multiplier estimate (with short-hand notation )

 γt    effective multiplier=     λt     statistical learning+   μqt−b   stochastic approx. (25)

where scalar tunes emphasis on past versus present state information, and the value of constant will be specified in Theorem 4. Based on the effective dual variable, the primal variables are obtained as in (13) with replaced by ; or, for general constrained optimization problems, as

 xt(γt):=argminxt∈XLt(xt,γt). (26)

The necessity of combining statistical learning and stochastic approximation in (25) can be further understood from obeying feasibility of the original problem (6); e.g., constraints (6c) and (6d). Note that in the adaptation step of online SAGA, the variable is updated according to (6c), and the primal variable is obtained as the minimizer of the Lagrangian using the effective multiplier , which accounts for the queue variable via (25). Assuming that is approaching infinity, the variable becomes large, and, therefore, the primal allocation obtained by minimizing online Lagrangian (26) with will make the penalty term negative. Therefore, it will lead to decreasing through the recursion in (6c). This mechanism ensures that the constraint (6d) will not be violated.555Without (25), for a finite time , is always near-optimal for the ensemble problem (12), and the primal variable in turn is near-feasible w.r.t. (9b) that is necessary for (6d). Since essentially accumulates online constraint violations of (9b), it will grow linearly with and eventually become unbounded. The online SAGA is listed in Algorithm 2.

###### Remark 2 (Connection with existing algorithms).

Online SAGA has similarities and differences with common schemes in machine learning, and stochastic network optimization.

(P1) SDG in [24, 6] can be viewed as a special case of online SAGA, which purely relies on stochastic estimates of multipliers or instantaneous queue lengths, meaning . In contrast, online SAGA further learns by stochastic averaging over (possibly big) data to improve convergence of SDG.

(P2) Several schemes have been developed for statistical learning via large-scale optimization, including SG, SAG [32], SAGA [19], dynaSAGA [20], and AdaNewton [34]. However, these are generally tailored for static large-scale optimization, and are not suitable for dynamic resource allocation, because having neither tracks constraint violations nor adapts to state changes, which is critical to ensure queue stability (6d).

(P3) While online SAGA shares the learning-while-adapting attribute with the methods in [17] and [18], these approaches generally require using histogram-based pdf estimation, and solve the resultant empirical dual problem per slot exactly. However, histogram-based estimation is not tractable for our high-dimensional continuous random vector; and when the dual problem cannot be written in an explicit form, solving it exactly per iteration is computationally expensive.

###### Remark 3 (Computational complexity).

Considering the dimension of as , the dual update in (14) has the computational complexity of , since each row of has at most non-zero entries. As the dimension of is , the primal update in (15), in general, requires solving a convex program with the worst-case complexity of using a standard interior-point solver. Overall, the per-slot computational complexity of the Lyapunov optimization or SDG in [21, 24] is .

Online SAGA has a comparable complexity with that of SDG during online adaptation phase, where the effective multiplier update, primal update and queue update also incur a worst-case complexity in total. Although the online learning phase is composed of additional updates of the empirical multiplier through (19), we only need to compute one gradient at a randomly selected sample