Distributed Stochastic Optimization in Networks with Low Informational Exchange

by   Wenjie Li, et al.

We consider a distributed stochastic optimization problem in networks with finite number of nodes. Each node adjusts its action to optimize the global utility of the network, which is defined as the sum of local utilities of all nodes. Gradient descent method is a common technique to solve the optimization problem, while the computation of the gradient may require much information exchange. In this paper, we consider that each node can only have a noisy numerical observation of its local utility, of which the closed-form expression is not available. This assumption is quite realistic, especially when the system is too complicated or constantly changing. Nodes may exchange the observation of their local utilities to estimate the global utility at each timeslot. We propose stochastic perturbation based distributed algorithms under the assumptions whether each node has collected local utilities of all or only part of the other nodes. We use tools from stochastic approximation to prove that both algorithms converge to the optimum. The convergence rate of the algorithms is also derived. Although the proposed algorithms can be applied to general optimization problems, we perform simulations considering power control in wireless networks and present numerical results to corroborate our claim.



There are no comments yet.


page 1

page 2

page 3

page 4


Distributed Derivative-free Learning Method for Stochastic Optimization over a Network with Sparse Activity

This paper addresses a distributed optimization problem in a communicati...

Distributed Zeroth-Order Stochastic Optimization in Time-varying Networks

We consider a distributed convex optimization problem in a network which...

Distributed Embodied Evolution in Networks of Agents

In most network problems, the optimum behaviors of agents in the network...

Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks

Motivated by machine learning applications in networks of sensors, inter...

Convergence Analysis of Nonconvex Distributed Stochastic Zeroth-order Coordinate Method

This paper investigates the stochastic distributed nonconvex optimizatio...

Optimizing Waiting Thresholds Within A State Machine

Azure (the cloud service provided by Microsoft) is composed of physical ...

Locally Convex Sparse Learning over Networks

We consider a distributed learning setup where a sparse signal is estima...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Distributed optimization is a fundamental problem in networks, which helps to improve the performance of the system by maximizing some predefined objective function. Significant amount of work have been done to solve the optimization problems in various applications. For example, in power control [2, 3, 4] and beamforming allocation [5, 6, 7, 8] problems, transmitters need to control its transmission power or beamforming in a smart manner, in order to maximize some performance metric of the wireless communication systems, such as throughput or energy efficiency. In medium access control problem [9]

, users set their individual channel access probability to maximize their benefit. In wireless sensor networks, sensor nodes collect information to serve a fusion center, an interesting problem is to make each node independently decide the quality of its report to maximize the average quality of information gathered by the fusion center subject to some power constraint

[10, 11], note that a higher level of quality requires higher power consumption.

This paper considers an optimization problem in a distributed network where each node adjusts its own action to maximize the global utility of the system, which is also perturbed by a stochastic process, e.g., wireless channels. The global utility is the sum of the local utilities of all nodes of the network. Gradient descent method is the most common technique to deal with optimization problems. In many scenarios in practice, however, the computation of gradient may require too much information exchange between the nodes, examples are provided in Section IV. Furthermore, there are other contexts also where the utility function of each node does not have a closed form expression or the expression is very complex which makes it very hard to use in the optimization, e.g., computation of the derivatives is very complicated or not possible. In this paper, we consider therefore that a node only has a noisy numerical observation of its utility function, which is quite realistic when the system is complex and time-varying. The nodes can only exchange the observation of their local utilities so that each node can have the knowledge of the whole network. However, a node may not receive all the local utilities of the other nodes due to the network topology or other practical issues, e.g., it is not possible to exchange much signaling information. In this situation, a node has to approximate the global utility with only incomplete information of local utilities. We have also taken in account such issue in this paper.

In summary, our problem is quite challenging due to the following reasons: i) each node has only a numerical observation of its local utility at each time; ii) each node may have incomplete information of the global utility of the network; iii) the action of each node has an impact on the utilities of the other nodes in the network; iv) the utility of each node is also influenced by some stochastic process (e.g., time varying channels in wireless networks) and the objective function is the average global utility.

In this paper, we develop novel distributed algorithms to optimize the global average utility of a network, where the nodes can only exchange the numerical observation of their local utility. Different versions of algorithms are proposed depending on: i) whether the value of action is constrained or unconstrained; ii) whether each node has the full knowledge of local utilities of all the other nodes or only a part of the utilities of other nodes. We have proved the convergence of the algorithms in all situations, using stochastic approximation tools. The convergence rate of different algorithms are also derived, in order to show the convergence speed of the proposed algorithms to the optimum from a quantitative point of view and deeply investigate the impact of the parameters introduced by the algorithm. Our theoretical results are further justified by simulations.

Some preliminary results of our work have been presented in [1]. This extended version provides the complete proof of all the results. Moreover, the constrained optimization problem and the analysis of the convergence rate in this paper are not considered at all in [1]. As we will see in Section VII, the derivation of the convergence rate is especially challenging.

The rest of the paper is organized as follows. Section II discusses some related work and highlights our main contribution. Section III describes the system model as well as some basic assumptions. Section IV shows motivating examples to explain the interest of our problem. Section V develops the initial version of our distributed optimization algorithm using stochastic perturbation (DOSP) and shows its convergence. Section VI presents the first variant of the DOSP algorithm to deal with the situation where a node has incomplete information of the global utility of the network. Section V-D proposes the second variant of the DOSP algorithm to solve the constrained optimization problem. Section VII focuses on the analysis of the convergence rate of the proposed algorithms. Section VIII shows some numerical results as well as a comparison with an alternative algorithm and Section IX concludes this paper.

Ii Related Work

Most of the prior work in the area of optimization consider that the objective function has a well known and simple closed form expression. Under this assumption, the optimization problem can be performed using gradient ascent or descent method [12]. This method can achieve a local optimum or global optimum in some special cases (e.g. concavity of the utility, etc.) of the optimization problem. A distributed asynchronous stochastic gradient optimization algorithms is presented in [13]. Incremental sub-gradient methods for non-differentialable optimization are discussed in [14]. Interested readers are referred to a survey by Bertsekas [15] on incremental gradient, sub-gradient, and proximal methods for convex optimization. The use of gradient-based method supposes in advance that the gradient can be computed or is available at each node, which is not always possible as this would require too much information exchanges. In our case, the computation of the gradient is not possible at each node since only limited control information can be exchanged in the network. This problem is known as derivative-free optimization, see [16] and the references therein. Our goal is then to develop an algorithm that requires only the knowledge of a numerical observation of the utility function. The obtained algorithm should be distributed.

Distributed optimization has also been studied in the literature using game theoretic tools. However, most of the existing work assume that a closed form expression of the payoff is available. One can refer to [17, 18] and the references therein for more details, while we do not consider non-cooperative games in this paper.

Stochastic approximation (SA) [19, 20] is an efficient method to solve the optimization problems in noisy environment. Typically, the action is updated as follows


where represents an estimation of the gradient . An important assumption is that the estimation error

is seen as a zero-mean random vector with finite variance, for example, see

[21]. If the step-size is properly chosen, then can tend to its optimum point asymptotically. The challenge of our work is how to propose such estimation of the gradient only with the noisy numerical observation of the utilities.

Most of the previous work related to derivative-free optimization consider a control center that updates the entire action vector during the algorithm, see [16] for more details. However, in our distributed setting, each node is only able to update its own action. Nevertheless, a stochastic approximation method using the simultaneous perturbation gradient approximation (SPGA) [22] can be an option to solve our distributed derivative-free optimization problem. The SPGA algorithm was initially proposed to accelerate the convergence speed of the centralized multi-variate optimization problem with deterministic objective function. Two measurements of the objective function are needed per update of the action. The approximation of the partial derivative of an element  is given by


where is vanishing and with each element zero mean and i.i.d. Two successive measurements of the objective function are required to perform a single estimation of the gradient. The interest of the SPGA method is that each variable can be updated simultaneously and independently. Spall has also proposed an one-measurement version of the SPGA algorithm in [23] with


Such algorithm also leads to converge, while with a decreased speed compared with the two-measurement SPGA. An essential result is that the estimation of gradient using (2) or (3) is unbiased if is vanishing, as long as the objective function is static. However, if the objective function is stochastic and its observation is noisy, there would be an additional term of stochastic noise in the numerator of (2) and (3), which may seriously affect the performance of approximation when the value is too small. As a consequence, the SPGA algorithm cannot be used to solve our stochastic optimization problem.

The authors in [24] proposed a fully distributed Nash equilibrium seeking algorithm which requires only a measurement of the numerical value of the static utility function. Their scheme is based on deterministic sine perturbation of the payoff function in continuous time. In [25], the authors extended the work in [24] to the case of discrete time and stochastic state-dependent utility functions, convergence to a close region of Nash equilibrium has been proved. However, in a distributed setting, it is challenging to ensure that the sine perturbation of different nodes satisfy the orthogonality requirement, especially when the number of nodes is large. Moreover, the continuous sine perturbation based algorithm converges slowly in a discrete-time system. Stochastic perturbation based algorithm has been proposed in [26] to solve an optimization problem, the algorithm is given by


with the zero-mean stochastic perturbation. The behavior of (4) has been analyzed in [26], however, under the assumption that the objective function is static and quadratic. Our proposed algorithm is different from (4) as we use the random perturbation with vanishing amplitude. In addition, the objective function is stochastic with non-specified form in our setting, which is much more challenging. Furthermore, we consider a situation where nodes have to exchange their local utilities to estimate the global utility and each node may have incomplete information of the local utilities of other nodes.

Iii System Model

This section presents the problem formulation as well as the basic assumptions. Throughout this paper, matrices and vectors are in boldface upper-case letters and in in bold-face lower-case letters respectively. Calligraphic font denotes set. denotes the Euclidean norm of any vector . In order to lighten the notations, we use

and in the rest of the paper.

Iii-a Problem formulation

Consider a network consisting of a finite set of nodes . Each node  is able to control its own action at each discrete timeslot , in order to maximize the performance of the network. Introduce the action vector which contains the action of all nodes at timeslot . Let denote the feasible action set of node , i.e., . Introduce . In general, the performance of a network is not only determined by the action of nodes, but also affected by the environment state, e.g., channels. We assume that the environment state of the entire network at any timeslot  is described by a matrix , which is considered as an i.i.d. ergodic stochastic process.

For any realization of and , we are interested in the global utility of the network, which is defined as the sum of the local utility function of each node , i.e.,

The network performance is then characterized by the average global utility

In this work, we consider a challenging setting that nodes do not have the knowledge of nor the closed-form expression of the utility functions. Each node  only has a numerical estimation of its local utility at each timeslot. Assume that


with some additive noise. Nodes can communicate with each other so that each node  can get a approximate value of the global utility .

As a summary, our aim is to propose some distributed solution of the following problem


An application example is introduced in Section IV to highlight the interest of this problem.

Iii-B Basic assumptions

We present the basic assumptions considered in this paper in order to guarantee the performance of our proposed algorithm.

Denote as the solution of the problem (6). Since the existence of can be ensured by the concavity of the objective function , we have the following assumption.

  • A1. (properties of objective function) Both and exist continuously. There exists such that and , . The objective function is strictly concave, i.e.,


    Besides, for any and , there exists a constant such that


We have some further assumptions on the local utility functions, which will be useful in our analysis.

  • A2. (properties of local utility function) For any , the function is Lipschitz with Lipschitz constant , i.e.,


    Besides, , , so that is also bounded.

From A2, we can easily deduce that is also Lipschitz with Lipschitz constant , since


In the end, we consider a common assumption on the noise term introduced in (5).

  • A3. (properties of additive noise) The noise is zero-mean, uncorrelated, and has bounded variance, i.e., for any , we have , , and if .

Iv Motivating examples

Figure 1: (I) A D2D network with transmitter-receiver pairs; (II) Downlink multi-cell network, each eNB (or base station) may serve multiple UEs (users).

This section provides some examples and shows the limit of the classical gradient-based methods.

We consider the power allocation problem in a general network with transmitter-receiver links. A link here can be seen as a node in our system model as presented in Section III. For example,

  • in a D2D network with transmitter-receiver pairs, each transmitter communicates with its associated receiver and different links interfere among each other, see Figure 1 (I).

  • in a multi-cell network, each base station may serve multiple users. We focus on the downlink, i.e., a user is seen as its receiver and its associated base station is seen as a transmitter, see Figure 1 (II).

In both models, the action is in fact the transmission power of transmitter  at timeslot , of which the value cannot exceed the maximum transmission power . Thus we have , . The environment state matrix represents the time-varying channel state of the network. More precisely, , in which each element denotes the the channel gain between transmitter and receiver at timeslot .

The utility function can be various depending on different applications, e.g., throughput and energy efficiency [27].

In our first example, the local utility of each node  is given by


where , represents the energy cost of transmission, and denotes the bit rate given by


Note that the maximization of -function of the bit rate is of type proportional fairness, which is used to ensure fairness among the nodes in the network. With the above notations, the global utility function can be written as


It is straightforward to show that is concave with respect to , hence the existence of the optimum can be guaranteed.

In order to maximize the average global utility , one may apply the classical stochastic approximation method described by (1). An essential step is that, each transmitter or receiver should be able to calculate the partial derivative, i.e.,


From (14), we find that the direct calculation of the partial derivative is complicated and nodes should exchange much information: Each node  should know the values of , the cross-channel gain , as well as of all nodes . Moreover, in the situation where the channel is time-varying, it is not realistic for any receiver  to estimate the direct-channel gain and all the cross-channel gain , .

We consider the sum-rate maximization problem as a second example, i.e., to maximize . The challenge come from the fact that is not concave with respect to if the rate is given by (12). For this reason, we have to consider the approximation of and some variable change to make the objective function concave, which is a well known problem [28]. It is common to use change of variable (i.e., consider as the transmission power instead of ) and consider the approximation [28], so that the global utility function is written as


It is straightforward to show the concavity of . Similar to (14), we evaluate


of which the calculation also requires much information, such as the cross-channel gain , and all the interference estimated by each receiver. All the channel information has to be estimated and exchanged by each active node, which is a huge burden for the network.

The two examples clearly shows the limit of the classical methods and motivates us to propose some novel solution where low informational exchange are required. In this paper, we consider that the nodes can only exchange their numerical approximation of local utilities. It is worth mentioning that the information exchange in our setting is much less than the gradient based method. We present our distributed optimization algorithms as well as their performance, in the situations where each node has the complete or incomplete knowledge of the local utilities of the other nodes, respectively. It is worth mentioning that, apart from the examples presented in this section, the solution proposed in this paper can also be applied to other type of problems such beamforming, coordinated multipoint (CoMP) [29], and so on.

V Distributed optimization algorithm using stochastic perturbation

This section presents a first version of our distributed optimization algorithm. We assume that each node is always able to collect the numerical estimation of local utilities from all the other nodes. In this situation, each node can evaluate the numerical value of the global utility function at each iteration (or timeslot)  by applying


since each node knows the complete information of for any . We first consider an unconstrained optimization problem, i.e., , in Section V-A. The constrained optimization problem with , is then presented in Section V-D.

V-a Algorithm

The distributed optimization algorithm using stochastic perturbation (DOSP) is presented in Algorithm 1.

  1. Initialize and set the action randomly.

  2. Generate a random variable

    , perform action .

  3. Estimate , exchange its value with the other nodes and calculate

  4. Update according to equation (18).

  5. , go to 2.

Algorithm 1 DOSP Algorithm for each node 

At each iteration , an arbitrary reference node  updates its action by applying


in which and are vanishing step-sizes, is randomly generated by each node  and . Recall that the approximation of the global utility is calculated by each node using (17), of which the value depends on the actual action performed by each node and the environment state matrix . Note that is very close to when is large as is vanishing. An example is presented in Section V-B to describe in detail the application of Algorithm 1 in practice.

Obviously, (18) can be written in the general form (1), in which


As we will discuss in Section V-C,

can be an asymptotically unbiased estimation of the partial derivative

and can converge to , under the condition that the parameters , , and are properly chosen. We state in what follows the desirable properties of these parameters.

  • A4. (properties of step-sizes) Both and take real positive values with , besides,

  • A5. (properties of random perturbation) The elements of are i.i.d. with , . There exist and such that

The conditions on the parameters can be easily achieved. We show in Example 1 a common setting of these parameters, which are also used to obtain the simulation results to be presented in Section VIII.

Example 1.

An easiest choice of the probability distribution of

is the symmetrical Bernoulli distribution with

and , . We can verify the conditions in A5 with .

Let and with the constants , so that both and are vanishing. Since converges if ; diverges if . Clearly, there exist pairs of and to make and satisfy the conditions in A4.

Remark 1.

The proposed algorithm has similar shape compared with the other existed methods proposed in [26, 24, 25]. The difference between our solution and the sine perturbation based method [24, 25] is that, we use a random vector instead of some deterministic sine functions as the perturbation term. Comparing (4) and (18), we can see that the amplitude of random perturbation is vanishing in our algorithm, which is not the case in the algorithm presented in [26].

V-B Application example

This section presents the application of Algorithm 1 to perform resource allocation in a D2D network, in order to highlight the interest of our solution. We focus on the requirement arisen by Algorithm 1, in terms of computation, memory and informational exchange.

Figure 2: (I) At iteration , each transmitter transmits to its associated receiver with transmission power ; (II) Each receiver sends the approximate local utility to its transmitter, every transmitter can receive and decode this feedback information.

Figure 2 briefly shows the algorithm procedure during one iteration. Recall that denotes the actual value of the action set by transmitter  at iteration . In order to update , each transmitter  needs to

  • independently and randomly generate a scalar under the condition A5;

  • use a pre-defined vanishing sequence which is common for each link;

  • independently update by applying (18), more details will be provided soon.

Then is given by .

Each transmitter  transmits to its associated receiver with the transmission power of value . The associated receiver  should be able to approximate the numerical value of its local utility and send this value to transmitter  as a feedback. All the transmitters can evaluate using (17) under the assumption that every transmitter is able to receive and decode the feedback from all the receivers.

Then iteration  starts, each transmitter  needs to update its power allocation strategy, what is required is listed as follows:

  • use a pre-defined vanishing sequence ;

  • reuse of the local random value , which was generated by transmitter  at iteration . It means that the value of should be saved temporarily

  • use the numerical value of , which has been explained already.

Each transmitter can then update independently using (18).

In the following step, each transmitter updates in the same way as the previous iteration, i.e., with a newly generated pseudo-random value.

We can see that Algorithm 1 can be easily applied in a network of multiple links: the algorithm itself has low complexity and each receiver only needs to feedback one quantity () per iteration to perform the algorithm.

V-C Convergence results

This section investigates the asymptotic behavior of Algorithm 1. For any integer , we consider the divergence


to describe the difference between the actual action and the optimal action . Our aim is to show that almost surely as .

In order to explain the reason behind the update rule (18), we rewrite it in the generalized Robbins-Monro form [20], i.e.,


where represents the expected value of with respect to all the stochastic terms (including , , and ) for a given , i.e.,


note that we prefer to highlight the stochastic terms in (22), an alternative way is to write (22) as ; denotes the estimation bias between and the actual gradient of the average objective function , i.e.,


and can be seen as the stochastic noise, which is the difference between the value of a single realization of and its average as defined in (22), i.e.,

Remark 2.

The analysis presented in this work is challenging and different from the existed results. An explicit difference comes from the unique feature of the algorithm itself as discussed in Remark 1: we are using a different method to estimate the gradient. Besides, the objective function is stochastic with general form in our problem, while it is considered as static in [22, 23] and it is assumed to be static and quadratic in [26].

To perform the analysis of convergence, we have to investigate the properties of and .

Lemma 1.

If all the conditions in A1-A5 are satisfied, then


which implies that as .


See Appendix -A. ∎

Lemma 1 implies that defined in Algorithm 1 is a reasonable estimator of the gradient with vanishing bias.

Lemma 2.

If all the conditions in A1-A5 are satisfied and almost surely, then for any constant , we have


See Appendix -B. ∎

Based on the results in Lemma 1 and in Lemma 2, we can build conditions under which almost surely.

Theorem 1.

If all the conditions in A1-A5 are satisfied and almost surely, then as almost surely by applying Algorithm 1.


See Appendix -C. ∎

V-D Distributed optimization under constraints

In this section, we consider the constrained optimization problem, in which the action of each node takes value from an interval, i.e., , . We assume that is not on the boundary of the feasible set , i.e., , . For example, in the power allocation problem, presents the transmission power of transmitter , which should be positive and not larger than a maximum value.

Recall that the actually performed action by nodes is . At each iteration, we need to ensure that . A direct solution is to introduce a projection to the algorithm, i.e., (18) turns to


in which for any ,


As by A5, (28) leads to .

Remark 3.

In order to write (28) into the form similar to (21), one has to re-define the bias term and the stochastic noise because of the operator . Therefore, it is not straightforward to deduce of the convergence of the constrained optimization algorithm from the results presented in Section V-C.

Theorem 2.

In the constrained optimization problem where and , , by applying the projection (27), we have as almost surely under the assumptions A1-A5.


See Appendix -D. ∎

Vi Optimization Algorithm with incomplete information of utilities of other nodes

A limit of Algorithm 1 is that each node is required to know the local utility of all the other nodes. Such issue is significant as there are many nodes in the network. It is thus important to consider a more realistic situation where a node only has has the knowledge of the local utilities of a subset of nodes, with . Throughout this section, we have the following assumption:

  • A6). at any iteration , an arbitrary node  knows the utility of another node  with a constant probability , , the elements contained in the set is random, for any , we have


Notice that we do not assume any specified network topology and each node has a different and independent set .

We propose a modified algorithm and then show its asymptotic performance. The algorithm is described in Algorithm 2. The main difference between Algorithm 1 and Algorithm 2 comes from the approximation of the objective function, i.e.,


Similar to (18), the algorithm is given by


The basic idea is to consider as a surrogate function of , in the case where the set is non-empty. Otherwise, node  does not know any utility of the other nodes, it then cannot estimate the global utility of the system. As a result, node  keeps its previous action, i.e., . Note that different users may have different knowledge of the global utility as is independent for each node . For example, node  may know of a different node , whereas node  may not know .

  1. Initialize and set the action randomly.

  2. Generate a random variable , perform action .

  3. Estimate , exchange its value with the other nodes and calculate using (30) based on the collected local utilities.

  4. Update according to equation (31).

  5. , go to 2.

Algorithm 2 DOSP algorithm for each node with incomplete information of the utilities of other nodes

Due to the additional random term , the convergence analysis of Algorithm 2 is more complicated than that of Algorithm 1. We start with Lemma 3, which is useful for the analysis in what follows

Lemma 3.

The expected value of over all possible sets is proportional to , i.e.,


See Appendix -E. ∎

To simplify the notation, introduce


Similar to (21), we rewrite (31) as




We need to investigate the property of and in order to show the convergence of Algorithm 2. The proof is more complicated because of the additional random term in both and compared with and discussed in Section V.

Theorem 3.

In the situation where nodes do not have the access to all the other nodes’ local utilities and the objective function is approximated by applying (30), then we still have as almost surely by applying Algorithm 2, as long as the assumptions A1-A6 hold and almost surely.


See Appendix -F. ∎

Remark 4.

Although the asymptotic convergence still holds, the convergence speed is reduced if the information of the objective function is incomplete. By comparing (21) and (34), we can see that the equivalent step size is decreased by times. Moreover, the variance of the stochastic noise is higher, as the randomness is more significant when we use random symbols to represent the average of symbols. More discussion related to the convergence rate will be provided in Section VII.

Vii Convergence rate

In this section, we study the average convergence rate of the proposed algorithm in order to investigate how fast the proposed algorithms converge to the optimum from a quantitative point of view. The analysis also provides a detailed guideline of setting the parameters and which determine the convergence rate. We start with the analysis considering general forms of and . A widely used example is then considered afterwards.

Vii-a General analysis

As a common setting in the analysis of the convergence rate [30], we have an additional assumption on the concavity of the objective function in this section, i.e.,

  • A7. is strongly concave, there exists such that


In this section, we are interested in the evolution of the average divergence ). In order to distinguish the two versions of algorithms with complete and incomplete information of local utilities, we use and to denote the divergence resulted by Algorithm 1 and Algorithm 2, respectively.

An essential step of the analysis of the convergence rate is to investigate the relation of the divergence between two successive iterations.

Denote and as the upper bounds of and of in Algorithm 1 and Algorithm 2 respectively. Our result is stated in Lemma (4).

Lemma 4.

Assume that A7 holds, for any , resulted by Algorithm 1 is such that


Similarly, resulted by Algorithm 2 is such that


See Appendix -G. ∎

We can see that both (38) and (39) can be written in the simplified general form


with positive constants


in (38) and


in (39).

Remark 5.

In the constrained optimization problem, we can obtain the same result as Lemma 4 directly from (68), for any . In fact, the unconstrained optimization problem can be seen as a special case of the constrained optimization problem with , , and . For this reason, we consider the general problem with taken into account, in the rest of this section.


which implies that and as .

Lemma 4 provides us the relation between and . Our next goal is to search a vanishing upper bound of using (40). In other words, we aim to search a sequence such that

This type of analysis is usually performed by induction: consider a given expression of and assume that , one needs to show that by applying (40).

An important issue is then the proper choice of the form of the upper bound . Note that there exists only one step-size in the classical stochastic approximation algorithm described by (1), it is relatively simple to determine the form of with a further setting , see [30]. Our problem is much more complicated as we use two step-sizes and with general form under Assumption A4. The following lemma presents an important property of .

Lemma 5.

If there exists a decreasing sequence such that can be deduced from and (40), then ,


See Appendix -H. ∎

Note that the lower bound of is vanishing as and . Such bound means that, using induction by developing (40), the convergence rate of cannot be better than the decreasing speed of and of . After the verification of the existence of bounded constant or such that or