I Introduction
Finding the minimum of a (possibly nonconvex) stochastic optimization problem over a set in an iterative way is very popular in a variety of fields and applications, such as signal processing, wireless communications, machine learning, social networks, economics, statistics and bioinformatics, to name just a few. Such stochastic problems arise in two different types of formulations, as will be discussed in the following.
The first formulation type corresponds to the case where the objective function is in terms of the expectation of a cost function which is parametrised by a random (vector) variable, as shown in the following formulation:
(1) 
where is the optimization variable and is the random variable involved in the problem. The expectation involved in the objective function may not have a closedform expression due to the statistics of the random variables being unknown or the computational complexity of computing the exact expectation being excessively high. Therefore, the deterministic optimization methods cannot solve this optimization problem. Such problem formulation is encountered in various problems and applications in various fields such as wireless communications, signal processing, economics and bioinformatics. For instances of such optimization problems, you may see [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].
Another category of problem formulations that utilize stochastic optimization corresponds to largescale optimizations in which the objective function is a deterministic function, but in terms of the summation of a large number of subfunctions, as shown in the following formulation:
(2) 
where is the optimization variable, is the number of terms in the sum and each subfunction is characterised by a data sample
. Such problems naturally arise in various applications in the machine learning area, such as classification, regression, pattern recognition, image processing, bioinformatics and social networks
[13, 14, 15, 16]. Moreover, these optimization problems are typically huge and need to be solved efficiently.Note that the above problem formulation is naturally deterministic. However, in largescale problems encountered in many emerging applications dealing with big data, the number of data samples or equivalently the number of subfunctions (and/or the size of the optimization variable ) is very large, and hence, deterministic optimization approaches (such as gradient descent [17]) can not be utilized to solve them, as the computational complexity of each iteration would be excessively high. This is because calculating the gradient of the objective function at each iteration corresponds to calculating the gradients of all of these subfunctions, which is of huge complexity and may not be practical. As such, the extremely large size of datasets is a critical challenge for solving largescale problems arising in emerging applications, which makes the classical optimization methods intractable. Consequently, there is an increasing need to develop new methods with tractable complexity to handle the largescale problems efficiently[18].
To address the aforementioned challenge, stochastic optimization approaches are used. Under such approaches, at each iteration, instead of calculating the exact gradient of the objective function, a stochastic gradient will be calculated by randomly picking one subfunction (corresponding to one randomly chosen or arrived data sample) and calculating the gradient of that subfunction as a stochastic approximation of the true gradient of the objective function. As such, the computational complexity of gradient calculation is that of deterministic gradient methods, and is independent of the number of data samples. Moreover, note that the intuition behind utilising stochastic optimization methods instead of deterministic approaches for the deterministic problem formulation in (2) is that the objective function in (2) can be viewed as the sample average approximation of the stochastic objective function described in (1), where each sample function corresponds to a realisation of the random cost function (i.e., under a fixed realisation ) [18].
Having encountered the above two forms of the optimization problems (that require stochastic optimization methods) in different areas,
researchers from various communities, including but not limited to optimization and pure mathematics, neural networks, and machine learning fields, have been working on stochastic optimization algorithms. Specifically, they aim to propose new stochastic optimization algorithms that meet the new requirements of emerging and future applications, as will be discussed in the following.
First of all, the main requirement for designing new stochastic optimization algorithms is fast convergence. This is because the new stochastic optimization problems encountered in emerging and future applications are mainly largescale problems that need to be handled in a very short time; otherwise, their obtained solutions may become outdated and hence, no longer valid at the time of convergence. Therefore, efficient algorithms that can converge very fast are essential.
Furthermore, in many new applications, there are multiple agents that have control over different variables and need to optimise the wholesystem performance together in a distributed way. These applications and scenarios deal with distributed systems in which the associated stochastic optimization problems need to be iteratively solved by a parallel stochastic optimization method.^{1}^{1}1Such applications arise in various fields, for example, multiagent resource allocation in wireless interference systems [19], peertopeer networks, adhoc networks and cognitive radio systems in wireless networks, parallelcomputing data centres in big data, and largescale distributed systems in economics and statistics, to name just a few. In addition, with the recent advances in multicore parallel processing technology, it is increasingly desirable to develop parallel algorithms that allow various control variables to be simultaneously updated at their associated agents, at each iteration.
The above requirements have not been truly addressed by the classical stochastic optimization algorithms. Therefore, new algorithms that can efficiently solve largescale stochastic optimization problems with fast convergence and support parallel optimization of the involved control variables are needed. The existing algorithms that aim to meet such requirements can be categorised into three groups: stochastic gradientbased, stochastic majorizationminimization and stochastic parallel decomposition methods, as discussed in the following sections.
Ia Stochastic GradientBased Algorithms
Stochastic gradientbased methods are variations of the wellknown stochastic gradient descent (SGD) method
[20, 21]. In the dynamic equations of these methods, instead of using the exact gradient of the objective function, which is not available in problem formulations in the general form of (1) or is not practical to calculate in problem formulations in the general form of (2), a noisy estimate of the gradient is used. Such a noisy gradient estimate, which is referred to as a
stochastic gradient, is usually obtained by calculating the gradient of the observed sample function for the observed realisation of the random variable in the case of the problem formulation in (1), or the gradient of the subfunction for the randomly chosen index in the case of the problem formulation in (1). As the negative of an stochastic gradient is used as the update direction at each iteration, these methods are called stochastic gradientbased.Various numerical results show that using such update direction suffers from slow convergence. This is mainly because the update direction may not be a tight approximation of the true gradient, and hence converges to the true gradient slowly. To improve the slow convergence of the stochastic gradientbased update direction, acceleration techniques have been proposed [22], including averaging over the iterates [20], suffix averaging [20], and polynomialdecay averaging [21]. However, although some of these methods have been proved to achieve the optimal convergence rate for a considered class of problems [20], their obtained rate is asymptotic, and numerical experiments on a large number of problems show that in practice, their improvement is not significant compared to the SGD method [21]. Furthermore, the convergence of the aforementioned works is proved mainly for the special case of strongly convex objective functions and may not work well for other convex functions or the general class of nonconvex problems. The existing works that also consider nonconvex problems are mainly designed for the unconstrained case only [23, 24]. In fact, under some strict requirements on stepsizes [25], [20], these works analyze a descent convergence of the algorithm for the unconstrained case. However, such analysis may not be valid in the presence of the projection which is involved due to the existence of the constraints.
Finally, another popular acceleration method to speed up the convergence, especially in training algorithms of deep networks such as ADAM algorithm [26], is minibatch stochastic gradient estimation. However, with the rapid growth of data and increasing model complexity, it still shows a slow convergence rate, while the periteration complexity of minibatch algorithms grows linearly with the size of the batch [27].
In this paper, we are interested in the general class of nonconvex stochastic optimization problems and aim to propose a fast converging stochastic optimization algorithm that can handle constrained optimization problems, as well.
IB Stochastic MajorizationMinimization
To better support nonconvex stochastic optimization problems, stochastic majorizationminimization method is also widely used. This method is basically a nontrivial extension of majorizationminimization (MM) method for the deterministic problems. MM is an optimization method to minimise a possibly nonconvex function by iteratively minimising a convex upperbound function that serves as a surrogate for the objective function [28]. The intuition behind the MM method is that minimising the upperbound surrogate functions over the iterations monotonically drives the objective function value down. Because of the simplicity of the idea behind MM, it has been popular for a wide range of problems in signal processing and machine learning applications, and there are many existing algorithms that use this idea, including the expectationmaximisation [29], DC programming [30] and proximal algorithms [31, 32]. However, extension of MM approaches to the stochastic optimization case is not trivial. Because in a stochastic optimization problem, there is no closedform expression for the objective function (due to the expectation involved), and hence, it is difficult to find the required upperbound convex approximation (i.e., the surrogate function) for the objective function.
To address the aforementioned issue, stochastic MM [33], a.k.a., stochastic successive upperbound minimization (SSUM) [34], is proposed as a new class of algorithms for largescale nonconvex optimization problems, which extends the idea of MM to stochastic optimization problems, in the following way: Instead of for the original objective function, stochastic MM tries to find a majorizing surrogate for the observed sample function at each iteration. The currently and previously found instance surrogate functions (a.k.a. sample surrogate functions) are then incrementally combined together to form an approximate surrogate function for the objective function, which will then be minimised to update the iterate.
Under some conditions, mainly on the instance surrogate functions, it has been shown that the approximate surrogate function evolves to be a majorizing surrogate for the expected cost in the objective function of the original stochastic optimization problem [33]. The major condition is that the instance surrogate function should be an upperbound for the observed sample cost function at each iteration [34]. In some of the related works in this area, this condition needs to be satisfied locally, while in most of the existing works, they require the surrogate function to be a global upperbound [34].
It should be noted that although the upperbound requirement is fundamental for the convergence of these methods, it is restrictive in practice. This is mainly because finding a proper upperbound for the sample cost function at each iteration may be difficult itself. Therefore, although this upperbound can facilitate the minimization at each iteration, in general finding this upperbound will increase the periteration computational complexity of the algorithm. Consequently, minimising the approximate surrogate function at each iteration is only practical when these surrogate functions are simple enough to be easily optimized, e.g., when they can be parametrised with a small number of variables [34]. Otherwise, the complexity of stochastic MM may dominate the simplicity of the idea behind it, making the method impractical.
In addition to the aforementioned complexity issue, stochastic MM methods may not be implementable in parallel. As in general, it is not guaranteed that the resulting problem of optimising the approximate surrogate function at each iteration is decomposable with respect to the control variables of different agents, a centralised implementation might be required. Consequently, this method may not be suitable for distributed systems and applications that need parallel implementations.
Motivated to address these issues of stochastic MM, in this paper, we propose a stochastic scheme with an approximate surrogate that not only is easy to be obtained and minimised at each iteration, but also can be easily decomposed in parallel. Specifically, the instance surrogate function in our method does not have to be an upperbound. Moreover, it can be easily calculated and optimized at each iteration with low complexity, as will be seen in the next section. However, it brings new challenges that need to be tackled in order to prove the convergence of the proposed method. Most importantly, since the instance surrogate function in our algorithm is no longer an upperbound, unlike in stochastic MM, the monotonically decreasing property cannot be utilized in our case. Therefore, it is more challenging to show that the approximate surrogate function will eventually become an upperbound for the expected cost in the objective function. We will show how we tackle these challenges in later sections.
IC Parallel Stochastic Optimization
As explained before, there is an increasing need to design new stochastic optimization algorithms that enable parallel optimization of the control variables by different agents in a distributed manner. Note that many gradientbased methods are parallelisable in nature [20, 21, 35, 36, 37]. However, as mentioned before, they suffer from low convergence speed in practice [38, 39].
There are quite few works on parallel stochastic optimization in the literature. The authors in [19] have proposed a stochastic parallel optimization method that decomposes the original stochastic (nonconvex) optimization problem into parallel deterministic convex subproblems, where the subproblems are then solved independently by different agents in a distributed fashion. The proposed method here differs from that in [19] in the following ways. Firstly, unlike the proposed method, the algorithm in [19] requires a weighted averaging of the iterates, which slows down the convergence, in practice. Secondly, in our proposed method, the objective function approximated with an incremental convex approximation that is easier to calculate and minimize, with less computational complexity than that in [19], for an arbitrary objective function in general. The aforementioned differences will be elaborated more, later in Section IIIC.
ID Contributions of This Paper
In this paper, we address the aforementioned issues of the existing works and propose a fast converging and lowcomplexity parallel stochastic optimization algorithm for general nonconvex stochastic optimization problems. The main contributions of this work can be summarised as follows.

A stochastic convex approximation method for general (possibly nonconvex) stochastic optimization problems with guaranteed convergence: We propose a stochastic convex approximation framework that can solve general stochastic optimization problems (i.e., without the requirement to be strongly convex or even convex), with low complexity. We analyze the convergence of the proposed framework for both cases of convex and nonconvex stochastic optimization problems, and prove its convergence to the optimal solution for convex problems and to a stationary point for general nonconvex problems.

A general framework for parallel stochastic optimization: We show that our proposed method can be applied for parallel decomposition of stochastic multiagent optimization problems arising in distributed systems. Under our proposed method, the original (possibly nonconvex) stochastic optimization problem is decomposed into parallel deterministic convex subproblems and each subproblem is then solved by the associated agent, in a distributed fashion. Such a parallel optimization approach can be highly beneficial for reducing computational complexity, especially in largescale optimization problems.

Applications to solve largescale stochastic optimization problems with fast convergence: We show by simulations that our proposed framework can efficiently solve largescale stochastic optimization problems in the area of machine learning. Comparing the proposed method to the stateoftheart methods shows that ours significantly outperforms the stateoftheart methods in terms of convergence speed, while maintaining low computational complexity and storage requirements.
The rest of the paper is organised as follows. Section II formulates the problem. Section III introduces the proposed parallel stochastic optimization framework and presents its convergence results for both the convex and nonconvex cases. In Section IV, we illustrate an important application example of the proposed framework in the area of machine learning, and show how the proposed method can efficiently solve the problem in this application with low complexity and high convergence speed. Simulation results and comparison to the stateoftheart methods are presented in Section V. Finally, Section VI concludes the chapter.
Ii Problem Formulation
Consider a multiagent system composed of users, each independently controlling a strategy vector , that aim at solve the following stochastic optimization problem together in a distributed manner:
(3) 
where is the joint strategy vector and is the joint strategy set of all users. Moreover, the objective function is the expectation of a sample cost function which depends on the joint strategy vector and a random vector . The random vector
is defined on a probability space denoted by
, where the probability measure is unknown.Such problem formulation is very general and includes many optimization problems as its special cases. Specifically, since the objective function is not assumed to be convex over the set , the considered optimization problem can be nonconvex in general. The following assumptions are made throughout this paper, which are classical in the stochastic optimization literature and are satisfied for a large class of problems [25].
Assumption 1 (Problem formualtion structure).
For the problem formulation in (3), we assume

The feasible sets are compact and convex;^{2}^{2}2This assumption guarantees that there exists a solution to the considered optimization problem.

For any random vector realisation , the sample cost function is continuously differentiable over and has a Lipschitz continuous gradient;

The objective function has a Lipschitz continuous gradient with constant .
We aim at designing a distributed iterative algorithm with low complexity, fast convergence and low storage requirements that solves the considered stochastic optimization problem when the distribution of the random variable is not known or it is practically impossible to accurately compute the expected value in (3) and hence, we only have access to the observed samples (i.e., realisations) of the random variable .
Iii Proposed Parallel Stochastic Optimization Algorithm
Iiia Algorithm Description
Note that the considered problem in (3) is possibly nonconvex. Moreover, due to the expectation involved, its objective function does not have a closedform expression. These two issues make finding the stationary point(s) of this problem very challenging. In the following, we tackle these challenges and propose an iterative decomposition algorithm that solves the problem in a distributed manner, where the agents update and optimise their associated surrogate functions in parallel. In this way, the original objective function is replaced with some incremental strongly convex surrogate functions which will then be updated and optimized by the agents in parallel. Note that the proposed method can also support the minibatch case. Therefore, for the sake of generality, we consider the minibatch stochastic gradient estimation, where the size of the minibatch, parametrized by , can be chosen to achieve a good tradeoff between the periteration complexity and the convergence speed.
(4) 
(5) 
(6) 
The proposed parallel decomposition method is described in Algorithm 1. The iterative algorithm proceeds as follows: As each iteration , a batch of size of random vectors is realised, and accordingly, each agent calculates the gradient of the the associated sample function with respect to its control variable at the latest iterate (i.e., ). Then, using its newly calculated gradient and the previous ones, agent incrementally updates a vector which is an estimation of the exact gradient of the original objective function with respect to . It will be shown that this vector eventually converges to the true gradient. Using this gradient estimation and the latest iterate (i.e., ), agent then constructs a quadratic deterministic surrogate function, as in (5). The surrogate function is then minimised by agent to update the control variable . In this way, through solving deterministic quadratic subproblems in parallel, each user minimises a sample convex approximation of the original nonconvex stochastic function.
Note that the first term in the surrogate function (5) is the proximal regularisation term and makes the surrogate function strongly convex, with parameter . Moreover, the role of the second term is to incrementally estimate the unavailable exact gradient of the original objective function at each iteration (i.e., ), using the sample gradients collected over the iterates so far.
According to (4), the direction is recursively updated based on the previous direction and the newly observed stochastic gradient . Under proper choices of the sequences and , it is expected that such incremental estimation of the exact gradient becomes more and more accurate as increases, and gradually, it will converge to the exact gradient (This will be shown later on by Lemma 4).
IiiB Convergence Analysis of the Proposed Method
In the following, we present the main convergence results of the proposed Algorithm 1. Prior to that, let us state the following assumptions on the noise of the stochastic gradients as well as the sequences and .
Assumption 2 (Unbiased gradient estimation with bounded variance).
For any iteration , the following results hold almost surely: ,

,

,
where denotes the past history of the algorithm up to iteration .
Assumption 2
(a) indicates that the instantaneous gradient is an unbiased estimation of the exact gradient at each point, and Assumption
2(b) indicates that the variance of such noisy gradient estimation is bounded. It is noted that these assumptions are standard and very common in the literature for instantaneous gradient errors
[40, 41]. Moreover, it can be easily verified that if the random variables are bounded and identically distributed, then these assumptions are automatically satisfied [25]. Finally, Assumption 2 clearly implies that the gradient of the observed sample function at the current iterate is an unbiased estimation of the true gradient of the original objective function (However, note that is not an unbiased estimation for finite ), with a finite variance.Assumption 3 (Stepsize sequences constraints).
The sequences and satisfy the following conditions:

, and ,

, and .
The following theorem states a preliminary convergence result of the proposed Algorithm 1 for the general (possibly nonconvex) stochastic optimization problems.
Theorem 1.
The following theorem shows the convergence of the proposed algorithm to the optimal solution for the case of convex stochastic optimization problems.
Theorem 2.
The last theorem establishes the convergence of the proposed algorithm to a stationary point for the general case of nonconvex stochastic optimization problems.
IiiC Comparison with the Existing Methods
It should be noted that unlike in SGD, the stochastic gradient vector in our method is not an unbiased estimation of the exact gradient, for finite . This fact adds to the nontrivial challenges in proving the convergence of the proposed algorithm since the expectation of the update direction is no longer a descent direction. To tackle this challenge, we prove that this biased estimation asymptotically converges to the exact gradient, with probability one.
Moreover, unlike the stochastic MM methods discussed before, the considered surrogate functions are not necessarily an upperbound for the observed sample function, but will eventually converge to be a global upperbound for the expectation of the sample functions. In addition, the considered surrogate functions can be computed and optimized with low complexity for any optimization problem of the form in (3). These advantages address the complexity issues of the stochastic MM methods that discussed before.
Furthermore, the proposed method here differs from that in [19] in the following ways: Firstly, the algorithm proposed in [19] requires weighted averaging of the iterates, where the associated vanishing factor is assumed to be much faster diminishing than the vanishing factor involved in the incremental estimate of the exact gradient of the objective function. In fact, averaging over the iterates helps to average out the noise involved in the gradient estimation of the stochastic objective function, and hence, is used as a crucial step for proving the convergence of the method proposed in [19]. However, in practice, such averaging over the iterates makes the convergence of the algorithm slower, as the weight for the approximate solution found at the current iteration converges to zero very fast. Therefore, the effect of a new approximation point found by solving the approximate surrogate function at the current iteration would very quickly become negligible. Such a step, which is fundamental for the convergence of the proposed scheme in [19], is no longer used in the proposed algorithm. Although this makes the convergence proof of our proposed method challenging, it contributes to the significantly faster convergence of the proposed scheme compared to the scheme in [19], as can be verified by the numerical results in Section V.
Secondly, the surrogate function at each agent is obtained from the original objective function by replacing the convex part of the expected value with its incremental sample function and the nonconvex part with a convex local estimation. However, in our proposed method, the expectation of the whole objective function is approximated with an incremental convex approximation, which can be easily calculated and optimized with low complexity, at each iteration. This contributes to the lower complexity of the proposed scheme here compared to the one in [19], for an arbitrary objective function in general.
Iv An Application Example of the Proposed Method in Solving LargeScale Machine Learning Problems
Optimization is believed to be one of the important pillars of machine learning [42]. This is mainly due to the nature of machine learning systems, where a set of parameters need to be optimized based on currently available data so that the learning system can make decisions for yet unseen data. Nowadays, many challenging machine learning problems in a wide range of applications are relying on optimization methods, and designing new methods that are more widely applicable to modern machine learning applications is highly desirable. One of the important problems which appear in many machine learning applications with huge datasets is largescale support vector machines (SVMs). In this section, we demonstrate how our proposed optimization algorithm can efficiently solve this problem. We compare the performance of the proposed algorithm to the stateoftheart methods in the literature and present experimental results to demonstrate the merits of our proposed framework.
SVMs are one of the most prominent machine learning techniques for a wide range of classification problems in machine learning applications, such as cancer diagnosis in bioinformatics, image classification, face recognition in computer vision and text categorisation in document processing
[43, 44]. The SVM classifier is formulated by the following optimization problem
^{3}^{3}3Note that although SVM problem formulation is widely used in the literature to show the performance of the optimization algorithms, it is not everywhere differentiable. However, it should be noted that in our proposed algorithm, similar to the other works in the literature, differentiability is a sufficient condition for the convergence proof. In addition, the probability of meeting the nondifferentiable points is almost zero, in practice. [45]:(8) 
where is the set of training samples, and is the regularisation parameter to control overfitting by maintaining a tradeoff between increasing the size of the margin (larger ) and ensuring that the data lie on the correct side of the margin (smaller ). Once this problem is solved, the obtained determines the SVM classifier as , i.e., each future datum will be labelled by the sign of its inner product with the solution .
Although SVM problem formulation is well understood, solving largescale SVMs is still challenging. For the large and highdimensional datasets that we encounter in emerging applications, the optimization problem cast by SVMs is huge. Consequently, solving such largescale SVMs is mathematically complex and computationally expensive. Specifically, as the number of training data becomes very large, computing the exact gradient of the objective function becomes impractical. Therefore, for training largescale SVMs, offtheshelf optimization algorithms for general problems quickly become intractable in their memory requirements. Moreover, for emerging applications that need to handle a huge amount of data in a short time, the speed of convergence of the applied algorithm is another critical factor, which needs to be carefully considered. Furthermore, when the data contain a large number of attributes, extensive resources may be required to process the datasets and solve the associated problems. To address this issue especially in resourcelimited multiagent systems, it is beneficial to decompose the attributes of the dataset into small groups, each associated with a computing agent, and then, process these smaller datasets by distributed agents in parallel.
The aforementioned requirements give rise to an increasing need for an efficient and scalable method that can solve largescale SVMs with low complexity and fast convergence. In the sequel, we show how our proposed algorithm can effectively address the aforementioned issues to solve largescale SVMs with low complexity and faster convergence than the stateoftheart existing solutions.
First, note that in SVMs with a huge number of training data, the optimization problem cast by SVMs (as in (8)) can be viewed as the sample average approximation (SAA) of the following stochastic optimization problem:
(9) 
in which the randomness comes from random samples . Considering the batch size , the stochastic gradient of the above objective function is
(10) 
where is the indicator function and
is a randomly drawn training sample. Note that since the training samples are chosen i.i.d., the gradient of the loss function with respect to any individual sample can be shown to be an unbiased estimate of a gradient of
[21]. Moreover, the variance of such estimation is finite, i.e., , as shown in [45]. For computing this stochastic gradient at each iteration, only one sample out of the training samples is drawn uniformly at random. Accordingly, the cost of computing the gradient at each iteration is lowered to almost of the cost of computing the exact gradient. Therefore, by adopting a stochastic approach and utilising the stochastic gradient in (10) instead of calculating the exact gradient of the SVM problem formulation, the cost of computing gradient at each iteration will become independent of the number of training data, which makes it highly scalable to largescale SVMs.The stateoftheart algorithms for solving SVMs through a stochastic approach can be classified into firstorder methods, such as Pegasos [45], and secondorder methods, such as the NewtonArmijo method [46]. The existing firstorder methods can significantly decrease the computational cost per iteration, but they converge very slowly, which is not desirable specially for emerging applications with huge datasets. On the other hand, secondorder methods suffer from high complexity, due to the significantly expensive matrix computations and high storage requirement [47, 48]. For example, the NewtonAmijo method [46] needs to solve a matrix equation, which involves matrix inversion, at each iteration. However, in the case of largescale SVMs in big data classification, the inversion of a curvature matrix or the storage of an iterative approximation of that inversion will be very expensive, which makes this method nonpractical for largescale SVMs. In addition, this method is an offline method that needs to have all the training data in a full batch to calculate the exact values of the gradient and Hessian matrix at each iteration. This will cause high complexity and computational cost, and makes this method even not applicable to online applications.
In the next section, we will show that our proposed algorithm can effectively solve largescale SVMs significantly faster than the existing solutions, and with low complexity. Using some realworld big datasets, we compare the proposed accelerated method and the aforementioned methods for solving SVMs.
V Simulation Results and Discussion
In this section, we empirically evaluate the proposed stochastic optimization algorithm and demonstrate its efficiency for solving largescale SVMs. We compare the performance of the proposed algorithm to three important stateoftheart baselines in this area:

Pegasos algorithm [45], which is known to be one of the best stateoftheart algorithms for solving largescale SVMs,

The stateoftheart parallel stochastic optimization algorithm [19],

ADAM algorithm [26]
, one of the most popular and widelyused deep learning algorithms, which is an adaptive learning rate optimization algorithm that has been designed specifically for training deep neural networks. This algorithm has been recognized as one of the best optimization algorithms for deep learning, and its popularity is growing exponentially, according to
[49].
We perform our simulations and comparison on two popular realworld large datasets with very different feature counts and sparsity. The COV1 dataset classifies the forestcover areas in the Roosevelt National Forest of northern Colorado identified from cartographic variables into two classes [50], and the RCV1 dataset classifies the CCAT and ECAT classes versus GCAT and MCAT classes in the Reuters RCV1 collection [51].^{4}^{4}4The datasets are available online at https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html. Details of the datasets’ characteristics as well as the values of the SVM regularisation parameter used in the experiments are all provided in Table I (Note that for the regularisation parameter for each dataset, we adopt the typical value used in the previous works [20, 21, 45] and [52]).
Similar to [21], the initial point of all the algorithms is set to . We also tune the stepsize parameters of the method in [19] to achieve its best performance for the used datasets. Moreover, for the Pegasos algorithm, we output the last weight vector rather than the average weight vector, as it is found that it performs better in practice [45]. Finally, for the hyperparameters of the ADAM algorithm, we use their default values as reported in [26], which are widely used for almost all the problems and known to perform very well, in practice.
Dataset  Training Size  Test Size  Features  Sparsity ()  

COV1  522911  58101  55  22.22  
RCV1  677399  20242  47236  0.16 
Fig.s 1(a) and 1(b) show the convergence results of the proposed method and the baselines for each of the datasets. As can be seen in these figures, the proposed algorithm converges much faster than the baselines. Especially at the early iterations, the convergence speeds of all of these methods are rather fast. However, after a number of iterations, the speed of convergence in the baselines drops, while our method still maintains a good convergence speed. This is because the estimation of the gradient of the objective function at each iteration in those methods is not as good as in our proposed method. For Pegasos method, this is due to the fact that the gradient estimation is done based on only one data sample and its associated observed sample function. However in our proposed method, we utilize all the data samples observed up to that point and average over the previous gradients of the observed sample functions to estimate the true gradient of the original objective function. Moroever, the method in [19] performs an extra averaging over the iterates, which makes the solution change increasingly slowly, as the iterations go on.
Figs. 2 and 3 show the classification precision of the resulted SVM model versus the number of samples visited. The top row shows the classification precision on the training data and the bottom row shows the classification precision on the testing data. According to the figures, comparing the proposed method to the baselines, under the same number of iterations, the proposed method obtains a more precise SVM with a higher percentage of accuracy for the training and testing data.
Finally, Table II compares the CPU time for running the algorithms for iterations. As can be verified from this table, the CPU time of the proposed method is less than those of the baselines. Therefore, the computational complexity per iteration of the proposed algorithm is less than the considered baselines, which are known to have low complexity.
Vi Conclusion
In this paper, we proposed a novel fast parallel stochastic optimization framework that can solve a large class of possibly nonconvex constrained stochastic optimization problems. Under the proposed method, each user of a multiagent system updates its control variable in parallel, by solving a successive convex approximation subproblem, independently. The subproblems have low complexity and are easy to obtain. The proposed algorithm can be applied to solve a large class of optimization problems arising in important applications from various fields, such as wireless networks and largescale machine learning. Moreover, for convex problems, we proved the convergence of the proposed algorithm to the optimal solution, and for general nonconvex problems, we proved the convergence of the proposed algorithm to a stationary point.
Moreover, as a representative application of our proposed stochastic optimization framework in the context of machine learning, we elaborated on largescale SVMs and demonstrated how the proposed algorithm can efficiently solve this problem, especially for modern applications with huge datasets. We compared the performance of our proposed algorithm to the stateoftheart baselines. Numerical results on popular realworld datasets show that the proposed method can significantly outperform the stateoftheart methods in terms of the convergence speed while having the same or lower complexity and storage requirement.
References
 [1] H. Shuai, J. Fang, X. Ai, Y. Tang, J. Wen, and H. He, “Stochastic optimization of economic dispatch for microgrid based on approximate dynamic programming,” IEEE Transactions on Smart Grid, vol. 10, no. 3, pp. 2440–2452, 2018.
 [2] N. Omidvar, A. Liu, V. Lau, F. Zhang, D. H. K. Tsang, and M. R. Pakravan, “Optimal hierarchical radio resource management for HetNets with flexible backhaul,” IEEE Transactions on Wireless Communications, vol. 17, no. 7, pp. 4239–4255, 2018.
 [3] R. Rezaei, N. Omidvar, M. Movahednasab, M. R. Pakravan, S. Sun, and Y. L. Guan, “Efficient, fair and QoSaware policies for wirelessly powered communication networks,” arXiv preprint arXiv:1909.07700, 2019.
 [4] K. Marti and P. Kall, Stochastic Programming: Numerical Techniques and Engineering Applications. Springer Science & Business Media, 2013, vol. 423.
 [5] M. Movahednasab, N. Omidvar, M. R. Pakravan, and T. Svensson, “Joint data routing and power scheduling for wireless powered communication networks,” in ICC 20192019 IEEE International Conference on Communications (ICC). IEEE, 2019, pp. 1–7.
 [6] R. Rezaei, M. Movahednasab, N. Omidvar, and M. R. Pakravan, “Optimal and nearoptimal policies for wireless power transfer considering fairness,” in 2018 IEEE Global Communications Conference (GLOBECOM). IEEE, 2018, pp. 1–7.
 [7] N. Omidvar, A. Liu, V. Lau, F. Zhang, D. H. K. Tsang, and M. R. Pakravan, “Twotimescale radio resource management for heterogeneous networks with flexible backhaul,” in 2015 IEEE Global Communications Conference (GLOBECOM). IEEE, 2015, pp. 1–6.
 [8] A. S. Bedi, A. Koppel, and K. Rajawat, “Asynchronous saddle point algorithm for stochastic optimization in heterogeneous networks,” IEEE Transactions on Signal Processing, vol. 67, no. 7, pp. 1742–1757, 2019.
 [9] N. Omidvar, A. Liu, V. Lau, F. Zhang, D. H. K. Tsang, and M. R. Pakravan, “Twotimescale QoSaware crosslayer optimisation for HetNets with flexible backhaul,” in Personal, Indoor, and Mobile Radio Communications (PIMRC), 2015 IEEE 26th Annual International Symposium on. IEEE, 2015, pp. 1072–1076.
 [10] M. Movahednasab, B. Makki, N. Omidvar, M. R. Pakravan, T. Svensson, and M. Zorzi, “An energyefficient controller for wirelesslypowered communication networks,” arXiv preprint arXiv:1905.05958, 2019.
 [11] R. Rezaei, M. Movahednasab, N. Omidvar, and M. R. Pakravan, “Stochastic power control policies for batteryoperated wireless power transfer,” in 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE, 2018, pp. 1–5.
 [12] N. Omidvar, F. Zhang, A. Liu, V. Lau, D. H. K. Tsang, and M. R. Pakravan, “Crosslayer QSIaware radio resource management for HetNets with flexible backhaul,” in Wireless Communications and Networking Conference (WCNC 2016), Doha, Qatar. IEEE, 2016, pp. 945–950.
 [13] Y. Zhang, A. M. Saxe, M. S. Advani, and A. A. Lee, “Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning,” Molecular Physics, vol. 116, no. 2122, pp. 3214–3223, 2018.
 [14] Y. Li and Y. Liang, “Learning overparameterized neural networks via stochastic gradient descent on structured data,” in Advances in Neural Information Processing Systems, 2018, pp. 8157–8166.
 [15] A. F. M. Agarap, “On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset,” in Proceedings of the 2nd International Conference on Machine Learning and Soft Computing. ACM, 2018, pp. 5–9.
 [16] V. Veitch, M. Austern, W. Zhou, D. M. Blei, and P. Orbanz, “Empirical risk minimization and stochastic gradient descent for relational data,” arXiv preprint arXiv:1806.10701, 2018.
 [17] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2009.
 [18] V. Cevher, S. Becker, and M. Schmidt, “Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics,” IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 32–43, 2014.
 [19] Y. Yang, G. Scutari, D. P. Palomar, and M. Pesavento, “A parallel decomposition method for nonconvex stochastic multiagent optimization problems,” IEEE Transactions on Signal Processing, vol. 64, no. 11, pp. 2949–2964, 2016.
 [20] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” arXiv preprint arXiv:1109.5647, 2011.
 [21] O. Shamir and T. Zhang, “Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes,” in ICML (1), 2013, pp. 71–79.
 [22] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, pp. 838–855, 1992.
 [23] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000.
 [24] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986.
 [25] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009.
 [26] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [27] X. Peng, L. Li, and F.Y. Wang, “Accelerating minibatch stochastic gradient descent using typicality sampling,” arXiv preprint arXiv:1903.04192, 2019.
 [28] K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer using surrogate objective functions,” Journal of computational and graphical statistics, vol. 9, no. 1, pp. 1–20, 2000.

[29]
O. Cappé and E. Moulines, “Online expectation–maximization algorithm for latent data models,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 71, no. 3, pp. 593–613, 2009.  [30] G. Gasso, A. Rakotomamonjy, and S. Canu, “Recovering sparse signals with a certain family of nonconvex penalties and dc programming,” IEEE Transactions on Signal Processing, vol. 57, no. 12, pp. 4686–4698, 2009.
 [31] S. J. Wright, R. D. Nowak, and M. A. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2479–2493, 2009.
 [32] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
 [33] J. Mairal, “Stochastic majorizationminimization algorithms for largescale optimization,” in Advances in Neural Information Processing Systems, 2013, pp. 2283–2291.
 [34] M. Razaviyayn, M. Sanjabi, and Z.Q. Luo, “A stochastic successive minimization method for nonsmooth nonconvex optimization with applications to transceiver design in wireless communication networks,” Mathematical Programming, vol. 157, no. 2, pp. 515–545, 2016.
 [35] Y. Nesterov, “Gradient methods for minimizing composite functions,” Mathematical Programming, vol. 140, no. 1, pp. 125–161, 2013.
 [36] I. Necoara and D. Clipici, “Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed MPC,” Journal of Process Control, vol. 23, no. 3, pp. 243–253, 2013.
 [37] P. Tseng and S. Yun, “A coordinate gradient descent method for nonsmooth separable minimization,” Mathematical Programming, vol. 117, no. 1, pp. 387–423, 2009.
 [38] M. Razaviyayn, M. Hong, Z.Q. Luo, and J.S. Pang, “Parallel successive convex approximation for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2014, pp. 1440–1448.
 [39] F. Facchinei, S. Sagratella, and G. Scutari, “Flexible parallel algorithms for big data optimization,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7208–7212.
 [40] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,” SIAM Journal on Optimization, vol. 20, no. 2, pp. 691–717, 2009.
 [41] J. Zhang, D. Zheng, and M. Chiang, “The impact of stochastic noisy feedback on distributed network utility maximization,” IEEE Transactions on Information Theory, vol. 54, no. 2, pp. 645–665, 2008.
 [42] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for largescale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
 [43] P. Y. Pawar and S. Gawande, “A comparative study on different types of approaches to text categorization,” International Journal of Machine Learning and Computing, vol. 2, no. 4, p. 423, 2012.
 [44] Y. Feng and D. P. Palomar, “Normalization of linear support vector machines,” IEEE Transactions on Signal Processing, vol. 63, no. 17, pp. 4673–4688, 2015.
 [45] S. ShalevShwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: Primal estimated subgradient solver for SVM,” Mathematical programming, vol. 127, no. 1, pp. 3–30, 2011.
 [46] Y.J. Lee and O. L. Mangasarian, “SSVM: A smooth support vector machine for classification,” Computational optimization and Applications, vol. 20, no. 1, pp. 5–22, 2001.
 [47] V. Cevher, S. Becker, and M. Schmidt, “Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics,” IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 32–43, 2014.
 [48] R. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal, “On the use of stochastic Hessian information in unconstrained optimization,” SIAM J. Optim, vol. 21, no. 3, pp. 977–995, 2011.
 [49] A. Karparthy, “A peek at trends in machine learning,” 2017.
 [50] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of SVMs for very large scale problems,” in Advances in Neural Information Processing Systems, 2002, pp. 633–640.
 [51] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmark collection for text categorization research,” Journal of machine learning research, vol. 5, no. Apr, pp. 361–397, 2004.
 [52] T. Joachims, “Training linear SVMs in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226.
 [53] A. Ruszczyński, “Feasible direction methods for stochastic programming problems,” Mathematical Programming, vol. 19, no. 1, pp. 220–229, 1980.
Appendix A Convergence of the Proposed Algorithm
Lemma 1.
Under Algorithm 1, for any , we have
(11) 
where the weights are defined as
(12) 
Proof.
The proof follows from a simple induction by applying the update equation of recursively, as follows. For any fixed agent index , starting from and using the update equation in (4) along with the initial values of and , we have , which satisfies the form introduced in (11). Now assume that for some , . It suffices to show that , as well.
Using the update equation (4) for the iteration, we have
(13) 
Substituting , in the above equation results in
(14) 
where the second equality follows from the fact that according to the definition of in (12), .
Lemma 2.
Under Algorithm 1 and for any , is bounded.
Proof.
The proof is a direct result of Lemma 1. Since according to Assumption 1b, are Lipschitz continuous over the domain , it follows that there exists some such that . Therefore, using (11), we have
(15) 
Moreover, from (12) and the assumption that , it is easy to verify that
(16) 
Combining (15) and (16) results in
(17) 
which completes the proof.
Lemma 3.
Under Algorithm 1, for any , we will have , and hence,
(18) 
Proof.
For any , since is the minimiser of and is a feasible point, we have
(19) 
Therefore, since according to Lemma 2, is bounded, we have . Consequently, as is decreasing (due to Assumption 3a), .
Lemma 4.
Under Algorithm 1, the vector converges to the true gradient of the objective function, with probability one, i.e.,
(20) 
Proof.
We simply refer to [53, Lemma 1], and show that all of the required conditions of that lemma are satisfied in our case. Specifically, Conditions (a) and (b) of [53, Lemma 1] are satisfied due to Assumption 1a and Assumption 2, respectively. Moreover, Conditions (c) and (d) of [53, Lemma 1] are satisfied by Assumptions 3 and 2a. Finally, it remains to prove that Condition (e) of [53, Lemma 1] is satisfied under our problem as well, which will be shown in the following.
Note that since the objective function has a Lipschitz continuous gradient (according to Assumption 1), we have
(21) 
Moreover, due to Lemma 3, we have . The above two facts along with Assumption 3 on the sequences and concludes that the righthand side of inequality (21) goes to zero as goes to infinity, and hence,
(22) 
Therefore, it follows from [53, Lemma 1] that , w.p.1.
Now, with the above lemmas, we proceed to prove Theorem 2. First of all, since the sequence lies in a compact set, it is sufficient to show that every limit point of the iterates is a stationary point of Problem 3. To show this, let be the limit point of a convergent subsequent . Note that since is a closed set, this limit point belongs to this set, and hence, it is a feasible point to the optimization problem 3.
Lemma 5.
There exists a subsequence such that under the proposed Algorithm 1,
(23) 
Proof.
We prove this lemma by contradiction. Assume that the statement of the lemma is not true (i.e., there is no such subsequence). This means that for any , there exists some such that
(24) 
Furthermore, note that from the firstorder optimality conditions of (6), we have
(25) 
which, by substituting , leads to
(26) 
Moreover, it follows from Assumption 1c that
(27) 
Plugging the inequalities (24) and (26) into (A) it is obtained that for any ,
(28) 
Considering the facts that and (according to Lemmas 3 and 4, respectively) and the assumption that is a decreasing sequence (due to Assumption 3), it follows that there exists a sufficiently large such that for any , we have
(29) 
for some , with probability one. Therefore, it follows from (28) that for any ,
(30) 
and hence, invoking (24), we have
(31) 
Letting , the righthand side of (31) will go to , as according to Assumption 3. This contradicts the boundedness of the objective function over the domain . Therefore, the contradiction assumption in (24) must be wrong and this completes the proof.
Aa Proof of Theorem 1
First, note that according to Lemma 5, there exists a subsequence such that
. Let be the limit point of the convergent subsequence . Since is a closed and convex set, we have . In the following, we will show that this limit point is a stationary point for the stochastic optimization problem in (3).
Invoking the firstorder optimality condition of 6, we have ,
(32) 
where the last equality follows from the facts that (as shown by Lemma 5) and (as shown by Lemma 4). Summing (AA) over all , we obtain
(33) 
which is the desired firstorder optimality condition for the considered optimization problem in (3). Therefore, the limit point is a stationary point of (3). This completes the proof of Theorem 1.
Note that in the case of a convex objective function , this stationary point is a global minimum, and hence we will have , where is the optimal value of the considered optimization problem in (3).^{5}^{5}5Note that this is the first place that makes use of the convexity of , and all the previous results are applicable to the nonconvex case, as well.
AB Proof of Theorem 2 (Convex case)
First, for any fixed , we define the following two sets
(34) 
(35) 
and the set of indices
(36) 
Second, we prove the following lemma, which will then be used in the proof of Theorem 2.
Lemma 6.
Proof.
First note that if , then the statement of the lemma is obvious. Therefore, it suffices to prove the lemma for the case where . We prove this case by contradiction, as follows. Suppose that there exists no that satisfies (60). Therefore, there should exist some subsequence such that
(38) 
This means that the sequence is convergent to a limit point. Let be the limit point of this convergent subsequence. Note that since is a closed and convex set, this limit point belongs to , as well. Using a similar analysis to that in (AA), for the convergent sequence , it is easy to verify that
(39) 
and hence, the limit point is a stationary point to the optimization problem (3). Consequently, since the problem is assumed to be convex, this stationary point is the global minimum, and hence, we have , or equivalently
(40) 
Accordingly, for any fixed , there exists some sufficiently large such that
(41) 
Substituting , it follows that for sufficiently large , we have . This means that , which is obviously a contradiction of the fact that . Therefore, the contradiction assumption must be wrong, and this completes the proof of the lemma.
Since the subsequence converges to a stationary point of (3), and the function is continuous (due to Assumption 1), it follows that there exists a sufficiently large such that, for any , . Consequently, due to the definition in (34), we have
(42) 
In the following, we will show by contradiction that for any , , as well.
Suppose , and hence, . Therefore, according to Lemma 6, there exists some fixed such that
(43) 
Now, using similar analysis to that in (24)–(28), it is easy to show that
(44) 
Moreover, for the considered sufficiently large , there exist some such that
(45) 
Therefore, combining the above three inequalities results that
(46) 
and hence, . Consequently, it follows that
(47) 
which indicates that . Obviously, this contradicts our initial assumption that . Therefore, this assumption must be wrong and .
The above result shows that under our proposed algorithm, once we enter the region (that we have shown in (42) that it happens at a sufficiently large iteration ), we never go out of it. Consequently, for any , there exists a sufficiently large such that
(48) 
or equivalently,
(49) 
Since the above inequality is proved for any arbitrary , it is concluded that
(50) 
which completes the proof of Theorem 2.
AC Proof of Theorem 3 (Nonconvex case)
Let be the set of limiting points of the proposed Algorithm 1. Note that according to Lemma 3, this set is not empty. In the following, we prove that if a strictly local minimum of the optimization problem (3) belongs to this set, then this set has only a single element, i.e., the algorithm converges to the strictly local minimum point.
Let denote the strictly local minimum point belonging to the set of limiting points . Since , there exists a subsequence of the iteration indices so that the sequence of the iterates generated by Algorithm 1 converges to , i.e.,
(51) 
For any , let denote a ball of radius centred at . Now, for any sufficiently small such that contains only one stationary point (such a nonempty region exists because is a strictly local minimum point), we define the following sets of points:
(52) 
Comments
There are no comments yet.