Introduction
Proximal gradient (PG) methods (Mine and Fukushima, 1981; Nesterov, 2004; Parikh, Boyd, and others, 2014)
are a class of powerful optimization tools in artificial intelligence and machine learning. In general, it considers the following nonsmooth optimization problem:
(1) 
where
usually is the loss function such as hinge loss and logistic loss, and
is the nonsmooth structure regularizer such as norm regularization. In recent research, Beck and Teboulle (2009); Nesterov (2013) proposed the accelerate PG methods to solve convex problems by using the Nesterov’s accelerated technique. After that, Li and Lin (2015) presented a class of accelerated PG methods for nonconvex optimization. More recently, Gu, Huo, and Huang (2018) introduced inexact PG methods for nonconvex nonsmooth optimization. To solve the big data problems, the incremental or stochastic PG methods (Bertsekas, 2011; Xiao and Zhang, 2014) were developed for largescale convex optimization. Correspondingly, Ghadimi, Lan, and Zhang (2016); Reddi et al. (2016) proposed the stochastic PG methods for largescale nonconvex optimization.However, in many machine learning problems, the explicit expressions of gradients are difficult or infeasible to obtain. For example, in some complex graphical model inference (Wainwright, Jordan, and others, 2008) and structure prediction problems (Sokolov, Hitschler, and Riezler, 2018), it is difficult to compute the explicit gradients of the objective functions. Even worse, in bandit (Shamir, 2017) and blackbox learning (Chen et al., 2017) problems, only the objective function values are available (the explicit gradients cannot be calculated). Clearly, the above PG methods will fail in dealing with these scenarios. The gradientfree (zerothorder) optimization method (Nesterov and Spokoiny, 2017) is a promising choice to address these problems because it only uses the function values in optimization process. Thus, the gradientfree optimization methods have been increasingly embraced for solving many machine learning problems (Conn, Scheinberg, and Vicente, 2009).
Algorithm  Reference  Gradient estimator  Problem  Convergence rate 
RSGF  Ghadimi and Lan (2013)  GauSGE  S(NC)  
ZOSVRG  Liu et al. (2018c)  CooSGE  S(NC)  
SZVRG  Liu et al. (2018a)  GauSGE  S(NC)  
GauSGE  NS(NC)  
RSPGF  Ghadimi, Lan, and Zhang (2016)  GauSGE  S(NC) + NS(C)  
ZOProxSVRG  Ours  CooSGE  S(NC) + NS(C)  
GauSGE  S(NC) + NS(C)  
ZOProxSAGA  Ours  CooSGE  S(NC) + NS(C)  
GauSGE  S(NC) + NS(C) 
Although many gradientfree methods have recently been developed and studied (Agarwal, Dekel, and Xiao, 2010; Nesterov and Spokoiny, 2017; Liu et al., 2018b), they often suffer from the high variances of zerothorder gradient estimates. In addition, these algorithms are mainly designed for smooth or convex settings, which will be discussed in the below related works, thus limiting their applicability in a wide range of nonconvex nonsmooth machine learning problems such as involving the nonconvex loss functions and nonsmooth regularization.
In this paper, thus, we propose a class of faster gradientfree proximal stochastic methods for solving the nonconvex nonsmooth problem as follows:
(2) 
where each is a nonconvex and smooth loss function, and is a convex and nonsmooth regularization term. Until now, there are few zerothorder stochastic methods for solving the problem (2) except a recent attempt proposed in (Ghadimi, Lan, and Zhang, 2016). Specifically, Ghadimi, Lan, and Zhang (2016) have proposed a randomized stochastic projected gradientfree method (RSPGF), i.e., a zerothorder proximal stochastic gradient method. However, due to the large variance of zerothorder estimated gradient generated from randomly selecting the sample and the direction of derivative, the RSPGE only has a convergence rate , which is significantly slower than , the best convergence rate of the zerothorder stochastic algorithm. To accelerate the RSPGF algorithm, we use the variance reduction strategies in the firstorder methods, i.e., SVRG (Xiao and Zhang, 2014) and SAGA (Defazio, Bach, and LacosteJulien, 2014), to reduce the variance of estimated stochastic gradient.
Although SVRG and SAGA have shown good performances, applying these strategies to the zerothorder method is not a trivial task. The main challenge arises due to that both SVRG and SAGA rely on the assumption that a stochastic gradient is an unbiased estimate of the true full gradient. However, it does not hold in the zerothorder algorithms. In the paper, thus, we will fill this gap between zerothorder proximal stochastic method and the classic variance reduction approaches (SVRG and SAGA).
Main Contributions
In summary, our main contributions are summarized as follows:

We propose a class of faster gradientfree proximal stochastic methods (ZOProxSVRG and ZOProxSAGA), based on the variance reduction techniques of SVRG and SAGA. Our new algorithms only use the objective function values in the optimization process.

Moreover, we provide the theoretical analysis on the convergence properties of both new ZOProxSVRG and ZOProxSAGA methods. Table 1 shows the specifical convergence rates of the proposed algorithms and other related ones. In particular, our algorithms have faster convergence rate than of the RSPGF (Ghadimi, Lan, and Zhang, 2016) (the existing stochastic PG algorithm for solving nonconvex nonsmoothing problems).

Extensive experimental results and theoretical analysis demonstrate the effectiveness of our algorithms.
Related Works
Gradientfree (zerothorder) methods have been effectively used to solve many machine learning problems, where the explicit gradient is difficult or infeasible to obtain, and have also been widely studied. For example, Nesterov and Spokoiny (2017) proposed several random gradientfree methods by using Gaussian smoothing technique. Duchi et al. (2015) proposed a zerothorder mirror descent algorithm. More recently, Yu et al. (2018); Dvurechensky, Gasnikov, and Gorbunov (2018) presented the accelerated zerothorder methods for the convex optimization. To solve the nonsmooth problems, the zerothorder online or stochastic ADMM methods (Liu et al., 2018b; Gao, Jiang, and Zhang, 2018) have been introduced.
The above zerothorder methods mainly focus on the (strongly) convex problems. In fact, there exist many nonconvex machine learning tasks, whose explicit gradients are not available, such as the nonconvex blackbox learning problems (Chen et al., 2017; Liu et al., 2018c). Thus, several recent works have begun to study the zerothorder stochastic methods for the nonconvex optimization. For example, Ghadimi and Lan (2013) proposed the randomized stochastic gradientfree (RSGF) method, i.e., a zerothorder stochastic gradient method. To accelerate optimization, more recently, Liu et al. (2018c, a) proposed the zerothorder stochastic variance reduction gradient (ZOSVRG) methods. Moreover, to solve the largescale machine learning problems, some asynchronous parallel stochastic zerothorder algorithms have been proposed in (Gu, Huo, and Huang, 2016; Lian et al., 2016; Gu et al., 2018).
Although the above zerothorder stochastic methods can effectively solve the nonconvex optimization, there are few zerothorder stochastic methods for the nonconvex nonsmooth composite optimization except the RSPGF method presented in (Ghadimi, Lan, and Zhang, 2016). In addition, Liu et al. (2018a) have also studied the zerothorder algorithm for solving the nonconvex nonsmooth problem, which is different from problem (2).
ZerothOrder Proximal Stochastic Method Revisit
In this section, we briefly review the zerothorder proximal stochastic gradient (ZOProxSGD) method to solve the problem (2). Before that, we first revisit the proximal gradient descent (ProxGD) method (Mine and Fukushima, 1981).
ProxGD is an effective method to solve the problem (2) via the following iteration:
(3) 
where is a step size, and is a proximal operator defined as:
(4) 
As discussed above, because ProxGD needs to compute the gradient at each iteration, it cannot be applied to solve the problems, where the explicit gradient of function is not available. For example, in the blackbox machine learning model, only function values (e.g., prediction results) are available Chen et al. (2017). To avoid computing explicit gradient, we use the zerothorder gradient estimators (Nesterov and Spokoiny, 2017; Liu et al., 2018c) to estimate the gradient only by function values.

Specifically, we use the Gaussian Smoothing Gradient Estimator (GauSGE) (Nesterov and Spokoiny, 2017; Ghadimi, Lan, and Zhang, 2016) to estimate the gradients as follows:
(5) where is a smoothing parameter, and denote i.i.d.
random directions drawn from a zeromean isotropic multivariate Gaussian distribution
. 
Moreover, to obtain better estimated gradient, we can use the Coordinate Smoothing Gradient Estimator (CooSGE) (Gu, Huo, and Huang, 2016; Gu et al., 2018; Liu et al., 2018c) to estimate the gradients as follows:
(6) where is a coordinatewise smoothing parameter, and
is a standard basis vector with
at its th coordinate, and otherwise. Although the CooSGE need more function queries than the GauSGE, it can get better estimated gradient, and even can make the algorithms to obtain a faster convergence rate.
Finally, based on these estimated gradients, we give a zerothorder proximal gradient descent (ZOProxGD) method, which performs the following iteration:
(7) 
where .
New Faster ZerothOrder Proximal Stochastic Methods
In this section, to efficiently solve the largescale nonconvex nonsmooth problems, we propose a class of faster zerothorder proximal stochastic methods with the variance reduction (VR) techniques of SVRG and SAGA, respectively.
ZOProxSVRG
In the subsection, we propose the zerothorder proximal SVRG (ZOProxSVRG) method by using VR technique of SVRG in (Xiao and Zhang, 2014; Reddi et al., 2016).
The corresponding algorithmic framework is described in Algorithm 1, where we use a mixture stochastic gradient . Note that , i.e., this stochastic gradient is a biased estimate of the true full gradient. Although the SVRG has shown a great promise, it relies upon the assumption that the stochastic gradient is an unbiased estimate of the true full gradient. Thus, adapting the similar ideas of SVRG to zerothorder optimization is not a trivial task. To address this issue, we analyze the upper bound for the variance of the estimated gradient , and choose the appropriate step size and smoothing parameter to control this variance, which will be in detail discussed in the below theorems.
Next, we derive the upper bounds for the variance of estimated gradient based on the CooSGE and the GauSGE, respectively.
Lemma 1.
In Algorithm 1 using the CooSGE, given the mixture estimated gradient , then the following inequality holds
(9) 
where .
Remark 1.
Lemma 1 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases, but does not vanishes, due to using the zerothorder estimated gradient.
Lemma 2.
In Algorithm 1 using the GauSGE, given the estimated gradient , then the following inequality holds
(10) 
Remark 2.
Lemma 2 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases.
ZOProxSAGA
In the subsection, we propose the zerothorder proximal SAGA (ZOProxSAGA) method via using VR technique of SAGA in (Defazio, Bach, and LacosteJulien, 2014; Reddi et al., 2016).
The corresponding algorithmic description is given in Algorithm 2, where we use a mixture stochatic gradient . Similarly, , i.e., this stochastic gradient is a biased estimate of the true full gradient. Note that in Algorithm 2, due to , the step 8 can use directly the term , which is computed in the step 5, to avoid unnecessary calculations. Next, we give the upper bounds for the variance of stochastic gradient based on the CooSGE and the GauSGE, respectively.
Lemma 3.
In Algorithm 2 using the CooSGE, given the estimated gradient with , then the following inequality holds
(11) 
Remark 3.
Lemma 3 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point, then the variance of stochastic gradient decreases.
Lemma 4.
In Algorithm 2 using GauSGE, given the estimated gradient with , then the following inequality holds
(12) 
Remark 4.
Lemma 4 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases.
Convergence Analysis
In this section, we conduct the convergence analysis of both ZOProxSVRG and ZOProxSAGA. First, we give some mild assumptions regarding problem (2) as follows:
Assumption 1.
For , gradient of the function is Lipschitz continuous with a Lipschitz constant , such that
which implies
Assumption 2.
The gradient is bounded as for all .
The first assumption is standard for the convergence analysis of the zerothorder algorithms (Ghadimi, Lan, and Zhang, 2016; Nesterov and Spokoiny, 2017; Liu et al., 2018c). The second assumption gives the bounded gradient used in (Nesterov and Spokoiny, 2017; Liu et al., 2018b), which is relatively stricter than the bounded variance of gradient in (Lian et al., 2016; Liu et al., 2018c, a), due to that we need to analyze more complex problem (2) including a nonsmooth part. Next, we introduce the standard gradient mapping (Parikh, Boyd, and others, 2014) used in the convergence analysis as follows:
(13) 
For the nonconvex problems, if , the point is a critical point (Parikh, Boyd, and others, 2014). Thus, we can use the following definition as the convergence metric.
Definition 1.
(Reddi et al., 2016) A solution is called accurate, if for some .
Convergence Analysis of ZOProxSVRG
In the subsection, we show the convergence analysis of the ZOProxSVRG with the CooSGE (ZOProxSVRGCooSGE) and the GauSGE (ZOProxSVRGGauSGE), respectively.
Theorem 1.
Remark 5.
Theorem 1 shows that, given , and , the ZOProxSVRGCooSGE has convergence rate.
Theorem 2.
Remark 6.
Theorem 2 shows that given , and , the ZOProxSVRGGauSGE has convergence rate, in which the part generates from the GauSGE.
Convergence Analysis of ZOProxSAGA
In this subsection, we provide the convergence analysis of the ZOProxSAGA with the CooSGE (ZOProxSAGACooSGE) and the GauSGE (ZOProxSAGAGauSGE), respectively.
Theorem 3.
Remark 7.
Theorem 3 shows that given and , the ZOProxSAGACooSGE has convergence rate.
Theorem 4.
Remark 8.
Theorem 4 shows that given and , the ZOProxSAGAGauSGE has convergence rate, in which the part generates from the GauSGE.
All related proofs are in the supplementary document.
Experiments
In this section, we will compare the proposed algorithms (ZOProxSVRGCooSGE, ZOProxSVRGGauSGE, ZOProxSAGACooSGE, ZOProxSAGAGauSGE) with the RSPGF method (Ghadimi, Lan, and Zhang, 2016) on two applications: blackbox binary classification and
adversarial attacks on blackbox deep neural networks (DNNs)
. Note that the RSPGF uses the GauSGE to estimate gradient.BlackBox Binary Classification
Experimental Setup
In this experiment, we apply our algorithms to learn the blackbox binary classification problem. Specifically, given a set of training samples , where and , we find the optimal predictor by solving the following problem:
(30) 
where is the blackbox loss function, that only returns the function value given an input. Here, we specify the nonconvex sigmoid loss function in the blackbox setting.
In the experiment, we use the publicly available real datasets^{1}^{1}120news is from the website https://cs.nyu.edu/~roweis/data.html; a9a, w8a and covtype.binary are from the website www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., which are summarized in Table 2. In the algorithms, we fix the minibatch size , the smoothing parameters in the GauSGE and in the GooSGE. Meanwhile, we fix , and use the same initial solution
from the standard normal distribution in each experiment. For each dataset, we use half of the samples as training data, and the rest as testing data.
datasets  

20news  16,242  100  2 
a9a  32,561  123  2 
w8a  64,700  300  2 
covtype.binary  581,012  54  2 
Experimental Results
Figures 1 and 2 show that both objective values and test losses of the proposed methods faster decrease than the RSPGF method, as the time increases. In particular, both the ZOProxSVRG and ZOProxSAGA using the CooSGE show the better performances than the counterparts using the GauSGE. From these results, we find that the CooSGE shows the better performances than the CauSGE in estimating gradients. Moreover, these results also demonstrate that both the ZOProxSVRG and ZOProxSAGA using the CooSGE have a relatively faster convergence rate than the counterparts using the GauSGE. Since the ZOProxSAGA has less function query complexity than the ZOProxSVRG, it shows the better performances than the ZOProxSVRG. For example, the ZOProxSVRGCooSGE needs function queries, while ZOSAGACooSGE needs function queries.
Adversarial Attacks on BlackBox DNNs
In this experiment, we apply our methods to generate adversarial examples to attack a pretrained neural network model. Following (Chen et al., 2017; Liu et al., 2018c), the parameters of given model are hidden from us and only its outputs are accessible. In this case, we can not compute the gradients by using backpropagation algorithm. Thus, we use the zerothorder algorithms to find an universal adversarial perturbation that could fool the samples , which can be specified as the following elasticnet attacks to blackbox DNNs problem:
(31) 
where and are nonnegative parameters to balance attack success rate, distortion and sparsity. Here
represents the final layer output of neural network, which is the probabilities of
classes.Following (Liu et al., 2018c), we use a pretrained DNN^{2}^{2}2https://github.com/carlini/nnrobustattacks. on the MNIST dataset as the target blackbox model, which achieves 99.4 test accuracy. In the experiment, we select examples from the same class, and set the batch size and a constant step size for the zerothorder algorithms, where . In addition, we set and in the experiment.
Figure 3 shows that both objective values and blackbox attack losses (i.e. the first part of the problem (Adversarial Attacks on BlackBox DNNs)) of the proposed algorithms faster decrease than the RSPGF method, as the number of iteration increases. Here, we add the ZOProxSGDCooSGE method for comparison, which is obtained by combining the ZOProxSGD method with the CooSGE. Interestingly, the ZOProxSGDCooSGE shows better performance than both the ZOProxSVRGGauSGE and ZOProxSAGAGauSGE, which further demonstrates that the CooSGE can have better performance than the CauSGE in estimating gradient. Although having a relatively good performance in generating the adversarial samples, the ZOProxSGD still shows worse performance than both the ZOProxSVRGCooSGE and ZOProxSAGACooSGE, due to not using the VR technique.
Conclusions
In this paper, we proposed a class of faster gradientfree proximal stochastic methods based on the zerothorder gradient estimators, i.e., the GauSGE and the CooSGE, which only use the objective function values in the optimization. Moreover, we provided the theoretical analysis on the convergence properties of the proposed algorithms (ZOProxSVRG and ZOProxSAGA) based on the CooSGE and the GauSGE, respectively. In particular, both the ZOProxSVRG and ZOProxSAGA using the CooSGE have relatively faster convergence rates than the counterparts using the GauSGE, since the CooSGE has better performance than the CauSGE in estimating gradients.
Acknowledgments
F. Huang and S. Chen were partially supported by the Natural Science Foundation of China (NSFC) under Grant No. 61806093 and No. 61682281, and the Key Program of NSFC under Grant No. 61732006, and Jiangsu Postdoctoral Research Grant Program No. 2018K004A. F. Huang, Z. Huo, H. Huang were partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.
References
 Agarwal, Dekel, and Xiao (2010) Agarwal, A.; Dekel, O.; and Xiao, L. 2010. Optimal algorithms for online convex optimization with multipoint bandit feedback. In COLT, 28–40. Citeseer.
 Beck and Teboulle (2009) Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183–202.
 Bertsekas (2011) Bertsekas, D. P. 2011. Incremental proximal methods for large scale convex optimization. Mathematical programming 129(2):163–195.
 Chen et al. (2017) Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; and Hsieh, C.J. 2017. Zoo: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In The 10th ACM Workshop on Artificial Intelligence and Security, 15–26. ACM.
 Conn, Scheinberg, and Vicente (2009) Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivativefree optimization, volume 8. Siam.
 Defazio, Bach, and LacosteJulien (2014) Defazio, A.; Bach, F.; and LacosteJulien, S. 2014. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, 1646–1654.
 Duchi et al. (2015) Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.; and Wibisono, A. 2015. Optimal rates for zeroorder convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61(5):2788–2806.
 Dvurechensky, Gasnikov, and Gorbunov (2018) Dvurechensky, P.; Gasnikov, A.; and Gorbunov, E. 2018. An accelerated method for derivativefree smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022.
 Gao, Jiang, and Zhang (2018) Gao, X.; Jiang, B.; and Zhang, S. 2018. On the informationadaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing 76(1):327–363.
 Ghadimi and Lan (2013) Ghadimi, S., and Lan, G. 2013. Stochastic first and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23:2341–2368.
 Ghadimi, Lan, and Zhang (2016) Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming 155(12):267–305.
 Gu et al. (2018) Gu, B.; Huo, Z.; Deng, C.; and Huang, H. 2018. Faster derivativefree stochastic algorithm for shared memory machines. In ICML, 1807–1816.
 Gu, Huo, and Huang (2016) Gu, B.; Huo, Z.; and Huang, H. 2016. Zerothorder asynchronous doubly stochastic algorithm with variance reduction. arXiv preprint arXiv:1612.01425.
 Gu, Huo, and Huang (2018) Gu, B.; Huo, Z.; and Huang, H. 2018. Inexact proximal gradient methods for nonconvex and nonsmooth optimization. In AAAI.
 Li and Lin (2015) Li, H., and Lin, Z. 2015. Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, 379–387.
 Lian et al. (2016) Lian, X.; Zhang, H.; Hsieh, C. J.; Huang, Y.; and Liu, J. 2016. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder. In Advances in Neural Information Processing Systems, 3054–3062.
 Liu et al. (2018a) Liu, L.; Cheng, M.; Hsieh, C.J.; and Tao, D. 2018a. Stochastic zerothorder optimization via variance reduction method. CoRR abs/1805.11811.
 Liu et al. (2018b) Liu, S.; Chen, J.; Chen, P.Y.; and Hero, A. 2018b. Zerothorder online alternating direction method of multipliers: Convergence analysis and applications. In The TwentyFirst International Conference on Artificial Intelligence and Statistics, volume 84, 288–297.
 Liu et al. (2018c) Liu, S.; Kailkhura, B.; Chen, P.Y.; Ting, P.; Chang, S.; and Amini, L. 2018c. Zerothorder stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1805.10367.
 Mine and Fukushima (1981) Mine, H., and Fukushima, M. 1981. A minimization method for the sum of a convex function and a continuously differentiable function. Journal of Optimization Theory & Applications 33(1):9–23.
 Nesterov and Spokoiny (2017) Nesterov, Y., and Spokoiny, V. G. 2017. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics 17:527–566.
 Nesterov (2004) Nesterov, Y. 2004. Introductory Lectures on Convex Programming Volume I: Basic course. Kluwer, Boston.
 Nesterov (2013) Nesterov, Y. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140(1):125–161.
 Parikh, Boyd, and others (2014) Parikh, N.; Boyd, S.; et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1(3):127–239.
 Reddi et al. (2016) Reddi, S.; Sra, S.; Poczos, B.; and Smola, A. J. 2016. Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. In Advances in Neural Information Processing Systems, 1145–1153.
 Shamir (2017) Shamir, O. 2017. An optimal algorithm for bandit and zeroorder convex optimization with twopoint feedback. Journal of Machine Learning Research 18(52):1–11.
 Sokolov, Hitschler, and Riezler (2018) Sokolov, A.; Hitschler, J.; and Riezler, S. 2018. Sparse stochastic zerothorder optimization with an application to bandit structured prediction. arXiv preprint arXiv:1806.04458.
 Wainwright, Jordan, and others (2008) Wainwright, M. J.; Jordan, M. I.; et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1–2):1–305.
 Xiao and Zhang (2014) Xiao, L., and Zhang, T. 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4):2057–2075.
 Yu et al. (2018) Yu, X.; King, I.; Lyu, M. R.; and Yang, T. 2018. A generic approach for accelerating stochastic zerothorder convex optimization. In IJCAI, 3040–3046.
Appendix A Supplementary Materials for “Faster GradientFree Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization”
In this section, we provide the detailed proofs of the above lemmas and theorems. First, we give some useful properties of the CooSGE and the GauSGE, respectively.
Lemma 5.
(Liu et al., 2018c) Assume that the function is smooth. Let denote the estimated gradient defined by the CooSGE. Define , where
denotes the uniform distribution at the interval
. Then we have
is smooth, and
(32) where denotes the partial derivative with respect to the th coordinate.

For ,
(33) (34) 
If for , then
(35)
Lemma 6.
Assume that the function is smooth. Let denote the estimated gradient defined by the GauSGE. Define . Then we have

For any , .

For any ,
(36) 
For any ,
(37)
Proof.
Notations: To make the paper easier to follow, we give the following notations:

denotes the vector norm and the matrix spectral norm, respectively.

denotes the smooth parameter of the gradient estimators (i.e., the CooSGE and GauSGE ).

denotes the step size of updating variable .

denotes the Lipschitz constant of .

denotes the minibatch size of stochastic gradient.

, and are the total number of iterations, the number of iterations in the inner loop, and the number of iterations in the outer loop, respectively.

For notational simplicity, denotes .
Convergence Analysis of ZOProxSVRGCooSGE
In this section, we give the convergence analysis of the ZOProxSVRGCooSGE. First, we give an useful lemma about the upper bound of the variance of estimated gradient.
Lemma 7.
In Algorithm 1 using the CooSGE, given the estimated gradient , then the following inequality holds
(38) 
Proof.
Since
(39) 
we have