# Faster Gradient-Free Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization

Proximal gradient method has been playing an important role to solve many machine learning tasks, especially for the nonsmooth problems. However, in some machine learning problems such as the bandit model and the black-box learning problem, proximal gradient method could fail because the explicit gradients of these problems are difficult or infeasible to obtain. The gradient-free (zeroth-order) method can address these problems because only the objective function values are required in the optimization. Recently, the first zeroth-order proximal stochastic algorithm was proposed to solve the nonconvex nonsmooth problems. However, its convergence rate is O(1/√(T)) for the nonconvex problems, which is significantly slower than the best convergence rate O(1/T) of the zeroth-order stochastic algorithm, where T is the iteration number. To fill this gap, in the paper, we propose a class of faster zeroth-order proximal stochastic methods with the variance reduction techniques of SVRG and SAGA, which are denoted as ZO-ProxSVRG and ZO-ProxSAGA, respectively. In theoretical analysis, we address the main challenge that an unbiased estimate of the true gradient does not hold in the zeroth-order case, which was required in previous theoretical analysis of both SVRG and SAGA. Moreover, we prove that both ZO-ProxSVRG and ZO-ProxSAGA algorithms have O(1/T) convergence rates. Finally, the experimental results verify that our algorithms have a faster convergence rate than the existing zeroth-order proximal stochastic algorithm.

There are no comments yet.

## Authors

• 17 publications
• 19 publications
• 15 publications
• 30 publications
• 61 publications
• ### Zeroth-Order Stochastic Alternating Direction Method of Multipliers for Nonconvex Nonsmooth Optimization

Alternating direction method of multipliers (ADMM) is a popular optimiza...
05/29/2019 ∙ by Feihu Huang, et al. ∙ 0

• ### Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization

As application demands for zeroth-order (gradient-free) optimization acc...
05/25/2018 ∙ by Sijia Liu, et al. ∙ 0

• ### A SMART Stochastic Algorithm for Nonconvex Optimization with Applications to Robust Machine Learning

In this paper, we show how to transform any optimization problem that ar...
10/04/2016 ∙ by Aleksandr Aravkin, et al. ∙ 0

• ### Enhancing Parameter-Free Frank Wolfe with an Extra Subproblem

Aiming at convex optimization under structural constraints, this work in...
12/09/2020 ∙ by Bingcong Li, et al. ∙ 6

• ### SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients

Adaptive gradient methods have shown excellent performance for solving m...
06/15/2021 ∙ by Feihu Huang, et al. ∙ 13

• ### Proximal Gradient Method for Manifold Optimization

This paper considers manifold optimization problems with nonsmooth and n...
11/02/2018 ∙ by Shixiang Chen, et al. ∙ 0

• ### SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization

There has been extensive research on developing stochastic variance redu...
10/25/2018 ∙ by Zhe Wang, et al. ∙ 9

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Proximal gradient (PG) methods (Mine and Fukushima, 1981; Nesterov, 2004; Parikh, Boyd, and others, 2014)

are a class of powerful optimization tools in artificial intelligence and machine learning. In general, it considers the following nonsmooth optimization problem:

 minx∈Rdf(x)+ψ(x), (1)

where

usually is the loss function such as hinge loss and logistic loss, and

is the nonsmooth structure regularizer such as -norm regularization. In recent research, Beck and Teboulle (2009); Nesterov (2013) proposed the accelerate PG methods to solve convex problems by using the Nesterov’s accelerated technique. After that, Li and Lin (2015) presented a class of accelerated PG methods for nonconvex optimization. More recently, Gu, Huo, and Huang (2018) introduced inexact PG methods for nonconvex nonsmooth optimization. To solve the big data problems, the incremental or stochastic PG methods (Bertsekas, 2011; Xiao and Zhang, 2014) were developed for large-scale convex optimization. Correspondingly, Ghadimi, Lan, and Zhang (2016); Reddi et al. (2016) proposed the stochastic PG methods for large-scale nonconvex optimization.

However, in many machine learning problems, the explicit expressions of gradients are difficult or infeasible to obtain. For example, in some complex graphical model inference (Wainwright, Jordan, and others, 2008) and structure prediction problems (Sokolov, Hitschler, and Riezler, 2018), it is difficult to compute the explicit gradients of the objective functions. Even worse, in bandit (Shamir, 2017) and black-box learning (Chen et al., 2017) problems, only the objective function values are available (the explicit gradients cannot be calculated). Clearly, the above PG methods will fail in dealing with these scenarios. The gradient-free (zeroth-order) optimization method (Nesterov and Spokoiny, 2017) is a promising choice to address these problems because it only uses the function values in optimization process. Thus, the gradient-free optimization methods have been increasingly embraced for solving many machine learning problems (Conn, Scheinberg, and Vicente, 2009).

Although many gradient-free methods have recently been developed and studied (Agarwal, Dekel, and Xiao, 2010; Nesterov and Spokoiny, 2017; Liu et al., 2018b), they often suffer from the high variances of zeroth-order gradient estimates. In addition, these algorithms are mainly designed for smooth or convex settings, which will be discussed in the below related works, thus limiting their applicability in a wide range of nonconvex nonsmooth machine learning problems such as involving the nonconvex loss functions and nonsmooth regularization.

In this paper, thus, we propose a class of faster gradient-free proximal stochastic methods for solving the nonconvex nonsmooth problem as follows:

 minx∈RdF(x)=:f(x)+ψ(x), f(x)=:1nn∑i=1fi(x) (2)

where each is a nonconvex and smooth loss function, and is a convex and nonsmooth regularization term. Until now, there are few zeroth-order stochastic methods for solving the problem (2) except a recent attempt proposed in (Ghadimi, Lan, and Zhang, 2016). Specifically, Ghadimi, Lan, and Zhang (2016) have proposed a randomized stochastic projected gradient-free method (RSPGF), i.e., a zeroth-order proximal stochastic gradient method. However, due to the large variance of zeroth-order estimated gradient generated from randomly selecting the sample and the direction of derivative, the RSPGE only has a convergence rate , which is significantly slower than , the best convergence rate of the zeroth-order stochastic algorithm. To accelerate the RSPGF algorithm, we use the variance reduction strategies in the first-order methods, i.e., SVRG (Xiao and Zhang, 2014) and SAGA (Defazio, Bach, and Lacoste-Julien, 2014), to reduce the variance of estimated stochastic gradient.

Although SVRG and SAGA have shown good performances, applying these strategies to the zeroth-order method is not a trivial task. The main challenge arises due to that both SVRG and SAGA rely on the assumption that a stochastic gradient is an unbiased estimate of the true full gradient. However, it does not hold in the zeroth-order algorithms. In the paper, thus, we will fill this gap between zeroth-order proximal stochastic method and the classic variance reduction approaches (SVRG and SAGA).

### Main Contributions

In summary, our main contributions are summarized as follows:

• We propose a class of faster gradient-free proximal stochastic methods (ZO-ProxSVRG and ZO-ProxSAGA), based on the variance reduction techniques of SVRG and SAGA. Our new algorithms only use the objective function values in the optimization process.

• Moreover, we provide the theoretical analysis on the convergence properties of both new ZO-ProxSVRG and ZO-ProxSAGA methods. Table 1 shows the specifical convergence rates of the proposed algorithms and other related ones. In particular, our algorithms have faster convergence rate than of the RSPGF (Ghadimi, Lan, and Zhang, 2016) (the existing stochastic PG algorithm for solving nonconvex nonsmoothing problems).

• Extensive experimental results and theoretical analysis demonstrate the effectiveness of our algorithms.

## Related Works

Gradient-free (zeroth-order) methods have been effectively used to solve many machine learning problems, where the explicit gradient is difficult or infeasible to obtain, and have also been widely studied. For example, Nesterov and Spokoiny (2017) proposed several random gradient-free methods by using Gaussian smoothing technique. Duchi et al. (2015) proposed a zeroth-order mirror descent algorithm. More recently, Yu et al. (2018); Dvurechensky, Gasnikov, and Gorbunov (2018) presented the accelerated zeroth-order methods for the convex optimization. To solve the nonsmooth problems, the zeroth-order online or stochastic ADMM methods (Liu et al., 2018b; Gao, Jiang, and Zhang, 2018) have been introduced.

The above zeroth-order methods mainly focus on the (strongly) convex problems. In fact, there exist many nonconvex machine learning tasks, whose explicit gradients are not available, such as the nonconvex black-box learning problems (Chen et al., 2017; Liu et al., 2018c). Thus, several recent works have begun to study the zeroth-order stochastic methods for the nonconvex optimization. For example, Ghadimi and Lan (2013) proposed the randomized stochastic gradient-free (RSGF) method, i.e., a zeroth-order stochastic gradient method. To accelerate optimization, more recently, Liu et al. (2018c, a) proposed the zeroth-order stochastic variance reduction gradient (ZO-SVRG) methods. Moreover, to solve the large-scale machine learning problems, some asynchronous parallel stochastic zeroth-order algorithms have been proposed in (Gu, Huo, and Huang, 2016; Lian et al., 2016; Gu et al., 2018).

Although the above zeroth-order stochastic methods can effectively solve the nonconvex optimization, there are few zeroth-order stochastic methods for the nonconvex nonsmooth composite optimization except the RSPGF method presented in (Ghadimi, Lan, and Zhang, 2016). In addition, Liu et al. (2018a) have also studied the zeroth-order algorithm for solving the nonconvex nonsmooth problem, which is different from problem (2).

## Zeroth-Order Proximal Stochastic Method Revisit

In this section, we briefly review the zeroth-order proximal stochastic gradient (ZO-ProxSGD) method to solve the problem (2). Before that, we first revisit the proximal gradient descent (ProxGD) method (Mine and Fukushima, 1981).

ProxGD is an effective method to solve the problem (2) via the following iteration:

 xt+1=Proxηψ(xt−η∇f(xt)), t=0,1,⋯, (3)

where is a step size, and is a proximal operator defined as:

 Proxηψ(x)=argminy∈Rd{ψ(y)+12η∥y−x∥2}. (4)

As discussed above, because ProxGD needs to compute the gradient at each iteration, it cannot be applied to solve the problems, where the explicit gradient of function is not available. For example, in the black-box machine learning model, only function values (e.g., prediction results) are available Chen et al. (2017). To avoid computing explicit gradient, we use the zeroth-order gradient estimators (Nesterov and Spokoiny, 2017; Liu et al., 2018c) to estimate the gradient only by function values.

• Specifically, we use the Gaussian Smoothing Gradient Estimator (GauSGE) (Nesterov and Spokoiny, 2017; Ghadimi, Lan, and Zhang, 2016) to estimate the gradients as follows:

 ^∇fi(x)=fi(x+μui)−fi(x)μui,i∈[n], (5)

where is a smoothing parameter, and denote i.i.d.

random directions drawn from a zero-mean isotropic multivariate Gaussian distribution

.

• Moreover, to obtain better estimated gradient, we can use the Coordinate Smoothing Gradient Estimator (CooSGE) (Gu, Huo, and Huang, 2016; Gu et al., 2018; Liu et al., 2018c) to estimate the gradients as follows:

 ^∇fi(x)=d∑j=1fi(x+μjej)−fi(x−μjej)2μjej,i∈[n], (6)

where is a coordinate-wise smoothing parameter, and

is a standard basis vector with

at its -th coordinate, and otherwise. Although the CooSGE need more function queries than the GauSGE, it can get better estimated gradient, and even can make the algorithms to obtain a faster convergence rate.

Finally, based on these estimated gradients, we give a zeroth-order proximal gradient descent (ZO-ProxGD) method, which performs the following iteration:

 xt+1=Proxηψ(xt−η^∇f(xt)), t=0,1,⋯, (7)

where .

Since ZO-ProxGD needs to estimate full gradient , when is large in the problem (2), its high cost per iteration is prohibitive. As a result, Ghadimi, Lan, and Zhang (2016) proposed the RSPGF (i.e., ZO-ProxSGD) with performing the following iteration:

 xt+1=Proxηψ(xt−η^∇fIt(xt)), t=0,1,⋯, (8)

where , and is the mini-batch size.

## New Faster Zeroth-Order Proximal Stochastic Methods

In this section, to efficiently solve the large-scale nonconvex nonsmooth problems, we propose a class of faster zeroth-order proximal stochastic methods with the variance reduction (VR) techniques of SVRG and SAGA, respectively.

### ZO-ProxSVRG

In the subsection, we propose the zeroth-order proximal SVRG (ZO-ProxSVRG) method by using VR technique of SVRG in (Xiao and Zhang, 2014; Reddi et al., 2016).

The corresponding algorithmic framework is described in Algorithm 1, where we use a mixture stochastic gradient . Note that , i.e., this stochastic gradient is a biased estimate of the true full gradient. Although the SVRG has shown a great promise, it relies upon the assumption that the stochastic gradient is an unbiased estimate of the true full gradient. Thus, adapting the similar ideas of SVRG to zeroth-order optimization is not a trivial task. To address this issue, we analyze the upper bound for the variance of the estimated gradient , and choose the appropriate step size and smoothing parameter to control this variance, which will be in detail discussed in the below theorems.

Next, we derive the upper bounds for the variance of estimated gradient based on the CooSGE and the GauSGE, respectively.

###### Lemma 1.

In Algorithm 1 using the CooSGE, given the mixture estimated gradient , then the following inequality holds

 E∥^vst−∇f(xst)∥2≤2δnL2dbE∥xst−~xs∥2+L2d2μ22, (9)

where .

###### Remark 1.

Lemma 1 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases, but does not vanishes, due to using the zeroth-order estimated gradient.

###### Lemma 2.

In Algorithm 1 using the GauSGE, given the estimated gradient , then the following inequality holds

 E∥^vst−∇f(xt)∥2≤(2+12δnb)(d+6)3L2μ2 +6δnL2bE∥xst−~xs∥2+(4+24δnb)(2d+9)σ2. (10)
###### Remark 2.

Lemma 2 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases.

### ZO-ProxSAGA

In the subsection, we propose the zeroth-order proximal SAGA (ZO-ProxSAGA) method via using VR technique of SAGA in (Defazio, Bach, and Lacoste-Julien, 2014; Reddi et al., 2016).

The corresponding algorithmic description is given in Algorithm 2, where we use a mixture stochatic gradient . Similarly, , i.e., this stochastic gradient is a biased estimate of the true full gradient. Note that in Algorithm 2, due to , the step 8 can use directly the term , which is computed in the step 5, to avoid unnecessary calculations. Next, we give the upper bounds for the variance of stochastic gradient based on the CooSGE and the GauSGE, respectively.

###### Lemma 3.

In Algorithm 2 using the CooSGE, given the estimated gradient with , then the following inequality holds

 E∥^vt−∇f(xt)∥2≤2L2dnbn∑i=1E∥xt−zti∥22+L2d2μ22. (11)
###### Remark 3.

Lemma 3 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point, then the variance of stochastic gradient decreases.

###### Lemma 4.

In Algorithm 2 using GauSGE, given the estimated gradient with , then the following inequality holds

 E∥^vt−∇f(xt)∥2≤(2+12b)(d+6)3L2μ2 +6L2nbn∑i=1E∥xt−zti∥2+(4+24b)(2d+9)σ2. (12)
###### Remark 4.

Lemma 4 shows that variance of has an upper bound. As the number of iterations increases, both and will approach the same stationary point , then the variance of stochastic gradient decreases.

## Convergence Analysis

In this section, we conduct the convergence analysis of both ZO-ProxSVRG and ZO-ProxSAGA. First, we give some mild assumptions regarding problem (2) as follows:

###### Assumption 1.

For , gradient of the function is Lipschitz continuous with a Lipschitz constant , such that

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥, ∀x,y∈Rd,

which implies

 fi(x)≤fi(y)+∇fi(y)T(x−y)+L2∥x−y∥2.
###### Assumption 2.

The gradient is bounded as for all .

The first assumption is standard for the convergence analysis of the zeroth-order algorithms (Ghadimi, Lan, and Zhang, 2016; Nesterov and Spokoiny, 2017; Liu et al., 2018c). The second assumption gives the bounded gradient used in (Nesterov and Spokoiny, 2017; Liu et al., 2018b), which is relatively stricter than the bounded variance of gradient in (Lian et al., 2016; Liu et al., 2018c, a), due to that we need to analyze more complex problem (2) including a non-smooth part. Next, we introduce the standard gradient mapping (Parikh, Boyd, and others, 2014) used in the convergence analysis as follows:

 gη(x)=1η(x−Proxηψ(x−η∇f(x))). (13)

For the nonconvex problems, if , the point is a critical point (Parikh, Boyd, and others, 2014). Thus, we can use the following definition as the convergence metric.

###### Definition 1.

(Reddi et al., 2016) A solution is called -accurate, if for some .

### Convergence Analysis of ZO-ProxSVRG

In the subsection, we show the convergence analysis of the ZO-ProxSVRG with the CooSGE (ZO-ProxSVRG-CooSGE) and the GauSGE (ZO-ProxSVRG-GauSGE), respectively.

###### Theorem 1.

Assume the sequence generated from Algorithm 1 using the CooSGE, and define a sequence as follows: for

 ct=⎧⎪⎨⎪⎩δnL2dηb+ct+1(1+β), 0≤t≤m−1;0, t=m (14)

where . Let , and satisfies the following inequality:

 8ρ2m2b+ρ≤1, (15)

then we have

 E∥gη(xst)∥2≤E[F(x10)−F(x∗)]Tγ+L2d2μ2η4γ, (16)

where and is an optimal solution of the problem (2). Further let , , and , we have

 E∥gη(xst)∥2≤16dLE[F(x10)−F(x∗)]T+O(dT). (17)
###### Remark 5.

Theorem 1 shows that, given , and , the ZO-ProxSVRG-CooSGE has convergence rate.

###### Theorem 2.

Assume the sequence generated from Algorithm 1 using the GauSGE, and define a sequence as follows: for

 ct=⎧⎪⎨⎪⎩3δnL2ηb+ct+1(1+β), 0≤t≤m−1;0, t=m (18)

where . Let and satisfies the following inequality:

 24ρ2m2b+ρ≤1, (19)

then we have

 E∥gη(xst)∥2≤ E[F(x10)−F(x∗)]Tγ+(1+6δnb)(d+6)3L2μ2ηγ +(2+12δnb)(2d+9)σ2ηγ, (20)

where and is an optimal solution of the problem (2). Further let , , and , we have

 E∥gη(xst)∥2≤ 18LE[F(x10)−F(x∗)]T+O(dT) +O(dσ2). (21)
###### Remark 6.

Theorem 2 shows that given , and , the ZO-ProxSVRG-GauSGE has convergence rate, in which the part generates from the GauSGE.

### Convergence Analysis of ZO-ProxSAGA

In this subsection, we provide the convergence analysis of the ZO-ProxSAGA with the CooSGE (ZO-ProxSAGA-CooSGE) and the GauSGE (ZO-ProxSAGA-GauSGE), respectively.

###### Theorem 3.

Assume the sequence generated from Algorithm 2 using the CooSGE, and define a positive sequence as follows:

 ct=L2dηb+ct+1(1−p)(1+β) (22)

where . Let , , and satisfies the following inequality:

 32ρ2n2b3+ρ≤1, (23)

then we have

 E∥gη(xt)∥2≤E[F(x0)−F(x∗)]Tγ+L2d2μ2η4γ, (24)

where and is an optimal solution of the problem (2). Further let , and , we have

 E∥gη(xt)∥2≤64dLE[F(x0)−F(x∗)]3T+O(dT). (25)
###### Remark 7.

Theorem 3 shows that given and , the ZO-ProxSAGA-CooSGE has convergence rate.

###### Theorem 4.

Assume the sequence generated from Algorithm 2 using the GauSGE, and define a positive sequence as follows:

 ct=3L2ηb+ct+1(1−p)(1+β), (26)

where . Let , and satisfies the following inequality:

 96ρ2n2b3+ρ≤1, (27)

then we have

 E∥gη(xt)∥2≤ E[F(x0)−F(x∗)]Tγ+(2+12b)(2d+9)σ2ηγ +(1+6b)(d+6)3L2μ2ηγ, (28)

where and is an optimal solution of the problem (2). Further given , and , we have

 E∥gη(xt)∥2≤ 144LE[F(x0)−F(x∗)]5T+O(dT)+O(dσ2). (29)
###### Remark 8.

Theorem 4 shows that given and , the ZO-ProxSAGA-GauSGE has convergence rate, in which the part generates from the GauSGE.

All related proofs are in the supplementary document.

## Experiments

In this section, we will compare the proposed algorithms (ZO-ProxSVRG-CooSGE, ZO-ProxSVRG-GauSGE, ZO-ProxSAGA-CooSGE, ZO-ProxSAGA-GauSGE) with the RSPGF method (Ghadimi, Lan, and Zhang, 2016) on two applications: black-box binary classification and

adversarial attacks on black-box deep neural networks (DNNs)

. Note that the RSPGF uses the GauSGE to estimate gradient.

### Black-Box Binary Classification

#### Experimental Setup

In this experiment, we apply our algorithms to learn the black-box binary classification problem. Specifically, given a set of training samples , where and , we find the optimal predictor by solving the following problem:

 minx∈Rd1nn∑i=1fi(x)+λ1∥x∥1+λ2∥x∥22, (30)

where is the black-box loss function, that only returns the function value given an input. Here, we specify the non-convex sigmoid loss function in the black-box setting.

In the experiment, we use the publicly available real datasets11120news is from the website https://cs.nyu.edu/~roweis/data.html; a9a, w8a and covtype.binary are from the website www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., which are summarized in Table 2. In the algorithms, we fix the mini-batch size , the smoothing parameters in the GauSGE and in the GooSGE. Meanwhile, we fix , and use the same initial solution

from the standard normal distribution in each experiment. For each dataset, we use half of the samples as training data, and the rest as testing data.

#### Experimental Results

Figures 1 and 2 show that both objective values and test losses of the proposed methods faster decrease than the RSPGF method, as the time increases. In particular, both the ZO-ProxSVRG and ZO-ProxSAGA using the CooSGE show the better performances than the counterparts using the GauSGE. From these results, we find that the CooSGE shows the better performances than the CauSGE in estimating gradients. Moreover, these results also demonstrate that both the ZO-ProxSVRG and ZO-ProxSAGA using the CooSGE have a relatively faster convergence rate than the counterparts using the GauSGE. Since the ZO-ProxSAGA has less function query complexity than the ZO-ProxSVRG, it shows the better performances than the ZO-ProxSVRG. For example, the ZO-ProxSVRG-CooSGE needs function queries, while ZO-SAGA-CooSGE needs function queries.

### Adversarial Attacks on Black-Box DNNs

In this experiment, we apply our methods to generate adversarial examples to attack a pre-trained neural network model. Following (Chen et al., 2017; Liu et al., 2018c), the parameters of given model are hidden from us and only its outputs are accessible. In this case, we can not compute the gradients by using back-propagation algorithm. Thus, we use the zeroth-order algorithms to find an universal adversarial perturbation that could fool the samples , which can be specified as the following elastic-net attacks to black-box DNNs problem:

 minx∈Rd 1nn∑i=1max{Fli(ai+x)−maxj≠liFj(ai+x),0} +λ1∥x∥1+λ2∥x∥22, (31)

where and are nonnegative parameters to balance attack success rate, distortion and sparsity. Here

represents the final layer output of neural network, which is the probabilities of

classes.

Following (Liu et al., 2018c), we use a pre-trained DNN222https://github.com/carlini/nnrobustattacks. on the MNIST dataset as the target black-box model, which achieves 99.4 test accuracy. In the experiment, we select examples from the same class, and set the batch size and a constant step size for the zeroth-order algorithms, where . In addition, we set and in the experiment.

Figure 3 shows that both objective values and black-box attack losses (i.e. the first part of the problem (Adversarial Attacks on Black-Box DNNs)) of the proposed algorithms faster decrease than the RSPGF method, as the number of iteration increases. Here, we add the ZO-ProxSGD-CooSGE method for comparison, which is obtained by combining the ZO-ProxSGD method with the CooSGE. Interestingly, the ZO-ProxSGD-CooSGE shows better performance than both the ZO-ProxSVRG-GauSGE and ZO-ProxSAGA-GauSGE, which further demonstrates that the CooSGE can have better performance than the CauSGE in estimating gradient. Although having a relatively good performance in generating the adversarial samples, the ZO-ProxSGD still shows worse performance than both the ZO-ProxSVRG-CooSGE and ZO-ProxSAGA-CooSGE, due to not using the VR technique.

## Conclusions

In this paper, we proposed a class of faster gradient-free proximal stochastic methods based on the zeroth-order gradient estimators, i.e., the GauSGE and the CooSGE, which only use the objective function values in the optimization. Moreover, we provided the theoretical analysis on the convergence properties of the proposed algorithms (ZO-ProxSVRG and ZO-ProxSAGA) based on the CooSGE and the GauSGE, respectively. In particular, both the ZO-ProxSVRG and ZO-ProxSAGA using the CooSGE have relatively faster convergence rates than the counterparts using the GauSGE, since the CooSGE has better performance than the CauSGE in estimating gradients.

## Acknowledgments

F. Huang and S. Chen were partially supported by the Natural Science Foundation of China (NSFC) under Grant No. 61806093 and No. 61682281, and the Key Program of NSFC under Grant No. 61732006, and Jiangsu Postdoctoral Research Grant Program No. 2018K004A. F. Huang, Z. Huo, H. Huang were partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.

## References

• Agarwal, Dekel, and Xiao (2010) Agarwal, A.; Dekel, O.; and Xiao, L. 2010. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, 28–40. Citeseer.
• Beck and Teboulle (2009) Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183–202.
• Bertsekas (2011) Bertsekas, D. P. 2011. Incremental proximal methods for large scale convex optimization. Mathematical programming 129(2):163–195.
• Chen et al. (2017) Chen, P.-Y.; Zhang, H.; Sharma, Y.; Yi, J.; and Hsieh, C.-J. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In The 10th ACM Workshop on Artificial Intelligence and Security, 15–26. ACM.
• Conn, Scheinberg, and Vicente (2009) Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivative-free optimization, volume 8. Siam.
• Defazio, Bach, and Lacoste-Julien (2014) Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 1646–1654.
• Duchi et al. (2015) Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.; and Wibisono, A. 2015. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61(5):2788–2806.
• Dvurechensky, Gasnikov, and Gorbunov (2018) Dvurechensky, P.; Gasnikov, A.; and Gorbunov, E. 2018. An accelerated method for derivative-free smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022.
• Gao, Jiang, and Zhang (2018) Gao, X.; Jiang, B.; and Zhang, S. 2018. On the information-adaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing 76(1):327–363.
• Ghadimi and Lan (2013) Ghadimi, S., and Lan, G. 2013. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23:2341–2368.
• Ghadimi, Lan, and Zhang (2016) Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming 155(1-2):267–305.
• Gu et al. (2018) Gu, B.; Huo, Z.; Deng, C.; and Huang, H. 2018. Faster derivative-free stochastic algorithm for shared memory machines. In ICML, 1807–1816.
• Gu, Huo, and Huang (2016) Gu, B.; Huo, Z.; and Huang, H. 2016. Zeroth-order asynchronous doubly stochastic algorithm with variance reduction. arXiv preprint arXiv:1612.01425.
• Gu, Huo, and Huang (2018) Gu, B.; Huo, Z.; and Huang, H. 2018. Inexact proximal gradient methods for non-convex and non-smooth optimization. In AAAI.
• Li and Lin (2015) Li, H., and Lin, Z. 2015. Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, 379–387.
• Lian et al. (2016) Lian, X.; Zhang, H.; Hsieh, C. J.; Huang, Y.; and Liu, J. 2016. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in Neural Information Processing Systems, 3054–3062.
• Liu et al. (2018a) Liu, L.; Cheng, M.; Hsieh, C.-J.; and Tao, D. 2018a. Stochastic zeroth-order optimization via variance reduction method. CoRR abs/1805.11811.
• Liu et al. (2018b) Liu, S.; Chen, J.; Chen, P.-Y.; and Hero, A. 2018b. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. In The Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, 288–297.
• Liu et al. (2018c) Liu, S.; Kailkhura, B.; Chen, P.-Y.; Ting, P.; Chang, S.; and Amini, L. 2018c. Zeroth-order stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1805.10367.
• Mine and Fukushima (1981) Mine, H., and Fukushima, M. 1981. A minimization method for the sum of a convex function and a continuously differentiable function. Journal of Optimization Theory & Applications 33(1):9–23.
• Nesterov and Spokoiny (2017) Nesterov, Y., and Spokoiny, V. G. 2017. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17:527–566.
• Nesterov (2004) Nesterov, Y. 2004. Introductory Lectures on Convex Programming Volume I: Basic course. Kluwer, Boston.
• Nesterov (2013) Nesterov, Y. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140(1):125–161.
• Parikh, Boyd, and others (2014) Parikh, N.; Boyd, S.; et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1(3):127–239.
• Reddi et al. (2016) Reddi, S.; Sra, S.; Poczos, B.; and Smola, A. J. 2016. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, 1145–1153.
• Shamir (2017) Shamir, O. 2017. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research 18(52):1–11.
• Sokolov, Hitschler, and Riezler (2018) Sokolov, A.; Hitschler, J.; and Riezler, S. 2018. Sparse stochastic zeroth-order optimization with an application to bandit structured prediction. arXiv preprint arXiv:1806.04458.
• Wainwright, Jordan, and others (2008) Wainwright, M. J.; Jordan, M. I.; et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1–2):1–305.
• Xiao and Zhang (2014) Xiao, L., and Zhang, T. 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4):2057–2075.
• Yu et al. (2018) Yu, X.; King, I.; Lyu, M. R.; and Yang, T. 2018. A generic approach for accelerating stochastic zeroth-order convex optimization. In IJCAI, 3040–3046.

## Appendix A Supplementary Materials for “Faster Gradient-Free Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization”

In this section, we provide the detailed proofs of the above lemmas and theorems. First, we give some useful properties of the CooSGE and the GauSGE, respectively.

###### Lemma 5.

(Liu et al., 2018c) Assume that the function is -smooth. Let denote the estimated gradient defined by the CooSGE. Define , where

denotes the uniform distribution at the interval

. Then we have

• is -smooth, and

 ^∇f(x)=d∑j=1∂fμj(x)∂xjej, (32)

where denotes the partial derivative with respect to the th coordinate.

• For ,

 |fμj(x)−f(x)|≤Lμ2j2, (33) |∂fμj(x)∂xj|≤Lμ2j2. (34)
• If for , then

 ∥^∇f(x)−∇f(x)∥22≤L2d2μ24. (35)
###### Lemma 6.

Assume that the function is -smooth. Let denote the estimated gradient defined by the GauSGE. Define . Then we have

• For any , .

• For any ,

 |fμ(x)−f(x)|≤Ldμ22, |∇fμ(x)−∇f(x)|≤Lμ(d+3)322, Eu∥^∇f(x)∥2≤2(d+4)∥∇f(x)∥2+μ2L2(d+6)32. (36)
• For any ,

 Eu∥^∇f(x)−∇f(x)∥2≤2(2d+9)∥∇f(x)∥2+μ2L2(d+6)3. (37)
###### Proof.

The first and second parts of the above results can be obtain from Lemma 5 in (Ghadimi, Lan, and Zhang, 2016). Using the inequality (36), we have

 ∥^∇f(x)−∇f(x)∥2 ≤2∥^∇f(x)∥2+2∥∇f(x)∥2 ≤2(2d+9)∥∇f(x)∥2+μ2L2(d+6)3,

where the first inequality holds by the Cauchy-Schwarz and Young’s inequality. ∎

Notations: To make the paper easier to follow, we give the following notations:

• denotes the vector norm and the matrix spectral norm, respectively.

• denotes the smooth parameter of the gradient estimators (i.e., the CooSGE and GauSGE ).

• denotes the step size of updating variable .

• denotes the Lipschitz constant of .

• denotes the mini-batch size of stochastic gradient.

• , and are the total number of iterations, the number of iterations in the inner loop, and the number of iterations in the outer loop, respectively.

• For notational simplicity, denotes .

### Convergence Analysis of ZO-ProxSVRG-CooSGE

In this section, we give the convergence analysis of the ZO-ProxSVRG-CooSGE. First, we give an useful lemma about the upper bound of the variance of estimated gradient.

###### Lemma 7.

In Algorithm 1 using the CooSGE, given the estimated gradient , then the following inequality holds

 E∥^vst−∇f(xst)∥2≤2δnL2dbE∥xst−~xs∥22+L2d2μ22. (38)
###### Proof.

Since

 EIt[^∇fIt(xst)−^∇fIt(~xs)