Zeroth-order (gradient-free) method is a powerful optimization tool for many machine learning problems, where the gradient of objective function is not available or computationally prohibitive. For example, zeroth-order optimization methods have been applied to bandit feedback analysis 6] and adversarial attacks on black-box deep neural networks (DNNs) [5, 26]
. Recently, the zeroth-order methods have been increasingly studied. Based on the Gaussian smoothing gradient estimator,Nesterov and Spokoiny  proposed the zeroth-order gradient descent methods. Ghadimi and Lan  presented the zeroth-order stochastic gradient methods. More recently, Balasubramanian and Ghadimi  introduced a class of zeroth-order conditional gradient methods. To deal with the nonsmooth penalties, Liu et al. , Gao et al.  proposed the zeroth-order online and stochastic alternating direction method of multipliers (ADMM) methods.
So far, the above zeroth-order algorithms mainly build on convexity of the problems. In fact, zeroth-order methods are also highly successful in solving many nonconvex problems such as adversarial attack to black-box DNNs [5, 26]. Thus, Ghadimi and Lan 
studied the zeroth-order stochastic gradient descent (ZO-SGD) methods for nonconvex optimization. To accelerate the ZO-SGD,Liu et al. [26, 24]
proposed fast zeroth-order stochastic variance-reduced gradient (ZO-SVRG) methods using the SVRG. For big data optimization, asynchronous parallel zeroth-order methods [23, 12] and distributed zeroth-order methods  have been developed. More recently, to reduce function query complexity, the SPIDER-SZO  and ZO-SPIDER-Coord  have been introduced using stochastic path-integrated differential estimator, i.e., SPIDER , which is improved by SpiderBoost  and is also a variant of stochastic recursive gradient algorithm (SARAH ). To deal with the nonsmooth regularization, Ghadimi et al. , Huang et al.  proposed some zeroth-order proximal stochastic gradient methods. However, these nonconvex zeroth-order methods still are not competent for many machine learning problems with complex nonsmooth penalties and constraints, such as the structured adversarial attack to black-box DNNs in . More recently, thus, Huang et al.  proposed a class of nonconvex zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve these complex problems. However, these zeroth-order ADMM methods suffer from high function query complexity (please see in Table 1).
|Type||Algorithm||Reference||Problem||Function Query Complexity|
|Finite-sum||ZO-SVRG-ADMM||||NC(S) + C(mNS)|
|ZO-SPIDER-ADMM||Ours||NC(S) + C(mNS)|
|Online||ZOO-ADMM||||C(S) + C(NS)|
|ZOO-SPIDER-ADMM||Ours||NC(S) + C(mNS)|
In this paper, thus, we propose a class of faster zeroth-order stochastic ADMM methods with lower function query complexity to solve the following nonconvex nonsmooth problem:
where , , , and is a nonconvex and smooth function, and is a convex and possibly nonsmooth function for all . Note that the explicit gradients of are difficult or infeasible to obtain. Under this case, we have to use the zeroth-order methods [28, 26] to estimate gradient of each . For the problem (1), its finite-sum subproblem generally comes from the empirical loss minimization in machine learning, and its online
subproblem generates from the expected loss minimization, where the random variablefollows an unknown data distribution. To address the online subproblem, we extend our ZO-SPIDER-ADMM to the online setting, and propose an online ZO-SPIDER-ADMM algorithm. In summary, our main contributions are summarized as follows:
We propose a fast zeroth-order stochastic ADMM method (i.e., ZO-SPIDER-ADMM ) with lower function query complexity to solve the problem (1).
We prove that the ZO-SPIDER-ADMM has the optimal function query complexities of for finding an -approximate local solution, which improves the existing best nonconvex zeroth-order ADMM methods by a factor .
We extend the ZO-SPIDER-ADMM to the online setting, and propose a faster online zeroth-order ADMM method (i.e., ZOO-SPIDER-ADMM). Moreover, we prove that the ZOO-SPIDER-ADMM has the optimal query complexities of , which improves the existing best result by a factor .
Let and for . Given a positive definite matrix , ; and
denote the largest and smallest eigenvalues of matrix, respectively; the conditional number . and denote the largest and smallest eigenvalues of matrix , respectively. Given positive definite matrices , let and .
2 Related Work
ADMM [8, 4] is a popular optimization method in solving the composite and constrained problems in machine learning. Due to the flexibility in splitting the objective function into loss and complex penalty, the ADMM can relatively easily solve some problems with complicated structure penalty such as the graph-guided fused lasso , which are too complicated for the other popular optimization methods such as proximal gradient methods . Thus, ADMM has been widely studied in recent years [30, 39]. For large-scale optimization, some stochastic ADMM methods [31, 33, 40, 27] have been proposed. In fact, the ADMM method is also successful in solving many nonconvex machine learning problems such as training neural networks . Thus, the nonconvex ADMM and its stochastic version methods have been developed in [35, 36, 14, 20]. At the same time, the nonconvex stochastic ADMM methods [15, 16] have been studied.
So far, the above ADMM methods need to repeatedly calculate gradients of the loss function over the iterations. However, in many machine learning problems, the gradients of objective functions are difficult or infeasible to obtain. For example, in adversarial attack to black-box DNNs[5, 26], only evaluation values (i.e., function values) are provided. Thus, [25, 9] have proposed the zeroth-order online and stochastic ADMM methods for solving some convex problems. More recently,  has proposed the nonconvex ZO-SVRG-ADMM and ZO-SAGA-ADMM methods.
3 Faster Zeroth-Order Stochastic ADMM Methods
In this section, we propose a faster zeroth-order stochatic ADMM method (i.e., ZO-SPIDER-ADMM) to solve the problem (1) based on the SPIDER method. Moreover, we extent the ZO-SPIDER-ADMM to the online setting, and propose a faster online zeroth-order ADMM, i.e., ZOO-SPIDER-ADMM.
First, we give the augmented Lagrangian function of the problem (1) as follows:
where denotes the dual variable and denotes the penalty parameter.
In the problem (1), the explicit expression of gradient for each function is not available, and only function value of is available. Thus, we use the coordinate smoothing gradient estimator  to evaluate gradients as follows:
where is a coordinate-wise smoothing parameter, and
is a standard basis vector with 1 at its-th coordinate, and 0 otherwise. Without loss of generality, let . Considering the case that the sample size is large, we use the mini-batch size samples instead of the whole samples in estimating gradient of . Algorithm 1 describes the ZO-SPIDER-ADMM algorithm.
Since we use the inexact stochastic zeroth-order gradient to update , we define an approximated function of over as follows:
where is a step size and . At the step 10 of Algorithm 1, we can easily obtain . To avoid computing inverse of matrix , we can set with to linearize term .
At the step 9 of Algorithm 1, we update the parameter by solving the following subproblem
where . When set with for all to linearize the term , then we can use the following proximal operator to update each :
Next, we extend the above ZO-SPIDER-ADMM method to the online setting, and propose an online ZO-SPIDER-ADMM (i.e., ZOO-SPIDER-ADMM) to solve the online problem (1). In the online setting, denotes a population risk over an underlying data distribution. The online problem 1 can be viewed as having infinite samples, so we are not able to estimate zeroth-order gradient of the function . In this case, we use the sampling method to evaluate the full zeroth-order gradient. The online ZO-SPIDER-ADMM is described in Algorithm 2.
4 Theoretical Analysis
In the section, we study the convergence properties of both the ZO-SPIDER-ADMM and ZOO-SPIDER-ADMM methods. All related proofs are provided in the supplementary document. First, we restate the standard -approximate stationary point of the problem (1), used in [20, 16].
Given , the point is said to be an -stationary point of the problem (1), if it holds that
Next, we give some standard assumptions regarding the problem (1) as follows:
Loss function for is -smooth such that, for any
For the expectation setting, the variance is bounded, i.e., for all . Full gradient of loss function is bounded, i.e., there exists a constant such that for all , it follows that .
and for all are all lower bounded, and let and .
Matrix is full row or column rank.
Assumption 1 imposes the smoothness on individual loss functions, which is commonly used in the convergence analysis of nonconvex algorithms . Assumption 2 shows that both the full gradient and the variance of stochastic gradients have the bounded norm. Assumption 2 of the variance boundedness assumption is only needed for the online case. Assumptions 3 guarantees the feasibility of the optimization problem, which has been used in the study of nonconvex ADMMs [14, 20]. Assumption 4 guarantees the matrix or is non-singular, which is commonly used in the convergence analysis of nonconvex ADMM algorithms [14, 20]. Without loss of generality, we will use the full column rank matrix in the following.
4.1 Convergence Analysis of the ZO-SPIDER-ADMM
In this subsection, we study the convergence properties of the ZO-SPIDER-ADMM algorithm. Throughout the paper, let such that . For notational simplicity, let
Suppose the sequence is generated from Algorithm 1, and define a Lyapunov function as follows:
Let , and , then we have
where , with and is a lower bound of the function .
Next, based on the above Lemma 1, we give convergence properties of the ZO-SPIDER-ADMM. We begin with defining a useful variable .
Theorem 1 shows that given , , and , the ZO-SPIDER-ADMM has convergence rate of , and has the optimal function query complexity of for finding an -approximate stationary point. Specifically, the function query complexity is generated from the following two parts: The first part is generated from using function values to compute the zeroth-order full gradient; The second part is generated from using function values to compute mini-batch zeroth-order stochastic gradient at each iteration, and our algorithm needs to iterations to obtain -approximate stationary point.
4.2 Convergence Analysis of Online ZO-SPIDER-ADMM
In this subsection, we study the convergence properties of online ZO-SPIDER-ADMM (i.e., ZOO-SPIDER-ADMM) algorithm.
Suppose the sequence is generated from Algorithm 2, and define a Lyapunov function as follows:
Let , and , then we have
where , with , and is a lower bound of the function .
Next, based on the above Lemma 2, we give convergence properties of the ZOO-SPIDER-ADMM. Let .
Theorem 2 shows that given , , , and , the ZOO-SPIDER-ADMM has the optimal function query complexity of for finding an -approximate stationary point.
5 Experimental Results
In this section, we utilize the task of structured adversarial attacks on black-box DNNs  to verify efficiency of our algorithms. In the experiment, we compare our algorithm (ZO-SPIDER-ADMM) with the ZO-SVRG-ADMM , ZO-SAGA-ADMM , and zeroth-order stochastic ADMM (ZO-SGD-ADMM) without variance reduction.
|datasets||# test samples||#dimension||#classes|
5.1 Experimental Setup
In this experiment, we focus on generating some adversarial samples to attack the pre-trained DNNs, whose parameters and structures are hidden from us and only its outputs are accessible. As in [38, 17], we try to learn some interesting structures in the adversarial samples, which can fool the black-box DNNs. Specifically, we try to find an universal structured adversarial perturbation that could fool the samples . In fact, this task can be described as the following problem:
where , and represents the final layer output before softmax of neural network, and the overlapping groups generate from dividing an image into sub-groups of pixels. Here ensures validness of the generated adversarial examples. Specifically, if for all and , otherwise .
In the experiment, we use the pre-trained DNN models on four benchmark datasets in Table 2 as the target black-box models. These pre-trained DNNs on MNIST, Fashion-MNIST, SVHN and CIFAR10 can attain , , and test accuracy, respectively. In the experiment, we set the tuning parameters , , , and the smoothing parameter . For these datasets, the kernel size for overlapped group lasso is set to
and the stride is one. In the non-online setting, we select 40 samples from each class, and choose 4 as the batch size. In the online setting, we select 500 samples from each class, and choose 10 as the batch size.
5.2 Experimental Results
Figure 1 shows that the attack loss (i.e. the first term in the problem (9)) of our algorithm faster decrease than the other algorithms, as the iterate number increases. Figure 2 shows that the attack loss of online algorithm also faster decrease than the other online algorithm, as the iterate number increases. Note that the ZOO-ADMM  and ZO-GADM  are similar algorithms, and ZOO-ADMM uses better zeroth-order gradient estimator, so we choose the ZOO-ADMM as a comparative method. These results demonstrate that our algorithm has lower function query complexity than other algorithms. Figure 3 shows that some structure perturbations are estimated by our methods, which can successfully attack the corresponding DNNs.
In this paper, we proposed a fast ZO-SPIDER-ADMM method with lower function query complexity to solve the problem (1). Moreover, we proved that the ZO-SPIDER-ADMM has the optimal function query complexity of for finding -approximated stationary point, which improves the existing nonconvex zeroth-order stochastic ADMM methods by a factor . Further, we extended the ZO-SPIDER-ADMM method to the online setting, and propose a fast ZOO-SPIDER-ADMM method. We also proved that the ZOO-SPIDER-ADMM method has the optimal function query complexity of , which improves the existing best results by a factor .
- Agarwal et al.  Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
- Balasubramanian and Ghadimi  Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Advances in Neural Information Processing Systems, pages 3455–3464, 2018.
- Beck and Teboulle  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- Boyd et al.  Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
Chen et al. 
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh.
Zoo: Zeroth order optimization based black-box attacks to deep neural
networks without training substitute models.
Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
- Choromanski et al.  Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. arXiv preprint arXiv:1804.02395, 2018.
- Fang et al.  Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
- Gabay and Mercier  Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
- Gao et al.  Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the information-adaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing, 76(1):327–363, 2018.
- Ghadimi and Lan  Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Ghadimi et al.  Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
- Gu et al.  Bin Gu, Zhouyuan Huo, Cheng Deng, and Heng Huang. Faster derivative-free stochastic algorithm for shared memory machines. In ICML, pages 1807–1816, 2018.
- Hajinezhad et al.  Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. Zeroth order nonconvex multi-agent optimization over networks. arXiv preprint arXiv:1710.09997, 2017.
- Hong et al.  Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
- Huang et al.  Feihu Huang, Songcan Chen, and Zhaosong Lu. Stochastic alternating direction method of multipliers with variance reduction for nonconvex optimization. arXiv preprint arXiv:1610.02758, 2016.
- Huang et al. [2019a] Feihu Huang, Songcan Chen, and Heng Huang. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In International Conference on Machine Learning, pages 2839–2848, 2019a.
- Huang et al. [2019b] Feihu Huang, Shangqian Gao, Songcan Chen, and Heng Huang. Zeroth-order stochastic alternating direction method of multipliers for nonconvex nonsmooth optimization. In IJCAI, 2019b.
- Huang et al. [2019c] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. arXiv preprint arXiv:1902.06158, 2019c.
- Ji et al.  Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. In International Conference on Machine Learning, pages 3100–3109, 2019.
- Jiang et al.  Bo Jiang, Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Computational Optimization and Applications, 72(1):115–157, 2019.
- Johnson and Zhang  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
- Kim et al.  Seyoung Kim, Kyung-Ah Sohn, and Eric P Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
- Lian et al.  Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in Neural Information Processing Systems, pages 3054–3062, 2016.
- Liu et al. [2018a] Liu Liu, Minhao Cheng, Cho-Jui Hsieh, and Dacheng Tao. Stochastic zeroth-order optimization via variance reduction method. CoRR, abs/1805.11811, 2018a.
- Liu et al. [2018b] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. In AISTATS, volume 84, pages 288–297, 2018b.
- Liu et al. [2018c] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In NIPS, pages 3731–3741, 2018c.
- Liu et al.  Yuanyuan Liu, Fanhua Shang, and James Cheng. Accelerated variance reduced stochastic admm. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Nesterov and Spokoiny  Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
- Nguyen et al.  Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017.
- Nishihara et al.  Robert Nishihara, Laurent Lessard, Ben Recht, Andrew Packard, and Michael Jordan. A general analysis of the convergence of admm. In International Conference on Machine Learning, pages 343–352, 2015.
- Ouyang et al.  Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML, 28:80–88, 2013.
- Reddi et al.  Sashank Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In NIPS, pages 1145–1153, 2016.
- Suzuki  Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In ICML, pages 736–744, 2014.
- Taylor et al.  Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: a scalable admm approach. In ICML, pages 2722–2731, 2016.
- Wang et al. [2015a] Fenghui Wang, Wenfei Cao, and Zongben Xu. Convergence of multi-block bregman admm for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015a.
- Wang et al. [2015b] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015b.
- Wang et al.  Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
- Xu et al.  Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. Structured adversarial attack: Towards general implementation and better interpretability. arXiv preprint arXiv:1808.01664, 2018.
- Xu et al.  Yi Xu, Mingrui Liu, Qihang Lin, and Tianbao Yang. Admm without a fixed penalty parameter: Faster convergence with new adaptive penalization. In Advances in Neural Information Processing Systems, pages 1267–1277, 2017.
- Zheng and Kwok  Shuai Zheng and James T Kwok. Fast and light stochastic admm. In IJCAI, 2016.
Appendix A Supplementary Materials
In this section, we at detail study the convergence properties of both the ZO-SPIDER-ADMM and ZOO-SPIDER-ADMM algorithms. Throughout the paper, let such that . First, we restate some useful lemmas.
Reddi et al.  For random variables are independent and mean , we have
Ji et al.  Given , for any given smoothing parameter and any , we have
Applying the mean value theorem to the gradient , we have, for some ,