Nonconvex Zeroth-Order Stochastic ADMM Methods with Lower Function Query Complexity

07/30/2019 ∙ by Feihu Huang, et al. ∙ Simon Fraser University University of Pittsburgh 0

Zeroth-order (gradient-free) method is a class of powerful optimization tool for many machine learning problems because it only needs function values (not gradient) in the optimization. In particular, zeroth-order method is very suitable for many complex problems such as black-box attacks and bandit feedback, whose explicit gradients are difficult or infeasible to obtain. Recently, although many zeroth-order methods have been developed, these approaches still exist two main drawbacks: 1) high function query complexity; 2) not being well suitable for solving the problems with complex penalties and constraints. To address these challenging drawbacks, in this paper, we propose a novel fast zeroth-order stochastic alternating direction method of multipliers (ADMM) method (i.e., ZO-SPIDER-ADMM) with lower function query complexity for solving nonconvex problems with multiple nonsmooth penalties. Moreover, we prove that our ZO-SPIDER-ADMM has the optimal function query complexity of O(dn + dn^1/2ϵ^-1) for finding an ϵ-approximate local solution, where n and d denote the sample size and dimension of data, respectively. In particular, the ZO-SPIDER-ADMM improves the existing best nonconvex zeroth-order ADMM methods by a factor of O(d^1/3n^1/6). Moreover, we propose a fast online ZO-SPIDER-ADMM (i.e., ZOO-SPIDER-ADMM). Our theoretical analysis shows that the ZOO-SPIDER-ADMM has the function query complexity of O(dϵ^-3/2), which improves the existing best result by a factor of O(ϵ^-1/2). Finally, we utilize a task of structured adversarial attack on black-box deep neural networks to demonstrate the efficiency of our algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Zeroth-order (gradient-free) method is a powerful optimization tool for many machine learning problems, where the gradient of objective function is not available or computationally prohibitive. For example, zeroth-order optimization methods have been applied to bandit feedback analysis [1]

, reinforcement learning

[6] and adversarial attacks on black-box deep neural networks (DNNs) [5, 26]

. Recently, the zeroth-order methods have been increasingly studied. Based on the Gaussian smoothing gradient estimator,

Nesterov and Spokoiny [28] proposed the zeroth-order gradient descent methods. Ghadimi and Lan [10] presented the zeroth-order stochastic gradient methods. More recently, Balasubramanian and Ghadimi [2] introduced a class of zeroth-order conditional gradient methods. To deal with the nonsmooth penalties, Liu et al. [25], Gao et al. [9] proposed the zeroth-order online and stochastic alternating direction method of multipliers (ADMM) methods.

So far, the above zeroth-order algorithms mainly build on convexity of the problems. In fact, zeroth-order methods are also highly successful in solving many nonconvex problems such as adversarial attack to black-box DNNs [5, 26]. Thus, Ghadimi and Lan [10]

studied the zeroth-order stochastic gradient descent (ZO-SGD) methods for nonconvex optimization. To accelerate the ZO-SGD,

Liu et al. [26, 24]

proposed fast zeroth-order stochastic variance-reduced gradient (ZO-SVRG) methods using the SVRG

[21]. For big data optimization, asynchronous parallel zeroth-order methods [23, 12] and distributed zeroth-order methods [13] have been developed. More recently, to reduce function query complexity, the SPIDER-SZO [7] and ZO-SPIDER-Coord [19] have been introduced using stochastic path-integrated differential estimator, i.e., SPIDER [7], which is improved by SpiderBoost [37] and is also a variant of stochastic recursive gradient algorithm (SARAH [29]). To deal with the nonsmooth regularization, Ghadimi et al. [11], Huang et al. [18] proposed some zeroth-order proximal stochastic gradient methods. However, these nonconvex zeroth-order methods still are not competent for many machine learning problems with complex nonsmooth penalties and constraints, such as the structured adversarial attack to black-box DNNs in [17]. More recently, thus, Huang et al. [17] proposed a class of nonconvex zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve these complex problems. However, these zeroth-order ADMM methods suffer from high function query complexity (please see in Table 1).

Type Algorithm Reference Problem Function Query Complexity
Finite-sum ZO-SVRG-ADMM [17] NC(S) + C(mNS)
ZO-SAGA-ADMM
ZO-SPIDER-ADMM Ours NC(S) + C(mNS)
Online ZOO-ADMM [25] C(S) + C(NS)
ZO-GADM [9]
ZOO-SPIDER-ADMM Ours NC(S) + C(mNS)
Table 1: Convergence properties comparison of the zeroth-order ADMM algorithms for finding an -approximated (local) solution. C, NC, S, NS and mNS are the abbreviations of convex, non-convex, smooth, non-smooth and the sum of multiple non-smooth functions, respectively. is the dimension of data and denotes the sample size.

In this paper, thus, we propose a class of faster zeroth-order stochastic ADMM methods with lower function query complexity to solve the following nonconvex nonsmooth problem:

(1)

where , , , and is a nonconvex and smooth function, and is a convex and possibly nonsmooth function for all . Note that the explicit gradients of are difficult or infeasible to obtain. Under this case, we have to use the zeroth-order methods [28, 26] to estimate gradient of each . For the problem (1), its finite-sum subproblem generally comes from the empirical loss minimization in machine learning, and its online

subproblem generates from the expected loss minimization, where the random variable

follows an unknown data distribution. To address the online subproblem, we extend our ZO-SPIDER-ADMM to the online setting, and propose an online ZO-SPIDER-ADMM algorithm. In summary, our main contributions are summarized as follows:

  • We propose a fast zeroth-order stochastic ADMM method (i.e., ZO-SPIDER-ADMM ) with lower function query complexity to solve the problem (1).

  • We prove that the ZO-SPIDER-ADMM has the optimal function query complexities of for finding an -approximate local solution, which improves the existing best nonconvex zeroth-order ADMM methods by a factor .

  • We extend the ZO-SPIDER-ADMM to the online setting, and propose a faster online zeroth-order ADMM method (i.e., ZOO-SPIDER-ADMM). Moreover, we prove that the ZOO-SPIDER-ADMM has the optimal query complexities of , which improves the existing best result by a factor .

1.1 Notations

Let and for . Given a positive definite matrix , ; and

denote the largest and smallest eigenvalues of matrix

, respectively; the conditional number . and denote the largest and smallest eigenvalues of matrix , respectively. Given positive definite matrices , let and .

2 Related Work

ADMM [8, 4] is a popular optimization method in solving the composite and constrained problems in machine learning. Due to the flexibility in splitting the objective function into loss and complex penalty, the ADMM can relatively easily solve some problems with complicated structure penalty such as the graph-guided fused lasso [22], which are too complicated for the other popular optimization methods such as proximal gradient methods [3]. Thus, ADMM has been widely studied in recent years [30, 39]. For large-scale optimization, some stochastic ADMM methods [31, 33, 40, 27] have been proposed. In fact, the ADMM method is also successful in solving many nonconvex machine learning problems such as training neural networks [34]. Thus, the nonconvex ADMM and its stochastic version methods have been developed in [35, 36, 14, 20]. At the same time, the nonconvex stochastic ADMM methods [15, 16] have been studied.

So far, the above ADMM methods need to repeatedly calculate gradients of the loss function over the iterations. However, in many machine learning problems, the gradients of objective functions are difficult or infeasible to obtain. For example, in adversarial attack to black-box DNNs

[5, 26], only evaluation values (i.e., function values) are provided. Thus, [25, 9] have proposed the zeroth-order online and stochastic ADMM methods for solving some convex problems. More recently, [17] has proposed the nonconvex ZO-SVRG-ADMM and ZO-SAGA-ADMM methods.

3 Faster Zeroth-Order Stochastic ADMM Methods

In this section, we propose a faster zeroth-order stochatic ADMM method (i.e., ZO-SPIDER-ADMM) to solve the problem (1) based on the SPIDER method. Moreover, we extent the ZO-SPIDER-ADMM to the online setting, and propose a faster online zeroth-order ADMM, i.e., ZOO-SPIDER-ADMM.

First, we give the augmented Lagrangian function of the problem (1) as follows:

(2)

where denotes the dual variable and denotes the penalty parameter.

1:  Input: , , and ;
2:  Initialize: , and ;
3:  for  do
4:     if  then
5:         Compute ;
6:     else
7:         Uniformly randomly pick a mini-batch from with replacement, then update ;
8:     end if
9:      for all ;
10:      ;
11:      ;
12:  end for
13:  Output: chosen uniformly random from .
Algorithm 1 ZO-SPIDER-ADMM Algorithm

In the problem (1), the explicit expression of gradient for each function is not available, and only function value of is available. Thus, we use the coordinate smoothing gradient estimator [26] to evaluate gradients as follows:

(3)

where is a coordinate-wise smoothing parameter, and

is a standard basis vector with 1 at its

-th coordinate, and 0 otherwise. Without loss of generality, let . Considering the case that the sample size is large, we use the mini-batch size samples instead of the whole samples in estimating gradient of . Algorithm 1 describes the ZO-SPIDER-ADMM algorithm.

Since we use the inexact stochastic zeroth-order gradient to update , we define an approximated function of over as follows:

where is a step size and . At the step 10 of Algorithm 1, we can easily obtain . To avoid computing inverse of matrix , we can set with to linearize term .

1:  Input: , , , and ;
2:  Initialize: , and ;
3:  for  do
4:     if  then
5:         Draw independently samples, then compute ;
6:     else
7:         Draw independently samples, then update ;
8:     end if
9:      for all ;
10:      ;
11:      ;
12:  end for
13:  Output: chosen uniformly random from .
Algorithm 2 ZOO-SPIDER-ADMM Algorithm

At the step 9 of Algorithm 1, we update the parameter by solving the following subproblem

(4)

where . When set with for all to linearize the term , then we can use the following proximal operator to update each :

(5)

where .

Next, we extend the above ZO-SPIDER-ADMM method to the online setting, and propose an online ZO-SPIDER-ADMM (i.e., ZOO-SPIDER-ADMM) to solve the online problem (1). In the online setting, denotes a population risk over an underlying data distribution. The online problem 1 can be viewed as having infinite samples, so we are not able to estimate zeroth-order gradient of the function . In this case, we use the sampling method to evaluate the full zeroth-order gradient. The online ZO-SPIDER-ADMM is described in Algorithm 2.

4 Theoretical Analysis

In the section, we study the convergence properties of both the ZO-SPIDER-ADMM and ZOO-SPIDER-ADMM methods. All related proofs are provided in the supplementary document. First, we restate the standard -approximate stationary point of the problem (1), used in [20, 16].

Definition 1.

Given , the point is said to be an -stationary point of the problem (1), if it holds that

(6)

where ,

and

Next, we give some standard assumptions regarding the problem (1) as follows:

Assumption 1.

Loss function for is -smooth such that, for any

Assumption 2.

For the expectation setting, the variance is bounded, i.e., for all . Full gradient of loss function is bounded, i.e., there exists a constant such that for all , it follows that .

Assumption 3.

and for all are all lower bounded, and let and .

Assumption 4.

Matrix is full row or column rank.

Assumption 1 imposes the smoothness on individual loss functions, which is commonly used in the convergence analysis of nonconvex algorithms [11]. Assumption 2 shows that both the full gradient and the variance of stochastic gradients have the bounded norm. Assumption 2 of the variance boundedness assumption is only needed for the online case. Assumptions 3 guarantees the feasibility of the optimization problem, which has been used in the study of nonconvex ADMMs [14, 20]. Assumption 4 guarantees the matrix or is non-singular, which is commonly used in the convergence analysis of nonconvex ADMM algorithms [14, 20]. Without loss of generality, we will use the full column rank matrix in the following.

4.1 Convergence Analysis of the ZO-SPIDER-ADMM

In this subsection, we study the convergence properties of the ZO-SPIDER-ADMM algorithm. Throughout the paper, let such that . For notational simplicity, let

Lemma 1.

Suppose the sequence is generated from Algorithm 1, and define a Lyapunov function as follows:

Let , and , then we have

where , with and is a lower bound of the function .

Next, based on the above Lemma 1, we give convergence properties of the ZO-SPIDER-ADMM. We begin with defining a useful variable .

Theorem 1.

Suppose the sequence is generated from Algorithm 1. Let , , , then we have

(7)

where , , , and is a lower bound of the function . It implies that the iteration number and the smoothing parameter satisfy , then is an -approximate stationary point of (1), where .

Remark 1.

Theorem 1 shows that given , , and , the ZO-SPIDER-ADMM has convergence rate of , and has the optimal function query complexity of for finding an -approximate stationary point. Specifically, the function query complexity is generated from the following two parts: The first part is generated from using function values to compute the zeroth-order full gradient; The second part is generated from using function values to compute mini-batch zeroth-order stochastic gradient at each iteration, and our algorithm needs to iterations to obtain -approximate stationary point.

4.2 Convergence Analysis of Online ZO-SPIDER-ADMM

In this subsection, we study the convergence properties of online ZO-SPIDER-ADMM (i.e., ZOO-SPIDER-ADMM) algorithm.

Lemma 2.

Suppose the sequence is generated from Algorithm 2, and define a Lyapunov function as follows:

Let , and , then we have

where , with , and is a lower bound of the function .

Next, based on the above Lemma 2, we give convergence properties of the ZOO-SPIDER-ADMM. Let .

Theorem 2.

Suppose the sequence is generated from Algorithm 2. Let , , , then we have

(8)

where , , , , and is a lower bound of the function . It implies that , and satisfy

then is an -approximate stationary point of (1), where .

Remark 2.

Theorem 2 shows that given , , , and , the ZOO-SPIDER-ADMM has the optimal function query complexity of for finding an -approximate stationary point.

5 Experimental Results

In this section, we utilize the task of structured adversarial attacks on black-box DNNs [17] to verify efficiency of our algorithms. In the experiment, we compare our algorithm (ZO-SPIDER-ADMM) with the ZO-SVRG-ADMM [17], ZO-SAGA-ADMM [17], and zeroth-order stochastic ADMM (ZO-SGD-ADMM) without variance reduction.

datasets # test samples #dimension #classes
MNIST 10,000 10
Fashion-MNIST 10,000 10
SVHN 26,032 10
CIFAR-10 10,000 10
Table 2: Four Benchmark Datasets for Attacking Black-Box DNNs

5.1 Experimental Setup

In this experiment, we focus on generating some adversarial samples to attack the pre-trained DNNs, whose parameters and structures are hidden from us and only its outputs are accessible. As in [38, 17], we try to learn some interesting structures in the adversarial samples, which can fool the black-box DNNs. Specifically, we try to find an universal structured adversarial perturbation that could fool the samples . In fact, this task can be described as the following problem:

(9)

where , and represents the final layer output before softmax of neural network, and the overlapping groups generate from dividing an image into sub-groups of pixels. Here ensures validness of the generated adversarial examples. Specifically, if for all and , otherwise .

(a) MNIST
(b) Fashion-MNIST
(c) CIFAR-10
(d) SVHN
Figure 1: Attack loss of adversarial attacks on four black-box DNNs.
(a) MNIST
(b) Fashion-MNIST
(c) CIFAR-10
(d) SVHN
Figure 2: Attack loss of adversarial attacks on four black-box DNNs in the online setting.

Figure 3: Learned structured perturbations. The red label represents successfully attack.

In the experiment, we use the pre-trained DNN models on four benchmark datasets in Table 2 as the target black-box models. These pre-trained DNNs on MNIST, Fashion-MNIST, SVHN and CIFAR10 can attain , , and test accuracy, respectively. In the experiment, we set the tuning parameters , , , and the smoothing parameter . For these datasets, the kernel size for overlapped group lasso is set to

and the stride is one. In the non-online setting, we select 40 samples from each class, and choose 4 as the batch size. In the online setting, we select 500 samples from each class, and choose 10 as the batch size.

5.2 Experimental Results

Figure 1 shows that the attack loss (i.e. the first term in the problem (9)) of our algorithm faster decrease than the other algorithms, as the iterate number increases. Figure 2 shows that the attack loss of online algorithm also faster decrease than the other online algorithm, as the iterate number increases. Note that the ZOO-ADMM [25] and ZO-GADM [9] are similar algorithms, and ZOO-ADMM uses better zeroth-order gradient estimator, so we choose the ZOO-ADMM as a comparative method. These results demonstrate that our algorithm has lower function query complexity than other algorithms. Figure 3 shows that some structure perturbations are estimated by our methods, which can successfully attack the corresponding DNNs.

6 Conclusion

In this paper, we proposed a fast ZO-SPIDER-ADMM method with lower function query complexity to solve the problem (1). Moreover, we proved that the ZO-SPIDER-ADMM has the optimal function query complexity of for finding -approximated stationary point, which improves the existing nonconvex zeroth-order stochastic ADMM methods by a factor . Further, we extended the ZO-SPIDER-ADMM method to the online setting, and propose a fast ZOO-SPIDER-ADMM method. We also proved that the ZOO-SPIDER-ADMM method has the optimal function query complexity of , which improves the existing best results by a factor .

References

  • Agarwal et al. [2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
  • Balasubramanian and Ghadimi [2018] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Advances in Neural Information Processing Systems, pages 3455–3464, 2018.
  • Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • Chen et al. [2017] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Workshop on Artificial Intelligence and Security

    , pages 15–26. ACM, 2017.
  • Choromanski et al. [2018] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. arXiv preprint arXiv:1804.02395, 2018.
  • Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
  • Gabay and Mercier [1976] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
  • Gao et al. [2018] Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the information-adaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing, 76(1):327–363, 2018.
  • Ghadimi and Lan [2013] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Ghadimi et al. [2016] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  • Gu et al. [2018] Bin Gu, Zhouyuan Huo, Cheng Deng, and Heng Huang. Faster derivative-free stochastic algorithm for shared memory machines. In ICML, pages 1807–1816, 2018.
  • Hajinezhad et al. [2017] Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. Zeroth order nonconvex multi-agent optimization over networks. arXiv preprint arXiv:1710.09997, 2017.
  • Hong et al. [2016] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
  • Huang et al. [2016] Feihu Huang, Songcan Chen, and Zhaosong Lu. Stochastic alternating direction method of multipliers with variance reduction for nonconvex optimization. arXiv preprint arXiv:1610.02758, 2016.
  • Huang et al. [2019a] Feihu Huang, Songcan Chen, and Heng Huang. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In International Conference on Machine Learning, pages 2839–2848, 2019a.
  • Huang et al. [2019b] Feihu Huang, Shangqian Gao, Songcan Chen, and Heng Huang. Zeroth-order stochastic alternating direction method of multipliers for nonconvex nonsmooth optimization. In IJCAI, 2019b.
  • Huang et al. [2019c] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. arXiv preprint arXiv:1902.06158, 2019c.
  • Ji et al. [2019] Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. In International Conference on Machine Learning, pages 3100–3109, 2019.
  • Jiang et al. [2019] Bo Jiang, Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Computational Optimization and Applications, 72(1):115–157, 2019.
  • Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
  • Kim et al. [2009] Seyoung Kim, Kyung-Ah Sohn, and Eric P Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
  • Lian et al. [2016] Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in Neural Information Processing Systems, pages 3054–3062, 2016.
  • Liu et al. [2018a] Liu Liu, Minhao Cheng, Cho-Jui Hsieh, and Dacheng Tao. Stochastic zeroth-order optimization via variance reduction method. CoRR, abs/1805.11811, 2018a.
  • Liu et al. [2018b] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. In AISTATS, volume 84, pages 288–297, 2018b.
  • Liu et al. [2018c] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In NIPS, pages 3731–3741, 2018c.
  • Liu et al. [2017] Yuanyuan Liu, Fanhua Shang, and James Cheng. Accelerated variance reduced stochastic admm. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • Nesterov and Spokoiny [2017] Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
  • Nguyen et al. [2017] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017.
  • Nishihara et al. [2015] Robert Nishihara, Laurent Lessard, Ben Recht, Andrew Packard, and Michael Jordan. A general analysis of the convergence of admm. In International Conference on Machine Learning, pages 343–352, 2015.
  • Ouyang et al. [2013] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML, 28:80–88, 2013.
  • Reddi et al. [2016] Sashank Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In NIPS, pages 1145–1153, 2016.
  • Suzuki [2014] Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In ICML, pages 736–744, 2014.
  • Taylor et al. [2016] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: a scalable admm approach. In ICML, pages 2722–2731, 2016.
  • Wang et al. [2015a] Fenghui Wang, Wenfei Cao, and Zongben Xu. Convergence of multi-block bregman admm for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015a.
  • Wang et al. [2015b] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015b.
  • Wang et al. [2018] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
  • Xu et al. [2018] Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. Structured adversarial attack: Towards general implementation and better interpretability. arXiv preprint arXiv:1808.01664, 2018.
  • Xu et al. [2017] Yi Xu, Mingrui Liu, Qihang Lin, and Tianbao Yang. Admm without a fixed penalty parameter: Faster convergence with new adaptive penalization. In Advances in Neural Information Processing Systems, pages 1267–1277, 2017.
  • Zheng and Kwok [2016] Shuai Zheng and James T Kwok. Fast and light stochastic admm. In IJCAI, 2016.

Appendix A Supplementary Materials

In this section, we at detail study the convergence properties of both the ZO-SPIDER-ADMM and ZOO-SPIDER-ADMM algorithms. Throughout the paper, let such that . First, we restate some useful lemmas.

Lemma 3.

Reddi et al. [2016] For random variables are independent and mean , we have

(10)
Lemma 4.

Ji et al. [2019] Given , for any given smoothing parameter and any , we have

(11)
Proof.

Applying the mean value theorem to the gradient , we have, for some ,