Alternating direction method of multipliers (ADMM [Gabay and Mercier1976, Boyd et al.2011]) is a popular optimization tool for solving the composite and constrained problems in machine learning. In particular, ADMM can efficiently optimize some problems with complicated structure regularization such as the graph-guided fused lasso [Kim et al.2009], which are too complicated for the other popular optimization methods such as proximal gradient methods [Beck and Teboulle2009]. Thus, ADMM has been widely studied in recent years [Boyd et al.2011]. For the large-scale optimization, the stochastic ADMM method [Ouyang et al.2013]
has been proposed. Due to variances of the stochastic gradient, however, these methods suffer from a slow convergence rate. To speedup the convergence, recently, some faster stochastic ADMM methods[Suzuki2014, Zheng and Kwok2016] have been proposed by using the variance reduced (VR) techniques such as the SVRG [Johnson and Zhang2013]
. In fact, ADMM is also highly successful in solving various nonconvex problems such as tensor decomposition[Jiang et al.2019] and learning neural networks [Taylor et al.2016]. Thus, some fast nonconvex stochastic ADMM methods have been developed in [Huang et al.2016].
Currently, most of the ADMM methods need to compute gradients of the loss functions over each iteration. However, in many machine learning problems, the explicit expression of gradient for objective function is difficult or infeasible to obtain. For example, in black-box situations, only prediction results (i.e., function values) are provided [Chen et al.2017, Liu et al.2018b]. In bandit settings [Agarwal et al.2010], the player only receives partial feedback in terms of loss function values, so it is impossible to obtain expressive gradient of the loss function. Clearly, the classic optimization methods, based on the first-order gradient or second-order information, are not competent to these problems. Thus, zeroth-order optimization methods [Duchi et al.2015, Nesterov and Spokoiny2017] are developed by only using the function values in the optimization.
|Algorithm||Reference||Gradient Estimator||Problem||Convergence Rate|
|ZOO-ADMM||[Liu et al.2018a]||GauSGE||C(S) + C(NS)|
|ZO-GADM||[Gao et al.2018]||UniSGE||C(S) + C(NS)|
|RSPGF||[Ghadimi et al.2016]||GauSGE||NC(S) + C(NS)|
|ZO-ProxSVRG||[Huang et al.2019]||CooSGE||NC(S) + C(NS)|
|ZO-SVRG-ADMM||Ours||CooSGE||NC(S) + C(mNS)|
In the paper, we focus on using the zeroth-order methods to solve the following nonconvex nonsmooth problem:
where , for all , is a nonconvex and smooth function, and each is a convex and nonsmooth function. In machine learning, function can be used for the empirical loss, for multiple structure penalties (e.g., sparse + group sparse), and the constraint for encoding the structure pattern of model parameters such as graph structure. Due to the flexibility in splitting the objective function into loss and each penalty , ADMM is an efficient method to solve the above constricted problem. However, in the problem (1), we only access the objective values rather than the explicit function , thus the classic ADMM methods are unsuitable for this problem.
Recently, [Gao et al.2018, Liu et al.2018a] proposed the zeroth-order stochastic ADMM methods, which only use the objective values to optimize. However, these zeroth-order ADMM-based methods build on the convexity of objective function. Clearly, these methods are limited in many applications such as adversarial attack on black-box deep neural network (DNN). Due to that the problem (1) includes multiple nonsmooth regularization functions and constraint, the existing nonconvex zeroth-order algorithms [Liu et al.2018b, Ghadimi et al.2016, Huang et al.2019] are not suitable for this problem.
In the paper, thus, we propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve the problem (1) based on the coordinate smoothing gradient estimator [Liu et al.2018b]. In particular, the ZO-SVRG-ADMM and ZO-SAGA-ADMM methods build on the SVRG [Johnson and Zhang2013] and SAGA [Defazio et al.2014], respectively. Moreover, we study the convergence properties of the proposed methods. Table 1 shows the convergence properties of the proposed methods and other related ones.
1.1 Challenges and Contributions
Although both SVRG and SAGA show good performances in the first-order and second-order methods, applying these techniques to the nonconvex zeroth-order ADMM method is not trivial. There exists at least two main challenges:
Due to failure of the Fejér monotonicity of iteration, the convergence analysis of the nonconvex ADMM is generally quite difficult [Wang et al.2015]. With using the inexact zeroth-order estimated gradient, this difficulty becomes greater in the nonconvex zeroth-order ADMM methods.
Thus, we carefully establish the Lyapunov functions in the following theoretical analysis to ensure convergence of the proposed methods. In summary, our major contributions are given below:
We propose a class of fast zeroth-order stochastic ADMM methods (i.e., ZO-SVRG-ADMM and ZO-SAGA-ADMM) to solve the problem (1).
We prove that both the ZO-SVRG-ADMM and ZO-SAGA-ADMM have convergence rate of for nonconvex nonsmooth optimization. In particular, our methods not only reach the existing best convergence rate for the nonconvex optimization, but also are able to effectively solve many machine learning problems with multiple complex regularized penalties.
Extensive experiments conducted on black-box classification and structured adversarial attack on black-box DNNs validate efficiency of the proposed algorithms.
2 Related Works
Zeroth-order (gradient-free) optimization is a powerful optimization tool for solving many machine learning problems, where the gradient of objective function is not available or computationally prohibitive. Recently, the zeroth-order optimization methods are widely applied and studied. For example, zeroth-order optimization methods have been applied to bandit feedback analysis [Agarwal et al.2010] and black-box attacks on DNNs [Chen et al.2017, Liu et al.2018b]. [Nesterov and Spokoiny2017] have proposed several random zeroth-order methods by using Gaussian smoothing gradient estimator. To deal with the nonsmooth regularization, [Gao et al.2018, Liu et al.2018a] have proposed the zeroth-order online/stochastic ADMM-based methods.
So far, the above algorithms mainly build on the convexity of problems. In fact, the zeroth-order methods are also highly successful in solving various nonconvex problems such as adversarial attack to black-box DNNs [Liu et al.2018b]. Thus, [Ghadimi and Lan2013, Liu et al.2018b, Gu et al.2018] have begun to study the zeroth-order stochastic methods for the nonconvex optimization. To deal with the nonsmooth regularization, [Ghadimi et al.2016, Huang et al.2019] have proposed some non-convex zeroth-order proximal stochastic gradient methods. However, these methods still are not well competent to some complex machine learning problems such as a task of structured adversarial attack to the black-box DNNs, which is described in the following experiment.
Let and for . Given a positive definite matrix , ; and
denote the largest and smallest eigenvalues of, respectively, and . and denote the largest and smallest eigenvalues of matrix .
Given , the point is said to be an -approximate stationary point of the problems (1), if it holds that
Next, we make some mild assumptions regarding problem (1) as follows:
Each function is -smooth for such that
which is equivalent to
Gradient of each function is bounded, i.e., there exists a constant such that for all , it follows that .
and for all are all lower bounded, and denote and for .
is a full row or column rank matrix.
Assumption 1 has been commonly used in the convergence analysis of nonconvex algorithms [Ghadimi et al.2016]. Assumption 2 is widely used for stochastic gradient-based and ADMM-type methods [Boyd et al.2011]. Assumptions 3 and 4 are usually used in the convergence analysis of ADMM methods [Jiang et al.2019, Huang et al.2016]. Without loss of generality, we will use the full column rank of matrix in the rest of this paper.
4 Fast Zeroth-Order Stochastic ADMMs
where and denotes the dual variable and penalty parameter, respectively.
In the problem (1), the explicit expression of objective function is not available, and only the function value of is available. To avoid computing explicit gradient, thus, we use the coordinate smoothing gradient estimator [Liu et al.2018b] to estimate gradients: for ,
where is a coordinate-wise smoothing parameter, and
is a standard basis vector with 1 at its-th coordinate, and 0 otherwise.
Based on the above estimated gradients, we propose a zeroth-order ADMM (ZO-ADMM) method to solve the problem (1) by executing the following iterations, for
where the term with to linearize the term . Here, due to using the inexact zeroth-order gradient to update , we define an approximate function over as follows:
where , is the zeroth-order gradient and is a step size. Considering the matrix is large, set with to linearize the term .
In the problem (1), not only the noisy gradient of is not available, but also the sample size is very large. Thus, we propose fast ZO-SVRG-ADMM and ZO-SAGA-ADMM to solve the problem (1), based on the SVRG and SAGA, respectively.
Algorithm 1 shows the algorithmic framework of ZO-SVRG-ADMM. In Algorithm 1, we use the estimated stochastic gradient with . We have , i.e., this stochastic gradient is a biased estimate of the true full gradient. Although the SVRG has shown a great promise, it relies upon the assumption that the stochastic gradient is an unbiased estimate of true full gradient. Thus, adapting the similar ideas of SVRG to zeroth-order ADMM optimization is not a trivial task. To handle this challenge, we choose the appropriate step size , penalty parameter and smoothing parameter to guarantee the convergence of our algorithms, which will be discussed in the following convergence analysis.
5 Convergence Analysis
In this section, we will study the convergence properties of the proposed algorithms (ZO-SVRG-ADMM and ZO-SAGA-ADMM). For notational simplicity, let
5.1 Convergence Analysis of ZO-SVRG-ADMM
In this subsection, we analyze convergence properties of the ZO-SVRG-ADMM.
Given the sequence is generated from Algorithm 1, we define a Lyapunov function:
where the positive sequence satisfies
In addition, we definite a useful variable ].
Theorem 1 shows that given , , , and , the ZO-SVRG-ADMM has convergence rate of . Specifically, when , given , the ZO-SVRG-ADMM has convergence rate of ; when , given , it has convergence rate of ; when , given , it has convergence rate of . Thus, the ZO-SVRG-ADMM has the optimal function query complexity of for finding an -approximate local solution.
5.2 Convergence Analysis of ZO-SAGA-ADMM
In this subsection, we provide the convergence analysis of the ZO-SAGA-ADMM.
Given the sequence is generated from Algorithm 2, we define a Lyapunov function
where the positive sequence satisfies
In addition, we definite a useful variable .
Theorem 2 shows that , , and , the ZO-SAGA-ADMM has the of convergence rate. Specifically, when , given , the ZO-SAGA-ADMM has convergence rate of ; when , given , it has convergence rate of ; when , given , it has convergence rate of . Thus, the ZO-SAGA-ADMM has the optimal function query complexity of for finding an -approximate local solution.
In this section, we compare our algorithms (ZO-SVRG-ADMM, ZO-SAGA-ADMM) with the ZO-ProxSVRG, ZO-ProxSAGA [Huang et al.2019], the deterministic zeroth-order ADMM (ZO-ADMM), and zeroth-order stochastic ADMM (ZO-SGD-ADMM) without variance reduction on two applications: 1) robust black-box binary classification, and 2) structured adversarial attacks on black-box DNNs.
6.1 Robust Black-Box Binary Classification
In this subsection, we focus on a robust black-box binary classification task with graph-guided fused lasso. Given a set of training samples , where and , we find the optimal parameter by solving the problem:
where is the black-box loss function, that only returns the function value given an input. Here, we specify the loss function , which is the nonconvex robust correntropy induced loss [He et al.2011]. Matrix decodes the sparsity pattern of graph obtained by sparse inverse covariance selection, as in [Ouyang et al.2013]. In the experiment, we give mini-batch size , smoothing parameter and penalty parameters .
In the experiment, we use some public real datasets11120news is from https://cs.nyu.edu/~roweis/data.html; others are from www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., which are summarized in Table 2. For each dataset, we use half of the samples as training data and the rest as testing data. Figure 1 shows that the objective values of our algorithms faster decrease than the other algorithms, as the CPU time increases. In particular, our algorithms show better performances than the zeroth-order proximal algorithms. It is relatively difficult that these zeroth-order proximal methods deal with the nonsmooth penalties in the problem (7). Thus, we have to use some iterative methods (such as the classic ADMM method) to solve the proximal operator in these proximal methods.
6.2 Structured Attacks on Black-Box DNNs
In this subsection, we use our algorithms to generate adversarial examples to attack the pre-trained DNN models, whose parameters are hidden from us and only its outputs are accessible. Moreover, we consider an interesting problem: “What possible structures could adversarial perturbations have to fool black-box DNNs ?” Thus, we use the zeroth-order algorithms to find an universal structured adversarial perturbation that could fool the samples , which can be regarded as the following problem:
where represents the final layer output before softmax of neural network, and ensures the validness of created adversarial examples. Specifically, if for all and , otherwise . Following [Xu et al.2018], we use the overlapping lasso to obtain structured perturbations. Here, the overlapping groups generate from dividing an image into sub-groups of pixels.
In the experiment, we use the pre-trained DNN models on MNIST and CIFAR-10 as the target black-box models, which can attain and test accuracy, respectively. For MNIST, we select 20 samples from a target class and set batch size ; For CIFAR-10, we select 30 samples and set . In the experiment, we set , where and for MNIST and CIFAR-10, respectively. At the same time, we set the parameters , , and . For both datasets, the kernel size for overlapping group lasso is set to
and the stride is one.
Figure 3 shows that attack losses (i.e. the first term of the problem (6.2)) of our methods faster decrease than the other methods, as the number of iteration increases. Figure 2 shows that our algorithms can learn some structure perturbations, and can successfully attack the corresponding DNNs.
In the paper, we proposed fast ZO-SVRG-ADMM and ZO-SAGA-ADMM methods based on the coordinate smoothing gradient estimator, which only uses the objective function values to optimize. Moreover, we prove that the proposed methods have a convergence rate of . In particular, our methods not only reach the existing best convergence rate for the nonconvex optimization, but also are able to effectively solve many machine learning problems with the complex nonsmooth regularizations.
F.H., S.G., H.H. were partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. S.C. was partially supported by the NSFC under Grant No. 61806093 and No. 61682281, and the Key Program of NSFC under Grant No. 61732006.
- [Agarwal et al.2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
- [Beck and Teboulle2009] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- [Boyd et al.2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
[Chen et al.2017]
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh.
Zoo: Zeroth order optimization based black-box attacks to deep neural
networks without training substitute models.
Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
- [Defazio et al.2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
- [Duchi et al.2015] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE TIT, 61(5):2788–2806, 2015.
- [Gabay and Mercier1976] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
- [Gao et al.2018] Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the information-adaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing, 76(1):327–363, 2018.
- [Ghadimi and Lan2013] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [Ghadimi et al.2016] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
- [Gu et al.2018] Bin Gu, Zhouyuan Huo, Cheng Deng, and Heng Huang. Faster derivative-free stochastic algorithm for shared memory machines. In ICML, pages 1807–1816, 2018.
[He et al.2011]
Ran He, Wei-Shi Zheng, and Bao-Gang Hu.
Maximum correntropy criterion for robust face recognition.IEEE TPAMI, 33(8):1561–1576, 2011.
- [Huang et al.2016] Feihu Huang, Songcan Chen, and Zhaosong Lu. Stochastic alternating direction method of multipliers with variance reduction for nonconvex optimization. arXiv preprint arXiv:1610.02758, 2016.
- [Huang et al.2019] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. In AAAI, 2019.
- [Jiang et al.2019] Bo Jiang, Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Computational Optimization and Applications, 72(1):115–157, 2019.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In NIPS, pages 315–323, 2013.
- [Kim et al.2009] Seyoung Kim, Kyung-Ah Sohn, and Eric P Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
- [Liu et al.2018a] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. In AISTATS, volume 84, pages 288–297, 2018.
- [Liu et al.2018b] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In NIPS, pages 3731–3741, 2018.
- [Nesterov and Spokoiny2017] Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
- [Ouyang et al.2013] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML, 28:80–88, 2013.
- [Suzuki2014] Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In ICML, pages 736–744, 2014.
- [Taylor et al.2016] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: a scalable admm approach. In ICML, pages 2722–2731, 2016.
- [Wang et al.2015] Fenghui Wang, Wenfei Cao, and Zongben Xu. Convergence of multi-block bregman admm for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015.
- [Xu et al.2018] Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. Structured adversarial attack: Towards general implementation and better interpretability. arXiv preprint arXiv:1808.01664, 2018.
- [Zheng and Kwok2016] Shuai Zheng and James T Kwok. Fast-and-light stochastic admm. In IJCAI, pages 2407–2613, 2016.