In this paper, we consider the following convex optimization problem:
where is differentiable and strongly convex. Optimization method that solves the above problem with function value access only is known as zeroth-order optimization or black-box optimization (Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013).
Zeroth-order optimization has attracted attention from the machine learning community (Bergstra et al., 2011; Ilyas et al., 2018) and it is especially useful for solving problem (1.1) where the evaluations of gradients are difficult or even infeasible. One prominent example of the zeroth-order optimization is the black-box adversarial attack on deep neural networks (Chen et al., 2017; Hu & Tan, 2017; Papernot et al., 2017; Ilyas et al., 2018)
. In the black-box adversarial attack, only the inputs and outputs of the neural network are available to the system and backpropagation on the target neural network is prohibited. Another application example is the hyper-parameter tuning which searches for the optimal parameters of deep neural networks or other learning models(Snoek et al., 2012; Bergstra et al., 2011).
In the past years, theoretical works on zeroth-order optimization arise as alternatives of the corresponding first-order methods, and they estimate gradients using function value difference(Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013; Duchi et al., 2015). However, these works on zeroth-order optimization have been concentrating on extracting gradient information of the objective function and failed to utilize the second-order Hessian information and, to some extent, have not fully exploited the information of models and enjoyed less competitive rates. In this paper, we aim to take advantages of the model’s second-order information and propose a novel method called Hessian-aware zeroth-order optimization. Aligning with earlier works Nesterov & Spokoiny (2017); Ghadimi & Lan (2013) we present in this paper our gradient estimation as follows:
where is the batch size of points for gradient estimation and is an approximate Hessian at the evaluation point. With Eqn. (1.2) at hands, the core update rule of our Hessian-aware zero-order algorithm, namely ZO-HessAware, is
where is the step size. If one lets be defined as
In comparison, early zeroth-order literatures (for instance Nesterov & Spokoiny (2017)) conduct gradient estimation via
Comparing (1.3) and (1.5), one observes that the directions of updates and share the same form but admit different covariances. Because contains Hessian information and shares the same form with estimated gradient , can be regarded as a natural gradient and ZO-HessAware can be regarded as natural gradient descent method. Perhaps surprising, in the context of zeroth-order optimization the difference between our zeroth-order update and earlier works boil down to the difference of search direction covariances. Our special choice of covariance matrix as the approximate inversed Hessian allows us to incorporate Hessian information into the update rule in Eqn. (1.4), which further achieves an improved theoretical convergence rate in zeroth-order optimization.
Let be an approximate Hessian satisfy with , and the algorithm is initialized at a point sufficiently close to the optimal solution, then in order to obtain an -accuracy, our ZO-Hess algorithm with a proper step size achieves an iteration complexity of . Note if the objective function has the strong convexity parameter , can be chosen as where is the
-th largest eigenvalue of the Hessian, andZO-HessAware enjoys an iteration complexity of
Furthermore, let ZO-HessAware be implemented with power-method based Hessian approximation (named ZOHA-PW), it then achieves following query complexity
Seeing that always holds, our proposed ZO-HessAware algorithm enjoys a sharper theoretical convergence rate and query complexity than the comparable version using no Hessian information (recall that Nesterov & Spokoiny (2017) indicates an iteration complexity of and a query complexity of where is the smoothness parameter of the objective function).
Though ZOHA-PW obtains a nice theoretical result of query complexity, it takes at least queries due to the power method procedure. This is very expensive especially when the dimension
is very large. Hence, we also propose several heuristic but practical methods to construct approximate Hessians with much lower query complexity.
First, we use Gaussian sampling method to approximate Hessian. This method only samples a small batch of points from the Gaussian distribution to estimate the Hessian with batch size being much smaller than the data dimension .
Furthermore, we propose diagonal Hessian approximation which is a popular method in training deep neural networks. This approximation approach does not need extra query to function value and can keep principal information of the Hessian which has been proved in training deep neural networks (Kingma & Ba, 2015; Duchi et al., 2011; Zeiler, 2012).
To numerically justify the effectiveness of our zeroth-order algorithm ZO-HessAware, we apply our algorithm with Hessian approximation approaches to the task of black-box adversarial attack in the neural network based image classifier (Ilyas et al., 2018; Chen et al., 2017). The adversarial attack aims to find an example of the given image with small noise but misclassified by the neural network.
We compare our algorithms with two state-of-the-art algorithms, PGD-NES and ZOO (Ilyas et al., 2018; Chen et al., 2017). The comparison shows that our Hessian-aware zeroth-order algorithms take much less queries to the function value while obtaining a better success rate of attack. Especially when the attack task is hard, our ZO-HessAware type algorithms achieve much better success rates than the state-of-the-art algorithms. Our experiment results also reveal such a potential that Hessian information is a key tool to promote the success rate of the black-box adversarial attack when the attack task is very hard.
To promote the attack success rate and reduce the query complexity, we propose a novel strategy called Descent-Checking. Descent-Checking empirically can bring a higher success rate and a lower query complexity. This benefit is more evident when the attack task is hard.
1.1 Main Contribution
We summarize our main contribution as follows.
We exploit the Hessian information of the model function and propose a novel Hessian-aware zeroth-order algorithm called ZO-HessAware. It is our creation that we integrate Hessian information into gradient estimation while keeping the algorithmic form similar to zeroth-order based gradient descent method. Theoretically we show ZO-HessAware has a faster convergence rate and lower query complexity with power-method based Hessian approximation than existing work without Hessian information.
Several novel structured Hessian approximation methods are proposed including Gauss sampling method as well as the diagonalization method. The Hessian estimation via Gauss sampling is our creation to the best of our knowledge. It only takes a few extra queries to the function value. In the construction of diagonal approximate Hessian, we use natural gradient which contains Hessian information other than a ordinary gradient which is used in training deep neural networks.
We propose a descent-checking trick for the black-box adversarial attack. This trick can significantly improve the success rate and reduce the number of queries.
We empirically prove the power of Hessian information in zeroth-order optimization especially in the black-box adversarial attack. Experiment results show that our ZO-HessAware type algorithm can achieve better success rates and need fewer queries than state-of-the-art algorithms especially when the problem is hard.
1.2 Related Work
Zeroth-order optimization minimizes functions only through the function value oracles. It is an important research topic in optimization (Nesterov & Spokoiny, 2017; Matyas, 1965; Ghadimi & Lan, 2013). Nesterov & Spokoiny (2017)
utilized random Gaussian vectors as the search directions and gave the convergence properties of the zeroth-order algorithms when the objective function is convex.Ghadimi & Lan (2013)
proposed new zeroth-order algorithms which has better convergence rate when the problem is non-smooth. Zeroth-order method with variance reduction was proposed to solve non-convex problem recently(Fang et al., 2018; Liu et al., 2018). Zeroth-order algorithm is also a crucial research topic in the on-line learning. Lots of results were obtained in the recent years (Shamir, 2017; Bach & Perchet, 2016; Duchi et al., 2015). In these works, one can only access to the function value and uses this feed-back to approximate the gradient or sub-gradient.
Recently, zeroth-order algorithm is becoming the main tool for the black-box adversarial attack (Chen et al., 2017; Ilyas et al., 2018). Chen et al. (2017) extended the CW (Carlini & Wagner, 2017) attack which is a powerful white-box method to the black-box attack and proposed ZOO. Algorithm ZOO can be viewed as a kind of zeroth-order stochastic coordinate descent (Chen et al., 2017). It chooses a coordinate randomly, then uses zeroth-order oracles to estimate the gradient of current coordinate. However, ZOO suffers from a poor query complexity because it needs queries to estimate the gradients of all the coordinates theoretically. To reduce the query complexity, Ilyas et al. (2018) resorted to the natural evolutionary strategies (Wierstra et al., 2014) to estimate gradients. And Ilyas et al. (2018) used the so-called ‘antithetic sampling’ technique (Salimans et al., 2017) to get better performance.
Furthermore, covariance matrix adaptation evolution strategy (CMA-ES) is another important zeroth-order method which is closely related to our algorithm (Hansen & Ostermeier, 2001). CMA-ES uses a learned covariance to generate search direction and this covariance matrix is much like the inversion of our approximate Hessian. The main difference between these two algorithms is the way to use zeroth-order oracles. Eqn. (1.3) shows that ZO-HessAWare queries to function value to approximate a natural gradient. In contrast, CMA-ES generates ’s and picks up such ’s as the search directions that is mall.
Organization. The rest of this paper is organized as follows. In Section 2, we present notation and preliminaries. In Section 3, we depict Algorithm ZO-HessAware in detail and analyze its local and global convergence rate, and query complexity with power-method based Hessian approximation, respectively. In Section 4, we propose two different strategies to construct a good approximate Hessian. In Section 5, we compare our ZO-HessAware type algorithms with two state-of-the-art algorithms in the adversarial attack problem. Finally, we conclude our work in Section 6. All the detailed proofs are deferred to the appendix in their order of appearance.
2 Notation and Preliminaries
We first introduce notation that will be used in this paper. Then, we give some assumptions about the objective function that will be used.
Given a positive semi-definite matrix of rank- and a positive integer , its eigenvalue decomposition is given as
contain the eigenvectors of, and with are the nonzero eigenvalues of . We also use and to denote the largest and smallest eigenvalue of a positive semi-definite matrix, respectively.
Using matrix , we can define -norm as . Furthermore, if is a positive semi-definite matrix, we say when is positive semi-definite.
2.2 Properties of Smoothness and Convexity
In this paper, we consider functions with -smoothness and -strongly convexity. It indicates the following properties.
If function is -smooth, then we have
If function is -strongly convex, then we have
We also assume that the Hessian of is -Lipschitz continuous, that is,
2.3 Gaussian Smoothing
Let be a function which is differentiable along any direction in . The Gaussian smoothing of is defined as
And is the parameter to control the smoothness. preserves several important properties of . For example, if is convex, then is also convex. If is -smooth, then is also -smooth.
3 Hessian-Aware Zeroth-Order Method
In this section, we will exploit the Hessian information of the model function which was commonly ignored in the past works on zeroth-order optimization and propose ZO-HessAware algorithm.
Our algorithm first constructs an approximate Hessian for the current point satisfies
with . Parameter measures how well approximates . If , then is the exact Hessian. On the other hand, if is small, then approximates poorly. One can use different methods to construct such . In Section 4, we will provide several approaches to compute a good approximate Hessian with a small number of queries to the function value. Note that, we do not need to construct an approximate Hessian for each iteration. Empirically, we can only update it every iterations where is a parameter controls the frequency of the Hessian approximation.
Then we begin to estimate the gradient by derivative-free oracles. Different from the existing zeroth-order works (Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013; Duchi et al., 2015), the Hessian information is used in our gradient estimation represented as follows:
where is the batch size. On the point , we sample points to obtain a good gradient estimation. This strategy is widely used in real applications such as adversarial attack.
Finally, analogue to Newton-style algorithms, we update using the approximate Hessian and estimated gradient as . Combining with Eqn. (3.2), we represent the algorithmic procedure of ZO-HessAware as follows
We depict the detailed algorithmic procedure of ZO-HessAwarein Algorithm 1.
In the rest of this section, we will first give some important properties of the estimated gradient computed as Eqn. (3.2). Then we analyze the local and global convergence property of Algorithm 1, respectively. Finally, the query complexity will be analyzed with power-method based Hessian approximation.
3.1 Properties of Estimated Gradient
Now, we list some important properties of defined in Eqn. (3.2) that will be used in our analysis of convergence rate of ZO-HessAware in the following lemmas. These lemmas are also of independent interest in zeroth-order algorithm.
Let be -smooth, then defined in Eqn. (3.2) satisfies that
Let be -smooth, then can be bounded as
Now we give the bound of in the following lemma.
Let be -smooth, then has such a property that
3.2 Local Convergence
Now we begin to analyze the local convergence property of Algorithm 1. To achieve a fast convergence rate, the initial point should be close enough to the optimal point. At the same time, the Hessian should be well-approximated.
Note that, the local convergence properties rely on the condition (3.4). However, this condition may be violated for next iteration if the descent direction is not good. This problem can be remedied by checking the value of . We will discard the current if is larger than .
To achieve an -accuracy solution, both the first and second terms in the right hand of inequality (3.3) must be smaller than . Therefore, ZO-HessAware needs
iterations. In contrast, without the second-order information, first order methods with zeroth-order oracles need iterations where is the condition number. Since it holds that , ZO-HessAware has a faster convergence rate than conventional zeroth-order methods without Hessian information. Especially when the Hessian can be well approximated by a rank- matrix, that is , our algorithm will show great advantages.
3.3 Global Convergence
We will analyze the global convergence property of Algorithm 1 in this section. To guarantee a global convergence, we have to set a smaller step size compared with the one set in Theorem 1. Then, we have the following theorem.
In our analysis of global convergence, we set a fixed step size. We can also use the line search method to get a better convergence property at the cost of extra query to the function value.
3.4 Query Complexity Analysis
In this section, we will analyze the query complexity of ZOHA-PW which implements ZO-HessAware with power-method based Hessian approximation. The power method only needs to access Hessian-Vector product which can be approximated by
where means the -th entry of vector . Note that, does not need to be explicitly represented. And it can be regarded as the Hessian with some small perturbations.
Given the above results to approximating Hessian-vector product, we conduct power method to obtain the -largest eigenvalue and their corresponding eigenvectors. The detailed algorithmic procedure is depicted in Algorithm 2. Then the approximate Hessian computed based on power method has the following properties.
Now we give the query complexity analysis of ZOHA-PW. First, we choose the batch size and the parameter . Then, for each iteration of ZOHA-PW, it takes queries to estimate gradient, and queries to construct the approximate Hessian. Combining the convergence rate depicted in Theorem 1, we have the following query complexity of ZOHA-PW.
Set the and in Algorithm 1. Then the query complexity of ZOHA-PW is
The approximate Hessian constructed using power method can capture the principal rank- information of . We have the empirical fact that the Hessian of model function can be written as a rank- matrix and a perturbation matrix of small norm and its main information lies in the rank- matrix (Yuan et al., 2007; Bakker et al., 2018; Sainath et al., 2013), that is . Hence the query complexity of ZOHA-PW is much smaller than the one of zeroth-order methods without Hessian information which takes indicated in the work of Nesterov & Spokoiny (2017).
Furthermore, we can use of the last iteration of Algorithm 1 as the input of Algorithm 2. Because is close to the optimal point , the value of can be regarded as a constant, that is, we can obtain an approximate Hessian in query complexity. Hence, the query complexity of ZOHA-PW can be further improved to .
4 Structured Hessian Approximation
In this section, we will provide two kinds of heuristic methods to construct approximate Hessian. These methods takes much fewer queries to function value compared with power-method based Hessian approximation which takes at least queries. The first method is based on Gauss sampling and the second one is based on diagonalization.
4.1 Gaussian-Sampling Based Hessian Approximation
In this section, we propose a novel method to approximate the Hessian of with a much lower query complexity. This method is based on Gaussian Sampling, and we name ZO-HessAware implemented with such Hessian Approximation as ZOHA-Gauss.
Our new method is going to estimate the Hessian of defined in Eqn. (2.6). Since is close to if is small, a good approximation of will approximate well. In fact, we can bound the error between and as follows.
Using Gaussian sampling, we can approximate the Hessian of as follows:
where is a properly chosen regularizer to keep invertible. In the construction of of Eqn. (4.1), we only take a small batch of points, that is is small even much smaller the dimension . Hence, the construction of such has a low query complexity.
The approximate Hessian constructed as Eqn. (4.1) has the following property.
Combining Lemma 4 and 5, we can obtain the result that if is small, and the batch size in Eqn. 4.1 is large, then constructed as Eqn. (4.1) can approximate very well. However, we can not give the exact approximation precision of such measured by in Eqn. (3.1) when the bash size is much smaller than . Thus, we will not give the theoretical convergence rate and query complexity of ZOHA-Gauss.
4.2 Diagonalization Based Hessian Approximation
We propose to use a diagonal matrix to approximate the Hessian. This method has been used in the optimization of deep neural networks (Kingma & Ba, 2015; Zeiler, 2012; Tieleman & Hinton, 2012) and online learning (Duchi et al., 2011).
First, we compute an approximate Hessian in the manner of ADAM (Kingma & Ba, 2015) as follows:
with . And means the entry-wise square of .
Second, we can also use the method of ADAGRAD (Duchi et al., 2011) to construct the approximate Hessian as
Other methods of constructing diagonal Hessian approximation such as ADADELTA (Zeiler, 2012) used in training deep neural networks can also be use to in our diagonal Hessian approximation.
These kinds of Hessian approximation are heuristic. We can not give an exact convergence rate of ZO-HessAware with diagonal Hessian approximation by Theorem 1. However, diagonal Hessian approximations have shown their power in training deep neural networks. Furthermore, diagonal approximate Hessian has an important advantage that it does not need extra queries to the function value and need less computational and storage cost.
Though the construction procedure of the diagonal Hessian approximation is the same with the one of ADAM and ADAGRAD, there some difference between these diagonal Hessians. First, ADAM and ADAGRAD use as the approximate Hessian in training neural network which is different from our approximate Hessian. Second, in the construction of our diagonal Hessian, we use the ‘natural gradient’ defined in Eqn. (1.3) which contains the Hessian information other than the ordinary gradient. And the information of the current diagonal Hessian will be used in the estimation of next ‘natural gradient’. In contrast, the diagonal Hessian will not affect the computation of gradients.
In this section, we apply our Hessian-aware zeroth-order algorithm to the black-box adversarial attacks. This is an important research topic in security of deep learning because neural networks are widely used in image classification. However, current neural network-based classifiers are susceptible to adversarial examples.
Our adversarial attack experiments include both targeted attack and un-targeted attack. The targeted attack task aims to find an adversarial example of a given image with a targeted class label toward misclassification. In this case, we are going to minimize the following problem proposed in the work of Carlini & Wagner (2017):
is the logit layer representation (logits) in the DNN forsuch that
represents the predicted probability thatbelongs to class . Parameter is a tuning parameter for attack transferability and we set it in our experiments. The constrain means that the adversarial image should be close to the given image.
The un-targeted adversarial attack task aims to find an example of the given image with label but misclassified by the neural network. In this case, we will minimize the following function (Carlini & Wagner, 2017):
5.1 Algorithm Implementation
In the experiments, we will implement ZO-HessAware (Algorithm 1) with two different kinds of Hessian approximation. The first one is based on the Gaussian sampling described in Section 4.1, and we call it ZOHA-Gauss. The second implementation is using the diagonal Hessian approximation described in Section 4.2 with the update procedure as ADAM defined in Eqn. (4.2). And we name it as ZOHA-Diag. We do not implement ZO-HessAware with other kinds of diagonal Hessian approximation because these methods have the similar performance.
where is a projection operator to make satisfy . Note that, this projection is exact for ZOHA-Diag. But as to ZOHA-Gauss, we should compute by optimizing the following sub-problem:
Because the objective function of the neural network model may be non-convex, we implement the approximate Hessian in ZOHA-Gauss as follows:
Such modification ensures that such is positive definite. Furthermore, we observe that can be written as . Then we can compute as follows. First, we compute the SVD decomposition of as with and . And we get . In practice, the value of can be set as a fraction of or tuned by several tries.
5.1.1 Descent Checking
To improve the success rate and the query efficiency, we introduce an important technique called Descent-Checking which has been discussed in Remark 1 but with a slight different implementation. Descent-Checking has the following algorithmic procedure. After obtaining the , we will query to the value and check if . If it holds, we will go to the next iteration. Otherwise, we will discard current and take extra samples combining with existing samples to estimate a new gradient until a new satisfies or the total sample size exceeds a threshold. If the total sample size exceeds the threshold, we will accept this ‘bad’ and go to the next iteration. Because we will often set be of several tens, Descent-Checking strategy will not bring many extra queries. As a result of Descent-Checking, we can filter some bad search direction effectively which will lead to a higher attack success rate. We depict the detailed algorithmic procedure of ZO-HessAware with Descent-Checking in Algorithm 3. Accordingly, we name ZOHA-Gauss and ZOHA-Diag with Descent-Checking strategy as ZOHA-Gauss-DC and ZOHA-Diag-DC, respectively.
5.2 Evalution on MNIST
We evaluate the effectiveness of our attacks against an convolution neural network (CNN) on the MNIST dataset. The network for MNIST is composed of two 55 convolutional layers with output and channels following two fully connected layers with and units. We use 2
2 max-pooling after each convolutional layer and use ReLU after every layer expect the layer. The network is trained for 100 epochs with learning rate starting at 0.1 and decay 0.5 every 20 epochs. The accuracy of the model is.
We test the attack algorithms on images from the test set. The limit of perturbation is . We run all the attack until getting the an adversarial examples unless the number of queries is more than .
We report the experiment results in Table 1 and Figure 1 and 2. The visualization of adversarial attack is present in Appendix E. We can observe that ZO-HessAware with different implementations obtain much better success rates than two state-of-the-art algorithms. This validates the effectiveness of second-order information of model function in zeroth-order optimization. Especially, our ZOHA-DC type algorithms obtain the best success rates both target and un-target adversarial attacks which are much higher than ZOO and PGD-NES while taking less queries to function values. Furthermore, on the un-target attack, ZOHA-DC type algorithms only take less than half of queries of PGD-NES. This greatly shows the query efficiency of our Hessian-aware zeroth-order algorithm.
The comparison of the distribution of the query number in Figure 1 and 2 shows that Descent Checking technique can reduce the query number effectively. For example, comparing ZOHA-Gauss with ZOHA-Gauss-DC in Figure 2, we can observe that the percentage of the query number between and of ZOHA-Gauss-DC is much higher than the one of ZOHA-Gauss.
|Algorithm||success rate||median queries||average queries|
|targeted||ZOO (Chen et al., 2017)||42.13||15,200||17,091|
|PGD-NES (Ilyas et al., 2018)||44.19||7,300||10,496|
|un-targeted||ZOO (Chen et al., 2017)||77.18||13,300||16,390|
|PGD-NES (Ilyas et al., 2018)||81.55||5,800||8,567|
5.3 Evaluation on ImageNet
In this experiment, we use a pre-trained ResNet50 that has top- accuracy and top- accuracy for evaluation. The limit of perturbation is . We will choose
images randomly from ImageNet test-set for evaluation and run the attack method until getting an adversarial example or the number of queries being more than. Furthermore, if the attack is targeted, the target label will be randomly chosen from classes.
In the experiment on ImageNet, instead of Eqn. (3.2), we use the following method to estimate the gradient
We can see that such have the same expectation with the one defined in Eqn. (3.2). However, it has a better performance in this experiment. Accordingly, the natural gradient is modified similarly as
We report the results in Table 2 and Figure 3, 4. The visualization of adversarial attack is present in Figure 6 and 7 of Appendix E. We can observe that our algorithms take much less queries than ZOO and PGD-NES. For the un-target attack, the median queries of ZOHA-Diag-DC is only about of ZOO and about of PGD-NES with the same success rate. For the target attack, compared with the un-target attack problem, all these algorithms take much more queries. But our algorithms still show great query efficiency. Especially, both ZOHA-Diag and ZOHA-Diag-DC achieve attack success rate which is higher than PGD-NES but only with about queries of PGD-NES. Though ZOO also obtain a success rate, it takes several times of queries as ZOHA-Diag and ZOHA-Diag-DC.
|Algorithm||success rate||median queries||average queries|
|targeted||ZOO (Chen et al., 2017)||100||39,100||45,822|
|PGD-NES (Ilyas et al., 2018)||99.37||11,270||17,435|
|un-targeted||ZOO (Chen et al., 2017)||100||12,700||14,199|
|PGD-NES (Ilyas et al., 2018)||100||1,500||2,283|
From above two experiments, we can find some important insights. First, the comparison between attack success rates of two deep learning models indicates that a deeper or more complicate neural network potentially involves more vulnerability to the adversarial attack. On the ResNet50, all algorithms achieve success rates over . In contrast, the attack success rate on the simple convolution network with several layers is much lower.
Second, the great gap of success rates between our Hessian-aware zeroth-order methods and two state-of-the-art algorithms on the MNIST may reveal such a potential that the Hessian information will bring great advantages on the hard adversarial attack problem.
Third, we can observe that our algorithms with Descent-Checking have better performance than the ones without Descent-Checking. The experiment results on the MNIST show that Descent-Checking strategy can promote attack success rate effectively both for targeted and un-targeted attack. At the same time, Descent-Checking is an effective way to reduce query complexity. This can be easily observed from the distribution of the number of queries in Figure 1-4.
In this paper, we propose a novel zeroth-order algorithmic framework called ZO-HessAware which exploits the second-order information of the model function. Due to this information, ZO-HessAware achieves a faster convergence rate and lower query complexity than existing works without the Hessian information. We also propose several methods to capture the principal information of the Hessian efficiently. Experiments on the black-box adversarial attack show that our ZO-HessAware algorithms improve the attack success rate and reduce the query complexity effectively. This validates the effectiveness of Hessian information in zeroth-order optimization and our theoretical analysis empirically. We also propose a novel technique called Descent Checking which can promote attack success rate and reduce query complexity empirically.
- Bach & Perchet (2016) Bach, F. & Perchet, V. (2016). Highly-smooth zero-th order online optimization. In Conference on Learning Theory (pp. 257–283).
- Bakker et al. (2018) Bakker, C., Henry, M. J., & Hodas, N. O. (2018). Understanding and exploiting the low-rank structure of deep networks.
- Balcan et al. (2016) Balcan, M.-F., Du, S. S., Wang, Y., & Yu, A. W. (2016). An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory (pp. 284–309).
- Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (pp. 2546–2554).
- Carlini & Wagner (2017) Carlini, N. & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP) (pp. 39–57).: IEEE.
Chen et al. (2017)
Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., & Hsieh, C.-J. (2017).
Zoo: Zeroth order optimization based black-box attacks to deep neural
networks without training substitute models.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security(pp. 15–26).: ACM.
- Duchi et al. (2011) Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
- Duchi et al. (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2788–2806.
- Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems (pp. 686–696).
- Ghadimi & Lan (2013) Ghadimi, S. & Lan, G. (2013). Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.
- Hansen & Ostermeier (2001) Hansen, N. & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2), 159–195.
- Hu & Tan (2017) Hu, W. & Tan, Y. (2017). Generating adversarial malware examples for black-box attacks based on gan. arXiv preprint arXiv:1702.05983.
- Ilyas et al. (2018) Ilyas, A., Engstrom, L., Athalye, A., & Lin, J. (2018). Black-box adversarial attacks with limited queries and information. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research (pp. 2137–2146). Stockholmsmässan, Stockholm Sweden: PMLR.
- Kingma & Ba (2015) Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- Liu et al. (2018) Liu, S., Kailkhura, B., Chen, P.-Y., Ting, P., Chang, S., & Amini, L. (2018). Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (pp. 3731–3741). Curran Associates, Inc.
Magnus, J. R. (1978).
The moments of products of quadratic forms in normal variables.Statistica Neerlandica, 32(4), 201–210.
- Matyas (1965) Matyas, J. (1965). Random optimization. Automation and Remote control, 26(2), 246–253.
- Nesterov & Spokoiny (2017) Nesterov, Y. & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 527–566.
- Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017). Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (pp. 506–519).: ACM.
- Sainath et al. (2013) Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 6655–6659).: IEEE.
- Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
- Shamir (2017) Shamir, O. (2017). An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52), 1–11.
- Snoek et al. (2012) Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959).
Tieleman & Hinton (2012)
Tieleman, T. & Hinton, G. (2012).
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2), 26–31.
- Wierstra et al. (2014) Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., & Schmidhuber, J. (2014). Natural evolution strategies. The Journal of Machine Learning Research, 15(1), 949–980.
Yuan et al. (2007)
Yuan, M., Ekici, A., Lu, Z., & Monteiro, R. (2007).
Dimension reduction and coefficient estimation in multivariate linear regression.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3), 329–346.
- Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Appendix A Proof of Section 3.1
In this section, we will prove three properties of the estimated gradient defined in Eqn. (3.2). Before that, we first list some important lemmas related to the Gaussian distribution that will be used in our proof.
a.1 Important Lemmas
Lemma 6 ((Nesterov & Spokoiny, 2017)).
If is -smooth, then we have
Lemma 7 ((Nesterov & Spokoiny, 2017)).
If is -smooth, then
If the Hessian is -Lipschitz continuous, then we can guarantee that
Lemma 8 ((Nesterov & Spokoiny, 2017)).
Let , be from , then we have the following bound
Then we give the results of moments of products quadratic forms in normal distribution.
Lemma 9 ((Magnus, 1978)).
Let and be two symmetric matrices, and has the Gaussian distribution, that is, . Define . The expectation of and are: