1 Introduction
In this paper, we consider the following convex optimization problem:
(1.1) 
where is differentiable and strongly convex. Optimization method that solves the above problem with function value access only is known as zerothorder optimization or blackbox optimization (Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013).
Zerothorder optimization has attracted attention from the machine learning community (Bergstra et al., 2011; Ilyas et al., 2018) and it is especially useful for solving problem (1.1) where the evaluations of gradients are difficult or even infeasible. One prominent example of the zerothorder optimization is the blackbox adversarial attack on deep neural networks (Chen et al., 2017; Hu & Tan, 2017; Papernot et al., 2017; Ilyas et al., 2018)
. In the blackbox adversarial attack, only the inputs and outputs of the neural network are available to the system and backpropagation on the target neural network is prohibited. Another application example is the hyperparameter tuning which searches for the optimal parameters of deep neural networks or other learning models
(Snoek et al., 2012; Bergstra et al., 2011).In the past years, theoretical works on zerothorder optimization arise as alternatives of the corresponding firstorder methods, and they estimate gradients using function value difference
(Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013; Duchi et al., 2015). However, these works on zerothorder optimization have been concentrating on extracting gradient information of the objective function and failed to utilize the secondorder Hessian information and, to some extent, have not fully exploited the information of models and enjoyed less competitive rates. In this paper, we aim to take advantages of the model’s secondorder information and propose a novel method called Hessianaware zerothorder optimization. Aligning with earlier works Nesterov & Spokoiny (2017); Ghadimi & Lan (2013) we present in this paper our gradient estimation as follows:(1.2) 
where is the batch size of points for gradient estimation and is an approximate Hessian at the evaluation point. With Eqn. (1.2) at hands, the core update rule of our Hessianaware zeroorder algorithm, namely ZOHessAware, is
where is the step size. If one lets be defined as
(1.3) 
then using the linear transformation property of the multivariate Gaussian distribution, the update rule further reduces to
(1.4) 
In comparison, early zerothorder literatures (for instance Nesterov & Spokoiny (2017)) conduct gradient estimation via
(1.5) 
Comparing (1.3) and (1.5), one observes that the directions of updates and share the same form but admit different covariances. Because contains Hessian information and shares the same form with estimated gradient , can be regarded as a natural gradient and ZOHessAware can be regarded as natural gradient descent method. Perhaps surprising, in the context of zerothorder optimization the difference between our zerothorder update and earlier works boil down to the difference of search direction covariances. Our special choice of covariance matrix as the approximate inversed Hessian allows us to incorporate Hessian information into the update rule in Eqn. (1.4), which further achieves an improved theoretical convergence rate in zerothorder optimization.
Let be an approximate Hessian satisfy with , and the algorithm is initialized at a point sufficiently close to the optimal solution, then in order to obtain an accuracy, our ZOHess algorithm with a proper step size achieves an iteration complexity of . Note if the objective function has the strong convexity parameter , can be chosen as where is the
th largest eigenvalue of the Hessian, and
ZOHessAware enjoys an iteration complexity ofFurthermore, let ZOHessAware be implemented with powermethod based Hessian approximation (named ZOHAPW), it then achieves following query complexity
Seeing that always holds, our proposed ZOHessAware algorithm enjoys a sharper theoretical convergence rate and query complexity than the comparable version using no Hessian information (recall that Nesterov & Spokoiny (2017) indicates an iteration complexity of and a query complexity of where is the smoothness parameter of the objective function).
Though ZOHAPW obtains a nice theoretical result of query complexity, it takes at least queries due to the power method procedure. This is very expensive especially when the dimension
is very large. Hence, we also propose several heuristic but practical methods to construct approximate Hessians with much lower query complexity.

First, we use Gaussian sampling method to approximate Hessian. This method only samples a small batch of points from the Gaussian distribution to estimate the Hessian with batch size being much smaller than the data dimension .

Furthermore, we propose diagonal Hessian approximation which is a popular method in training deep neural networks. This approximation approach does not need extra query to function value and can keep principal information of the Hessian which has been proved in training deep neural networks (Kingma & Ba, 2015; Duchi et al., 2011; Zeiler, 2012).
To numerically justify the effectiveness of our zerothorder algorithm ZOHessAware, we apply our algorithm with Hessian approximation approaches to the task of blackbox adversarial attack in the neural network based image classifier (Ilyas et al., 2018; Chen et al., 2017). The adversarial attack aims to find an example of the given image with small noise but misclassified by the neural network.

We compare our algorithms with two stateoftheart algorithms, PGDNES and ZOO (Ilyas et al., 2018; Chen et al., 2017). The comparison shows that our Hessianaware zerothorder algorithms take much less queries to the function value while obtaining a better success rate of attack. Especially when the attack task is hard, our ZOHessAware type algorithms achieve much better success rates than the stateoftheart algorithms. Our experiment results also reveal such a potential that Hessian information is a key tool to promote the success rate of the blackbox adversarial attack when the attack task is very hard.

To promote the attack success rate and reduce the query complexity, we propose a novel strategy called DescentChecking. DescentChecking empirically can bring a higher success rate and a lower query complexity. This benefit is more evident when the attack task is hard.
1.1 Main Contribution
We summarize our main contribution as follows.

We exploit the Hessian information of the model function and propose a novel Hessianaware zerothorder algorithm called ZOHessAware. It is our creation that we integrate Hessian information into gradient estimation while keeping the algorithmic form similar to zerothorder based gradient descent method. Theoretically we show ZOHessAware has a faster convergence rate and lower query complexity with powermethod based Hessian approximation than existing work without Hessian information.

Several novel structured Hessian approximation methods are proposed including Gauss sampling method as well as the diagonalization method. The Hessian estimation via Gauss sampling is our creation to the best of our knowledge. It only takes a few extra queries to the function value. In the construction of diagonal approximate Hessian, we use natural gradient which contains Hessian information other than a ordinary gradient which is used in training deep neural networks.

We propose a descentchecking trick for the blackbox adversarial attack. This trick can significantly improve the success rate and reduce the number of queries.

We empirically prove the power of Hessian information in zerothorder optimization especially in the blackbox adversarial attack. Experiment results show that our ZOHessAware type algorithm can achieve better success rates and need fewer queries than stateoftheart algorithms especially when the problem is hard.
1.2 Related Work
Zerothorder optimization minimizes functions only through the function value oracles. It is an important research topic in optimization (Nesterov & Spokoiny, 2017; Matyas, 1965; Ghadimi & Lan, 2013). Nesterov & Spokoiny (2017)
utilized random Gaussian vectors as the search directions and gave the convergence properties of the zerothorder algorithms when the objective function is convex.
Ghadimi & Lan (2013)proposed new zerothorder algorithms which has better convergence rate when the problem is nonsmooth. Zerothorder method with variance reduction was proposed to solve nonconvex problem recently
(Fang et al., 2018; Liu et al., 2018). Zerothorder algorithm is also a crucial research topic in the online learning. Lots of results were obtained in the recent years (Shamir, 2017; Bach & Perchet, 2016; Duchi et al., 2015). In these works, one can only access to the function value and uses this feedback to approximate the gradient or subgradient.Recently, zerothorder algorithm is becoming the main tool for the blackbox adversarial attack (Chen et al., 2017; Ilyas et al., 2018). Chen et al. (2017) extended the CW (Carlini & Wagner, 2017) attack which is a powerful whitebox method to the blackbox attack and proposed ZOO. Algorithm ZOO can be viewed as a kind of zerothorder stochastic coordinate descent (Chen et al., 2017). It chooses a coordinate randomly, then uses zerothorder oracles to estimate the gradient of current coordinate. However, ZOO suffers from a poor query complexity because it needs queries to estimate the gradients of all the coordinates theoretically. To reduce the query complexity, Ilyas et al. (2018) resorted to the natural evolutionary strategies (Wierstra et al., 2014) to estimate gradients. And Ilyas et al. (2018) used the socalled ‘antithetic sampling’ technique (Salimans et al., 2017) to get better performance.
Furthermore, covariance matrix adaptation evolution strategy (CMAES) is another important zerothorder method which is closely related to our algorithm (Hansen & Ostermeier, 2001). CMAES uses a learned covariance to generate search direction and this covariance matrix is much like the inversion of our approximate Hessian. The main difference between these two algorithms is the way to use zerothorder oracles. Eqn. (1.3) shows that ZOHessAWare queries to function value to approximate a natural gradient. In contrast, CMAES generates ’s and picks up such ’s as the search directions that is mall.
Organization. The rest of this paper is organized as follows. In Section 2, we present notation and preliminaries. In Section 3, we depict Algorithm ZOHessAware in detail and analyze its local and global convergence rate, and query complexity with powermethod based Hessian approximation, respectively. In Section 4, we propose two different strategies to construct a good approximate Hessian. In Section 5, we compare our ZOHessAware type algorithms with two stateoftheart algorithms in the adversarial attack problem. Finally, we conclude our work in Section 6. All the detailed proofs are deferred to the appendix in their order of appearance.
2 Notation and Preliminaries
We first introduce notation that will be used in this paper. Then, we give some assumptions about the objective function that will be used.
2.1 Notation
Given a positive semidefinite matrix of rank and a positive integer , its eigenvalue decomposition is given as
(2.1) 
where and
contain the eigenvectors of
, and with are the nonzero eigenvalues of . We also use and to denote the largest and smallest eigenvalue of a positive semidefinite matrix, respectively.Using matrix , we can define norm as . Furthermore, if is a positive semidefinite matrix, we say when is positive semidefinite.
2.2 Properties of Smoothness and Convexity
In this paper, we consider functions with smoothness and strongly convexity. It indicates the following properties.
smoothness
If function is smooth, then we have
(2.2) 
and
(2.3) 
strong convexity
If function is strongly convex, then we have
and
We also assume that the Hessian of is Lipschitz continuous, that is,
(2.4) 
and
(2.5) 
2.3 Gaussian Smoothing
Let be a function which is differentiable along any direction in . The Gaussian smoothing of is defined as
(2.6) 
and
And is the parameter to control the smoothness. preserves several important properties of . For example, if is convex, then is also convex. If is smooth, then is also smooth.
3 HessianAware ZerothOrder Method
In this section, we will exploit the Hessian information of the model function which was commonly ignored in the past works on zerothorder optimization and propose ZOHessAware algorithm.
Our algorithm first constructs an approximate Hessian for the current point satisfies
(3.1) 
with . Parameter measures how well approximates . If , then is the exact Hessian. On the other hand, if is small, then approximates poorly. One can use different methods to construct such . In Section 4, we will provide several approaches to compute a good approximate Hessian with a small number of queries to the function value. Note that, we do not need to construct an approximate Hessian for each iteration. Empirically, we can only update it every iterations where is a parameter controls the frequency of the Hessian approximation.
Then we begin to estimate the gradient by derivativefree oracles. Different from the existing zerothorder works (Nesterov & Spokoiny, 2017; Ghadimi & Lan, 2013; Duchi et al., 2015), the Hessian information is used in our gradient estimation represented as follows:
(3.2) 
where is the batch size. On the point , we sample points to obtain a good gradient estimation. This strategy is widely used in real applications such as adversarial attack.
Finally, analogue to Newtonstyle algorithms, we update using the approximate Hessian and estimated gradient as . Combining with Eqn. (3.2), we represent the algorithmic procedure of ZOHessAware as follows
We depict the detailed algorithmic procedure of ZOHessAwarein Algorithm 1.
In the rest of this section, we will first give some important properties of the estimated gradient computed as Eqn. (3.2). Then we analyze the local and global convergence property of Algorithm 1, respectively. Finally, the query complexity will be analyzed with powermethod based Hessian approximation.
3.1 Properties of Estimated Gradient
Now, we list some important properties of defined in Eqn. (3.2) that will be used in our analysis of convergence rate of ZOHessAware in the following lemmas. These lemmas are also of independent interest in zerothorder algorithm.
Lemma 1.
Let be smooth, then defined in Eqn. (3.2) satisfies that
Lemma 2.
Let be smooth, then can be bounded as
Now we give the bound of in the following lemma.
Lemma 3.
Let be smooth, then has such a property that
3.2 Local Convergence
Now we begin to analyze the local convergence property of Algorithm 1. To achieve a fast convergence rate, the initial point should be close enough to the optimal point. At the same time, the Hessian should be wellapproximated.
Theorem 1.
Remark 1.
Note that, the local convergence properties rely on the condition (3.4). However, this condition may be violated for next iteration if the descent direction is not good. This problem can be remedied by checking the value of . We will discard the current if is larger than .
To achieve an accuracy solution, both the first and second terms in the right hand of inequality (3.3) must be smaller than . Therefore, ZOHessAware needs
iterations. In contrast, without the secondorder information, first order methods with zerothorder oracles need iterations where is the condition number. Since it holds that , ZOHessAware has a faster convergence rate than conventional zerothorder methods without Hessian information. Especially when the Hessian can be well approximated by a rank matrix, that is , our algorithm will show great advantages.
3.3 Global Convergence
We will analyze the global convergence property of Algorithm 1 in this section. To guarantee a global convergence, we have to set a smaller step size compared with the one set in Theorem 1. Then, we have the following theorem.
Theorem 2.
In our analysis of global convergence, we set a fixed step size. We can also use the line search method to get a better convergence property at the cost of extra query to the function value.
3.4 Query Complexity Analysis
In this section, we will analyze the query complexity of ZOHAPW which implements ZOHessAware with powermethod based Hessian approximation. The power method only needs to access HessianVector product which can be approximated by
(3.5) 
where means the th entry of vector . Note that, does not need to be explicitly represented. And it can be regarded as the Hessian with some small perturbations.
Given the above results to approximating Hessianvector product, we conduct power method to obtain the largest eigenvalue and their corresponding eigenvectors. The detailed algorithmic procedure is depicted in Algorithm 2. Then the approximate Hessian computed based on power method has the following properties.
Theorem 3.
Now we give the query complexity analysis of ZOHAPW. First, we choose the batch size and the parameter . Then, for each iteration of ZOHAPW, it takes queries to estimate gradient, and queries to construct the approximate Hessian. Combining the convergence rate depicted in Theorem 1, we have the following query complexity of ZOHAPW.
Theorem 4.
Set the and in Algorithm 1. Then the query complexity of ZOHAPW is
The approximate Hessian constructed using power method can capture the principal rank information of . We have the empirical fact that the Hessian of model function can be written as a rank matrix and a perturbation matrix of small norm and its main information lies in the rank matrix (Yuan et al., 2007; Bakker et al., 2018; Sainath et al., 2013), that is . Hence the query complexity of ZOHAPW is much smaller than the one of zerothorder methods without Hessian information which takes indicated in the work of Nesterov & Spokoiny (2017).
Furthermore, we can use of the last iteration of Algorithm 1 as the input of Algorithm 2. Because is close to the optimal point , the value of can be regarded as a constant, that is, we can obtain an approximate Hessian in query complexity. Hence, the query complexity of ZOHAPW can be further improved to .
4 Structured Hessian Approximation
In this section, we will provide two kinds of heuristic methods to construct approximate Hessian. These methods takes much fewer queries to function value compared with powermethod based Hessian approximation which takes at least queries. The first method is based on Gauss sampling and the second one is based on diagonalization.
4.1 GaussianSampling Based Hessian Approximation
In this section, we propose a novel method to approximate the Hessian of with a much lower query complexity. This method is based on Gaussian Sampling, and we name ZOHessAware implemented with such Hessian Approximation as ZOHAGauss.
Our new method is going to estimate the Hessian of defined in Eqn. (2.6). Since is close to if is small, a good approximation of will approximate well. In fact, we can bound the error between and as follows.
Using Gaussian sampling, we can approximate the Hessian of as follows:
(4.1) 
where is a properly chosen regularizer to keep invertible. In the construction of of Eqn. (4.1), we only take a small batch of points, that is is small even much smaller the dimension . Hence, the construction of such has a low query complexity.
The approximate Hessian constructed as Eqn. (4.1) has the following property.
Lemma 5.
Combining Lemma 4 and 5, we can obtain the result that if is small, and the batch size in Eqn. 4.1 is large, then constructed as Eqn. (4.1) can approximate very well. However, we can not give the exact approximation precision of such measured by in Eqn. (3.1) when the bash size is much smaller than . Thus, we will not give the theoretical convergence rate and query complexity of ZOHAGauss.
4.2 Diagonalization Based Hessian Approximation
We propose to use a diagonal matrix to approximate the Hessian. This method has been used in the optimization of deep neural networks (Kingma & Ba, 2015; Zeiler, 2012; Tieleman & Hinton, 2012) and online learning (Duchi et al., 2011).
First, we compute an approximate Hessian in the manner of ADAM (Kingma & Ba, 2015) as follows:
(4.2) 
with . And means the entrywise square of .
Second, we can also use the method of ADAGRAD (Duchi et al., 2011) to construct the approximate Hessian as
Other methods of constructing diagonal Hessian approximation such as ADADELTA (Zeiler, 2012) used in training deep neural networks can also be use to in our diagonal Hessian approximation.
These kinds of Hessian approximation are heuristic. We can not give an exact convergence rate of ZOHessAware with diagonal Hessian approximation by Theorem 1. However, diagonal Hessian approximations have shown their power in training deep neural networks. Furthermore, diagonal approximate Hessian has an important advantage that it does not need extra queries to the function value and need less computational and storage cost.
Though the construction procedure of the diagonal Hessian approximation is the same with the one of ADAM and ADAGRAD, there some difference between these diagonal Hessians. First, ADAM and ADAGRAD use as the approximate Hessian in training neural network which is different from our approximate Hessian. Second, in the construction of our diagonal Hessian, we use the ‘natural gradient’ defined in Eqn. (1.3) which contains the Hessian information other than the ordinary gradient. And the information of the current diagonal Hessian will be used in the estimation of next ‘natural gradient’. In contrast, the diagonal Hessian will not affect the computation of gradients.
5 Experiments
In this section, we apply our Hessianaware zerothorder algorithm to the blackbox adversarial attacks. This is an important research topic in security of deep learning because neural networks are widely used in image classification. However, current neural networkbased classifiers are susceptible to adversarial examples.
Our adversarial attack experiments include both targeted attack and untargeted attack. The targeted attack task aims to find an adversarial example of a given image with a targeted class label toward misclassification. In this case, we are going to minimize the following problem proposed in the work of Carlini & Wagner (2017):
(5.1) 
where
is the logit layer representation (logits) in the DNN for
such thatrepresents the predicted probability that
belongs to class . Parameter is a tuning parameter for attack transferability and we set it in our experiments. The constrain means that the adversarial image should be close to the given image.The untargeted adversarial attack task aims to find an example of the given image with label but misclassified by the neural network. In this case, we will minimize the following function (Carlini & Wagner, 2017):
(5.2) 
5.1 Algorithm Implementation
In the experiments, we will implement ZOHessAware (Algorithm 1) with two different kinds of Hessian approximation. The first one is based on the Gaussian sampling described in Section 4.1, and we call it ZOHAGauss. The second implementation is using the diagonal Hessian approximation described in Section 4.2 with the update procedure as ADAM defined in Eqn. (4.2). And we name it as ZOHADiag. We do not implement ZOHessAware with other kinds of diagonal Hessian approximation because these methods have the similar performance.
Furthermore, because the adversarial problem is of constrain, we will modify the update step (7) of Algorithm 1 as follows:
(5.3) 
where is a projection operator to make satisfy . Note that, this projection is exact for ZOHADiag. But as to ZOHAGauss, we should compute by optimizing the following subproblem:
(5.4) 
However, the projection as Eqn. (5.3) performs well and is of simple implementation even it is just an approximation to the true one computed by Eqn. (5.4).
Because the objective function of the neural network model may be nonconvex, we implement the approximate Hessian in ZOHAGauss as follows:
Such modification ensures that such is positive definite. Furthermore, we observe that can be written as . Then we can compute as follows. First, we compute the SVD decomposition of as with and . And we get . In practice, the value of can be set as a fraction of or tuned by several tries.
5.1.1 Descent Checking
To improve the success rate and the query efficiency, we introduce an important technique called DescentChecking which has been discussed in Remark 1 but with a slight different implementation. DescentChecking has the following algorithmic procedure. After obtaining the , we will query to the value and check if . If it holds, we will go to the next iteration. Otherwise, we will discard current and take extra samples combining with existing samples to estimate a new gradient until a new satisfies or the total sample size exceeds a threshold. If the total sample size exceeds the threshold, we will accept this ‘bad’ and go to the next iteration. Because we will often set be of several tens, DescentChecking strategy will not bring many extra queries. As a result of DescentChecking, we can filter some bad search direction effectively which will lead to a higher attack success rate. We depict the detailed algorithmic procedure of ZOHessAware with DescentChecking in Algorithm 3. Accordingly, we name ZOHAGauss and ZOHADiag with DescentChecking strategy as ZOHAGaussDC and ZOHADiagDC, respectively.
5.2 Evalution on MNIST
We evaluate the effectiveness of our attacks against an convolution neural network (CNN) on the MNIST dataset. The network for MNIST is composed of two 5
5 convolutional layers with output and channels following two fully connected layers with and units. We use 22 maxpooling after each convolutional layer and use ReLU after every layer expect the layer. The network is trained for 100 epochs with learning rate starting at 0.1 and decay 0.5 every 20 epochs. The accuracy of the model is
.We test the attack algorithms on images from the test set. The limit of perturbation is . We run all the attack until getting the an adversarial examples unless the number of queries is more than .
We report the experiment results in Table 1 and Figure 1 and 2. The visualization of adversarial attack is present in Appendix E. We can observe that ZOHessAware with different implementations obtain much better success rates than two stateoftheart algorithms. This validates the effectiveness of secondorder information of model function in zerothorder optimization. Especially, our ZOHADC type algorithms obtain the best success rates both target and untarget adversarial attacks which are much higher than ZOO and PGDNES while taking less queries to function values. Furthermore, on the untarget attack, ZOHADC type algorithms only take less than half of queries of PGDNES. This greatly shows the query efficiency of our Hessianaware zerothorder algorithm.
The comparison of the distribution of the query number in Figure 1 and 2 shows that Descent Checking technique can reduce the query number effectively. For example, comparing ZOHAGauss with ZOHAGaussDC in Figure 2, we can observe that the percentage of the query number between and of ZOHAGaussDC is much higher than the one of ZOHAGauss.
Algorithm  success rate  median queries  average queries  

targeted  ZOO (Chen et al., 2017)  42.13  15,200  17,091 
PGDNES (Ilyas et al., 2018)  44.19  7,300  10,496  
ZOHAGauss  50.03  3,712  6,649  
ZOHAGaussDC  56.14  2,941  6,246  
ZOHADiag  52.13  6,400  9,128  
ZOHADiagDC  55.56  3,936  7,239  
untargeted  ZOO (Chen et al., 2017)  77.18  13,300  16,390 
PGDNES (Ilyas et al., 2018)  81.55  5,800  8,567  
ZOHAGauss  85.06  3,612  5,000  
ZOHAGaussDC  88.80  2,152  3,629  
ZOHADiag  90.37  4,500  6,439  
ZOHADiagDC  91.90  2,460  4,352 
5.3 Evaluation on ImageNet
In this experiment, we use a pretrained ResNet50 that has top accuracy and top accuracy for evaluation. The limit of perturbation is . We will choose
images randomly from ImageNet testset for evaluation and run the attack method until getting an adversarial example or the number of queries being more than
. Furthermore, if the attack is targeted, the target label will be randomly chosen from classes.In the experiment on ImageNet, instead of Eqn. (3.2), we use the following method to estimate the gradient
We can see that such have the same expectation with the one defined in Eqn. (3.2). However, it has a better performance in this experiment. Accordingly, the natural gradient is modified similarly as
We report the results in Table 2 and Figure 3, 4. The visualization of adversarial attack is present in Figure 6 and 7 of Appendix E. We can observe that our algorithms take much less queries than ZOO and PGDNES. For the untarget attack, the median queries of ZOHADiagDC is only about of ZOO and about of PGDNES with the same success rate. For the target attack, compared with the untarget attack problem, all these algorithms take much more queries. But our algorithms still show great query efficiency. Especially, both ZOHADiag and ZOHADiagDC achieve attack success rate which is higher than PGDNES but only with about queries of PGDNES. Though ZOO also obtain a success rate, it takes several times of queries as ZOHADiag and ZOHADiagDC.
Algorithm  success rate  median queries  average queries  

targeted  ZOO (Chen et al., 2017)  100  39,100  45,822 
PGDNES (Ilyas et al., 2018)  99.37  11,270  17,435  
ZOHAGauss  99.62  8,748  12,257  
ZOHAGaussDC  100  8,588  11,770  
ZOHADiag  100  7,400  9,123  
ZOHADiagDC  100  6,273  8,574  
untargeted  ZOO (Chen et al., 2017)  100  12,700  14,199 
PGDNES (Ilyas et al., 2018)  100  1,500  2,283  
ZOHAGauss  100  1,212  2,259  
ZOHAGaussDC  100  1,124  1,959  
ZOHADiag  100  800  1,149  
ZOHADiagDC  100  561  945 
5.4 Discussion
From above two experiments, we can find some important insights. First, the comparison between attack success rates of two deep learning models indicates that a deeper or more complicate neural network potentially involves more vulnerability to the adversarial attack. On the ResNet50, all algorithms achieve success rates over . In contrast, the attack success rate on the simple convolution network with several layers is much lower.
Second, the great gap of success rates between our Hessianaware zerothorder methods and two stateoftheart algorithms on the MNIST may reveal such a potential that the Hessian information will bring great advantages on the hard adversarial attack problem.
Third, we can observe that our algorithms with DescentChecking have better performance than the ones without DescentChecking. The experiment results on the MNIST show that DescentChecking strategy can promote attack success rate effectively both for targeted and untargeted attack. At the same time, DescentChecking is an effective way to reduce query complexity. This can be easily observed from the distribution of the number of queries in Figure 14.
6 Conclusion
In this paper, we propose a novel zerothorder algorithmic framework called ZOHessAware which exploits the secondorder information of the model function. Due to this information, ZOHessAware achieves a faster convergence rate and lower query complexity than existing works without the Hessian information. We also propose several methods to capture the principal information of the Hessian efficiently. Experiments on the blackbox adversarial attack show that our ZOHessAware algorithms improve the attack success rate and reduce the query complexity effectively. This validates the effectiveness of Hessian information in zerothorder optimization and our theoretical analysis empirically. We also propose a novel technique called Descent Checking which can promote attack success rate and reduce query complexity empirically.
References
 Bach & Perchet (2016) Bach, F. & Perchet, V. (2016). Highlysmooth zeroth order online optimization. In Conference on Learning Theory (pp. 257–283).
 Bakker et al. (2018) Bakker, C., Henry, M. J., & Hodas, N. O. (2018). Understanding and exploiting the lowrank structure of deep networks.
 Balcan et al. (2016) Balcan, M.F., Du, S. S., Wang, Y., & Yu, A. W. (2016). An improved gapdependency analysis of the noisy power method. In Conference on Learning Theory (pp. 284–309).
 Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyperparameter optimization. In Advances in neural information processing systems (pp. 2546–2554).
 Carlini & Wagner (2017) Carlini, N. & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP) (pp. 39–57).: IEEE.

Chen et al. (2017)
Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., & Hsieh, C.J. (2017).
Zoo: Zeroth order optimization based blackbox attacks to deep neural
networks without training substitute models.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
(pp. 15–26).: ACM.  Duchi et al. (2011) Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
 Duchi et al. (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Wibisono, A. (2015). Optimal rates for zeroorder convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5), 2788–2806.
 Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems (pp. 686–696).
 Ghadimi & Lan (2013) Ghadimi, S. & Lan, G. (2013). Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.
 Hansen & Ostermeier (2001) Hansen, N. & Ostermeier, A. (2001). Completely derandomized selfadaptation in evolution strategies. Evolutionary computation, 9(2), 159–195.
 Hu & Tan (2017) Hu, W. & Tan, Y. (2017). Generating adversarial malware examples for blackbox attacks based on gan. arXiv preprint arXiv:1702.05983.
 Ilyas et al. (2018) Ilyas, A., Engstrom, L., Athalye, A., & Lin, J. (2018). Blackbox adversarial attacks with limited queries and information. In J. Dy & A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research (pp. 2137–2146). Stockholmsmässan, Stockholm Sweden: PMLR.
 Kingma & Ba (2015) Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
 Liu et al. (2018) Liu, S., Kailkhura, B., Chen, P.Y., Ting, P., Chang, S., & Amini, L. (2018). Zerothorder stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (pp. 3731–3741). Curran Associates, Inc.

Magnus (1978)
Magnus, J. R. (1978).
The moments of products of quadratic forms in normal variables.
Statistica Neerlandica, 32(4), 201–210.  Matyas (1965) Matyas, J. (1965). Random optimization. Automation and Remote control, 26(2), 246–253.
 Nesterov & Spokoiny (2017) Nesterov, Y. & Spokoiny, V. (2017). Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2), 527–566.
 Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017). Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (pp. 506–519).: ACM.
 Sainath et al. (2013) Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 6655–6659).: IEEE.
 Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
 Shamir (2017) Shamir, O. (2017). An optimal algorithm for bandit and zeroorder convex optimization with twopoint feedback. Journal of Machine Learning Research, 18(52), 1–11.
 Snoek et al. (2012) Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959).

Tieleman & Hinton (2012)
Tieleman, T. & Hinton, G. (2012).
Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 4(2), 26–31.  Wierstra et al. (2014) Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., & Schmidhuber, J. (2014). Natural evolution strategies. The Journal of Machine Learning Research, 15(1), 949–980.

Yuan et al. (2007)
Yuan, M., Ekici, A., Lu, Z., & Monteiro, R. (2007).
Dimension reduction and coefficient estimation in multivariate linear regression.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3), 329–346.  Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Appendix A Proof of Section 3.1
In this section, we will prove three properties of the estimated gradient defined in Eqn. (3.2). Before that, we first list some important lemmas related to the Gaussian distribution that will be used in our proof.
a.1 Important Lemmas
Lemma 6 ((Nesterov & Spokoiny, 2017)).
If is smooth, then we have
Lemma 7 ((Nesterov & Spokoiny, 2017)).
If is smooth, then
If the Hessian is Lipschitz continuous, then we can guarantee that
Lemma 8 ((Nesterov & Spokoiny, 2017)).
Let , be from , then we have the following bound
Then we give the results of moments of products quadratic forms in normal distribution.
Lemma 9 ((Magnus, 1978)).
Let and be two symmetric matrices, and has the Gaussian distribution, that is, . Define . The expectation of and are:
Comments
There are no comments yet.