lmt
Public code for a paper "Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks."
view repo
High sensitivity of neural networks against malicious perturbations on inputs causes security concerns. We aim to ensure perturbation invariance in their predictions. However, prior work requires strong assumptions on network structures and massive computational costs, and thus their applications are limited. In this paper, based on Lipschitz constants and prediction margins, we present a widely applicable and computationally efficient method to lower-bound the size of adversarial perturbations that networks can never be deceived. Moreover, we propose an efficient training procedure to strengthen perturbation invariance. In experimental evaluations, our method showed its ability to provide a strong guarantee for even large networks.
READ FULL TEXT VIEW PDF
Deep neural networks (DNNs) have achieved superior performance in variou...
read it
Robust risk minimisation has several advantages: it has been studied wit...
read it
In this work, we present preliminary results demonstrating the ability t...
read it
We present an algorithm for computing class-specific universal adversari...
read it
Due to their susceptibility to adversarial perturbations, neural network...
read it
A key problem in research on adversarial examples is that vulnerability ...
read it
It is well-known that the robustness of artificial neural networks (ANNs...
read it
Public code for a paper "Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks."
Deep neural networks are highly vulnerable against intentionally created small perturbations on inputs [36], called adversarial perturbations, which cause serious security concerns in applications such as self-driving cars. Adversarial perturbations in object recognition systems have been intensively studied [36; 12; 7], and we mainly target the object recognition systems.
One approach to defend from adversarial perturbations is to mask gradients. Defensive distillation
[29], which distills networks into themselves, is one of the most prominent methods. However, Carlini and Wagner [7] showed that we can create adversarial perturbations that deceive networks trained with defensive distillation. Input transformations and detections [40; 14] are some other defense strategies, although we can bypass them [6]. Adversarial training [12; 20; 23], which injects adversarially perturbed data into training data, is a promising approach. However, there is a risk of overfitting to attacks [20; 37]. Many other heuristics have been developed to make neural networks insensitive against small perturbations on inputs. However, recent work has repeatedly succeeded to create adversarial perturbations for networks protected with heuristics in the literature
[1]. For instance, Athalye et al. [2] reported that many ICLR 2018 defense papers did not adequately protect networks soon after the announcement of their acceptance. This indicates that even protected networks can be unexpectedly vulnerable, which is a crucial problem for this specific line of research because the primary concern of these studies is security threats.The literature indicates the difficulty of defense evaluations. Thus, our goal is to ensure the lower bounds on the size of adversarial perturbations that can deceive networks for each input. Many existing approaches, which we cover in Sec. 2, are applicable only for special-structured small networks. On the other hand, common networks used in evaluations of defense methods are wide, which makes prior methods computationally intractable and complicated, which makes some prior methods inapplicable. This work tackled this problem, and we provide a widely applicable, yet, highly scalable method that ensures large guarded areas for a wide range of network structures.
The existence of adversarial perturbations indicates that the slope of the loss landscape around data points is large, and we aim to bound the slope. An intuitive way to measure the slope is to calculate the size of the gradient of a loss with respect to an input. However, it is known to provide a false sense of security [37; 7; 2]. Thus, we require upper-bounds of the gradients. The next candidate is to calculate a local Lipschitz constant, that is the maximum size of the gradients around each data point. Even though this can provide certification, calculating the Lipschitz constant is computationally hard. We can obtain it in only small networks or get its approximation, which cannot provide certification [15; 39; 11]. A coarser but available alternative is to calculate the global Lipschitz constant. However, prior work could provide only magnitudes of smaller certifications compared to the usual discretization of images even for small networks [36; 30]. We show that we can overcome such looseness with our improved and unified bounds and a developed training procedure. The training procedure is more general and effective than previous approaches [8; 41]. We empirically observed that the training procedure also improves robustness against current attack methods.
In this section, we review prior work to provide certifications for networks. One of the popular approaches is restricting discussion to networks using ReLU
[25]exclusively as their activation functions and reducing the verification problem to some other well-studied problems.
Bastani et al. [4]encoded networks to linear programs,
Katz et al. [17, 16] reduced the problem to Satisfiability Modulo Theory, and Raghunathan et al. [31] encoded networks to semidefinite programs. However, these formulations demand prohibitive computational costs and their applications are limited to only small networks. As a relatively tractable method, Kolter and Wong [19] has bounded the influence of -norm bounded perturbations using convex outer-polytopes. However, it is still hard to scale this method to deep or wide networks. Another approach is assuming smoothness of networks and losses. Hein and Andriushchenko [15] focused on local Lipschitz constants of neural networks around each input. However, the guarantee is provided only for networks with one hidden layer. Sinha et al. [34] proposed a certifiable procedure of adversarial training. However, smoothness constants, which their certification requires, are usually unavailable or infinite. As a concurrent work, Ruan et al. [32] proposed another algorithm to certify robustness with more scalable manner than previous approaches. We note that our algorithm is still significantly faster.We define the threat model, our defense goal, and basic terminologies.
Let be a data point from data distribution and its true label be where is the number of classes. Attackers create a new data point similar to which deceives defenders’ classifiers. In this paper, we consider the -norm as a similarity measure between data points because it is one of the most common metrics [24; 7].
Let be a positive constant and be a classifier. We assume that the output of
is a vector
and the classifier predicts the label with , where denotes the -th element of . Now, we define adversarial perturbation as follows.We define a guarded area for a network and a data point as a hypersphere with a radius that satisfies the following condition:
(1) |
This condition (1) is always satisfied when . Our goal is to ensure that neural networks have larger guarded areas for data points in data distribution.
In this section, we first describe basic concepts for calculating the provably guarded area defined in Sec. 3. Next, we outline our training procedure to enlarge the guarded area.
We explain how to calculate the guarded area using the Lipschitz constant. If bounds the Lipschitz constant of neural network , we have the following from the definition of the Lipschitz constant:
Note that if the last layer of
is softmax, we only need to consider the subnetwork before the softmax layer. We introduce the notion of prediction margin
:This margin has been studied in relationship to generalization bounds [21; 3; 28]. Using the prediction margin, we can prove the following proposition holds.
(2) |
The details of the proof are in Appendix A. Thus, perturbations smaller than cannot deceive the network for a data point . Proposition 1 sees network as a function with a multidimensional output. This connects the Lipschitz constant of a network, which has been discussed in Szegedy et al. [36] and Cisse et al. [8], with the absence of adversarial perturbations. If we cast the problem to a set of functions with a one-dimensional output, we can obtain a variant of Prop. 1. Assume that the last layer before softmax in is a fully-connected layer and is the -th raw of its weight matrix. Let be a Lipschitz constant of a sub-network of before the last fully-connected layer. We obtain the following proposition directly from the definition of the Lipschitz constant [15; 39].
(3) |
To ensure non-trivial guarded areas, we propose a training procedure that enlarges the provably guarded area.
To encourage conditions Eq.(2) or Eq.(3) to be satisfied with the training data, we convert them into losses. We take Eq.(2) as an example. To make Eq.(2) satisfied for perturbations with -norm larger than , we require the following condition.
(4) |
Thus, we add
to all elements in logits except for the index corresponding to
. In training, we calculate an estimation of the upper bound of
with a computationally efficient and differentiable way and use it instead of is specified by users. We call this training procedure Lipschitz-margin training (LMT). The algorithm is provided in Figure 2. Using Eq.(3) instead of Eq.(2) is straightforward. Small additional techniques to make LMT more stable is given in Appendix E.From the former paragraph, we can see that LMT maximizes the number of training data points that have larger guarded areas than , as long as the original training procedure maximizes the number of them that are correctly classified. We experimentally evaluate its generalization to test data in Sec. 6. The hyperparameter is easy to interpret and easy to tune. The larger we specify, the stronger invariant property the trained network will have. However, this does not mean that the trained network always has high accuracy against noisy examples. To see this, consider the case where
is extremely large. In such a case, constant functions become an optimal solution. We can interpret LMT as an interpolation between the original function, which is highly expressive but extremely non-smooth, and constant functions, which are robust and smooth.
A main computational overhead of LMT is the calculation of the Lipschitz constant. We show in Sec. 5 that its computational cost is almost the same as increasing the batch size by one. Since we typically have tens or hundreds of samples in a mini-batch, this cost is negligible.
In this section, we first describe a method to calculate upper bounds of the Lipschitz constant. We bound the Lipschitz constant of each component and recursively calculate the overall bound. The concept is from Szegedy et al. [36]. While prior work required separate analysis for slightly different components [36; 30; 8; 32], we provide a more unified analysis. Furthermore, we provide a fast calculation algorithm for both the upper bounds and their differentiable approximation.
We describe the relationships between the Lipschitz constants and some functionals which frequently appears in deep neural networks: composition, addition, and concatenation. Let and be functions with Lipschitz constants bounded by and , respectively. The Lipschitz constant of output for each functional is bounded as follows:
We describe bounds of the Lipschitz constants of major layers commonly used in image recognition tasks. We note that we can ignore any bias parameters because they do not change the Lipschitz constants of each layer.
Fully-connected, convolutional and normalization layers are typically linear operations at inference time. For instance, batch-normalization is a multiplication of a diagonal matrix whose
-th element is , whereare a scaling parameter, running average of variance, and a constant, respectively. Since the composition of linear operators is also linear, we can jointly calculate the Lipschitz constant of some common pairs of layers such as convolution + batch-normalization. By using the following theorem, we proposed a more unified algorithm than
Yoshida and Miyato [41].Let be a linear operator from to , where and . We initialize a vector from a Gaussian with zero mean and unit variance. When we iteratively apply the following update formula, the -norm of converges to the square of the operator norm of in terms of -norm, almost surely.
The proof is found in Appendix C.1. The algorithm for training time is provided in Figure 2. At training time, we need only one iteration of the above update formula as with Yoshida and Miyato [41]. Note that for estimation of the operator norm for a forward path, we do not require to use gradients. In a convolutional layer, for instance, we do not require another convolution operation or transposed convolution. We only need to increase the batch size by one. The wide availability of our calculation method will be especially useful when more complicated linear operators than usual convolution appear in the future. Since we want to ensure that the calculated bound is an upper-bound for certification, we can use the following theorem.
Let and be an operator norm of a function in terms of the -norm and the -norm of the vector at the -th iteration, where each element
is initialized by a Gaussian with zero mean and unit variance. With probability higher than
, the error between and is smaller than , where .The proof is in Appendix C.3, which is mostly from Friedman [10]
. If we use a large batch for the power iteration, the probability becomes exponentially closer to one. We can also use singular value decomposition as another way for accurate calculation. Despite its simplicity, the obtained bound for convolutional layers is much tighter than the previous results in
Peck et al. [30] and Cisse et al. [8], and that for normalization layers is novel. We numerically confirm the improvement of bounds in Sec. 6.First, we have the following theorem.
Define , where and for all . Then,
where and is the -th element of .
The proof, whose idea comes from Cisse et al. [8], is found in Appendix D.1. The exact form of in the pooling and convolutional layers is given in Appendix D.3. The assumption in Theorem 3 holds for most layers of networks for image recognition tasks, including pooling layers, convolutional layers, and activation functions. Careful counting of leads to improved bounds on the relationship between the Lipschitz constant of a convolutional layer and the spectral norm of its reshaped kernel from the previous result [8].
Let be the operator norm of a convolutional layer in terms of the -norm, and
be the spectral norm of a matrix where the kernel of the convolution is reshaped into a matrix with the same number of rows as its output channel size. Assume that the width and the height of its input before padding are larger or equal to those of the kernel. The following inequality holds.
where is a constant independent of the weight matrix.
With recursive computation using the bounds described in the previous sections, we can calculate an upper bound of the Lipschitz constants of the whole network in a differentiable manner with respect to network parameters. At inference time, calculation of the Lipschitz constant is required only once.
In calculations at training time, there may be some notable differences in the Lipschitz constants. For example, in a batch normalization layer depends on its input. However, we empirically found that calculating the Lipschitz constants using the same bound as inference time effectively regularizes the Lipschitz constant. This lets us deal with batch-normalization layers, which prior work ignored despite its impact on the Lipschitz constant [8; 41].
In this section, we show the results of numerical evaluations. Since our goal is to create networks with stronger certification, we evaluated the following three points.
Our bounds of the Lipschitz constants are tighter than previous ones (Sec. 6.1).
Our calculation technique of the guarded area and LMT are available for modern large and complex networks (Sec. 6.2).
We also evaluated the robustness of trained networks against current attacks and confirmed that LMT robustifies networks (Secs. 6.1 and 6.2). For calculating the Lipschitz constant and guarded area, we used Prop. 2. Detailed experimental setups are available in Appendix F. Our codes are available at https://github.com/ytsmiling/lmt.
We numerically validated improvements of bounds for each component and numerically analyzed the tightness of overall bounds of the Lipschitz constant. We also see the non-triviality of the provably guarded area. We used the same network and hyperparameters as Kolter and Wong [19].
We evaluated the difference of bounds in convolutional layers in networks trained using a usual training procedure and LMT. Figure 3 shows comparisons between the bounds in the second convolutional layer. It also shows the difference of bounds in pooling layers, which does not depend on training methods. We can confirm improvement in each bound. This results in significant differences in upper-bounds of the Lipschitz constants of the whole networks.
Let be an upper-bound of the Lipschitz constant calculated by our method. Let be the local and global Lipschitz constants. Between them, we have the following relationship.
(5) |
We analyzed errors in inequalities (i) – (iii). We define an error of (i) as (B)(A) and others in the same way. We used lower bounds of the local and global Lipschitz constant calculated by the maximum size of gradients found. A detailed procedure for the calculation is explained in Appendix F.1.3. For the generation of adversarial perturbations, we used DeepFool [24]. Note that (iii) does not hold because we calculated mere lower bounds of Lipschitz constants in (B) and (C). We analyzed inequality (5) in an unregularized model, an adversarially trained (AT) model with the -iteration C&W attack [7], and an LMT model. Figure 4 shows the result. With an unregularized model, estimated error ratios in (i) – (iii) were , , and respectively. This shows that even if we could precisely calculate the local Lipschitz constant for each data point with possibly substantial computational costs, inequality (iii) becomes more than times looser than the size of adversarial perturbations found by DeepFool. In an AT model, the discrepancy became more than 2.4. On the other hand, in an LMT model, estimated error ratios in (i) – (iii) were , , and respectively. The overall median error between the size of found adversarial perturbations, and the provably guarded area was . This shows that the trained network became smooth and Lipschitz constant based certifications became significantly tighter when we use LMT. This also resulted in better defense against attack. For reference, the median of found adversarial perturbations for an unregularized model was , while the median of the size of the provably guarded area was in an LMT model.
We discuss the size of the provably guarded area, which is practically more interesting than tightness. While our algorithm has clear advantages on computational costs and broad applicability over prior work, guarded areas that our algorithm ensured were non-trivially large. In a naive model, the median of the size of perturbations we could certify invariance was . This means changing several pixels by one in usual – scale cannot change their prediction. Even though this result is not so tight as seen in the previous paragraph, this is significantly larger than prior computationally cheap algorithm proposed by Peck et al. [30]. The more impressive result was obtained in models trained with LMT, and the median of the guarded area was . This corresponds to in the norm. Kolter and Wong [19], which used the same network and hyperparameters as ours, reported that they could defend from perturbations with its -norm bounded by for more than examples. Thus, in the -norm, our work is inferior, if we ignore their limited applicability and massive computational demands. However, our algorithm mainly targets the -norm, and in that sense, the guarded area is significantly larger. Moreover, for more than half of the test data, we could ensure that there are no one-pixel attacks [35]. To confirm the non-triviality of the obtained certification, we have some examples of provably guarded images in Figure 5.
We evaluated our method with a larger and more complex network to confirm its broad applicability and scalability. We used -layered wide residual networks [42] with width factor on the SVHN dataset [27] following Cisse et al. [8]. To the best of our knowledge, this is the largest network concerned with certification. We compared LMT with a naive counterpart, which uses weight decay, spectral norm regularization [41], and Parseval networks.
For a model trained with LMT, we could ensure larger guarded areas than
for more than half of test data. This order of certification was only provided for small networks in prior work. In models trained with other methods, we could not provide such strong certification. There are mainly two differences between LMT and other methods. First, LMT enlarges prediction margins. Second, LMT regularizes batch-normalization layers, while in other methods, batch-normalization layers cancel the regularization on weight matrices and kernel of convolutional layers. We also conducted additional experiments to provide further certification for the network. First, we replaced convolution with kernel size 1 and stride 2 with average-pooling with size 2 and convolution with kernel size 1. Then, we used LMT with
. As a result, while the accuracy dropped to , the median size of the provably guarded areas was larger than . This corresponds to that changing elements of input by in usual image scales (–) cannot cause error over for the trained network. These certifications are non-trivial, and to the best of our knowledge, these are the best certification provided for this large network.We evaluated the robustness of trained networks against adversarial perturbations created by the current attacks. We used C&W attack [7] with 100 iterations and no random restart for evaluation. Table 1 summarizes the results. While LMT slightly dropped its accuracy, it largely improved robustness compared to other regularization based techniques. Since these techniques are independent of other techniques such as adversarial training or input transformations, further robustness will be expected when LMT is combined with them.
Size of perturbations | ||||
---|---|---|---|---|
Clean | ||||
weight decay | ||||
Parseval network | ||||
spectral norm regularization | ||||
LMT |
To ensure perturbation invariance of a broad range of networks with a computationally efficient procedure, we achieved the following.
We offered general and tighter spectral bounds for each component of neural networks.
We introduced general and fast calculation algorithm for the upper bound of operator norms and its differentiable approximation.
We proposed a training algorithm which effectively constrains networks to be smooth, and achieves better certification and robustness against attacks.
We successfully provided non-trivial certification for small to large networks with negligible computational costs.
We believe that this work will serve as an essential step towards both certifiable and robust deep learning models. Applying developed techniques to other Lipschitz-concerned domains such as training of GAN or training with noisy labels is future work.
Authors appreciate Takeru Miyato for valuable feedback. YT was supported by Toyota/Dwango AI scholarship. IS was supported by KAKENHI 17H04693. MS was supported by KAKENHI 17H00757.
Proceedings of the 35th International Conference on Machine Learning
, pages 274–283, 2018.Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 3–14, 2017a.Error bounds on the power method for determining the largest eigenvalue of a symmetric, positive definite matrix.
Linear Algebra and its Applications, 280(2):199–216, 1998.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2574–2582, 2016.We prove Prop. 1 in Sec. 4.1. Let us consider a classifier with Lipschitz constant . Let be an output vector of the classifier for a data point .
The statement to prove is the following:
(6) |
If we prove the following, it suffices:
(7) |
Before proving inequality (7), we have the following lemma.
For real vectors and , the following inequality holds:
W.l.o.g. we assume . Let be . Then,
∎
We prove bounds described in Sec. 5.1. Let and be functions with their Lipschitz constants bounded with and , respectively.
Using triangle inequality,
We see the Lipschitz constant of linear components, given in Sec. 5.2, in more detail. We first prove Theorem 1, and Theorem 2. Next, we focus on its calculation for normalization layers.
Since there exists a matrix representation of and the operator norm of in terms of -norm is equivalent to the spectral norm of , considering is sufficient. Now, we have
Thus, recursive application of the algorithm in Theorem 1 is equivalent to the power iteration to . Since the maximum eigen value of is a square of the spectral norm of , converges to the square of the spectral norm of almost surely in the algorithm. ∎
We use the same notation with Algorithm 2.
In Algorithm 2, we only care the direction of the vector because we
normalize it at every iteration.
We first explain that the direction of
converges to a singular vector of the largest singular value of the linear function
when is fixed.
Since
and is a scalar, converges to the same direction with Theorem 1. In other words, converges to the singular vector of the largest singular of .
From the proof of Theorem 1 in Appendix C.1, we considers power iteration to . Let be the largest singular value of a matrix . Since is a symmetric positive definite matrix, from Theorem 1.1 in Friedman [1998], we have
where is bounded by from Prop. 2.2 in [Friedman, 1998]. A quantity has the following relationship [Friedman, 1998]:
Thus, the Theorem 2 holds. ∎
If we use batchsize for the algorithm and take the max of all upper bound, then the failure probability is less than .
Batch normalization applies the following function,
(8) |
where and are learnable parameters and are the mean and deviation of (mini) batch, respectively. Parameters and variables and are constant at the inference time. Small constant is generally added for numerical stability. We can rewrite an update of (8) as follows:
Since the second term is constant in terms of input, it is independent of the Lipschitz constant. Thus, we consider the following update:
The Lipschitz constant can be bounded by .
Since the opertion is linear, we can also use Algorithm 2 for the calculation.
This allows us to calculate the Lipschitz constant of batch-noramlization and precedent other linear layers jointly.
When we apply the algorithm 2 to a single batch-normalization layer,
a numerical issue can offer. See Appendix C.4.3 for more details.
In weight normalization [Salimans and Kingma, 2016], the same discussion applies if we replace in batch-normalization with , where is the -th row of a weight matrix.
In some cases, estimation of spectral norm using power iteration can fail in training dynamics. For example, in batch-normalization layer, in Algorithm 2 converges to some one-hot vector. Once converges, no matter how much other parameters change during training, stay the same. To avoid the problem, when we apply Algorithm 2 to normalization layers, we added small perturbations on a vector in the algorithm at every iteration after its normalization.
First, we prove the following lemma:
Let vector be a concatenation of vectors and let be a function such that is a concatenations of vectors , where each is a function with its Lipschitz constant bounded by . Then, the Lipschitz constant of can be bounded by .
∎
: input channel size, output channel size, kernel height, kernel width.
: a matrix which kernel of a convolution is reshaped into the size .
The operation in a convolution layer satisfies the assumption in Theorem 3, where all are the matrix multiplication of . Thus, the right inequality holds. Since matrix multiplication with is applied at least once in the convolution, the left inequality holds. ∎
We provide tight number of the repetition for pooling and convolutional layers here.
: height and width of input array.
: kernel height, kernel width.
, : stride height, stride width.
First of all, the repetition is bounded by the size of reception field, which is . This is provided by Cisse et al. [2017]. Now, we extend the bound by considering the input size and stride. Firstly, we consider the input size after padding. If both the input and kernel size are , the number of repetition is obviously bounded by . Similarly, the number of repetition can be bounded by the following:
We can further bound the time of repetition by considering the stride as follows:
∎
Lipschitz constant of max function is bounded by one.
Before bounding the Lipschitz constant, we note that the following inequality holds for a vector :
This can be proved using
Now, we bound the Lipschitz constant of the average function .
We empirically found that applying the addition only when a prediction is correct stabilizes the training. Thus, in the training, we scale the addition with
Even though depends on , we do not back-propagate it.
In this section, we describe the details of our experimental settings.
We used the same network, optimizer and hyperparameters with Kolter and Wong [2018]. A network consisting of two convolutional and two fully-connected layers was used. Table 3 shows the details of its structure.
output size | kernel | padding | stride | |
---|---|---|---|---|
convolution | 16 | (4,4) | (1,1) | (2,2) |
ReLU | - | - | - | - |
convolution | 32 | (4,4) | (1,1) | (2,2) |
ReLU | - | - | - | - |
fully-connected | 100 | - | - | - |
ReLU | - | - | - | - |
fully-connected | 10 | - | - | - |
All models were trained using Adam optimizer [Kingma and Ba, 2015] for epochs with a batch size of . The learning rate of Adam was set to . Note that these setting is the same with Kolter and Wong [2018]. For a LMT model, we set . For an AT model, we tuned hyperparemter of C&W attack from and chose the best one on validation data.
We calculated (A) with Proposition 2.
We took the max of the local Lipschitz constant calculated for (C).
First, we added a random perturbation which each element is sampled from a Gaussian with zero-mean and variance , where is set as a reciprocal number of the size of input dimension. Next, we calculated the size of a gradient with respect to the input. We repeated the above two for 100 times and used the maximum value between them as an estimation of the local Lipschitz constant.
We used DeepFool [Moosavi-Dezfooli et al., 2016].
Wide residual network [Zagoruyko and Komodakis, 2016] with 16 layers and a width factor was used. We sampled 10000 images from an extra data available for SVHN dataset as validation data and combined the rest with the official training data, following Cisse et al. [2017]. All inputs were preprocessed so that each element has a value in a range -.
Models were trained with Nesterov Momentum
[Nesterov, 1983] for epochs with a batch size of . The initial learning rate was set to and it was multiplied by at epochs and . For naive models, the weight decay with and the dropout with a dropout ratio of were used. For Parseval networks, the weight decay was removed except for the last fully-connected layer and Parseval regularization with was added, following Cisse et al. [2017]. For a network with the spectral norm regularization, the weight decay was removed and the spectral norm regularization with was used following Yoshida and Miyato [2017]. We note that both Cisse et al. [2017] and Yoshida and Miyato [2017] used batch-normalization for their experimental evaluations and thus, we left it for them. For LMT, we used and did not apply weight decay. In residual blocks, the Lipschits constant for the convolutional layer and the batch normalization layer was jointly calculated as described in Sec. 5.2.Since the proposed calculation method of guarded areas imposes almost no computational overhead at inference time, this property has various potential applications. First of all, we note that in real-world applications, even though true labels are not available, we can calculate the lower bounds on the size of perturbations needed to change the predictions. The primary use is balancing between the computational costs and the performance. When the provably guarded areas are sufficiently large, we can use weak and computationally cheap detectors of perturbations, because the detectors only need to find large perturbations. For data with small guarded areas, we may resort to computationally heavy options, e.g., strong detectors or denoising networks.
Here, we discuss the difference between our work and Cisse et al. [2017]. In the formulation of Parseval networks, the goal is to limit the change in some Lipschitz continuous loss by constraining the Lipschitz constant. However, since the existence of adversarial perturbations corresponds to the - loss, which is not continuous, their discussion is not applicable. For example, if we add a scaling layer to the output of a network without changing its parameters, we can control the Lipschitz constant of the network. However, this does not change its prediction and this is irrelevant to the existence of adversarial perturbations. Therefore, considering solely the Lipschitz constant can be insufficient. In LMT, the insufficiency is avoided using Proposition 1 and 2.
Additionally, we point out three differences. First, in Parseval networks, the upper bound of each component is restricted to be smaller than one. This makes their theoretical framework incompatible with some frequently used layers such as the batch normalization layer. Since they just ignore the effects of such layers, Parseval networks cannot control the Lipschitz constant of networks with normalization layers. On the other hand, our calculation method of guarded area and LMT can handle such layers without problems. Second, Parseval networks force all singular values of the weight matrices to be close to one, meaning that Parseval networks prohibit weight matrices to dump unnecessary features. As Wang et al. [2017] pointed out, learning unnecessary features can be a cause of adversarial perturbations, which indicates the orthonormality condition has adverse effects that encourage the existence of adversarial perturbations. Since LMT does not penalize small singular values, LMT does not suffer the problem. Third, LMT requires only differentiable bounds of the Lipschitz constants. This lets LMT be easily extended to networks with various components. On the other hand, the framework of Parseval networks requires special optimization techniques for each component.
The formulation of LMT is highly flexible, so we can consider some extended versions. First, we consider the applications that require guarded area different in classes. For example, to distinguish humans from cats will be more important than to classify Angora cats from Persian cats. In LMT, such knowledge can be combined by specifying different hyperparameter for each pair of classes. Second, we consider a combination of adversarial trainings. It will be more reasonable to require smaller margins for the inputs with large perturbations. In LMT, we can incorporate this intuition by changing according to the size of perturbations or merely set to zero for perturbed data. This ability of LMT to be easily combined with other notions is one of the advantages of LMT.
Comments
There are no comments yet.