RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

10/28/2018 ∙ by Huan Zhang, et al. ∙ Microsoft 0

The Jacobian matrix (or the gradient for single-output networks) is directly related to many important properties of neural networks, such as the function landscape, stationary points, (local) Lipschitz constants and robustness to adversarial attacks. In this paper, we propose a recursive algorithm, RecurJac, to compute both upper and lower bounds for each element in the Jacobian matrix of a neural network with respect to network's input, and the network can contain a wide range of activation functions. As a byproduct, we can efficiently obtain a (local) Lipschitz constant, which plays a crucial role in neural network robustness verification, as well as the training stability of GANs. Experiments show that (local) Lipschitz constants produced by our method is of better quality than previous approaches, thus providing better robustness verification results. Our algorithm has polynomial time complexity, and its computation time is reasonable even for relatively large networks. Additionally, we use our bounds of Jacobian matrix to characterize the landscape of the neural network, for example, to determine whether there exist stationary points in a local neighborhood. Source code available at https://github.com/huanzhang12/RecurJac-Jacobian-Bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep neural networks have been successfully applied to many applications, but one of the major criticisms is their being black boxes—no satisfactory explanation of their behavior can be easily offered. Given a neural network with input , one fundamental question to ask is: how does a perturbation in the input space affect the output prediction? To formally answer this question and bound the behavior of neural networks, a critical step is to compute the uniform bounds of the Jacobian matrix for all within a certain region. Many recent works on understanding or verifying the behavior of neural networks rely on this quantity. For example, once a (local) Jacobian bound is computed, one can immediately know the radius of a guaranteed “safe region” in the input space, where no adversarial perturbation can change the output label [Hein and Andriushchenko2017, Weng et al.2018b]. This is also referred to as the robustness verification problem. In generative adversarial networks (GANs) [Goodfellow et al.2014], the training process suffers from the gradient vanishing problem and can be very unstable. Adding the Lipschitz constant of the discriminator network as a constraint [Arjovsky, Chintala, and Bottou2017, Miyato et al.2018] or as a regularizer [Gulrajani et al.2017] significantly improves the training stability of GANs. For neural networks, the Jacobian matrix is also closely related to its Jacobian matrix with respect to the weights

, whose bound directly characterizes the generalization gap in supervised learning and GANs; see, e.g.,

[Vapnik and Vapnik1998, Sriperumbudur et al.2009, Bartlett, Foster, and Telgarsky2017, Arora and Zhang2018, Zhang et al.2017].

Computing bounds for Jacobian (or gradient) is very challenging even for a simple ReLU network, and how to efficiently provide a tight bound is still an open problem for deep neural networks. Previous attempts for computing Jacobian bounds can be summarized into three categories. Sampling approaches 

[Wood and Zhang1996, Weng et al.2018b]estimate the Jacobian bound by sampling points and estimating the maximum gradient, but the computed quantity is usually an under-estimation, and as a result not a certified bound. Another line of work simply bounds the norm of Jacobian matrix over the entire domain (i.e. global Lipschitz constant) by the product of operator norms of the weight matrices [Szegedy et al.2013, Cisse et al.2017, Elsayed et al.2018]. Unfortunately, this is a very loose global upper bound, especially when we are only interested in a small local region of a neural network. Finally, some recent works focus on computing Lipschitz constant in ReLU networks: [Raghunathan, Steinhardt, and Liang2018] solves a semi-definite programming (SDP) problem to give a Lipschitz constant, but its computational cost is high and it only applies to 2-layer networks; [Weng et al.2018b] can be applied to multi-layer ReLU networks but the bound quickly loses its power when the network goes deeper.

In this paper, we propose a novel recursive algorithm, dubbed RecurJac, for efficiently computing a certified Jacobian bound. Unlike the layer-by-layer algorithm (Fast-Lip) for ReLU network in [Weng et al.2018b]

, we develop a recursive refinement procedure that significantly outperforms Fast-Lip on ReLU networks, and our algorithm is general enough to be applied to networks with most common activation functions, not limited to ReLU. Our key observation is that the Jacobian bounds of previous layers can be used to reduce the uncertainties of neuron activations in the current layer, and some uncertain neurons can be fixed without affecting the final bound. We can then absorb these fixed neurons into the previous layers’ weight matrix, which results in bounding Jacobian matrix for another shallower network. This technique can be applied recursively to get a tighter final bound. Compared with the non-recursive algorithm (Fast-Lip), RecurJac increases the computation cost by at most

times ( is depth of the network), which is reasonable even for relatively large networks.

Figure 1: RecurJac can obtain local and global Lipschitz constants which are magnitudes better than existing algorithms. See Experiment section for more details.

We apply RecurJac to various applications. First, we can investigate the local optimization landscape after obtaining the upper and lower bounds of Jacobian matrix, by guaranteeing that no stationary points exist inside a certain region. Experimental results show that the radius of this region steadily decreases when networks become deeper. Second, RecurJac can find a local Lipschitz constant, which up to two magnitudes smaller than the state-of-the-art algorithm without a recursive structure (Figure 1). Finally, we can use RecurJac to evaluate the robustness of neural networks, by giving a certified lower bound within which no adversarial examples can be found.

Related Work

Previous algorithms for computing Lipschitz constant.

Several previous works give bounds of the local or global Lipschitz constant of neural networks, which is a special case of our problem – after knowing the element-wise bounds for Jacobian matrix, a local or global Lipschitz constant can be obtained by taking the corresponding induced norm of Jacobian matrix (more details in the next section).

One simple approach for estimating the Lipschitz constant for any black-box function is to sample many and compute the maximal  [Wood and Zhang1996]. However, the computed value may be an under-estimation unless the sample size goes to infinity. The Extreme Value Theory [De Haan and Ferreira2007] can be used to refine the bound but the computed value could still under-estimate the Lipschitz constant [Weng et al.2018b], especially due to the high dimensionality of neural networks.

For a neural network with known structure and weights, it is possible to compute Lipschitz constant explicitly. Since the Jacobian matrix of a neural network with respect to the input can be written explicitly (as we will introduce later in Eq. (2)), an easy and popular way to obtain a loose global Lipschitz constant is to multiply weight matrices’ operator norms and the maximum derivative of each activation function (most common activation functions are Lipschitz continuous) [Szegedy et al.2013]. Since this quantity is simple to compute and can be optimized by back-propagation, many recent works also propose defenses to adversarial examples [Cisse et al.2017, Elsayed et al.2018, Tsuzuku, Sato, and Sugiyama2018, Qian and Wegman2018] or techniques to improve the training stability of GAN [Miyato et al.2018] by regularizing this global Lipschitz constant. However, it is clearly a very loose Lipschitz constant, as will be shown in our experiments.

For 2-layer ReLU networks, [Raghunathan, Steinhardt, and Liang2018] computes a global Lipschitz constant by relaxing the problem to semi-definite programming (SDP) and solving its dual, but it is computationally expensive. For 2-layer networks with twice differentiable activation functions, [Hein and Andriushchenko2017] derives the local Lipschitz constant for robustness verification. These methods show promising results for 2-layer networks, but cannot be trivially extended to networks with multiple layers.

Bounds for Jacobian matrix

Recently, [Weng et al.2018a] proposes an layer-by-layer algorithm, Fast-Lip, for computing the lower and upper bounds of Jacobian matrix with respect to network input . It exploits the special activation patterns in ReLU networks but does not apply to networks with general activation functions. Most importantly, it loses power quickly when the network becomes deeper, and using Fast-Lip for robustness verification produces non-trivial bounds only for very shallow networks (less than 4 layers).

Robustness verification of neural networks.

Verifying if a neural network is robust to a norm bounded distortion is a NP-complete problem [Katz et al.2017]. Solving the minimal adversarial perturbation can take hours to days using solvers for satisfiability modulo theories (SMT) [Katz et al.2017]

or mixed integer linear programming (MILP) 

[Tjeng, Xiao, and Tedrake2017]. However, a lower bound for the minimum adversarial distortion (robustness lower bound) can be given if knowing the local Lipschitz constant near the input example . For a multi-class classification network, assume the output of network is a

-dimensional vector where each

is the logit for the

-th class and the final prediction , the following lemma gives a robustness lower bound [Hein and Andriushchenko2017, Weng et al.2018b]: For an input example ,

(1)

where is the Lipschitz constant of in some local region (will be formally defined later). Therefore, as long as a local Lipschitz constant can be computed, we can verify that the prediction of a neural network will stay unchanged for any perturbation within radius . A good local Lipschitz constant is hard to compute in general: [Hein and Andriushchenko2017] only shows the results for 2-layer neural networks; [Weng et al.2018b] applies a sampling-based approach and cannot guarantee that the computed radius satisfies (1). Thus, an efficient, guaranteed and tight bound for Lipschitz constant is essential for understanding the robustness of deep neural networks.

Some other methods have also been proposed for robustness verification, including direct linear bounds [Zhang et al.2018, Croce, Andriushchenko, and Hein2018, Weng et al.2018a], convex adversarial polytope [Wong and Kolter2018, Wong et al.2018], solving Lagrangian relaxation [Dvijotham et al.2018] and geometry abstraction [Gehr et al.2018, Mirman, Gehr, and Vechev2018]. In this paper we focus on Local Lipschitz constant based methods only.

RecurJac: Recursive Jacobian Bounding

In this section, we present RecurJac, our recursive algorithm for uniformly bounding (local) Jacobian matrix of neural networks with a wide range of activation functions.

Notations.

For an -layer neural network with input , weight matrices

and bias vectors

, the network can be defined recursively as for all with , . is a component-wise activation function of (leaky-)ReLU, sigmoid family (including sigmoid, arctan, hyperbolic tangent, etc), and other activation functions that satisfy the assumptions we will formally show below. We denote as the -th row and as the -th column of . For convenience, we denote as the pre-activation function values.

Local Lipchitz constant.

Given a function and two distance metrics and on and , respectively, the local Lipschitz constant of in a close ball of radius centered at (denoted as ) is defined as:

Any scalar that satisfies this condition is a local Lipschitz constant. However, a good local Lipschitz constant should be as small as possible, i.e., close to the best (smallest) local Lipschitz constant. The Lipschitz constant we have computed can be seen as an upper bound of the best Lipschitz constant.

Assumptions on activation functions.

Our bounds depend on the following assumptions on the activation function : is continuous and differentiable almost everywhere on . This is a basic assumption for neural network activation functions. There exists a positive constant such that when the derivative exists. This covers all common activation functions, including (leaky-)ReLU, hard-sigmoid, exponential linear units (ELU), sigmoid, hyperbolic tangent, arctan and all sigmoid-shaped family activation functions. This assumption helps us derive an elegant bound.

Overview of Techniques.

The local Lipschitz constant can be presented as the maximum directional derivative inside the ball  [Weng et al.2018b]. For differentiable functions, this is the maximum norm of gradient with respect to the distance metric (or the maximal operator norm of Jacobian induced by distances and in the vector-output case). Deriving the exact maximum norm of gradient is NP-complete even for a 2-layer ReLU network [Raghunathan, Steinhardt, and Liang2018], so we bound each element of Jacobian through a layer-by-layer approach, as shown below.

Define diagonal matrices representing the derivatives of the activation functions:

The Jacobian matrix of a -layer network can be written as:

(2)

For the ease of notation, we also define

for . As a special case, .

In the first step, we assume that we have the following pre-activation bounds and for every layer :

(3)

These bounds can be computed efficiently via any algorithms that compute layer-wise activation bounds, including CROWN [Zhang et al.2018] and convex adversarial polytope [Wong and Kolter2018]. Because pre-activations are within a range rather than a fixed value (see Eq. (3)), matrices contain uncertainties, which will be characterized analytically.

In the second step, we compute both lower and upper bounds for each entry of in a backward manner. More specifically, we compute so that

(4)

holds true element-wisely. For layer , we have and thus . For layers , uncertainties in matrices propagate into layer by layer. Naively deriving just from and leads to a very loose bound. We propose a fast recursive algorithm that makes use of bounds for all previous layers to compute a much tighter bound for . Applying our algorithm to will eventually allow us to obtain . Our algorithm can also be applied in a forward manner; the forward version (RecurJac-F) typically leads to slightly tighter bounds but can slow down the computation significantly, as we will show in the experiments.

From an optimization perspective, we essentially try to solve two constrained maximization and minimization problems with variables , for each element in the Jacobian :

(5)

[Raghunathan, Steinhardt, and Liang2018] show that even for ReLU networks with one hidden layer, finding the maximum norm of the gradient is equivalent to the Max-Cut problem and NP-complete. RecurJac is a polynomial time algorithm to give upper and lower bounds on , rather than solving the exact maxima and minima in exponential time.

After obtaining the Jacobian bounds , we can make use of it to derive an upper bound for the local Lipchitiz constant in the set . We present bounds when and are both ordinary -norm () distance in Euclidean space. We can also use the Jacobian bounds for other proposes, like understanding the local optimization landscape.

Bounds for

From (2), we can see that the uncertainties in are purely from ; all are fixed. For any , we define the range of as and , i.e.,

(6)

Note that and can be easily obtained because we know (thanks to (3)) and the analytical form of

. For example, for the sigmoid function

, , we have:

(7)
(8)

Note that equations (7) and (8) is also valid for other sigmoid-family activation functions, including , , and many others.

For (leaky-)ReLU activation functions with a negative-side slope (), and are:

For (leaky-)ReLU activation functions, in the cases where and , we have , so becomes a constant and there is no uncertainty.

A recursive algorithm to bound

Bounds for .

By definition, we have and . Therefore, we obtain

(9)

where thanks to (6).

By assumption Assumptions on activation functions., is always non-negative, and thus we only need to consider the signs of and in (9). Denote and to be a lower and upper bounds of (9). By examining the signs of each term, we can take

(10)
(11)

In (10), we collect all negative terms of and multiply them by as a lower bound of , and collect all positive terms and multiply them by as a lower bound of the positive counterpart. We obtain the upper bound in (11) following the same rationale. Fast-Lip is a special case of RecurJac for ReLU activation functions when there are only two layers; RecurJac becomes much more sophisticated in multi-layer cases, as we will show in the next paragraph.

Bounds for when .

By definition, we have , i.e.,

(12)

where thanks to (6) and

thanks to previous computation. We want to find the bounds where

We decompose (12) into two terms:

(13)

and bound them separately.

Observing the signs of each term in and , we take:

(14)

and

(15)

The index constraint is still effective in (14) and (15), but we omit it for notation simplicity. Then we can show that and are a lower and upper bound for term in (13) as follows.

(16)

where is the first term in (13).

For term in (13), the sign of does not change since or . Similar to what we did in (10) and (11), depending on the sign of , we can lower/upper bound term using itself instead of its bound . This will give us much tighter bounds than just naively using as we deal with term . More specifically, we define matrices for as below:

(17)
(18)

Then we can show the following lemma.

For any , we have

(19)

where is the second term in (13). Note that when the sign of is fixed, i.e., or in term , the bounds in (19) is always tighter than those in (16). After we know the sign of , we can fix to be either or according to the sign of and thus eliminate the uncertainty in . Then we can plug into the lower and upper bounds and merge terms involving , and , resulting in (19). Compared with using the worst-case bound directly in (16), we expand and remove uncertainty in in (19), and thus get much tighter bounds.

Finally, combining Proposition Bounds for when . and Lemma Bounds for when ., we get the following recursive formula to bound .

For any and any , we have

and

where and are defined in (14), (15), (17) and (18), respectively.

The lower and upper bounds of in (10) and (11) can be viewed as a special case of Theorem Bounds for when . when . Because we have in this case, we do not have term in the decomposition (13). Moreover, the bounds of term in (19) are reduced to exactly (10) and (11) after we notice that and and specify . Specifying is equivalent to adding another identity layer to the neural network , which does not change any computation.

A recursive algorithm to bound .

Notice that the lower and upper bounds in Lemma Bounds for when . have exactly the same formation of , by replacing with and . Therefore, we can recursively apply our Theorem Bounds for when . to obtain an lower and upper bound for , denoted as and separately. This recursive procedure further reduces uncertainty in for all previous layers, improving the quality of bounds significantly. We elaborate our recursive algorithm in Algorithm 1 for the case , so we omit the last superscript in and . When , we can apply Algorithm 1 independently for each output.

1:,
2:if  then
3:     
4:else if  then
5:     Compute from (10), from (11)
6:else if  then
7:     Compute from (17), from (18)
8:      = ComputeLU( , ) Recursive call
9:      = ComputeLU( , ) Recursive call
10:     Compute from (14), from (15)
11:     
12:     
13:end if
14:return ,
Algorithm 1 ComputeLU (compute the lower and upper Jacobian bounds)

Compute the bounds in a forward manner.

In previous sections, we start our computation from the last layer and bound in a backward manner. By transposing (2), we have

Then we can apply Algorithm 1 to bound according to the equation above. This is equivalent to starting from the first layer, and bound from to . Because we obtain the bounds of pre-activations in a forward manner by CROWN [Zhang et al.2018], the bounds (3) get looser when the layer index gets larger. Therefore, bounding the Jacobian by the forward version is expected to get tighter bounds of at least for small . In our experiments, we see that the bounds for obtained from the forward version are typically a little tighter than those obtained from the backward version. However, the “output” dimension in this case is , which is the input dimension of the neural network. For image classification networks, , the forward version has to apply Algorithm 1 times to obtain the final bounds and thus increases the computational cost significantly compared to the backward version. We make a detailed comparison between the forward and backward version in the experiment section.

Compute a local Lipschitz constant

After obtaining for all , we define

(20)

where the and inequality are taken element-wise. In the rest of this subsection, we simplify the notations to when no confusion arises.

Recall that the Local Lipschitz constant can be evaluated as

Note that is the Jacobian matrix and denotes the induced operator norm. Then we can bound the maximum norm of Jacobian (local Lipschitz constant) considering its element-wise worst case. When , are both ordinary -norm () distance in Euclidean space, we denote as , and it can be bounded as follows.

For any , we have

(21)

where is defined in (20).

Improve the bound for .

For the important case of upper bounding , we use an additional trick to improve the bound (21). We note that . As in (13), we decompose it into two terms

(22)

where , , and .

For term , we take the same bound as we have in (21), i.e.,