Towards Robust, Locally Linear Deep Networks

07/07/2019 ∙ by Guang-He Lee, et al. ∙ MIT 2

Deep networks realize complex mappings that are often understood by their locally linear behavior at or around points of interest. For example, we use the derivative of the mapping with respect to its inputs for sensitivity analysis, or to explain (obtain coordinate relevance for) a prediction. One key challenge is that such derivatives are themselves inherently unstable. In this paper, we propose a new learning problem to encourage deep networks to have stable derivatives over larger regions. While the problem is challenging in general, we focus on networks with piecewise linear activation functions. Our algorithm consists of an inference step that identifies a region around a point where linear approximation is provably stable, and an optimization step to expand such regions. We propose a novel relaxation to scale the algorithm to realistic models. We illustrate our method with residual and recurrent networks on image and sequence datasets.



There are no comments yet.


page 6

page 10

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Complex mappings are often characterized by their derivatives at points of interest. Such derivatives with respect to the inputs play key roles across many learning problems, including sensitivity analysis. The associated local linearization is frequently used to obtain explanations for model predictions (Baehrens et al., 2010; Simonyan et al., 2013; Sundararajan et al., 2017; Smilkov et al., 2017); explicit first-order local approximations (Rifai et al., 2012; Goodfellow et al., 2015; Wang & Liu, 2016; Koh & Liang, 2017; Alvarez-Melis & Jaakkola, 2018b); or used to guide learning through regularization of functional classes controlled by derivatives (Gulrajani et al., 2017; Bellemare et al., 2017; Mroueh et al., 2018). We emphasize that the derivatives discussed in this paper are with respect to the input coordinates rather than parameters.

The key challenge lies in the fact that derivatives of functions parameterized by deep learning models are not stable in general 

(Ghorbani et al., 2019). State-of-the-art deep learning models (He et al., 2016; Huang et al., 2017) are typically over-parametrized (Zhang et al., 2017), leading to unstable functions as a by-product. The instability is reflected in both the function values (Goodfellow et al., 2015) as well as the derivatives (Ghorbani et al., 2019; Alvarez-Melis & Jaakkola, 2018a). Due to unstable derivatives, first-order approximations used for explanations therefore also lack robustness (Ghorbani et al., 2019; Alvarez-Melis & Jaakkola, 2018a).

We note that gradient stability is a notion different from adversarial examples. A stable gradient can be large or small, so long as it remains approximately invariant within a local region. Adversarial examples, on the other hand, are small perturbations of the input that change the predicted output (Goodfellow et al., 2015)

. A large local gradient, whether stable or not in our sense, is likely to contribute to finding an adversarial example. Robust estimation techniques used to protect against adversarial examples (e.g.,

(Madry et al., 2018)) focus on stable function values rather than stable gradients but can nevertheless indirectly impact (potentially help) gradient stability. A direct extension of robust estimation to ensure gradient stability would involve finding maximally distorted derivatives and require access to approximate Hessians of deep networks.

In this paper, we focus on deep networks with piecewise linear activations to make the problem tractable. The special structure of this class of networks (functional characteristics) allows us to infer lower bounds on the margin — the maximum radius of -norm balls around a point where derivatives are provably stable. In particular, we investigate the special case of

since the lower bound has an analytical solution, and permits us to formulate a regularization problem to maximize it. The resulting objective is, however, rigid and non-smooth, and we further relax the learning problem in a manner resembling (locally) support vector machines (SVM) 

(Vapnik, 1995; Cortes & Vapnik, 1995).

Both the inference and learning problems in our setting require evaluating the gradient of each neuron with respect to the inputs, which poses a significant computational challenge. For piecewise linear networks, given

-dimensional data, we propose a novel perturbation algorithm that collects all the exact gradients by means of forward propagating carefully crafted samples in parallel without any back-propagation. When the GPU memory cannot fit samples in one batch, we develop an unbiased approximation to the objective with a random subset of such samples.

Empirically, we examine our inference and learning algorithms with fully-connected (FC), residual (ResNet) (He et al., 2016), and recurrent (RNN) networks on image and time-series datasets with quantitative and qualitative experiments. The main contributions of this work are as follows:

  • Inference algorithms that identify input regions of neural networks, with piecewise linear activation functions, that are provably stable.

  • A novel learning criterion that effectively expand regions of provably stable derivatives.

  • Novel perturbation algorithms that scale computation to high dimensional data.

  • Empirical evaluation with several types of networks.

2 Related Work

For tractability reasons, we focus in this paper on neural networks with piecewise linear activation functions, such as ReLU 

(Glorot et al., 2011) and its variants (Maas et al., 2013; He et al., 2015; Arjovsky et al., 2016). Since the nonlinear behavior of deep models is mostly governed by the activation function, a neural network defined with affine transformations and piecewise linear activation functions is inherently piecewise linear (Montufar et al., 2014)

. For example, FC, convolutional neural networks (CNN) 

(LeCun et al., 1998), RNN, and ResNet (He et al., 2016) are all plausible candidates under our consideration. We will call this kind of networks piecewise linear networks throughout the paper.

The proposed approach is based on a mixed integer linear representation of piecewise linear networks, activation pattern (Raghu et al., 2017), which encodes the active linear piece (integer) of the activation function for each neuron; once an activation pattern is fixed, the network degenerates to a linear model (linear). Thus the feasible set corresponding to an activation pattern in the input space is a natural region where derivatives are provably stable (same linear function). Note the possible degenerate case where neighboring regions (with different activation patterns) nevertheless have the same end-to-end linear coefficients (Serra et al., 2018). We call the feasible set induced by an activation pattern (Serra et al., 2018) a linear region, and a maximal connected subset of the input space subject to the same derivatives of the network (Montufar et al., 2014) a complete linear region. Activation pattern has been studied in various contexts, such as visualizing neurons (Fischetti & Jo, 2017), reachability of a specific output value (Lomuscio & Maganti, 2017), its connection to vector quantization (Balestriero & Baraniuk, 2018), counting the number of linear regions of piecewise linear networks (Raghu et al., 2017; Montúfar, 2017; Serra et al., 2018), and adversarial attacks (Cheng et al., 2017; Fischetti & Jo, 2017; Weng et al., 2018) or defense (Wong & Kolter, 2018). Note the distinction between locally linear regions of the functional mapping and decision regions defined by classes (Wong & Kolter, 2018; Yan et al., 2018; Mirman et al., 2018; Croce et al., 2019).

Here we elaborate differences between our work and the two most relevant categories above. In contrast to quantifying the number of linear regions as a measure of complexity, we focus on the local linear regions, and try to expand them via learning. The notion of stability we consider differs from adversarial examples. The methods themselves are also different. Finding the exact adversarial example is in general NP-complete (Katz et al., 2017; Sinha et al., 2018)

, and mixed integer linear programs that compute the exact adversarial example do not scale 

(Cheng et al., 2017; Fischetti & Jo, 2017). Layer-wise relaxations of ReLU activations (Weng et al., 2018; Wong & Kolter, 2018) are more scalable but yield bounds instead exact solutions. Empirically, even relying on relaxations, the defense (learning) methods (Wong & Kolter, 2018; Wong et al., 2018)

are still intractable on ImageNet scale images 

(Deng et al., 2009). In contrast, our inference algorithm certifies the exact margin around a point subject to its activation pattern by forwarding samples in parallel.

In a high-dimensional setting, where it is computationally challenging to compute the learning objective, we develop an unbiased estimation by a simple sub-sampling procedure, which scales to ResNet 

(He et al., 2016) on dimensional images in practice.

The proposed learning algorithm is based on the inference problem with margins. The derivation is reminiscent of the SVM objective (Vapnik, 1995; Cortes & Vapnik, 1995), but differs in its purpose; while SVM training seeks to maximize the

margin between data points and a linear classifier, our approach instead maximizes the

margin of linear regions around each data point. Since there is no label information to guide the learning algorithm for each linear region, the objective is unsupervised and more akin to transductive/semi-supervised SVM (TSVM) (Vapnik & Sterin, 1977; Bennett & Demiriz, 1999). In the literature, the idea of margin is also extended to nonlinear classifiers in terms of decision boundaries (Elsayed et al., 2018). Concurrently, Croce et al. (2019) also leverages the (raw) margin on small networks for adversarial training. In contrast, we develop a smooth relaxation of the margin and novel perturbation algorithms, which scale the computation to realistic networks, for gradient stability.

The problem we tackle has implications for interpretability and transparency of complex models. The gradient has been a building block for various explanation methods for deep models, including gradient saliency map (Simonyan et al., 2013) and its variants (Springenberg et al., 2014; Sundararajan et al., 2017; Smilkov et al., 2017), which apply a gradient-based attribution of the prediction to the input with nonlinear post-processings for visualization (e.g., normalizing and clipping by the percentile (Smilkov et al., 2017; Sundararajan et al., 2017)). While one of the motivations for this work is the instability of gradient-based explanations (Ghorbani et al., 2019; Alvarez-Melis & Jaakkola, 2018a), we focus more generally on the fundamental problem of establishing robust derivatives.

3 Methodology

To simplify the exposition, the approaches are developed under the notation of FC networks with ReLU activations, which naturally generalizes to other settings. We first introduce notation, and then present our inference and learning algorithms. All the proofs are provided in Appendix A.

3.1 Notation

We consider a neural network with hidden layers and neurons in the layer, and the corresponding function it represents. We use and to denote the vector of (raw) neurons and activated neurons in the layer, respectively. We will use x and interchangeably to represent an input instance from . With an FC architecture and ReLU activations, each and are computed with the transformation matrix

and bias vector



where denotes the set . We use subscript to further denote a specific neuron. To avoid confusion from other instances , we assert all the neurons are functions of the specific instance denoted by x

. The output of the network is a linear transformation of the last hidden layer

with and . The output can be further processed by a nonlinearity such as softmax for classification problems. However, we focus on the piecewise linear property of neural networks represented by

, and leverage a generic loss function

to fold such nonlinear mechanism.

We use to denote the set of training data , to denote the same set without labels y, and to denote the -ball around x with radius .

The activation pattern (Raghu et al., 2017) used in this paper is defined as:

Definition 1.

(Activation Pattern) An activation pattern is a set of indicators for neurons that specifies the following functional constraints:


Each is called an activation indicator. Note that a point on the boundary of a linear region is feasible for multiple activation patterns. The definition fits the property of the activation pattern discussed in §2. We define to be the sub-gradient found by back-propagation using , whenever is defined in the context.

3.2 Inference for Regions with Stable Derivatives

Although the activation pattern implicitly describes a linear region, it does not yield explicit constraints on the input space, making it hard to develop algorithms directly. Hence, we first derive an explicit characterization of the feasible set on the input space with Lemma 2.222Similar characterization also appeared in (Balestriero & Baraniuk, 2018) and, concurrently with our work, (Croce et al., 2019).

Lemma 2.

Given an activation pattern with any feasible point x, each activation indicator induces a feasible set , and the feasible set of the activation pattern is equivalent to .

Remark 3.

Lemma 2 characterizes each linear region of as the feasible set with a set of linear constraints with respect to the input space , and thus is a convex polyhedron.

The aforementioned linear property of an activation pattern equipped with the input space constraints from Lemma 2 yield the definition of , the margin of x subject to its activation pattern:


where can be based on any feasible activation pattern on x;333When x has multiple possible activation patterns , is always regardless of the choice of . therefore, at from now on can take or arbitrarily as long as consistency among sub-gradients is ensured with respect to some feasible activation pattern . Note that is a lower bound of the margin subject to a derivative specification (i.e., a complete linear region).

3.2.1 Directional Verification, the Cases and

We first exploit the convexity of to check the feasibility of a directional perturbation.

Proposition 4.

(Directional Feasibility) Given a point x, a feasible set and a unit vector , if such that , then is linear in .

The feasibility of can be computed by simply checking whether satisfies the activation pattern in . Proposition 4 can be applied to the feasibility problem on -balls.

Proposition 5.

(-ball Feasibility) Given a point x, a feasible set , and an -ball with extreme points , if , then is linear in .

Proposition 5 can be generalized for an -ball. However, in high dimension , the number of extreme points of an -ball is exponential to , making it intractable. Instead, the number of extreme points of an -ball is only linear to ( and for each dimension). With the above methods to verify feasibility, we can do binary searches to find the certificates of the margins for directional perturbations and -balls . The details are in Appendix B.

3.2.2 The case

The feasibility on is tractable due to convexity of and its certification is efficient by a binary search; by further exploiting the polyhedron structure of , can be certified analytically.

Proposition 6.

(-ball Certificate) Given a point x, is the minimum distance between x

and the union of hyperplanes

+ (.

To compute the distance between x and the hyperplane induced by a neuron , we evaluate . If we denote as the set of hidden neuron indices , then can be computed as , where all the can be computed by a single forward pass.444Concurrently, Croce et al. (2019) find that the margin can be similarly computed as , where is the dual norm of the -norm. We will show in §4.1 that all the can also be computed efficiently by forward passes in parallel. We refer readers to Figure 0(c) to see a visualization of the certificates on margins.

3.2.3 The Number of Complete Linear Regions

The sizes of linear regions are related to their overall number, especially if we consider a bounded input space. Counting the number of linear regions in is, however, intractable due to the combinatorial nature of the activation patterns (Serra et al., 2018). We argue that counting the number of linear regions on the whole space does not capture the structure of data manifold, and we propose to certify the number of complete linear regions (#CLR) of among the data points , which turns out to be efficient to compute given a mild condition. Here we use to denote the cardinality of a set , and we have

Lemma 7.

(Complete Linear Region Certificate) If every data point has only one feasible activation pattern denoted as , the number of complete linear regions of among is upper-bounded by the number of different activation patterns , and lower-bounded by the number of different Jacobians .

3.3 Learning: Maximizing the Margins of Stable Derivatives

In this section, we focus on methods aimed at maximizing the margin , since it is (sub-)differentiable. We first formulate a regularization problem in the objective to maximize the margin:


However, the objective itself is rather rigid due to the inner-minimization and the reciprocal of . Qualitatively, such rigid loss surface hinders optimization and may attend infinity. To alleviate the problem, we do a hinge-based relaxation to the distance function similar to SVM.

3.3.1 Relaxation

An ideal relaxation of Eq. (4) is to disentangle and for a smoother problem. Our first attempt is to formulate an equivalent problem with special constraints which we can leverage.

Lemma 8.

If there exists a (global) optimal solution of Eq. (4) that satisfies , then every optimal solution of Eq. (5) is also optimal for Eq. (4).


If the condition in Lemma 8 does not hold, Eq. (5) is still a valid upper bound of Eq. (4) due to a smaller feasible set. An upper bound of Eq. (5) can be obtained consequently due to the constraints:


We then derive a relaxation that solves a smoother problem by relaxing the squared root and reciprocal on the norm as well as the hard constraint with a hinge loss to a soft regularization problem:


where is a hyper-parameter. The relaxed regularization problem can be regarded as a maximum aggregation of TSVM losses among all the neurons, where a TSVM loss with only unannotated data can be written as:


which pursues a similar goal to maximize the margin in a linear model scenario, where the margin is computed between a linear hyperplane (the classifier) and the training points.

To visualize the effect of the proposed methods, we make a toy 2D binary classification dataset, and train a 4-layer fully connected network with 1) (vanilla) binary cross-entropy loss , 2) distance regularization as in Eq. (4), and 3) relaxed regularization as in Eq. (7). Implementation details are in Appendix F. The resulting piecewise linear regions and prediction heatmaps along with gradient annotations are shown in Figure 1. The distance regularization enlarges the linear regions around each training point, and the relaxed regularization further generalizes the property to the whole space; the relaxed regularization possesses a smoother prediction boundary, and has a special central region where the gradients are to allow gradients to change directions smoothly.

(a) Vanilla loss
(b) Distance regularization
(c) Relaxed regularization
Figure 1: Toy examples of a synthetic 2D classification task. For each model (regularization type), we show a prediction heatmap (smaller pane) and the corresponding locally linear regions. The boundary of each linear region is plotted with line segments, and each circle shows the margin around the training point. The gradient is annotated as arrows with length proportional to its norm.

3.3.2 Improving Sparse Learning Signals

Since a linear region is shaped by a set of neurons that are “close” to a given a point, a noticeable problem of Eq. (7) is that it only focuses on the “closest” neuron, making it hard to scale the effect to large networks. Hence, we make a generalization to the relaxed loss in Eq. (7) with a set of neurons that incur high losses to the given point. We denote as the set of neurons with top percent relaxed loss (TSVM loss) on x. The generalized loss is our final objective for learning RObust Local Linearity (Roll) and is written as:


A special case of Eq. (9) is when (i.e. ), where the nonlinear sorting step effectively disappears. Such simple additive structure without a nonlinear sorting step can stabilize the training process, is simple to parallelize computation, and allows for an approximate learning algorithm as will be developed in §4.2. Besides, taking can induce a strong synergy effect, as all the gradient norms in Eq. (9) between any two layers are highly correlated.

4 Computation, Approximate Learning, and Compatibility

4.1 Parallel Computation of Gradients

The margin and the Roll loss in Eq. (9) demands heavy computation on gradient norms. While calling back-propagation times is intractable, we develop a parallel algorithm without calling a single back-propagation by exploiting the functional structure of .

Given an activation pattern, we know that each hidden neuron is also a linear function of . We can construct another linear network that is identical to in based on the same set of parameters but fixed linear activation functions constructed to mimic the behavior of in . Due to the linearity of , the derivatives of all the neurons to an input axis can be computed by forwarding two samples: subtracting the neurons with an one-hot input from the same neurons with a zero input. The procedure can be amortized and parallelized to all the dimensions by feeding samples to in parallel. We remark that the algorithm generalizes to all the piecewise linear networks, and refer readers to Appendix C for algorithmic details.555When the network is FC (or can efficiently be represented as such), one can use a dynamic programming algorithm to compute the gradients (Papernot et al., 2016), which is included in Appendix D.

To analyze the complexity of the proposed approach, we assume that parallel computation does not incur any overhead and a batch matrix multiplication takes a unit operation. To compute the gradients of all the neurons for a batch of inputs, our perturbation algorithm takes operations, while back-propagation takes operations. The detailed analysis is also in Appendix C.

4.2 Approximate Learning

Despite the parallelizable computation of , it is still challenging to compute the loss for large networks in a high dimension setting, where even calling forward passes in parallel as used in §4.1 is infeasible due to memory constraints. Hence we propose an unbiased estimator of the Roll loss in Eq. (9) when . Note that is already computable in one single forward pass. For the sum of gradient norms, we use the following equivalent decoupling:


where the summation inside the expectation in the last equation can be efficiently computed using the procedure in §4.1 and is in general storable within GPU memory. In practice, we can uniformly sample () input axes to have an unbiased approximation to Eq. (10), where computing all the partial derivatives with respect to axes only requires times memory (one hot vectors and a zero vector) than a typical forward pass for x.

4.3 Compatibility

The proposed algorithms can be used on all the deep learning models with affine transformations and piecewise linear activation functions by enumerating every neuron that will be imposed an ReLU-like activation function as

. They do not immediately generalize to the nonlinearity of maxout/max-pooling 

(Goodfellow et al., 2013) that also yields a piecewise linear function. We provide an initial step towards doing so in the Appendix E

, but we suggest to use an average-pooling or convolution with large strides instead, since they do not induce extra linear constraints as max-pooling and do not in general yield significant difference in performance 

(Springenberg et al., 2014).

5 Experiments

In this section, we compare our approach (‘Roll’) with a baseline model with the same training procedure except the regularization (‘vanilla’) in several scenarios. All the reported quantities are computed on a testing set. Experiments are run on single GPU with G memory.

5.1 Mnist

Table 1: FC networks on MNIST dataset. #CLR is the number of complete linear regions among the 10K testing points, and shows the margin for each percentile .

Evaluation Measures: 1) accuracy (ACC), 2) number of complete linear regions (#CLR), and 3) margins of linear regions . We compute the margin for each testing point x with , and we evaluate on 4 different percentiles among the testing data.

We use a split of MNIST dataset for training/validation/testing. Experiments are conducted on a 4-layer FC model with ReLU activations. The implementation details are in Appendix G. We report the two models with the largest median among validation data given the same and less validation accuracy compared to the baseline model.

The results are shown in Table 1. The tuned models have , and different as shown in the table. The condition in Lemma 7 for certifying #CLR is satisfied with tight upper bound and lower bound, so a single number is reported. Given the same performance, the Roll loss achieves about times larger margins for most of the percentiles than the vanilla loss. By trading-off accuracy, about times larger margins can be achieved. The Spearman’s rank correlation between and among testing data is at least 0.98 for all the cases. The lower #CLR in our approach than the baseline model reflects the existence of certain larger linear regions that span across different testing points. All the points inside the same linear region in the Roll model with ACC have the same label, while there are visually similar digits (e.g., and ) in the same linear region in the other Roll model. We do a parameter analysis in Figure 2 with the ACC and of under different and when the other hyper-parameters are fixed. As expected, with increased and , the accuracy decreases with an increased margin. Due to the smoothness of the curves, higher values reflect less sensitivity to hyper-parameters and .

To validate the efficiency of the proposed method, we measure the running time for performing a complete mini-batch gradient descent step (starting from the forward pass) on average. We compare 1) the vanilla loss, 2) the full Roll loss () in Eq. (9) computed by back-propagation, 3) the same as 2) but computed by our perturbation algorithm, and 4) the approximate Roll loss in Eq. (10) computed by perturbations. The approximation is computed with samples. The results are shown in Table 2. The accuracy and margins of the approximate Roll loss are comparable to the full loss. Overall, our approach is only twice slower than the vanilla loss. The approximate loss is about 9 times faster than the full loss. Compared to back-propagation, our perturbation algorithm achieves about 12 times empirical speed-up. In summary, the computational overhead of our method is minimal compared to the vanilla loss, which is achieved by the perturbation algorithm and the approximate loss.

Figure 2: Parameter analysis on MNIST dataset. of is the median of in the testing data.
Vanilla Roll (full; back-prop) Roll (full; perturb) Roll (-samples; perturb)
Table 2: Running time for a gradient descent step of FC networks on MNIST dataset. The full setting refers to Eq. (9) (), and -samples refers to approximating Eq. (10) with samples.

5.2 Speaker Identification

Loss ACC
Table 3: RNNs on the Japanese Vowel dataset. shows the margin for each percentile in the testing data (the larger the better).
(a) The channel of the sequence that yields of on the Roll model.
(b) The channel of the sequence that yields of on the Roll model.
Figure 3: Stability bounds on derivatives on the Japanese Vowel dataset.

We train RNNs for speaker identification on a Japanese Vowel dataset from the UCI machine learning repository 

(Dheeru & Karra Taniskidou, 2017) with the official training/testing split.666The parameter is tuned on the testing set and thus the performance should be interpreted as validation. The dataset has variable sequence length between and with channels and classes. We implement the network with the state-of-the-art scaled Cayley orthogonal RNN (scoRNN) (Helfrich et al., 2018), which parameterizes the transition matrix in RNN using orthogonal matrices to prevent gradient vanishing/exploding, with LeakyReLU activation. The implementation details are in Appendix H. The reported models are based on the same criterion as §5.1.

The results are reported in Table 3. With the same/ inferior ACC, our approach leads to a model with about 4/20 times larger margins among the percentiles on testing data, compared to the vanilla loss. The Spearman’s rank correlation between and among all the cases are . We also conduct sensitivity analysis on the derivatives by finding along each coordinate ( except ), which identifies the stability bounds at each timestamp and channel that guarantees stable derivatives. The visualization using the vanilla and our Roll model with ACC is in Figure 3. Qualitatively, the stability bound of the Roll regularization is consistently larger than the vanilla model.

5.3 Caltech-256

Loss P@1 P@5
Vanilla 80.7% 93.4% 583.8 777.4 1041.9 3666.7 840.9 1118.2 1477.6 5473.5
Roll 80.8% 94.1% 540.6 732.0 948.7 2652.2 779.9 1046.7 1368.2 3882.8
Table 4: ResNet on Caltech-256. Here denotes gradient distortion (the smaller the better for each percentile among the testing data).
(a) Image            (Laptop)
(b) Orig. gradient (Roll)
(c) Adv. gradient (Roll)
(d) Orig. gradient (Vanilla)
(e) Adv. gradient (Vanilla)
(f) Image            (Bear)
(g) Orig. gradient (Roll)
(h) Adv. gradient (Roll)
(i) Orig. gradient (Vanilla)
(j) Adv. gradient (Vanilla)
Figure 4: Visualization of the examples in Caltech-256 that yield the (above) and (below) of the maximum gradient distortions among the testing data on our Roll model. The adversarial gradient is found by maximizing the distortion over the -norm ball with radius .

We conduct experiments on Caltech-256 (Griffin et al., 2007), which has 256 classes, each with at least images. We downsize the images to and train a -layer ResNet (He et al., 2016) with initializing from parameters pre-trained on ImageNet (Deng et al., 2009). The approximate Roll loss in Eq. (10) is used with random samples on each channel. We randomly select 5 and 15 samples in each class as the validation and testing set, respectively, and put the remaining data into the training set. The implementation details are in Appendix I.

Evaluation Measures: Due to high input dimensionality (), computing the certificates is computationally challenging without a cluster of GPUs. Hence, we turn to a sample-based approach to evaluate the stability of the gradients for the ground-truth label in a local region with a goal to reveal the stability across different linear regions. Note that evaluating the gradient of the prediction instead is problematic to compare different models in this case.

Given labeled data , we evaluate the stability of gradient in terms of expected

distortion (over a uniform distribution) and the maximum

distortion within the intersection of an -ball and the domain of images . The gradient distortion is defined as . For a fixed x, we refer to the maximizer as the adversarial gradient. Computation of the maximum distortion requires optimization, but gradient-based optimization is not applicable since the gradient of the loss involves the Hessian which is either

or ill-defined due to piecewise linearity. Hence, we use a genetic algorithm 

(Whitley, 1994) for black-box optimization. Implementation details are provided in Appendix J. We use samples to approximate the expected distortion. Due to computational limits, we only evaluate random images in the testing set for both maximum and expected gradient distortions. The -ball radius is set to .

The results along with precision at and (P@1 and P@5) are presented in Table 4. The Roll loss yields more stable gradients than the vanilla loss with marginally superior precisions. Out of examined examples x, only and gradient-distorted images change prediction labels in the Roll and vanilla model, respectively. We visualize some examples in Figure 4 with the original and adversarial gradients for each loss. Qualitatively, the Roll loss yields stable shapes and intensities of gradients, while the vanilla loss does not. More examples with integrated gradient attributions (Sundararajan et al., 2017) are provided in Appendix K.

6 Conclusion

This paper introduces a new learning problem to endow deep learning models with robust local linearity. The central attempt is to construct locally transparent neural networks, where the derivatives faithfully approximate the underlying function and lends itself to be stable tools for further applications. We focus on piecewise linear networks and solve the problem based on a margin principle similar to SVM. Empirically, the proposed Roll loss expands regions with provably stable derivatives, and further generalize the stable gradient property across linear regions.


The authors acknowledge support for this work by a grant from Siemens Corporation, thank the anonymous reviewers for their helpful comments, and thank Hao He and Yonglong Tian for helpful discussions.


  • Alvarez-Melis & Jaakkola (2018a) David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretability methods. 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), 2018a.
  • Alvarez-Melis & Jaakkola (2018b) David Alvarez-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pp. 7786–7795, 2018b.
  • Arjovsky et al. (2016) Martin Arjovsky, Amar Shah, and Yoshua Bengio.

    Unitary evolution recurrent neural networks.

    In Proceedings of the International Conference on Machine Learning, pp. 1120–1128, 2016.
  • Baehrens et al. (2010) David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÞller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
  • Balestriero & Baraniuk (2018) Randall Balestriero and Richard Baraniuk. Mad max: Affine spline insights into deep learning. arXiv preprint arXiv:1805.06576v5, 2018.
  • Bellemare et al. (2017) Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.
  • Bennett & Demiriz (1999) Kristin P Bennett and Ayhan Demiriz. Semi-supervised support vector machines. In Advances in Neural Information processing systems, pp. 368–374, 1999.
  • Cheng et al. (2017) Chih-Hong Cheng, Georg Nührenberg, and Harald Ruess. Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis, pp. 251–268. Springer, 2017.
  • Cortes & Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • Croce et al. (2019) Francesco Croce, Maksym Andriushchenko, and Matthias Hein. Provable robustness of relu networks via maximization of linear regions. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019. URL
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Proceedings of the IEEE international conference on computer vision

    , pp. 248–255. Ieee, 2009.
  • Dheeru & Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL
  • Elsayed et al. (2018) Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In Advances in Neural Information Processing Systems, pp. 850–860, 2018.
  • Fischetti & Jo (2017) Matteo Fischetti and Jason Jo. Deep neural networks as 0-1 mixed integer linear programs: A feasibility study. arXiv preprint arXiv:1712.06174, 2017.
  • Ghorbani et al. (2019) Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, 2011.
  • Goodfellow et al. (2013) Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1319–1327, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL
  • Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL
  • Griffin et al. (2007) Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • Helfrich et al. (2018) Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. Proceedings of the International Conference on Machine Learning, 2018.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE international conference on computer vision, volume 1, pp.  3, 2017.
  • Katz et al. (2017) Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Springer, 2017.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL
  • Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1885–1894, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lomuscio & Maganti (2017) Alessio Lomuscio and Lalit Maganti. An approach to reachability analysis for feed-forward relu neural networks. arXiv preprint arXiv:1706.07351, 2017.
  • Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. volume 30, pp.  3, 2013.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL
  • Mirman et al. (2018) Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575–3583, 2018.
  • Montúfar (2017) Guido Montúfar. Notes on the number of linear regions of deep neural networks. 2017.
  • Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
  • Mroueh et al. (2018) Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev GAN. In International Conference on Learning Representations, 2018. URL
  • Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. IEEE, 2016.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

    Automatic differentiation in pytorch.

  • Raghu et al. (2017) Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2847–2854. JMLR. org, 2017.
  • Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL
  • Rifai et al. (2012) Salah Rifai, Yoshua Bengio, Yann Dauphin, and Pascal Vincent. A generative process for sampling contractive auto-encoders. Proceedings of the 29th International Conference on Machine Learning, 2012.
  • Serra et al. (2018) Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting linear regions of deep neural networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4558–4566. PMLR, 10–15 Jul 2018. URL
  • Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL
  • Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  • Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. JMLR. org, 2017.
  • Vapnik (1995) Vladimir N. Vapnik. Estimation of dependences based on empirical data. 1982. NY: Springer-Verlag, 1995.
  • Vapnik & Sterin (1977) Vladimir N. Vapnik and A. Sterin. On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control, 10(3):1495–1503, 1977.
  • Wang & Liu (2016) Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
  • Weng et al. (2018) Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks. Proceedings of the International Conference on Machine Learning, 2018.
  • Whitley (1994) Darrell Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85, 1994.
  • Wong & Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the International Conference on Machine Learning, pp. 5283–5292, 2018.
  • Wong et al. (2018) Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pp. 8410–8419, 2018.
  • Yan et al. (2018) Ziang Yan, Yiwen Guo, and Changshui Zhang. Deep defense: Training dnns with improved adversarial robustness. In Advances in Neural Information Processing Systems, pp. 417–426, 2018.
  • Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations, 2017.

Appendix A Proofs

a.1 Proof of Lemma 2

Lemma 2.

Given an activation pattern with any feasible point x, each activation indicator induces a feasible set , and the feasible set of the activation pattern is equivalent to .


For , we have . If is feasible to the fixed activation pattern , it is equivalent to that satisfies the linear constraint


in the first layer.

Assume has satisfied all the constraints before layer . We know if all the previous layers follows the fixed activation indicators, it is equivalent to rewrite each


Then for , it is clear that is a fixed linear function of x with linear weights equal to by construction. If is also feasible to the fixed activation indicator , it is equivalent to that also satisfies the linear constraint


The proof follows by induction. ∎

a.2 Proof of Proposition 4

Proposition 4.

(Directional Feasibility) Given a point x, a feasible set and a unit vector , if such that , then is linear in .


Since is a convex set and , . ∎

a.3 Proof of Proposition 5

Proposition 5.

(-ball Feasibility) Given a point x, a feasible set , and an -ball with extreme points , if , then is linear in .


is a convex set and . Hence, , we know is a convex combination of , which implies . ∎

a.4 Proof of Proposition 6

Proposition 6.

(-ball Certificate) Given a point x, is the minimum distance between x and the union of hyperplanes + (.


Since is a convex polyhedron and , is equivalent to the statement: the hyperplanes induced from the linear constraints in are away from x for at least in distance. Accordingly, the minimizing distance between x and the hyperplanes is the maximizing distance that satisfies . ∎

a.5 Proof of Lemma 7

Lemma 7.

(Complete Linear Region Certificate) If every data point has only one feasible activation pattern denoted as , the number of complete linear regions of among is upper-bounded by the number of different activation patterns , and lower-bounded by the number of different Jacobians .


The number of different activation patterns is an upper bound since it counts the number of linear regions instead of the number of complete linear regions (a complete linear region can contain multiple linear regions). The number of different Jacobians is a lower bound since it only count the number of different linear coefficients on without distinguishing whether they are in the same connected region. ∎

a.6 Proof of Lemma 8

Lemma 8.

If there exists a (global) optimal solution of Eq. (4) that satisfies , then every optimal solution of Eq. (5) is also optimal for Eq. (4).


The proof is based on constructing a neural network feasible in Eq. (5) that has the same loss as the optimal model in Eq. (4). Since the optimum in Eq. (5) is lower-bounded by the optimum in Eq. (4) due to smaller feasible set, a model feasible in Eq. (5) and having the same loss as the optimum in Eq. (4) implies that it is also optimal in Eq. (5).

Given the optimal model in Eq. (4) satisfying the constraint , we construct a model feasible in Eq. (5). For , we compute the smallest neuron response in , and revise its weights by the following rules:


The above rule only scale the neuron value of without changing the value of all the higher layers, so the realized function of does not change. That says, it achieves the same objective value as Eq. (4) while being feasible in Eq. (