Full-Jacobian Representation of Neural Networks

05/02/2019 ∙ by Suraj Srinivas, et al. ∙ Idiap Research Institute 70

Non-linear functions such as neural networks can be locally approximated by affine planes. Recent works make use of input-Jacobians, which describe the normal to these planes. In this paper, we introduce full-Jacobians, which includes this normal along with an additional intercept term called the bias-Jacobians, that together completely describe local planes. For ReLU neural networks, bias-Jacobians correspond to sums of gradients of outputs w.r.t. intermediate layer activations. We first use these full-Jacobians for distillation by aligning gradients of their intermediate representations. Next, we regularize bias-Jacobians alone to improve generalization. Finally, we show that full-Jacobian maps can be viewed as saliency maps. Experimental results show improved distillation on small data-sets, improved generalization for neural network training, and sharper saliency maps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

Code Repositories

B_Full-Gradient-Representation-for-Neural-Network-Visualization_Keras

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the main unsolved problems in deep learning is to optimally incorporate prior knowledge about data. Priors inform regularization methods, which help improve generalization when dealing with small training sets. This is especially crucial for knowledge transfer, which involves emulating the function mapping of a “teacher” in a “student” using training examples. For this task, prior knowledge encodes information about the teacher’s map. A good representation of this map can result in rapid learning by the student using little data.

A good knowledge transfer method can have a lot of value for neural network practitioners. This includes in particular the exploration of the space of model architectures, without having to retrain every time. This flexibility of exploration is critical for hyper-parameter and architecture search, compression and ensemble learning. However, this task is especially challenging with neural networks as we do not have expressive representations encoding neural network functions. Crucially, we want representations to only encode information of the function map, and not related to the idiosyncracies of parameterization. Encoding unnecessary information can overly restrict the student model, causing it to under-perform.

Recently, Czarnecki et al. (2017) proposed to use input-Jacobians–the gradients of the outputs w.r.t. input–for knowledge transfer. Input-Jacobians capture the slope of the local affine approximation of the neural network. Together with the function output, this method completely captures the local behavior of neural nets.

In this paper we propose full-Jacobians, a representation which includes the input-Jacobian, and an additional terms called the bias-Jacobian. Together, they also completely capture the local behavior of neural networks. However unlike raw function outputs, bias-Jacobians can provide more insight into the internal decision-making process of neural networks.

The overall contributions of our paper are:

  1. We introduce full-Jacobians and use them for distillation in a low-data setting.

  2. We propose bias-Jacobian-norm minimization as a regularizer for neural networks and show connections with dropout.

  3. We show that full-Jacobian maps serve as neural network saliency maps, pointing to important regions in the input.

We provide experimental evidence showing that full-Jacobians indeed help knowledge transfer, and that bias-Jacobian-norm minimization provides regularization benefits. We also provide source code to reproduce the visualization experiments in the supplementary material.

2 Related Work

Knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015) for neural networks usually involves matching outputs of two networks on the same input. Romero et al. (2014) and Zagoruyko & Komodakis (2017) propose methods to improve performance by having additional supervision at intermediate layers. While Romero et al. (2014) used connector functions to match intermediate layers of two different networks, Zagoruyko & Komodakis (2017) use channel-wise sums for features of same spatial extent. Recent works such as those by Heo et al. (2018) and Yim et al. (2017) also use similar overall strategies of matching quantities relating to intermediate activations. In contrast, Czarnecki et al. (2017) and Srinivas & Fleuret (2018) match input-Jacobians in order to preserve parameterization-invariance, and the latter also connect input-Jacobian matching to data augmentation with gaussian noise.

Using Jacobian-based penalties to regularize neural networks is slowly gaining popularity since their usage to regularize GANs (Gulrajani et al., 2017). Early works on such penalties date back to Drucker & Le Cun (1992), who proposed penalties to improve robustness to change in inputs. This was confirmed by Srinivas & Fleuret (2018), who connect Jacobian norm minimization again to data augmentation with noise.

Deep Taylor Decomposition (Montavon et al., 2017) also proposes a saliency map representation which sums to the neural network output. This involves a custom back-propagation rule formulated to satisfy certain interpretability-based axioms. As a result, its precise mathematical relationship with the underlying neural network function map is unclear. On the contrary, our full-Jacobian representation assumes no additional axioms and has a precise meaning in terms of being the parameters of the local affine plane.

3 Full-Jacobians

Let us consider a neural network with inputs . The following simple result holds for ReLU networks without bias-parameters.

Proposition 1.

Let be a ReLU neural network without bias-parameters, then , .

All proofs are provided in the supplementary material. The proof here uses the fact that for such nets, for any . Here can be seen an alternate representation of , in contrast to the usual representation involving parameterized weights and non-linearities. We emphasize here that even though the proof uses first-order Taylor series, the relation described is exact.

This can be naturally extended to ReLU neural networks with bias-parameters by incorporating multiplicative inputs for biases which always equal one. For example, an affine function , where can be converted to a linear function by introducing ‘bias inputs’ , giving us . Here is the effective input to the linear system.

In a similar manner for ReLU networks with bias, we can introduce such bias inputs–one for every bias parameter. Let the number of such bias parameters in be .

Proposition 2.

Let be a ReLU neural network with bias-parameters , then

(1)

Here, is the Hadamard product. Similar to the previous case, equation 1 is an alternate representation of the neural network output in terms of various Jacobians. We shall call as the input-Jacobian, and as the bias-Jacobian. Together, they will be referred to as the full-Jacobian. To the best our knowledge, this is the only exact representation of neural network outputs, other than the usual feed-forward neural net representation in terms of weights and biases.

The full-Jacobian decomposition represents the parameters of the affine plane that locally approximate the function at . The input-Jacobian is its normal, while the bias-Jacobian sum is the intercept. Alternately, this plane can also be represented by the input-Jacobian and function value pair (). Both pairs are representations of the same affine plane.

Note here that for ReLU networks,

is also the gradient of the output w.r.t. intermediate layer pre-activations according to chain rule. For a layer with

, it is easy to see that , where is the gradient w.r.t. activations .

We shall henceforth use the shorthand notation for the bias-Jacobians, and drop the explicit dependence on , as shown in Table 1. Other notations are summarized for reference.

Input
Function
Bias-parameters of
Bias-Jacobian
Input-Jacobian
Table 1: Notations and Terminology
Figure 1: Illustration of the full-Jacobian representation for a pre-trained VGG-16 network. The bias-Jacobian maps are summed across the channel dimension to produce single-channel maps. According to equation 1, aggregating these spatially gives us the neural network output .

3.1 Interpreting bias-Jacobians

3.1.1 Toy example

To illustrate the form of bias-Jacobians for a simple case, consider the decomposition of a one-hidden layer ReLU neural network of the following form.

Example 1.

Let .
Also , and

Here, is the input-Jacobian and is the bias-Jacobian.

The above example follows from Proposition 1 applied to ReLU, i.e.; . Note here that . Thus the bias-Jacobians incorporate bias-parameters as well as the gradient of the output w.r.t. intermediate layer pre-activations of the neural network.

Figure 1

shows this decomposition for a pre-trained VGG-16 with batch normalization. For purposes of visualization, we collapse bias-Jacobians along the channel dimension to obtain single-channel heat maps. Performing summation over the spatial dimension as indicated in the figure gives us exactly the function output according to equation

1.

3.1.2 Connection to noise injection

Srinivas & Fleuret (2018) interpret input-Jacobians as the sensitivity of the neural network to noise added to its inputs. Here we show that bias-Jacobians can be interpreted as sensitivity to biases-parameters. Given a neural network function with weights and biases , we apply multiplicative noise to the biases to obtain the following.

Proposition 3.

Given the notations above, and assuming , with noise variable , we have

This is obtained from applying first order Taylor series expansion at a local linear neighbourhood around . This general expression holds for any variable of . Notice that the second term contains , which is exactly the bias-Jacobian. Hence the bias-Jacobian can be interpreted as the sensitivity of the neural network to multiplicative noise applied to bias-parameters.

3.2 Sources of Bias

In the discussion above we considered ReLU networks with bias-parameters. Here we shall look at other sources of bias which can effectively act as biases. There are three main sources of bias in neural networks in general.

  • Explicit bias-parameters: These refer to convolutional and fully connected layers with the form , with explicitly added bias-parameters .

  • Batch-norm parameters: For batch-norm layers of the form , the effective bias is . This is typically much larger in magnitude than the explicit bias-parameters in convolutional or fully connected layers.

  • Activations intercepts: We can linearize a nonlinearity at a neighbourhood around to obtain . Here is the effective bias that is unaccounted for by the derivative. Note that for ReLU nonlinearity, always. In this work we only consider ReLU non-linearities and hence we do not have this source of bias.

4 Full-Jacobian Matching

Given two networks and , we would like to perform distillation with being the teacher and being the student. The problem of distillation is to improve the training of using information from . Usually, and are trained on the same dataset. This is usually done by matching the outputs and for the same input . In essence, we look for a function with the same input-output mapping as but with a different parameterization owing to its different architecture.

Now, if we require that two functions are equal then it holds that the gradients and are also equal. Combining this with equation 1, we see that the sums and must also be equal. Hence we can match the pairs for two functions. As we shall see next, this distillation objective can become easier when there exists common sub-structures in these functions.

4.1 Sub-Structures within Architectures

Here we shall see how the existence of certain sub-structures within architectures can make the distillation problem easier.

4.1.1 Local connections

Example 2.

Let . Here, , and . Also, , . Let be another function parameterized similarly. It is clear that if we require , then and .

Thus as a result of this common structure, we are able to break a single distillation problem into two smaller sub-problems. In this example the functions are locally-connected, as only depends , and similarly for

. Note that in practical deep nets, convolutional layers are examples of such locally-connected functions. For convolutional layers with stride being equal to kernel size, the correspondence is similar with

.

4.1.2 Depth

Multiple theoretical results about deep networks express so-called “no-flattening” theorems (Cohen et al., 2016; Raghu et al., 2017)

. Broadly speaking, they state that a shallow network requires exponentially many units to approximate a deep network. In practice for distillation this means that different layers in a neural network are indeed useful and cannot be approximated by shallower nets. Furthermore, visualization studies in computer vision have pointed to the fact that different layers in deep networks have clearly delineated tasks 

(Zeiler & Fergus, 2014). For instance, early layers often perform edge detection, while higher layers perform object part detection. This means that depth can sometimes be seen as another form of a sub-structure within neural networks.

4.1.3 Matching methodology

These examples motivate the following approach for matching two convolutional networks. Given and we choose convolutional layers in each, and match the bias-Jacobian terms for each layer separately. This incorporates the depth separation argument presented above. Within each layer, the bias-Jacobian terms are only summed channel-wise, not spatially. This uses the locally-connected nature of convolutions. Without these assumptions, we would be restricted to matching only the overall sum of bias-Jacobians. By making these assumptions, we are able to match sums of smaller sub-parts of bias-Jacobians to each other.

5 Bias-Jacobian Regularization

Our interpretation of Jacobians as measures of sensitivity to bias-parameters suggests a natural regularization strategy, that of minimizing such sensitivity. While input-Jacobians capture the sensitivity to changes in input, the bias-Jacobians capture sensitivity to changes in the bias-parameters. Minimizing the sensitivity of a neural network to its parameters has long been considered (Hochreiter & Schmidhuber, 1997) as an important criterion for generalization. Recent works also connect the notion of flat minimum to implicit regularization of SGD, thus partially explaining the success of deep learning (Keskar et al., 2016).

However as pointed out by Dinh et al. (2017), many measures of flat minima such as gradients or Hessians w.r.t. weights of a neural network are heavily dependent on the parameterization. In particular, one can use the non-negative homogeneity of ReLU (i.e. ) to arbitrarily change the scale of weights without changing the parameterization. Note that both input-Jacobians and bias-Jacobians are unaffected by such scale changes as they do not change the output .

Srinivas & Fleuret (2018) previously used input-Jacobian norm regularization to reduce sensitivity to input noise. They do not report increase in generalization. As a result in this work we shall only investigate effects of bias-Jacobian norm regularization (i.e. ).

5.1 Connection to dropout

One other important regularizer which adds noise to intermediate layers of networks is dropout (Srivastava et al., 2014). However the difference is that while dropout can be viewed as adding multiplicative noise to activations directly, bias-Jacobian regularization adds multiplicative noise to bias-parameters. Equivalently, this can also be thought of as adding noise to pre-activations of layers, as opposed to post-non-linearity activations as done typically in dropout.

Let us consider a form of dropout under the limit of low-dropout noise. For convenience we shall assume dropout with multiplicative gaussian noise, but same can be easily repeated with bernoulli noise. Invoking Proposition 3, and using it for an intermediate activation , we have

Here, is the multiplicative gaussian noise variable. Thus under the low-noise limit, we can analytically perform dropout by taking expectation over all noise terms. This results in a deterministic regularizer which minimizes norm of . We observe that this term is similar to bias-Jacobians as the gradient w.r.t. biases of a layer is the same as the gradient w.r.t. the corresponding intermediate pre-activation , by chain rule. Note that both regularizers are identical when the previous layer’s activations are zero, thus making . To summarize, dropout and bias-Jacobian norm share a tight connection, that of reducing the sensitivity of the output to the intermediate layers.

6 Full-Jacobian Visualization

Here we shall use the full-Jacobian representation to formulate a neural network saliency method. While there is a large literature on saliency methods, there is no precise definition of such saliency and many works resort of axiomatic approaches (Sundararajan et al., 2017). An informal definition of saliency is the relative importance of each pixel of the image on the final decision. This is sometimes measured by the change in neural network output upon changing values of a pixel. A good saliency measure takes into consideration non-linear effects of such pixel change. There are also no objective methods to score the relative merits of such saliency maps. The most reliable test unfortunately still remains visual inspection.

Within these constraints, we propose a simple way to visualize saliency given by the following equation. Let run across channels of a layer in a neural network.

(4)

Here,

is an operator which maps a vector of any dimension to

, the space of inputs. This refers to using methods such as linear or cubic interpolation for resizing images. Thus we compute channel-wise sums of bias-Jacobians, take their absolute value, resize them to the image dimension, then accumulate them with bias-Jacobians of every other layer.

The full-Jacobian saliency method has the unique advantage of using quantities which completely capture the local behaviour of neural networks. This is unlike methods based on input-Jacobians alone (Sundararajan et al., 2017; Springenberg et al., 2014; Smilkov et al., 2017), which do not account for the intercept of local planes. Having said that, most methods in literature, like us, only take into account convolutional layers, and not fully connected ones. Fortunately most modern architectures completely do away with the latter.

Most other saliency methods in the literature require specification of certain choices. Integrated-gradients (Sundararajan et al., 2017) require choice of number of steps for Riemann approximation of an integral, while smooth-grad (Smilkov et al., 2017) needs number of images to smooth the gradient over. Grad-CAM (Selvaraju et al., 2017) requires choice of intermediate hidden layer which we found to be especially tricky to tune. Guided backprop (Springenberg et al., 2014), on the other hand, is specific to ReLU networks. In contrast, our full-Jacobian method extends to any non-linearity by accounting for the activation intercepts.

7 Experiments

To show the effectiveness of full-Jacobians, we run experiments on distillation, regularization and visualization. First, we perform distillation on CIFAR-100 datasets (Krizhevsky & Hinton, 2009) in a limited-data setting. Second, we regularize training of individual neural networks on the CIFAR100 dataset. Finally, we show visualizations of neural network saliency maps using full-Jacobian visualization. For all experiments, we approximate Jacobian computation by computing gradient of the output unit with the correct class, as done by Srinivas & Fleuret (2018). Details about experiments are present in the supplementary material.

7.1 Distillation

For distillation experiments, we use VGG-like (Simonyan & Zisserman, 2014) architectures with batch normalization. The main difference is we discard all fully-connected layers except the final. We use the following procedure in our experiments. First, a 9-layer “teacher” network is trained on the full CIFAR-100 dataset. Then, a larger 13-layer “student” network is trained, but this time on small subsets rather than the full dataset. As the teacher is trained on much more data than the student, we expect distillation to improve the student’s performance. Note that in this case our objective is not to compress the teacher model, but to effectively transfer the knowledge of the full CIFAR-100 dataset when only limited samples are available.

We compare our methods against the following baselines. (1): Cross-Entropy (CE) training – Here we train the student using only the ground truth (hard labels) available with the dataset without invoking the teacher network. (2): CE + match output-activations (Activation Matching) – This is the classical form of distillation (Ba & Caruana, 2014; Hinton et al., 2015)

, where the output-activations of the teacher network are matched with that of the student. This is weighted with the cross-entropy term which uses ground truth targets. Here we use the squared-error loss function for matching activations.

(3): CE + match {output-activations + input-Jacobians } (i-Jacobians) – This is the regularizer used by (Czarnecki et al., 2017; Srinivas & Fleuret, 2018), where the input-Jacobians of teacher and student networks are matched. Here we minimize the distance between input-Jacobians. (4): CE + match { output-activations + hidden-layer-attention} (Attention) – This approach is taken by Zagoruyko & Komodakis (2017), who match the channel-wise absolute sum of hidden layers for teacher and student with layers of same spatial dimensions. This can also be thought of as matching intermediate activations rather than intermediate gradients like our method does. (5): i-Jacobians + Attention – Considering that attention mapping also incorporates sub-structure information like bias-Jacobians, we combine two previous baselines to directly compare against our method.

We find that our new augmented baseline of input-Jacobians with attention matching is surprisingly strong and beats all previous baselines, including full-Jacobians. To improve upon this strong baseline, we add to it the bias-Jacobian matching term and find that it improves performance over that. This seems to contradict our assertions in section 4.1 that one can match either bias-Jacobians or intermediate activations to account for sub-structure, as they contain information about the same affine plane.

However individually, these quantities carry complementary information. While attention maps at a layer capture computation performed by the neural network upto that layer, the gradients from outputs w.r.t. a layer capture the computation done by the rest of the network after that layer. We match bias-Jacobians or attention maps of only three convolutional layers out of eleven. This is done because computing these for all layers during training is computationally expensive. This explains the increase in performance for this augmented objective. Similar experiments are presented for CIFAR-10.

# of Data points / class 5 10 50 100 500 (full)
Cross-Entropy (CE) 7.45 0.3 11.83 0.4 40.88 0.8 51.19 0.01 69.95 0.2
Activation Matching (Ba & Caruana, 2014) 23.72 1.3 37.22 0.2 59.43 0.02 63.91 0.2 66.99 0.2
i-Jacobians (Czarnecki et al., 2017) 27.27 1.2 41.47 1 61.83 0.01 65.43 0.6 66.92 0.7
Attention (Zagoruyko & Komodakis, 2017) 38.18 1.9 46.39 0.1 60.27 0.3 64.28 0.2 66.53 0.3
i-Jacobians + Attention 42.75 1.7 51.16 0.6 62.62 0.6 65.38 0.2 67.25 0.8
Full-Jacobians (Ours) 35.15 0.5 48.00 0.4 62.88 0.1 65.84 0.1 66.83 0.1
Full-Jacobians + Attention (Ours) 47.11 0.9 54.59 0.2 63.20 0.4 65.49 0.1 66.65 0.4
Table 2: Distillation performance on CIFAR100 (see Section 7.1

). Table shows average test accuracy (%) across two runs, along with standard deviation. We find that matching Full-Jacobians along with attention works best for limited-data settings. The student network is VGG-11 while the teacher is a VGG-9 network which achieves

accuracy. As the student is larger than the teacher, distillation does not help when using the entire dataset.
# of Data points / class 50 100 500 1000 5000 (full)
Cross-Entropy (CE) 49.29 1.6 59.93 0.1 79.36 0.04 83.87 0.1 91.95 0.1
Activation Matching (Ba & Caruana, 2014) 55.43 2.1 65.33 2.2 85.44 0.1 88.77 0.3 92.47 0.1
i-Jacobians (Czarnecki et al., 2017) 55.73 2 67.22 3.0 85.84 0.1 89.30 0.3 92.04 0.01
Attention (Zagoruyko & Komodakis, 2017) 68.11 0.8 74.44 0.2 85.88 0.1 88.61 0.1 91.20 0.01
i-Jacobians + Attention 70.83 1.0 77.06 0.2 86.51 0.3 89.63 0.1 90.68 0.04
Full-Jacobians (Ours) 58.88 0.2 69.42 1.4 86.55 0.1 89.76 0.1 91.49 0.05
Full-Jacobians + Attention (Ours) 72.75 0.4 78.71 0.1 87.31 0.3 89.87 0.3 90.68 0.1
Table 3: Distillation performance on CIFAR10 (see Section 7.1). Table shows average test accuracy (%) across two runs, along with standard deviation. We find that matching Full-Jacobians along with attention works best for limited-data settings. The student network is VGG-11 while the teacher is a VGG-9 network which achieves accuracy. As the student is larger than the teacher, distillation does not help when using the entire dataset

7.1.1 Effect on Input-Jacobian Matching

In our experiments we found that the Jacobian-based matching terms are difficult to optimize. This was also observed by (Srinivas & Fleuret, 2018), who attributed this to a second-order vanishing-gradient effect. We did not observe any such effect in our experiments, and we are unsure of the exact cause of this difficulty. Figure 2 illustrates this phenomenon for CIFAR100 distillation with data points per class. For the case of input-Jacobian matching, we see that the cosine angle hardly drops below on the training set. Surprisingly, augmenting this loss with bias-Jacobian or attention losses helps the optimization of input-Jacobians. In all three cases, the regularization constant for input-Jacobian matching loss term is unchanged. This indicates that the gains we observe could be because of this virtuous cycle of regularizers reinforcing and improving each others’ objectives.

Figure 2: Plot shows evolution of input-Jacobian angle between teacher and student during training. The input-Jacobian matching objective is identical in all three cases, and we find that augmenting this with full-Jacobian and attention matching helps increase alignment.

7.1.2 Effect of Student size

Common folk wisdom among machine learning researchers is that small models must be preferred to large ones when training with limited data. We find that this advice does not hold for the case of distillation. We train three models (VGG-{4,6,11}) on CIFAR100 with

data points per class with full-Jacobian matching. We find that surprisingly, the larger models perform better. For VGG-11, we get an accuracy of , while for VGG-6 and VGG-4 we get and respectively. We also plot the angle between input-Jacobians for all three cases in figure 3, and find that the input-Jacobian norms are better aligned for VGG-11. These observations are not surprising, as additional capacity is required to fit all the objectives we introduce.

We make two additional observations here. First, when using VGG-9 as student, we found that it performed as good as VGG-11. This is expected as the teacher itself is a VGG-9 network. Second, VGG-4 and 6 do slightly outperform VGG-11 on smaller datasets such as using points per class, and show better input-Jacobian alignment. However we did not observe this for other cases.

Figure 3: Plot shows evolution of input-Jacobian angle between teacher and student for three different student networks. We find that larger models fit the teacher better, which is also reflected in the improved input-Jacobian alignment.
# Data points / class 50 100 500
No regularization 33.25 0.6 46.24 0.1 68.48 0.1
Dropout (Srivastava et al., 2014) 35.04 0.6 47.62 0.7 70.14 0.06
Bias weight decay 34.17 0.2 47.29 0.7 68.75 0.04
Bias-Jacobian (Ours) 36.02 0.08 48.76 0.1 71.49 0.02
Table 4: Regularization of VGG-11 models on CIFAR100 (see Section 7.2). We report average test accuracy (%) across two runs, along with standard deviation. s denote regularization strengths, while

is dropout probability. We apply these to the same single layer of VGG-11, and find that bias-Jacobian regularization outperforms dropout and bias weight decay in all cases.

Image Guided Backprop
(Springenberg et al., 2014) Integrated gradients
(Sundararajan et al., 2017) Smooth grad
(Smilkov et al., 2017) Grad-CAM
(Selvaraju et al., 2017) Full-Jacobian
(Ours) Bias-Jacobian
(Ours)
Table 5: Comparison of different neural network saliency methods (see Section 7.3). Guided-Backprop (Springenberg et al., 2014), integrated-gradients (Sundararajan et al., 2017) and smooth-grad (Smilkov et al., 2017) produce sharp object boundaries, while grad-CAM (Selvaraju et al., 2017) indicates important regions without adhering to boundaries. Full-Jacobians highlight salient regions while being tightly confined within objects.

7.2 Regularization

We perform experiments where we penalize the bias-Jacobian norm to check whether it improves generalization. We train 9-layer VGG networks on CIFAR100 with varying number of data points per class, and measure test accuracy. We compare our method with dropout and bias parameter weight-decay applied to the same layer whose bias-Jacobian norm we compute. We also found that regularization benefits arise when applying these regularizers to final convolutional layers. For all methods, we choose regularization constants by performing grid search, leading to using for dropout, for bias-weight decay, and for bias-Jacobian regularization.

Our experiments confirm our hypothesis that bias-Jacobians have regularization benefits, and we find that they are also superior to dropout and weight decay on biases.

7.3 Visualization

We perform full-Jacobian visualization on an Imagenet pre-trained VGG-16 network with batch normalization. This network has 13 convolution-batchnorm linear blocks. For each block, we extract the bias-Jacobians, and use equation

4 to compute the visualization. Table 5 shows these visualizations along with four baselines - guided backprop, integrated gradients, smooth grad and grad-CAM.

We see that the first three maps are based on input-Jacobians alone, and hence their maps are qualitatively different from grad-CAM and full-Jacobians. These tend to highlight object boundaries more than their interior. Grad-CAM, on the other hand, highlights broad regions of the input without demarcating clear boundaries. Full-Jacobians combine advantages of both–highlighted regions are confined to object boundaries while highlighting its interior at the same time. This is not surprising as full-Jacobians include information both about input-Jacobians like guided backprop, integrated gradients and smooth grad, and also about intermediate-layer gradients like grad-CAM. Finally, we also visualize simply the bias-Jacobians, and find that they tend to be sharper than full-Jacobians, primarily because they do not contain the noisy input gradient maps.

8 Conclusion

We have introduced the full-Jacobian representation, which completely captures the local affine behavior of a neural network. In particular, it provides a formal way to reason about the intermediate layers of multi-layered architectures. In this paper, we used this representation to perform distillation and regularization which drew parallels with dropout. We also found that visualizing full-Jacobians produces sharp saliency maps.

Despite these advances, this representation is incomplete without a formal understanding of structural similarities between neural nets. This was briefly discussed in Section 4.1. Future work can focus on formalizing this notion for convolutional networks, as well as on methods to automatically discover such similarity between two architectures and find the optimal matching losses for knowledge transfer.

References