As demonstrated in prior research, the gradient tensor of the scalar loss function with respect to the input or intermediate layer, termed the Jacobian , is highly informative gradcam
. This follows naturally from the equations of backpropagation for a perceptron,
with . Here is the gradient tensor, is the layer with being the final layer, is the gradient of loss function with respect to the neural network output after the final activation,
is the activation function,is the output after layer with , is the weight matrix, and is the Hadamard product. It follows from these equations that the gradient tensor at any layer is a function of both the loss function and all succeeding weight matrices. The information from gradient tensors have been employed classically for regularization doubleback and more recently for visualizing saliency maps salmaps, interpreting DNNs visualizing; allconvnet, generating adversarial examples harnessing and weakly supervised object localizationgradcam. Most approaches use the information from the gradient tensor in a separate step to achieve the desired quantitative or qualitative result. Different from these approaches, we use the gradient tensor during the training procedure via an adversarial process GAN in our proposed GRadiEnt Adversarial Training (GREAT) procedure.
The main premise underlying GREAT is that the information in the gradient tensor inhibits reliable training dynamics under certain scenarios. GREAT aims to nullify the dark information in the gradient tensors by first processing the gradient tensor in an auxiliary network and then passing an adversarial signal back to the main network (Figure 1a) via the gradient reversal procedure DANN. This adversarial signal regularizes the weight tensors in the main network akin to double backpropagation doubleback. Using calculus, the adversarial gradient signal flowing forward in the main network can be shown to be,
which is of a similar functional form as but of opposite sign and affected by preceding weight matrices till the layer of the considered gradient tensor. As networks tend to have perfect sample expressiveness as soon as the number of parameters exceeds the number of data points rethinkgen, we expect the regularization provided by the auxiliary network to improve robustness and not considerably affect performance. We describe the dark information present in the gradient tensors in three scenarios: (a) adversarial examples, (b) multi-task learning, and (c) knowledge distillation distillation. We describe the intuition behind using GREAT for these three scenarios in the subsequent paragraphs and describe the exact training methodology in Section 3.
Adversarial examples are carefully crafted perturbations applied to normal images which are usually imperceptible to humans, but can seriously confuse state-of-the-art deep learning models advexamples; harnessing. A common step to all adversarial example generation is calculating the gradient of the objective function with respect to the input madry called the saliency map. The objective function is either the task loss function or derived from it. This gradient tensor is processed to perturb the original image, and the model mis-classifies the perturbed image. We use GREAT to make the saliency maps uninformative (Figure 1b), and hence, mitigate the network’s susceptibility to adversarial examples.
The objective of knowledge distillation is to compress the predictive behavior of a cumbersome DNN (teacher) or an ensemble of DNNs into a simpler model (student) distillation; ensembling
. Distilling knowledge to a student network is achieved by matching the logits or soft output distribution of the teacher to the output of the student in addition to usual supervised loss function. In Figure1c, we show how GREAT provides a complementary approach to distillation wherein we statistically match the gradient tensor of the teacher to the student using the auxiliary network, in lieu of matching output distributions.
In multi-task learning, a single network is trained end-to-end to achieve multiple related but different task outputs for an input relationship. This is achieved by having a common encoder and separate task-specific decoder. In a perfect multi-task learning scenario, the gradient tensors of the individual task-loss functions with respect the the last shared layer in the encoder should be indistinguishable so as to coherently train all the shared layers in the encoder. We use GREAT to train a gradient alignment layer between the encoder and task-specific decoders which operates in the backward pass so that the task-specific gradient tensors are less distinguishable by the auxiliary network (Figure 1d).
In Section 2, we describe the GREAT procedure for each of the above scenarios. In Section 3, we highlight the results of GREAT and in Section 4 we discuss conclusions and possible avenues of future work. Note we discuss relevant work as appropriate in the remainder of this article.
2 Gradient Adversarial Training
We describe the adaptations of GREAT suitable for adversarial defense, knowledge distillation and multi-task learning.
2.1 Adversarial defense
The general objective for defense against adversarial examples is
Here, is the input, the output, the network parameters, is the perturbation tensor whose -norm is constrained to be less than , and subsumes the loss function and the network architecture. Non-targeted attacks are devised by , i.e., moving in the direction of the gradient of the ground truth class , where is usually the sign function in FGSM; whereas targeted attacks are calculated as for . Using first order Taylor series approximation in equation 3 amounts to the equivalent formulation,
Previous attempts at adversarial defenses have focused on minimizing locally at the training points defensivedistill; Ross2017ImprovingTA; datagrad; advrob. However, this leads to a sharp curvature of the loss surface near those points, violating the first order Taylor approximation, which in turn makes the defense ineffective certified.
GREAT: Our GREAT procedure removes the class-specific information present in the gradient tensor. Formally, for all samples in the training set,
In the absence of class-specific information, a single-step targeted attack becomes hard as the perturbation tensor is class-agnostic. However, GREAT makes the gradient tensors class-agnostic or in other words obfuscates the gradient. Networks with obfuscated gradients are still vulnerable to sophisticated iterative attacks obgrad and to universal adversarial perturbations universalap. Hence, as a second line of defense we propose gradient-adversarial cross-entropy (GREACE) loss.
GREACE: GREACE adapts the cross-entropy loss function to add weight to the negative classes whose gradient tensors are similar to those of the primary class. The weight is added to the negative classes in the gradient tensor flowing backward from the soft-max activation, before back-propagating through the rest of the main network (see Figure 2). The weight is evaluated using the soft-max distribution from the auxiliary network which indicates the similarity of gradient tensor of the primary class to the negative classes. This added weight helps separate the high-dimensional decision boundary between easily confused classes, similar in spirit to confidence penalty confidencepen and focal loss focalloss, albeit from the perspective of gradients. Mathematically, the gradient tensor from the cross-entropy loss is modified in the following way,
Here, and are the GREACE and original cross-entropy functions respectively, and are the output activations from the main and auxiliary network respectively, is the soft-max function, is a penalty parameter, and is a one-hot function for all not equal to the original class , i.e., negative classes. The gradient fed into the auxiliary network is masked after passing through the soft-max function in the main network, . This avoids the auxiliary classifier to catch onto gradient cues from negative classes and only concentrates on the class in question. We also experimented with the unmasked gradient tensor, but the results weren’t as good. The combined objective for adversarial defense is:
indicates the GREACE, indicates the standard cross-entropy, indicates the masked cross-entropy, and is a weight parameter for the auxiliary network’s loss.
2.2 Knowledge distillation
In classical distillation distillation the student’s output distribution mimics the teacher’s soft output distribution . In GREAT, the student model mimics teacher model’s gradient distribution which is a weaker constraint as it allows final distributions to differ by a constant value. A solution for student, which jointly minimizes the supervised loss and exists, as proved in sobolev. GREAT uses a discriminator to match the gradient distributions owing to the success of adversarial losses GAN over traditional regression-based loses.
The GREAT procedure for knowledge distillation mimics a GAN training procedure. The binary classifier discriminates between student and teacher model gradients and drives the student model to generate gradient tensor distribution similar to the teacher model as shown in Figure 1c. The objective to be optimized is:
is the binary classifier with parameters, are gradient tensors from the student and teacher, respectively, denotes expectation, and is a loss balancing parameter. GREAT has no hyper-parameter controlling the teacher’s distribution to be matched, unlike the hard to set temperature parameter in distillation. However, we have an extra pass through the student network.
2.3 Multi-task learning
GradNorm gradnorm adaptively balances the loss-weights based on the norm of the gradients. The GREAT procedure for multi-task learning can be viewed as a generalization of GradNorm with two important differences: (1) We do not enforce that the gradients have balanced norms, but instead, desire that they have similar statistical distributions. This is achieved by the auxiliary network similar to a discriminator in a GAN setting. (2) Instead of assigning task-weights, we add extra-capacity to the network in the form of gradient-alignment layers (GALs). These layers are placed after the shared encoder and before each of the task-specific decoders as shown in Figure 3. They have the same dimensions as the last shared feature tensor minus the batch size, and are active only during the backward pass, i.e., the GALs are dropped during forward inference.
The auxiliary network receives the gradient tensor from each task as input and classifies them according to task. Successful classification implies the gradient tensors are discriminative, which impedes training of the shared encoder as the gradients are misaligned. The GALs mitigate the misalignment by element-wise scaling of the gradient tensors from all tasks. These layers are trained using the reversed gradient signal from the auxiliary network, i.e., the GALs attempt to make the gradient tensors indistinguishable. Intuitively, the GALs observe the statistical irregularities that prompt the auxiliary classifier to successfully discriminate between the gradient tensors, and then adapt the tensors to remove the irregularities or equalize the distributions. Note, that the task losses are normalized by the initial loss so that the alignment layers are tasked with local alignment and not global loss scale alignment. Furthermore, the soft-max activation function in the auxiliary network’s classification layer implicitly normalizes the gradients. The values in the GAL weight tensors are initialized with ones and restricted to be positive for training to converge. In practice, we observed that a low learning rate ensured positivity of the GAL tensors. The overall objective for multi-task learning is:
are normalized task losses, is N-class cross-entropy loss, are learnable parameters in shared encoder and auxiliary classifier, respectively, are decoder parameters, GAL parameters, and labels for task respectively, and represent the task labels.
3.1 Adversarial defense
We demonstrate GREAT on the CIFAR-10 and SVHN datasets. We use a ResNet-18 architecture resnet for both datasets. We observed that ResNet models are more effective in the GREAT training paradigm for adversarial defense relative to models without skip connections. In GREAT, skip connections help propagate the gradient information in the usual backward pass, as well as forward propagate the reversed gradient from the auxiliary classifier network through the main network. In our experiments, the auxiliary network is a copy of the main network. We gradually increase the auxiliary loss weight parameter, and the penalty parameter, to their final values,
so as to not impede the main training task during initial epochs. We empirically setand to 2 and 10 for CIFAR-10 and SVHN, respectively. These values optimally defend against adversarial examples, while not adversely affecting the test accuracy on the original samples. The network architectures and additional parameters are discussed in the supplement. We evaluate our method against targeted and non-targeted adversarial examples using the fast gradient sign method (FGSM) and its iterated version (iFGSM) for iterations. For targeted attacks we report the test accuracy for adversaries choosing a random target class or the worst (least probability) target class. We compare our method against adversarial training and base network with no defense mechanism in Tables 1 and 2. We employ FGSM adversaries in the adversarially trained network, described further in the supplement. Most other defenses are not effective as reported in obgrad. For CIFAR-10, we also draw plots of the test accuracy as a function of the maximum perturbation, allowed by the adversary in Figure 4. Firstly, the training set accuracy indicates that GREACE acts as a strong regularizer, and the combination of GREACE and GREAT prevents over-fitting to the training set. Second, we see that GREAT adds robustness to non-targeted single-step attacks but fails against iterated adversary (iFGSM), an indication of gradient obfuscation obgrad. Third, we see that GREACE in isolation is robust to adversarial attacks, however, the combination of GREAT and GREACE boosts robustness. Surprisingly, GRE(AT+CE) performs better than adversarial training on single step attacks, even though adversarial training is trained to be robust against them. Finally, in Figure 4 we see that the performance of GRE(AT+CE) deteriorates slightly for strong adversaries with high values validating the robustness of the classifier. 111Single-step targeted attacks are not successful on SVHN due to the simple task of recognizing digits. The saliency maps for the different methods are plotted in Figure 4 for 3 examples of CIFAR-10. Pixel activations around an object promote generation of adversarial examples. We see that the saliency maps for baseline and adversarial training have high pixel activations both within and around the object, whereas activations for GREAT are very noisy and not discriminative as expected. In contrast, the saliency maps for GRE(CE+AT) are sparse and predominantly activated within the object, hence, mitigating adversarial examples.
3.2 Knowledge distillation
We demonstrate GREAT’s potential for knowledge distillation on the CIFAR-10 and mini-ImageNet datasets. The mini-ImageNet dataset is a subset of the original ImageNet dataset with 200 classes, and 500 training and 50 test samples for each class. We show distillation results for 2 scenarios: (a) all training examples are used to train the student model, i.e, dense regime and (b) only 5% of training samples are used to train the student models, i.e., sparse regime. For CIFAR-10, we use (i) a 5-layer CNN and a pretrained ResNet-18, and (ii) ResNet-18 and a pretrained ResNext-29-8resnext as student-teacher combinations. For mini-ImageNet, we train a teacher ResNet-152 model at two resolutions: (i) 64x64, (ii) 224x224 for 100 epochs and 50 epochs, respectively. We use a ResNet-18 as the student model at both resolutions. We use a shallower version of the student model as the auxiliary binary classifier. Details of the architecture, optimizer and learning rate policy for each scenario are in the supplement. We compare GREAT against a baseline model trained using cross-entropy loss, and against a distilled model trained using a combination of cross-entropy and unsupervised KL-loss. We determined the best temperature and parameter for distillation in the two training regimes on the 5-layer CNN+ResNet-18 combination, and used these parameters for the mini-ImageNet experiments. The optimal parameters were chosen through grid search for the ResNet-18+ResNext-29-8 combination. We set in all experiments using GREAT determined from the dense training regime of CNN+ResNet-18 combination. The results are reported in Table 3. We see that GREAT consistently performs better than the baseline and distillation in the sparse training regime, indicating better regularization by the gradient adversarial signal. The baseline model performs best for the full resolution, dense training regime for mini-ImageNet indicating that the teacher model trained for only 50 epochs provides weak learning cues. Indeed, the best test accuracy reported for mini-ImageNet at full resolution is 83.32% as opposed to our teacher model with 71.30% top-1 accuracy. The poor performance of distillation on mini-ImageNet dense regime indicate that the hyper parameters determined on CIFAR-10 are not transferable across datasets. In contrast, GREAT with the same parameter is able to coherently distill the model for both the dense and sparse training regimes across different student-teacher combinations.
3.3 Multi-task learning
We test GREAT for multi-task learning on 2 datasets: (a) CIFAR-10 with input a noisy gray-scale image and with tasks (i) classification, (ii) colorization, (iii) edge detection and (iv) denoised reconstruction; (b) NYUv2 dataset wherein the tasks are (i) depth estimation, (ii) surface-normal estimation, and (iii) key-point estimation. The input and output resolutions for the CIFAR-10 dataset are 32x32, and the input resolution for NYUv2 is 320x320 and the output resolution is 80x80 as set ingradnorm. We compare out method against the baseline of equal weights, GradNorm gradnorm, and uncertainty based weighting kendallmulti. For all methods we use the same architecture: a ResNet-53 with dilated convolution backbone and task-specific decoders. We tested GradNorm for different values and set it equal to 0.6 for CIFAR-10, and 1.5 for NYUv2 as set in the original paper. Full details about the dataset creation, task losses, main model and classifier architecture are in the supplement. Table 4
lists the results. We see that GREAT performs better or on par with GradNorm, despite having no tunable hyperparameters. This indicates that the extra parameters in the GALs are sufficient to absorb dataset-specific information without requiring hand-tuning. On CIFAR-10, we see that GREAT performs best on edge detection and denoised auto-encoding, and is close to the best value for colorization. The high classification error for the uncertainty-based method and high RMSE values of the baseline on the other three tasks indicates that classification is antagonistic to the other three tasks. However, both GradNorm and GREAT are able to correctly balance the gradient flowing from classification with the other tasks. On the NYUv2 dataset we see that GREAT performs best on depth and normal estimation, and is withinRSME on keypoint detection. Overall, we see that GREAT performs better than all other methods on four of the seven tasks, and is close to the best values in all cases.
4 Conclusion and future work
We have introduced gradient adversarial training and demonstrated its applicability in diverse scenarios: from defense against adversarial examples to knowledge distillation to multi-task learning. We show that adaptations of GREAT offer (a) strong defense to both targeted and non-targeted adversarial examples, (b) can easily distill knowledge from different teacher networks without heavy parameter tuning, and (c) aid multi-task learning by tuning a gradient alignment layer. There are several directions of future work in the proposed domains. We wish to investigate others forms of loss functions beyond GREACE that are symbiotic with GREAT, explore progressive training of student networks using ideas from Progressive-GAN progressivegan
to better learn from the teacher, and absorb the explicit parameters in the GALs directly into the optimizer as done with the mean and variance estimates for each weight parameter in ADAMadam. The general approach underlying GREAT of passing an adversarial gradient signal to a network is broadly applicable to domains beyond the ones discussed here such as to the discriminator in domain adversarial training DANN and GANs GAN. We can also replace direct gradient tensor evaluation with synthetic gradients syntheticgrad for efficiency. In the future we will explore these exciting avenues. Holistically, we believe that understanding gradient distributions will help uncover the underlying mechanisms that govern the successful training of deep architectures using backpropagation, and gradient adversarial training is a step towards this direction.