Gradient Adversarial Training of Neural Networks

06/21/2018 ∙ by Ayan Sinha, et al. ∙ Magic Leap, Inc. 8

We propose gradient adversarial training, an auxiliary deep learning framework applicable to different machine learning problems. In gradient adversarial training, we leverage a prior belief that in many contexts, simultaneous gradient updates should be statistically indistinguishable from each other. We enforce this consistency using an auxiliary network that classifies the origin of the gradient tensor, and the main network serves as an adversary to the auxiliary network in addition to performing standard task-based training. We demonstrate gradient adversarial training for three different scenarios: (1) as a defense to adversarial examples we classify gradient tensors and tune them to be agnostic to the class of their corresponding example, (2) for knowledge distillation, we do binary classification of gradient tensors derived from the student or teacher network and tune the student gradient tensor to mimic the teacher's gradient tensor; and (3) for multi-task learning we classify the gradient tensors derived from different task loss functions and tune them to be statistically indistinguishable. For each of the three scenarios we show the potential of gradient adversarial training procedure. Specifically, gradient adversarial training increases the robustness of a network to adversarial attacks, is able to better distill the knowledge from a teacher network to a student network compared to soft targets, and boosts multi-task learning by aligning the gradient tensors derived from the task specific loss functions. Overall, our experiments demonstrate that gradient tensors contain latent information about whatever tasks are being trained, and can support diverse machine learning problems when intelligently guided through adversarialization using a auxiliary network.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In backpropagation 

backprop the gradient of the loss function is evaluated with respect to weight tensor in each layer, and and the weights are updated using a learning rule  adam. Gradient tensors recursively evaluated through backpropagation can successfully train deep networks with millions of weight parameters across hundreds of layers and generalize to unseen examples  resnet. However, a mathematical formalism of the generalization ability of deep neural networks (DNNs) trained using backpropagation remains elusive. Indeed, a lack of formalism has given rise to new domains in deep learning such as robustness of DNNs in particular to adversarial examples  advexamples, domain adaptation DANN, multi-task learning relationship, model compression  modelcomp etc. Here, we investigate the potential of gradient tensors derived during back propagation to serve as an additional cue to learning in these new domains.

Figure 1: Gradient Adversarial Training (GREAT) of Neural Networks. The legends on the top left and right of the figure show the information flow in the networks and the different kinds of modules. (a) The general methodology of GREAT wherein the main network is trained using standard backpropagation and also acts as an adversary to the auxiliary network via gradient reversal. The auxiliary network is trained on gradient tensors evaluated during backpropagation. (b) GREAT procedure for adversarial defense: The auxiliary network performs the same classification as the main network, albeit with gradient tensors as input. (c) GREAT method for knowledge distillation: The auxiliary network performs binary classification on the gradient tensors from the student and teacher networks. (d) GREAT method for multi-task learning: The auxiliary networks classifies the gradient tensors from the different task decoders and aligns them through gradient reversal and an explicit gradient alignment layer described later.

As demonstrated in prior research, the gradient tensor of the scalar loss function with respect to the input or intermediate layer, termed the Jacobian , is highly informative gradcam

. This follows naturally from the equations of backpropagation for a perceptron,


with . Here is the gradient tensor, is the layer with being the final layer, is the gradient of loss function with respect to the neural network output after the final activation,

is the activation function,

is the output after layer with , is the weight matrix, and is the Hadamard product. It follows from these equations that the gradient tensor at any layer is a function of both the loss function and all succeeding weight matrices. The information from gradient tensors have been employed classically for regularization doubleback and more recently for visualizing saliency maps salmaps, interpreting DNNs visualizing; allconvnet, generating adversarial examples harnessing and weakly supervised object localizationgradcam. Most approaches use the information from the gradient tensor in a separate step to achieve the desired quantitative or qualitative result. Different from these approaches, we use the gradient tensor during the training procedure via an adversarial process GAN in our proposed GRadiEnt Adversarial Training (GREAT) procedure.

The main premise underlying GREAT is that the information in the gradient tensor inhibits reliable training dynamics under certain scenarios. GREAT aims to nullify the dark information in the gradient tensors by first processing the gradient tensor in an auxiliary network and then passing an adversarial signal back to the main network (Figure 1a) via the gradient reversal procedure DANN. This adversarial signal regularizes the weight tensors in the main network akin to double backpropagation doubleback. Using calculus, the adversarial gradient signal flowing forward in the main network can be shown to be,


which is of a similar functional form as but of opposite sign and affected by preceding weight matrices till the layer of the considered gradient tensor. As networks tend to have perfect sample expressiveness as soon as the number of parameters exceeds the number of data points rethinkgen, we expect the regularization provided by the auxiliary network to improve robustness and not considerably affect performance. We describe the dark information present in the gradient tensors in three scenarios: (a) adversarial examples, (b) multi-task learning, and (c) knowledge distillation distillation. We describe the intuition behind using GREAT for these three scenarios in the subsequent paragraphs and describe the exact training methodology in Section 3.

Adversarial examples are carefully crafted perturbations applied to normal images which are usually imperceptible to humans, but can seriously confuse state-of-the-art deep learning models advexamples; harnessing. A common step to all adversarial example generation is calculating the gradient of the objective function with respect to the input madry called the saliency map. The objective function is either the task loss function or derived from it. This gradient tensor is processed to perturb the original image, and the model mis-classifies the perturbed image. We use GREAT to make the saliency maps uninformative (Figure 1b), and hence, mitigate the network’s susceptibility to adversarial examples.

The objective of knowledge distillation is to compress the predictive behavior of a cumbersome DNN (teacher) or an ensemble of DNNs into a simpler model (student) distillation; ensembling

. Distilling knowledge to a student network is achieved by matching the logits or soft output distribution of the teacher to the output of the student in addition to usual supervised loss function. In Figure

1c, we show how GREAT provides a complementary approach to distillation wherein we statistically match the gradient tensor of the teacher to the student using the auxiliary network, in lieu of matching output distributions.

In multi-task learning, a single network is trained end-to-end to achieve multiple related but different task outputs for an input relationship. This is achieved by having a common encoder and separate task-specific decoder. In a perfect multi-task learning scenario, the gradient tensors of the individual task-loss functions with respect the the last shared layer in the encoder should be indistinguishable so as to coherently train all the shared layers in the encoder. We use GREAT to train a gradient alignment layer between the encoder and task-specific decoders which operates in the backward pass so that the task-specific gradient tensors are less distinguishable by the auxiliary network (Figure 1d).

In Section 2, we describe the GREAT procedure for each of the above scenarios. In Section 3, we highlight the results of GREAT and in Section 4 we discuss conclusions and possible avenues of future work. Note we discuss relevant work as appropriate in the remainder of this article.

Figure 2:

Adversarial defense comprised of GREAT and GREACE. In GREACE, the output probability distribution from auxiliary network is added to the gradient of the loss with respect to the logits in the main network to help separate negative classes whose gradient tensors are similar to primary class.

2 Gradient Adversarial Training

We describe the adaptations of GREAT suitable for adversarial defense, knowledge distillation and multi-task learning.

2.1 Adversarial defense

The general objective for defense against adversarial examples is


Here, is the input, the output, the network parameters, is the perturbation tensor whose -norm is constrained to be less than , and subsumes the loss function and the network architecture. Non-targeted attacks are devised by , i.e., moving in the direction of the gradient of the ground truth class , where is usually the sign function in FGSM; whereas targeted attacks are calculated as for . Using first order Taylor series approximation in equation 3 amounts to the equivalent formulation,


Previous attempts at adversarial defenses have focused on minimizing locally at the training points defensivedistill; Ross2017ImprovingTA; datagrad; advrob. However, this leads to a sharp curvature of the loss surface near those points, violating the first order Taylor approximation, which in turn makes the defense ineffective certified.

GREAT: Our GREAT procedure removes the class-specific information present in the gradient tensor. Formally, for all samples in the training set,


In the absence of class-specific information, a single-step targeted attack becomes hard as the perturbation tensor is class-agnostic. However, GREAT makes the gradient tensors class-agnostic or in other words obfuscates the gradient. Networks with obfuscated gradients are still vulnerable to sophisticated iterative attacks obgrad and to universal adversarial perturbations universalap. Hence, as a second line of defense we propose gradient-adversarial cross-entropy (GREACE) loss.

GREACE: GREACE adapts the cross-entropy loss function to add weight to the negative classes whose gradient tensors are similar to those of the primary class. The weight is added to the negative classes in the gradient tensor flowing backward from the soft-max activation, before back-propagating through the rest of the main network (see Figure 2). The weight is evaluated using the soft-max distribution from the auxiliary network which indicates the similarity of gradient tensor of the primary class to the negative classes. This added weight helps separate the high-dimensional decision boundary between easily confused classes, similar in spirit to confidence penalty confidencepen and focal loss focalloss, albeit from the perspective of gradients. Mathematically, the gradient tensor from the cross-entropy loss is modified in the following way,


Here, and are the GREACE and original cross-entropy functions respectively, and are the output activations from the main and auxiliary network respectively, is the soft-max function, is a penalty parameter, and is a one-hot function for all not equal to the original class , i.e., negative classes. The gradient fed into the auxiliary network is masked after passing through the soft-max function in the main network, . This avoids the auxiliary classifier to catch onto gradient cues from negative classes and only concentrates on the class in question. We also experimented with the unmasked gradient tensor, but the results weren’t as good. The combined objective for adversarial defense is:


indicates the GREACE, indicates the standard cross-entropy, indicates the masked cross-entropy, and is a weight parameter for the auxiliary network’s loss.

1:procedure Train() Requires inputs , labels , penalty
2:      while  do is current iteration
3:             Main network loss by forward pass
4:             Evaluate masked gradient tensor w.r.t.
5:             Auxiliary network loss by forward pass
6:             Update weights in auxiliary network using
7:             Evaluate reversed gradient w.r.t
8:             Evaluate GREACE loss
9:             Update weights in main network using       
Algorithm 1 Algorithm for defense against adversarial examples using GREAT and GREACE

2.2 Knowledge distillation

In classical distillation distillation the student’s output distribution mimics the teacher’s soft output distribution . In GREAT, the student model mimics teacher model’s gradient distribution which is a weaker constraint as it allows final distributions to differ by a constant value. A solution for student, which jointly minimizes the supervised loss and exists, as proved in sobolev. GREAT uses a discriminator to match the gradient distributions owing to the success of adversarial losses GAN over traditional regression-based loses.

1:procedure Train() Requires inputs , labels , teacher with parameters
2:      while  do is current iteration
3:             Student network loss by forward pass
4:             Evaluate student gradient tensor w.r.t.
5:             Evaluate teacher gradient tensor w.r.t.
6:             Binary classifier loss by forward pass
7:             Update weights in auxiliary network using
8:             Evaluate gradient tensor of loss
9:             Evaluate reversed gradient w.r.t
10:             Update weights in main network using       
Algorithm 2 Algorithm for knowledge distillation using GREAT procedure

The GREAT procedure for knowledge distillation mimics a GAN training procedure. The binary classifier discriminates between student and teacher model gradients and drives the student model to generate gradient tensor distribution similar to the teacher model as shown in Figure 1c. The objective to be optimized is:


is the binary classifier with parameters, are gradient tensors from the student and teacher, respectively, denotes expectation, and is a loss balancing parameter. GREAT has no hyper-parameter controlling the teacher’s distribution to be matched, unlike the hard to set temperature parameter in  distillation. However, we have an extra pass through the student network.

2.3 Multi-task learning

GradNorm  gradnorm adaptively balances the loss-weights based on the norm of the gradients. The GREAT procedure for multi-task learning can be viewed as a generalization of GradNorm with two important differences: (1) We do not enforce that the gradients have balanced norms, but instead, desire that they have similar statistical distributions. This is achieved by the auxiliary network similar to a discriminator in a GAN setting. (2) Instead of assigning task-weights, we add extra-capacity to the network in the form of gradient-alignment layers (GALs). These layers are placed after the shared encoder and before each of the task-specific decoders as shown in Figure 3. They have the same dimensions as the last shared feature tensor minus the batch size, and are active only during the backward pass, i.e., the GALs are dropped during forward inference.

1:procedure Train() Requires inputs , labels for tasks
2:       Initialize GAL tensors with ones and initial task losses
3:      while  do is current iteration
4:             Normalize task losses after forward pass
5:             Evaluate task gradient tensors w.r.t. feature
6:             Update weights in decoders using
7:             Update weights in encoder using
8:             Task classification loss by forward pass
9:             Update weights in task classifier network using
10:             Evaluate reversed gradient w.r.t
11:             Update weights in GALs using       
Algorithm 3 Algorithm for multi-task learning using GREAT on GALs
Figure 3: GREAT for multi-task learning. The GALs are trained using reversed gradients from the auxiliary task classifier.

The auxiliary network receives the gradient tensor from each task as input and classifies them according to task. Successful classification implies the gradient tensors are discriminative, which impedes training of the shared encoder as the gradients are misaligned. The GALs mitigate the misalignment by element-wise scaling of the gradient tensors from all tasks. These layers are trained using the reversed gradient signal from the auxiliary network, i.e., the GALs attempt to make the gradient tensors indistinguishable. Intuitively, the GALs observe the statistical irregularities that prompt the auxiliary classifier to successfully discriminate between the gradient tensors, and then adapt the tensors to remove the irregularities or equalize the distributions. Note, that the task losses are normalized by the initial loss so that the alignment layers are tasked with local alignment and not global loss scale alignment. Furthermore, the soft-max activation function in the auxiliary network’s classification layer implicitly normalizes the gradients. The values in the GAL weight tensors are initialized with ones and restricted to be positive for training to converge. In practice, we observed that a low learning rate ensured positivity of the GAL tensors. The overall objective for multi-task learning is:


are normalized task losses, is N-class cross-entropy loss, are learnable parameters in shared encoder and auxiliary classifier, respectively, are decoder parameters, GAL parameters, and labels for task respectively, and represent the task labels.

Method Train No-Attack Non-Targeted Targeted
Worst Random Worst Random
Baseline 99.97 93.32 32.75 1.99 72.89 10.43 89.59 18.29
Adversarial 99.97 89.91 56.88 16.73 82.07 45.26 89.81 69.89
GREACE 92.56 89.84 77.90 72.40 83.39 66.23 87.13 79.10
GREAT 99.53 91.95 47.51 15.45 72.73 12.78 89.62 21.95
GRE(AT+CE) 90.87 89.97 81.28 77.04 84.53 73.52 88.57 82.38
Table 1: CIFAR-10 test accuracy in % values of different training methods on targeted and non-targeted attacks using FGSM and iFGSM. We use , and set for iFGSM. GREA(AT+CE) is the best defense for all but one adversary highlighting the importance of gradient adversarial training in addition to GREACE during training.
Method Train No-Attack Non-Targeted Targeted
Worst Random Worst Random
Baseline 99.97 96.20 45.97 6.26 96.20 96.20 80.27 15.65
Adversarial 99.98 94.92 58.70 8.41 94.92 94.92 83.70 19.19
GREACE 93.52 93.71 73.65 80.13 93.70 93.70 89.26 88.09
GREAT 99.76 95.58 45.95 5.95 95.58 95.58 79.09 15.68
GRE(AT+CE) 92.95 93.90 74.12 79.36 93.89 93.90 89.56 87.37
Table 2: SVHN test accuracy in % values of different training methods on targeted and non-targeted attacks using FGSM and iFGSM. We use , and set for iFGSM. GREA(AT+CE) or GREACE is the best defense for all iFGSM and non-targeted FGSM adversaries showing that the modified cross entropy loss robustly separates classes.

3 Results

3.1 Adversarial defense

We demonstrate GREAT on the CIFAR-10 and SVHN datasets. We use a ResNet-18 architecture resnet for both datasets. We observed that ResNet models are more effective in the GREAT training paradigm for adversarial defense relative to models without skip connections. In GREAT, skip connections help propagate the gradient information in the usual backward pass, as well as forward propagate the reversed gradient from the auxiliary classifier network through the main network. In our experiments, the auxiliary network is a copy of the main network. We gradually increase the auxiliary loss weight parameter, and the penalty parameter, to their final values,

so as to not impede the main training task during initial epochs. We empirically set

and to 2 and 10 for CIFAR-10 and SVHN, respectively. These values optimally defend against adversarial examples, while not adversely affecting the test accuracy on the original samples. The network architectures and additional parameters are discussed in the supplement. We evaluate our method against targeted and non-targeted adversarial examples using the fast gradient sign method (FGSM) and its iterated version (iFGSM) for iterations. For targeted attacks we report the test accuracy for adversaries choosing a random target class or the worst (least probability) target class. We compare our method against adversarial training and base network with no defense mechanism in Tables 1 and 2. We employ FGSM adversaries in the adversarially trained network, described further in the supplement. Most other defenses are not effective as reported in obgrad. For CIFAR-10, we also draw plots of the test accuracy as a function of the maximum perturbation, allowed by the adversary in Figure 4. Firstly, the training set accuracy indicates that GREACE acts as a strong regularizer, and the combination of GREACE and GREAT prevents over-fitting to the training set. Second, we see that GREAT adds robustness to non-targeted single-step attacks but fails against iterated adversary (iFGSM), an indication of gradient obfuscation obgrad. Third, we see that GREACE in isolation is robust to adversarial attacks, however, the combination of GREAT and GREACE boosts robustness. Surprisingly, GRE(AT+CE) performs better than adversarial training on single step attacks, even though adversarial training is trained to be robust against them. Finally, in Figure 4 we see that the performance of GRE(AT+CE) deteriorates slightly for strong adversaries with high values validating the robustness of the classifier. 111Single-step targeted attacks are not successful on SVHN due to the simple task of recognizing digits. The saliency maps for the different methods are plotted in Figure 4 for 3 examples of CIFAR-10. Pixel activations around an object promote generation of adversarial examples. We see that the saliency maps for baseline and adversarial training have high pixel activations both within and around the object, whereas activations for GREAT are very noisy and not discriminative as expected. In contrast, the saliency maps for GRE(CE+AT) are sparse and predominantly activated within the object, hence, mitigating adversarial examples.

Figure 4: Left: Saliency maps for images in CIFAR-10 for different training methods. Baseline and adversarial are active outside object, GREACE is sparse, GREAT is uninformative, GRE(AT+CE) is sparse, less informative and active within object. Right: Accuracy plots of different methods against adversaries with maximum allowed perturbation, . GREAT (yellow) is more robust than Adversarial (blue) to non-targeted adversaries for high . GREACE (green) and GRE(AT+CE) (red) are uniformly robust for different .
Method CIFAR-10


CNN(S)+RN(T) RN(S)+RNx(T) RN(S)+RN152(T) RN(S)+RN152(T)
100% 5% 100% 5% 100% 5% 100% 5%
Baseline 84.74 65.41 93.19 66.73 59.24 14.41 58.02 13.79
Distillation 85.69 66.45 93.65 67.69 51.72 16.73 46.77 14.00
GREAT 85.72 66.55 93.43 67.80 59.80 16.82 56.31 14.02
Table 3: Results of knowledge distillation on CIFAR-10 and mini-ImageNet. RN refers to ResNet-18. The third row indicates the % of all train samples used during training. GREAT performs best in the sparse regime for all combinations and better than distillation on all but 1 scenario.

3.2 Knowledge distillation

We demonstrate GREAT’s potential for knowledge distillation on the CIFAR-10 and mini-ImageNet datasets. The mini-ImageNet dataset is a subset of the original ImageNet dataset with 200 classes, and 500 training and 50 test samples for each class. We show distillation results for 2 scenarios: (a) all training examples are used to train the student model, i.e, dense regime and (b) only 5% of training samples are used to train the student models, i.e., sparse regime. For CIFAR-10, we use (i) a 5-layer CNN and a pretrained ResNet-18, and (ii) ResNet-18 and a pretrained ResNext-29-8resnext as student-teacher combinations. For mini-ImageNet, we train a teacher ResNet-152 model at two resolutions: (i) 64x64, (ii) 224x224 for 100 epochs and 50 epochs, respectively. We use a ResNet-18 as the student model at both resolutions. We use a shallower version of the student model as the auxiliary binary classifier. Details of the architecture, optimizer and learning rate policy for each scenario are in the supplement. We compare GREAT against a baseline model trained using cross-entropy loss, and against a distilled model trained using a combination of cross-entropy and unsupervised KL-loss. We determined the best temperature and parameter for distillation in the two training regimes on the 5-layer CNN+ResNet-18 combination, and used these parameters for the mini-ImageNet experiments. The optimal parameters were chosen through grid search for the ResNet-18+ResNext-29-8 combination. We set in all experiments using GREAT determined from the dense training regime of CNN+ResNet-18 combination. The results are reported in Table 3. We see that GREAT consistently performs better than the baseline and distillation in the sparse training regime, indicating better regularization by the gradient adversarial signal. The baseline model performs best for the full resolution, dense training regime for mini-ImageNet indicating that the teacher model trained for only 50 epochs provides weak learning cues. Indeed, the best test accuracy reported for mini-ImageNet at full resolution is 83.32% as opposed to our teacher model with 71.30% top-1 accuracy. The poor performance of distillation on mini-ImageNet dense regime indicate that the hyper parameters determined on CIFAR-10 are not transferable across datasets. In contrast, GREAT with the same parameter is able to coherently distill the model for both the dense and sparse training regimes across different student-teacher combinations.

Method CIFAR-10 NYUv2
Class Color Edge Auto Depth Normal Keypoint
Equal 24.0 0.131 0.349 0.113 0.861 0.207 0.407
Uncertainty 26.6 0.111 0.270 0.090 0.796 0.192 0.389
GradNorm 23.5 0.116 0.270 0.091 0.810 0.169 0.377
GREAT 24.2 0.114 0.252 0.087 0.779 0.167 0.382
Table 4: Test errors of multi-task learning on the CIFAR-10 and NYUv2 datasets. GREAT performs best on 2 tasks each for CIFAR and NYUv2, and has comparable performance on the other tasks.

3.3 Multi-task learning

We test GREAT for multi-task learning on 2 datasets: (a) CIFAR-10 with input a noisy gray-scale image and with tasks (i) classification, (ii) colorization, (iii) edge detection and (iv) denoised reconstruction; (b) NYUv2 dataset wherein the tasks are (i) depth estimation, (ii) surface-normal estimation, and (iii) key-point estimation. The input and output resolutions for the CIFAR-10 dataset are 32x32, and the input resolution for NYUv2 is 320x320 and the output resolution is 80x80 as set in

gradnorm. We compare out method against the baseline of equal weights, GradNorm gradnorm, and uncertainty based weighting kendallmulti. For all methods we use the same architecture: a ResNet-53 with dilated convolution backbone and task-specific decoders. We tested GradNorm for different values and set it equal to 0.6 for CIFAR-10, and 1.5 for NYUv2 as set in the original paper. Full details about the dataset creation, task losses, main model and classifier architecture are in the supplement. Table 4

lists the results. We see that GREAT performs better or on par with GradNorm, despite having no tunable hyperparameters. This indicates that the extra parameters in the GALs are sufficient to absorb dataset-specific information without requiring hand-tuning. On CIFAR-10, we see that GREAT performs best on edge detection and denoised auto-encoding, and is close to the best value for colorization. The high classification error for the uncertainty-based method and high RMSE values of the baseline on the other three tasks indicates that classification is antagonistic to the other three tasks. However, both GradNorm and GREAT are able to correctly balance the gradient flowing from classification with the other tasks. On the NYUv2 dataset we see that GREAT performs best on depth and normal estimation, and is within

RSME on keypoint detection. Overall, we see that GREAT performs better than all other methods on four of the seven tasks, and is close to the best values in all cases.

4 Conclusion and future work

We have introduced gradient adversarial training and demonstrated its applicability in diverse scenarios: from defense against adversarial examples to knowledge distillation to multi-task learning. We show that adaptations of GREAT offer (a) strong defense to both targeted and non-targeted adversarial examples, (b) can easily distill knowledge from different teacher networks without heavy parameter tuning, and (c) aid multi-task learning by tuning a gradient alignment layer. There are several directions of future work in the proposed domains. We wish to investigate others forms of loss functions beyond GREACE that are symbiotic with GREAT, explore progressive training of student networks using ideas from Progressive-GAN progressivegan

to better learn from the teacher, and absorb the explicit parameters in the GALs directly into the optimizer as done with the mean and variance estimates for each weight parameter in ADAM 

adam. The general approach underlying GREAT of passing an adversarial gradient signal to a network is broadly applicable to domains beyond the ones discussed here such as to the discriminator in domain adversarial training DANN and GANs GAN. We can also replace direct gradient tensor evaluation with synthetic gradients syntheticgrad for efficiency. In the future we will explore these exciting avenues. Holistically, we believe that understanding gradient distributions will help uncover the underlying mechanisms that govern the successful training of deep architectures using backpropagation, and gradient adversarial training is a step towards this direction.