1 Introduction††footnotetext: Corresponding author.
In human learning, an effective and widely used methodology for improving learning outcome is self-explanation where students explain to themselves a learned topic to achieve better understanding of this topic. Self-explanation encourages a student to actively digest and integrate prior knowledge and new information, which helps to fill in the gaps in understanding a topic.
Inspired by this explanation-driven learning technique of humans, we are interested in investigating whether this methodology is helpful for improving machine learning as well. We propose a novel learning framework called learning by self-explanation (LeaSE). In this framework, there is an explainer model and an audience model, both of which learn to perform the same prediction task. The explainer has a learnable architecture and a set of learnable network weights. The audience has a predefined architecture by human experts and a set of learnable network weights. The goal is to help the explainer to learn well on the target task. The way to achieve this goal is to encourage the explainer to give clear explanations to the audience. Intuitively, if a model can explain prediction outcomes well, it has a deep understanding of the prediction task and can learn better accordingly. The learning is organized into four stages. In the first stage, the explainer trains its network weights by minimizing the prediction loss on its training dataset, with its architecture fixed. In the second stage, the explainer uses its model trained in the first stage to make predictions on the training data examples of the audience and leverages an adversarial attack approach to explain the prediction outcomes. In the third stage, the audience model combines its training examples and the explainer-made explanations of prediction outcomes on these examples to train its network weights. In the fourth stage, the explainer updates its neural architecture by minimizing its validation loss and the audience’s validation loss. The fours stages are performed jointly end-to-end in a multi-level optimization framework, where different stages influence each other. We apply our method for neural architecture search in image classification tasks. Our method achieves significant improvement on CIFAR-100, CIFAR-10, and ImageNet(deng2009imagenet).
The major contributions of this paper is as follows:
Inspired by the explanation-driven learning technique of humans, we propose a novel machine learning approach called learning by self-explanation (LeaSE). In our approach, the explainer model improves its learning ability by trying to clearly explain to an audience model regarding how the prediction outcomes are made.
We propose a multi-level optimization framework to formulate LeaSE which involves four stages of learning: explainer learns; explainer explains; audience learns; explainer and audience validate themselves.
We develop an efficient algorithm to solve the LeaSE problem.
We apply our approach to neural architecture search on CIFAR-100, CIFAR-10, and ImageNet. The results demonstrate the effectiveness of our method.
The rest of the paper is organized as follows. Section 2 and 3 present the method and experiments respectively. Section 4 reviews related works. Section 5 concludes the paper.
In this section, we propose a framework for learning by self-explanation (LeaSE) and develop an optimization for solving the LeaSE problem.
2.1 Learning by Self-Explanation
In our framework, there is an explainer model and an audience model, both of which learn to perform the same target task. The primary goal of our framework is to help the explainer to learn the target task very well. The way to achieve this goal is to let the explainer make meaningful explanations of the prediction outcomes in the target task. The intuition behind LeaSE is: to correctly explain prediction results, a model needs to learn to understand the target task very well. The explainer has a learnable architecture and a set of learnable network weights . The audience has a pre-defined neural architecture (by human experts) and a set of learnable network weights . The learning is organized into four stages. In the first stage, the explainer trains its network weights on its training dataset , with the architecture fixed:
The architecture is used to define the training loss. But it is not updated in this stage. If is learned by minimizing this training loss, a trivial solution will be yielded where is very large and complex that it can perfectly overfit the training data but will generalize poorly on unseen data. Note that the optimally trained weights is a function of since is a function of and is a function of . In the second stage, the explainer uses the trained model to make predictions on the input training examples of the audience and explains the prediction outcomes. Specifically, given an input data example (without loss of generality, we assume it is an image) and the predicted label , the explainer aims to find out a subset of image patches in that are mostly correlated with and uses as explanations for . We leverage an adversarial attack approach to achieve this goal. Adversarial attack goodfellow2014explaining adds small random perturbations to pixels in so that the prediction outcome on the perturbed image is no longer . Pixels that are perturbed more have higher correlations with the prediction outcome and can be used as explanations. This process amounts to solving the following optimization problem:
where and is the perturbation added to image . and are the prediction outcomes of the explainer’s network on and . Without loss of generality, we assume the task is image classification (with classes). Then and are bothclasses. is the cross-entropy loss with . In this optimization problem, the explainer aims to find out perturbations for each image so that the predicted outcome on the perturbed image is largely different from that on the original image. The learned optimal perturbations are used as explanations and those with larger values indicate that the corresponding pixels are more important in decision-makings. Note that is a function of since is a function of the objective in Eq.(2) and the objective is a function of . In the third stage, given the explanations made by the explainer, the audience leverages them to learn the target task. Since the perturbations indicate how important the input pixels are, the audience uses them to reweigh the pixels: , where denotes element-wise multiplication. Pixels that are more important are given more weights. Then the audience trains its network weights on these weighted images:
where is the prediction outcome of the audience’s network on the weighted image and is the class label. Note that is a function of since is a function of the objective in Eq.(3) and the objective is a function of . In the fourth stage, the explainer validates its network weights on its validation set and the audience validates its network weights on its validation set . The explainer optimizes its architecture by minimizing its validation loss and the audience’s validation loss:
where is a tradeoff parameter.
The four stages are mutually dependent: learned in the first stage is used to define the objective function in the second stage; learned in the second stage is used to define the objective function in the third stage; and learned in the first and third stage are used to define the objective in the fourth stage; the updated in the fourth stage in turn changes the objective function in the first stage, which subsequently renders , , and to be changed.
Putting these pieces together, we have the following LeaSE framwork, which is a four-level optimization problem:
This formulation nests fourth optimization problems. On the constraints of the outer optimization problem are three inner optimization problems corresponding to the first, second, and third learning stage respectively. The objective function of the outer optimization problem corresponds to the fourth learning stage.
|Architecture of the explainer|
|Network weights of the explainer|
|Network weights of the audience|
|Training data of the explainer|
|Training data of the audience|
|Validation data of the explainer|
|Validation data of the audience|
Similar to (liu2018darts), we represent the architecture of the learner in a differentible way. The search space of is composed of a large number of building blocks. The output of each block is associated with a variable indicating how important this block is. After learning, blocks whose is among the largest are retained to form the final architecture. To this end, architecture search amounts to optimizing the set of architecture variables .
2.2 Optimization Algorithm
In this section, we derive an optimization algorithm to solve the LeaSE problem defined in Eq.(5). Inspired by (liu2018darts), we approximate using one-step gradient descent update of with respect to . We plug the approximation of into and obtain an approximated objective . Then we approximate using one-step gradient descent update of with respect to . Next, we plug the approximation of into and get an approximated objective . Then we approximate using one-step gradient descent update of with respect to . Finally, we plug the approximation of and the approximation of into and get an approximated objective . We perform gradient-descent update of with respect to . In the sequel, we use to denote .
First of all, we approximate using
where is a learning rate. Plugging into , we obtain an approximated objective . Then we approximate using one-step gradient descent update of with respect to :
Plugging into , we obtain an approximated objective . Then we approximate using one-step gradient descent update of with respect to :
Finally, we plug and into and get . We can update the explainer’s architecture by descending the gradient of w.r.t :
The first term in the third line involves expensive matrix-vector product, whose computational complexity can be reduced by a finite difference approximation:
where and is a small scalar that equals .
For in Eq.(9), it can be calculated as
according to the chain rule, where
The algorithm for solving LeaSE is summarized in Algorithm 1.
We apply LeaSE for neural architecture search in image classification tasks. Following (liu2018darts), we first perform architecture search which finds out an optimal cell, then perform architecture evaluation which composes multiple copies of the searched cell into a large network, trains it from scratch, and evaluates the trained model on the test set.
We used three datasets in the experiments: CIFAR-10, CIFAR-100, and ImageNet (deng2009imagenet). The CIFAR-10 dataset contains 50K training images and 10K testing images, from 10 classes (the number of images in each class is equal). Following (liu2018darts), we split the original 50K training set into a new 25K training set and a 25K validation set. In the sequel, when we mention “training set”, it always refers to the new 25K training set. During architecture search, the training set is used as and in LeaSE. The validation set is used as and in LeaSE. During architecture evaluation, the combination of the training data and validation data is used to train the large network stacking multiple copies of the searched cell. The CIFAR-100 dataset contains 50K training images and 10K testing images, from 100 classes (the number of images in each class is equal). Similar to CIFAR-100, the 50K training images are split into a 25K training set and 25K validation set. The usage of the new training set and validation set is the same as that for CIFAR-10. The ImageNet dataset contains a training set of 1.2M images and a validation set of 50K images, from 1000 object classes. The validation set is used as test set for architecture evaluation. Following (liu2018darts), we evaluate the architectures searched using CIFAR-10 and CIFAR-100 on ImageNet: given a cell searched using CIFAR-10 and CIFAR-100, multiple copies of it compose a large network, which is then trained on the 1.2M training data of ImageNet and evaluated on the 50K test data.
3.2 Experimental Settings
The search space of in LeaSE is the same as that in DARTS (liu2018darts). The candidate operations include: and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and zero. In LeaSE, the network of the explainer is a stack of multiple cells, each consisting of 7 nodes. For the architecture of the audience, we used ResNet-18 (resnet). is set to 1.
For CIFAR-10 and CIFAR-100, during architecture search, the explainer’s network is a stack of 8 cells, with the initial channel number set to 16. The search is performed for 50 epochs, with a batch size of 64. The hyperparameters for the explainer’s architecture and weights are set in the same way as DARTS. The network weights of the audience are optimized using SGD with a momentum of 0.9 and a weight decay of 3e-4. The initial learning rate is set to 0.025 with a cosine decay scheduler. During architecture evaluation, 20 copies of the searched cell are stacked to form the explainer’s network, with the initial channel number set to 36. The network is trained for 600 epochs with a batch size of 96 (for both CIFAR-10 and CIFAR-100). The experiments are performed on a single Tesla v100. For ImageNet, following(liu2018darts)
, we take the architecture searched on CIFAR-10 and evaluate it on ImageNet. We stack 14 cells (searched on CIFAR-10) to form a large network and set the initial channel number as 48. The network is trained for 250 epochs with a batch size of 1024 on 8 Tesla v100s. Each experiment on LeaSE is repeated for ten times with the random seed to be from 1 to 10. We report the mean and standard deviation of results obtained from the 10 runs.
|Error (%)||Error (%)||(M)||(GPU days)|
|*ShuffleNet 2 (v1) (ZhangZLS18)||26.4||10.2||5.4||-|
|*ShuffleNet 2 (v2) (MaZZS18)||25.1||7.6||7.4||-|
|*P-DARTS (CIFAR100) (chen2019progressive)||24.7||7.5||5.1||0.3|
|*P-DARTS (CIFAR10) (chen2019progressive)||24.4||7.4||4.9||0.3|
Table 2 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-100. From this table, we can see that when our method LeaSE is applied to DARTS-2nd, the classification error is significantly reduced from 20.58% to 19.26%. This demonstrates that learning by self-explanation is an effective mechanism for improving the quality of searched architectures. In our method, the explainer is encouraged to make sensible explanations for the prediction outcomes. To make meaningful explanations, it needs to learn better to understand the prediction task very well.
Table 3 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR-10. As can be seen, applying our proposed LeaSE to DARTS-2nd, the error is reduced from 2.76% to 2.62%. This further demonstrates the efficacy of our method in searching better-performing architectures, thanks to its mechanism of improving the explainer’s learning ability by encouraging the explainer to make sensible explanations.
Table 4 shows the results on ImageNet, including top-1 and top-5 classification errors on the test set, number of weight parameters (millions), and search costs (GPU days). Following (liu2018darts), we take the architecture searched by LeaSE-R18-DARTS-2nd on CIFAR-10 and evaluate it on ImageNet. As can be seen, applying our LeaSE method to DARTS-2nd, the top-1 error considerably reduces from 26.7% to 24.7% . This further demonstrates the effectiveness of our method.
4 Related Works
Neural architecture search (NAS) has achieved remarkable progress recently, which aims at searching for the optimal architecture of neural networks to achieve the best predictive performance. In general, there are three paradigms of methods in NAS: reinforcement learning (RL) approaches(zoph2016neural; pham2018efficient; zoph2018learning), evolutionary learning approaches (liu2017hierarchical; real2019regularized), and differentiable approaches (cai2018proxylessnas; liu2018darts; xie2018snas). In RL-based approaches, a policy is learned to iteratively generate new architectures by maximizing a reward which is the accuracy on the validation set. Evolutionary learning approaches represent the architectures as individuals in a population. Individuals with high fitness scores (validation accuracy) have the privilege to generate offspring, which replaces individuals with low fitness scores. Differentiable approaches adopt a network pruning strategy. On top of an over-parameterized network, the weights of connections between nodes are learned using gradient descent. Then weights close to zero are later pruned. There have been many efforts devoted to improving differentiable NAS methods. In P-DARTS (chen2019progressive), the depth of searched architectures is allowed to grow progressively during the training process. Search space approximation and regularization approaches are developed to reduce computational overheads and improve search stability. PC-DARTS (abs-1907-05737) reduces the redundancy in exploring the search space by sampling a small portion of a super network. Operation search is performed in a subset of channels with the held out part bypassed in a shortcut. Our proposed LeaSE framework can be applied to any differentiable NAS methods.
In this paper, we propose a new machine learning approach – learning by self-explanation (LeaSE), inspired by the explanation-driven learning technique of human. In LeaSE, the primary goal is to help an explainer model learn how to well perform a target task. The way to achieve this goal is to let the explainer make sensible explanations. The intuition behind LeaSE is that a model has to learn to understand a topic very well before it can explain this topic clearly. We proposal a multi-level optimization framework to formalize LeaSE which involves four learning stages: the explainer learns a topic; the explainer explains this topic; the audience learns this topic based on the explanations given by the explainer; the explainer re-learns this topic based on the feedback on the learning outcome of the audience. Our framework is applied for neural architecture search and achieves significant improvement on CIFAR-100, CIFAR-10, and ImageNet.