1 Introduction
^{†}^{†}footnotetext: Corresponding author.In human learning, an effective and widely used methodology for improving learning outcome is selfexplanation where students explain to themselves a learned topic to achieve better understanding of this topic. Selfexplanation encourages a student to actively digest and integrate prior knowledge and new information, which helps to fill in the gaps in understanding a topic.
Inspired by this explanationdriven learning technique of humans, we are interested in investigating whether this methodology is helpful for improving machine learning as well. We propose a novel learning framework called learning by selfexplanation (LeaSE). In this framework, there is an explainer model and an audience model, both of which learn to perform the same prediction task. The explainer has a learnable architecture and a set of learnable network weights. The audience has a predefined architecture by human experts and a set of learnable network weights. The goal is to help the explainer to learn well on the target task. The way to achieve this goal is to encourage the explainer to give clear explanations to the audience. Intuitively, if a model can explain prediction outcomes well, it has a deep understanding of the prediction task and can learn better accordingly. The learning is organized into four stages. In the first stage, the explainer trains its network weights by minimizing the prediction loss on its training dataset, with its architecture fixed. In the second stage, the explainer uses its model trained in the first stage to make predictions on the training data examples of the audience and leverages an adversarial attack approach to explain the prediction outcomes. In the third stage, the audience model combines its training examples and the explainermade explanations of prediction outcomes on these examples to train its network weights. In the fourth stage, the explainer updates its neural architecture by minimizing its validation loss and the audience’s validation loss. The fours stages are performed jointly endtoend in a multilevel optimization framework, where different stages influence each other. We apply our method for neural architecture search in image classification tasks. Our method achieves significant improvement on CIFAR100, CIFAR10, and ImageNet
(deng2009imagenet).The major contributions of this paper is as follows:

Inspired by the explanationdriven learning technique of humans, we propose a novel machine learning approach called learning by selfexplanation (LeaSE). In our approach, the explainer model improves its learning ability by trying to clearly explain to an audience model regarding how the prediction outcomes are made.

We propose a multilevel optimization framework to formulate LeaSE which involves four stages of learning: explainer learns; explainer explains; audience learns; explainer and audience validate themselves.

We develop an efficient algorithm to solve the LeaSE problem.

We apply our approach to neural architecture search on CIFAR100, CIFAR10, and ImageNet. The results demonstrate the effectiveness of our method.
The rest of the paper is organized as follows. Section 2 and 3 present the method and experiments respectively. Section 4 reviews related works. Section 5 concludes the paper.
2 Methods
In this section, we propose a framework for learning by selfexplanation (LeaSE) and develop an optimization for solving the LeaSE problem.
2.1 Learning by SelfExplanation
In our framework, there is an explainer model and an audience model, both of which learn to perform the same target task. The primary goal of our framework is to help the explainer to learn the target task very well. The way to achieve this goal is to let the explainer make meaningful explanations of the prediction outcomes in the target task. The intuition behind LeaSE is: to correctly explain prediction results, a model needs to learn to understand the target task very well. The explainer has a learnable architecture and a set of learnable network weights . The audience has a predefined neural architecture (by human experts) and a set of learnable network weights . The learning is organized into four stages. In the first stage, the explainer trains its network weights on its training dataset , with the architecture fixed:
(1) 
The architecture is used to define the training loss. But it is not updated in this stage. If is learned by minimizing this training loss, a trivial solution will be yielded where is very large and complex that it can perfectly overfit the training data but will generalize poorly on unseen data. Note that the optimally trained weights is a function of since is a function of and is a function of . In the second stage, the explainer uses the trained model to make predictions on the input training examples of the audience and explains the prediction outcomes. Specifically, given an input data example (without loss of generality, we assume it is an image) and the predicted label , the explainer aims to find out a subset of image patches in that are mostly correlated with and uses as explanations for . We leverage an adversarial attack approach to achieve this goal. Adversarial attack goodfellow2014explaining adds small random perturbations to pixels in so that the prediction outcome on the perturbed image is no longer . Pixels that are perturbed more have higher correlations with the prediction outcome and can be used as explanations. This process amounts to solving the following optimization problem:
(2) 
where and is the perturbation added to image . and are the prediction outcomes of the explainer’s network on and . Without loss of generality, we assume the task is image classification (with classes). Then and are both
dimensional vectors containing prediction probabilities on the
classes. is the crossentropy loss with . In this optimization problem, the explainer aims to find out perturbations for each image so that the predicted outcome on the perturbed image is largely different from that on the original image. The learned optimal perturbations are used as explanations and those with larger values indicate that the corresponding pixels are more important in decisionmakings. Note that is a function of since is a function of the objective in Eq.(2) and the objective is a function of . In the third stage, given the explanations made by the explainer, the audience leverages them to learn the target task. Since the perturbations indicate how important the input pixels are, the audience uses them to reweigh the pixels: , where denotes elementwise multiplication. Pixels that are more important are given more weights. Then the audience trains its network weights on these weighted images:(3) 
where is the prediction outcome of the audience’s network on the weighted image and is the class label. Note that is a function of since is a function of the objective in Eq.(3) and the objective is a function of . In the fourth stage, the explainer validates its network weights on its validation set and the audience validates its network weights on its validation set . The explainer optimizes its architecture by minimizing its validation loss and the audience’s validation loss:
(4) 
where is a tradeoff parameter.
The four stages are mutually dependent: learned in the first stage is used to define the objective function in the second stage; learned in the second stage is used to define the objective function in the third stage; and learned in the first and third stage are used to define the objective in the fourth stage; the updated in the fourth stage in turn changes the objective function in the first stage, which subsequently renders , , and to be changed.
Putting these pieces together, we have the following LeaSE framwork, which is a fourlevel optimization problem:
(5) 
This formulation nests fourth optimization problems. On the constraints of the outer optimization problem are three inner optimization problems corresponding to the first, second, and third learning stage respectively. The objective function of the outer optimization problem corresponds to the fourth learning stage.
Notation  Meaning 

Architecture of the explainer  
Network weights of the explainer  
Network weights of the audience  
Explanations  
Training data of the explainer  
Training data of the audience  
Validation data of the explainer  
Validation data of the audience 
Similar to (liu2018darts), we represent the architecture of the learner in a differentible way. The search space of is composed of a large number of building blocks. The output of each block is associated with a variable indicating how important this block is. After learning, blocks whose is among the largest are retained to form the final architecture. To this end, architecture search amounts to optimizing the set of architecture variables .
2.2 Optimization Algorithm
In this section, we derive an optimization algorithm to solve the LeaSE problem defined in Eq.(5). Inspired by (liu2018darts), we approximate using onestep gradient descent update of with respect to . We plug the approximation of into and obtain an approximated objective . Then we approximate using onestep gradient descent update of with respect to . Next, we plug the approximation of into and get an approximated objective . Then we approximate using onestep gradient descent update of with respect to . Finally, we plug the approximation of and the approximation of into and get an approximated objective . We perform gradientdescent update of with respect to . In the sequel, we use to denote .
First of all, we approximate using
(6) 
where is a learning rate. Plugging into , we obtain an approximated objective . Then we approximate using onestep gradient descent update of with respect to :
(7) 
Plugging into , we obtain an approximated objective . Then we approximate using onestep gradient descent update of with respect to :
(8) 
Finally, we plug and into and get . We can update the explainer’s architecture by descending the gradient of w.r.t :
(9) 
where
(10) 
The first term in the third line involves expensive matrixvector product, whose computational complexity can be reduced by a finite difference approximation:
(11) 
where and is a small scalar that equals .
For in Eq.(9), it can be calculated as
according to the chain rule, where
(12)  
(13) 
(14)  
(15) 
and
(16)  
(17) 
The algorithm for solving LeaSE is summarized in Algorithm 1.
3 Experiments
We apply LeaSE for neural architecture search in image classification tasks. Following (liu2018darts), we first perform architecture search which finds out an optimal cell, then perform architecture evaluation which composes multiple copies of the searched cell into a large network, trains it from scratch, and evaluates the trained model on the test set.
3.1 Datasets
We used three datasets in the experiments: CIFAR10, CIFAR100, and ImageNet (deng2009imagenet). The CIFAR10 dataset contains 50K training images and 10K testing images, from 10 classes (the number of images in each class is equal). Following (liu2018darts), we split the original 50K training set into a new 25K training set and a 25K validation set. In the sequel, when we mention “training set”, it always refers to the new 25K training set. During architecture search, the training set is used as and in LeaSE. The validation set is used as and in LeaSE. During architecture evaluation, the combination of the training data and validation data is used to train the large network stacking multiple copies of the searched cell. The CIFAR100 dataset contains 50K training images and 10K testing images, from 100 classes (the number of images in each class is equal). Similar to CIFAR100, the 50K training images are split into a 25K training set and 25K validation set. The usage of the new training set and validation set is the same as that for CIFAR10. The ImageNet dataset contains a training set of 1.2M images and a validation set of 50K images, from 1000 object classes. The validation set is used as test set for architecture evaluation. Following (liu2018darts), we evaluate the architectures searched using CIFAR10 and CIFAR100 on ImageNet: given a cell searched using CIFAR10 and CIFAR100, multiple copies of it compose a large network, which is then trained on the 1.2M training data of ImageNet and evaluated on the 50K test data.
3.2 Experimental Settings
The search space of in LeaSE is the same as that in DARTS (liu2018darts). The candidate operations include: and separable convolutions, and dilated separable convolutions, max pooling, average pooling, identity, and zero. In LeaSE, the network of the explainer is a stack of multiple cells, each consisting of 7 nodes. For the architecture of the audience, we used ResNet18 (resnet). is set to 1.
For CIFAR10 and CIFAR100, during architecture search, the explainer’s network is a stack of 8 cells, with the initial channel number set to 16. The search is performed for 50 epochs, with a batch size of 64. The hyperparameters for the explainer’s architecture and weights are set in the same way as DARTS. The network weights of the audience are optimized using SGD with a momentum of 0.9 and a weight decay of 3e4. The initial learning rate is set to 0.025 with a cosine decay scheduler. During architecture evaluation, 20 copies of the searched cell are stacked to form the explainer’s network, with the initial channel number set to 36. The network is trained for 600 epochs with a batch size of 96 (for both CIFAR10 and CIFAR100). The experiments are performed on a single Tesla v100. For ImageNet, following
(liu2018darts), we take the architecture searched on CIFAR10 and evaluate it on ImageNet. We stack 14 cells (searched on CIFAR10) to form a large network and set the initial channel number as 48. The network is trained for 250 epochs with a batch size of 1024 on 8 Tesla v100s. Each experiment on LeaSE is repeated for ten times with the random seed to be from 1 to 10. We report the mean and standard deviation of results obtained from the 10 runs.
3.3 Results
Method  Error(%)  Param(M)  Cost 

*ResNet (he2016deep)  22.10  1.7   
*DenseNet (HuangLMW17)  17.18  25.6   
*PNAS (LiuZNSHLFYHM18)  19.53  3.2  150 
*ENAS (pham2018efficient)  19.43  4.6  0.5 
*AmoebaNet (real2019regularized)  18.93  3.1  3150 
DARTS1st (liu2018darts)  20.520.31  1.8  0.4 
*GDAS (DongY19)  18.38  3.4  0.2 
*RDARTS (ZelaESMBH20)  18.010.26    1.6 
*DARTS (abs200901027)  17.510.25  3.3  0.4 
DARTS (abs200901027)  18.970.16  3.1  0.4 
*PDARTS (chen2019progressive)  17.49  3.6  0.3 
DARTS (abs190906035)  17.110.43  3.8  0.2 
*DropNAS (HongL0TWL020)  16.39  4.4  0.7 
*DARTS2nd (liu2018darts)  20.580.44  1.8  1.5 
LeaSER18DARTS2nd (ours)  19.26  2.1  1.8 
Method  Error(%)  Param(M)  Cost 

*DenseNet (HuangLMW17)  3.46  25.6   
*HierEvol (liu2017hierarchical)  3.750.12  15.7  300 
*NAONetWS (LuoTQCL18)  3.53  3.1  0.4 
*PNAS (LiuZNSHLFYHM18)  3.410.09  3.2  225 
*ENAS (pham2018efficient)  2.89  4.6  0.5 
*NASNetA (zoph2018learning)  2.65  3.3  1800 
*AmoebaNetB (real2019regularized)  2.550.05  2.8  3150 
*DARTS1st (liu2018darts)  3.000.14  3.3  0.4 
*RDARTS (ZelaESMBH20)  2.950.21    1.6 
*GDAS (DongY19)  2.93  3.4  0.2 
*SNAS (xie2018snas)  2.85  2.8  1.5 
*BayesNAS (ZhouYWP19)  2.810.04  3.4  0.2 
*MergeNAS (WangXYYHS20)  2.730.02  2.9  0.2 
*NoisyDARTS (abs200503566)  2.700.23  3.3  0.4 
*ASAP (NoyNRZDFGZ20)  2.680.11  2.5  0.2 
*SDARTS (abs200205283)  2.610.02  3.3  1.3 
*DARTS (abs200901027)  2.590.08  3.5  0.4 
DARTS (abs200901027)  2.970.04  3.3  0.4 
*DropNAS (HongL0TWL020)  2.580.14  4.1  0.6 
*PCDARTS (abs190705737)  2.570.07  3.6  0.1 
*FairDARTS (abs191112126)  2.54  3.3  0.4 
*DrNAS (abs200610355)  2.540.03  4.0  0.4 
*PDARTS (chen2019progressive)  2.50  3.4  0.3 
*DARTS2nd (liu2018darts)  2.760.09  3.3  1.5 
LeaSER18DARTS2nd (ours)  2.62  3.4  1.8 
Method  Top1  Top5  Param  Cost 
Error (%)  Error (%)  (M)  (GPU days)  
*Inceptionv1 (googlenet)  30.2  10.1  6.6   
*MobileNet (HowardZCKWWAA17)  29.4  10.5  4.2   
*ShuffleNet 2 (v1) (ZhangZLS18)  26.4  10.2  5.4   
*ShuffleNet 2 (v2) (MaZZS18)  25.1  7.6  7.4   
*NASNetA (zoph2018learning)  26.0  8.4  5.3  1800 
*PNAS (LiuZNSHLFYHM18)  25.8  8.1  5.1  225 
*MnasNet92 (TanCPVSHL19)  25.2  8.0  4.4  1667 
*AmoebaNetC (real2019regularized)  24.3  7.6  6.4  3150 
*SNASCIFAR10 (xie2018snas)  27.3  9.2  4.3  1.5 
*BayesNASCIFAR10 (ZhouYWP19)  26.5  8.9  3.9  0.2 
*PARSECCIFAR10 (abs190205116)  26.0  8.4  5.6  1.0 
*GDASCIFAR10 (DongY19)  26.0  8.5  5.3  0.2 
*DSNASImageNet (HuXZLSLL20)  25.7  8.1     
*SDARTSADVCIFAR10 (abs200205283)  25.2  7.8  5.4  1.3 
*PCDARTSCIFAR10 (abs190705737)  25.1  7.8  5.3  0.1 
*ProxylessNASImageNet (cai2018proxylessnas)  24.9  7.5  7.1  8.3 
*FairDARTSCIFAR10 (abs191112126)  24.9  7.5  4.8  0.4 
*FairDARTSImageNet (abs191112126)  24.4  7.4  4.3  3.0 
*PDARTS (CIFAR100) (chen2019progressive)  24.7  7.5  5.1  0.3 
*PDARTS (CIFAR10) (chen2019progressive)  24.4  7.4  4.9  0.3 
*DrNASImageNet (abs200610355)  24.2  7.3  5.2  3.9 
*PCDARTSImageNet (abs190705737)  24.2  7.3  5.3  3.8 
*DARTSImageNet (abs190906035)  23.9  7.4  5.1  6.8 
*DARTSImageNet (abs200901027)  23.8  7.0  4.9  4.5 
*DARTSCIFAR100 (abs190906035)  23.7  7.2  5.1  0.2 
*DARTS2ndCIFAR10 (liu2018darts)  26.7  8.7  4.7  4.0 
LeaSER18DARTS2ndCIFAR10 (ours)  24.7  7.9  4.7  4.0 
Table 2 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR100. From this table, we can see that when our method LeaSE is applied to DARTS2nd, the classification error is significantly reduced from 20.58% to 19.26%. This demonstrates that learning by selfexplanation is an effective mechanism for improving the quality of searched architectures. In our method, the explainer is encouraged to make sensible explanations for the prediction outcomes. To make meaningful explanations, it needs to learn better to understand the prediction task very well.
Table 3 shows the classification error (%), number of weight parameters (millions), and search cost (GPU days) of different NAS methods on CIFAR10. As can be seen, applying our proposed LeaSE to DARTS2nd, the error is reduced from 2.76% to 2.62%. This further demonstrates the efficacy of our method in searching betterperforming architectures, thanks to its mechanism of improving the explainer’s learning ability by encouraging the explainer to make sensible explanations.
Table 4 shows the results on ImageNet, including top1 and top5 classification errors on the test set, number of weight parameters (millions), and search costs (GPU days). Following (liu2018darts), we take the architecture searched by LeaSER18DARTS2nd on CIFAR10 and evaluate it on ImageNet. As can be seen, applying our LeaSE method to DARTS2nd, the top1 error considerably reduces from 26.7% to 24.7% . This further demonstrates the effectiveness of our method.
4 Related Works
Neural architecture search (NAS) has achieved remarkable progress recently, which aims at searching for the optimal architecture of neural networks to achieve the best predictive performance. In general, there are three paradigms of methods in NAS: reinforcement learning (RL) approaches
(zoph2016neural; pham2018efficient; zoph2018learning), evolutionary learning approaches (liu2017hierarchical; real2019regularized), and differentiable approaches (cai2018proxylessnas; liu2018darts; xie2018snas). In RLbased approaches, a policy is learned to iteratively generate new architectures by maximizing a reward which is the accuracy on the validation set. Evolutionary learning approaches represent the architectures as individuals in a population. Individuals with high fitness scores (validation accuracy) have the privilege to generate offspring, which replaces individuals with low fitness scores. Differentiable approaches adopt a network pruning strategy. On top of an overparameterized network, the weights of connections between nodes are learned using gradient descent. Then weights close to zero are later pruned. There have been many efforts devoted to improving differentiable NAS methods. In PDARTS (chen2019progressive), the depth of searched architectures is allowed to grow progressively during the training process. Search space approximation and regularization approaches are developed to reduce computational overheads and improve search stability. PCDARTS (abs190705737) reduces the redundancy in exploring the search space by sampling a small portion of a super network. Operation search is performed in a subset of channels with the held out part bypassed in a shortcut. Our proposed LeaSE framework can be applied to any differentiable NAS methods.5 Conclusions
In this paper, we propose a new machine learning approach – learning by selfexplanation (LeaSE), inspired by the explanationdriven learning technique of human. In LeaSE, the primary goal is to help an explainer model learn how to well perform a target task. The way to achieve this goal is to let the explainer make sensible explanations. The intuition behind LeaSE is that a model has to learn to understand a topic very well before it can explain this topic clearly. We proposal a multilevel optimization framework to formalize LeaSE which involves four learning stages: the explainer learns a topic; the explainer explains this topic; the audience learns this topic based on the explanations given by the explainer; the explainer relearns this topic based on the feedback on the learning outcome of the audience. Our framework is applied for neural architecture search and achieves significant improvement on CIFAR100, CIFAR10, and ImageNet.
Comments
There are no comments yet.