Sub-Architecture Ensemble Pruning in Neural Architecture Search

10/01/2019 ∙ by Yijun Bian, et al. ∙ 0

Neural architecture search (NAS) is gaining more and more attention in recent years due to its flexibility and the remarkable capability of reducing the burden of neural network design. To achieve better performance, however, the searching process usually costs massive computation, which might not be affordable to researchers and practitioners. While recent attempts have employed ensemble learning methods to mitigate the enormous computation, an essential characteristic of diversity in ensemble methods is missed out, causing more similar sub-architectures to be gathered and potential redundancy in the final ensemble architecture. To bridge this gap, we propose a pruning method for NAS ensembles, named as ”Sub-Architecture Ensemble Pruning in Neural Architecture Search (SAEP).” It targets to utilize diversity and achieve sub-ensemble architectures in a smaller size with comparable performance to the unpruned ensemble architectures. Three possible solutions are proposed to decide which subarchitectures should be pruned during the searching process. Experimental results demonstrate the effectiveness of the proposed method in largely reducing the size of ensemble architectures while maintaining the final performance. Moreover, distinct deeper architectures could be discovered if the searched sub-architectures are not diverse enough.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Designing neural network architectures usually requires manually elaborated architecture engineering, extensive expertise as well as expensive costs. Neural architecture search (NAS), which aims at mitigating these challenges, is attracting more and more attention recently [Zoph et al.2018, Elsken, Metzen, and Hutter2019, Wistuba, Rawat, and Pedapati2019]. However, NAS methods usually require huge computational complexity to achieve an architecture with expected performance, which is too expensive for many infrastructures to be deployed and researchers to afford [Zoph and Le2017].

Recent work [Cortes et al.2017, Huang et al.2018, Macko et al.2019] proposes to employ ensemble methods to mitigate this shortcoming, using weak sub-architectures trained with less computation cost to comprise powerful neural architectures. However, all of them overlook a crucial principle in ensemble methods (i.e., model diversity) when searching for new sub-architectures, which is usually beneficial for creating better model ensembles [Zhou2012, Jiang et al.2017]. Besides, lots of ensemble pruning methods utilize the characteristic of diversity to gain sub-ensembles with a smaller size than the original ensembles [Zhou2012]. It has been proved that a few of diverse individual learners could even construct a more powerful ensemble learner than the unpruned ensembles [Zhou, Wu, and Tang2002, Zhou2012]. It motivates us to investigate the NAS ensemble pruning problems, targeting diverse sub-ensemble architectures towards a smaller yet effective ensemble model.

However, it is quite challenging to describe the characteristic of diversity for different sub-architectures and decide the ones to be pruned or kept in the ensemble architecture. First, there are plenty of definitions or measurements for diversity in the ensemble learning community [Zhou2012]. Unlike the model accuracy, however, there is no well-accepted formal definition of diversity [Jiang et al.2017]. Second, diversity among individual learners usually decreases when these individual learners approach a higher level of accuracy [Lu et al.2010]. Combining some accurate individual learners with some relatively weak ones is usually better than combining accurate ones only since diversity is more important than pure accuracy [Zhou2012]. Third, selecting the best combination of sub-architectures from an ensemble architecture is NP-complete hard with exponential computational complexity [Li, Yu, and Zhou2012, Martínez-Muñoz and Suárez2007]. Thus, how to handle the trade-off between accuracy and diversity properly and select the best subset of ensemble architectures is an essential issue in the NAS ensemble pruning problems.

To tackle the NAS ensemble pruning problems, we seek diverse sub-ensemble architectures in a smaller size yet still with the comparable performance to the original ensemble architecture without pruning. The idea is to prune the ensemble architecture on-the-fly based on different criteria and keep more valuable sub-architectures in the searching process. Our NAS ensemble pruning method is named as “Sub-Architecture Ensemble Pruning in Neural Architecture Search (SAEP),” motivated by AdaNet [Cortes et al.2017], with three proposed criteria to decide which sub-architectures will be pruned. Besides, SAEP might lead to distinct deeper and more effective architectures than the original one if the degree of diversity is not sufficient, which could be a bonus by pruning. Our contribution in this paper is threefold:

  • We propose a NAS ensemble pruning method to seek sub-ensemble architectures in a smaller size, benefiting from an essential characteristic, i.e., diversity in ensemble learning. It could obtain comparable accuracy performance to the ensemble architectures that are not pruned.

  • Moreover, our proposed method would lead to distinct deeper architectures than the original ensemble architecture that is not pruned if the diversity is not sufficient.

  • Experimental results demonstrate the effectiveness of the proposed method in largely reducing the number of sub-architectures in ensemble architectures and increasing diversity while maintaining the final performance.

Problem Statement

Notations

: In this paper, we denote tensors with bold italic lowercase letters (e.g.,

), vectors with bold lowercase letters (e.g.,

), and scalars with italic lowercase letters (e.g., ). We use to represent the transpose of a vector. Data/hypothesis spaces are denoted by bold script uppercase letters (e.g., ). We use , and

to denote the real space, the probability measure, the expectation of a random variable, and the indicator function, respectively.

We summarize the notations and their definitions in Table 1. We follow the notations and the definition of search space in AdaNet to formulate our problem and introduce our method since it is one of the most popular ensemble searching method in NAS literature. It is worth pointing out that our proposed pruning criteria could also be generalized upon other ensemble methods, which could be interesting for future exploration.

Let be a neural network with layers searched via AdaNet [Cortes, Mohri, and Syed2014, Cortes et al.2017], where each layer would connect to previous layers. The output for any would connect with all intermediate units, i.e.,

(1)

where and . is the function of a unit in the layer, i.e.,

(2)

where is the layer denoted by the input. If for and for , this architecture of will coincide with the standard multi-layer feed-forward ones [Cortes et al.2017].

Notation Definition
the representation of for clarity
the input of neural networks
the function of a neural network with layers
the number of units in the layer
the function of a unit in the layer
the weight of the layer for the units of the layer
the function vector of units in the layer
the weight of the layer for
the -norm of where
the number of iterations in the neural architecture
searching process
a specific complexity constraint based on the Radema-
cher complexity
Table 1: The used symbols and definitions in this paper.

While AdaNet attempts to train multiple weak sub-architectures with less computation cost to comprise powerful neural architectures inspired by ensemble methods [Cortes et al.2017], the crucial characteristic of diversity brings opportunities to achieve sub-ensemble architectures in a smaller size using diverse sub-architectures yet still with the comparable performance to an original ensemble architecture generated by AdaNet. Based on the terminologies mentioned above, we formally define the NAS ensemble pruning problem.

Problem Definition (NAS Ensemble Pruning).

Given an ensemble architecture searched by ensemble NAS methods such as AdaNet, and a training set where all training instances are assumed to be drawn i.i.d. (independent and identically distributed) from one distribution over with as the number of labels, the goal is to prune the ensemble architecture and seek a sub-ensemble architecture in a smaller size using sub-architectures yet still with the comparable performance to the original ensemble architecture .

Sub-Architecture Ensemble Pruning in Neural Architecture Search (Saep)

In this section, we elaborate on the proposed NAS ensemble pruning method to achieve smaller yet effective ensemble neural architectures. Before pruning the less valuable sub-architectures, we need to generate sub-architectures first. We take advantage of AdaNet [Cortes et al.2017] here due to its popularity and superiority in ensemble NAS research, and utilize its objective function for generating sub-architecture candidates in the searching process. The objective function to generate new candidates in AdaNet [Cortes et al.2017] is defined as

(3)

where denotes the empirical margin error of function on the training set . denotes a specific complexity constraint. As the learning guarantee in [Cortes et al.2017] is for binary classification, we introduce an auxiliary function in Eq. (4) to extend the objective to multi-class classification problems echoing with our problem statement, i.e.,

(4)

In this case, the empirical margin error would be

(5)
Figure 1: This figure is used to illustrate the difference between SAEP and AdaNet during the incremental construction of a neural architecture. Layers in blue and green indicate the input and output layer, respectively. Units in yellow, cyan, and red are added at the first, second, and third iteration, respectively. (a) AdaNet [Cortes et al.2017]: A line between two blocks of units indicates that these blocks are fully-connected. (b) SAEP: Only some valuable blocks are kept (those that will be pruned are denoted by black dashed lines), which is the key difference from AdaNet. The criteria used to decide which sub-architectures will be pruned has three proposed solutions in our SAEP, i.e., PRS, PAP, and PIE.

Guided by Eq. (3), AdaNet only generates new candidates by minimizing the empirical error and architecture complexity, while overlooking the diversity and differences among different sub-architectures. To achieve smaller yet effective ensembles via taking the diversity property into account, we need first to measure the diversity of different sub-architectures so that a corresponding objective function could be derived to guide us for the selection of more valuable sub-architectures during the searching process.

Specifically, we propose three different ways to enhance the diversity of different sub-architectures. Except for the first solution, the latter two provide specific objective quantification where diversity is involved as guidance among different sub-architectures for NAS. Besides, the diversity of sub-ensemble architectures generated by them could be quantified to verify whether these ways work or not.

Our final NAS ensemble pruning method, named as “Sub-Architecture Ensemble Pruning in Neural Architecture Search (SAEP),” is shown in Algorithm 1. The key difference between SAEP and AdaNet is that SAEP prunes the less valuable sub-architectures based on certain criteria during the searching process (lines 1011 in Algorithm 1), instead of keeping all of them, as shown in Figure 1. At the iteration in Algorithm 1, let denote the neural network constructed before the start of the iteration, with the depth of . The first target at the iteration is to generate new candidates (lines 34) and select the better one to be added in the model of (lines 49) since we expect the searching process is progressive. The second target at the iteration is to prune the less valuable sub-architectures for and keep beneficial ones to construct the final architecture (lines 1011).

0:  Dataset Parameter: Number of iteration
0:  Final function
1:  Initialize and
2:  for  to  do
3:      s.t.
4:      s.t.
5:     if  then
6:         
7:     else
8:         
9:     end if
10:     Choose based on one certain criterion, i.e., picking randomly in PRS, of Eq. (6) in PAP, or of Eq. (14) in PIE.
11:     Set to be zero.
12:  end for
Algorithm 1 Sub-Architecture Ensemble Pruning in Neural Architecture Search (SAEP)

To evaluate the most valuable sub-architectures, we propose three solutions to tackle this problem. Now we introduce them into guided pruning on-the-fly, to decide which sub-architectures are less valuable to be pruned.

Pruning by Random Selection (Prs)

The first solution, named as “Pruning by Random Selection (PRS),” is to randomly prune some of the sub-architectures in the searching process, with one difference from other solutions. In PRS, we firstly decide randomly whether or not to pick one of the sub-architectures to be pruned; if we indeed decide to prune one of them, the objective to decide which sub-architecture to be pruned is random as well, instead of one specific objective like the next two solutions.

However, there is no specific objective for PRS to follow in the pruning process. That might lead to a situation where some valuable sub-architectures are pruned as well. Therefore, we need to find more explicit objectives to guide our pruning.

Pruning by Accuracy Performance (Pap)

To measure different sub-architectures better, we propose the second pruning solution based on their accuracy performance. This method is named as “Pruning by Accuracy Performance (PAP).” To choose the valuable sub-architectures from those individual sub-architectures in the original model, this second optional objective function for this target is defined as

(6)

where is the sub-architecture corresponding to the weight . Our target is to pick up the and by minimizing Eq. (6), and prune them if their loss is less than zero. The reason why we do this is that the generalization error of gathering all sub-architectures is defined as

(7)

if the sub-architecture is excluded from the final architecture, the generalization error of the pruned sub-ensemble architecture will become

(8)

Then, if we expect the pruned architecture works better than the original one, we need to make sure that , i.e.,

(9)

Therefore, if the sub-architecture meeting Eq. (9) is excluded from the final architecture, the performance will not be weakened and could be even better than the original one. The hidden meaning behind Eq. (9) is that the final architecture makes mistakes; however, the pruned architecture that excludes the

sub-architecture will work correctly. These sub-architectures that make too serious mistakes to affect the final architecture negatively would be expected to be pruned, leading to our loss function Eq. (

6). In this case, we could improve the performance of the final architecture without breaking the learning guarantee.

However, this objective in Eq. (6) only considers the accuracy performance of different sub-architectures and misses out the crucial characteristic of diversity in ensemble methods. Therefore, we need to find an objective to reflect accuracy and diversity both.

Pruning by Information Entropy (Pie)

To consider accuracy and diversity simultaneously, we propose another strategy, named “Pruning by Information Entropy (PIE).” The objective is based on information entropy. For any sub-architecture in the ensemble architecture, represents its classification results on the dataset . is the class label vector. To exhibit the relevance between this sub-architecture and the class label vector, the normalized mutual information [Zadeh et al.2017],

(10)

is used to imply its accuracy. Note that

(11)

is the mutual information [Cover and Thomas2012], where and are the entropy function and the joint entropy function, respectively. To reveal the redundancy between two sub-architectures ( and ) in the ensemble architecture, the normalized variation of information [Zadeh et al.2017],

(12)

is used to indicate the diversity between them. The objective function for handling the trade-off between diversity and accuracy of two sub-architectures is defined as

(13)

if , otherwise Note that is a regularization factor introduced to balance between these two criteria, indicating their importance as well. Our target is to pick up the and , and prune them by minimizing in Eq. (14), i.e.,

(14)

This loss function considers both diversity and accuracy concurrently according to the essential characteristics in ensemble learning.

Experiments

In this section, we describe our experiments to verify the effectiveness of the proposed SAEP method. There are four major questions that we aim to answer. (1) Could SAEP achieve sub-ensemble architectures in a smaller size yet still with comparable accuracy performance to the original ensemble architecture? (2) Could SAEP generate sub-ensemble architectures with more diversity than the original ensemble architecture? (3) What are the impacts of the parameter on the sub-ensemble architectures generated by PIE? (4) Could PIE generate different sub-architectures from that in the original ensemble architecture?

Three Image Classification Datasets

The three image classification datasets that we employ in the experiments are all publicly available. The ImageNet

[Deng et al.2009] dataset is not included because the cost for it is not affordable for the GPU that we use.

CIFAR-10 [Krizhevsky and Hinton2009]: 60,000 32x32 color images in 10 classes are used as instances, with 6,000 images per class, representing airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks, respectively. There are 50,000 training images and 10,000 test images.

MNIST [LeCun et al.1998]: 70,000 28x28 grayscale images of handwritten digits in 10 different classes are used as instances. There are 60,000 instances as a training set and 10,000 instances as a test set. The digits have been size-normalized and centered in a fixed-size image.

Fashion-MNIST [Xiao, Rasul, and Vollgraf2017]: 70,000 28x28 grayscale images are used as instances, including 60,000 instances for training and 10,000 instances for testing. They are categorized into ten classes, representing T-shirts/tops, trousers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots, respectively.

Baseline Methods

To analyze the effectiveness of SAEP, we compare the three proposed solutions (i.e., PRS, PAP, and PIE) with AdaNet [Cortes et al.2017]. Besides, AdaNet (usually set to use uniform average weights in practice) has a variant to use mixture weights, which we call AdaNet.W. Similarly, PRS.W, PAP.W, and PIE.W (i.e., SAEP.W) are variants of PRS, PAP, and PIE using mixture weights, respectively. Our baselines include AdaNet and their corresponding variants.

Experimental Settings

In the same experiment, all methods would use the same kind of sub-architectures in consideration of fairness during the comparisons to verify whether their objectives work well. The optional sub-architectures that we use include multilayer perceptrons (MLP) and convolutional neural networks (CNN). More options include depthwise separable convolution networks and the like. As for the hyper-parameters in the experiments, the learning rate is set to be 0.025, and cosine decay is applied to the learning rate using a momentum optimizer in the training process. The number of training steps is 5,000, and that of the batch size is 128.

We use three datasets mentioned before for image classification. In the multi-class classification scenario, we use all of the categories in the corresponding dataset; in the binary classification scenario, we reduce these datasets by considering several pairs of classes. For example, we consider one pair of classes in CIFAR-10 (i.e., deep-truck), five pairs of classes in Fashion-MNIST (i.e., top-pullover, top-coat, top-shirt, trouser-dress, and sandal-ankle boot), and two pairs of digits in MNIST (i.e., 6-9, and 5-8).

Figure 2: Empirical results of accuracy for image classification tasks. LABEL: Comparison of performance of AdaNet and SAEP. LABEL: Comparison of performance of their corresponding variants.

Saep Leads to Sub-Ensemble Architectures with Smaller Size than AdaNet

In this subsection, we verify whether the pruned sub-ensemble architectures could achieve comparable performance with the original ensemble architecture.

Dataset Test Accuracy (%)  /  Number of Sub-Architectures
AdaNet PRS PAP PIE
MNIST 93.21 12 93.29 2 93.27 1 93.21 12
Fashion-MNIST 81.84 13 81.49 1 82.27 6 81.76 7
AdaNet.W PRS.W PAP.W PIE.W
MNIST 93.26 10 93.19 1 93.41 8 93.39 14
Fashion-MNIST 81.82 17 82.26 6 81.81 1 82.36 13
Table 2: Empirical results of accuracy for multi-class classification on MNIST and Fashion-MNIST datasets. Each method includes two columns: one is the test accuracy (%), the other is the number of sub-architectures. Note that sub-architectures used in these experiments are MLP. The best accuracy and its corresponding architecture’s size are indicated with bold fonts for each dataset (row).

Experimental results are reported in Figure 2 and Tables 23, containing the accuracy on the test set of each method and their corresponding size of searched architectures. More detailed results (Figure 5) are reported in the appendix due to the space limitation. As we can see in Table 2, SAEP (indicated by PRS, PAP, and PIE) achieves the same level of accuracy performance as AdaNet, and that SAEP yields smaller ensemble architectures than AdaNet. For example, PIE achieves 82.36% accuracy with a sub-ensemble architecture in the size of 13, while AdaNet achieves 81.82% accuracy with an ensemble architecture in the size of 17. Similar results could be observed for SAEP in Figure 2 and for their corresponding variants in Figure 2. All these results suggest that our NAS ensemble pruning method is meaningful.

Label Pair Test Accuracy (%)  /  Number of Sub-Architectures
AdaNet PRS PAP PIE
deer-truck 88.80 10 89.25 1 88.65 2 87.95 13
top-pullover 96.70 12 96.30 1 96.65 1 96.70 14
top-coat 98.55 14 98.40 1 98.55 2 98.45 11
top-shirt 84.05 9 83.60 1 84.05 8 83.90 12
trouser-dress 97.70 10 97.70 5 97.80 4 97.90 6
sandal-ankle boot 97.90 8 97.95 1 98.05 1 97.95 5
digit 6-9 99.69 7 99.80 1 99.80 1 99.75 8
digit 5-8 98.82 17 98.71 1 98.77 11 98.71 13
(a) Comparison of performance of AdaNet and SAEP.
Label Pair Test Accuracy (%)  /  Number of Sub-Architectures
AdaNet.W PRS.W PAP.W PIE.W
deer-truck 88.80 6 88.90 1 88.45 1 88.20 13
top-pullover 96.70 6 96.35 3 96.65 1 96.55 6
top-coat 98.75 10 98.55 1 98.45 3 98.35 12
top-shirt 83.80 7 84.25 1 84.20 1 83.85 6
trouser-dress 97.55 7 97.70 1 97.65 6 97.75 7
sandal-ankle boot 97.80 8 97.95 1 97.80 2 97.65 7
digit 6-9 99.69 11 99.80 1 99.69 3 99.75 8
digit 5-8 98.77 8 99.04 3 98.77 2 98.61 7
(b) Comparison of performance of their corresponding variants.
Table 3: Empirical results of accuracy and size of architectures for binary classification on CIFAR-10, Fashion-MNIST, and MNIST datasets. The sub-architectures that we use for Fashion-MNIST and MNIST are MLP; those for CIFAR-10 are CNN. The best accuracy with its corresponding architecture’s size is indicated with bold fonts for each label pair (row).
Figure 3: Empirical results of diversity for image classification tasks. (a) The effect of diversity on the accuracy performance and the size of the sub-ensemble architectures. Increasing diversity could be beneficial to the accuracy of the sub-ensemble architecture before the diversity reaches some specified threshold; however, after the diversity reaches the threshold, increasing diversity would become less beneficial. (b) The effect of the value on the diversity and the accuracy performance of the sub-ensemble architectures. (c) The effect of the value on the diversity and the size of the sub-ensemble architectures.

Pie Generates Sub-Ensemble Architectures with More Diversity

In this subsection, we verify whether the purpose of increasing diversity of ensemble architectures is satisfied. We use the normalized of information in PIE to imply the redundancy between two different sub-architectures, indicating the diversity between them. However, in this experiment, we use another measure named the disagreement measure [Skalak and others1996, Ho1998] here to calculate the diversity for the ensemble architecture and the pruned sub-ensemble architectures, because there is no analogous term like in PRS and PAP. Note that researchers proposed many other measures to calculate diversity, and the disagreement measure is one of them [Zhou2012]. We choose the disagreement measure here because this measure is easy to be calculated and understood. The disagreement between two sub-architectures and is

(15)

the diversity of the ensemble architecture using the disagreement measure is

(16)

and the diversity of the sub-ensemble architecture could be calculated analogously.

Table 4 reports their performance with the corresponding disagreement value reflecting the diversity of the whole architecture. Besides, Figure 3 reports the diversity of the sub-architectures using PIE and other corresponding information. Note that the larger the disagreement is, the larger the diversity of the ensemble architecture or the pruned sub-architecture is. Three rows of Table 4 illustrate that PIE could yield sub-ensemble architectures with more diversity. Besides, it is also understandable that AdaNet.W obtains more diversity than PIE.W according to Figure 3LABEL: since AdaNet.W has a larger ensemble architecture in the third row of Table 4. Figure 3LABEL: illustrates that the accuracy of the sub-ensemble architecture could benefit from increasing diversity before the diversity reaches one certain threshold, and that increasing diversity would be less beneficial after the diversity reaches the threshold. Meanwhile, Figure 3LABEL: shows that larger sub-ensemble architectures correspond to less diversity. In addition, Figures 3LABEL:3LABEL: indicate the effect of the value in Eq. (13) on the diversity, the accuracy performance, and the size of the sub-ensemble architectures.

Test Accuracy (%)  /  Diversity (Disagreement)
AdaNet PRS PAP PIE
MLP 77.42 0.6693 77.16 0.0000 78.72 0.5722 78.67 0.7120
CNN 84.81 0.8044 85.25 0.0000 84.94 0.0000 85.02 0.8859
AdaNet.W PRS.W PAP.W PIE.W
MLP 78.84 0.6827 78.55 0.0000 80.61 0.3688 78.05 0.6081
CNN 85.36 0.8906 84.18 0.0000 84.70 0.0000 85.36 0.9297
Table 4: Empirical results of accuracy for multi-class classification on the Fashion-MNIST dataset. Each method includes two columns: one is the test accuracy (%), the other is the diversity (disagreement) of sub-architectures. Note that sub-architectures used in these experiments are MLP or CNN. The maximal diversity (disagreement) and its corresponding accuracy are indicated with bold fonts for each dataset (row). Note that 0.0 indicates that there is only one sub-architecture kept in the pruned sub-ensemble architecture using the corresponding method.

Effect of the Value

We now investigate the effect of the hyper-parameter in PIE. The value of indicates the relation between two criteria in Eq. (13) as well. To reveal this issue, different values (from 0.1 to 0.9 with 0.1 steps) are tested in the experiments of this part. Figure 4 exemplifies the effect of on MNIST and Fashion-MNIST datasets. Figure 4LABEL: illustrates that different values have little effect on the accuracy performance of the final architectures. Figure 4LABEL: illustrates that different values affect the number of the final pruned sub-architectures, and that a global minimum around the optimal exists indeed in each dataset. Figure 4LABEL: presents that when is set to 0.5, the sub-ensemble architecture could achieve the competitive accuracy performance with a smaller size, which is why the of PIE and PIE.W is set to 0.5 in Tables 25 if there is no extra explanation.

Figure 4: The effect of different values for image classification tasks. (a) The effect of the value on the accuracy performance of sub-ensemble architectures. (b) The effect of the value on the size of sub-ensemble architectures. (c) The sub-ensemble architectures could achieve competitive accuracy performance with a smaller size under different values.

Pie Could Generate Distinct Deeper Sub-Architectures than AdaNet

In a few cases, we observe that PIE could achieve a larger ensemble architecture than AdaNet, which makes us wonder whether SAEP could lead to distinct architectures from AdaNet. Thus, we dig the sub-architectures that are kept in the final architecture to explore more details deep down inside.

Test Accuracy (%) Diversity Size Indexes
AdaNet 77.42 0.6693 7 [0,1,2,3,4,5,6]
PIE (=0.5) 78.67 0.7120 6 [0,1,2,4,5,6]
PIE (=0.9) 77.99 0.4516 7 [0,1,2,3,4,5,7]
PIE.W 78.33 0.3941 19 [0,1,2,3,4,5,6,7,
   (=0.0)  8,9,10,11,12,13,
 14,15,16,17,19]
Table 5: Empirical results of the exact sub-architectures that are kept in the final sub-ensemble architecture after pruning on the Fashion-MNIST dataset. Each method includes four columns: the test accuracy (%), the diversity (disagreement), the size (i.e., the number of sub-architectures), and the indexes of the sub-architectures in the final sub-ensemble architecture after pruning. Note that the sub-architectures used in these experiments are MLP.

As we can see in Table 5, when the size of sub-ensemble architectures equals or exceeds that of the ensemble architecture that is not pruned, the diversity of the sub-ensemble architecture is usually smaller than that of the ensemble architecture. The reason why PIE (or PIE.W) generates distinct deeper sub-architectures might be the diversity is not sufficient for its objective in Eq. (14). In this case, the objective would guide the pruning process to search for more distinct deeper sub-architectures to increase diversity.

For more information, as for the MNIST dataset in Table 2, AdaNet keeps all twelve sub-architectures in the final architecture and reached 93.21% accuracy, while PIE searches thirteen sub-architectures and prunes the thirteenth sub-architecture at last, arriving at the same accuracy. Besides, AdaNet.W keeps all ten sub-architectures in the final architecture and reaches 93.26% accuracy, while PIE.W keeps the first to the fifteenth sub-architectures except the fourteenth sub-architecture, arriving at 93.39% accuracy. Similar results are reported in the appendix.

Related Work

In this section, we introduce the neural architecture search (NAS) briefly. The concept of “neural architecture search (NAS)” was proposed by zoph2017neural zoph2017neural for the very first time. They presented NAS as a gradient-based method to find good architectures. A “controller,” denoted by a recurrent network, was used to generate variable-length string which specified the structure and connectivity of a neural network; the generated “child network,” specified by the string, was then trained on the real data to obtain accuracy as the reward signal, to generate an architecture with higher probabilities to receive high accuracy [Zoph and Le2017, Baker et al.2017, Zoph et al.2018]

. Existing NAS methods could be categorized under three dimensions: search space, search strategy, and performance estimation strategy

[Elsken, Metzen, and Hutter2019, Kandasamy et al.2018, Cai et al.2018a, Liu, Simonyan, and Yang2018]. Classical NAS methods yielded chain-structured neural architectures [Zela et al.2018, Elsken, Metzen, and Hutter2019], yet ignored some modern designed elements from hand-crafted architectures, such as skip connections from ResNet [He et al.2016]. Thus some researchers also attempted to build complex multi-branch networks by incorporating those and achieved positive results [Cai et al.2018b, Real et al.2018, Elsken, Metzen, and Hutter2018, Brock et al.2017, Elsken, Metzen, and Hutter2017, Zhong et al.2018a, Pham et al.2018, Zhong et al.2018b].

Recently, NAS methods involved ensemble learning are attracting researchers’ attention gradually. cortes2017adanet cortes2017adanet proposed a data-dependent learning guarantee to guide the choice of additional sub-networks and presented AdaNet to learn neural networks adaptively. They claimed that AdaNet could precisely address some of the issues of wasteful data, time, and resources in neural architecture search since their optimization problem for AdaNet was convex and admitted a unique global solution. Besides, huang2018learning huang2018learning specialized sub-architectures by residual blocks and claimed that their BoostResNet boosted over multi-channel representations/features, which was different from AdaNet. macko2019improving macko2019improving also proposed another attempt named as AdaNAS to utilize ensemble methods to compose a neural network automatically, which was an extension of AdaNet with the difference of using subnetworks comprising stacked NASNet [Zoph and Le2017, Zoph et al.2018] blocks. However, both of them gathered all searched sub-architectures together and missed out the critical characteristic that ensemble models usually benefit from diverse individual learners.

Conclusion

Recent attempts on NAS with ensemble learning methods have achieved prominent results in reducing the search complexity and improving the effectiveness [Cortes et al.2017]. However, current approaches usually miss out on an essential characteristic of diversity in ensemble learning. To bridge this gap, in this paper, we target the ensemble learning methods in NAS and propose an ensemble pruning method named “Sub-Architecture Ensemble Pruning in Neural Architecture Search (SAEP)” to reduce the redundant sub-architectures during the searching process. Three solutions are proposed as the guiding criteria in SAEP that reflects the characteristics of the ensemble architecture (i.e., PRS, PAP, and PIE) to prune the less valuable sub-architectures. Experimental results indicate that SAEP could guide diverse sub-architectures to create sub-ensemble architectures in a smaller size yet still with comparable performance to the ensemble architecture that is not pruned. Besides, PIE might lead to distinct deeper sub-architectures if diversity is not sufficient. In the future, we plan to generalize the current method to more diverse ensemble strategies and derive theoretical guarantees to further improve the performance of the NAS ensemble architectures.

References

  • [Baker et al.2017] Baker, B.; Gupta, O.; Naik, N.; and Raskar, R. 2017.

    Designing neural network architectures using reinforcement learning.

    In ICLR.
  • [Brock et al.2017] Brock, A.; Lim, T.; Ritchie, J.; and Weston, N. 2017. Smash: One-shot model architecture search through hypernetworks. In NIPS Workshop on Meta-Learning,.
  • [Cai et al.2018a] Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang, J. 2018a. Efficient architecture search by network transformation. In AAAI.
  • [Cai et al.2018b] Cai, H.; Yang, J.; Zhang, W.; Han, S.; and Yu, Y. 2018b. Path-level network transformation for efficient architecture search. In ICML.
  • [Cortes et al.2017] Cortes, C.; Gonzalvo, X.; Kuznetsov, V.; Mohri, M.; and Yang, S. 2017. Adanet: Adaptive structural learning of artificial neural networks. In ICML, 874–883.
  • [Cortes, Mohri, and Syed2014] Cortes, C.; Mohri, M.; and Syed, U. 2014. Deep boosting. In ICML, 1179–1187.
  • [Cover and Thomas2012] Cover, T., and Thomas, J. 2012. Elements of information theory. John Wiley & Sons.
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255. Ieee.
  • [Elsken, Metzen, and Hutter2017] Elsken, T.; Metzen, J.; and Hutter, F. 2017. Simple and efficient architecture search for convolutional neural networks. In NIPS Workshop on Meta-Learning.
  • [Elsken, Metzen, and Hutter2018] Elsken, T.; Metzen, J.; and Hutter, F. 2018. Efficient multi-objective neural architecture search via lamarckian evolution. ArXiv e-prints.
  • [Elsken, Metzen, and Hutter2019] Elsken, T.; Metzen, J. H.; and Hutter, F. 2019. Neural architecture search: A survey. J Mach Learn Res 20(55):1–21.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • [Ho1998] Ho, T. K. 1998. The random subspace method for constructing decision forests. IEEE T Pattern Anal 20(8):832–844.
  • [Huang et al.2018] Huang, F.; Ash, J.; Langford, J.; and Schapire, R. 2018. Learning deep resnet blocks sequentially using boosting theory. In ICML.
  • [Jiang et al.2017] Jiang, Z.; Liu, H.; Fu, B.; and Wu, Z. 2017.

    Generalized ambiguity decompositions for classification with applications in active learning and unsupervised ensemble pruning.

    In AAAI, 2073–2079.
  • [Kandasamy et al.2018] Kandasamy, K.; Neiswanger, W.; Schneider, J.; Poczos, B.; and Xing, E. 2018. Neural architecture search with bayesian optimisation and optimal transport. In NeurIPS, 2020–2029.
  • [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
  • [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Li, Yu, and Zhou2012] Li, N.; Yu, Y.; and Zhou, Z.-H. 2012. Diversity regularized ensemble pruning. In ECML PKDD, 330–345.
  • [Liu, Simonyan, and Yang2018] Liu, H.; Simonyan, K.; and Yang, Y. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
  • [Lu et al.2010] Lu, Z.; Wu, X.; Zhu, X.; and Bongard, J. 2010. Ensemble pruning via individual contribution ordering. In SIGKDD, 871–880. ACM.
  • [Macko et al.2019] Macko, V.; Weill, C.; Mazzawi, H.; and Gonzalvo, J. 2019. Improving neural architecture search image classifiers via ensemble learning. arXiv preprint arXiv:1903.06236.
  • [Martínez-Muñoz and Suárez2007] Martínez-Muñoz, G., and Suárez, A. 2007. Using boosting to prune bagging ensembles. Pattern Recogn Lett 28(1):156–165.
  • [Pham et al.2018] Pham, H.; Guan, M.; Zoph, B.; Le, Q.; and Dean, J. 2018. Efficient neural architecture search via parameter sharing. In ICML.
  • [Real et al.2018] Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
  • [Skalak and others1996] Skalak, D. B., et al. 1996. The sources of increased accuracy for two proposed boosting algorithms. In AAAI, volume 1129, 1133.
  • [Wistuba, Rawat, and Pedapati2019] Wistuba, M.; Rawat, A.; and Pedapati, T. 2019. A survey on neural architecture search. arXiv preprint arXiv:1905.01392.
  • [Xiao, Rasul, and Vollgraf2017] Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
  • [Zadeh et al.2017] Zadeh, S.; Ghadiri, M.; Mirrokni, V.; and Zadimoghaddam, M. 2017.

    Scalable feature selection via distributed diversity maximization.

    In AAAI, 2876–2883.
  • [Zela et al.2018] Zela, A.; Klein, A.; Falkner, S.; and Hutter, F. 2018.

    Towards automated deep learning: Efficient joint neural architecture and hyperparameter search.

    In ICML Workshop on AutoML.
  • [Zhong et al.2018a] Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; and Liu, C.-L. 2018a. Practical block-wise neural network architecture generation. In CVPR, 2423–2432.
  • [Zhong et al.2018b] Zhong, Z.; Yang, Z.; Deng, B.; Yan, J.; Wu, W.; Shao, J.; and Liu, C.-L. 2018b. Blockqnn: Efficient block-wise neural network architecture generation. arXiv preprint arXiv:1808.05584.
  • [Zhou, Wu, and Tang2002] Zhou, Z.-H.; Wu, J.; and Tang, W. 2002. Ensembling neural networks: many could be better than all. Artif Intell 137(1-2):239–263.
  • [Zhou2012] Zhou, Z.-H. 2012. Ensemble Methods: Foundations and Algorithms. CRC press.
  • [Zoph and Le2017] Zoph, B., and Le, Q. 2017. Neural architecture search with reinforcement learning. In ICLR.
  • [Zoph et al.2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. 2018. Learning transferable architectures for scalable image recognition. In CVPR, 8697–8710.

Appendix A More Details of Experiments

Saep Leads to Sub-Ensemble Architectures with Smaller Size than AdaNet

In this section, we report additional experimental results in Figure 5 and Table 3. Experimental results reported in Table 3 contain the accuracy on the test set of each method on each label pair for binary classification problems. Each row (label pair) in Table 3 compares the classification accuracy with the same type of sub-architectures, indicating results with higher accuracy and its corresponding size of the searched ensemble architecture by bold fonts. Similar results are reported in Table 2 for multi-class classification problems. As we can see, SAEP (indicated by PRS, PAP, and PIE) could achieve comparable accuracy performance with AdaNet through a smaller ensemble architecture in the end in most cases. Similar results could be observed in Figure 2 and Figure 5. We could observe that SAEP achieved the comparable accuracy performance with AdaNet in Figure 5LABEL:, and that SAEP yielded smaller ensemble architectures than AdaNet in most cases in Figure 5LABEL:. All these results suggested that the proposed NAS ensemble pruning method (SAEP) was meaningful.

Pie Could Generate Distinct Deeper Sub-Architectures than AdaNet

In this section, we report additional details about the size of the searched ensemble architectures after pruning.

For example, as for the MNIST dataset in Table 2, AdaNet keeps all twelve sub-architectures in the final architecture and reached 93.21% accuracy, while PIE searches thirteen sub-architectures and prunes the thirteenth sub-architecture at last, arriving at the same accuracy. Their paths cross when searching the first twelve sub-architectures and moved to different ways later. It means that PIE searches deeper architectures than AdaNet does. Besides, AdaNet.W keeps all ten sub-architectures in the final architecture and reaches 93.26% accuracy, while PIE.W keeps the first to the fifteenth sub-architectures except the fourteenth sub-architecture, arriving at 93.39% accuracy. Their paths cross when searching the tenth sub-architecture and move to different paths later. It means that PIE.W searches distinct deeper architectures than AdaNet.W as well. As for the Fashion-MNIST dataset in Table 2, AdaNet.W keeps all seventeen sub-architectures in the final architecture and reaches 81.82% accuracy; PIE.W keeps the first to the thirteen sub-architectures and prunes the fourteenth sub-architecture, arriving at 82.36% accuracy.

Similar results could be observed as for the top-pullover pair in Table 3. As for the top-pullover pair in Table 3LABEL:, AdaNet keeps all twelve sub-architectures in the final architecture and reaches 96.70% accuracy, while PIE keeps the first to the fifteenth sub-architectures except the ninth sub-architecture and arrives at the same accuracy. It means that PIE searches deeper sub-architectures than AdaNet does. Their paths cross when searching the eighth sub-architecture and move to different directions later. Similarly, as for the same pair in Table 3LABEL:, AdaNet.W keeps all six sub-architectures in the final architecture, while PIE.W keeps the first to the seventh sub-architectures except the second one in the final ensemble architecture. Their paths cross when searching the first sub-architecture and move to different directions later. All these situations suggest that in practice, SAEP could achieve distinct deeper architectures in some cases.

Figure 5: Empirical results of accuracy and size of searched architectures after pruning for image classification tasks. LABEL: Accuracy on the test sets. LABEL: Size of the pruned ensemble architectures.