STEERAGE: Synthesis of Neural Networks Using Architecture Search and Grow-and-Prune Methods

12/12/2019 ∙ by Shayan Hassantabar, et al. ∙ 0

Neural networks (NNs) have been successfully deployed in many applications. However, architectural design of these models is still a challenging problem. Moreover, neural networks are known to have a lot of redundancy. This increases the computational cost of inference and poses an obstacle to deployment on Internet-of-Thing sensors and edge devices. To address these challenges, we propose the STEERAGE synthesis methodology. It consists of two complementary approaches: efficient architecture search, and grow-and-prune NN synthesis. The first step, covered in a global search module, uses an accuracy predictor to efficiently navigate the architectural search space. The predictor is built using boosted decision tree regression, iterative sampling, and efficient evolutionary search. The second step involves local search. By using various grow-and-prune methodologies for synthesizing convolutional and feed-forward NNs, it reduces the network redundancy, while boosting its performance. We have evaluated STEERAGE performance on various datasets, including MNIST and CIFAR-10. On MNIST dataset, our CNN architecture achieves an error rate of 0.66 CIFAR-10 dataset, we used the ResNet architectures as the baseline. Our STEERAGE-synthesized ResNet-18 has a 2.52 original ResNet-18, 1.74 having comparable number of parameters and FLOPs to the original ResNet-18. This shows that instead of just increasing the number of layers to increase accuracy, an alternative is to use a better NN architecture with fewer layers. In addition, STEERAGE achieves an error rate of just 3.86 ResNet architecture with 40 layers. To the best of our knowledge, this is the highest accuracy obtained by ResNet-based architectures on the CIFAR-10 dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks (NNs) have led to wonderful achievements in various areas of Artificial Intelligence, such as computer vision

[krizhevsky2012imagenet], speech recognition [graves2013speech, kim2017residual], and machine translation [jean2014using]. Deep NNs can learn hierarchical features, which is key to the above-mentioned triumphs. Improving the performance of these models is a very active current line of research [akmandor2019secret].

The NN architecture affects the performance of the final model. As a result, synthesizing an appropriate NN architecture is of utmost importance. However, since the NN architecture search space is exponentially large in its many hyperparameters, using a trial-and-error approach to constructing the NN leads to suboptimal results. Hence, it is necessary to find the best set of hyperparameter values for the architecture.

Another important design objective is model compactness. Compact models are easier to deploy on edge devices. As shown previously [han2015learning, hassantabar2019scann], the number of NN parameters can be reduced significantly without degrading performance. Thus, choosing the hyperparameter values that yield very compact, not just accurate, NN architectures is also an important consideration.

Neural architecture search (NAS) is a technique for automatically synthesizing convolutional layers through hyperparameter search. It is currently a very active area of research [baker2016designing, tan2018mnasnet]

. However, most NAS approaches, such as those based on reinforcement learning (RL)

[baker2016designing, liu2018progressive], are computationally very expensive.

To address the above challenges, we propose a two-step global+local search based NN synthesis methodology, called STEERAGE 111Steerage refers to the action of steering a boat. Here, we use it to steer NN synthesis. (SynThesis of nEural nEtworks with aRchitecture seArch and Grow and prunE

methods). STEERAGE is efficient and automatically generates accurate and compact NN architectures. To efficiently find the NN hyperparameter values, we derive an accuracy predictor to measure architecture fitness. This significantly speeds up architecture search. We use grow-and-prune synthesis methods to ensure model compactness. STEERAGE targets both the convolutional and feed-forward layers of the NN through two sets of hyperparameter values, and alleviates the curse of dimensionality by incorporating dimensionality reduction (DR) in the search space.

Global search is based on the NAS approach. However, unlike prior work, it does not rely on especially-crafted efficient building blocks, nor does it use the computationally expensive RL approach. It uses an accuracy predictor to measure architecture performance. This obviates the need to train the NNs during search, thus significantly speeding up the search process. We use an iterative process to build the accuracy predictor, using a combination of a boosted decision tree regression tree and quasi Monte-Carlo (QMC) sampling. We employ an efficient evolutionary search (EES) method to find the best-performing variant of the base architecture in the search space.

The second synthesis step, local search, starts with the best architecture obtained from global search. It not only enhances model performance, but also generates a compact NN by reducing architectural redundancy. We use two different grow-and-prune methodologies for this purpose: targeted at feed-forward NNs (FFNNs) and convolutional NNs (CNNs). For FFNNs, we use the SCANN [hassantabar2019scann]

synthesis methodology to enhance compactness and accuracy. SCANN uses three operations iteratively: connection growth, neuron growth, and connection pruning. We use the NeST

[dai2017nest] synthesis methodology for the convolutional layers. It uses two operations: feature map growth and partial-area convolution.

Our contributions can be summarized as follows:

  • We present STEERAGE, an NN synthesis methodology that efficiently explores both the FFNN and CNN search spaces for edge applications.

  • STEERAGE employs a two-step synthesis approach that utilizes two complementary approaches: architecture search and network growth-and-pruning. The first step uses an accuracy predictor based on boosted decision tree regression for efficient architecture search. The second step helps refine the architecture to improve its performance and reduce redundancy.

  • STEERAGE is general and easy-to-extend, and adapts to various architecture types, such as FFNNs as well as shallow and deep CNNs.

  • We demonstrate the effectiveness of STEERAGE through evaluation on datasets, including MNIST and CIFAR-.

The rest of the article is organized as follows. Section 2 presents related work. Section 3 provides the necessary background material. Section 4 discusses the STEERAGE synthesis methodology, and the complementary nature of its global and local search modules. Section 5 presents experimental results. Section 6 provides a short discussion of the methodology and results. Finally, Section 7 concludes the article.

2 Related Work

In this section, we summarize prior work on NN synthesis.

2.1 Synthesizing Compact Architectures

Synthesis of compact NNs through removal of redundant connections and neurons has attracted significant attention. Network pruning is one such widely used approach [han2015learning, dai2017nest, molchanov2016pruning, yang2018netadapt, dai2018grow, hassantabar2019scann, dai2019incremental]. Several works use structured sparsity to prune deep CNNs [wen2016learning, ye2018rethinking]. RL has also been used to compress CNN models for mobile devices [he2018amc]. Runtime pruning [lin2017runtime] separately uses RL to evaluate the importance of each convolutional feature map and channel pruning for each input image. Dynamic channel pruning [gao2018dynamic] is another approach for pruning channels based on their relevance at runtime.

Another approach for generating compact CNNs is to handcraft efficient building blocks. It is aimed at NN deployment on mobile devices. One operation that significantly reduces computational cost is depthwise separable convolution. It has been used in both MobileNet architecture versions [howard2017mobilenets, sandler2018mobilenetv2]. MobileNetV [sandler2018mobilenetv2] uses efficient CNN blocks with inverted residuals and linear bottlenecks. Other such operations are pointwise group convolution and channel shuffle that are used in the ShuffleNet architectures [zhang2018shufflenet, ma2018shufflenet]. In addition, quantization can also reduce the computational cost of inference [han2015deep, hubara2016binarized, zhu2016trained], with a minimal adverse impact on performance.

2.2 Neural Architecture Search (NAS)

RL is a well-known method for synthesizing NNs [baker2016designing, liu2018progressive, pham2018efficient]. MnasNet [tan2018mnasnet] uses a platform-aware NAS approach based on RL, built atop MobileNetV. It uses different layer structures for different parts of the network. MobileNetV uses two complementary search approaches. It combines NAS for block-wise search and NetAdapt [yang2018netadapt] for layer-wise search (for individual fine-tuning of the layers). Another NAS approach is weight sharing [brock2017smash, bender2018understanding]

. It trains an over-parametrized supernet architecture. The differentiable NAS approach uses a loss function to optimize the weights and architectural parameters

[liu2018darts, veniat2018learning, xie2018snas, wu2019fbnet, guo2019single]. The final architecture inherits the weights of the supernet and is then fine-tuned.

3 Background

In this section, we familiarize the reader with some necessary background material. First, we give an overview of the SCANN [hassantabar2019scann] synthesis framework since it is used within STEERAGE. We also discuss the Chamnet [dai2019chamnet] framework that first introduced the idea of an accuracy predictor to expedite NN synthesis.

3.1 Scann

In this section, we first briefly explain the operations used in SCANN, which are used to change the NN architecture in different steps of the training process. Then, we go over three different training schemes employed in this framework.

SCANN uses three main operations to change the network architecture, namely, connection growth, neuron growth, and connection pruning. Connection growth adds a number of inactive connections to the network. Connections may be added based on their effect on the loss function. Neuron growth works either by duplicating existing neurons in the network or by randomly generating new neurons and their connections. Connection pruning reduces the size of the network. This is done by either pruning small-weight or large-weight connections.

The above operations are used in three different training schemes. After applying any of the architecture-changing operations, the weights of the network are trained and the performance of the architecture evaluated on the validation set. The three different training schemes are referred to as Schemes A, B, and C. Scheme A starts with a small network and gradually increases its size. On the other hand, Scheme B starts with a large network and gradually decreases its size. Scheme C is similar to Scheme B, but limits the NN to a multi-layer perceptron (MLP) architecture by only allowing connections to adjacent layers.

3.2 ChamNet

ChamNet is aimed at efficiently adapting a given NN to a resource-constrained computational platform. It uses three different predictors for this purpose: accuracy, latency, and energy consumption of the NN. With the help of these predictors, model adaptation is framed as a constraint optimization problem that aims to maximize accuracy while satisfying the energy and latency constraints of the platform.

Chamnet uses Gaussian process (GP) regression to train the accuracy predictor. The latency predictor is based on an operator latency look-up table. The energy predictor is also built atop GP regression, with energy consumption values directly measured from the hardware platform. These three predictors are used in an efficient NN search process (random and genetic search) to solve the constrained optimization problem.

4 The STEERAGE Synthesis Framework

In this section, we first give an overview of the STEERAGE synthesis framework. We then zoom into the search space and introduce the DR methods that we used as one of the search dimensions. We then go on to describe the complete list of search dimensions (hyperparameters) for both CNNs and FFNNs. This is followed by the description of our accuracy predictor and search mechanisms.

4.1 Framework Overview

Fig. 1

shows the block diagram of the STEERAGE methodology. It takes the base network and dataset as input and, based on an evolutionary search and network grow-and-prune algorithms, finds a variant of the base NN that is superior in terms of classification performance. For evolutionary search, we represent each architecture with a vector of hyperparameters

, where is the number of hyperparameters in the search space.

We evaluate the fitness of each vector representation of the architecture by using an accuracy predictor. This predictor is trained using a number of sample architectures that are chosen iteratively. During training, the accuracy of the architectures is used as their measure of fitness. EES then selects the fittest candidates to generate the next round of architectures using mutation and crossover operations. We formulate evolutionary search as an optimization problem that maximizes the predicted accuracy of the architecture:

The accuracy predictor is used to find the gene that maximizes accuracy. The process of building this predictor is discussed in Section 4.4.

As can be seen from Fig. 1, the candidate architecture output of EES is fed to the local search module. In this step, we use two different grow-and-prune synthesis methodologies, SCANN and NeST, to synthesize the feed-forward and convolutional parts of the network, respectively. This step finds the best NN architecture, given the output of the EES step. This process is described in Section 4.6.

Fig. 1: Block Diagram of STEERAGE: (a) Training an accuracy predictor, (b) genetic search to find the architecture with maximum accuracy, and (c) local search around the architecture through efficient grow-and-prune algorithms.

4.2 Dimensionality Reduction

Curse of dimensionality is a well-known problem that stems from the need to increase the dataset size as its dimensionality increases. Since it is often not possible to increase dataset size, researchers traditionally use various DR techniques. Since DR methods are quite effective, we include DR as one of the hyperparameters in our search space. It enables us to reduce the number of features and map the -dimensional dataset to a -dimensional dataset, where . We use two different sets of DR methods for FFNNs and CNNs, as described next.

4.2.1 FFNNs

The framework chooses amongst different DR methods for FFNNs. Random projection (RP) is a DR method based on the Johnson-Lindenstrauss lemma [sivakumar2002algorithmic, dasgupta2003elementary]

. We consider four different random projection matrices that yield four different DR methods. The entries in two of the matrices are i.i.d. samples drawn from the Gaussian distributions,

and

. Entries of the other two matrices are drawn from the following probability distributions

[achlioptas2001database]:

We also include a number of traditional DR methods in the set of choices to the framework: principal component analysis (PCA), polynomial kernel PCA, Gaussian kernel PCA, factor analysis, isomap, independent component analysis, and spectral embedding. We used the scikit-learn machine learning library

[pedregosa2011scikit] implementation of these methods.

4.2.2 CNNs

For image datasets that are analyzed by CNN architectures, we use different downsampling methods. Similar to the methods used in [chrabaszcz2017downsampled], we use lanczos, nearest, bilinear, bicubic, Hamming, and box filter from the Pillow library [pillow].

Lanczos uses a low-pass filter to smoothly interpolate the pixel values in the resampled image. Nearest neighbor interpolation, used in the nearest filter, is a simple method that only selects the value of the nearest neighbor to do the interpolation, and produces a piece-wise constant interpolant. Bilinear interpolation is an extension of linear interpolation based on quadratic functions. The bicubic filter is an extension of cubic interpolation and is based on Lagrange polynomials, cubic splines, and the cubic convolution algorithm. The Hamming filter is similar to bilinear, but generates sharper images. The box filter uses nearest neighbor interpolation for upscaling. Furthermore, each pixel of the original image contributes to one pixel of the resampled image with identical weights.

Using the above methods, we downsample the original size images of the MNIST dataset to and size images. Similarly, we reduce the size images of the CIFAR- dataset to size images. Fig. 2 shows examples of the downsampled images from the MNIST dataset using these methods.

Fig. 2: Original MNIST (size ) (left column), and the downsampled (size ) versions using six different techniques: lanczos, nearest, bilinear, bicubic, Hamming, and box filter, from left to right, respectively.

4.3 Search Vector Space

The set of hyperparameters chosen to navigate the search space directly impacts the performance of the STEERAGE framework. Hence, using important design hyperparameters to define the search space is necessary. However, including too many hyperparamters may have a negative impact. For example, as the number of search hyperparameters increases, we need more data to train the accuracy predictor. Furthermore, including architectures with inferior performance can also hurt STEERAGE performance. Therefore, the set of hyperparameters should be as small as possible, yet be inclusive of all important factors.

We differentiate between the set of hyperparameters chosen for FFNNs and CNNs. These sets are discussed next.

4.3.1 FFNNs

For FFNNs, we use five hyperparameters to define the search space. These are (i) the number of hidden layers in the architecture, (ii) the number of neurons in each layer, (iii) the DR method, (iv) the feature compression ratio of the DR method, and (v) network quantization. Quantization is done in the inference step: we considered various bit-length representations of the network weights (e.g., , , or bits).

4.3.2 CNNs

There are a number of hyperparameters that significantly impact the performance of CNNs. They include the number of convolutional layers, the number of feature maps in each layer, and the filter size of each layer. Pooling layers are often used to reduce the spatial size of the feature maps in the convolutional layers. The use of average pooling or max-pooling can, hence, be another hyperparameter included in the search space. We also include some hyperparameters that are similar to those targeted for FFNNs, e.g., the DR method, the number of hidden layers, and the number of neurons in each layer of the fully-connected classifier, and quantization bit representation.

4.4 Building the Accuracy Predictor

In order to facilitate and accelerate architecture search, we need an efficient way to evaluate the fitness of each candidate. We use an accuracy predictor to do this that obviates the need for network training. Two desirable properties of such a predictor are prediction reliability and sample efficiency. For prediction reliability, the predictor should minimize the distance between the predicted accuracy and the actual accuracy. In addition, it should maintain the relative ranking of architectures based on their predicted and actual accuracies. Furthermore, it should need a small amount of training data (i.e., trained architectures). We address these issues next.

4.4.1 Regression Model

In order to obtain a suitable regression model for use in the accuracy predictor, we compared six methods. These are Gaussian process, multilayer perceptron (MLP), linear regression, decision tree regression, boosted decision tree regression, and Bayesian ridge regression. We compared their performance in the context of accuracy prediction. Fig. 

3 shows a comparison of these methods based on sampled architectures (sample selection is discussed ahead). The mean-squared error (MSE) is based on leave-one-out MSE. Since boosted decision tree regression had the smallest MSE, we chose it as the regression model for the accuracy predictor.

Fig. 3: Performance comparison of six different regression methods for accuracy prediction.

Boosting is an ensemble approach that fits several copies of the regressor model on the original dataset. However, after each iteration, the weights of data instances are modified based on the prediction error of the current iteration. This process helps the regressor to focus more on difficult data instances in the next iteration. This is one reason why the boosted decision tree regressor is very suitable for training the accuracy predictor. We set the maximum depth of the tree to five and the maximum number of estimators in the boosting stage to

.

4.4.2 Sample Generation

The objective of this step is to generate a sample set that is representative of the architectures in the search space and has as few samples as possible. The process of iterative sample selection is summarized in Algorithm 1.

Since the search space can be quite large, we first create a pool of architectures. We use quasi Monte-Carlo (QMC) sampling [asmussen2007stochastic] methods, Sobol sequence [sobol1967distribution]

in particular, to generate the initial pool of architectures. Sobol sequences are designed to generate samples to be as uniformly distributed as possible over the multi-dimensional search space. In addition, compared to other sampling methods, such as random sampling, stratified sampling, and Latin hypercube sampling

[stein1987large], sampling based on Sobol sequences generates more evenly distributed samples [burhenne2011sampling]. Our initial pool of architectures contains sample architectures.

After generating the pool of candidate architectures, we select a set of sample architectures to be trained and used for building the accuracy predictor. This is done over multiple iterations (in our implementation, we used three iterations). In the first step, a pre-defined number (set to in our experiments) of architectures are randomly selected from the pool of architectures. The accuracy predictor is trained based on architecture accuracy obtained on a validation set. Then, the fitness of the remaining samples in the architecture pool is evaluated based on the accuracy predictor. In the next iteration, we choose the samples with the highest accuracy prediction values. This process is repeated for a pre-defined number of iterations. In each iteration, the accuracy predictor gets updated with the new and updated set of architecture-validation accuracy observations.

0:  count: Size of architecture pool; iterCount: #candidates in each iteration; : #iterations
  Lower/upper bounds and the step size of each hyperparameter
  archPool = Sample Sobol sequence samples from the search space
  Samples = Randomly select iterCount samples from archPool
  samplesAcc = Validation-Acc(Samples)
  Predictor = Boosted decision tree predictor trained with (Samples, samplesAcc)
  while maximum iterations not reached do
     for Arch in Pool Samples do
        (a) = Predictor (Arch)
     end for
     () Arch = Reverse-Sort(Arch) {based on }
     () Candidates = Arch [:iterCount]
     () Samples = Samples Candidates
     () candidatesAcc = Validation-Acc(Candidates)
     () Predictor = Boosted decision tree predictor trained with all observations
  end while
  The accuracy predictor model
Algorithm 1 Iterative Sample Selection and Accuracy Predictor Training

4.5 Efficient Evolutionary Search

The EES process is formulated as an optimization problem to find the architecture that scores highest on the fitness criterion: predicted accuracy. We use the accuracy predictor described in Section 4.4 to accelerate the search process.

Algorithm 2 summarizes the search process. We use a genetic search algorithm to find the best architecture in the search space. The first step is to define the search space by identifying the lower and upper bound values for each hyperparameter. We also have to identify the step size for the values within this range. Then we randomly generate a pre-defined number of architectures within this space. We evaluate the fitness of these architectures using the accuracy predictor. Using the mutation operator, with a pre-defined probability, we breed the next generation of NN architecture candidates. We sort these candidates based on their predicted accuracy values and based on the number of samples in each search iteration, we pick the best architectures for the next iteration. After a certain number of search process iterations, we find the best architecture. This architecture is trained and then evaluated on the test set. Thus, this process returns the best-found architecture and its performance on the test set. Here, performance refers to the accuracy of the best architecture after the global search step. In our implementation, we set the population size in each iteration to . The total number of search iterations is set to .

0:  getAcc: Accuracy prediction; iterCount: #samples in each iteration; : mutation probability; : #search iterations
  Lower/upper bounds and the step size of each hyperparameter
  Parents [’Arch’] = Randomly generate architectures
  Parents [’Reward’] = getAcc (Parents[’Arch’])
  while maximum iterations not reached do
     () Children =
     for parent in Parents do
        (a) child = parent
        (b) with : child [’Arch’] = mutate parent[’Arch’]
        (c) child[’reward’] = getAcc(child[’Arch’])
        (d) Children += child
     end for
     () Children = Children Parents
     () Reverse-sort(Children) {based on ’reward’}
     () bestArch = Children[][’Arch’]
     () bestReward = Children[][’reward’]
     () Parents = Children[:]
  end while
  Accuracy = Test accuracy of bestArch
  Best architecture (bestArch) and its test accuracy
Algorithm 2 Global Search Module

4.6 Grow-and-prune Synthesis

We use a secondary step of NN grow-and-prune synthesis for two main reasons: (i) to find a model with an even better performance and (ii) to ensure model compactness and computational efficiency. However, the initial NN architecture provided to a grow-and-prune NN synthesis tool has a notable impact on the result. Hence, identifying a good starting point is important. In our case, we treat the architecture found by the global search module to be the initial point. Subsequent grow-and-prune synthesis is encapsulated in a local search module. We discuss application of the above approach to FFNNs and CNNs next.

4.6.1 FFNNs

We use the SCANN [hassantabar2019scann] synthesis tool for local search of the feed-forward part of the architecture. SCANN can achieve very high compression rates, yet generate very accurate networks, when a DR method is employed.

0:  Best architecture from global search module ; Reduced-dimension dataset , and number of search iterations
  while  not reached do
     (a) Apply one architecture-changing operation 
     (b) Train weights of the network on dataset () and test its performance on the validation set 
  end while
  Final network architecture with best performance on the validation set
Algorithm 3 Local search with SCANN

Algorithm 3 summarizes the grow-and-prune process for the FFNNs. We use the best architecture found from the search process discussed in Section 4.5 to initialize this process. We also feed SCANN the corresponding reduced-dimension dataset. Subsequently, for a defined number of iterations, we use one of the three architecture-changing SCANN operations and evaluate the performance of the resulting model on the validation set. These operations are connection growth, neuron growth, and connection pruning. The output of this module is the best architecture based on evaluation on the validation set.

4.6.2 CNNs

For the convolutional layers of the architecture, we use NeST [dai2017nest] to perform grow-and-prune synthesis. In the growth phase, this methodology uses an intelligent feature map growth method. In order to add a new feature map to the convolutional layers, the best set of kernels is selected from a set of randomly generated kernels. This selection is based on which ones reduce the value of the loss function the most. We also employ partial-area convolutions. Different parts of the images are of interest to different kernels. Therefore, using the method in NeST, we identify the corresponding areas of interest in the image and prune away the connections to other image areas. This reduces the number of parameters in the convolutional layers (by a certain pruning ratio), as well as the number of floating-point operations (FLOPs) needed for computation.

5 Experimental Results

In this section, we evaluate the performance of STEERAGE on several datasets. Table I shows their characteristics.

The evaluation results are divided into two parts. Section 5.1 presents results obtained by STEERAGE on FFNNs. Section 5.2 presents results on the CNN architectures for the MNIST and CIFAR- datasets. Note that MNIST has both types of architectures: an FFNN LeNet-300-100 and CNN LeNet-5. We use various ResNet [he2016deep, he2016identity] architectures as the baseline for the CIFAR- dataset. We demonstrate that the NNs generated by STEERAGE are compact, computationally efficient, and accurate. Therefore, STEERAGE-generated NNs can be used in energy-constrained edge devices and IoT sensors.

Dataset Training Set Validation Set Test Set Features Classes
Sensorless Drive Diagnosis
Human Activity Recognition (HAR)
Musk v
Pen-Based Recognition of Handwritten Digits
Landsat Satellite Image
Letter Recognition
Epileptic Seizure Recognition
Smartphone Human Activity Recognition
DNA
MNIST
CIFAR-
TABLE I: Characteristics of the datasets

5.1 FFNNs

In this section, we present STEERAGE synthesis results on FFNNs. We started with a large set of potential hyperparameters and finalized it into a small subset. Table II shows the hyperparameters we considered for FFNNs. For each hyperparameter, we show the lower bound, upper bound, and step size. For the number of layers, we considered one to six hidden layers in the architecture search space. with a step size of one. We let the synthesis algorithm find the optimal number of neurons (in a 50-600 range, with a step size of 25) in each of the hidden layers. STEERAGE steps through the 11 DR method discussed earlier one by one, with the DR ratio spanning 1-20 with a step size of 0.1. Finally, we evaluated four different values for quantization: 4-, 8-, 16- or

-bit (i.e., full precision) inference. These are placed in bins 1-4 in the table. Furthermore, we may refine these search parameters for specific datasets. We also considered use of different nonlinear activation functions in different network layers. However, since this did not help improve performance, we did not consider it further.

The final search space for each dataset is a subset of the search space defined by this table. For example, for the letter recognition dataset that only has features, we considered feature compression ratios up to . In the following, first, we present synthesis results on nine small- to medium-size datasets whose NN models would be appropriate for edge devices or IoT sensors. These datasets were obtained from the UCI machine learning repository [Dua:2017] and Statlog collection [Michie:1995:MLN:212782]. Then, we present results on the FFNN implementation of the MNIST dataset.

Hyperparameter Lower bound Upper bound Step size
#Layers
#Neurons per layer
DR method
DR ratio
Quantization bin bin
TABLE II: General search space for FFNNs

5.1.1 Small to medium size datasets

The top nine rows of Table I show the characteristics of the small- to medium-size datasets we experimented with. For these experiments, we use the Adam optimizer with a learning rate of and weight decay of e-. Table III shows the test accuracy results. There are two rows associated with STEERAGE. The first one shows the result obtained by just using the global search module, and the second one when both the global and local search modules are used. The number of neurons in the hidden layers of the baseline MLP architecture is a multiple of the number of its input features. Relative to the MLP baseline, STEERAGE-generated NNs improve classification accuracy by to . Furthermore, compared to SCANN and DR+SCANN [hassantabar2019scann], STEERAGE generates NNs with higher accuracy across all datasets. Table IV shows the number of parameters in the respective NN architectures in Table III. As can be seen, STEERAGE, in general, generates much more compact architectures than the baseline and is competitive with SCANN. In addition, using the global+local search helps both in terms of model performance and connection compression ratio relative to just global search.

Dataset SenDrive HAR Musk Pendigits SatIm Letter Seizure SHAR DNA
Baseline
SCANN [hassantabar2019scann]
DR + SCANN [hassantabar2019scann]
STEERAGE (GS) 97.20%
STEERAGE (GS+LS) 99.36% 96.43% 99.19% 98.05% 92.00% 97.72% 95.50% 95.95%
TABLE III: Highest test accuracy comparison on small- to medium-size datasets (the highest number is highlighted)
Dataset SenDrive HAR Musk Pendigits SatIm Letter Seizure SHAR DNA
Baseline k k k k k k k k k
SCANN k k k k k k k k k
DR + SCANN k k k k k k
STEERAGE (GS) k k k k k k k k k
STEERAGE (GS+LS) k k k k k k k k k
TABLE IV: Neural network parameter comparison

5.1.2 The MNIST dataset

MNIST is a well-studied dataset of handwritten digits. It contains 60000 training images and 10000 test images. We set aside 10000 images from the training set as the validation set. We adopt LeNet-- [lecun1998gradient] as the baseline FFNN model for this dataset. We use the stochastic gradient decent optimizer in our experiments, with a learning rate of , momentum of , and weight decay of e-. Table V shows the results for feed-forward architecture synthesis for the MNIST dataset. We also compare the results of our synthesis methodology with other related work on feed-forward architectures on the MNIST dataset. Relative to related work, the combination of global and local search achieves the highest accuracy of , with a connection compression ratio of . In this case, DR+SCANN generates the most compact NNs, however, at a lower accuracy than STEERAGE. This points to a classic accuracy-compactness tradeoff. Again, using the combination of global and local search helps both in terms of model performance and connection compression ratio.

Method Weights Connection Compression ratio Test Accuracy
LeNet-- k
NeST [dai2017nest] k
DR + SCANN [hassantabar2019scann] k
DR + SCANN [hassantabar2019scann] k
STEERAGE (GS) k
STEERAGE (GS+LS) k
TABLE V: Accuracy results for feed-forward MNIST architectures

5.2 CNNs

In this section, we present results of STEERAGE on CNN architectures obtained for the medium-size MNIST and CIFAR- datasets.

5.2.1 The MNIST dataset

For the MNIST dataset, we use LeNet- [lecun1998gradient]

as the baseline architecture. We use the PyTorch

[paszke2017automatic] implementation of this network. This architecture has two convolutional layers with and filters, respectively. The convolutional layers are followed by a feed-forward classifier with three hidden layers consisting of , , and neurons, respectively. Table VI shows the search space. The search space includes image DR method, image size, and quantization. It also includes the number of filters in each convolutional layer (between to for the first layer, and to for the second one), use of max or average pooling after the convolutional layers, and the number of neurons in the three fully-connected layers. The search space for the fully-connected layers is selected based on image size. As the number of input features decreases, we decrease the search interval for the number of neurons in these layers.

Input size
Image DR -
Kernel Size vs.
Conv - - -
Pooling Max vs. Avg
Conv - - -
Pooling Max vs. Avg
FC - - -
FC - - -
FC - - -
Quantization bits - - -
TABLE VI: Search space for LeNet-

We use the stochastic gradient descent optimizer in our experiments, with a learning rate of

, momentum of , and weight decay of e- in the global search module. In the local search module, we use various SCANN schemes to obtain the feed-forward part of the network and NeST for grow-and-prune synthesis of the convolutional layers.

Table VII presents the results. The most accurate architecture based on the LeNet- model has an error rate of . It is based on using both global and local search modules and corresponds to images of size . This model only has k parameters. Moreover, another model based on global+local search only has k parameters, with an error rate of . These two results are highlighted in bold. Interestingly, the first of the above two models outperforms the GS+LS model based on an image size of . This may be due to the fact that dimensionality reduction has two opposite effects on accuracy: mitigation of curse of dimensionality improves accuracy; however, it also results in some loss of information that reduces accuracy.

Method Error rate (%) Weights
Baseline k
Network pruning [han2015learning] k
NeST [dai2017nest] k
SCANN Scheme A[hassantabar2019scann] k
SCANN Scheme C[hassantabar2019scann] k
STEERAGE (GS) () k
STEERAGE (GS+LS) () k
STEERAGE (GS) () k
STEERAGE (GS+LS) () 0.72% 5.2k
STEERAGE (GS+LS) () 0.66% 7.2k
STEERAGE (GS) () k
STEERAGE (GS+LS) () k
TABLE VII: Comparison of results for LeNet- (the size of the image is shown in parentheses for STEERAGE)

Fig. 4 shows the tradeoff between accuracy and the number of parameters in the network for various image sizes. For each image size, we show two graphs. The first is the result of using grown-and-prune synthesis on only the feed-forward part of the architecture, whereas the other graph is based on using it on both the feed-forward and convolutional layers. As can be seen, using it on both layer types yields NNs that are both more accurate and more compact. Furthermore, by using global and local search modules, STEERAGE provides the NN designers a choice among NNs that fall at different points on the graph and is most suitable to their use scenario. Note that beyond a certain point, adding more parameters to the network leads to the problem of overfitting and thus damages performance.

(a)
(b)
(c)
Fig. 4: Accuracy vs. #parameters tradeoff for three variants of LeNet- for images sizes: (a) , (b) , and (c) .

5.2.2 The CIFAR- dataset

In this section, we present results of our CNN synthesis methodology for the CIFAR- dataset. For this dataset, we used ResNet [he2016deep, he2016identity] architectures of various depths as the baselines. ResNet uses two residual block architectures, basic and bottleneck, to facilitate training of deep networks. The search space for ResNet architectures is shown in Table VIII. We used two different image sizes: the original and down-sampled . We choose from among the six DR methods discussed earlier for images. ResNet architectures have four stages of residual blocks, either basic or bottleneck [he2016deep]. For ResNet architectures that are less than (greater than equal to) layers deep, we use the basic (bottleneck) block. We search for architectures that are between to layers deep (in steps of layers), by searching for the number of basic blocks (in the range 2-6) for each of the four stages in the residual network. Moreover, the number of filters in the convolutional layers of each stage is part of the search space. Other hyperparameters denote whether max or average pooling is used after the convolutional layers, whether the number of fully-connected layers is or , and the quantization level (4, 8, 16 or 32 bits of precision). We also experiment with fixed-depth ResNet architectures with layers. In these experiments, we do not include network depth in the search space.

Input size -
Image DR -
#Basic blocks -
Network depth - ()
Conv - ()
Conv_x - ()
Conv_x - ()
Conv_x - ()
Conv_x - ()
Pooling Max - Avg
#FC layers -
Quantization bits - - -
TABLE VIII: Adaptive search space for ResNet architectures (for #filters in the convolutional layers, the step sizes are shown in the parentheses)

The results are summarized in Table IX. We compare the results obtained using our synthesis methodology with those for the original ResNet architectures. By setting different pruning ratios (in partial-area convolution operation) in the local search module, STEERAGE can navigate the accuracy-model complexity tradeoff. This is evident from the results shown in the table. Our ResNet architecture variants are more accurate compared to the original ones. For example, STEERAGE-synthesized ResNet- can reach an error rate of just , which is lower than the error rate of the original ResNet-, and lower than that of even the much larger ResNet-. Furthermore, another synthesized ResNet- with an error rate of has fewer parameters relative to the baseline ResNet-.

By including the depth of the network as one of the hyperparameters in the search space, we were able to synthesize a variant of the ResNet architecture with layers that can achieve an error rate of only , with M network parameters. It dominates the original ResNet-101 architecture in the error rate, number of parameters, as well as FLOPs. Another variant of this architecture with an error rate of reduces the number of parameters and FLOPs further, and also dominates the original ResNet-101 architecture. These STEERAGE-synthesized architectures are more accurate compared to even the ResNet- architecture [he2016identity]. ResNet- has 1001 layers and is the most accurate ResNet architecture, with an error rate of . This shows that the traditional method of increasing accuracy by increasing the number of layers is not necessarily the best approach – a better architecture can accomplish the same job with much fewer layers (as well as fewer parameters and FLOPs).

Method Error rate (%) #Parameters FLOPs
ResNet- [he2016deep] M M
ResNet- [he2016deep] M G
ResNet- [he2016deep] M G
ResNet- [he2016identity] M G
ResNet- (GS) M M
ResNet- (GS+LS) M M
ResNet- (GS+LS) M M
ResNet- (GS) M G
ResNet- (GS+LS) M G
ResNet- (GS+LS) M M
TABLE IX: Results for the CIFAR- dataset.

6 Discussion

There are several advantages to our synthesis methodology. To begin with, we speed up the search process by using the accuracy predictor models. For example, training a ResNet- architecture for epochs can take around GPU hours. However, by using the accuracy predictor, we can evaluate this architecture in seconds. This results in an acceleration rate of around in evaluating the architectures in the search process. Furthermore, by using the boosted decision tree regression, we trained a predictor that accurately predicts the performance of the architectures. This synthesis approach is more efficient relative to RL-based NAS approaches. The methodology also easily adapts to various network architecture types: FFNNs as well as shallow and deep CNNs.

Our architecture search framework is general. As a result, we can easily add other hyperparameters to the search space. For example, an interesting topic for future research is to investigate the effect of including the type of normalization on architecture performance. Similar to the work done in [luo2019switchable], we could add switchable normalization to the search space and enable the framework to find the optimal combination of various normalization schemes for different parts of the network.

We evaluated the impact of both global search and combined global+local search. The combined approach performed better. This is because global search efficiently obtains a good initialization point for subsequent grow-and-prune synthesis that yields a compact and computationally-efficient architecture while also enhancing accuracy.

7 Conclusion

In this article, we proposed a new two-step NN synthesis methodology called STEERAGE. It uses an efficient global search module based on an accuracy predictor to find the best set of hyperparameter values for the NN design space. It then uses grow-and-prune synthesis methods to find a superior version of the architecture that is more compact, efficient, and accurate. Experimental results demonstrated the ability of STEERAGE to generate accurate and compact networks for both architecture types: FFNNs and CNNs. STEERAGE generated a ResNet- architecture that achieves the highest accuracy relative to all ResNet architectures on the CIFAR- dataset.

References