SCANN: Synthesis of Compact and Accurate Neural Networks

04/19/2019 ∙ by Shayan Hassantabar, et al. ∙ 0

Artificial neural networks (ANNs) have become the driving force behind recent artificial intelligence (AI) research. An important problem with implementing a neural network is the design of its architecture. Typically, such an architecture is obtained manually by exploring its hyperparameter space and kept fixed during training. This approach is both time-consuming and inefficient. Furthermore, modern neural networks often contain millions of parameters, whereas many applications require small inference models. Also, while ANNs have found great success in big-data applications, there is also significant interest in using ANNs for medium- and small-data applications that can be run on energy-constrained edge devices. To address these challenges, we propose a neural network synthesis methodology (SCANN) that can generate very compact neural networks without loss in accuracy for small and medium-size datasets. We also use dimensionality reduction methods to reduce the feature size of the datasets, so as to alleviate the curse of dimensionality. Our final synthesis methodology consists of three steps: dataset dimensionality reduction, neural network compression in each layer, and neural network compression with SCANN. We evaluate SCANN on the medium-size MNIST dataset by comparing our synthesized neural networks to the well-known LeNet-5 baseline. Without any loss in accuracy, SCANN generates a 46.3× smaller network than the LeNet-5 Caffe model. We also evaluate the efficiency of using dimensionality reduction alongside SCANN on nine small to medium-size datasets. Using this methodology enables us to reduce the number of connections in the network by up to 5078.7× (geometric mean: 82.1×), with little to no drop in accuracy. We also show that our synthesis methodology yields neural networks that are much better at navigating the accuracy vs. energy efficiency space.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial neural networks (ANNs) have a long history, dating back to 1950’s [1]. However, interest in ANNs has waxed and waned over the years. The recent spurt in interest in ANNs is due to large datasets becoming available, enabling ANNs to be trained to high accuracy. This trend is also due to a significant increase in compute power that speeds up the training process. ANNs demonstrate very high classification accuracies for many applications of interest, e.g., image recognition [2], speech recognition [3], and machine translation [4]

. ANNs have also become deeper, with tens to hundreds of layers. Thus, the phrase ‘deep learning’ is often associated with such neural networks

[5]. Deep learning refers to the ability of ANNs to learn hierarchically, with complex features built upon simple ones.

An important challenge in deploying ANNs in practice is their architecture design, since the ANN architecture directly influences the learnt representations and thus the performance. Typically, it takes researchers a huge amount of time through much trial-and-error to find a good architecture because the search space is exponentially large with respect to many of its hyperparameters. As an example, let us consider a convolutional neural network (CNN) often used in image recognition tasks. Its various hyperparameters, such as depth, number of filters in each layer, kernel size, how feature maps are connected, etc., need to be determined when designing an architecture. Improvements in such architectures often take several years of effort, as evidenced by the evolution of various architectures for the ImageNet dataset: AlexNet

[2], GoogleNet [6, 7], ResNet [8, 9, 10], and DenseNet [11].

Another challenge ANNs pose is that to obtain their high accuracy, they need to be designed with a large number of parameters. This negatively impacts both the training and inference times. For example, modern deep CNNs often have millions of parameters and take days to train even with powerful graphics processing units (GPUs). However, making the ANN models compact and energy-efficient may enable them to be moved from the cloud to the edge, leading to benefits in communication energy, network bandwidth, and security. The challenge is to do so without degrading accuracy. This is currently a very active area of research [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24].

As the number of features or dimensions of the dataset increases, in order to generalize accurately, we need exponentially more data. This is another challenge which is referred to as the curse of dimensionality. Hence, one way to reduce the need for large amounts of data is to reduce the dimensionality of the dataset. In addition, with the same amount of data, by reducing the number of features, the accuracy of the inference model may also improve to a degree. However, beyond a certain point, which is dataset-dependent, reducing the number of features may lead to loss of information, which may lead to inferior classification results.

In this paper, we address the aforementioned challenges by introducing an ANN synthesis methodology, called SCANN. The ANNs it synthesizes are not necessarily multilayer perceptrons (MLPs). SCANN allows skipped connections, instead of enforcing a layer-by-layer connection structure, in order to address the limitation of conventional ANNs that fixes their depth while training. SCANN uses three basic operations: connection growth, neuron growth, and connection pruning. It generates a feed-forward architecture with arbitrary depth. It uses three effective training schemes that enable it to generate ANNs that are much smaller in size and similar in performance relative to prior ANNs.

Moreover, in order to alleviate the curse of dimensionality, we use dimensionality reduction methods to reduce the feature size of the datasets, before using the SCANN methodology to generate compact neural networks. This methodology consists of three basic blocks: dimensionality reduction, neural network compression in each layer, followed by another neural network compression step with SCANN. We refer to this methodology as DR+SCANN.

For the LeNet-5 Caffe model [23, 25] derived for the MNIST [23] dataset, SCANN generates a network with only 9.3K parameters (representing a 46.3 compression ratio relative to the baseline), yet providing similar performance as the baseline. To further show the efficacy of SCANN, we experiment with several small to medium-size datasets. Although on two datasets, SCANN achieves a compression ratio of and while losing less than in accuracy, on the other seven, SCANN generates ANNs that are to smaller than the baseline architecture, without any drop in classification accuracy. In addition, by combining dimensionality reduction with SCANN (i.e., DR+SCANN), we demonstrate a compression ratio between to , with little to no drop in model performance on these datasets. These results show that our final synthesis framework (DR+SCANN) has an even higher compression power than SCANN, and can significantly reduce the number of connections in the network, without degrading performance. Moreover, we demonstrate that DR+SCANN yields ANNs that are very energy-efficient, yet offering similar accuracy compared to other methods. This opens the door for such ANNs to be used in IoT sensors.

The paper is organized as follows. Section 2 describes related work. Section 3 describes the SCANN synthesis methodology in detail. Section 4 explains the methodology that combines dimensionality reduction with SCANN. Section 5 provides results of synthesis on various benchmarks. Section 6 provides a short discussion. Finally, Section 7 concludes the paper.

2 Related Work

In this section, we review some of the previous work in two related areas: dimensionality reduction and automatic architecture synthesis.

2.1 Dimensionality Reduction

The high dimensionality of many datasets used in various applications of machine learning leads to the curse of dimensionality problem. Therefore, researchers have explored dimensionality reduction methods to improve the performance of machine learning models by decreasing the number of features. Traditional dimensionality reduction methods include Principal Component Analysis (PCA), Kernel PCA, Factor Analysis (FA), Independent Component Analysis (ICA), as well as Spectral Embedding methods. Some graph-based methods include Isomap


and Maximum Variance Unfolding

[27]. FeatureNet [28]

uses community detection in small sample size datasets to map high-dimensional data to lower dimensions. Other dimensionality reduction methods include stochastic proximity embedding (SPE)

[29], Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbor Embedding (t-SNE) [30]. A detailed survey of dimensionality reduction methods can be found in [31].

2.2 Automatic Architecture Synthesis

There are three different categories of automatic architecture synthesis methods that have been proposed by researchers: evolutionary algorithm, reinforcement learning algorithm, and structure adaptation algorithm.

2.2.1 Evolutionary Algorithm

The use of an evolutionary algorithm to select an ANN architecture dates back to 1989 [32]. One of the seminal works in neuroevolution is the NEAT algorithm [33], which uses direct encoding of every neuron and connection to simultaneously evolve the network architecture and weights through weight mutation, connection mutation, node mutation, and crossover. Recent years have seen extensions of the evolutionary algorithm to generate CNNs. For example, Xie and Yuille [34] use a concise binary representation of network connections, and demonstrate a comparable classification accuracy to previous human-designed architectures.

2.2.2 Reinforcement Learning Algorithm

A recent trend in automatic ANN architecture synthesis is to use reinforcement learning. Zoph and Le [35]

use a recurrent neural network as a controller to generate a string that specifies the network architecture. They use the performance of the generated network on a validation dataset as the reward signal to compute the policy gradient and update the controller. In a later work

[36], the authors define a different search space and use the controller to obtain a building block instead of the whole network. They showed that the convolutional cells obtained by learning performed on the CIFAR-10 dataset can be successfully transferred to architectures for other datasets. They achieve a state-of-the-art classification accuracy on ImageNet.

2.2.3 Structure Adaptation Algorithm

Several previous works achieve compact and accurate neural networks through structure adaptation algorithms. One such method is network pruning, which has been used in several works [37, 38, 39, 40, 12, 41, 42]. Structure adaptation algorithms can be constructive or destructive. Constructive algorithms start from a small neural network and grow it into a larger more accurate neural network. Destructive algorithms start from a large neural network and prune connections and neurons to get rid of the redundancy while maintaining accuracy. NeST [37] is a network synthesis tool that combines both the constructive and destructive approaches in a grow-and-prune synthesis paradigm. It is used to synthesize compact and accurate architectures for the MNIST and ImageNet datasets. However, its limitation is that growth and pruning are both performed at a specific ANN layer. Thus, network depth cannot be adjusted and is fixed throughout training. In the next section, we will show this problem can be solved by synthesizing a general feed-forward network instead of an MLP architecture, allowing the ANN depth to be changed dynamically during training.

Several works have also proposed more efficient building blocks for CNN architectures [43, 44, 20, 45, 18, 24, 21]. They result in compact networks, with much fewer parameters, while maintaining or improving performance. Platform-aware search for an optimized NN architecture has also been used in this area. Yin et al. [46]

combine the grow-prune synthesis methodology with hardware-guided training to achieve compact long short-term memory (LSTM) cells. The authors of

[19] train an ANN to satisfy pre-defined resource constraints, such as latency and energy consumption, with the help of a pre-generated accuracy predictor.

Orthogonal to the above works, quantization has also been used to reduce computations in a network with little to no accuracy drop [13, 47, 17].

3 SCANN Synthesis Methodology

In this section, we first propose a technique to address the limitation of prior work that requires the ANN depth to be fixed. Then we introduce three basic architecture-changing techniques that enable the synthesis of an optimized feed-forward network architecture. Finally, we describe three training schemes that can be used to synthesize the network architecture.

3.1 Depth Change

To address the problem of having to fix the ANN depth during training in prior work, we adopt a general feed-forward architecture instead of an MLP structure. Specifically, a hidden neuron can receive inputs from any neuron activated before it (including input neurons), and can feed its output to any neuron activated after it (including output neurons). In this setting, depth is determined by how hidden neurons are connected and thus can be changed through rewiring of hidden neurons. As shown in Fig. 1, depending on how the hidden neurons are connected, they can form one, two, or three hidden layers.

Fig. 1: Connection pattern determines network depth. Only hidden neurons are shown. (a) One hidden layer, (b) two hidden layers, and (c) three hidden layers.

3.2 Overall Workflow

The overall workflow for architecture synthesis is shown in Algorithm 1, the synthesis process iteratively alternates between architecture change and weight training. Thus, the network architecture evolves along the way. After a specified number of iterations, the checkpoint that achieves the best performance on the validation set is output as the final network. Next, we first elaborate on the three basic architecture-changing operations, and then introduce three different training schemes based on how the architectures evolve.

0:  Initial network architecture , weights , and maximum number of iterations
  while maximum iterations not reached do
     (a) Perform one of the three basic architecture-changing operations 
     (b) Train weights of the network and test its performance on the validation set 
  end while
  Final network architecture and associated weights that achieve the best performance on the validation set
Algorithm 1 Automatic architecture synthesis

3.3 Basic Architecture-changing Operations

Three basic operations, connection growth, neuron growth, and connection pruning, are used to adjust the network architecture, in order to evolve a feed-forward network just through these operations. Fig. 2 shows a simple example in which an MLP architecture with one hidden layer evolves into a non-MLP architecture with two hidden layers with a sequence of basic operations mentioned above.

Fig. 2: An MLP architecture with one hidden layer evolves into a non-MLP architecture with two hidden layers through a sequence of neuron growth, connection growth, and connection pruning.

Next, we describe these three operations. We denote the th hidden neuron as , its activity as , and its preactivity as , where and

is the activation function. We denote the depth of


and the loss function by

. Finally, we denote the connection between and , where , as . In our implementation, we use masks to mask out the pruned weights.

3.3.1 Connection Growth

Connection growth adds connections between neurons that are unconnected. The initial weights of all newly added connections are set to 0. Depending on how connections can be added, we use three different methods, as shown in Algorithm 2.

  • Gradient-based growth: Gradient-based growth was proposed by Dai et al. [37]. It adds connections that tend to reduce the loss function significantly. Suppose two neurons and are not connected and , then gradient-based growth adds a new connection if is large.

  • Full growth: Full growth restores all possible connections to the network.

  • Random growth: Random growth randomly picks some inactive connections and adds them to the network.

    0:  Network , weight matrix , mask matrix , data batch , threshold
      if full growth then
         set all elements in to 1
      else if random growth then
         randomly set some elements in to 1
      else if gradient-base growth then
         forward propagation through using data and then back propagation
         For , set
      end if
      Modified weight matrix and mask matrix
    Algorithm 2 Connection growth algorithm

3.3.2 Neuron Growth

Neuron growth adds new neurons to the network, thus increasing network size over time. There are two possible methods for doing this, as shown in Algorithm 3. First, drawing an analogy from biological cell division, neuron growth can be achieved by duplicating an existing neuron. To break the symmetry, random noise is added to the weights of all the connections related to this newly added neuron. The specific neuron that is duplicated can be selected in two ways:

  • Activation-based selection: Activation-based selection selects neurons with a large activation for duplication.

  • Random selection: Random selection randomly selects neurons for duplication.

Second, instead of duplicating existing neurons, new neurons with random initial weights and random initial connections with other neurons may be added to the network.

0:  Network , weight matrix , mask matrix , data batch , a candidate neuron to be added
  if neuron division then
     if activation-based selection then
        forward propagation through using data
     else if random selection then
        randomly pick an active neuron
     end if
  else if random growth then
     randomly set elements of and to 1
     randomly initialize and
  end if
  Modified weight matrix and mask matrix
Algorithm 3 Neuron growth algorithm

3.3.3 Connection Pruning

Connection pruning disconnects previously connected neurons and reduces the number of network parameters. If all connections associated with a neuron are pruned, then the neuron is removed from the network. We adopt a widely-used method [12, 13, 37] to prune connections with small magnitude, as shown in Algorithm 4. The rationale behind it is that since small weights have a relatively small influence on the network, ANN performance can be restored through retraining after pruning.

0:  Weight matrix , mask matrix , threshold
  for all  do
     if  then
     end if
  end for
  Modified weight matrix and mask matrix
Algorithm 4 Connection pruning algorithm

3.4 Training Schemes

In practice, depending on how the initial network architecture and basic operations in Step (a) of Algorithm 1 are chosen, we adopt three training schemes in our experiments, as explained next.

3.4.1 Scheme A

Scheme A is a constructive approach, where we start with a tiny network, and gradually increase the network size. This can be achieved by performing connection and neuron growth more often than connection pruning or carefully selecting the growth and pruning rates, such that each growth operation grows a large number of connections and neurons, while each pruning operation prunes a small number of connections.

3.4.2 Scheme B

Scheme B is a destructive approach, where we start with an over-parameterized network and end up with a small network. There are two possible ways to accomplish this. First, similar to the technique in [12, 37], we can iteratively prune a small number of network connections and then train the weights. This gradually reduces network size and finally results in a small network after many iterations. Another approach is that, instead of pruning the network gradually, we can aggressively prune the network to a tiny size. However, to make this approach work, we need to repeatedly prune the network and then grow the network back, rather than just perform a one-time pruning. In practice, we find the second approach works better.

3.4.3 Scheme C

Scheme B also works with MLP architectures, with only a small adjustment in connection growth such that only connections between adjacent layers are added and not skipped connections. For clarity, we give another name to MLP-based Scheme B: Scheme C. Note that Scheme C is similar to the iterative hard thresholding technique proposed in [48]. Besides, Scheme C can also be viewed as an iterative version of the dense-sparse-dense technique proposed in [49], with the aim of generating compact networks instead of improving performance of the original architecture.

Fig. 3 shows examples of the initial and final architectures for each scheme. Both Schemes A and B evolve general feed-forward architectures, thus allowing network depth to be changed during training. Scheme C evolves an MLP structure, thus keeping the depth fixed.

Fig. 3: Illustration of the three training schemes. Shown here are the initial and final architectures: (a) Scheme A, (b) Scheme B, and (c) Scheme C.

4 Dimensionality Reduction + SCANN

In this section, we propose a methodology to synthesize compact neural networks by combining dimensionality reduction (DR) and SCANN, which we refer to as DR+SCANN. Fig. 4 shows the block diagram of the methodology. It begins by obtaining a very accurate baseline architecture by progressively increasing the number of hidden layers. Its other main parts are dataset dimensionality reduction and two neural network compression steps that are discussed next.

4.1 Dataset Modification

Fig. 4: Block diagram of DR+SCANN: (a) dimensionality reduction into a lower-dimensional space, (b) reduction in the number of neurons in each layer of the initial MLP architecture by the same feature compression ratio, and (c) further compression of the neural network using the three training schemes from SCANN. The process of finding the baseline MLP architecture is also shown.

Dataset modification entails normalizing the dataset and reducing its dimensionality. All feature values are normalized to the range [,

]. Reducing the number of features in the dataset is aimed at alleviating the effect of the curse of dimensionality and increasing data classifiability. This way, an

-dimensional dataset is mapped onto an -dimensional space, , using various dimensionality reduction methods. We explore 11 such methods, including four random projection (RP) methods.

RP is used to reduce data dimensionality based on the Johnson-Lindenstrauss lemma [50, 51]. The essence of this lemma is that if the data points are in a space of sufficiently high dimension, they can be projected onto a suitable lower dimension, while approximately maintaining inter-point distances. More precisely, this lemma shows that the distance between the points change only by a factor of , when they are randomly projected onto the subspace of dimensions, for any .

The RP matrix can be generated in several ways. Here, we discuss four RP matrices we used. One approach is to generate

using a Gaussian distribution. In this case, the entries

are i.i.d. samples drawn from a Gaussian distribution . Another RP matrix can be obtained by sampling entries from . These entries are shown below.

Achlioptas [52] proposed several other sparse RP matrices. Two of these proposals are as follows, where entries

’s are independent random variables that are drawn based on the following probability distributions:

The other dimensionality reduction methods that we used include PCA, Polynomial Kernel PCA, Gaussian Kernel PCA, FA, Isomap, ICA, and Spectral Embedding. Implementations of these methods are obtained from the Scikit-learn machine learning library [53].

4.2 Neural Network Compression in Each Layer

Dimensionality reduction maps the dataset into a vector space of lower dimension. As a result, as the number of features reduces, the number of neurons in the input layer of the neural network decreases accordingly. However, since the dataset dimension is reduced, one might expect the task of classification to become easier. This means we can reduce the number of neurons in all layers, not just the input layer. This step reduces the number of neurons in each layer of the neural network by the

feature compression ratio in the dimensionality reduction step (see Fig. 4), except for the output layer. Fig. 5 shows an example of this process of compressing neural networks in each layer. We refer to this dimensionality reduction stage as DR.

Fig. 5: Compressing the neural network by a compression ratio of 2: the number of neurons in each layer, except the last layer, is reduced by a factor of 2.

4.3 Neural Network Compression With SCANN

We input several neural network architectures obtained from the output of the first neural network compression step to SCANN. These architectures correspond to the best three classification accuracies, as well as the three most compressed networks that meet the baseline accuracy of the initial MLP architecture, as evaluated on the validation set.

SCANN uses the corresponding reduced-dimension dataset. In Scheme A, we need to set the maximum number of connections in the network. We set this value to the number of connections in the neural network that results from the first compression step. This way, the final neural network will become smaller. Schemes B and C require the maximum number of neurons and the maximum number of connections to be initialized. In addition, in these two training schemes, the final number of connections in the network also needs to be set. Furthermore, the number of layers in the MLP architecture synthesized by Scheme C needs to be predetermined. We initialize these parameters using the network architecture that is output from first neural network compression.

5 Experimental Results

In this section, we evaluate the performance of SCANN and DR+SCANN on several small- to medium-size datasets. Table I shows the characteristics of these datasets.

The evaluation results are divided into two parts. Section 5.1 discusses results obtained by SCANN when applied to the widely used MNIST dataset. Compared to related work, SCANN generates neural networks with better classification accuracy and fewer parameters. In Section 5.2, we show results of experiments on nine other datasets. We demonstrate that the ANNs generated by SCANN are very compact and energy-efficient, while maintaining performance. These results open up opportunities to use SCANN-generated ANNs in energy-constrained edge devices and IoT sensors.

Dataset Training Set Validation Set Test Set Features Classes
Sensorless Drive Diagnosis
Human Activity Recognition (HAR)
Musk v
Pen-Based Recognition of Handwritten Digits
Landsat Satellite Image
Letter Recognition
Epileptic Seizure Recognition
Smartphone Human Activity Recognition
TABLE I: Characteristics of the datasets

5.1 Experiments with MNIST

MNIST is a well-studied dataset of handwritten digits. It contains 60000 training images and 10000 test images. We set aside 10000 images from the training set as the validation set. We adopt the Lenet-5 Caffe model [23], [25] that is widely used in related works [12, 54, 55]

. For Schemes A and B, the feed-forward part of the network is learnt by SCANN, whereas the convolutional part is kept the same as in the baseline (Scheme A does not make any changes to the baseline, but Scheme B prunes the connections). For Scheme C, SCANN starts with the baseline architecture, and only learns the connections and weights, without changing the depth of the network. All experiments use the stochastic gradient descent (SGD) optimizer with a learning rate of 0.03, momentum of 0.9, and weight decay of 1e-4. No other regularization technique like dropout or batch normalization is used. We run each experiment five times and report the average performance.

The LeNet-5 Caffe model contains two convolutional layers with 20 and 50 filters, and also one fully-connected hidden layer with 500 neurons. For Scheme A, we start with 400 hidden neurons in the feed-forward part, randomly prune out 95 percent of the connections in the beginning and then iteratively perform a sequence of connection growth that activates 30 percent of all connections and connection pruning that prunes 25 percent of existing connections. For Scheme B, we start with 400 hidden neurons in the feed-forward part and iteratively perform a sequence of connection pruning such that 3.3K connections are left in the convolutional part and 16K connections are left in the feed-forward part, and connection growth such that 90 percent of all connections are restored. For Scheme C, we start with a fully connected baseline architecture and iteratively perform a sequence of connection pruning such that 3.3K connections are left in the convolutional part and 6K connections are left in the feed-forward part, and connection growth such that all connections are restored.

Table II summarizes the results. The baseline error rate is 0.72% with 430.5K parameters. The most compressed model generated by SCANN contains only 9.3K parameters (with a compression ratio of 46.3 over the baseline), achieving a 0.72% error rate when using Scheme C. Scheme A obtains the best error rate of 0.68%, however, with a lower compression ratio of 2.3. For a fair comparison, we implement the method given in [12] on the same data split.

Methods Error rate Weights Compression ratio
Baseline 0.72% 430.5K 1.0
Network pruning [12] 0.77% 34.5K 12.5
Scheme A 0.68% 184.6K 2.3
Scheme B 0.72% 19.3K 22.3
Scheme C 0.72% 9.3K 46.3
TABLE II: Comparison of different methods on the LeNet-5 Caffe model

5.2 Experiments with Other Datasets

Though SCANN demonstrates very good compression ratios for LeNets on the medium-size MNIST dataset at similar or better accuracy, one may ask if SCANN can also generate compact neural networks from other medium and small datasets. To answer this question, we experiment with nine datasets from the UCI machine learning repository [56] and Statlog collection [57]. Next, we present evaluation results on these datasets.

SCANN experiments are based on the Adam optimizer with a learning rate of and weight decay of e-. We compare results obtained by DR+SCANN with those obtained by only applying SCANN, and also DR without using SCANN in a secondary compression step. Table III shows the classification accuracy obtained. The MLP column shows the accuracy of the MLP baseline for each dataset. For all the other methods, we present two columns, the left of which shows the highest achieved accuracy (H.A.) whereas the right one shows the result for the most compressed network (M.C.). Furthermore, for the DR columns, the dimensionality reduction method employed is shown in parentheses. Table IV shows the number of parameters in the network for the corresponding columns in Table III.

SCANN-generated networks show improved accuracy for six of the nine datasets, as compared to the MLP baseline. The accuracy increase is between to . These results correspond to networks that are to smaller than the base architecture. Furthermore, DR+SCANN shows improvements on the highest classification accuracy on five out of the nine datasets, as compared to SCANN-generated results.

In addition, SCANN yields ANNs that achieve the baseline accuracy with fewer parameters on seven out of the nine datasets. For these datasets, the results show a connection compression ratio between to . Moreover, as shown in Tables III and IV, combining dimensionality reduction with SCANN helps achieve higher compression ratios. For these seven datasets, DR+SCANN can meet the baseline accuracy with a to smaller network. This shows a significant improvement over the compression ratio achievable by just using SCANN.

We also report the performance of applying DR without the benefit of the SCANN synthesis step. While these results show improvements, DR+SCANN can be seen to have much more compression power, relative to when DR and SCANN are used separately. This points to a synergy between DR and SCANN.

Dataset MLP DR (H.A.) DR (M.C.) SCANN (H.A.) SCANN (M.C.) DR+SCANN (H.A.) DR+SCANN (M.C.)
SenDrive (FA) (FA)
Musk (FA) (FA)
Pendigits (Isomap) (Isomap)
SatIm (PCA) (PCA)
Letter (PCA) (PCA)
Seizure (FA) (FA)
TABLE III: Test accuracy comparison
Dataset MLP DR (H.A.) DR (M.C.) SCANN (H.A.) SCANN (M.C.) DR+SCANN (H.A.) DR+SCANN (M.C.)
SenDrive k () k () () k () () k () ()
HAR k () k () k () k () k () k () ()
Musk k () k () k () k () k () () ()
Pendigits k () () () k () k () () ()
SatIm k () k () k () k () k () k () k ()
Letter k () k () k () k () k () k () k ()
Seizure k () k () () k () k () k () ()
SHAR k () k () k () k () () k () ()
DNA k () k () k () k () () () ()
TABLE IV: Neural network parameter comparison

Although the classification performance is of great importance, in applications where computing resources are limited, e.g., in battery-operated devices, energy efficiency might be one of the most important concerns. Thus, energy performance of the algorithms should also be taken into consideration in such cases. To evaluate the energy performance, we use the energy analysis method proposed in [58], where the energy consumption for inference is calculated based on the number of multiply-accumulate (MAC) and comparison operations and the number of SRAM accesses. For example, a multiplication of two matrices of size and would require MAC operations and SRAM accesses. In their model, a single MAC operation, SRAM access, and comparison operation implemented in a 130 CMOS process (which may be an appropriate technology for many IoT sensors) consumes 11.8 , 34.6 and 6.16 , respectively. Table V

shows the energy consumption estimates per inference for the corresponding models discussed in Tables

III and IV. DR+SCANN can be seen to have the best overall energy performance. Except for the Letter dataset (for which the energy reduction is only 17 percent), the compact ANNs generated by DR+SCANN consume one to four orders of magnitude less energy than the baseline MLP models. Thus, this synthesis methodology is suitable for heavily energy-constrained devices, such as IoT sensors.

Dataset MLP DR (H.A.) DR (M.C.) SCANN (H.A.) SCANN (M.C.) DR+SCANN (H.A.) DR+SCANN (M.C.)
SenDrive e- e- e- e- e- e- e-
HAR e- e- e- e- e- e- e-
Musk e- e- e- e- e- e- e-
Pendigits e- e- e- e- e- e- e-
SatIm e- e- e- e- e- e- e-
Letter e- e- e- e- e- e- e-
Seizure e- e- e- e- e- e- e-
SHAR e- e- e- e- e- e- e-
DNA e- e- e- e- e- e- e-
TABLE V: Inference energy consumption comparison ()

6 Discussion

The advantages of SCANN are derived from its core benefit: the network architecture is allowed to dynamically evolve during training. This benefit is not directly available in several other existing automatic architecture synthesis techniques, such as the evolutionary and reinforcement learning based approaches. In those methods, a new architecture, whether generated through mutation and crossover in the evolutionary approach or from the controller in the reinforcement learning approach, needs to be fixed during training and trained from scratch again when the architecture is changed. However, human learning is incremental. Our brain gradually changes based on the presented stimuli. For example, studies of the human neocortex have shown that up to 40 percent of the synapses are rewired every day

[59]. Hence, from this perspective, SCANN takes inspiration from how the human brain evolves incrementally. SCANN’s dynamic rewiring can be easily achieved through connection growth and pruning.

Comparisons between SCANN and DR+SCANN show that the latter results in a smaller network in nearly all the cases. This is due to the initial step of dimensionality reduction. By mapping data instances into lower dimensions, it reduces the number of neurons in each layer of the neural network, without degrading performance. This helps feed a significantly smaller neural network to SCANN. As a result, DR+SCANN synthesizes smaller networks relative to when only SCANN is used. However, a limitation of SCANN is that it can only evolve feed-forward networks. How to extend SCANN to CNNs and recurrent neural networks is the focus of our future work.

7 Conclusion

In this paper, we proposed a synthesis methodology that can generate compact and accurate neural networks. It solves the problem of having to fix the depth of the network during training that prior synthesis methods suffer from. It is able to evolve an arbitrary feed-forward network architecture with the help of three basic operations: connections growth, neuron growth, and connection pruning. Experiments on the MNIST dataset show that, without loss in accuracy, SCANN generates a 46.3 smaller network than the LeNet-5 Caffe model. Furthermore, by combining dimensionality reduction with SCANN synthesis, we showed significant improvements in the compression power of this framework. Experiments with several other small to medium datasets show that SCANN and DR+SCANN can provide a good tradeoff between accuracy and energy efficiency in applications where computing resources are limited.