Artificial neural networks (ANNs) have a long history, dating back to 1950’s . However, interest in ANNs has waxed and waned over the years. The recent spurt in interest in ANNs is due to large datasets becoming available, enabling ANNs to be trained to high accuracy. This trend is also due to a significant increase in compute power that speeds up the training process. ANNs demonstrate very high classification accuracies for many applications of interest, e.g., image recognition , speech recognition , and machine translation 
. ANNs have also become deeper, with tens to hundreds of layers. Thus, the phrase ‘deep learning’ is often associated with such neural networks. Deep learning refers to the ability of ANNs to learn hierarchically, with complex features built upon simple ones.
An important challenge in deploying ANNs in practice is their architecture design, since the ANN architecture directly influences the learnt representations and thus the performance. Typically, it takes researchers a huge amount of time through much trial-and-error to find a good architecture because the search space is exponentially large with respect to many of its hyperparameters. As an example, let us consider a convolutional neural network (CNN) often used in image recognition tasks. Its various hyperparameters, such as depth, number of filters in each layer, kernel size, how feature maps are connected, etc., need to be determined when designing an architecture. Improvements in such architectures often take several years of effort, as evidenced by the evolution of various architectures for the ImageNet dataset: AlexNet, GoogleNet [6, 7], ResNet [8, 9, 10], and DenseNet .
Another challenge ANNs pose is that to obtain their high accuracy, they need to be designed with a large number of parameters. This negatively impacts both the training and inference times. For example, modern deep CNNs often have millions of parameters and take days to train even with powerful graphics processing units (GPUs). However, making the ANN models compact and energy-efficient may enable them to be moved from the cloud to the edge, leading to benefits in communication energy, network bandwidth, and security. The challenge is to do so without degrading accuracy. This is currently a very active area of research [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24].
As the number of features or dimensions of the dataset increases, in order to generalize accurately, we need exponentially more data. This is another challenge which is referred to as the curse of dimensionality. Hence, one way to reduce the need for large amounts of data is to reduce the dimensionality of the dataset. In addition, with the same amount of data, by reducing the number of features, the accuracy of the inference model may also improve to a degree. However, beyond a certain point, which is dataset-dependent, reducing the number of features may lead to loss of information, which may lead to inferior classification results.
In this paper, we address the aforementioned challenges by introducing an ANN synthesis methodology, called SCANN. The ANNs it synthesizes are not necessarily multilayer perceptrons (MLPs). SCANN allows skipped connections, instead of enforcing a layer-by-layer connection structure, in order to address the limitation of conventional ANNs that fixes their depth while training. SCANN uses three basic operations: connection growth, neuron growth, and connection pruning. It generates a feed-forward architecture with arbitrary depth. It uses three effective training schemes that enable it to generate ANNs that are much smaller in size and similar in performance relative to prior ANNs.
Moreover, in order to alleviate the curse of dimensionality, we use dimensionality reduction methods to reduce the feature size of the datasets, before using the SCANN methodology to generate compact neural networks. This methodology consists of three basic blocks: dimensionality reduction, neural network compression in each layer, followed by another neural network compression step with SCANN. We refer to this methodology as DR+SCANN.
For the LeNet-5 Caffe model [23, 25] derived for the MNIST  dataset, SCANN generates a network with only 9.3K parameters (representing a 46.3 compression ratio relative to the baseline), yet providing similar performance as the baseline. To further show the efficacy of SCANN, we experiment with several small to medium-size datasets. Although on two datasets, SCANN achieves a compression ratio of and while losing less than in accuracy, on the other seven, SCANN generates ANNs that are to smaller than the baseline architecture, without any drop in classification accuracy. In addition, by combining dimensionality reduction with SCANN (i.e., DR+SCANN), we demonstrate a compression ratio between to , with little to no drop in model performance on these datasets. These results show that our final synthesis framework (DR+SCANN) has an even higher compression power than SCANN, and can significantly reduce the number of connections in the network, without degrading performance. Moreover, we demonstrate that DR+SCANN yields ANNs that are very energy-efficient, yet offering similar accuracy compared to other methods. This opens the door for such ANNs to be used in IoT sensors.
The paper is organized as follows. Section 2 describes related work. Section 3 describes the SCANN synthesis methodology in detail. Section 4 explains the methodology that combines dimensionality reduction with SCANN. Section 5 provides results of synthesis on various benchmarks. Section 6 provides a short discussion. Finally, Section 7 concludes the paper.
2 Related Work
In this section, we review some of the previous work in two related areas: dimensionality reduction and automatic architecture synthesis.
2.1 Dimensionality Reduction
The high dimensionality of many datasets used in various applications of machine learning leads to the curse of dimensionality problem. Therefore, researchers have explored dimensionality reduction methods to improve the performance of machine learning models by decreasing the number of features. Traditional dimensionality reduction methods include Principal Component Analysis (PCA), Kernel PCA, Factor Analysis (FA), Independent Component Analysis (ICA), as well as Spectral Embedding methods. Some graph-based methods include Isomap
and Maximum Variance Unfolding. FeatureNet 
uses community detection in small sample size datasets to map high-dimensional data to lower dimensions. Other dimensionality reduction methods include stochastic proximity embedding (SPE), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbor Embedding (t-SNE) . A detailed survey of dimensionality reduction methods can be found in .
2.2 Automatic Architecture Synthesis
There are three different categories of automatic architecture synthesis methods that have been proposed by researchers: evolutionary algorithm, reinforcement learning algorithm, and structure adaptation algorithm.
2.2.1 Evolutionary Algorithm
The use of an evolutionary algorithm to select an ANN architecture dates back to 1989 . One of the seminal works in neuroevolution is the NEAT algorithm , which uses direct encoding of every neuron and connection to simultaneously evolve the network architecture and weights through weight mutation, connection mutation, node mutation, and crossover. Recent years have seen extensions of the evolutionary algorithm to generate CNNs. For example, Xie and Yuille  use a concise binary representation of network connections, and demonstrate a comparable classification accuracy to previous human-designed architectures.
2.2.2 Reinforcement Learning Algorithm
A recent trend in automatic ANN architecture synthesis is to use reinforcement learning. Zoph and Le 
use a recurrent neural network as a controller to generate a string that specifies the network architecture. They use the performance of the generated network on a validation dataset as the reward signal to compute the policy gradient and update the controller. In a later work, the authors define a different search space and use the controller to obtain a building block instead of the whole network. They showed that the convolutional cells obtained by learning performed on the CIFAR-10 dataset can be successfully transferred to architectures for other datasets. They achieve a state-of-the-art classification accuracy on ImageNet.
2.2.3 Structure Adaptation Algorithm
Several previous works achieve compact and accurate neural networks through structure adaptation algorithms. One such method is network pruning, which has been used in several works [37, 38, 39, 40, 12, 41, 42]. Structure adaptation algorithms can be constructive or destructive. Constructive algorithms start from a small neural network and grow it into a larger more accurate neural network. Destructive algorithms start from a large neural network and prune connections and neurons to get rid of the redundancy while maintaining accuracy. NeST  is a network synthesis tool that combines both the constructive and destructive approaches in a grow-and-prune synthesis paradigm. It is used to synthesize compact and accurate architectures for the MNIST and ImageNet datasets. However, its limitation is that growth and pruning are both performed at a specific ANN layer. Thus, network depth cannot be adjusted and is fixed throughout training. In the next section, we will show this problem can be solved by synthesizing a general feed-forward network instead of an MLP architecture, allowing the ANN depth to be changed dynamically during training.
Several works have also proposed more efficient building blocks for CNN architectures [43, 44, 20, 45, 18, 24, 21]. They result in compact networks, with much fewer parameters, while maintaining or improving performance. Platform-aware search for an optimized NN architecture has also been used in this area. Yin et al. 
combine the grow-prune synthesis methodology with hardware-guided training to achieve compact long short-term memory (LSTM) cells. The authors of train an ANN to satisfy pre-defined resource constraints, such as latency and energy consumption, with the help of a pre-generated accuracy predictor.
3 SCANN Synthesis Methodology
In this section, we first propose a technique to address the limitation of prior work that requires the ANN depth to be fixed. Then we introduce three basic architecture-changing techniques that enable the synthesis of an optimized feed-forward network architecture. Finally, we describe three training schemes that can be used to synthesize the network architecture.
3.1 Depth Change
To address the problem of having to fix the ANN depth during training in prior work, we adopt a general feed-forward architecture instead of an MLP structure. Specifically, a hidden neuron can receive inputs from any neuron activated before it (including input neurons), and can feed its output to any neuron activated after it (including output neurons). In this setting, depth is determined by how hidden neurons are connected and thus can be changed through rewiring of hidden neurons. As shown in Fig. 1, depending on how the hidden neurons are connected, they can form one, two, or three hidden layers.
3.2 Overall Workflow
The overall workflow for architecture synthesis is shown in Algorithm 1, the synthesis process iteratively alternates between architecture change and weight training. Thus, the network architecture evolves along the way. After a specified number of iterations, the checkpoint that achieves the best performance on the validation set is output as the final network. Next, we first elaborate on the three basic architecture-changing operations, and then introduce three different training schemes based on how the architectures evolve.
3.3 Basic Architecture-changing Operations
Three basic operations, connection growth, neuron growth, and connection pruning, are used to adjust the network architecture, in order to evolve a feed-forward network just through these operations. Fig. 2 shows a simple example in which an MLP architecture with one hidden layer evolves into a non-MLP architecture with two hidden layers with a sequence of basic operations mentioned above.
Next, we describe these three operations. We denote the th hidden neuron as , its activity as , and its preactivity as , where and
is the activation function. We denote the depth ofby
and the loss function by. Finally, we denote the connection between and , where , as . In our implementation, we use masks to mask out the pruned weights.
3.3.1 Connection Growth
Connection growth adds connections between neurons that are unconnected. The initial weights of all newly added connections are set to 0. Depending on how connections can be added, we use three different methods, as shown in Algorithm 2.
Gradient-based growth: Gradient-based growth was proposed by Dai et al. . It adds connections that tend to reduce the loss function significantly. Suppose two neurons and are not connected and , then gradient-based growth adds a new connection if is large.
Full growth: Full growth restores all possible connections to the network.
Random growth: Random growth randomly picks some inactive connections and adds them to the network.
3.3.2 Neuron Growth
Neuron growth adds new neurons to the network, thus increasing network size over time. There are two possible methods for doing this, as shown in Algorithm 3. First, drawing an analogy from biological cell division, neuron growth can be achieved by duplicating an existing neuron. To break the symmetry, random noise is added to the weights of all the connections related to this newly added neuron. The specific neuron that is duplicated can be selected in two ways:
Activation-based selection: Activation-based selection selects neurons with a large activation for duplication.
Random selection: Random selection randomly selects neurons for duplication.
Second, instead of duplicating existing neurons, new neurons with random initial weights and random initial connections with other neurons may be added to the network.
3.3.3 Connection Pruning
Connection pruning disconnects previously connected neurons and reduces the number of network parameters. If all connections associated with a neuron are pruned, then the neuron is removed from the network. We adopt a widely-used method [12, 13, 37] to prune connections with small magnitude, as shown in Algorithm 4. The rationale behind it is that since small weights have a relatively small influence on the network, ANN performance can be restored through retraining after pruning.
3.4 Training Schemes
In practice, depending on how the initial network architecture and basic operations in Step (a) of Algorithm 1 are chosen, we adopt three training schemes in our experiments, as explained next.
3.4.1 Scheme A
Scheme A is a constructive approach, where we start with a tiny network, and gradually increase the network size. This can be achieved by performing connection and neuron growth more often than connection pruning or carefully selecting the growth and pruning rates, such that each growth operation grows a large number of connections and neurons, while each pruning operation prunes a small number of connections.
3.4.2 Scheme B
Scheme B is a destructive approach, where we start with an over-parameterized network and end up with a small network. There are two possible ways to accomplish this. First, similar to the technique in [12, 37], we can iteratively prune a small number of network connections and then train the weights. This gradually reduces network size and finally results in a small network after many iterations. Another approach is that, instead of pruning the network gradually, we can aggressively prune the network to a tiny size. However, to make this approach work, we need to repeatedly prune the network and then grow the network back, rather than just perform a one-time pruning. In practice, we find the second approach works better.
3.4.3 Scheme C
Scheme B also works with MLP architectures, with only a small adjustment in connection growth such that only connections between adjacent layers are added and not skipped connections. For clarity, we give another name to MLP-based Scheme B: Scheme C. Note that Scheme C is similar to the iterative hard thresholding technique proposed in . Besides, Scheme C can also be viewed as an iterative version of the dense-sparse-dense technique proposed in , with the aim of generating compact networks instead of improving performance of the original architecture.
Fig. 3 shows examples of the initial and final architectures for each scheme. Both Schemes A and B evolve general feed-forward architectures, thus allowing network depth to be changed during training. Scheme C evolves an MLP structure, thus keeping the depth fixed.
4 Dimensionality Reduction + SCANN
In this section, we propose a methodology to synthesize compact neural networks by combining dimensionality reduction (DR) and SCANN, which we refer to as DR+SCANN. Fig. 4 shows the block diagram of the methodology. It begins by obtaining a very accurate baseline architecture by progressively increasing the number of hidden layers. Its other main parts are dataset dimensionality reduction and two neural network compression steps that are discussed next.
4.1 Dataset Modification
Dataset modification entails normalizing the dataset and reducing its dimensionality. All feature values are normalized to the range [,
]. Reducing the number of features in the dataset is aimed at alleviating the effect of the curse of dimensionality and increasing data classifiability. This way, an-dimensional dataset is mapped onto an -dimensional space, , using various dimensionality reduction methods. We explore 11 such methods, including four random projection (RP) methods.
RP is used to reduce data dimensionality based on the Johnson-Lindenstrauss lemma [50, 51]. The essence of this lemma is that if the data points are in a space of sufficiently high dimension, they can be projected onto a suitable lower dimension, while approximately maintaining inter-point distances. More precisely, this lemma shows that the distance between the points change only by a factor of , when they are randomly projected onto the subspace of dimensions, for any .
The RP matrix can be generated in several ways. Here, we discuss four RP matrices we used. One approach is to generate
using a Gaussian distribution. In this case, the entriesare i.i.d. samples drawn from a Gaussian distribution . Another RP matrix can be obtained by sampling entries from . These entries are shown below.
Achlioptas  proposed several other sparse RP matrices. Two of these proposals are as follows, where entries
The other dimensionality reduction methods that we used include PCA, Polynomial Kernel PCA, Gaussian Kernel PCA, FA, Isomap, ICA, and Spectral Embedding. Implementations of these methods are obtained from the Scikit-learn machine learning library .
4.2 Neural Network Compression in Each Layer
Dimensionality reduction maps the dataset into a vector space of lower dimension. As a result, as the number of features reduces, the number of neurons in the input layer of the neural network decreases accordingly. However, since the dataset dimension is reduced, one might expect the task of classification to become easier. This means we can reduce the number of neurons in all layers, not just the input layer. This step reduces the number of neurons in each layer of the neural network by thefeature compression ratio in the dimensionality reduction step (see Fig. 4), except for the output layer. Fig. 5 shows an example of this process of compressing neural networks in each layer. We refer to this dimensionality reduction stage as DR.
4.3 Neural Network Compression With SCANN
We input several neural network architectures obtained from the output of the first neural network compression step to SCANN. These architectures correspond to the best three classification accuracies, as well as the three most compressed networks that meet the baseline accuracy of the initial MLP architecture, as evaluated on the validation set.
SCANN uses the corresponding reduced-dimension dataset. In Scheme A, we need to set the maximum number of connections in the network. We set this value to the number of connections in the neural network that results from the first compression step. This way, the final neural network will become smaller. Schemes B and C require the maximum number of neurons and the maximum number of connections to be initialized. In addition, in these two training schemes, the final number of connections in the network also needs to be set. Furthermore, the number of layers in the MLP architecture synthesized by Scheme C needs to be predetermined. We initialize these parameters using the network architecture that is output from first neural network compression.
5 Experimental Results
In this section, we evaluate the performance of SCANN and DR+SCANN on several small- to medium-size datasets. Table I shows the characteristics of these datasets.
The evaluation results are divided into two parts. Section 5.1 discusses results obtained by SCANN when applied to the widely used MNIST dataset. Compared to related work, SCANN generates neural networks with better classification accuracy and fewer parameters. In Section 5.2, we show results of experiments on nine other datasets. We demonstrate that the ANNs generated by SCANN are very compact and energy-efficient, while maintaining performance. These results open up opportunities to use SCANN-generated ANNs in energy-constrained edge devices and IoT sensors.
|Dataset||Training Set||Validation Set||Test Set||Features||Classes|
|Sensorless Drive Diagnosis|
|Human Activity Recognition (HAR)|
|Pen-Based Recognition of Handwritten Digits|
|Landsat Satellite Image|
|Epileptic Seizure Recognition|
|Smartphone Human Activity Recognition|
5.1 Experiments with MNIST
MNIST is a well-studied dataset of handwritten digits. It contains 60000 training images and 10000 test images. We set aside 10000 images from the training set as the validation set. We adopt the Lenet-5 Caffe model ,  that is widely used in related works [12, 54, 55]
. For Schemes A and B, the feed-forward part of the network is learnt by SCANN, whereas the convolutional part is kept the same as in the baseline (Scheme A does not make any changes to the baseline, but Scheme B prunes the connections). For Scheme C, SCANN starts with the baseline architecture, and only learns the connections and weights, without changing the depth of the network. All experiments use the stochastic gradient descent (SGD) optimizer with a learning rate of 0.03, momentum of 0.9, and weight decay of 1e-4. No other regularization technique like dropout or batch normalization is used. We run each experiment five times and report the average performance.
The LeNet-5 Caffe model contains two convolutional layers with 20 and 50 filters, and also one fully-connected hidden layer with 500 neurons. For Scheme A, we start with 400 hidden neurons in the feed-forward part, randomly prune out 95 percent of the connections in the beginning and then iteratively perform a sequence of connection growth that activates 30 percent of all connections and connection pruning that prunes 25 percent of existing connections. For Scheme B, we start with 400 hidden neurons in the feed-forward part and iteratively perform a sequence of connection pruning such that 3.3K connections are left in the convolutional part and 16K connections are left in the feed-forward part, and connection growth such that 90 percent of all connections are restored. For Scheme C, we start with a fully connected baseline architecture and iteratively perform a sequence of connection pruning such that 3.3K connections are left in the convolutional part and 6K connections are left in the feed-forward part, and connection growth such that all connections are restored.
Table II summarizes the results. The baseline error rate is 0.72% with 430.5K parameters. The most compressed model generated by SCANN contains only 9.3K parameters (with a compression ratio of 46.3 over the baseline), achieving a 0.72% error rate when using Scheme C. Scheme A obtains the best error rate of 0.68%, however, with a lower compression ratio of 2.3. For a fair comparison, we implement the method given in  on the same data split.
|Methods||Error rate||Weights||Compression ratio|
|Network pruning ||0.77%||34.5K||12.5|
5.2 Experiments with Other Datasets
Though SCANN demonstrates very good compression ratios for LeNets on the medium-size MNIST dataset at similar or better accuracy, one may ask if SCANN can also generate compact neural networks from other medium and small datasets. To answer this question, we experiment with nine datasets from the UCI machine learning repository  and Statlog collection . Next, we present evaluation results on these datasets.
SCANN experiments are based on the Adam optimizer with a learning rate of and weight decay of e-. We compare results obtained by DR+SCANN with those obtained by only applying SCANN, and also DR without using SCANN in a secondary compression step. Table III shows the classification accuracy obtained. The MLP column shows the accuracy of the MLP baseline for each dataset. For all the other methods, we present two columns, the left of which shows the highest achieved accuracy (H.A.) whereas the right one shows the result for the most compressed network (M.C.). Furthermore, for the DR columns, the dimensionality reduction method employed is shown in parentheses. Table IV shows the number of parameters in the network for the corresponding columns in Table III.
SCANN-generated networks show improved accuracy for six of the nine datasets, as compared to the MLP baseline. The accuracy increase is between to . These results correspond to networks that are to smaller than the base architecture. Furthermore, DR+SCANN shows improvements on the highest classification accuracy on five out of the nine datasets, as compared to SCANN-generated results.
In addition, SCANN yields ANNs that achieve the baseline accuracy with fewer parameters on seven out of the nine datasets. For these datasets, the results show a connection compression ratio between to . Moreover, as shown in Tables III and IV, combining dimensionality reduction with SCANN helps achieve higher compression ratios. For these seven datasets, DR+SCANN can meet the baseline accuracy with a to smaller network. This shows a significant improvement over the compression ratio achievable by just using SCANN.
We also report the performance of applying DR without the benefit of the SCANN synthesis step. While these results show improvements, DR+SCANN can be seen to have much more compression power, relative to when DR and SCANN are used separately. This points to a synergy between DR and SCANN.
|Dataset||MLP||DR (H.A.)||DR (M.C.)||SCANN (H.A.)||SCANN (M.C.)||DR+SCANN (H.A.)||DR+SCANN (M.C.)|
|Dataset||MLP||DR (H.A.)||DR (M.C.)||SCANN (H.A.)||SCANN (M.C.)||DR+SCANN (H.A.)||DR+SCANN (M.C.)|
|SenDrive||k ()||k ()||()||k ()||()||k ()||()|
|HAR||k ()||k ()||k ()||k ()||k ()||k ()||()|
|Musk||k ()||k ()||k ()||k ()||k ()||()||()|
|Pendigits||k ()||()||()||k ()||k ()||()||()|
|SatIm||k ()||k ()||k ()||k ()||k ()||k ()||k ()|
|Letter||k ()||k ()||k ()||k ()||k ()||k ()||k ()|
|Seizure||k ()||k ()||()||k ()||k ()||k ()||()|
|SHAR||k ()||k ()||k ()||k ()||()||k ()||()|
|DNA||k ()||k ()||k ()||k ()||()||()||()|
Although the classification performance is of great importance, in applications where computing resources are limited, e.g., in battery-operated devices, energy efficiency might be one of the most important concerns. Thus, energy performance of the algorithms should also be taken into consideration in such cases. To evaluate the energy performance, we use the energy analysis method proposed in , where the energy consumption for inference is calculated based on the number of multiply-accumulate (MAC) and comparison operations and the number of SRAM accesses. For example, a multiplication of two matrices of size and would require MAC operations and SRAM accesses. In their model, a single MAC operation, SRAM access, and comparison operation implemented in a 130 CMOS process (which may be an appropriate technology for many IoT sensors) consumes 11.8 , 34.6 and 6.16 , respectively. Table V
shows the energy consumption estimates per inference for the corresponding models discussed in TablesIII and IV. DR+SCANN can be seen to have the best overall energy performance. Except for the Letter dataset (for which the energy reduction is only 17 percent), the compact ANNs generated by DR+SCANN consume one to four orders of magnitude less energy than the baseline MLP models. Thus, this synthesis methodology is suitable for heavily energy-constrained devices, such as IoT sensors.
|Dataset||MLP||DR (H.A.)||DR (M.C.)||SCANN (H.A.)||SCANN (M.C.)||DR+SCANN (H.A.)||DR+SCANN (M.C.)|
The advantages of SCANN are derived from its core benefit: the network architecture is allowed to dynamically evolve during training. This benefit is not directly available in several other existing automatic architecture synthesis techniques, such as the evolutionary and reinforcement learning based approaches. In those methods, a new architecture, whether generated through mutation and crossover in the evolutionary approach or from the controller in the reinforcement learning approach, needs to be fixed during training and trained from scratch again when the architecture is changed. However, human learning is incremental. Our brain gradually changes based on the presented stimuli. For example, studies of the human neocortex have shown that up to 40 percent of the synapses are rewired every day. Hence, from this perspective, SCANN takes inspiration from how the human brain evolves incrementally. SCANN’s dynamic rewiring can be easily achieved through connection growth and pruning.
Comparisons between SCANN and DR+SCANN show that the latter results in a smaller network in nearly all the cases. This is due to the initial step of dimensionality reduction. By mapping data instances into lower dimensions, it reduces the number of neurons in each layer of the neural network, without degrading performance. This helps feed a significantly smaller neural network to SCANN. As a result, DR+SCANN synthesizes smaller networks relative to when only SCANN is used. However, a limitation of SCANN is that it can only evolve feed-forward networks. How to extend SCANN to CNNs and recurrent neural networks is the focus of our future work.
In this paper, we proposed a synthesis methodology that can generate compact and accurate neural networks. It solves the problem of having to fix the depth of the network during training that prior synthesis methods suffer from. It is able to evolve an arbitrary feed-forward network architecture with the help of three basic operations: connections growth, neuron growth, and connection pruning. Experiments on the MNIST dataset show that, without loss in accuracy, SCANN generates a 46.3 smaller network than the LeNet-5 Caffe model. Furthermore, by combining dimensionality reduction with SCANN synthesis, we showed significant improvements in the compression power of this framework. Experiments with several other small to medium datasets show that SCANN and DR+SCANN can provide a good tradeoff between accuracy and energy efficiency in applications where computing resources are limited.
-  F. Rosenblatt, The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory, 1957.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
-  G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large target vocabulary for neural machine translation,” arXiv preprint arXiv:1412.2007, 2014.
-  Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Inception-ResNet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017, pp. 5987–5995.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017, pp. 2261–2269.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proc. Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
R. Ding, Z. Liu, R. Shi, D. Marculescu, and R. D. Blanton, “LightNN: Filling the gap between conventional deep neural networks and binarized networks,” inProc. Great Lakes Symposium on VLSI, 2017, pp. 35–40.
-  D. Stamoulis, E. Cai, D.-C. Juan, and D. Marculescu, “HyperPower: Power-and memory-constrained hyper-parameter optimization for neural networks,” arXiv preprint arXiv:1712.02446, 2017.
-  E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu, “NeuralPower: Predict and deploy energy-efficient convolutional neural networks,” arXiv preprint arXiv:1710.05420, 2017.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proc. Advances in Neural Information Processing Systems, 2016, pp. 4107–4115.
-  N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet v2: Practical guidelines for efficient CNN architecture design,” arXiv preprint arXiv:1807.11164, vol. 1, 2018.
-  X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, P. Vajda, M. Uyttendaele, and N. K. Jha, “ChamNet: Towards efficient network design through platform-aware model adaptation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNet v2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 675–678.
-  J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
-  K. Q. Weinberger and L. K. Saul, “An introduction to nonlinear dimensionality reduction by maximum variance unfolding,” in Proc. AAAI, vol. 6, 2006, pp. 1683–1686.
-  K. Bhardwaj and R. Marculescu, “Dimensionality reduction via community detection in small sample datasets,” in Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining. Springer, 2018, pp. 102–114.
-  D. K. Agrafiotis, “Stochastic proximity embedding,” J. Computational Chemistry, vol. 24, no. 10, pp. 1215–1221, 2003.
-  L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Machine Learning Research, vol. 9, pp. 2579–2605, Nov 2008.
-  L. van der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: A comparative,” J. Machine Learning Research, vol. 10, pp. 66–71, 2009.
G. F. Miller, P. M. Todd, and S. U. Hegde, “Designing neural networks using genetic algorithms.” inProc. Int. Conf. Genetic Algorithms, vol. 89, 1989, pp. 379–384.
-  K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting topologies,” Evolutionary Computation, vol. 10, no. 2, pp. 99–127, 2002.
-  L. Xie and A. Yuille, “Genetic CNN,” arXiv preprint arXiv:1703.01513, 2017.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” arXiv preprint arXiv:1707.07012, 2017.
-  X. Dai, H. Yin, and N. K. Jha, “NeST: A neural network synthesis tool based on a grow-and-prune paradigm,” arXiv preprint arXiv:1711.02017, 2017.
-  ——, “Grow and prune compact, fast, and accurate LSTMs,” arXiv preprint arXiv:1805.11797, 2018.
-  T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and Y. Wang, “Adam-ADMM: a unified, systematic framework of structured weight pruning for DNNs,” arXiv preprint arXiv:1807.11091, 2018.
-  T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” arXiv preprint arXiv:1611.05128, 2016.
-  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Proc. Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
-  T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, “NetAdapt: platform-aware neural network adaptation for mobile applications,” Energy, vol. 41, p. 46, 2018.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size,” arXiv preprint arXiv:1602.07360, 2016.
-  B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer, “Shift: A zero flop, zero parameter alternative to spatial convolutions,” arXiv preprint arXiv:1711.08141, 2017.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
-  H. Yin, G. Chen, Y. Li, S. Che, W. Zhang, and N. K. Jha, “Hardware-guided symbiotic training for compact, accurate, yet execution-efficient LSTM,” arXiv preprint arXiv:1901.10997, 2019.
-  C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
-  X. Jin, X. Yuan, J. Feng, and S. Yan, “Training skinny deep neural networks with iterative hard thresholding methods,” arXiv preprint arXiv:1607.05423, 2016.
-  S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J. Dally, “DSD: Regularizing deep neural networks with dense-sparse-dense training flow,” arXiv preprint arXiv:1607.04381, 2016.
D. Sivakumar, “Algorithmic derandomization via complexity theory,” in
Proc. ACM Symp. Theory of Computing, 2002, pp. 619–626.
-  S. Dasgupta and A. Gupta, “An elementary proof of a theorem of Johnson and Lindenstrauss,” Random Structures & Algorithms, vol. 22, no. 1, pp. 60–65, 2003.
-  D. Achlioptas, “Database-friendly random projections,” in Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, 2001, pp. 274–281.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 4860–4874.
-  K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” arXiv preprint arXiv:1702.04008, 2017.
-  D. Dheeru and E. Karra Taniskidou, “UCI Machine Learning Repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
D. Michie, D. Spiegelhalter, C. Taylor, and J. Campbell,
Machine Learning, Neural and Statistical Classification. Upper Saddle River, NJ, USA: Ellis Horwood, 1994.
-  A. O. Akmandor, H. Yin, and N. K. Jha, “Simultaneously ensuring smartness, security, and energy efficiency in Internet-of-Things sensors,” in Proc. IEEE Custom Integrated Circuits Conference, 2018, pp. 1–8.
-  J. Hawkins, “What intelligent machines need to learn from the neocortex machines won’t become intelligent unless they incorporate certain features of the human brain-here are three of them,” IEEE Spectrum, vol. 54, no. 6, pp. 34–71, 2017.