I Motivation
For modern edge computing systems, the task of implementing deep learning applications with high efficiency has become critical. For example, automatic speech recognition
[16] (ASR) is an area where significant industrial effort is being focused to move the problem ondevice and avoid an alwayson connection to cloud computing resources.The typical approach is first to select a deep neural network (DNN) architecture and train it to acceptable accuracy on some task. Techniques such as weight pruning [33] and model quantization [30] are then applied, so that the inference has acceptable latency and memory consumption on the target device. This general workflow is applied in the vast majority of research, for example, in the recent ondevice ASR work at Google [15].
However, certain features of neural architecture, especially in convolutional neural networks (CNNs) present hard roadblocks for performance. For example, the size and shapes chosen for convolution kernels determines whether or not fast and memoryefficient convolution algorithms can be applied at inference time
[3].In this paper, we argue for a more flexible and holistic approach to optimization of deep learning applications. We propose to adapt techniques from hardwaresoftware codesign to the domain of deep learning, by performing neural architecture search with both classification accuracy and inference performance as primary objectives.
Extending neural architecture search with information about the hardware on which the inference will be deployed allows decisions to be made at a high level concerning the structure of the neural network which have profound implications for performance. Without this information, we are searching “in the dark” with regard to inference performance, and it is unlikely that the search will find high efficiency neural architectures, except by chance.
Contribution
In this paper, we extend neural architecture search with an inference performance cost model to allow performance concerns to be directly integrated in the architecture search.
We propose a neural architecture search based on the idea of iterative refinement of stepwiseoptimal candidates that allows the user to control the depth and breadth of the search.
We demonstrate that it is possible to find variations of neural architecture with equivalent classification accuracy but greatly improved inference performance, even before weight pruning or model quantization are considered.
Paper Structure
We first present some background on neural architecture search, and inference performance in deep neural networks.
Next, we present our performanceoriented neural architecture search algorithm. Using our approach, we explore variations of the initial seed architecture and collect classification accuracy and inference performance statistics.
We evaluate our approach on a realworld edge platform by considering the application of a keyword spotting (KWS) pipeline on an ARM CortexA class processor. For context, we also benchmark the models produced by our approach on a typical highend GPU commonly used for deeplearning workloads. Finally, we present some related work and discuss potential extensions of our approach.
Ii Background
Many deep learning models in widespread use today are still designed by humans [23, 17] or according to simple schemes where submodules are programmatically stacked to a configurable depth to construct the network [27, 31, 26].
Neural architecture search is typically performed in conjunction with hyperparameter optimization to automatically
discover model architectures which can be trained to a high degree of accuracy for a specific learning task [35, 13, 12].However, the search process may discover many model architectures which can be trained to an approximately equivalent degree of accuracy. When resources are no object at inference time, any of these models will suffice. However, when performing inference on edge devices, with limited memory and compute capacities, it is crucial to optimize the model to minimize storage and computational requirements.
Variants of neural architecture search have been successfully applied to many networks in different contexts in prior work [35]. However, these investigations have focused only on improving the accuracy of the trained networks.
Yang et al. propose the use of codesign methods to produce a customized hardware design to accelerate a specific neural network [32]. Our work differs from the work of Yang et al. primarily in that we propose a generic method by which the entire space of customized networks derived from some initial seed network can automatically be explored, rather than designing a custom network by hand.
Cai et al. [8] propose a modification to the training process of the neural network to directly incorporate parameter search. However a key difference with our work is that Cai et al. require a set of fixed alternatives to be chosen in advance, rather than our finegrained bounds for individual architectural parameters. Nevertheless, the approach of Cai et al. is shown to yield improvements in two image classification tasks.
Pruning weights with small values induces sparsity in the model [14] and is a popular technique for decreasing operation count. However, the degree to which a model can be pruned without failing to meet classification accuracy targets is unpredictable and largely modeldependent. Furthermore, it is not always clear how to select an effective pruning approach for any given model beyond trial and error.
Quantization is another major approach to reducing model size and improving inference performance. By converting weights to a smaller numeric format, the model size is reduced. Although quantization does not change the number of operations, computation on smaller types is often faster, especially on FPGA [6], or custom accelerators [14], where the arithmetic implementation in hardware can be customized to match the model quantization scheme.
On general purpose processors and GPUs, quantization is restricted to the use of the numeric types already present in hardware. Nonetheless, quantization often results in a net gain in inference performance [11].
Iia DNN Convolution
Convolution is the most computationally heavy primitive in the majority of popular deep neural networks. As such, it presents a natural target for optimization.
Figure 1 shows the data flow in the DNN convolution operation. The and parameters are determined by the size of the input feature maps, but the , , and parameters characterize the convolution being applied. The filter size of the convolution is points, and the number of feature maps, , determines how many of these filters are trained. Since convolutions are connected inputtooutput in the neural network, the number of feature maps on the output, , matches the number of feature maps on the input of the next convolution in the chain.
Since inference performance is dominated by the convolutional layers, the size and number of convolutional filters trained in each convolutional layer are natural candidates for search. With very few or very small filters, the network encodes much less information than with more numerous or larger filters. However, a reduction in filter quantity or filter size reduces both the operation count, and the model size, resulting in more efficient inference.
Striking a balance between these opposing objectives is the goal of this paper. We propose to incorporate inference performance concerns directly in the neural architecture search, by extending neural architecture search with a new objective function incorporating inference performance.
Inference Performance
The inference performance of convolutional DNNs is dominated by the time spent executing the convolutional layers. multiplications and additions are used to compute the output of a single filter. All filters are then applied at input points, where and
are subdivided by their respective strides,
, if present. additions per output point are then performed to compute one output feature map. Since there are output feature maps, the total operation count is or more succinctly,(1) 
Similiarly, the number of weight parameters for a convolutional layer can be expressed as
(2) 
There are many specialcase algorithms for computing a direct convolution, which all have the same asymptotic complexity, but exploit spatial locality and other features of the convolution to make efficient use of different primitive operations, such as matrix multiplication [28]. The choice of filter size and shape determines which of these algorithms can actually be used.
Iii Our Approach
Prior work [7, 35] on neural architecture search ranges in scope from the design of wholenetwork connective structure to the value ranges of specific hyperparameters. In order to demonstrate our performanceoriented search, we focus on the two kinds of parameters which most immediately influence operation count in the network; filter size, , and feature map count, . We consider the filter size to consist of two distinct subparameters, and , to allow for nonsquare filters, since these are observed in several expertdesigned networks for keyword spotting.
Taking the expertdesigned network from the work of De Prado et al. [11] as the initial seed architecture, we explore variant networks with the same connective structure, but different , , and parameters on all layers.
In principle, this search can be extended to incorporate many other parameters, provided that their impact on the operation count in the network can be specified.
Iiia Search
We begin by constructing a search space by fixing a range for each of the parameters , and
, for each convolution in the network. Each point in this search space is a unique configuration of the neural network. There are two key metrics of network quality which we consider: accuracy and cost. To model how well any given configuration of the network classifies, we measure the TOP1 accuracy across some fixed number of inferences. Using the formula in Equation
1, we also compute the cost of the configuration as the sum of the operation count in all convolutional layers.In principle, any measure of accuracy and of cost can be used. For example, to relax the accuracy requirement, the average top5 accuracy could be considered instead of TOP1.
Since , , and are arbitrary (positive) integers, the configuration space of the network is extremely large. Recall that these parameters can vary independently for each convolutional layer in the network. In practice, for any one layer, and are bounded above by the height and width of the input feature maps, since setting and means that the convolution produces a single scalar output. However the number of filters is unbounded in practice, although we would expect diminishing returns in terms of accuracy as the filter count grows very large.
Even by fixing , , and to be bounded above by their settings in the seed network, many configurations remain, the evaluation of which requires some number of training iterations.
Prior work on Neural Architecture Search has employed numerous approaches to deal with the size and complexity of the search spaces involved. All of the approaches share a common design element, in that they attempt to learn the structure of the search space so that the majority of the search time is spent evaluating candidate configurations which are likely to be of high quality.
Two of the most popular approaches are to use reinforcement learning or hypernetworks to learn the structure of the search space. In the “SMASH” approach of Brock et al. [7]
, an estimator is trained using reinforcement learning which proposes new points in the space that are likely to be highquality configurations. In the approach of Le et al.
[35], a recurrent neural network (RNN) is trained, which learns to propose strings specifying the configuration of the network. This approach works by feeding the accuracy of the child network (the CNN matching the proposed configuration) back into the parent RNN to modify the loss.
In principle, any method may be chosen to explore the configuration space of the network. Regardless of the choice of method, the ideal result is a set of configurations of the network which are highquality, in the sense that they have high accuracy and low cost. However, according to the needs of the programmer, some of these highquality configurations may be better than others. For example, if the aim of the search is strictly to find the leastcost network which meets some accuracy target, a different candidate may be best for a search aimed at finding the most accurate network that does not exceed a certain cost.
To evaluate a candidate model while exploring the search space, some budget of training iterations is required. Depending on the available computing resources for training, this may be larger or smaller. With a smaller training iteration budget, more points in the search space can be explored in a fixed amount of time than with a larger budget. However, with a larger budget, the candidates will be evaluated in more depth, meaning discriminating between them will be easier. The stopping criterion for one search phase may be a wallclock time elapsed, or a certain number of models explored. In either case, after executing a search phase, we obtain a set of candidate models trained for some number of iterations.
IiiB Refinement
The second phase of our approach is the refinement of the set of candidate models. We construct the Pareto frontier of the model set produced from the search stage with respect to the two criteria of accuracy and cost.
If all, or very many, candidates on the Pareto frontier share a common substructure (i.e. common settings for any , , or parameter), we fix that substructure, and repeat the search phase. Since we have reduced the search space by fixing some substructure, each of these iterative refinements becomes faster to evaluate. This means that we can explore more of the search space around refined configurations (with a fixed training iteration budget), or evaluate candidates more precisely (with an increased training iteration budget). In this way, the search process works at a progressively finer granularity the more refinement steps are taken.
A Pareto improvement is a change that makes at least one criterion better without making any other criterion worse, given a certain initial configuration. A configuration is Pareto optimal when no further Pareto improvements can be made. The Pareto frontier is the set of all Pareto optimal configurations, i.e. configurations of the network where no other configuration can achieve at least as much accuracy with lower cost, nor strictly more accuracy with equivalent cost.
In practice, we found that fixing the most commonly observed parameter setting in the Paretooptimal models worked very well in the refinement phase. It is important to note that when we freeze a parameter setting, although we do not continue to evaluate candidates which do not share the setting for the fixed architectural parameter, we do not discard Pareto optimal candidates which have already been found. Models on the Pareto frontier which do not share the chosen setting for a frozen parameter might continue to be on the frontier even after the other candidates, and their variants, have been trained for further iterations. Algorithm 1 summarizes the workflow of our neural architecture search strategy. Steps 4 and 5 may be repeated until all architectural parameters are frozen.
Iv Experimentation
In this paper, we use keyword spotting (KWS) as an example to study neural architecture search. A KWS system takes an input audio signal and predicts its most likely text label. This system can be trained to work through a typical supervised learning approach together with labelled speech samples. Instead of classifying audio signals directly, a feature extraction step is applied to generate a spectrogram from the audio. Computing the Melfrequency Cepstral Coefficients (MFCCs)
[10] is a typical preprocessing step for ASR applications. Figure 2 shows an endtoend KWS workflow in which 2D MFCCs are used to train a neural network classifier. For all models we generate MFCC features in the same way. 40 frequency bands are applied to 16kHz audio samples with 128ms frame length and 32ms stride. We use the Google speech command data set [29]which has samples of 1 second in length, so the MFCC tensor has 40x32 features.
The speech commands dataset contains 65,000 audio samples of 30 keywords. We selected 10 keywords (yes, no, up, down, left, right, on, off, stop, go) for our experiment and used the default training, validation and test split, 80:10:10 respectively.
We implemented our approach using the Microsoft NNI (Neural Network Intelligence) toolkit [9]
, for constructing automated machine learning (AutoML) experiments. NNI is designed for building automated searches for the best neural architecture and/or hyperparameter sets.
For the two objectives we chose the TOP1 accuracy across 100 inferences as the estimator of network accuracy, and the sum of operation counts (Equation 1) as the estimator of network cost. Rather than a reinforcement learning or RNN based search, we chose to use the Treestructured Parzen Estimator [5] (TPE), since each trial involves training a neural network to convergence, so the trials are highcost. This estimator is already implemented in the NNI toolkit.
Seed Architecture
A 6 layer CNN model (Figure 3) previously achieved 94.23% test accuracy with Google speech commands dataset [29]
, therefore it is a good starting point of neural architecture search. We refer to the Caffetrained model of
[11] as the original model (i.e. with the parameters as found in [11] that yields 94.23% TOP1 accuracy), and its neural architecture as the seed architecture.Starting with the seed architecture, we configured NNI to perform a search for the number of feature maps , kernel height and kernel width for each convolutional layer.
Search in Experiments
Microsoft’s NNI toolkit offers many options to tune network hyperparameters as well as solver/optimizer parameters, including training iterations, batch size, learning rate and decay strategy. However, the search space expands exponentially if we try to search for values for all parameters independently. In order to finish our experiment within a reasonable time budget we used a two step approach. Firstly, we explored solver parameters used to train the original model. The original model is trained in 40,000 iterations with the ADAM optimizer [21] with a batch size of 100. The base learning rate is and it drops 70% every 10,000 iterations.
Performing a hyperparameter search with NNI, we found that using a learning rate of and batch size of 25 with 8,000 training iterations had high correlation with top accuracy scores. The ADAM optimizer was still preferred. We fixed this set of new solver parameters before beginning our search.
Our search space bounds for the three parameter types were configured as follows. For the perlayer bound on , we used the setting in the seed architecture as an upper bound, with a lower bound of . For the settings of and , we used the seed architecture settings as upper bounds for the first convolutional layer (), and for the remaining layers we increased the bounds slightly from the original setting of () to (). All and lower bounds were also set to .
This choice of bounds means we are searching for networks where each layer has at most as many filters, , as in the original network, and the first layer filter size is at most as large as in the original network. The filter sizes of subsequent layers may be larger than in the original network, up to a bound of growth in each spatial dimension.
Refinement in Experiments
We ran the search stage of the experiment with our initial configuration until 300 candidate networks were produced. Looking at the Pareto frontier in this experiment, we noticed that the vast majority of highquality candidates used a filter size of () for the first convolutional layer. There was no other significant common substructure in the candidates. Fixing this filter size for the first layer, we repeated the search step with the refined seed network until a further 500 candidates were produced. We computed the Pareto frontier for this refined set of candidates, and finetuned the 12 Paretooptimal models until each had hit the limit of 40,000 training iterations (i.e. the number of iterations used to train the seed model). Figure 4 shows the results of the experiment.
Since the nonsquare kernel () of the first convolutional layer yielded high accuracy in CNN, DSCNN and CRNN scenarios [34], it is believed to be a good design for KWS. However, our search found that a () kernel in the first layer is a better choice.
After investigating the data, we found that the MFCC features were generated from 40ms speech frames [34], but we generated MFCC features from 128ms speech frames. Consequently the MFCC feature map in our experiment covers more temporal information ( dimension of MFCC) than the one in previous work. This setting enables good performance of smaller square kernels and it explains why the and values we observe are preferred by the search. It also demonstrates the power of neural architecture search to adapt the model to changes in the conditioning of the dataset without intervention on the part of the enduser.
Observations
The most immediate observation from the experimental data is that, once the TOP1 accuracy exceeds 90%, the vast majority of candidates which achieve any given accuracy target are much more expensive than they need to be. For many of the gradations in TOP1 accuracy, the difference between the least cost and greatest cost configuration is close to, or exceeds, an order of magnitude.
Using traditional approaches to Neural Architecture Search, which are oblivious to inference performance, we have no way to ensure that the search process will choose a candidate with reasonable cost, and it is clear that the likelihood of making a very expensive choice is high.
TOP1  M  TOP1  Note  

0.9423  581.12  0.0%  1  seed model 
0.8960  17.22  4.63%  33.75  best 
0.9410  87.61  0.09%  6.63  fastest TOP1 0 
0.9425  167.68  0.02%  3.47  fastest TOP1 
0.9511  223.44  0.88%  2.60  best TOP1 
Table I summarizes the most interesting models we found in our experiment. The first row shows the configuration which had the largest reduction in operation count versus the seed architecture. The search finds a configuration which uses fewer operations for inference than the seed architecture, at the cost of a point reduction in TOP1 accuracy.
The leastcost configuration found which had approximately the same accuracy as the seed architecture exhibited a reduction in operation count of , while the leastcost architecture that was strictly more accurate has a reduction of . Finally, the most accurate configuration found improves TOP1 accuracy by points, to TOP1, while reducing operation count by over the seed architecture.
RealWorld Benchmarking
Using the full set of Paretooptimal CNN models from Figure 4, we performed two realworld benchmarking experiments, one on an embedded SoC, the Samsung Exynos 5 Octa (Exynos 5422), and one on a highend GPU (the NVIDIA GTX 1080Ti). We used the ODroid XU3 system, which has 2GB of RAM. Our experiments used OpenBLAS 0.3.5. For GPU, experiments, we used the CUDA toolkit version 10.0 and cuDNN version 7.5.
All models in the Pareto frontier were benchmarked using Caffe. 100 inferences were performed in all cases. The runtime of the seed model was also benchmarked. Labels on bars are the TOP1 score of the model for which inference time is being benchmarked. All models (including seed model) were trained for a total of 40,000 iterations.
Figure 5 shows the singleinference latency of the Paretooptimal models using Caffe [19] on the Exynos 5422 system, and Figure 6 shows the same data on the GTX 1080Ti, using the Caffe cuDNN backend.
The mapping from to inference latency is not onetoone, since Caffe uses GEMM to implement the arithmetic, and spatial locality and other implementation concerns affect the execution time. Nevertheless, we see that the general trend observed in Figure 4 is still observed in practice.
Model  conv1  conv2  conv3  conv4  conv5  conv6  TOP1  M 

seed  4x10, 100  3x3, 100  3x3, 100  3x3, 100  3x3, 100  3x3, 100  94.2%  581.1 
kws1  3x3, 40  3x3, 30  1x1, 30  5x5, 50  5x5, 50  5x5, 50  95.1%  223.4 
kws2  5x5, 40  3x3, 50  1x1, 30  5x5, 40  3x3, 50  5x5, 50  94.3%  167.7 
kws3  5x5, 50  1x1, 30  5x5, 40  3x3, 20  5x5, 30  3x3, 50  94.1%  87.6 
kws4  5x5, 50  3x3, 40  5x5, 20  1x1, 20  5x5, 30  3x3, 50  93.8%  87.2 
kws5  5x5, 20  1x1, 40  5x5, 30  3x3, 20  5x5, 30  3x3, 30  93.8%  76.5 
kws6  5x5, 20  3x3, 40  3x3, 40  3x3, 20  3x3, 40  3x3, 40  93.6%  65.2 
kws7  3x3, 50  1x1, 30  3x3, 20  5x5, 20  3x3, 50  3x3, 40  93.6%  56.8 
kws8  5x5, 50  1x1, 50  3x3, 20  3x3, 40  3x3, 30  3x3, 20  93.7%  46.3 
kws9  5x5, 50  1x1, 20  1x1, 50  3x3, 20  5x5, 20  3x3, 40  93.4%  37.7 
kws10  3x3, 40  1x1, 20  1x1, 20  3x3, 20  5x5, 20  3x3, 30  93.4%  26.3 
kws11  5x5, 30  1x1, 20  1x1, 20  1x1, 20  3x3, 20  5x5, 20  91.9%  20.2 
kws12  5x5, 50  1x1, 40  1x1, 50  1x1, 20  3x3, 20  3x3, 20  92.7%  17.2 
Discovered Models
Table II shows the network architectures on the Pareto frontier in Figure 4. The configuration of convolutional layers is written . Final TOP1 scores after finetuning are shown, along with operation count. Additionally, the first row shows the parameters of the seed architecture.
One of the clearest trends in the data is that the choice of 100 filters per convolutional unit in the expertdesigned seed architecture is excessive. None of the architectures found during search have any unit with more than 50 filters.
Another trend is that the choice of a filter for all but the first unit in the network is suboptimal. In fact, only one of the final set of Pareto optimal models uses this arrangement (model 6). Looking at the most accurate model found (model 1), we see that a smaller filter size of is found for the third unit, while larger filters are selected for the last three units. While the choice of 100 filters per unit is clearly excessive, it appears that the choice of a filter size is sometimes too large, and sometimes too small.
Additionally, the choice of a filter in the first layer, which evenly divides the dimensions of the input spectrogram seems to be totally unnecessary in our experiments – not a single Paretooptimal arrangement of the network uses this filter shape. This expert designed substructure could be useful with certain MFCC generation settings but it is not always true. The NAS approach helps us avoid suboptimal architectures.
The smallest model found, model 12, uses a filter for the first convolutional unit, filters for the next three convolutional units, and filters for the last two units. Despite the enormous reduction in operation count versus the seed model, this arrangement is only points less accurate with the same training budget.
V Conclusion and Future Work
It is clear that the design of deep neural networks has become a task for which pen and paper are no longer suited. We have presented a computeraided approach for the design of deep networks, which extends typical neural architecture search, with a new objective modelling inference performance. From an initial seed architecture designed by hand, we are able to automatically discover variants which are as much as more efficient, or as much as points more accurate, as well as the set of Paretooptimal models spanning these two extremes. As our evaluation shows, the benefits translate into practical performance gains, both on resource constrained embedded devices ( speedup) right up to highperformance GPU systems ( speedup).
Future Work
We have not evaluated the use of pruning or quantization in conjunction with our approach. Pruning and quantization apply to a fixed network architecture, and so are orthogonal approaches to reducing model size and operation count.
As we have demonstrated, using a more appropriate neural architecture can result in gains of up to reduction in operation count. However, the resulting model can potentially still benefit from quantization and pruning, and the benefit is cumulative, since the pruning and quantization would be applied to the models which are discovered by search. The evaluation of pruning and quantization in conjunction with model search is a promising avenue for future work.
Acknowledgments
This work was partly supported by Science Foundation Ireland with grants 12/IA/1381, 13/RC/2094 (the Irish Software Research Centre www.lero.ie) and 13/RC/2106 (the Adapt Research Centre www.adaptcentre.ie). The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work is supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 16.0159.
The opinions expressed and arguments employed herein do not necessarily reflect the official views of these funding bodies.
References
 [1] (2017) 28th IEEE international conference on applicationspecific systems, architectures and processors, ASAP 2017, seattle, wa, usa, july 1012, 2017. IEEE Computer Society. External Links: Link, ISBN 9781509048250 Cited by: 28.
 [2] (2016) 43rd ACM/IEEE annual international symposium on computer architecture, ISCA 2016, seoul, south korea, june 1822, 2016. IEEE Computer Society. External Links: Link, ISBN 9781467389471 Cited by: 14.
 [3] (2018) Optimal DNN primitive selection with partitioned boolean quadratic programming. See Proceedings of the 2018 international symposium on code generation and optimization, CGO 2018, vösendorf / vienna, austria, february 2428, 2018, Knoop et al., pp. 340–351. External Links: Link, Document Cited by: §I.
 [4] K. Bazargan and S. Neuendorffer (Eds.) (2019) Proceedings of the 2019 ACM/SIGDA international symposium on fieldprogrammable gate arrays, FPGA 2019, seaside, ca, usa, february 2426, 2019. ACM. External Links: Link, Document, ISBN 9781450361378 Cited by: 32.
 [5] (2011) Algorithms for hyperparameter optimization. See Advances in neural information processing systems 24: 25th annual conference on neural information processing systems 2011. proceedings of a meeting held 1214 december 2011, granada, spain, ShaweTaylor et al., pp. 2546–2554. External Links: Link Cited by: §IV.
 [6] (2018) FINNR: an endtoend deeplearning framework for fast exploration of quantized neural networks. TRETS 11 (3), pp. 16:1–16:23. External Links: Link Cited by: §II.
 [7] (2017) SMASH: oneshot model architecture search through hypernetworks. CoRR abs/1708.05344. External Links: Link, 1708.05344 Cited by: §IIIA, §III.
 [8] (2018) ProxylessNAS: direct neural architecture search on target task and hardware. CoRR abs/1812.00332. External Links: Link, 1812.00332 Cited by: §II.

[9]
(2019)
An open source AutoML toolkit for neural architecture search and hyperparameter tuning.
. Note: {(}https://github.com/Microsoft/nni)[Online; accessed 18 March 2019] Cited by: §IV.  [10] (1980Aug.) Experiments in syllablebased recognition of continuous speech. IEEE Trans. Acoust., Speech, Signal Processing 28, pp. 357 – 366. Cited by: §IV.
 [11] (2018) QUENN: quantization engine for lowpower neural networks. See Proceedings of the 15th ACM international conference on computing frontiers, CF 2018, ischia, italy, may 0810, 2018, Kaeli and Pericàs, pp. 36–44. External Links: Link, Document Cited by: §II, §III, §IV.
 [12] (201808) Neural Architecture Search: A Survey. arXiv eprints, pp. arXiv:1808.05377. External Links: 1808.05377 Cited by: §II.
 [13] (2016) HyperNetworks. CoRR abs/1609.09106. External Links: Link, 1609.09106 Cited by: §II.
 [14] (2016) EIE: efficient inference engine on compressed deep neural network. See 2, pp. 243–254. External Links: Link, Document Cited by: §II, §II.
 [15] (2018) Streaming endtoend speech recognition for mobile devices. CoRR abs/1811.06621. External Links: Link, 1811.06621 Cited by: §I.
 [16] (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29 (6), pp. 82–97. External Links: Link, Document Cited by: §I.
 [17] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §II.

[18]
(2015)
IEEE conference on computer vision and pattern recognition, CVPR 2015, boston, ma, usa, june 712, 2015
. IEEE Computer Society. External Links: Link, ISBN 9781467369640 Cited by: 27.  [19] (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: §IV.
 [20] D. R. Kaeli and M. Pericàs (Eds.) (2018) Proceedings of the 15th ACM international conference on computing frontiers, CF 2018, ischia, italy, may 0810, 2018. ACM. External Links: Link, Document Cited by: 11.
 [21] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §IV.
 [22] J. Knoop, M. Schordan, T. Johnson, and M. F. P. O’Boyle (Eds.) (2018) Proceedings of the 2018 international symposium on code generation and optimization, CGO 2018, vösendorf / vienna, austria, february 2428, 2018. ACM. External Links: Link, Document Cited by: 3.
 [23] (2014) One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997. External Links: Link, 1404.5997 Cited by: §II.
 [24] (2017) Proceedings of the 44th annual international symposium on computer architecture, ISCA 2017, toronto, on, canada, june 2428, 2017. ACM. External Links: Link, Document, ISBN 9781450348928 Cited by: 33.
 [25] J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger (Eds.) (2011) Advances in neural information processing systems 24: 25th annual conference on neural information processing systems 2011. proceedings of a meeting held 1214 december 2011, granada, spain. External Links: Link Cited by: 5.
 [26] (2014) Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556. External Links: Link, 1409.1556 Cited by: §II.
 [27] (2015) Going deeper with convolutions. See 18, pp. 1–9. External Links: Link, Document Cited by: §II.
 [28] (2017) Parallel multi channel convolution using general matrix multiplication. See 1, pp. 19–24. External Links: Link, Document Cited by: §IIA.
 [29] (2018) Speech commands: A dataset for limitedvocabulary speech recognition. CoRR abs/1804.03209. External Links: Link, 1804.03209 Cited by: §IV, §IV.
 [30] (2018) Weighted quantizationregularization in dnns for weight memory minimization toward HW implementation. IEEE Trans. on CAD of Integrated Circuits and Systems 37 (11), pp. 2929–2939. External Links: Link, Document Cited by: §I.
 [31] (2016) Wider or deeper: revisiting the resnet model for visual recognition. CoRR abs/1611.10080. External Links: Link, 1611.10080 Cited by: §II.
 [32] (2019) Synetgy: algorithmhardware codesign for convnet accelerators on embedded fpgas. See Proceedings of the 2019 ACM/SIGDA international symposium on fieldprogrammable gate arrays, FPGA 2019, seaside, ca, usa, february 2426, 2019, Bazargan and Neuendorffer, pp. 23–32. External Links: Link, Document Cited by: §II.
 [33] (2017) Scalpel: customizing DNN pruning to the underlying hardware parallelism. See 24, pp. 548–560. External Links: Link Cited by: §I.
 [34] (2017) Hello edge: keyword spotting on microcontrollers. CoRR abs/1711.07128. External Links: Link, 1711.07128 Cited by: §IV, §IV.
 [35] (2016) Neural architecture search with reinforcement learning. CoRR abs/1611.01578. External Links: Link, 1611.01578 Cited by: §II, §II, §IIIA, §III.
Comments
There are no comments yet.