1. Introduction
During the past decade, deep learning (DL) has led to significant breakthroughs in many areas, such as image classification and natural language processing
(Huang et al., 2017; He et al., 2016; Brown and others, 2020). However, the existing large models and computation complexity limit the deployment of DL on resourceconstrained devices and its largescale adoption in edge computing. Multiple model compression techniques, such as network pruning (Han et al., 2015), quantization (Courbariaux et al., 2016), and knowledge distillation (Hinton et al., 2015), have been proposed to compress and deploy such complex models on resourceconstrained devices without sacrificing the test accuracy. However, these techniques require a significant amount of manual tuning. Hence, neural architecture search (NAS) has been proposed to automatically design neural architectures with reduced model sizes (Baker et al., 2016; Zoph and Le, 2016; Liu and others, 2018; Liu et al., 2018; Elsken et al., 2019).NAS is an optimization problem with specific targets (e.g., high classification accuracy) over a set of possible candidate architectures. The set of candidate architectures defines the (typically vast) search space, while the optimizer defines the search algorithm. Recent breakthroughs in NAS can simplify the tricky (and errorprone) adhoc architecture design process (Liu and others, 2018; Pham et al., 2018). Moreover, the networks obtained via NAS have higher test accuracy and significantly fewer parameters than the handdesigned networks (Liu et al., 2018; Real and others, 2017). These advantages of NAS have attracted significant attention from researchers and engineers alike (Wistuba et al., 2019). However, most of the existing NAS approaches do not explicitly consider the hardware constraints (e.g., latency and energy consumption). Consequently, the resulting neural networks still cannot be deployed on real devices.
To address this drawback, recent studies propose hardwareaware NAS, which incorporates the hardware constraints of networks during the search process (Jiang and others, 2020). Nevertheless, current approaches are timeconsuming since they involve training the candidate network, and a tedious search process (Wu and others, 2019). To accelerate NAS, recent NAS approaches rely on graph neural networks (GNNs) to estimate the accuracy of a given network (Ning et al., 2020; Wen et al., 2020; Chau et al., 2020; Lukasik et al., 2020). However, training a GNNbased accuracy predictor is still timeconsuming (in the order of tens of minutes (Chiang and others, 2019) to hours (Mao et al., 2019) on GPU clusters). Therefore, adapting existing NAS approaches to different hardware architecture is challenging due to their intensive computation and execution time requirements.
To alleviate the computation cost of current NAS approaches, we propose to analyze the NAS problem from a network topology perspective. This idea is motivated by observing that the tediousness and complexity of current NAS approaches stem from the lack of understanding of what actually contributes to a neural network’s accuracy. Indeed, the innovations on the topology of neural architecture, especially the introduction of skip connections, have achieved great success in many applications (Huang et al., 2017; He et al., 2016). This is because, in general, the network topology (or structure) strongly influences the phenomena taking place over them (Newman et al., 2006). For instance, how closely the social network users are interconnected directly affects how fast the information propagates through the network (Barabási and Bonabeau, 2003)
. Similarly, a DNN architecture can be seen as a network of connected neurons. As discussed in
(Bhardwaj et al., 2021), the topology of deep networks has a significant impact on how effectively the gradients can propagate through the network and thus the test performance of neural networks. These observations motivate us to take an approach from network science to quantify the topological property of neural networks to accelerate NAS.From an application perspective, the performance and energy efficiency of DNN accelerators are other critical metrics besides the test accuracy. Inmemory computing (IMC)based architectures have recently emerged as a promising technique to construct highperformance and energyefficient hardware accelerators for DNNs. IMCbased architectures can store all the weights onchip, hence removing the latency occurring from offchip memory accesses. However, IMCbased architectures face the challenge of a tremendous increase of onchip communication volume. While most of the stateoftheart neural networks adopt skip connections in order to improve their performance (He et al., 2016; Sandler et al., 2018; Huang et al., 2017), the wide usage of skip connections requires large amounts of data transfer across multiple layers, thus causing a significant communication overhead. Prior work on IMCbased DNN accelerators proposed busbased networkonchip (NoC) (Chen et al., 2018) or cmeshbased NoC (Shafiee and others, 2016) for communication between multiple layers. However, both busbased and cmeshbased onchip communication significantly increase the area, latency, and energy consumption of hardware; hence, they do not offer a promising solution for future accelerators.
Starting from these overarching ideas, this paper proposes FLASH – a fast neural architecture search with hardware optimization – to address the drawbacks of current NAS techniques. FLASH delivers a neural architecture that is cooptimized with respect to accuracy and hardware performance. Specifically, by analyzing the topological property of neural architectures from a network science perspective, we propose a new topologybased metric, namely, the NNDegree. We show that NNDegree could indicate the test performance of a given architectures. This makes our proposed NAS trainingfree during the search process and accelerates NAS by orders of magnitude compared to stateoftheart approaches. Then, we demonstrate that NNDegree enables a lightweight accuracy predictor with only three parameters. Moreover, to improve the onchip communication efficiency, we adopt the meshNoC for the IMCbased hardware. Based on the communicationoptimized hardware architecture, we measure the hardware performance for a subset of neural networks from the NAS search space. Then, we construct analytical models for the area, latency, and energy consumption of a neural network based on our optimized target hardware platform. Unlike existing neural networkbased and blackbox style searching algorithms (Jiang and others, 2020), the proposed NAS methodology enable searching across the entire search space via a mathematically rigorous and timeefficient optimization algorithm. Consequently, our experimental evaluations show that FLASH significantly pushes forward the NAS frontier by enabling NAS in less than 0.1 seconds on a 20core Intel Xeon CPU. Finally, we demonstrate that FLASH could be readily transferred to other hardware platforms (e.g., Raspberry Pi) only by finetuning the hardware performance models.
Overall, this paper makes the following contributions:

We propose a new topologybased analytical metric (NNDegree) to quantify the topological characteristics of DNNs with skip connections. We demonstrate that the NNDegree enables a trainingfree NAS within seconds. Moreover, we use the NNDegree metric to build a new lightweight (threeparameter) accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Without any significant loss in accuracy, our proposed accuracy predictor requires 6.88 fewer samples and provides a reduction of the finetuning time cost compared to existing GNN/GCN based approaches (Wen et al., 2020).

We construct analytical models to estimate the latency, area, and energy consumption of various DNN architectures. We show that our proposed analytical models are applicable to multiple hardware architectures and achieve a high accuracy with less than one second finetuning time cost.

We design a hierarchical simplicial homology global optimization (SHGO)based algorithm, to search for the optimal architecture. Our proposed hierarchical SHGObased algorithm enables 27729 faster (less than 0.1 seconds) NAS compared to RLbased baseline approach.

We demonstrate that our methodology enables NAS on a Raspberry Pi 3B with less than 3 seconds computational time. To our best knowledge, this is the first work showing NAS running directly on edge devices with such low computational requirements.
The rest of the paper is organized as follows. In Section 2, we discuss related work and background information. In Section 3, we formulate the optimization problem, then describe the new analytical models and search algorithm. Our experimental results are presented in Section 4. Finally, Section 5 concludes the paper with remarks on our main contributions and future research directions.
2. Related Work and Background Information
Hardwareaware NAS: Hardware accelerators for DNNs have recently become popular due to highperformance demand for multiple applications (Deng et al., 2009; Manning et al., 1999; Benmeziane and others, 2021); they can reduce the latency and energy associated with DNN inference significantly. The hardware performance (e.g., latency, energy, and area) of accelerators varies with DNN properties (e.g., number of layers, parameters, etc.); therefore, hardware performance also is a crucial factor to consider during NAS.
Several recent studies consider hardware performance for NAS. Authors in (Dai et al., 2020) introduce a growing and pruning strategy that automatically maximizes the test accuracy and minimizes the FLOPs of neural architectures during training. A platformaware NAS targeting mobile devices is proposed in (Tan and others, 2019); the objective is to maximize the model accuracy with an upper bound on latency. Authors in (Wu and others, 2019)
create a latencyaware loss function to perform differentiable NAS. The latency of DNNs is estimated through a lookup table which consists of the latency of each operation/layer. However, both of these studies consider latency as the only metric for hardware performance. Authors in
(Marculescu et al., 2018)propose a hardwareaware NAS framework to design convolutional neural networks. Specifically, by building analytical latency, power, and memory models, they create a hardwareaware optimization methodology to search for the optimal architecture that meets the hardware budgets. Authors in
(Jiang and others, 2020)consider latency, energy, and area as metrics for hardware performance while performing NAS. Also, a reinforcement learning (RL)based controller is adopted to tune the network architecture and device parameters. The resulting network is retrained to evaluate the model accuracy. There are two major drawbacks of this approach. First, RL is a slowconverging process that prohibits fast exploration of the design space. Second, retraining the network further exacerbates the search time leading to hundreds of GPU hours needed for real applications
(Zoph and Le, 2016). Furthermore, most existing hardwareaware NAS approaches explicitly optimize the architectures for a specific hardware platform (Cai et al., 2019; Wu and others, 2019; Li and others, 2020). Hence, if we switch to some new hardware, we need to repeat the entire NAS process, which is very timeconsuming under the existing NAS frameworks (Cai et al., 2019; Wu and others, 2019; Li and others, 2020). The demand for reducing the overhead of adaptation to new hardware motivates us to improve the transferability of hardwareaware NAS methodology.Accuracy Predictorbased NAS: Several approaches perform NAS by estimating the accuracy of the network (Ning et al., 2020; Wen et al., 2020; Chau et al., 2020; Lukasik et al., 2020). These approaches first train a graph neural network (GNN), or a graph convolution network (GCN), to estimate the network accuracy while exploring the search space. During the searching process, the test accuracy of the sample networks is obtained from the estimator instead of doing regular training. Although by estimating the accuracy, the NAS process is significantly accelerated, the training cost of the accuracy predictor itself remains a bottleneck. GNN requires many training samples to achieve high accuracy, thus involving a significant overhead during training the candidate networks from the search space. Therefore, using accuracy predictors to do NAS still suffers from excessive computation and time requirements.
Timeefficient NAS: To reduce the time cost of training candidate networks, authors in (Pham et al., 2018; Stamoulis and others, 2019) introduced the weight sharing mechanism (WSNAS). Specifically, candidate networks are generated by randomly sampling part of a large network (supernet). Hence, candidate networks share the weights of the supernet and update these weights during training. By reusing these trained weights instead of training from scratch, WSNAS significantly improves the time efficiency of NAS. However, the accuracy of these models obtained via WSNAS is typically far below those obtained from training from scratch. Several optimization techniques have been proposed to fill the accuracy gap between the shared weights and standalone training (Yu and others, 2020; Cai et al., 2020). For example, authors in (Cai et al., 2020)
propose a progressive shrinking algorithm to train the supernet. However, in many cases, the resulting networks still need some finetuning epochs to get the final architecture. To further accelerate the NAS process, some works propose the differentiable NAS to accelerate the NAS process
(Liu et al., 2018; Cai et al., 2019). The differentiable NAS approaches search for the optimal architecture by learning the optimal architecture parameters during the training process. Hence, differentiable NAS only needs to train the supernet once, thus reducing the training time significantly. Nevertheless, due to the significantly large number of parameters of the supernet, differentiable NAS requires a high volume of GPU memory. In order to further improve the timeefficiency of NAS, several approaches have been proposed to do trainingfree NAS (Abdelfattah et al., 2021; Chen et al., 2021). These approaches leverage some trainingfree proxy that indicates the test performance of some given architectures; hence, the training time is eliminated from the entire NAS process. However, these methods usually use gradientbased information to build the proxy (Abdelfattah et al., 2021; Chen et al., 2021). Therefore, in order to calculate the gradients, GPUs are still necessary for the backward propagation process. To totally decouple the NAS process from using GPU platforms, our work proposes a GPUfree proxy to do trainingfree NAS. We provide more details in Section 4.3.Skip connections and Network Science: Currently, both networks obtained by manual design and NAS have shown that longrange links (i.e., skip connections) are crucial for getting higher accuracy (He et al., 2016; Huang et al., 2017; Sandler et al., 2018; Liu et al., 2018). Overall, there are two commonly used skip connections in neural networks. First, we have the DenseNettype skip connections (DTSC), which concatenate previous layers’ outputs as the input for the next layer (Huang et al., 2017). To study the topological properties and enlarge the search space, we do not use the original DesneNets (Huang et al., 2017), which contains alltoall connections. Instead, we consider a generalized version where we vary the number of skip connections by randomly selecting only some channels for concatenation, as shown in Fig. 1(a). The other type of skip connections is the additiontype skip connections (ATSC), which consist of links that bypass several layers to be directly added to the output of later layers (see Fig. 1(b)) (He et al., 2016).
In network science, a smallworld network is defined as a highly clustered network, thus showing a small distance (typically logarithmic in the number of network nodes) between any two nodes inside the network (Watts and Strogatz, 1998). Considering the skip connections in neural networks, we propose to use the smallworld network concept to analyze networks with both short and longrange (or skip) links. Indeed, smallworld networks can be decomposed into: (i) a lattice network accounting for shortrange links; (ii) a random network accounting for longrange links (see Fig. 1(c)). The coexistence of a rich set of short and longrange links leads to both a high degree of clustering and short average path length (logarithmic with network size). We use the smallworld network to model and analyze the topological property of neural networks in Section 3.
Average Degree: The average degree of a network determines the average number of connections a node has, i.e., the total number of edges divided by the total number of nodes. The average degree and degree distribution (i.e., distribution of node degree) are important topological characteristics that directly affect how information flows through a network (Barabási and Bonabeau, 2003). Indeed, the small network theory reveals that the average degree of a network has a significant impact on network average path length and clustering behavior (Watts and Strogatz, 1998). Therefore, we investigate the performance gains due to the topological properties by using network science.
3. Proposed Methodology
3.1. Overview of New NAS Approach
The proposed NAS framework is a twostage process, as illustrated in Fig. 2: (i) We first quantify the topological characteristics of neural networks by the newly proposed NNDegree metric. Then, we randomly select a few networks and train them to finetune the accuracy predictor based on the network topology. We also build analytical models to estimate the latency, energy, and area of given neural architectures. (ii) Based on the accuracy predictor and analytical performance models in the first stage, we use a simplical homology global optimization (SHGO)based algorithm in a hierarchical fashion to search for the optimal network architecture.
3.2. Problem Formulation of hardwareaware NAS
The overall target of the hardwareaware NAS approach is to find the network architecture that gives the highest test accuracy while achieving small area, low latency, and low energy consumption when deployed on the target hardware. In practice, there are constraints (budgets) on the hardware performance and test accuracy. For example, batterybased devices have very constrained energy capacity (Wang et al., 2020). Hence, there is an upper bound for the energy consumption of the neural architecture. To summarize, the NAS problem can be expressed as:
(1)  
subject to: 
where , , , and are the constraints on the test accuracy, area, latency, and energy consumption, respectively. We summarize the symbols (and their meaning) in this part in Table 1.
Symbol  Definition 

Objective function of NAS  
Test accuracy of a given network  
Chip area  
Inference latency of a given network  
Inference energy consumption of a given network  
Constraint of test accuracy for NAS  
Constraint of area for NAS  
Constraint of inference latency for NAS  
Constraint of inference energy consumption for NAS 
3.3. NNDegree and Trainingfree NAS
This section first introduces our idea of modeling a CNN based on network science (Watts and Strogatz, 1998). To this end, we define a group of consecutive layers with the same width (i.e., number of output channels, ) as a cell; then we break the entire network into multiple cells and denote the number of cells as . Similar to MobileNetv2 (Sandler et al., 2018), we also adopt a width multiplier () to scale the width of each cell. Moreover, following most of the mainstream CNN architectures, we assume that each cell inside a CNN has the same number of layers (). Furthermore, as shown in Fig. 1, we consider each channel of the feature map as a node in a network and consider each convolution filter/kernel as an undirected link. These notations are summarized in Table 2.
Symbol  Definition 

NNDegree (new metric we propose)  
NNDegree of the lattice network (shortrange connections)  
NNDegree of the random network (longrange or skip connections)  
Number of cells  
Number of output channels per layer within cell (i.e., the width of cell )  
Number of layers within cell (i.e., the depth of cell )  
Number of skip connections within cell  
Learnable parameters for the accuracy predictor 
Combining the concept of smallworld networks in Section 2 and our modeling of a CNN, we decompose a network cell with skip connections into a lattice network and random network (see Fig. 1(c)).
Proposed Metrics: Our key objective is twofold: (i) Quantify which topological characteristics of DNN architectures affect their performance, and (ii) Exploit such properties to accurately predict the test accuracy of a given architecture. To this end, we propose a new analytical metric called NNDegree, as defined below.
Definition of NNDegree: Given a DNN with cells, layers per cell, the width of each cell , and the number of skip connections of each cell , the NNDegree metric is defined as the sum of the average degree of each cell:
(2) 
Intuition: The average degree of a given DNN cell is the sum of the average degrees from lattice network and random network . Given a cell with convolutional layers and channels per layer, the number of nodes is . Moreover, each convolutional layer has filters (kernels) accounting for the shortrange connections; hence, in the lattice network , there are connections (total). Using the above analysis, we can express the NNDegree as follows:
(3) 
Discussion: The first term in Equation 3 (i.e., ) reflects the the width of the network . Many successful DNN architectures, such as DenseNets (Huang et al., 2017), WideResNets (Zagoruyko and Komodakis, 2016), and MobileNets (Sandler et al., 2018), have shown that wider networks can achieve a higher test performance. The second term (i.e., ) quantifies how densely the nodes are connected through the skip connections. As discussed in (Veit et al., 2016), networks with more skip connections have more forward/backward propagation paths, thus have a better test performance. Based on the above analysis, we claim that a higher NNDegree value should indicate networks with higher test performance. We verify this claim empirically in the experimental section. Next, we propose an accuracy predictor based only on the NNDegree.
Accuracy Predictor: Given the NNDegree (
) definition, we build the accuracy predictor by using a variant of logistic regression. Specifically, the test accuracy
of a given architecture is:(4) 
where are the parameters that are finetuned with the accuracy and NNDegree of sample networks from the search space. Section 4 shows that by using as few as 25 data samples (NNDegree and corresponding accuracy values), we can generate an accurate predictor for a huge search space covering more than 63 billion configurations within 1 second on a 20core Intel Xeon CPU.
3.4. Overview of Inmemory Computing (IMC)based Hardware
Fig. 3 shows the IMC architecture considered in this work. We note that the proposed FLASH methodology is not specific to IMCbased hardware. We adopt an IMC architecture since it has been proven to achieve less memory access latency (Horowitz, 2014). Due to the high communication volume imposed by deeper and denser networks, communication between multiple tiles is crucial for hardware performance, as shown in (Krishnan and others, 2020; Mandal and others, 2020).
Our architecture consists of multiple tiles connected by networkonchip (NoC) routers, as shown in Fig. 3
(a). We use a meshbased NoC due to its superior performance compared to busbased architectures. Each tile consists of a fixed number of compute elements (CE), a rectified linear unit (ReLU), an I/O buffer, and an accumulation unit, as shown in Figure Fig.
3(b).Within each CE, there exist a fixed number of immemory processing elements (imPE), a multiplexer, a switch, an analogtodigital converter (ADC), a shift and add (S&A) circuit, and a local buffer (Chen et al., 2018), as shown in Fig. 3(c). The ADC precision is set to four bits to avoid any accuracy degradation. There is no digitaltoanalog (DAC) converter used in the architecture. A sequential signaling technique to represent multibit inputs is adopted (Peng and others, 2019). Each imPE consists of 256256 IMC crossbars (the memory elements) based on ReRAM (1T1R) technology (Krishnan and others, 2020; Mandal and others, 2020; Chen et al., 2018). This work incorporates a sequential operation between DNN layers since a pipelined operation may cause pipeline bubbles during inference (Song et al., 2017; Qiao et al., 2018).
Symbol  Definition  Symbol  Definition  

Number of cells 





Width multiplier 


Number of layers within cell 



Width of cell 






Number of FLOPs of cell 



Number of imPEs in each CE  

Area of a tile  
Features for energy 






3.5. Hardware Performance Modeling
This section describes the methodology of modeling hardware performance. We consider three metrics for hardware performance: area, latency, and energy consumption. We use customized versions of NeuroSim (Chen et al., 2018) for circuit simulation (computing fabric) and BookSim (Jiang and others, 2013) for cycleaccurate NoC simulation (communication fabric). First, we describe the details of the simulator.
Input to the simulator: The inputs to the simulator include the DNN structure, technology node, and frequency of operation. In this work, we consider a layerbylayer operation. Specifically, we simulate each DNN layer and add its performance at the end to obtain the total performance of the hardware for the DNN.
Simulation of computing fabric: Table 4 shows the parameters considered for the simulation of computing fabric. At the start of the simulation, the number of inmemory computing tiles is computed. Then, the area and energy of one tile are computed through analytical models derived from HSPICE simulation. After that, the area and energy of one tile are multiplied by the total number of tiles to obtain the total area and energy of the computing fabric. The latency of the computing fabric is computed as a function of the workload (the DNN being executed). We note that the original version of NeuroSim considers pointtopoint onchip interconnects, while our proposed work uses meshbased NoC. Therefore, we skip the interconnect simulation in NeuroSim.
Simulation of communication fabric: We consider cycleaccurate simulation for the communication fabric. BookSim is used to perform simulation. First, the number of tiles required for each layer is obtained from the simulation of computing fabric. In this work, we assume that each tile is connected to a dedicated router of the NoC. A trace file is generated corresponding to the particular layer of the DNN. The trace file consists of the information of the source router, destination router, and timestamp when the packet is generated. The trace file is simulated through BookSim to obtain the latency to finish all the transactions between two layers. We also obtain the area and energy of the interconnect through BookSim. Table 4 shows the parameters considered for the interconnect simulator. More details of the simulator can be found in (Krishnan and others, 2021).
For hardware performance modeling, first we obtain the performance of the DNN through simulation, then the performance numbers are used to construct the performance models.
Analytical Area Model: An inmemory computingbased DNN accelerator consists of two major components: computation and communication. The computation unit consists of multiple tiles and peripheral circuits; the communication unit includes an NoC with routers and other network components (e.g., buffers, links). To estimate the total area, we first compute the number of rows () and number of columns () of imPEs required for the layer of the DNN following Equation 5 and Equation 6.
(5) 
(6) 
where all the symbols are defined in Table 3. Therefore, total number of imPEs required for the layer of the DNN is . Each tile consists of CEs, and each CE consists of number of imPEs. Accordingly, each tile comprises imPEs. Therefore, the total number of tiles required for the layer of the DNN () is:
(7) 
Hence, the total number of tiles () required for a given DNN is .
Circuit  NoC  

imPE array size  Bus width  32  
Cell levels  2 bit/cell  Routing algorithm  X–Y 
Flash ADC resolution  4 bits  Number of router ports  5 
Technology used  RRAM  Topology  Mesh 
As shown in Fig. 3(a), each tile is connected to the NoC routers for the onchip communication. We assume that the total number of required routers is equal to the total number of tiles. Hence, the total chip area is expressed as follows:
(8)  
where is the area accounted for all tiles and is the total area accounted for all routers in the design. The area of a single tile is denoted by ; there are tiles in the design. Therefore . The area of the peripheral circuit (
) consists of I/O interface, max pool unit, accumulation unit, and global buffer. The area of a single router is denoted by
; the number of routers is equal to the number of tiles (). Therefore . The area of other components in the NoC () comprises links and buffers.Analytical Latency Model: Similar to area, the total latency consists of computation latency and communication latency, as shown in Fig. 4(a). To construct the analytical model of latency, we use floatingpoint operations (FLOPs) of the network to represent the computational workload. We observe that the FLOPs of a given network are roughly proportional to the total number of convolution filters (kernels), which is the product of the number of layers and the square of the number of channels per layer (i.e., width value). In the network search space we consider, the width is equivalently represented by the width multiplier and the number of layers is ; hence, we express the number of FLOPs of a given network approximately as the product of the number of layers, and the square of width multiplier:
(9) 
Moreover, communication volume increases significantly due to the skip connections. To quantify the communication volume due to skip connections, we define (the communication volume of a given network cell ) as follows:
Combining the above analysis of computation latency and communication latency, we use a linear model to build our analytical latency model as follows:
(10) 
where is a weight vector and is the vector of features with respect to the computation latency; is another weight vector and is the vector of features corresponding to the NoC latency. We randomly sample some networks from the search space and measure their latency to finetune the values of and .
Analytical Energy Model: We divide the total energy consumption into computation energy and communication energy, as shown in Fig. 4(b). Specifically, the entire computation process inside each tile consists of three steps:

Read the input feature map from the I/O buffer to the CE;

Perform computations in CE and ReLU unit, then update the results in the accumulator;

Write the output feature map to the I/O buffer.
Therefore, both the size of feature map and FLOPs contribute to the computation energy of a single cell. Moreover, the communication energy consumption is primarily determined by the communication volume, i.e., (). Hence, we use a linear combination of features to estimate the energy consumption of each tile :
(11) 
where is a weight vector and are the features corresponding to the energy consumption of each tile. We use the measured energy consumption values of several sample networks to finetune the values of . The total energy consumption () is the product of and number of tiles:
(12) 
We note that all the features used in both our accuracy predictor and analytical hardware performance model are only related to the architecture of the network through the basic parameters . Therefore, the analytical hardware models are lightweight. We note that there exist no other lightweight analytical models for IMC platforms. Besides this, FLASH is general and can be applied to different hardware platforms. For a given hardware platform, energy, latency, and area of the DNNs need to be first collected. Then the analytical hardware models need to be trained using the performance data.
3.6. Optimal neural architecture search
Based on the above accuracy predictor and analytical hardware performance models, we perform the second stage of our NAS methodology, i.e., searching for the optimal neural architecture by considering both test accuracy and hardware performance on the target hardware. To this end, we use a modified version of the Simplicial Homology Global Optimization (SHGO (Endres et al., 2018)) algorithm to search for the optimum architecture. SHGO has mathematically rigorous convergence properties on nonlinear objective functions and constraints and can solve derivativefree optimization problems^{1}^{1}1The detailed discussion of SHGO is beyond the scope of this paper. More details are available in (Endres et al., 2018). Moreover, the convergence of SHGO requires much fewer samples and less time than reinforcement learning approaches (Jiang and others, 2020). Hence, we use SHGO for our new hierarchical searching algorithm.
Specifically, as shown in Algorithm 1, to further accelerate the searching process, we propose a threelevel SHGObased algorithm instead of using the original SHGO algorithm. At the first level, we enumerate in the search space. Usually, the range of is much more narrow than the other architecture parameters; hence without fixing , we cannot use a large search step size for the secondlevel coarsegrain search. At the second level, we use SHGO with a large search step size to search for a coarse optimum by fixing the . At the third level (finegrain search), we use SHGO with the smallest search step size (i.e., 1) to search for the optimum values for a specific , within the neighborhood of the coarse optimum , and add it to the candidate set. After completing the threelevel search, we compare all neural architectures in the candidate set and determine the (final) optimal architecture . To summarize, given the number of hyperparameters and the number of possible values of each hyperparameter , the complexity of our hierarchical SHGObased NAS is roughly proportional to MN, i.e., .
Experimental results in Section 4 show that our proposed hierarchical search accelerates the overall search process without any decrease in the performance of the obtained neural architecture. Moreover, our proposed hierarchical SHGObased algorithm involves much less computational workload compared to the original (onelevel) SHGObased algorithm and RLbased approaches (Jiang and others, 2020); this even enables us to do NAS on a real Raspberry Pi3B processor.
4. Experimental Results
4.1. Experimental setup
Dataset:
Existing NAS approaches show that the test accuracy of CNNs on CIFAR10 dataset can indicate the test accuracy on other datasets, such as ImageNet
(Dong and Yang, 2020). Hence, similar to most of the NAS approaches, we use CIFAR10 as the primary dataset. Moreover, we also evaluate our framework on CIFAR100 and TinyImageNet
^{2}^{2}2TinyImageNet is a downscaledversion ImageNet dataset with 64x64 resolution and 200 classes (Deng et al., 2009). For more details, please check: http://cs231n.stanford.edu/tinyimagenet200.zip to demonstrate the generality of our proposed metric NNDegree and accuracy predictor.Training Hyperparameters:
We train each of the selected neural networks five times with PyTorch and use the mean test accuracy of these five runs as the final results. All networks are trained for 200 epochs with the SGD optimizer and a momentum of 0.9. We set the initial learning rate as 0.1 and use Cosine Annealing algorithm as the learning rate scheduler.
Search Space: DenseNets are more efficient in terms of model size and computation workload than ResNets while achieving the same test accuracy (Huang et al., 2017). Moreover, DenseNets have many more skip connections; this provides us with more flexibility for exploration compared to networks with Additiontype skip connections (ResNets, WideResNets, and MobileNets). Hence, in our experiments, we explore the CNNs with DenseNettype skip connections.
To enlarge the search space, we generate the generalized version of standard DenseNets by randomly selecting channels for concatenation. Specifically, for a given cell , we define as the maximum skip connections that each layer can have; thus, we use to control the topological properties of CNNs. Given the definition of , layer can receive DenseNettype skip connections (DTSC) from a maximum number of channels from previous layers within the same cell; that is, we randomly select channels from layers , and concatenate them at layer . The concatenated channels then pass through a convolutional layer to generate the output of layer (). Similar to recent NAS research (Liu et al., 2018), we select links randomly because random architectures are often as competitive as the carefully designed ones. If the skip connections encompass alltoall connections, this would result in the original DenseNet architecture (Huang et al., 2017). An important advantage of the above setup is that we can control the number of DTSC (using ) to cover a vast search space with a large number of candidate DNNs.
Like standard DenseNets, we can generalize this setup to contain multiple () cells of width and depth ; DTSC are present only within a cell and not across cells. Furthermore, we increase the width (i.e., the number of output channels per layer) by a factor of 2 and halve the height and width of the feature map cell by cell, following the standard practice (Simonyan and Zisserman, 2014)
. After several cells (groups) of convolutions layers, the final feature map is averagepooled and passed through a fullyconnected layer to generate the logits. The width of each cell is controlled using a width multiplier,
(like in WideResNets (Zagoruyko and Komodakis, 2016)). The base number of channels of each cell is [16,32,64]. For , cells will have [48,96,192] channels per layer. To summarize, we control the value to sample candidate architectures from the entire search space.Fig. 5 illustrates a sample CNN similar to the candidate architectures in our search space (small values of and are used for clarity). This CNN consists of three cells, each containing convolutional layers. The three cells have a width (i.e., the number of channels per layer) of 2, 3, and 4, respectively. We denote the network width as . Finally, the maximum number of channels that can supply skip connections is given by . That is, the first cell can have a maximum of two skip connection candidates per layer (i.e., previous channels that can supply skip connections), the second cell can have a maximum of five skip connections candidates per layer, and so on. Moreover, as mentioned before, we randomly choose channels for skip connections at each layer. The inset of Fig. 5 shows for a specific layer, how skip connections are created by concatenating the feature maps from previous layers.
In practice, we use three cells for the CIFAR10 dataset, i.e., . We constrain the and . We also constrain of each cell: , and for these three cells, respectively. In this way, we can balance the number of skip connections across each cell. Moreover, the maximum number of skip connections that a layer can have is the product of the width of the cell () and which happens for the last layer in a cell concatenating all of the output channels except the second last layer. Hence, the upper bound of , for each cell, is , respectively. Therefore, the size of the overall search space is:
Hardware Platform: The training of the sample neural architectures from the search space is conducting on Nvidia GTX1080Ti GPU. We use Intel Xeon 6230, a 20core CPU, to simulate the hardware performance of multiple candidate networks and finetune the accuracy predictor and analytical hardware models. Finally, we use the same 20core CPU to conduct the NAS process.
4.2. Accuracy Predictor
Accuracy Estimation Technique  Search Space (SS) Size  # Training Samples  RMSE (%)  Training Time (s)  
Value  % of FLASH SS  Value 

Value 


GNN+MLP (Ning et al., 2020)  %  15250        
GNN (Lukasik et al., 2020)  %  11862  0.05      
GCN (Chau et al., 2020)  %  40  >1.8      
GCN (Wen et al., 2020)  %  6.88  1.4  25  66  

100%  1  0.152  0.38  1 
We first derive the NNDegree () for the neural architecture in our search space. Based on Equation 2, we substitute with the real number of skip connections in a cell as follows:
(13) 
In Section 3, we argue that the neural architecture with a higher NNdegree value tends to provide a higher test accuracy. In Fig. 6(a), we plot the test accuracy vs. NNDegree of 60 randomly sampled neural networks from the search space for CIFAR10 dataset; our proposed networktopology based metric NNDegree indicates the test accuracy of neural networks. Furthermore, Fig 6(b) and Fig 6(c) also show the test accuracy vs. NNDegree of 20 networks on CIFAR100 dataset and 27 networks on TinyImageNet randomly sampled from the search space. Clearly, our proposed metric NNDegree predicts the test accuracy of neural networks on these two datasets as well. Indeed, the results prove that our claim in Section 3 is empirically correct, i.e., networks with higher NNDegree values have a better test accuracy.
Next, we use our proposed NNDegree to build the analytical accuracy predictor. We train as few as 25 sample architectures randomly sampled from the entire search space and record their test accuracy and NNDegree on CIFAR10, CIFAR100, and TinyImageNet datasets. Then, we finetune our NNDegree based accuracy predictor described by Equation 7. As shown in Fig. 7(a), Fig 7(b), and Fig 7(c), our accuracy predictor achieves very high performance while using surprisingly few samples with only three parameters on all these datasets.
We also compare our NNDegreebased accuracy predictor with the current stateoftheart approaches. As shown in Table 5, most of the existing approaches use Graphbased neural networks to make predictions (Wen et al., 2020; Lukasik et al., 2020; Chau et al., 2020; Ning et al., 2020). However, Graphbased neural networks require much more training data, and they are much more complicated in terms of computation and model structure compared to classical methods like logistic regression. Due to the significant reduction in the model complexity, our predictor requires fewer training samples, although a much larger search space ( larger than the existing work) is covered. Moreover, our NNDegree based predictor has only three parameters to be updated; hence it consumes less finetuning time than the existing approaches. Finally, besides such low model complexity and fast training process, our predictor achieves a very small RMSE (0.152%) as well.
During the search of our NAS methodology, we use the accuracy predictor to directly predict the accuracy of sample architectures as opposed to performing the timeconsuming training. The high precision and low complexity of our proposed accuracy predictor also enable us to adopt very fast optimization methods during the search stage. Furthermore, because our proposed metric NNDegree can predict the test performance of a given architecture, we can use NNDegree as the proxy of the test accuracy to do NAS without the timeconsuming training process. This trainingfree property allows us to quickly compare the accuracy of given architectures and thus accelerate the entire NAS.
4.3. NNDegree based Trainingfree NAS
Method  Search Method  #Params  Search Cost  Training needed  Test error (%) 

ENAS(Pham et al., 2018)  RL+weight sharing  4.6M  12 GPU hours  Yes  2.89 
SNAS(Xie et al., 2019)  gradientbased  2.8M  36 GPU hours  Yes  2.85 
DARTSv1(Liu et al., 2018)  gradientbased  3.3M  1.5 GPU hours  Yes  3.0 
DARTSv2(Liu et al., 2018)  gradientbased  3.3M  4 GPU hours  Yes  2.76 
ProxylessNAS(Cai et al., 2019)  gradientbased  5.7M  NA  Yes  2.08 
ZeroCost(Abdelfattah et al., 2021)  Proxybased  NA  NA  Yes  5.78 
TENAS(Chen et al., 2021)  Proxybased  3.8M  1.2 GPU hours  No  2.63 
FLASH  NNDegree based  3.8M  0.11 seconds  No  3.13 
To conduct the trainingfree NAS, we reformulate the problem described by Equation 1 as follows:
(14) 
To maximize the values of , we can search for the network with maximal NNDegree values, which eliminate the training time of candidate architectures. In Fig. 8, we show how we can use the NNDegree to do trainingfree NAS. During the first stage, we profile a few networks on the target hardware and finetune our hardware performance models. During the second stage, we randomly sample candidate architectures and select those which meet the hardware performance constraints. We use the finetuned analytical models to estimate the hardware performance instead of doing real inference, which improves the time efficiency of the entire NAS. After that, we select the optimal architecture with the highest NNDegree values which meets the hardware performance constraints. We note that the NAS process itself is trainingfree (hence lightweight), as only the final solution needs to be trained.
To evaluate the performance of our trainingfree NAS framework, we randomly sample 20,000 candidate architectures from the search space and select the one with the highest NNDegree values as the obtained/optimal architecture. Specifically, it takes only 0.11 seconds to evaluate these 20,000 samples’ NNDegree on a 20core CPU to get the optimal architecture (no GPU needed). As shown in Table 6, the optimal architecture among these 20,000 samples achieves a comparable test performance with the representative timeefficient NAS approaches but with much less time cost and computation capacity requirement.
4.4. Analytical hardware performance models
Our experiments show that using 180 samples offers a good balance between the analytical models’ accuracy and the number of finetuning samples. Hence, we randomly select 180 neural architectures from the search space to build our analytical hardware performance models. Next, we perform the inference of these selected 180 networks on our simulator (Krishnan and others, 2021) to obtain their area, latency, and energy consumption. After obtaining the hardware performance of 180 sample networks, we finetune the parameters of our proposed analytical area, latency, and energy models discussed in Section 3. To evaluate the performance of these finetuned models, we randomly select another 540 sample architectures from the search space then conduct inference and obtain their hardware performance.
Table 7 summarizes the performance of our analytical models. The mean estimation error is always less than 4%. Fig. 9 shows the estimated hardware performance obtained by our analytical model for the ImageNet dataset. We observe that the estimation coincides with the measured values from simulation. Our analytical models enable us to obtain very accurate predictions of hardware performance with the time cost of less than 1 second on a 20core CPU. The high performance and low computation workload enable us to directly adopt these analytical models to accelerate our searching stage instead of conducting real inference.
Model  #Features  Mean Error (%)  Max Error (%)  Finetuning Time (s) 

Area  2  0.1  0.2  0.49 
Latency  9  3.0  20.8  0.52 
Energy  16  3.7  24.4  0.56 
SVM 




Latency Est. Error (%)  58.98  8.23  6.7  
Energy Est. Error (%)  78.49  11.01  3.5  
Area Est. Error (%)  36.99  13.37  1.7 
Comparison with other machine learning models:
Table 8compares the estimation error for SVM, random forest with a maximum tree depth of 16 and the proposed analytical hardware models for ImageNet dataset. A maximum tree depth of 16 is chosen for random forest since it provides the best accuracy among random forest models. We observe that our proposed analytical hardware models achieve the smallest error among all three modeling techniques. SVM performs poorly since it tries to classify the data with a hyperplane, and no such plane may exist given the complex relationship between the features and performance of the hardware platform.
4.5. Onchip communication optimization
As shown in Fig. 10 and Fig. 11, we compare the NoC performance (area, energy, and latency) of our FLASH with respect to the cmeshNoC (Shafiee and others, 2016) for 16 randomly selected networks from the search space for CIFAR10 dataset and ImageNet dataset, respectively. We observe that the meshNoC occupies on average only 37% area and consumes only 41% energy with respect to the cmeshNoC. Since the cmeshNoC uses extra links and repeaters to connect diagonal routers, the area and energy with the cmeshNoC are significantly higher than the meshNoC. Additional links and routers in the cmeshNoC result in lower hop counts than the meshNoC. However, the lower hop count reduces the latency at low congestion. As the congestion in the NoC increases, the latency of the cmeshNoC becomes higher than the meshNoC due to increased utilization of additional links. This phenomenon is also demonstrated in (Grot and Keckler, 2008). Therefore, the communication latency with the cmeshNoC is higher than the meshNoC for most of the DNNs. The communication latency with the meshNoC is on average within 3% different from the communication latency with the cmeshNoC. Moreover, we observe that the average utilization of the queues in the meshNoC varies between 20%40% for the ImageNet dataset. Furthermore, the maximum utilization of the queues ranges from 60% to 80%. Therefore, the meshNoC is heavily congested. Thus, our proposed communication optimization strategy outperforms the stateoftheart approaches.
4.6. Hierarchical SHGObased neural architecture search
After we finetune the NNDegree based accuracy predictor and analytical hardware performance models, we use our proposed hierarchical SHGObased searching algorithm to do the neural architecture search.
Baseline approach: Reinforcement Learning (RL) is widely used in NAS (Jiang and others, 2020; Hsu and others, 2018; Zoph et al., 2018); hence we have implemented a RLbased NAS framework as a baseline. For the baseline, we consider the objective function in Equation 1. Specifically, we incorporate a deepQ network approach for the baselineRL (Mnih and others, 2013). We construct four different controllers for the number of cell (), cell depth (), width multiplier () and number of long skip connections (). The training hyperparameters for the baselineRL are shown in Table 9. The baselineRL approach estimates the optimal parameters (). We tune the baselineRL approach to obtain the best possible results. We also implement a onelevel SHGO algorithm (i.e., original SHGO) as another baseline to show the efficiency of our hierarchical algorithm.
Metric  Value  Metric  Value 

Number of layers  3  Learning rate  0.001 
Number of neurons in each layer  20  Activation  softmax 
Optimizer  ADAM  Loss  MSE 
Constraints involved?  Method  Search cost (#Samples)  Search Time (s)  Quality of obtained model (Eq. 1)  Converge? 
No  RL  10000  1955  20984  Yes 
onelevel SHGO  23  0.03  20984  Yes  
hierarchical SHGO (FLASH)  69  0.07  20984  Yes  
Improvement  144.93  27929    
Yes.  RL  ¿10000      No 
onelevel SHGO  1195  3.82  10550  Yes  
hierarchical SHGO (FLASH)  170  0.26  11969  Yes  
Improvement  7.03  14.7  1.13   
We compare the baselineRL approach with our proposed SHGObased optimization approach. As shown in Table 10, when there is no constraint in terms of accuracy and hardware performance, our hierarchical SHGObased algorithm brings negligible overhead compared to the onelevel SHGO algorithm. Moreover, our hierarchical SHGObased algorithm needs much fewer samples () during the search process than RLbased methods. Our proposed search algorithm is as fast as 0.07 seconds and 27929 faster than the RLbased methods, while achieving the same quality of the solution! As for the searching with specific constraints, the training of RLbased methods cannot even converge after training with 10000 samples. Furthermore, our hierarchical SHGObased algorithm obtains a betterquality model with fewer samples and less search time compared to the onelevel SHGO algorithm. The results show that our proposed hierarchical strategy further improves the efficiency of the original SHGO algorithm.

Method 




RPi3B  MC1  RPi3B  MC1  RPi3B  MC1  
No  onelevel SHGO  112  113  1.68  0.71  4.74  4.13  
hierarchical SHGO (FLASH)  180  135  2.21  0.45  4.74  4.13  
Yes,  onelevel SHGO  1309  1272  45.98  9.65  0.35  0.38  
hierarchical SHGO (FLASH)  261  414  2.33  1.32  0.48  0.57  
Improvement  5.01  3.07  19.73  20.5  1.37  1.51 
4.7. Case study: Raspberry Pi and Odroid MC1
As discussed in previous sections, each component and stages of FLASH are very efficient in terms of both computation and time costs. To further demonstrate the efficiency of our FLASH methodology, we implement FLASH on two typical edge devices, namely, the Raspberry Pi3 ModelB (RPi3B) and Odroid MC1 (MC1).
Setup:
RPi3B has an Arm CortexA53 quadcore processor with a nominal frequency of 1.2GHz and 1GB of RAM. Furthermore, we use the Odroid Smart Power 2 to measure voltage, current, and power. We use TensorFlowLite (TFLite) as the runtime framework on RPi3B. To achieve this, we first define the architecture of the models by TensorFlow (TF). Then we convert the TF model into the TFLite format and generate the binary file deployed on the RPi3B.
Odroid MC1 is powered by Exynos 5422, a heterogeneous systemonachip (MPSoC). This SoC consists of two clusters of ARM cores and a small GPU core. Besides the hardware platform itself, we use the same setup as for the RPi3B.
Accuracy predictor and analytical hardware performance models: We adopt the same accuracy predictor used in Section 4.6. We only consider latency and energy consumption as the hardware performance metrics because the chip area is fixed. Hence, the objective function of searching on RPi3B and MC1 is:
(15) 
To finetune the analytical latency and energy models, we randomly select 180 sample networks from the search space. Then we convert them into the TFLite format and record their latency and energy consumption on the RPi3B. Based on the recorded data, we update the parameters of the analytical latency and energy models. Fig. 12 and 13 show that our analytical hardware performance models almost coincide with the real performance of both the RPi3B and MC1.
Search Process on RPi3B and MC1: We do not show the results of RLbased methods because the training of RL models requires intensive computation resources; thus, they cannot be deployed on RPi3B and MC1. As shown in Table 11, for searching without any constraint, our hierarchical SHGObased algorithm has only a minimal overhead compared with the basic (onelevel) SHGO algorithm. Moreover, our hierarchical SHGObased algorithm is faster than the onelevel SHGO algorithm on MC1.
For searching with constraints, the hierarchical SHGObased algorithm obtains a betterquality model with fewer samples and less search time on the RPi3B; we achieve similar improvements on MC1 as well. These results prove the effectiveness of our hierarchical strategy again. Overall, the total searching time on RPi3B and MC1 are as short as 2.33 seconds and 1.32 seconds, respectively on such resourceconstrained edge devices. To our best knowledge, this is the first time when a neural architecture search is reported on edge devices.
5. Conclusions and Future Work
This paper presented a very fast methodology, called FLASH, to improve the time efficiency of NAS. To this end, we have proposed a new topologybased metric, namely the NNDegree. Using the NNDegree, we have proposed an analytical accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Our proposed accuracy predictor achieves the same performance with 6.88 fewer samples and reduction in finetuning time cost compared to stateoftheart approaches. We have also optimized the onchip communication by designing a meshNoC for communication across multiple layers; based on the optimized hardware, we have built new analytical models to predict area, latency, and energy consumption.
Combining the accuracy predictor and the analytical hardware performance models, we have developed a hierarchical simplicial homology global optimization (SHGO)based algorithm to optimize the codesign process while considering both test accuracy and the area, latency, and energy figures of the target hardware. Finally, we have demonstrated that our newly proposed hierarchical SHGObased algorithm enables 27729 faster (less than 0.1 seconds) NAS compared to the stateoftheart RLbased approaches. We have also shown that FLASH can be readily transferred to other hardware platforms by doing NAS on a Raspberry Pi3B and Odroid MC1 in less than 3 seconds. To our best knowledge, our work is the first to report NAS performed directly and efficiently on edge devices.
We note that there is no fundamental limitation to apply FLASH to other machine learning tasks. However, no IMCbased architectures are widely adopted yet for other machine learning tasks like speech recognition or object segmentation. Therefore,the current work focuses on DNN inference and leaves the extension to other machine learning tasks as future work. Finally, we plan to incorporate more types of networks such as ResNet and MobileNetv2 as part of our future work.
6. Acknowledgments
This work was supported in part by the US National Science Foundation (NSF) grant CNS2007284, and in part by Semiconductor Research Corporation (SRC) grants GRC 2939.001 and 3012.001.
References
 Zerocost proxies for lightweight nas. In International Conference on Learning Representations, Cited by: §2, Table 6.
 Designing Neural Network Architectures using Reinforcement Learning. arXiv preprint arXiv:1611.02167. Cited by: §1.
 Scalefree Networks. Scientific american 288 (5), pp. 60–69. Cited by: §1, §2.
 A comprehensive survey on hardwareaware neural architecture search. arXiv preprint arXiv:2101.09336. Cited by: §2.

How does topology influence gradient propagation and model performance of deep networks with densenettype skip connections?.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  Language Models are FewShot Learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
 Onceforall: train one network and specialize it for efficient deployment. In International Conference on Learning Representations, Cited by: §2.
 ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, Cited by: §2, §2, Table 6.
 BRPNAS: Predictionbased NAS using GCNs. arXiv preprint arXiv:2007.08668. Cited by: §1, §2, §4.2, Table 5.
 NeuroSim: A Circuitlevel Macro Model for Benchmarking NeuroInspired Architectures in Online Learning. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 37 (12), pp. 3067–3080. Cited by: §1, §3.4, §3.5.
 Neural architecture search on imagenet in four gpu hours: a theoretically inspired perspective. In International Conference on Learning Representations, Cited by: §2, Table 6.
 Clustergcn: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266. Cited by: §1.
 Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §1.
 Grow and prune compact, fast, and accurate lstms. IEEE Transactions on Computers 69 (3), pp. 441–452. Cited by: §2.
 Imagenet: A LargeScale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §2, footnote 2.
 NASBench201: Extending the Scope of Reproducible Neural Architecture Search. arXiv preprint arXiv:2001.00326. Cited by: §4.1.
 Neural architecture search: a survey. The Journal of Machine Learning Research 20 (1), pp. 1997–2017. Cited by: §1.
 A Simplicial Homology Algorithm for Lipschitz Optimisation. Journal of Global Optimization 72 (2), pp. 181–217. Cited by: §3.6, footnote 1.
 Scalable OnChip Interconnect Topologies. In 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, Cited by: §4.5.
 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
 Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §1, §1, §2.
 Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. Cited by: §1.
 1.1 Computing’s Energy Problem (and what we can do about it). In IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §3.4.
 Monas: Multiobjective Neural Architecture Search using Reinforcement Learning. arXiv preprint arXiv:1806.10332. Cited by: §4.6.
 Densely Connected Convolutional Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §1, §1, §2, §3.3, Figure 5, §4.1, §4.1.
 A Detailed and Flexible CycleAccurate NetworkonChip Simulator. In IEEE ISPASS, pp. 86–96. Cited by: §3.5.
 Devicecircuitarchitecture Coexploration for Computinginmemory Neural Accelerators. IEEE Transactions on Computers. Cited by: §1, §1, §2, §3.6, §3.6, §4.6.
 Interconnectaware Area and Energy Optimization for Inmemory Acceleration of DNNs. IEEE Design & Test 37 (6), pp. 79–87. Cited by: §3.4, §3.4.
 InterconnectCentric Benchmarking of InMemory Acceleration for DNNS. In 2021 China Semiconductor Technology International Conference (CSTIC), pp. 1–4. Cited by: §3.5, §4.4.
 EDD: efficient differentiable dnn architecture and implementation cosearch for embedded ai solutions. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference, Cited by: §2.
 Progressive Neural Architecture Search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1, §1.
 Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §1, §2, §2, §4.1, Table 6.
 Neural Architecture Performance Prediction Using Graph Neural Networks. arXiv preprint arXiv:2010.10024. Cited by: §1, §2, §4.2, Table 5.
 A LatencyOptimized Reconfigurable NoC for InMemory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10 (3), pp. 362–375. Cited by: §3.4, §3.4.
 Foundations of Statistical Natural Language Processing. MIT press. Cited by: §2.
 Learning Scheduling Algorithms for Data Processing Clusters. In ACM Special Interest Group on Data Communication, pp. 270–288. Cited by: §1.
 Hardwareaware machine learning: modeling and optimization. In Proceedings of the International Conference on ComputerAided Design, ICCAD ’18. Cited by: §2.
 Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602. Cited by: §4.6.
 The Structure and Dynamics of Networks.. Princeton University Press. Cited by: §1.
 A Generic Graphbased Neural Architecture Encoding Scheme for Predictorbased NAS. Cited by: §1, §2, §4.2, Table 5.
 Inference Engine Benchmarking Across Technological Platforms from CMOS to RRAM. In Proceedings of the International Symposium on Memory Systems, pp. 471–479. Cited by: §3.4.
 Efficient Neural Architecture Search via Parameters Sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §1, §2, Table 6.
 Atomlayer: A Universal Rerambased CNN Accelerator with Atomic Layer Computation. In IEEE/ACM DAC, Cited by: §3.4.
 Largescale Evolution of Image Classifiers. In International Conference on Machine Learning, pp. 2902–2911. Cited by: §1.
 Mobilenetv2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §1, §2, §3.3, §3.3.
 ISAAC: A Convolutional Neural Network Accelerator with insitu Analog Arithmetic in Crossbars. ACM/IEEE ISCA. Cited by: §1, Figure 10, Figure 11, §4.5.
 Very Deep Convolutional Networks for Largescale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
 Pipelayer: A Pipelined Rerambased Accelerator for Deep Learning. In IEEE HPCA, pp. 541–552. Cited by: §3.4.
 SinglePath NAS: Designing HardwareEfficient ConvNets in less than 4 Hours. arXiv preprint arXiv:1904.02877. Cited by: §2.
 Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2820–2828. Cited by: §2.
 Residual Networks Behave Like Ensembles of Relatively Shallow Networks. arXiv preprint arXiv:1605.06431. Cited by: §3.3.
 Neural Network Inference on Mobile SoCs. IEEE Design & Test 37 (5), pp. 50–57. Cited by: §3.2.
 Collective Dynamics of ‘SmallWorld’Networks. Nature 393 (6684), pp. 440–442. Cited by: §2, §2, §3.3.
 Neural Predictor for Neural Architecture Search. In European Conference on Computer Vision, pp. 660–676. Cited by: 1st item, §1, §2, §4.2, Table 5.
 A Survey on Neural Architecture Search. arXiv preprint arXiv:1905.01392. Cited by: §1.
 Fbnet: Hardwareaware Efficient Convnet Design via Differentiable Neural Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1, §2.
 SNAS: stochastic neural architecture search. In International Conference on Learning Representations, Cited by: Table 6.
 BigNAS: scaling up neural architecture search with big singlestage models. In Computer Vision – ECCV 2020, pp. 702–717. Cited by: §2.
 Wide Residual Networks. arXiv preprint arXiv:1605.07146. Cited by: §3.3, §4.1.
 Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
 Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710. Cited by: §4.6.
Comments
There are no comments yet.