After deep neural network achieved great success, we are now witnessing the process of Artificial Intelligence (AI) democratization, which involves various machine learning tasks (e.g., image classification, video segmentation, speech recognition)[1, 2], tremendous applications (e.g., automotive vehicle, robot, health care) [3, 4] and different hardware platforms (e.g., CPUs, GPUs, FPGAs, ASICs) [5, 6, 7, 8, 9]. One of the most important question in the AI democratization era is: Given a dataset with a specified machine learning task, how to efficiently identify the best neural network architecture and hardware design, such that the network accuracy and hardware efficiency can be maximized simultaneously.
has been proposed to liberate human labor in the design of neural architectures by automatically identifying their hyperparameters. However, such an approach does not take hardware into consideration, which may easily lead the identified architecture to be useless due to the violation of the required hardware specifications. As a response, hardware-aware NAS[16, 17, 18, 19] has been proposed, in which the hardware specifications are considered during the search process. To further improve hardware efficiency, co-exploration of neural architectures and hardware design is proposed in , which proves that the Pareto frontiers between network accuracy and hardware efficiency can be further pushed forward by opening the hardware design space. However, all the works are based on the conventional von-neumann architecture (e.g., mobile platform or FPGAs), leading the memory accesses inevitably becoming the performance bottleneck due to the well-known memory wall.
Computing-in-memory (CIM) has been proved to effectively transcend such a memory wall , and has been considered to be a promising candidate for neural network computations due to the incomparable architectural benefits. (i) CIM architecture can benefit from the fixed memory access pattern within neural network computation  to execute operations in place. (ii) Emerging devices (e.g., ReRAM, STT-RAM) can be efficiently leveraged in the in-memory computing architecture  to provide high performance and energy efficiency. In [24, 25], MOSFET based in-memory processing has been employed for neural network computation, and the improvement in terms of energy and delay are observed compared with the conventional von-neumann architectures. Researches [23, 26] leverage emerging devices based in-memory computing scheme to construct crossbar architectures that can perform the matrix multiplication in analog domain, which further optimize the computation metrics such as area, energy, and delay.
Most of the exsiting works on CIM neural accelerator design simply map classic neural networks (e.g., LeNet, AlexNet) to the CIM platform to evaluate their design and compare against other counterparts. However, without the optimization on neural architectures, these reported metrics (i.e., accuracy, latency, energy, etc.) may be far from the optimal. In this work, we bring the CIM neural accelerator design to interplay with the neural architecture search, aiming at automatically identifying the best device, circuit, and neural architecture tuple with the maximized network accuracy and hardware efficiency. To the best of our knowledge, this is the first work to carry out the device-circuit-architecture co-exploration for CIM neural accelerators.
The novel device-circuit-architecture co-exploration brings opportunities to boost performance; however, it also incurs a bunch of new challenges. First of all, unlike the conventional von-neumann architecture based neural architecture co-exploration , the design space of CIM-based neural accelerator spans across multiple layers from devic type, circuit topology to neural architecture. Second, limited by the computing capacity of each device cell, quantization is essential to improve the hardware efficiency [27, 28, 29]; as such, quantization has to be automatically determined during the search process. Third, in addition to the optimization goals of hardware efficiency for mobile platform and FPGA, CIM has extra objectives, such as minimizing area, maximizing lifetime, etc. Last but not the least, emerging devices commonly have non-ideal behaviors (known as device variation); that is, if we directly map the trained DNN models to the architecture without considering the device variation, a dramatic accuracy loss will be observed, rendering the architecture useless.
This paper proposes a device-circuit-architecture co-exploration framework, namely NACIM, to automatically identify the best CIM neural accelerators, including the device type, circuit topology, and neural architecture hyperparamters. NACIM framework will iteratively conduct explorations based on a reward function, which is suitable for reinforcement learning approaches or evolutionary algorithms. By configuring the parameters of the framework, designers can customize the optimization goals in terms of their demands. Furthermore, we have considered the device variation in the framework. In the forward path of our training framework, we incorporate the variation in the computation, which is based on the device noise model. Experimental results show that the proposed NACIM framework can find the robust neural network with only 0.45% accuracy loss in the presence of device variation, compare with a 76.44% loss from the state-of-the-art NAS without considering device variation. In addition, NACIM can significantly push forward the Pareto frontier in terms of the tradeoff between accuracy and hardware efficiency, achieving up to 16.3 TOPs/W energy efficiency for a 3.17 improvement.
The main contributions of this work are listed as follows.
We formally define the optimization problem of identifying the best computing-in-memory (CIM) neural accelerator, whose design space spans across device type, circuit topology to neural architecture. To the best of our knowledge, this is the first work on optimizing CIM neural accelerators together with neural architecture search.
We have proposed a novel device-circuit-architecture co-exploration framework, namely NACIM, to simultaneously optimize network accuracy and hardware efficiency. The framework further optimizes the quantization to boost the hardware efficiency and considers the device variation to identify the robust neural architectures.
We implement the NACIM framework using a reinforcement learning approach and evaluate it on the commonly used datasets. Experimental results demonstrate the efficacy of the proposed framework in identifying the robust neural architectures in terms of device variation and pushing forward the Pareto frontier between accuracy and efficiency.
The remainder of the paper is organized as follows. Section II presents the background of both neural architecture search and computing-in-memory architectures. Section III demonstrates the search space of five layers, and formally defines the cross-layer optimization problem. The proposed novel cross-layer optimization framework is presented in Section IV. Experimental results are shown in Section V. Finally, concluding remarks are given in Section VI.
Ii-a System-Level Overview
Figure 1 demonstrates the overview of extending the conventional framework of neural architecture search to optimize neural architectures for the non-volatile devices based computing-in-memory architecture. Specifically, the neural architecture search process is first performed on GPUs, which involves the training of new models from scratch to generate the reward. After the search process is convergent, the identified neural network architecture will finally be deployed on the target computing-in-memory architecture. However, as shown in Figure 1, there is a missing link between the neural architecture search process and the computing-in-memory neural accelerator design. We will introduce the neural architecture search and computing-in-memory platform in the following subsections.
Ii-B Neural Architecture Search
Most recently, Neural Architecture Search (NAS) has been consistently achieving breakthroughs in different machine learning applications, such as image classifications , image segmentation , video action recognition , etc. NAS attracts large attentions mainly because it successfully releases human expertise and labor to identify high-accuracy neural architectures.
A typical NAS, such as that in , is composed of a controller and a trainer. The controller will iteratively predicts neural architecture parameters, called child network, and the trainer will train the child network from scratch on a held-out data set to obtain its accuracy. Then, the accuracy will be feedback to update controller. Finally, after the number of child networks predicted by the controller exceed a predefined threshold, the search process will be terminated. Among all of the searched neural architectures, the one with the highest accuracy will be finally identified.
It has been demonstrated in existing work that the automatically searched neural architectures can achieve close accuracy to the best human-invented architectures [10, 11]. However, the identified architectures may have complicated structures, which render them useless in real-world applications. For instance, it will result in excessive bandwidth requirement to perform secure inference.
In this paper, we consider the crossbar as the basic compute-in-memory engine. We discuss the devices used in this work, and the non-ideal behavior of the device. We also introduce NeuroSim, the framework we used to simulate crossbar computation.
Ii-C1 Device and its variations
Non-volatile devices have been widely adopted in the crossbar computations. When considering using the crossbar to perform inference, different device implementations lead to different energy, latency, etc. Here, we consider two factors (1) how many levels the device can be configured. (2) the non-ideal behavior of the devices. Both binary devices and multi-level devices are used in the existing crossbar computation. For the multi-level device, there are existing work with 4-bit (i.e., 16 levels) devices, with good distinction among different levels . Besides the multi-level devices, binary devices (STT-MRAM, etc.) are also considered in our implementation. Different kinds of devices may affect the on and off current for the crossbar computation, and ultimately impact the delay, energy, etc. Different number of levels in these devices also requires different peripheral circuitries in the crossbar architecture, which is another design space we will consider in this work.
These emerging devices also suffer from various errors . When the circuitry is used for inference, device-to-device variations could be the dominate error source. The variation could be caused in the fabrication process and in the device programming phase. The other dominate source of error comes from noise. Among the noise sources, random telegraph noise (RTN)  in particular, is a main source of noise caused by electrons temporarily being trapped within the device which in turn changes the effective conductance of device. Other noise sources include thermal noise and shot noise. However, they typically are much smaller compared with RTN 
. In this work, we model the device varation as a whole, and use a Gaussian distribution to represent the variation. The magnitude of the variation is from the paper, where the variations are from actual measurements.
Ii-C2 Crossbar Architecture
Different crossbar based architectures are proposed [23, 26]. We assume an ISAAC-like architecture  in our simulation. The architecture is highly parallel with multiple tiles. Within each tile, there are multiple crossbar arrays. The computation here is performed in analog domain. However, ADC and DAC are used to convert the signal from and to the analog domain computation. We assume that all the weights can be mapped to the crossbar arrays. Therefore, no programming of the weights is needed in the computation.
DNN+NeuroSim  is an integrated framework built for emulating the deep neural networks (DNN) inference performance or on-chip training performance on the hardware accelerator based on near-memory computing or in-memory computing architectures. Various device technologies are supported, including SRAM, emerging non-volatile memory (eNVM) based on resistance switching (e.g. RRAM, PCM, STT-MRAM), and ferroelectric FET (FeFET). SRAM is by nature 1-bit per cell, eNVMs and FeFET in this simulator can support either 1-bit or multi-bit per cell. NeuroSim 
is a circuit-level macro model for benchmarking neuro-inspired architectures (including memory array, peripheral logic, and interconnect routing) in terms of circuit-level performance metrics, such as chip area, latency, dynamic energy and leakage power. With Pytorch and TensorFlow wrapper, DNN +NeuroSim framework can support hierarchical organization from the device level (transistors from 130 nm down to 7 nm, eNVM and FeFET device properties) to the circuit level (periphery circuit modules such as analog-to-digital converters, ADCs), to chip level (tiles of processing-elements built up by multiple sub-arrays, and global interconnect and buffer) and then to the algorithm level (different convolutional neural network topologies), enabling instruction-accurate evaluation on the inference accuracy as well as the circuit-level performance metrics at the run-time of inference.
Iii Problem Definition
Figure 2 illustrates the cross-layer optimization from application to hardware. Our ultimate goal is to implement inference of a neural network on computing-in-memory platform. Such an implementation involves 5 layers optimization, including neural architecture search, quantization determination, data flow, circuit design, and device selection. In the following texts, we will first introduce the definitions for each layer. Then, we will formally define the optimization problem.
(a) Neural Architecture: As shown in the figure of cross-layer, a neural architecture is composed of multiple layers, which is defined as . It consists of a set of layers . The number of layers in the neural architecture is the size of set , i.e., . A layer can be a convolutional layer, a fully connection layer, etc. In order to automatically identify the neural architecture, we parameterize each layer to form a search space. For the layer , set
contains the predictable parameters, such as the number of filters and the filter size for convolution layer, and the number of neurons in the fully connection layer. After we determined the parameters in all layers, we obtain a neural architecture, called child network. The accuracy of the child network is, which can be obtained by training
on a held-out dataset. For illustration purposes, we use a linear chain of layers as an example. However, the proposed technique is not limited to such structure and is applicable to more complicated structures, such as Directed Acyclic Graph (DAG), which can represent the residual connections.
(b) Quantization: For each layer of the neural architecture, we can apply different data precision for computation. We define the quantization of a neural architecture as , where and represent the quantization for activation and weights, respectively. For a layer , indicates that we apply bits to represent the integer part and bits to represent the fraction part of the activation data; similarly, is defined for weights. Figure 1 (b) illustrates two quantization instances for a 4-layer neural architecture, where the number above x-axis indicates the bit-width for integer part and the number below x-axis indicates that for fraction part.
(c) Data Flow: The data flow is the intermediate layer between software (neural architecture) and hardware (circuit and device). In this work, we apply the weight-stationary data flow, which is commonly used in the computing-in-memory architecture. The basic idea is described as follows. First, for the convolution operation, the weights of a kernel are expanded and spread on the memory cells of cross-bar vertically; while for fully connection, the weights for each output neural are vertically spread on the cross-bar. Second, the activation (i.e., IFM or input neural) is horizontally fed into the cross-bar. Third, at each cycle, dot product is performed on the fed activation and the stationed weights to get the partial sums of outputs, and the accumulation operation is conducted on top of the previous obtained partial sums. Figure 2 (c) shows the above details for both convolution operation (left-hand side) and fully connection operation (right-hand side).
(d) Circuit: Figure 2 (d) shows the chip hierarchy. A chip is defined as , which is composed of tile array , PE array , and synaptic array , and the device . The top-level of the chip is a network-on-chip (NoC) based tile array, which is defined as , where is the size of the global buffer, and is the bandwidth of a link on NoC. Similarly, a tile is composed of a PE array, which is defined as ; and a PE is composed of a synaptic array, which is defined as . In the synaptic array, each cell is a device, which is specified from a set of available devices defined as follows.
(e) Device: We will have different choices of devices to be employed in the circuit. We define , where is a set of available devices (e.g., ReRAM, FeFET, STT-MRAM, as shown in Figure 2 (e)). For a specific device , say ReRAM, indicates the applied ReRAM has the ability to store 4 bits in one cell; and refers to the variation function, which is based on the existing work (e.g.,  for ReRAM). Kindly note that if the bit-width of a layer (in terms of ) is larger than , we adopt a shift-and-add circuitry at the peripheral, and we use multiple devices to represent the weights, otherwise if the bit-width is less than , we employ one device to store the weights. By leveraging the shift-and-add operation, we can achieve arbitrary the number of bits, which can well support the design space exploration when applying NAS to the crossbar.
Problem Statement: Based on the definition of each layer, we formally define the problem solved in this work as follows: Given a dataset (e.g., CIFAR-10), a machine learning task (e.g., image classification), and a set of available devices , we are going to determine:
: the neural architecture for the machine learning task;
: the quantization of each layer in the architecture ;
: the device in set used for the chip design;
: the circuit design based on the selected device ;
Objective: such that the inference accuracy of the machine learning task on the resultant circuit can be maximized, while the hardware efficiency (e.g., latency, energy efficiency, area, etc.) can be maximized. Kindly note that since the above optimization problem has multiple objectives, we further propose a framework in the next section, which can support designers to specify the metrics to be optimized (e.g., simultaneously maximizing accuracy, latency, and area).
Iv Cross-Layer Exploration Framework
Figure 3 demonstrates the overview of the proposed Neural Architecture and Computing-in-Memory Architecture Co-Exploration Framework, namely NACIM, to solve the problem defined in Section III. NACIM contains 4 components: ➀ a controller ➁ an optimizer selector, ➂ an network accuraty evaluator, ➃ a hardware performance evaluator.
➀Controller. The controller is a core component in NACIM framework, it predicts the hyperparameters of neural architecture, quantization, and device, according to the network accuracy and hardware performance from evaluators. These metrics form a reward function for updating the controller. The reward function is formulated as follows.
where is the prediction accuracy, is a scaling parameter, and represent the hardware performance, including latency, energy, area. The merge function can either be a simple weighted sum or other more advanced functions defined by the user. In the experiment section of this work, we adopt weighted sum for this function.
In terms of the reward, the controller will predict hyperparamters, which can be implemented by different techniques, such as the reinforcement learning approach or evolutionary algorithm. In this work, we employ the reinforcement learning method in the controller which interacts with the environment modeled as a Markov Decision Process (MDP). The Monte Carlo policy gradient algorithm is employed to update the controller:
where is the batch size and is the total number of steps in each episode. The rewards are discounted at every step by an exponential factor and the baseline is the average exponential moving of the rewards.
➁Optimizer Selector. The optimizer selector will determine the flow in NACIM framework. As shown in Figure 3 ➁, there are four switches , , , corresponding to four determination variables of neural architecture , quantization , device , and circuit . In terms of the status of switches, NACIM can perform different functions as listed in the following:
In the first case, NACIM performs the conventional neural architecture search, like , which aims to maximize accuracy without considering the hardware efficiency.
In the second case, NACIM considers the quantization during the neural architecture search, like , which will simultaneously determine the neural architecture and the quantization for each network layer.
In the third case, NACIM additionally involves the devices in the search process where the device variation will be considered to guarantee no accuracy loss after implementing the identified network on the target hardware.
In the fourth case, NACIM further explores the circuit design space for circuit optimization together with quantization in terms of a given architecture and device.
In this work, in order to conduct cross-layer optimization, we first set the switch combinations to the third case (called “hardware perturbation aware NAS”, abbreviating as “pNAS”), such that we can identify neural architectures with high accuracy on the target devices with variation. Second, we apply the fourth switch combination (called “hardware resource aware NAS”, abbreviating as “rNAS”) to further explore the circuit optimization to involve the hardware performance into consideration. The details for pNAS and rNAS will be introduced in the following two evaluators.
➂Accuracy Evaluator. The accuracy evaluator is the key component to execute pNAS. In the conventional neural architecture search based on the mobile or FPGA platforms, there is no need to consider hardware perturbation; however, when it comes to computing-in-memory platform, the fundamental devices will have variations in their characteristics (i.e., device non-idealities), which in turn will affect the accuracy. As a result, if we do not consider the variation during training, as shown in the left component in Figure 3 ➂, there will be a dramatic accuracy loss when the identified architecture is deployed to the circuit.
The crossbar is assumed to use for inference in this paper. However, the non-ideal behavior of the device in the inference stage may significantly decrease the application level accuracy , which prevent the use of the emerging devices crossbar. In this work, we propose to use a modified training method to alleviate the impact of non-ideal behavior of the device and circuit, as shown in the right component of Figure 3 ➂. When considering device variation in the training phase, the training typically requires a much longer time 
than a conventional training method. This is not tolerate in NAS process, since the framework will train all predict architectures. As a result, leveraging existing methods will dramatically increase the search time. In this paper, we propose method to reduce the affects of device variation by a more efficient way. Specifically, we propose a novel training method that invovles the device variation in the training procedure. The method is composed of two steps: First, we use Monte Carlo method to obtain samples (i.e., the device variations) for each weight based on a Gaussian distribution, whose mean is 0 and variance is equivalent to the device variance; Second, these samples will be added to the corresponding weights in the forward path in the training stage. Since only one Monte Carlo sample for each weight is required in each forward path, we can obtain the reasonable accuracy with the negligible extra training time introduced by our proposed method.
Based on the proposed trainer, pNAS is executed as follows. The controller, trainer, and accuracy evaluator collaboratively search the parameters of neural architecture, quantization, and devices for higher accuracy while taking noises caused by hardware perturbation into account and proposing a variety of candidate architectures. This searching step includes four phases. First, the controller predicts a quantized neural architecture and a type of device. Second, the identified architecture is trained by the trainer using the proposed weight perturbation aware training method. Third, the trained model is then evaluated by the accuracy evaluator to generate inference accuracy with noise. Finally, the accuracy will be the reward to update the controller for predicting new hyperparameters.
➃Performance Evaluator. Before entering the performance evaluator, we first conduct the circuit optimization. We base the circuit optimization on NeuroSim , and make modifications to support different quantization for network layers. Based on the modified model, given a neural architecture , a quantization , a device , we can optimize the circuit and determine the parameters in . Then, based on and the evaluation tool in 
, we can estimate the area, energy efficiency, and latency for the implementation of the inference phase.
Based on the above performance evaluator, the rNAS will fine tune quantization parameters of the candidate architectures to further integrate hardware metrics, including area, energy and latency into consideration. In the exploration, we will fix the neural architecture and device, so that there is no need to train the network from scratch to accelerate the search process. Specifically, we open the switches and , and close switche and . In each iteration, we will predict new quantization parameters for the identified neural architecture and device. Then, we will first obtain the inference accuracy via accuracy evaluator using the saved weights and the new quantization paramters. Next, we will conduct the circuit optimization and obtain the hardware metrics including latency, energy, and area. Finally, we generate the reward in terms of the reward function, and update the controller based on the reward for the prediction in the next iteration.
V Experiments and Results
In this section, we will first present the experiment setup. Then the experimental results will be presented.
V-a Experiment Setup
Similar to most existing works on CIM based neural accelerators [39, 40], We use CIFAR-10 dataset and image classification task to evaluate our cross-layer optimization framework, NACIM. The architecture is fixed to be a typical convolutional neural network consists of 6 convolution layers and 2 fully connected layers. For each convolution layer, the hyperparamters include the filter height/width (e.g., [1, 3, 5, 7]) and output channels (e.g., [24, 36, 48, 64]). For FC layers, the search space is the number of neurons (e.g., [64, 128, 256, 512]).
Quantization is explored during the search process. The quantization bit width of the activation and weight of each layer are searched separately. For each type of data, we determine the number of integer bits range from 0 to 3, and the number of fraction bits range from 0 to 6.
For the device and circuit, as a starting point, we use a 4-bit ReRAM device in the crossbar computation. The noise model of the device is from . We assume the current range of the device to be [0, 16
]. In each level of the device, the variation follows a Gaussian distribution, with a mean of 0 and standard deviation of 800. We assume the array size for crossbar to be
. The updating rate of the controller is set to be 0.2 and the framework trains each candidate architecture for 30 epochs and searches for the optimal architecture for 500 episodes. We pick the architectures with top 40 hardware noise aware inference accuracy from the searching results, and further fine-tune them with 200 training epochs for each network.
We search through layer-wise quantization parameters for each candidate architecture while assuming the underlying hardware to have the properties listed as follows: we use 4-bit RRAMs as our CIM device and 16 level (4-bit) ADCs for the crossbar, chip clock frequency is 1 GHz, chip technology node is 32 nm. The memory voltage is 0.5 V and the chip voltage is 1.1 V. For each candidate architecture, the controller starts from the specifications provided by the previous search step, then performs 100 search steps to generate an optimized quantization scene for this architecture.
V-B Comparison Results to State-of-the-Art NAS
First, we show the exploration results of different searching methods in Table I. “QuantNAS” indicates the state-of-the-art quantziation-architecture co-exploration method proposed in . “pNAS” indicates the noise-aware training and searching method proposed in this work, where the switch combination is set as . “NACIM” indicates the the noise-aware training and searching method along with hardware resource aware quantization search, which combines and . Please note that “NACIM” can obtain a serials of solutions on Pareto frontier. We use notation “NACIM” and “NACIM” to represent the solution with maximum hardware efficiency and that with maximum accuracy, respectively. For comparison, we obtain the accuracy of all architectures without noise, as shown in column “Accuracy”. We then compare the accuracy after considering the device variation in column “Acc w/ variation”. We employ the same circuit optimization procedure, and obtain the hardware efficiency metrics, including area and energy delay product (EDP), speed (TOPs), and energy efficiency (TOPs/W).
Results in Table I shows that QuanNAS can find architecture with the highest accuracy. However, when it is employed for computing-in-memory circuit with variation, it has a drastic accuracy loss from 84.92% to 8.48%, rendering the architecture to be useless. On the contrary, with consideration of device variation in training process, the network accuracies of pNAS, NACIM, NACIM on computing-in-memory circuit are 70.76%, 70.12%, 73.45%, respectively. What is more, the accuracy loss for NACIM is only 0.43%.
We can also observe from the table that by employing the cross-layer optimization, NACIM can obtain the best hardware efficiency. Compared with QuantNAS, NACIM achieves 1.82 reduction on area and 3.66 improvement on energy delay product. Compared with pNAS, these figures are 14.01% and 1.89, respectively. Compared with NACIM, these figures are 9.64% and 1.70, respectively. These results demonstrate the capability of NACIM to synthesize the cost-effective computing-in-memory chips.
Another observation is that the architectures identified by both QuanNAS and NACIM achieve slightly higher speed than that by NACIM. This is because NACIM finds many simple structures with fewer operations, but the latency is not improved accordingly since other designs can have more processing elements. In the comparison of energy efficiency, NACIM achieves higher energy efficiency than QuanNAS. NACIM achieves higher energy efficiency, reaching up to 16.3 TOPs/W. The above observations clearly show the importance of conducting cross-layer optimization to obtain useful neural architectures for hardware efficient computing-in-memory architecture.
V-C Results of Bi-Objective Optimization
Next, we report the design space exploration results of both pNAS and NACIM with bi-objective optimization: maximizing the accuracy and hardware performance. Here, the accuracy is obtained by executing the neural network on computing-in-memory chip with variation. And we carry out three sets of experiments to optimize each hardware performance metric, including latency, area, and energy, separately. The reward function is calculated based on these metrics, as shwon in Formula 3, where we set to be 0.5 to co-optimize network accuracy and hardware efficiency. In the bi-objective optimization, function will only return the value of one metric, and we will extend to multi-objective optimization in the next subsection.
Figure 4 shows the design space exploration in terms of accuracy and latency. In this figure, the x-aixs and y-aixs represent the latency and error, respectively. Each rectangle stands for a design identified by NACIM and each cross stands for a design identified by pNAS. For all multi-objective results, the ideal solutions will be on the bottom-left corner, as shown in this figure.
From the results, we can see that by considering the cross-layer optimization, NACIM can significantly push forward the Pareto frontier between accuracy and latency. This is because NACIM will generate the reward using the weighted accuracy and latency, which can improve the latency by find better circuit design and guarantee accuracy at the same time. Specifically, for the comparison between solutions with the highest accuracy (design A for NACIM , and B for pNAS), we can see that A’s accuracy (73.77%) is higher than B’s accuracy (73.69%). What is more, design A reduces latency by 16.63%. For the comparison between solutions with the lowest latency, we can see that NACIM (design C) achieves the same accuracy but 32.49% lower latency, compared with pNAS (design D).
We further conduct experiments on optimizing area and energy. We observed similar results. The results are shown in Figures 5 and 6. There is one interesting observation in exploring the design space for accuracy and energy tradeoffs, which is shown in Figure 6. The figure shows that pNAS can find solutions with higher accuracy against the NACIM. For example, in the figure, design A identified by NACIM has 1% accuracy loss against design B, which is identified by pNAS. However, NACIM achieves 1.73 higher energy efficiency. Here, both designs have the same neural architecture but different quantization. In order to obtain high energy efficiency, NACIM employs lower bit-width precision. We can avoid such accuracy loss by increasing the scaling variable in the reward function in Formula 3.
All above observations verify the importance of conducting bi-objective optimization instead of mono-objective optimization on accuracy.
V-D Results of Multi-Objective Optimization
Figure 7 shows the design space exploration tradeoffs between accuracy and the normalized hardware efficiency. The normalized hardware efficiency is calculated based on weighted hardware metrics, including latency, area, and energy, which is represented by the x-axis. Each hardware component has a same weight and the total normalized hardware efficiency has the consists of half of the reward and inference accuracy takes another half. An interesting observation from the results is that compared with the bi-objective optimization, NACIMfound more architectures with lower accuracy. This is because the weights for accuracy in calculating the reward is decreased. However, we can still can find the solution with the highest accuracy, and achieves 1.65 improvement on hardware efficiency.
In this work, we formally defined cross-layer optimization problem for automatically identifying neural architectures on computing-in-memory (CIM) platform. We devised a novel neural architecture search framework that gives flexibility for designers to set different optimization goal. We further integrate a trainer with the consideration of device variation in our framework. In experiments, we first demonstrated the importance of finding a robust neural architecture in terms of the device variation in CIM, which may lead the neural architectures that apply the existing NAS to be useless due to dramatic accuracy loss. We further showed that the cross-layer optimization can identify the robust neural architecture with 0.45% accuracy loss after considering variation, and maximize hardware efficiency to achieve 16.3 TOPs/W energy efficiency.
A. Krizhevsky et al.
, “Imagenet classification with deep convolutional neural networks,” inProc. of NIPS, 2012, pp. 1097–1105.
-  J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
-  H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, “Object classification using cnn-based fusion of vision and lidar in autonomous vehicle environment,” IEEE Transactions on Industrial Informatics, vol. 14, no. 9, pp. 4224–4231, 2018.
-  X. Xu, T. Wang, Y. Shi, H. Yuan, Q. Jia, M. Huang, and J. Zhuang, “Whole heart and great vessel segmentation in congenital heart disease using deep neural networks and graph matching,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 477–485.
J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators,” inProceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 19.
-  J. J. Zhang and S. Garg, “Fate: fast and accurate timing error prediction framework for low power dnn accelerator design,” in Proceedings of the International Conference on Computer-Aided Design. ACM, 2018, p. 24.
-  W. Jiang, E. H.-M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, and J. Hu, “Achieving super-linear speedup across multi-fpga for real-time dnn inference,” ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, p. 67, 2019.
-  X. Xu et al., “Resource constrained cellular neural networks for real-time obstacle detection using fpgas,” in Proc. of ISQED. IEEE, 2018, pp. 437–440.
-  W. Jiang et al., “Heterogeneous fpga-based cost-optimal design for timing-constrained cnns,” IEEE TCAD, 2018.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in International Conference on Learning Representations (ICLR), 2017.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in
-  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” arXiv preprint arXiv:1703.01041, 2017.
-  H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” arXiv preprint arXiv:1711.00436, 2017.
-  V. Nekrasov, H. Chen, C. Shen, and I. Reid, “Architecture Search of Dynamic Cells for Semantic Video Segmentation,” arXiv preprint arXiv:1904.02371, 2019.
-  H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
-  H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by network transformation.” AAAI, 2018.
-  B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” arXiv preprint arXiv:1812.03443, 2018.
-  W. Jiang, X. Zhang, E. H.-M. Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu, “Accuracy vs. efficiency: Achieving both through fpga-implementation aware neural architecture search,” in Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 2019, p. 5.
-  W. Jiang, L. Yang, E. Sha, Q. Zhuge, S. Gu, Y. Shi, and J. Hu, “Hardware/software co-exploration of neural architectures,” arXiv preprint arXiv:1907.04650, 2019.
-  D. Ielmini and H.-S. P. Wong, “In-memory computing with resistive switching devices,” in Nature Electronics. Nature, 2018, p. 333.
-  Y.-H. C. T.-J. Y. Sze, Vivienne and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” in Proceedings of the IEEE. IEEE, 2017, pp. 2295–2329.
-  A. N. N. M.-R. B. J. P. S. M. H. R. S. W. Shafiee, Ali and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, pp. 14–26, 2016.
-  A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficient sram with embedded convolution computation for low-power cnn-based machine learning applications,” in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 488–490.
-  S. L. S. G. Kang, Mingu and N. Shanbhag, “An in-memory vlsi architecture for convolutional neural networks,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018, pp. 494–505.
-  S. L. C. X.-T. Z. J. Z. Y. L. Y. W. Chi, Ping and Y. Xie, “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” In ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 27-39. IEEE Press, pp. 27–39, 2016.
-  X. Xu et al., “Scaling for edge inference of deep neural networks,” Nature Electronics, vol. 1, no. 4, p. 216, 2018.
-  X. Xu, Q. Lu, L. Yang, S. Hu, D. Chen, Y. Hu, and Y. Shi, “Quantization of fully convolutional networks for accurate biomedical image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8300–8308.
-  J. J. Zhang, P. Raj, S. Zarar, A. Ambardekar, and S. Garg, “Compact: On-chip compression of activations for low power systolic array based cnn acceleration,” ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, p. 47, 2019.
-  H. W. B. G.-Q. Z. W. W. S. W. Y. X. Zhao, Meiran, “Investigation of statistical retention of filamentary analog rram for neuromophic computing,” IEEE International Electron Devices Meeting (IEDM), pp. 39–4, 2017.
-  C. Liu et al., “Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation,” in Proc. of CVPR, 2019, pp. 82–92.
-  W. Peng et al., “Video action recognition via neural architecture searching,” in Proc. of ICIP. IEEE, 2019, pp. 11–15.
-  S. W. Feinberg, Ben and E. Ipek, “Maing memristive neural network accelerators reliable,” IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 52–65, 2018.
-  X. Peng et al., “Dnn+neurosim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,” in Proc. of IEDM, 2019.
-  P. Chen, X. Peng, and S. Yu, “Neurosim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3067–3080, Dec 2018.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  Q. Lu, W. Jiang, X. Xu, Y. Shi, and J. Hu, “On neural architecture search for resource-constrained hardware platforms,” in International Conference on Computer-Aided Design (ICCAD). ACM, 2019, p. 1.
-  L.-Y. C. Zhang, Bonan and N. Verma, “Stochastic data-driven hardware resilience to efficiently train inference models for stochastic hardware implementations,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
-  A. D. Patil, H. Hua, S. Gonugondla, M. Kang, and N. R. Shanbhag, “An mram-based deep in-memory architecture for deep neural networks,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
-  X. Sun, R. Liu, X. Peng, and S. Yu, “Computing-in-memory with sram and rram for binary neural networks,” in 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). IEEE, 2018, pp. 1–4.