1. Introduction
In the machine learning community, there is great interest in developing computational models to solve problems related to computer vision
(Krizhevsky et al., 2012b), speech recognition (Graves et al., 2013), information security (Kang and Kang, 2016), climate modeling (Jones, 2017), etc. To improve the delay and energy efficiency of computational tasks related to both inference and training, the hardware design and architecture communities are considering how hardware can best be employed to realize algorithms/models from the machine learning community. Approaches include application specific circuits (ASICs) to accelerate deep neural networks (DNNs) (Reagen et al., 2016; Whatmough et al., 2017) and convolutional neural networks (CoNNs) (Moons et al., 2016), neural processing units (NPUs) (Hashemi et al., 2017), hardware realizations of spiking neural networks (Esser et al., 2015; Kim et al., 2015), etc.When considering applicationspecific hardware to support neural networks, it is important that said hardware can implement networks that can be extensible to a large class of networks, and solve a large collection of applicationlevel problems. Deep neural networks (DNNs) represent a class of such networks and have demonstrated their strength in applications such as playing the game of Go (Silver et al., 2017), image and video analysis (Krizhevsky et al., 2012a), target tracking (Kristan et al., 2017), etc. In this paper, we use convolutional neural network (CoNN) as a case study for DNNs due to its general prevalence. CoNNs are computationally intensive, which could lead to high latency and energy for inference and even higher latency/energy for training. The focus of this paper is on developing a low energy/delay mixedsignal system based on cellular neural networks (CeNNs) for realizing CoNN.
A Cellular Nonlinear/Neural Network (CeNN) is an analog computing architecture (L O Chua and Lin Yang, 1988) that could be well suited for many information processing tasks. In a CeNN, identical processing units (called cells) process analog information in a concurrent manner. Interconnection between cells is typically local (i.e., nearest neighbor) and spaceinvariant. For spatiotemporal applications, CeNNs can offer vastly superior performance and power efficiency when compared to conventional von Neumann architectures (Palit et al., 2015; Peter Kinget , and Michel SJ Steyaert, 1995). Using ”CeNNs for CoNN” allows the bulk of the computation associated with a CoNN to be performed in the analog domain. Sensed information could immediately be processed with no analogtodigital conversion (ADC). Also, inferencebased processing tasks can tolerate lower precision (e.g., Google’s TPU employs 8bit integer matrix multiplies (Jouppi et al., 2017)) typically associated with analog hardware, and can leverage its higher energy efficiency. With this context, we have made the following contributions in this paper.
(i) We elaborate the use of CeNN to realize computations that are typically associated with different layers in a CoNN. These layers include convolution, ReLU, and pooling. Based on the implementations for each layer, a baseline CeNNfriendly CoNN for the MNIST problem (LeCun et al., 2010) is presented^{1}^{1}1A preliminary version of the design was presented in (Horvath et al., 2017) ()..
(ii) We introduce an improved CoNN model for the MNIST problem to support CeNNfriendly layers/algorithms that could ultimately improve figures of merit (FOM) such as delay, energy, and accuracy, etc. Following the same concept, we also develop a CeNNfriendly CoNN for the CIFAR10 problem (Krizhevsky and Hinton, 2009).
(iii) We present a complete, mixedsignal architecture to support CeNNfriendly CoNN designs. Besides CeNN cells and SRAM to store weights, the architecture includes analog memory to store intermediate feature map data, and ADC and digital circuits for the FC layer computation. The architecture also supports efficient programming/reprogramming CeNN cells.
We have conducted detail studies of energy, delay, and accuracy per classification for the MNIST and CIFAR10 datasets, and compared our networks and architecture with other algorithms and architectures (Reagen et al., 2016; Whatmough et al., 2017; Moons et al., 2016; Hashemi et al., 2017; Esser et al., 2015; Kim et al., 2015) that address the same problem. For the MNIST dataset, at isoaccuracy, our results demonstrate an 8.7 improvement in energydelay product (EDP) when compared with a stateoftheart accelerator. When compared with another recent analog implementation(Biswas et al., 2018), a 10.3 improvement in EDP is observed. For the CIFAR10 dataset, a 4.3 improvement in EDP is observed when comparing with a stateoftheart quantized approach (Hashemi et al., 2017).
The rest of the paper is structured as follows. Sec. 2 gives a general discussion of CeNNs and existing CoNN accelerators. In Sec. 3, we present the implementation of CoNN layers in CeNNs. Our baseline network designs as well as other algorithmic changes and network topologies that might be wellsuited for our architecture are given in Sec. 4. Sec. 5 describes our proposed architecture, including CeNN cell design, and simulations of various core architectural components. Evaluation and benchmarking results are presented in Sec. 6. Lastly, Sec. 7 concludes the paper.
2. Background
Here, we briefly review the basic concepts of CeNN and accelerator designs for CoNN.
2.1. CeNN basics
A CeNN architecture is a spatially invariant, array of identical cells (Fig. 1a) (Horvath et al., 2017). Each cell has identical connections with adjacent cells in a predefined neighborhood. The number of cells () in the neighborhood is given by the expression . ( is typically 1, which suggests that each cell interacts with only its immediate neighbors.)
A CeNN cell is comprised of one resistor, one capacitor, linear voltage controlled current sources (VCCSs), and one fixed current source (Fig. 1b). A cell’s input, state, and the output of a given cell , corresponds to the nodal voltages, , , and respectively. VCCSs controlled by input and output voltages of each neighbor deliver feedforward and feedback currents to a given cell. To understand CeNN cell dynamics, we can simply assume a system of ordinary differential equations. Each equation is simply the Kirchhoff’s Current Law (KCL) at the state nodes of the corresponding cells (Eq. 1). CeNN cells also employ a nonlinear sigmoidlike transfer function at the output (see Eq. 2).
(1) 
(2) 
Feedback and feedforward weights from cell to cell are captured by the parameters and , respectively. , and are space invariant and are denoted by two matrices. (If , matrices are .) Matrices of and parameters are referred to as templates – where and are the feedback and feedforward templates, respectively. These templates are the coefficients in the differential equation, and can either be a constant to reflect the linear relation between cells, or some nonlinear function to reflect nonlinear relation between cells. Design flexibility is further enhanced by the fixed bias current . This provides a means to adjust total current flowing into a cell. By selecting values for , , and , CeNNs can solve a wide range of problems.
Various circuits including inverters, Gilbert multipliers, operational transconductance amplifiers (OTAs), etc can be used as VCCSs in CeNN. (Jesus E. MolinarSolis, et al, 2007; Lei Wang, et. al, 1998). For the work to be discussed in this paper, we assume the OTA design from (Lou et al., 2015). OTAs provide a large linear range for voltage to current conversion, and can implement a wide range of transconductances that could be used for different CeNN template implementations. Furthermore, these OTAs can also be used to implement Nonlinear templates, which leads to CeNNs with richer functionality (Lou et al., 2015).
2.2. Convolutional neural network accelerators
Due to the high computational complexity of CoNNs, various hardware platforms are used to enable the efficient processing of DNNs, including GPUs, FPGAs, ASICs, etc. Specifically, there is a growing interest in using ASICs to provide more specialized acceleration of DNN computation. A recent review paper summarized these approaches in (Sze et al., 2017). Both digital and/or analog circuitries are proposed to implement these accelerators. In the digital domain, typical approaches include using optimized dataflow to efficiently reduce the data movement overhead for the dense matrix multiplication operation (Chen et al., 2017) , or implementing sparse matrix multiplication by applying pruning to the network (Han et al., 2016)
. Recently, analog implementations have also been proposed to accelerate deep learning processes. Work in
(Biswas et al., 2018) embedded a charge sharing scheme into SRAM cells to reduce the overhead of memory accesses. Work in (Shafiee et al., 2016) uses a crossbar circuit with memristors to speed up the inference of deep neural networks,3. CeNN implementation of CoNN computations
As pointed out earlier, CeNNs have a number of benefits such as (i) ease of implementation in VLSI, (ii) low energy due to its nature fit for analog realization, (iii) Turing complete, etc. We show in this section that all the core operations in a CoNN can be readily implemented with CeNNs. In a CoNN, every layer typically implements a simple operation that might include: (i) convolutions, (ii) nonlinear operations (usually a rectifier), (iii) pooling operations, and (iv) fully connected layers. Below, we describe how each of these layers can map to a CeNN. A more detailed description of the operations and how the layered network itself can be built can be found in (LeCun et al., 2015)(et al, 2015). We will also discuss our network design in Sec. 4.
3.1. Convolution
Convolution layers are used to detect and extract different feature maps on input data as the summation of the pointwise multiplication of the feature map and the convolutional kernel. One map is the input image (), and the convolutional kernel encodes a desired feature () to be detected by some operation. It is easy to see that a convolution has the highest response at positions where the desired feature appears. The convolution operation can be defined per Eq. 3. The exact convolutional kernels are optimized during training.
(3) 
As can be seen from Eq. 1, with the application of the feedforward template (denoted as ), one CeNN can implement a convolutional kernel for a feature map in a straightforward manner. Then, all these feature maps after convolutional operations need to sum up together. We will discuss the mechanism for achieving this in Sec. 5.
Due to the sigmoid function within the CeNN equation, the output of CeNN is thresholded to the range (1, 1). However, in the CoNN computation, the output could be larger than 1 or less than 1, which leads to an error in data representation. However, our initial simulation results suggest that this error does not impact the overall classification accuracy in the networks considered in this paper.
3.2. Rectified Linear Units
As CoNNs are built and designed for recognition purposes and classification tasks, nonlinear operations are required. Perhaps the most commonly used nonlinearity in deep learning architectures (Dahl et al., 2013)
is the rectified linear unit (ReLU) that per Eq.
4, thresholds every value below zero.(4) 
In a CeNN, the ReLU operation can be implemented using a nonlinear template. In CeNN theory, nonlinear templates are usually noted as templates in parallel with templates and templates. To realize the required nonlinear computation here, one can define an additional template implementing the nonlinear function of the ReLU operation: . This function sets all negative values to zero and leaves the positive values unchanged, hence it directly implements Eq. 4. That said, (i) while nonlinear templates are well established in the theory of CeNNs, (ii) the application of a nonlinear function has obvious computational utility, and (iii) nonlinear templates can be easily simulated, in practice, nonlinear operations are much more difficult to realize. While existing hardware considers nonlinear template implementations (Lou et al., 2015), it may still not exactly mimic the behavior of nonlinear templates. (We will discuss this in more detail in Sec. 3.4.)
Alternatively, as the CeNNUM is Turing complete, all nonlinear templates can be implemented as a series of linear templates together with the implicit CeNN nonlinearity (i.e. sigmoid output, see Eq. 2) (Roska and Chua, 1993). This implicit CeNN nonlinearity is widely implemented in real devices such as the ACE16k chip (RodríguezVázquez et al., 2004) or the SPS 02 Smart Photosensor from Toshiba (SPS, [n. d.]). In the CoNN case, the ReLU operation can be rewritten as a series of linear operations (with the implicit CeNN nonlinearity) by applying templates below.
First, one can execute the feedforward template given by Eq. 5, which simply decreases all values by . Because the standard CeNN nonlinearity thresholds all values in a CeNN array below , after this shift all values between and are simply set to .
(5) 
Next, one can shift the values back, (i.e., increase them by 1) by applying the template operation in Eq. 6:
(6) 
As the nonlinearity thresholds a given value, these two linear operations implement the required ReLU operator, i.e., leaving all positive values unchanged, and thresholds all values below 0.
3.3. Pooling
Pooling operations are employed to decrease the amount of information flow between consecutive layers in a deep neural network to compensate for the effects of small translations. Two pooling approaches are widely used in CoNN – max pooling and average pooling. Here, we discuss the implementations of both pooling approaches using CeNN.
3.3.1. Max pooling
A max pooling operation selects the maximum element in a region around every value per Eq. 7:
(7) 
Similar to ReLU, max pooling is also a nonlinear function. As before, functionality associated with max pooling can also be realized with a sequence of linear operations. We use a pooling operation with a 33 receptive field as an example to illustrate the process. The idea here is to compare the intensity of each pixel in the image with all its neighbors in succession (with a radius of 1 in the 33 case). We use to represent the intensity for pixel . For each comparison, if the intensity of its neighbor pixel (defined as ) is larger than , we use to replace in the location , otherwise remains unchanged. By making comparisons with all neighboring pixels, the value of can be set to the magnitude of all of its neighbors.
We developed a sequence of CeNN templates to realize the comparison between and all its neighboring pixels, . Then, by simply rotating the templates, we can easily compare to other neighbor pixels. Downsampling could be performed afterwards to extract the maximum value within a certain range if needed. The detailed CeNN operations to realize the comparison can be broken down into 4 steps, and are summarized as follows. (i) Apply the linear DIFF template shown in Eq. 8:
(8) 
The output after applying this template is . After applying the sigmoid function, if , otherwise remains unchanged. (ii) Apply the linear INC template in Eq. 9 to shift the pixel intensity up. After this operation, becomes 0 if , otherwise .
(9) 
(iii) Apply the CeNN MULT template to multiply by 2. Thus, if , otherwise . (iv) Add to to obtain the maximum between and , and use it to update the intensity in the location .
3.3.2. Average pooling
Per Sec. 3.3.1, a max pooling operation with linear CeNN templates requires up to 16 computational steps. (Each comparison requires 4 steps, while the pixel needs to compare with (at least) its neighboring 4 pixels.) That said, average pooling can be used in lieu of max pooling, and may have only a nominal impact on the classification accuracy in certain scenarios (Boureau et al., 2010). Average pooling operations can be easily realized with CeNNs – in fact, only one template operation is required. To perform an average pooling operation in or grids, one can simply employ the templates in Eq. 10 ().
(10) 
3.4. Nonlinear template operations
While CeNN templates typically suggest linear relationships between cells, nonlinear relationships are also possible and can be highly beneficial. (As noted earlier, while nonlinear template operations are wellsupported by CeNN theory, in hardware realizations, linear operations are more common owing to the complexity of the circuitry required for nonlinear steps.) That said, we also consider what impact nonlinear template operations may ultimately have on applicationlevel FOM.
We consider nonlinear implementations of ReLU and pooling per (Lou et al., 2015). The nonlinear OTA based IV characteristic shown in (Lou et al., 2015) can directly mimic the ReLU function discussed in Sec. 3.2. The pooling operation can also be implemented by the nonlinear, GLOBMAX template, which can be found in the standard CeNN template library(tem, [n. d.]). The GLOBMAX operation selects the maximum value in the neighborhood of a cell in a CeNN array and propagates it through the array. By setting the execution time of the template accordingly, one can easily set how far the maximum values can propagate/which regions the maximum values can fill. Here, the nonlinear templates can also be implemented by using the type nonlinear function as given in Eq. 11.
(11) 
3.5. FullyConnected Layers
The operations described above are used in local feature extractors and can extract complex feature maps from a given input image. However, to accomplish classification, one must convert said feature maps into a scalar index value associated with the selected class. While various machine learning algorithms (e.g., SVMs) can be used for this, a common approach is to employ a fully connected (FC) layer and associated neurons. The FC layer considers information globally and unifies local features from the lower layers. It can be defined as a pixelwise dot product between a weight map and the feature map. This product can be used as a classification result, which captures how strongly the data belongs to a class and the product is calculated for every class independently. The index of the largest classification result can be selected and associated with the input data.
CeNNs can be readily used to implement the dot product function in the FC layer. However, if for large feature maps and weight maps, i.e., the pointwise calculation for vector length over 9. CeNN would require large
, hence cannot efficiently implement such FC layers. To overcome this challenge, one can leverage a digital processor (e.g., per (Nahlus et al., 2014)) to perform the FC layer function.4. CeNNbased CoNNs for two case studies
As mentioned in the previous section, (a) CeNNs could operate in the analog domain – which could result in lower power consumption/improved energy efficiency (Kim et al., 2009), and (b) CeNNs are Turing complete (Chua and Roska, 2002a) and could provide a richer library of functionality than which is typically associated with CoNNs. In this section, we consider how the topographic, highly parallel CeNN architecture can efficiently implement deeplearning operations/CoNNs.
CeNNs are typically comprised of a single layer of processing elements (PEs). Thus, while most CeNN hardware implementations lack the layered structure of CoNNs, by using local memory and CeNN reprogramming (commonly available on every realized CeNN chip (RodríguezVázquez et al., 2004) as will be discussed), a cascade of said operations can be realized by reusing the result of each previous processing layer (Chua and Roska, 2002b). One could also simply use multiple CeNNs to compute different feature maps in each layer in parallel. These CeNNs need to communicate with each other, e.g., in order to sum values for different feature maps. Below, we show how the layered CoNNs can be realized with layers of CeNNs through two case studies: (i) MNIST, (ii) CIFAR10.
4.1. CeNNbased CoNNs for MNIST
Using the building blocks described above, we have developed several CeNNfriendly structures for the MNIST problem. In the MNIST handwritten digit classification task (Lecun et al., 1998)
, a system must analyze and classify what digit (09) is represented by a 28
28 pixel gray scale image. There are 60,000 images in the training set, and 10,000 images in the test set.To develop the CeNNfriendly CoNN, we leverage the following two observations. First, all computational kernels are best to be restricted to a CeNN friendly size of . In some sense, this could be viewed as a ”departure” from larger kernel sizes (e.g., or larger) that may be common in CoNNs. It should be noted that larger kernels are acceptable according to the CeNN theory (i.e., per Sec. 2, a neighborhood’s radius could easily be larger than 1). However, due to increased connectivity requirements, said kernels are infrequently realized in hardware. That said, the kernel size is not necessarily a restriction. Recent works (Simonyan et al., 2014)
suggests that larger kernels can be estimated by using a series of
kernels with fewer parameters. Again, this maps well to CeNN hardware. Second, per the discussion in Sec. 3, all template operations for the convolution, ReLU, and pooling steps are feedforward (B) templates. The feedback template (A) is not used in any of the feature extracting operations (i.e., per Eq.
1, all values would simply be 0).During network development, we use TensorFlow to train the network with full precision to obtain accuracy data. We use stochastic gradient descent for training, with the initial learning rate set to
. We have also implemented a more versatile/adjustable training framework in MATLAB. The MATLAB based simulator extracts weights from the trained model (from TensorFlow), and performs inference in conjunction with CeNN operations at the precision that is equivalent to actual hardware. Our network learns the parameters of the Btype templates for the convolution kernels. (Per Sec. 3, the Btemplate values for the ReLU and pooling layers are fixed.)Following the observations and process described above, we develop a layered, CeNNfriendly network to solve the MNIST problem. The network topology is shown in Fig. 2. The network contains two convolution layers, and each layer contains 4 feature maps. There is also an FC layer that follows the two convolution layers to obtain the classification results. The baseline network is designed using maximum pooling and linear templates to potentially maximize the classification accuracy. However, we also study the network accuracy for average pooling and alternatives based on nonlinear templates, to evaluate tradeoffs in terms of accuracy, delay, energy, etc. to be discussed.
The accuracy for different design options for the network are summarized in the second column in Table 1. From the table, we can see that max pooling generally leads to better accuracy than average pooling. The nonlinear template implementation is also less accurate than the linear implementation for max pooling. This is mainly because the GLOBALMAX template is an approximation for the max pooling, and it is not as accurate as the linear template approach.
4.2. Eliminating FC layers
One of the potential challenges of a network with a fullyconnected layer shown in Fig. 2 is the need to convert analog CeNN data into a digital representation to perform computations associated with an FC layer since an FC layer is not CeNN friendly (see Sec. 3.5). To reduce the impact of analogtodigital conversion and associated FC layer computation, we have designed an alternative network for MNIST digit classification to perform computations associated with an FC layer. Note that this network modification approach can be extended to any given network. However, here we show how we use this approach in our Design 1.
In this alternative network (Fig. 3), the weights (and image sizes) associated with the last layer of the network are reduced to CeNNfriendly, kernels. Changes include modifications to the pooling layer. In the network in Fig. 2, max pooling is achieved by propagating the maximum pixel value to all neighbors within a certain region specified by the network design. However, the sizes of these feature maps do not change. For the network in Fig. 3, the maximum value is propagated within a grid to form a group, and only one maximum pixel value in each group is extracted to be processed in the next stage of the network. Thus, the network size is reduced by a factor of two with each pooling layer. For the implementation of downsampling through max pooling, after a pooling operation is completed, for each a grid within a feature map, only one pixel is required to write to an analog memory array for the next stage processing. In the network in Fig. 3, three pooling layers are required to properly downsize an image and obtain reasonable accuracy. The final computational steps associated with this alternative network are readily amenable for CeNN hardware implementations. However, both the image size and the kernel size are reduced to .
Potential overheads associated with FC layer computations are reduced as only the final results (10 probability values corresponding to the number of image classes) must be sent to any digital logic and/or CPU (in lieu of the 16,000 signals associated with the network in Fig.
2). Downsampling may also impact classification energy, as smaller subsets of the CeNN array can be used for computations associated with successive layers in the network. Again, we evaluate the accuracy of this proposed approach by using average pooling, nonlinear templates, etc. The results are shown in the third column in Table 1. In general, these accuracy numbers are still close to the baseline design discussed in Sec. 4.1.4.3. CeNNbased CoNNs for CIFAR10
The networks proposed in Sec. 4.1 and Sec. 4.2 for MNIST are relatively simple compared with stateoftheart networks. Typically, to solve more complex problem, larger networks with more layers/feature maps are required. In this subsection, we discuss our design for larger CeNNfriendly CoNNs.
As a case study, we use CIFAR10 as the dataset, which consists of 50,000 images in the training set, 10,000 images in the validation set and 10,000 images in the test set. These images are all color images with RGB channel. There are 10 classes with different objects (e.g., airplane, automobile, bird, etc.) within the dataset. Each image belongs to one class, with a size of 32 32. During the inference stage, the network must predict which class the image belongs to.
We use modified AlexNet (Krizhevsky et al., 2012a)
network to solve the CIFAR10 problem. AlexNet is originally used to solve ImageNet
(Deng et al., 2009), which is a more complex problem. Thus, we expect our modification still leads to reasonable accuracy for CIFAR10. We perform our modifications on AlexNet to (i) enable the modified network to solve the CIFAR10 problem, and (ii) make the network CeNNfriendly. Specifically, our main modifications are summarized as follows: (i) For all convolution layers in AlexNet, the kernel sizes are changed to 3 3 so that it is readily amendable to CeNNs with the same template size. (ii) We remove the FC layer in the AlexNet since it is not CeNNfriendly, and use a convolution layer with 10 outputs as the last layer, to obtain the classification probabilities. (iii) Downsampling in the pooling layer is not used in the modified network in order to retain reasonable model size. The network architecture is shown in Fig. 4.We use the network in Fig. 4 as a baseline, and explore the design space by changing the number of feature maps in each layer. In the baseline, the feature maps for the first 5 convolution layers are the same as AlexNet (C96C256C384C384C256). We also considered feature map sizes of C64C128C256C256C128 and C64C128C128C128C64. We use the Adam algorithm (Kingma and Ba, 2014) to train the network, with learning rate set to . The accuracy data for different design options are summarized in Table 2. The accuracies only drop for 1.6% and 2.17% with the decrease of the network size. Therefore, we also consider these two networks in the benchmarking efforts discussed in Sec. 5.
Approach  C96C256C384C384C256  C64C128C256C256C128  C64C128C128C128C64 
Accuracy  84.5%  82.9%  81.8% 
5. CeNN architectures
In this section, we introduce our CeNNbased architecture for realizing CeNNfriendly CoNNs. Our architecture is general and programmable for any CoNN that contains convolution, ReLU and pooling layers. Meanwhile, by changing the configurations (e.g., SRAM size, number of OTAs) and parameters of the circuits (e.g., bias current), our CeNN architecture design could be used to satisfy different precision requirements for the network. Thus, we can explore tradeoffs between accuracy, delay and energy efficiency within the same network. We first present our CoNNbased architecture in Sec. 5.1. We then describe each component in the architecture, i.e., CeNN cells in Sec. 5.2, analog memories in Sec. 5.3, and SRAM in Sec. 5.4. We also highlight the dataflow for the CoNN network computation using CeNN architecture. In Sec. 5.5, we discuss the need for ADCs and digital circuitry to support computations in an FC layer (i.e., to support networks as discussed in Sec. 4.1). Finally, we discuss the programming mechanism for the CeNN templates of the architecture. Throughout we also highlight differences between CeNN cell designs presented here as compared to previous work (e.g., (Horvath et al., 2017)).
5.1. Architecture
Our CeNN architecture for (Fig. 5) CoNN computation consists of multiple CeNN arrays (boxes labeled by array ). These arrays are the key components for implementing convolution, ReLU and pooling operations in a CoNN. Within each array, there are multiple cells per Sec. 2.1. The array size can usually accommodate all the image pixels to enable parallel processing of a whole image (extra cells will be power gated to save power). For large images, time multiplexing is used to sequentially process part of the image. The connections between these cells follow the typical CeNN array design as described in Sec. 2.1. An SRAM array (the rectangle at the bottom of Fig. 5) is used to store the templates needed for the CeNN computation. How to configure the CeNN templates with the SRAM data is discussed in Sec. 5.4. An analog memory array (boxes labeled by MEM) is embedded into each CeNN cell. The analog memory array is used to store intermediate results for the CeNN computation. Each CeNN array is associated with an ADC. The output of the ADC connects to the host processor or a digital logic, which supports computations for FC layers.
Each CeNN array performs computations associated with one feature map at one time. Thus, feature maps could perform computations simultaneously with CeNN arrays. Generally in a stateoftheart CoNN design, there may be hundreds of feature maps. However, it is not possible to accommodate hundreds of CeNNs in a chip due to area and power restrictions. Therefore, these CeNNs need to be time multiplexed to compute different feature maps in one layer, and the intermediate data needs to be stored in the associated analog memory for processing in the next layer. Thus, the number of CeNN arrays should be chosen to balance the power/area of the chip and the degree of parallel computation of feature maps (FMs) in any given layer.
We use a convolution layer as an example to illustrate how the computation is performed since it is typically the most time/energy consuming layer in stateoftheart CoNN designs. We assume layer is a convolution layer, and the layer has feature maps as inputs and feature maps as outputs. We assume the number of CeNN arrays is . For each output feature map in layer , the computation required is shown in Eq. 12. Namely, each feature map ( from 1 to ) in layer must convolve with a kernel , and the sum of the convolution results need to be computed. That is,
(12) 
The computation in Eq. 12 needs to be repeated times to obtain the results for all the feature maps in Layer .
To compute feature map , we first perform convolution operations on feature maps in layer from to , to obtain to (i.e., ). Then we perform by leveraging the connections among these CeNNs. The intermediate results are stored in the analog memories associated with the CeNN array 1. Similarly, the convolution operation on another feature maps in layer ( to ) are performed. Again, we compute and store it in the analog memories associated with CeNN array 2. We repeat the above process until all the input feature maps convolved with a convolution kernel, and their partial sums (from to , where ) are all stored in the analog memories associated with to . If the number of CeNNs, , is less than , one CeNN would store more than one feature maps. Then, we sum these partial sums up to obtain the feature map in layer (i.e., ). Again, the above process is repeated times to obtain all feature maps in layer . The detailed algorithm is shown in Algorithm 5. Other type of CoNN layer computations are also summarized in the Algorithm 5. By iteratively using the CeNN architecture, we realize different functionalities. The relation between the processing time and number of CeNNs for a convolutional layer can be calculated as in Eq. 13.
(13) 
Here, refers to the settling time of an CeNN array, refers to the reprogramming time of CeNN (i.e., loading new templates), and and are the analog memory read and write time, respectively.
Algorithm 1 CoNN layer computation with CeNN
The templates of each CeNN can be programmed to implement different kernels in a given CoNN. Before each CeNN operation, all the OTAs must be reconfigured to implement different templates. These templates are read from the SRAM block, where all template values are stored. The bitline outputs of the SRAM are connected to the switches of the OTAs. After configuration, CeNN operations are performed. Below, we discuss the key blocks in the CeNN architecture.
5.2. CeNN Cells design
CeNN arrays are the core computational elements in our architecture. The CeNN template values for different layers are determined during the network design phase. For convolutional layers, the templates are the same as weights, which are trained by deep neural network frameworks. The templates for ReLU and pooling are discussed in Sec. 3, and they are independent of the specific problem instance. These template values are read from the SRAM to configure the VCCSs in the CeNN cells. Note that all the cells in an array share the same template values. However, different CeNN arrays may employ the same templates (i.e., for ReLU and pooling layers), or employ different templates (i.e., for convolution layers).
Many prior works have focused on CeNNs implemented by analog circuits using CMOS transistors. Per Sec. 2, a widely used implementation is based on OTAs (Chou et al., 1997). Here, an OTA is built with twostage operational amplifiers (Lou et al., 2015). We use OTAs with quantized values (i.e., , , …, ) to realize Nbit templates (i.e., weights). The ’s values are set according to the power requirement since ’s values are proportional to the bias current. Each OTA is connected to a switch for power gating. By power gating different combinations of these OTAs (as shown in Fig. 6), different template values can be realized.
The cell resistance ( in Fig. 1) here is set as () such that the cell voltage settles to the desired output to achieve correct CoNN functionality. The cell capacitance ( in Fig. 1) is the summation of the output capacitance of nearby OTAs. The delay and energy estimation of a CeNN cell in this paper is different from that in (Horvath et al., 2017) in that: (1) 32 nm technology is used for the hardware design, (2) the of the OTAs are larger for faster processing while still satisfying a given power requirement, and (3) the cell resistance in (Horvath et al., 2017) is assumed to be the absolute value of the sum of , which leads to much larger settling times. Therefore, the work in (Horvath et al., 2017) is a conservative estimation and overestimates the delay and the energy.
5.3. Analog memory design
In order to support operations that may require multiple (analog) steps associated with different CeNN templates, each CeNN cell is augmented by an embedded analog memory array (CarmonaGalan et al., 1999) (see Fig. 5). For the CeNN based convolution computation described in Sec. 5.1, analog memory is used to store the intermediate result after each step. For a convolution layer, all the intermediate results described in Algorithm 5 need to be stored in the analog memory. The design of the analog memory and the op amp are from (CarmonaGalan et al., 1999). Specifically, the analog memory array is implemented by a write transistor () and read transistor () to enable write and read. An additional op amp is used to hold the state of the capacitors shown in Fig. 5b. Multiple pass transistors and capacitors are used to store data. Each capacitor and pass transistor forms a memory cell (as a charge storage capacitor) within the analog memory array that could store one state value of a CeNN cell (i.e., data correspond to one pixel). The number of capacitors () within one analog memory array depends on the data needed to store in the memory. The gates of the pass transistors are connected to a MUX. Thus, to shown in Fig. 5b are controlled by the MUX to determine which capacitor memory needs to be written/read. If the analog signal needs to write to the memory, transistor is on, and one of the pass transistors is selected by the MUX. The data is written to the corresponding capacitor . For a read, transistor is on, and one of the pass transistors is selected by MUX. As each analog memory array is dedicated to one CeNN cell, CeNN cells can access these memory arrays in parallel.
5.4. Sram
An SRAM array is used to store all the template values required for CeNNs to realize a CoNN. While the SRAM itself is a standard design, we still need to carefully select the number of bitlines within one word line due to power and performance constraints. One design choice may have one word containing all the template values for one CeNN array. For one template operation, bits are needed for bit precision weight (including 9 template values and a bias). For this option, if CeNN arrays have distinct sets of templates (i.e., in the convolution layer), accesses will be required. However, if CeNN arrays have the same templates (i.e., in the ReLU and pooling layers), only 1 access is required. To reduce the number of accesses, two or more bit words may be accessed in one cycle by either using more read ports or longer SRAM words. After SRAM cell data are read, they are used to control how an OTA is power gated, which in turn realizes different precisions.
5.5. ADC and hardware for FC layers
Each CeNN is connected to an ADC to convert analog data to a digital representation whenever necessary, e.g., for FC layer computations (i.e., the last layer in Fig. 2 computation). FC layers typically require computing the dot product of two large vectors. Such operations are not wellsuited for CeNNs with limited kernel size. Hence, either a CPU, GPU, or other hardware should be employed. In the benchmarking efforts to be discussed in Sec. 6, combinations of digital adders, multipliers, and registers (i.e., ASICs) are used. For simplicity, ripple carry adders and array multipliers are employed in our simulations. Both inputs and weights are bits (where refers the to the precision of CeNN). We also assume that the weights for the FC layer are stored in SRAM. The result of the multiplication is bits, while an additional bits are used to store the final results of this layer to avoid overflow. Thus, there are bits at the output. That said, alternative network designs as shown in Sec. 4.2 can eliminate this layer.
6. Evaluation
We now evaluate the architectures, networks, and algorithms described above to determine (i) whether or not CeNNfriendly CoNNs are competitive with respect to existing architectures and algorithms that address the same dataset and (ii) if so, what elements of the CeNN design space lead to superior applicationlevel FOM (e.g., energy and delay per classification, and accuracy). While our CeNN architecture can be applied to different datasets, we specifically compare our approach to other efforts in the context of the MNIST and CIFAR10 dataset given the wealth of comparison points available.
6.1. Simulation setup
Components of the CeNNbased architecture are evaluated via SPICE simulation using the Arizona State University Predictive Technology Model (ASU PTM) for highperformance MOSFET devices at the 32 nm technology node (Zhao et al., 2006). We use CACTI 7 (Balasubramonian et al., 2017) to estimate the delay and energy needed for SRAM accesses with the same technology node. The size of SRAM is set as 16 KB to retain reasonable access time/energy, while also accommodating all templates for the proposed networks. The SRAM can be scaled if necessary, to accommodate all the weights in larger networks. In our SRAM design, each wordline contains bitlines, so that all weights needed for one CeNN operation can be read from SRAM only once. The analog memory is also scaled to the same technology node.
Though the architecture itself can realize any number of bits, we assume 4bit and 8bit precision in our evaluation. 4bit results help to inform the energy efficiency of our design with reasonable applicationlevel classification accuracy, while 8bit designs generally do not sacrifice accuracy when compared with 32bit floating point representation. We use 4 CeNNs that correspond to 4 feature maps in the networks described in Sec. 4 for evaluation. However, the number of CeNNs could be changed as a tradeoff between processing time and area/power, as discussed in Algorithm 5 in Sec. 5.1. We take the trained model from TensorFlow, and perform inference computations in a MATLAB based infrastructure with both feature maps and weights quantized to 4 bits or 8 bits to predict accuracy.
The supply voltage is set to 1 V, and the ratio of the current mirrors in the OTAs is set to 2, to save power in the first stage of OTA. For different precision requirements, the same OTA schematic is used with different transistor sizes and bias currents. The multiple OTA design in Sec. 5.2 could be used to represent different number of bits for weights. Here, for each OTA, we convert the signaltonoise ratio (SNR) of OTA to bit precision using the methods in (Kester, 2009) to represent different number of bits for feature maps. Compared to the 4bit designs, the 8bit designs increase the bias current by 7.5, and increase the transistor width by 4 to increase the SNR of the circuit from 32.1 dB to 50.6 dB. Thus, the delay increases by 4.3 due to the change of bias conditions and increase of transistor size (i.e., parasitic capacitance increases), and the power increases by 7.5 as the bias current increases. The ’s of an OTA can be selected to tradeoff processing speed and power. Here, we use four OTAs with values 12, 24, 48, and 96, to realize 4bit templates (i.e., weights). In the 8bit design, larger granularity is used, and values are set to 0.75, 1.5, 3, 6, 12, 24, 48, and 96. We assume stateoftheart ADC designs (Xu et al., 2016, 2014) to estimate the delay and energy of analog to digital conversion needed before the FC layer in the network in Fig. 2. We assume each CeNN is associated with an ADC to convert analog data to digital representation.
We employ the same device model to benchmark analog memory arrays. We first determine the capacitance and size of pass transistors based on the methods in (CarmonaGalan et al., 1999). The capacitance is and the width of the transistor is 180 nm. We use a minimum length of 30 nm. Then, memory write time is determined by the resistance of pass transistor and the capacitor . The memory read time is determined by the analog signal through the read buffer. We use SPICE to measure the delay of the analog memories. Per simulations, each memory write and read requires 124 and 253 , respectively.
6.2. Evaluation of the CeNN based architecture
We initially use the 4bit CeNN design as an example to show how we evaluate the accuracy, delay and energy of our CeNN architecture for performing CoNN computations. We use MNIST as the benchmarking dataset, and both network in Fig. 2 and network in Fig. 3 with different configurations (summarized in Table 1) are used for evaluation. 8bit results are also presented here.
We first measure the energy and delay associated with each layer of a CeNNfriendly CoNN for the 4bit design. Table 3 summaries the delay and energy for each layer, for the networks in Fig. 2 and Fig. 3. Per Table 3, the energy for each layer in the network in Fig. 3 decreases with subsequent layers as data is downsampled, and only a subset of cells in a CeNN are used for the computation. However, delay remains constant (for each layer) as all computations in CeNN cells occur in parallel. (The network in Fig. 3 has a higher latency than the network in Fig. 2 in the CeNN components due to the fact that more layers are employed to properly downscale the image, i.e., more template operations are required.) We use the MATLAB framework to quantize the weights and inputs to 4 bits in the inference stage, and classification accuracies for each design are shown in Table 4. We find that for all cases, the accuracy decreases about 2% for each design compared with the 32bit floating point design shown in Table 1, due to the reduced precision of input and weights for our simple network.
We next consider the impact of the ADCs and the FC layer. The delay and energy for an ADC can be approximated based on a 28 nm SAR ADCs design from (Xu et al., 2016). The total time and energy to port all analog data to the digital domain for the network in Fig. 2 are 166.7 and 3834 , respectively (using time multiplexing). For the FC layer, we first use the uniform beyondCMOS benchmarking (BCB) methodology (Nikonov et al., 2015) to estimate the delay and energy for a full adder as well as the register for storing temporary data during the computation. Then, we estimate the delay of multiplication and addition operations by counting the number of full adders in the critical path of the multiplier and adder. The energy per operation is estimated by the summation of all full adder operations and loading/storing data during computation. The energy and delay overhead due to the interconnect parasitics is also taken into account by using the BCB methodology. Overall, the delay and energy of the FC layer are 124.4 and 4041 , and they contribute 23% and 20% to the total delay and energy per classification for the network in Fig. 2 (including ADCs), respectively.
Network in Fig. 2  Network in Fig. 3  
Layer  Delay()  Energy()  Delay()  Energy() 
Conv. 1  5.3  626  5.3  626 
ReLU1  10.7  536  10.7  536 
Pooling1  85.5  4290  85.5  3398 
Conv. 2  42.8  2827  42.8  981 
ReLU2  10.7  410  10.7  186 
Pooling2  85.5  3277  85.5  1489 
Conv. 3      42.8  519 
ReLU3      10.7  115 
Pooling3      85.5  921 
Conv. 4      53.4  582 
ADC + FC  291.1  7875     
Total  531.6  19841  432.9  9353 
Though the network in Fig. 3 (with no FC layer) requires additional layers to properly downscale the image, the delay is still 19% lower than the network in Fig. 2. Additionally, the network in Fig. 3 requires 2.1 less energy per classification due to downsampling. However, the accuracy for the network in Fig. 3 is 0.5% lower than that in Fig. 2.
To evaluate the impact of different approaches for pooling operations, as well as how nonlinear template operations impact energy, delay, and accuracy, we apply each design alternative to the networks in Figs. 2 and 3. Results are summarized in Table 4. The numbers in parenthesis refer to the comparison between the alternative approach with the baseline (i.e., the network in Figs. 2 and 3 with maximum pooling and linear templates). By using average pooling, the delay/energy is reduced by 1.4/1.5 and 2.2/2.1 for the networks in Figs. 2 and 3, respectively – as 16 CeNN steps are reduced to 1 step. The accuracy is reduced by 0.8% for the network in Fig. 2 and 1.7% for the network in Fig. 3, respectively. Designs with nonlinear templates lead to reductions in delay/energy of 1.5/1.7 and 3.7/2.8 for the networks in Figs. 2 and 3, respectively – as both ReLU and pooling operations are reduced to a single step. However, the accuracy drops by 3.6% and 4.5%, respectively, following the same trend as the floating point precision.
Network in Fig. 2  Network in Fig. 3  
Approach  Accuracy  Delay  Energy  Accuracy  Delay  Energy 
Baseline  96.5%  532  19.8  96.0%  433  9.4 
Average  95.7%  372  12.5  94.3%  192  4.4 
Pooling  (1.4x)  (1.5x)  (2.2x)  (2.1x)  
Nonlinear  92.9%  357  12.0  91.5%  116  3.4 
operation  (1.5x)  (1.7x)  (3.7x)  (2.8x) 
It is obvious that the accuracy drops for 4bit designs (in Table 4) compared with 32bit floating point designs (in Table 1). Meanwhile, there is evidence that the 8bit precision for many networks usually do not sacrifice accuracy compared with 32 bit floating point design, and are widely used in the stateoftheart training and inference engine (Jouppi et al., 2017). Therefore, we also evaluate accuracy, delay and energy for our 8bit CeNN design using the same method above to show the tradeoffs. In this design, we use OTAs with an SNR equivalent to 8bit precision. The weights are also set to 8 bits. We use a different design (Keane et al., 2017) to evaluate ADC overhead to reflect converting analog signals to 8bit digital signals. The inputs and weights of the digital FC layer are also set to 8 bits. The results are summarized in Table 5. As expected, the delay and energy both increase compared to the 4bit design by 2.0  4.2 and 3.8  7.5 depending on the specific designs, but the accuracy approaches that of 32 bit floating point data. In this design, the delay and energy of network in Fig. 3 increase more than that of the network in Fig. 2. The computations of the network in Fig. 3 is mostly in the analog domain, while the computations in the network in Fig. 2 use both analog and digital circuits. As the number of bit increases, the delay and energy for computations associated with analog circuits increase generally faster than the delay and energy for computations associated with digital circuits.
6.3. Comparison to other MNIST implementations
It now begs the question as to how our CeNNbased approach compares to other accelerator architectures and algorithms that have been developed to address classification problems such as MNIST. Since the computations in our designs are mostly performed in analog domain, we first compare our work with a recent logicinmemory analog implementation that addresses the same problem (Biswas et al., 2018). We compare the delay and energy of convolution layers here. As (Biswas et al., 2018) only reports the throughput and energy efficiency for the first two convolutional layers in LeNet5, using 7bit inputs and 1bit weights, we also use the throughput and energy efficiency for convolution layers in our baseline network design for fair comparison. The comparison results are shown in Table 6. Our CeNN design demonstrates 10.3 EDP improvements than (Biswas et al., 2018). At the applicationlevel, we still obtain better classification accuracy (96.5% v.s. 96%). However, since (Biswas et al., 2018) does not include the data for FC layer, they do not have the complete EDP data on MNIST. Hence, we do not include the implementation in (Biswas et al., 2018) the benchmarking plot (Fig. 7) to be discussed.
Approach  Precision of  Precision of  Efficiency  Energy Efficiency  Technology  Accuracy 
feature maps  weights  
CeNN based approach  4 bits  4 bits  251 GOPS  12.3 TOPS/W  32 nm  96.5% 
Logicinmemory analog circuit (Biswas et al., 2018)  7 bits  1 bit  10.7 GOPS  28.1 TOPS/W  65 nm  96% 
We next consider a stateoftheart digital DNN engine presented in (Whatmough et al., 2017) with 28 nm technology node for the MNIST dataset at isoaccuracy with our CeNN based design. We scale the design in (Whatmough et al., 2017) from 28 nm to 32 nm for a fair comparison using the method described in (Perricone et al., 2016). The work in (Whatmough et al., 2017) assumes an multilayer perception (MLP) network with 8bit feature maps and weights, varying the different network sizes. Among these different networks, we find three implementations that match the accuracy of our three designs. Their network sizes are 78416161610, 78432323210, and 78464646410, with accuracy of 95.41%, 97.0%, and 97.58%, respectively. Meanwhile, our three designs are (i) network in Fig. 3, baseline with 4bit precision (accuracy to be 96.03%), (ii) network in Fig. 2, baseline with 4bit precision, (96.5% accuracy) and (iii) network in Fig. 3, average pooling with 8bit precision (97.41% accuracy). We compare FOMs including energy and delay at isoaccuracy for these designs.
Comparison  Approach  Accuracy  Bits  Delay (ns)  Energy (nJ)  EDP (nJns) 


Comparison 1  CeNN – network in Fig. 3, baseline  96.03%  4  372  9.0  
DNN engine (Whatmough et al., 2017)  95.41%  8  1001  39.9  


Comparison 2  CeNN – network in 2, baseline  96.5%  4  532  19.8  
DNN engine (Whatmough et al., 2017)  97.0%  8  1478  72.5  


Comparison 3  CeNN – design 2, avg. pooling  95.41%  8  810  230  
DNN engine (Whatmough et al., 2017)  97.58%  8  2692  145 
From Table 7, we can find that in our implementation, the EDP and energy efficiency are 2.1  8.7 and 6  27 better, respectively, than the DNN engine (Whatmough et al., 2017). The 8bit CeNN based design is not as efficient as the 4bit design with respect to energy efficiency – compared with the DNN engine due to the fact that analog circuits have worse area/delay/energy compared with digital circuits in higher precision. Here, our delay and energy data is based on simulations, while the data for DNN engine is based on the measurement. Therefore, some discrepancy may exist. However, in general, with the CeNN approach, (i) high parallelism can be achieved in terms of multiplications and additions in the CeNNbased architecture, (ii) the network exploits local analog memory for fast processing, and (iii) accessing feature maps in the analog domain is faster than accessing the digital weights in the digital domain. Thus, the weight stationary approach is used. That said, once the weights are read from the SRAM (i.e., all the cells are configured), all the computations associated with the weights are performed. The weights do not need to be read from SRAM again. Therefore, the total weight access time is minimized. Since there are still unused OTAs in our design, it may be further optimized to reduce the delay and energy.
We also compare our work with a wider range of implementations, including custom ASIC chips (Reagen et al., 2016; Whatmough et al., 2017; Moons et al., 2016; Chen et al., 2017), neural processing units (Hashemi et al., 2017), spiking neural networks (Esser et al., 2015; Kim et al., 2015; Mostafa et al., 2017), crossbar implementations (Tang et al., 2017), and CPU/GPUbased solutions of the DropConnect approach (Wan et al., 2013) (the most accurate approach for MNIST to date; data is measured via i75820K, 32GB DDR3 with Nvidia Titan). Fig. 7 plots the EDP v.s. misclassification rate for all these approaches. In order to make a fair comparison, we again scale all delay/energy data to the 32 nm technology node using the ITRS data based on (Perricone et al., 2016).
Note that the comparison is shown in the log scale, additional uncertainties (interconnects parasitics, clocking, control circuits) should not change the overall trend shown in Fig. 7 as the EDP of these elements would not be orders of magnitude larger (Nikonov and Young, 2015). Our approach has significantly lower EDP compared with other approaches with comparable classification accuracy. Among our designs, higher EDPs are generally correlated with higher accuracy. We draw a Pareto frontier line (the green line in Fig. 7 according to the product of misclassification rate and the EDP. In our designs, several datapoints are on the Pareto frontier. Specifically, for the 4bit design, the network in Fig. 3 with maximum pooling and linear templates, and the network in Fig. 3 with average pooling and linear templates are on the Pareto frontier, while for the 8bit design, the network in Fig. 2 with average pooling linear templates are on the Pareto frontier in the plot. We should add that the EDP values of some of the implementations (Reagen et al., 2016; Whatmough et al., 2017; Chen et al., 2017; Moons et al., 2016) in Fig. 7 are obtained from actual measurements, while others are from simulation. Therefore, some discrepancy may exist.
6.4. Evaluation of larger networks
In Sec. 6.3, we discussed a comprehensive comparison using the MNIST problem as the context. However, networks for MNIST are relatively simple. In this subsection, we also compare our CeNN design with other implementations that target larger networks, i.e., we compare with other accelerators that solve the CIFAR10 problem.
For the CIFAR10 dataset, images with size 3232 are used. we also use CeNNs with the same size to enable parallel processing. The evaluation setup is the same as in Sec. 6.1. We use the networks discussed in Sec. 4.3, and summarize our results in Table 8. Here, we use 4bit design to maximize the energy efficiency, and the accuracy is close to 32 floating point accuracy (given in Table 2).
Approach  C96C256C384C384C256  C64C128C256C256C128  C64C128C128C128C64 
Accuracy  83.9%  82.2%  80.8% 
Delay ()  311  106  47 
Energy ()  497  169  75 
We compare our approach with a large number of implementations available that solve the CIFAR10 problem. The benchmarking plot is shown in Fig. 8. The implementation include IBM TrueNorth (Esser et al., 2015)
, Fourier transform approach
(Liao et al., 2017), NPU (Hashemi et al., 2017), Eyeriss (Chen et al., 2017), a mixedsignal approach (Bankman et al., 2018) the CPU and GPU data reported in (Ni et al., 2017). We also draw a Pareto frontier line based on the product of misclassification rate and EDP of the data points collected in Fig. 8. From the plot, one of our CeNN datapoint (C64C128C128C128C64) lands on the Pareto frontier.We also make an isoaccuracy comparison with the NPU data point shown in the plot. We selected a datapoint from our design with similar accuracy to the design in NPU. The detailed comparison is shown in Table 9. Not only our accuracy of our CeNN design is 0.3% better than NPU approach, but also our design achieves 4.3 EDP compared with the NPU approach. Note that the NPU data are also simulation results.
Approach  Technology node  Accuracy  Bits  Delay ()  Energy ()  EDP () 
CeNN based approach  32 nm  80.8%  4  47  75  3525 
NPU (Hashemi et al., 2017)  32 nm  80.5%  8  485  32  15332 
We also discuss the difference comparing our work with existing analog accelerator engines, ISAAC (Shafiee et al., 2016) and RedEye (LiKamWa et al., 2016) here. When compared with ISAAC, our work features the following key differences. (1) ISAAC uses a crossbar architecture, and signals are accumulated horizontally. Our computation uses CeNN, signals are only communicate locally. (2) ISAAC uses an inmemory computing paradigm, while our work separates the memory units and computation units. (3) ISAAC requires memristor for computation, while our work is compatible with both emerging devices and conventional CMOS. Our architecture and processing engine is also different from RedEye, summarized as follows. (1) Our approach uses OTA as computational units, while RedEye uses the tunable capacitors as computation units. (2) Our dataflow is different from RedEye. In RedEye, data are transmitted vertically, while in our CeNN based architecture, data are processed locally.
6.5. Training with actual IV characteristics
In Sec. 6.2, we show that by leveraging the 8bit representation, the accuracy does not decrease much compared with the 32bit floating point representation. However, another source of error comes from the actual IV characteristics of an OTA. For example, in Fig. 9, when the difference of two inputs, , of the OTA is larger than , the mismatch between the actual and ideal IV characteristics becomes more severe. This behavior could potentially decrease the accuracy.
To study the impact, we include the mismatch described above into the inference stage. We use the actual IV characteristics of an OTA obtained from SPICE simulation, and build a lookup table. We then embed the table into the MATLAB based CeNN simulator in the inference stage. That is, whenever an OTA operation is required, results for the OTA are read from the lookup table, instead of by direct matrix multiplication. Simulations of the networks in Figs. 2 and 3 suggest that by including the actual IV characteristics in the network, the accuracy decreases from 98.1% and 96.5% to 96.8% and 95.8%, respectively.
However, this accuracy decrease can be largely compensated by leveraging the IV characteristics in the training stage. We use the same lookup table, and plug it into the forward path of the training stage of the network in the TensorFlow framework. By considering the IV characteristics during training, the accuracy increases and become close to the ideal accuracy. The results are summarized in Table 10. We can see that by using the actual IV characteristics in the training stage, the accuracy only decreases 0.2% when compared with the original network for the baseline design for network in Fig. 2 and network in Fig. 3. This approach should be applicable for other nonideal circuit behaviors.
Network type  Original network  Inference with actual IV  Training & inference 
with actual IV  
Network in Fig. 2, linear templates, max pooling  98.1%  96.5%  97.9% 
Network in Fig. 3, linear templates, max pooling  97.8%  95.8%  97.6% 
Our initial study indicates that by considering nonideal characteristics of the circuit into the training, the accuracy could be improved. There are other nonideal behavior of analog circuits that would probably be required in the training. Given onchip training is still an active research topic (Yao et al., 2017; Chen and Yu, 2017), whether individualized training is required is still an open topic. However, we expect that the neural network itself is robust, some variations will probably not greatly impact the overall accuracy in the application level. Moreover, approaches like partial onchip training could also be used – i.e., we could train a model in GPU, and deploy to the device. Then, based on the nonideal device characteristics, we can have a fine tune process.
7. Conclusions and Discussions
This paper presents a mixedsignal architecture for hardware implementation of convolutional neural networks. The architecture is based on an analog CeNN realization. We demonstrate the use of CeNN to realize different layers of CoNN, and the design of CeNNfriendly CoNNs. We present tradeoffs for each CeNN based design, and compare our approaches with various other existing accelerators to illustrate the benefits for the MNIST and CIFAR10 problem as case studies. Our results show that the CeNN based approach can lead to superior performance while retaining reasonable accuracy. Specifically, 8.7 EDP for MNIST problem, and 4.3 EDP for CIFAR10 problem are obtained in isoaccuracy comparison, when comparing with stateoftheart approaches. Our CeNN based design could be suitable for layers with 3x3 kernel size in convolutional neural network, which are most commonly used kernel size in today’s neural network. Our simulation results also indicate in the AlexNet level network, we could have better performance compared with existing works. This makes our CeNN based architecture suitable for deployment to IoT devices, where the network used is usually smaller than AlexNet. As future work, we will continue evaluating what benefits machine learning/computer vision applications can get from analog computation and emerging devices.
Acknowledgements.
This work was supported in part by the Center for Low Energy Systems Technology (LEAST), one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA.References
 (1)
 SPS ([n. d.]) [n. d.]. Official site of the Toshiba SPS 02 Smart Photosensor. http://www.toshibateli.co.jp/en/products/industrial/sps/sps.htm.
 tem ([n. d.]) [n. d.]. Software Library for Cellular Wave Computing Engines in an era of kiloprocessor chips Version 3.1. http://cnntechnology.itk.ppke.hu/Template_library_v3.1.pdf. Accessed: 20161129.
 Balasubramonian et al. (2017) Rajeev Balasubramonian et al. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative OffChip Memories. TACO 14(2) (2017).
 Bankman et al. (2018) Daniel Bankman et al. 2018. An alwayson 3.8 J 86% CIFAR10 mixedsignal binary CNN processor with all memory on chip in 28nm CMOS. SolidState Circuits Conference(ISSCC) (2018), 222–224.
 Biswas et al. (2018) Avishek Biswas et al. 2018. ConvRAM: An EnergyEfficient SRAM with Embedded Convolution Computation for LowPower CNNBased Machine Learning Applications. In ISSCC. 31.1.
 Boureau et al. (2010) YLan Boureau et al. 2010. A theoretical analysis of feature pooling in visual recognition. In ICML (2010), 111–118.
 CarmonaGalan et al. (1999) R. CarmonaGalan et al. 1999. An 0.5 m CMOS analog random access memory chip for TeraOPS speed multimedia video processing. IEEE Trans. on Multimedia 1, 2 (1999), 121–135.
 Chen and Yu (2017) Xiaochen Peng Chen, PaiYu and Shimeng Yu. 2017. NeuroSim+: An integrated devicetoalgorithm framework for benchmarking synaptic devices and array architectures. IEDM (2017), 6–1.
 Chen et al. (2017) YuHsin Chen et al. 2017. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. JSSC 52(1) (2017), 127–138.
 Chou et al. (1997) Eric Y. Chou et al. 1997. VLSI design of optimization and image processing cellular neural networks. IEEE TCASI 44 (1997), 12–20.
 Chua and Roska (2002a) Leon O Chua and Tamas Roska. 2002a. Cellular neural networks and visual computing: foundations and applications. Cambridge University Press.
 Chua and Roska (2002b) Leon O. Chua and Tamas Roska. 2002b. Cellular neural networks and visual computing: foundations and applications.
 Dahl et al. (2013) George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8609–8613.

Deng
et al. (2009)
Jia Deng et al.
2009.
Imagenet: A largescale hierarchical image
database.
Computer Vision and Pattern Recognition
(2009), 248–255.  Esser et al. (2015) Steve Esser et al. 2015. Backpropagation for EnergyEfficient Neuromorphic Computing. In NIPS. 1117–1125.
 et al (2015) Christian Szegedy et al. 2015. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015).

Graves
et al. (2013)
Alex Graves, Abdelrahman
Mohamed, and Geoffrey Hinton.
2013.
Speech recognition with deep recurrent neural networks. In
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6645–6649.  Han et al. (2016) S. Han et al. 2016. EIE: efficient inference engine on compressed deep neural network. ISCA (2016).
 Hashemi et al. (2017) Soheil Hashemi et al. 2017. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In DATE. 1474–9.
 Horvath et al. (2017) A. Horvath et al. 2017. Cellular neural network friendly convolutional neural networks CNNs with CNNs. In DATE. 145–150.
 Jesus E. MolinarSolis, et al (2007) Jesus E. MolinarSolis, et al. 2007. Programmable CMOS CNN cell based on floatinggate inverter unit. In The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. 207–216.
 Jones (2017) Nicola Jones. 2017. Machine learning tapped to improve climate forecasts. Nature 548 (2017), 379–380.

Jouppi
et al. (2017)
Norman P. Jouppi et al.
2017.
InDatacenter Performance Analysis of a Tensor Processing Unit.
ISCA (2017).  Kang and Kang (2016) MinJoo Kang and JeWon Kang. 2016. Intrusion detection system using deep neural network for invehicle network security. PloS one 11, 6 (2016), e0155781.
 Keane et al. (2017) John P. Keane et al. 2017. 16.5 An 8GS/s timeinterleaved SAR ADC with unresolved decision detection achieving− 58dBFS noise and 4GHz bandwidth in 28nm CMOS. In SolidState Circuits Conference (ISSCC), 2017 IEEE International (2017), 284 – 285.
 Kester (2009) Walt Kester. 2009. Understand SINAD, ENOB, SNR, THD, THD+ N, and SFDR so you don’t get lost in the noise floor. MT003 Tutorial (2009).
 Kim et al. (2015) J. K. Kim et al. 2015. A 640M pixel/s 3.65mW sparse eventdriven neuromorphic object recognition processor with onchip learning. In VLSI Circuits. 50–51. https://doi.org/10.1109/VLSIC.2015.7231323
 Kim et al. (2009) Kwanho Kim, Seungjin Lee, JooYoung Kim, Minsu Kim, and HoiJun Yoo. 2009. A 125 GOPS 583 mW networkonchip based parallel processor with bioinspired visual attention engine. IEEE Journal of SolidState Circuits 44, 1 (2009), 136–147.
 Kingma and Ba (2014) D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (2014).
 Kristan et al. (2017) Matej Kristan et al. 2017. The visual object tracking vot2013 challenge results. In Proceedings of the IEEE international conference on computer vision workshops (2017), 1949–1972.
 Krizhevsky et al. (2012a) Alex Krizhevsky et al. 2012a. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS. 1097–1105.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto 1, 4 (2009).
 Krizhevsky et al. (2012b) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012b. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS. 1097–1105.
 L O Chua and Lin Yang (1988) L O Chua and Lin Yang. 1988. Cellular neural network: Theory. IEEE Transactions on Circuits and Systems 35 (1988), 1257–1272.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521.7553 (2015), 436–444.
 Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradientbased learning applied to document recognition. Proc. of the IEEE 86, 11 (Nov 1998), 2278–2324. https://doi.org/10.1109/5.726791
 LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).
 Lei Wang, et. al (1998) Lei Wang, et. al. 1998. Time multiplexed color image processing based on a CNN with cellstate outputs. IEEE TVLSI 6.2 (1998), 314–322.
 Liao et al. (2017) S. Liao et al. 2017. Energyefficient, highperformance, highlycompressed deep neural network design using blockcirculant matrices. In Proceedings of the 36th International Conference on ComputerAided Design (2017), 458 – 465.
 LiKamWa et al. (2016) Robert LiKamWa et al. 2016. RedEye: analog ConvNet image sensor architecture for continuous mobile vision. ACM SIGARCH Computer Architecture 61 (2016).
 Lou et al. (2015) Qiuwen Lou et al. 2015. TFETbased Operational Transconductance Amplifier Design for CNN Systems. In GLSVLSI. 277–282.
 Moons et al. (2016) Bert Moons et al. 2016. A 0.32.6 TOPS/W precisionscalable processor for realtime largescale ConvNets. In VLSI Circuits. 1–2.
 Mostafa et al. (2017) Hesham Mostafa et al. 2017. Fast classification using sparsely active spiking networks. ISCAS (2017), 1–4.
 Nahlus et al. (2014) I. Nahlus, E. P. Kim, N. R. Shanbhag, and D. Blaauw. 2014. Energyefficient dot product computation using a switched analog circuit architecture. In International Symposium on Low Power Electronics and Design (ISLPED). 315–318. https://doi.org/10.1145/2627369.2627664
 Ni et al. (2017) L. Ni et al. 2017. An energyefficient digital ReRAMcrossbarbased CNN with bitwise parallelism using blockcirculant matrices. IEEE Journal on Exploratory SolidState Computational Devices and Circuits (2017), 37 – 46.
 Nikonov et al. (2015) D.E. Nikonov et al. 2015. Benchmarking of BeyondCMOS Exploratory Devices for Logic Integrated Circuits. IEEE JXCDC 1 (2015), 3–11.
 Nikonov and Young (2015) D.E. Nikonov and I.A. Young. 2015. Benchmarking of BeyondCMOS Exploratory Devices for Logic Integrated Circuits. IEEE JXCDC 1 (2015), 3–11.
 Palit et al. (2015) Indranil Palit, Qiuwen Lou, Nicholas Acampora, Joseph Nahas, Michael Niemier, and X. Sharon Hu. 2015. Analytically Modeling Power and Performance of a CNN System. In Proceedings of the IEEE/ACM International Conference on ComputerAided Design (ICCAD ’15). 186–193.
 Perricone et al. (2016) Robert Perricone et al. 2016. Can beyondCMOS devices illuminate dark silicon? DATE (2016), 13–18.
 Peter Kinget , and Michel SJ Steyaert (1995) Peter Kinget , and Michel SJ Steyaert. 1995. A programmable analog cellular neural network CMOS chip for high speed image processing. IEEE Journal of SolidState Circuits 30.3 (1995), 235–243.
 Reagen et al. (2016) Brandon Reagen et al. 2016. Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators. In ISCA. 267–278.
 RodríguezVázquez et al. (2004) Angel RodríguezVázquez, Gustavo LiñánCembrano, L Carranza, Elisenda RocaMoreno, Ricardo CarmonaGalán, Francisco JiménezGarrido, Rafael DomínguezCastro, and S Espejo Meana. 2004. ACE16k: the third generation of mixedsignal SIMDCNN ACE chips toward VSoCs. IEEE Transactions on Circuits and Systems I: Regular Papers 51, 5 (2004), 851–863.
 Roska and Chua (1993) Tamas Roska and Leon O Chua. 1993. The CNN universal machine: an analogic array computer. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 40, 3 (1993), 163–173.
 Shafiee et al. (2016) Ali. Shafiee et al. 2016. ISAAC: A convolutional neural network accelerator with insitu analog arithmetic in crossbars. ACM SIGARCH Computer Architecture (2016), 14 – 26.
 Silver et al. (2017) David Silver et al. 2017. Mastering the game of go without human knowledge. Nature 7676 (2017), 354.
 Simonyan et al. (2014) Karen Simonyan et al. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint (2014), 1409.1556.
 Sze et al. (2017) Vivienne Sze et al. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE (2017), 2295–2329.
 Tang et al. (2017) T. Tang et al. 2017. Binary convolutional neural network on RRAM. ASPDAC (2017), 782–787.
 Wan et al. (2013) L. Wan et al. 2013. Regularization of neural networks using dropconnect. ICML (2013), 105–1066.
 Whatmough et al. (2017) P. N. Whatmough et al. 2017. A 28 nm SoC with a 1.2GHz 568nJ/prediction sparse deepneuralnetwork engine with 0.1 timing error rate tolerance for IoT applications. In ISSCC. 242–243. https://doi.org/10.1109/ISSCC.2017.7870351
 Xu et al. (2016) Benwei Xu et al. 2016. A 23mW 24GS/s 6b timeinterleaved hybrid twostep ADC in 28 nm CMOS. In VLSI Circuits (VLSICircuits). IEEE, 1–2.
 Xu et al. (2014) Y. Xu et al. 2014. A 7bit 40 MS/s singleended asynchronous SAR ADC in 65 nm CMOS. Analog Integrated Circuits and Signal Processing 80: 349 (2014).

Yao et al. (2017)
Peng Yao et al.
2017.
Face classification using electronic synapses.
Nature communications (2017), 15199.  Zhao et al. (2006) Wei Zhao et al. 2006. New generation of predictive technology model for sub45 nm early design exploration. IEEE TED 53 (2006), 2816–23.
Comments
There are no comments yet.