I Introduction
Approximate computing exploits the fact that there are many errorresilient applications (such as image recognition, video processing and data mining) in which quality of service can be traded for performance or power consumption. Adopting the principles of approximate computing thus enables to significantly improve energy efficiency of complex computer applications. In order to obtain an approximate implementation, a common practice is to replace selected components of the original (exact) implementation by their approximate versions. For this purpose, approximate components based on various approximation principles have been introduced (for example, see a recent survey of approximate adders and multipliers [1]). Even open circuit libraries nowadays provide various sorts of approximate circuits [2, 3]
. However, it is important to emphasize that these circuits have (almost always) been optimized with respect to generalpurpose error metrics and evaluated under the assumption of uniformly distributed input values. Applying these prefabricated approximate circuits can bring some improvements in power consumption or performance, but much better tradeoffs are always obtained if the approximate circuit is deliberately developed and optimized for a given application and if it exploits some knowledge about the application, for example, a particular (nonuniform) distribution of input vectors.
These applicationspecific approximate circuits (ASACs) can, in principle, be obtained using automated circuit approximation methods such as ABACUS [4], SALSA [5], CGP [6, 3] etc. if a suitable error metric is provided. Contrasted to the manual approximation approach (represented by, e.g., [1]
) these methods automatically generate and evaluate candidate designs until the implementation showing acceptable tradeoffs between design objectives is obtained. Let us consider an example in which the objective is to create highly efficient approximate multipliers for an image classifier based on a Convolutional Neural Network (CNN). Papers
[6, 7] already shown that employing approximate multipliers optimized with respect to a given CNN can reduce (in comparison with a common truncation) the overall power consumption with a negligible impact on the accuracy. When automated approximate circuit design techniques are applied in this context, the key question is how to define the error metric for the approximation procedure working at the level of components (multipliers). It is evident that the error metrics cannot be based on the classification accuracy (i.e. at the CNN level) as obtaining this parameter requires to perform a very time consuming evaluation for each CNN containing a new candidate approximate multiplier. This approach enables to explore only a very limited number of candidate designs (within the available time) and obtain a low quality solution. On the other hand, if a common error metric is applied at the level of multipliers, the approximation algorithm has no way to exploit the particular data distribution observed in a given CNN.In general, we are looking for an easytocalculate error metric applicable at the level of components, but providing highly correlated outputs with the quality measure used in the application containing these components. This applicationtailored, but componentlevel error metric is then used in the circuit approximation method.
This paper deals with an automated design of ASACs using Cartesian Genetic Programming (CGP). In the context of CGPbased approximations, we propose a new error metric – a weighted mean error distance (WMED) – for steering the circuit approximation process. WMED introduces a set of weights (derived from the data distribution measured on a selected signal in a given application) determining the importance of each input vector for the approximation process. The principle is to allow more aggressive approximations for less important inputs (lower weights are assigned to them) and gentle approximations for highly important inputs (higher weights are assigned to them).
The proposed method is evaluated using (1) synthetic benchmark problems and (2) two instances of neural image classifiers. In the case of synthetic benchmark problems, the objective is to design an approximate multiplier showing highquality tradeoffs between WMED and power consumption. The weights used in WMED reflect the importance of particular input vectors on
input which is modeled using a probability mass function
. In other words, is designed, optimized and approximated for a usergiven . This is highly relevant for applications in which one operand of the multiplier is an arbitrary input value and the importance of the second operand (roughly) follows. For example, in image filters, signal filters, or artificial neurons there is always an input multiplied by a certain value (i.e. a filter coefficient or a synaptic weight) which can be statistically characterized for a given application. At the same time it is required that all multipliers have to be identical in these applications in order to obtain uniform circuit structures suitable for hardware implementation.
In the case of neural image classifiers, applicationspecific approximate MAC (multiplyandaccumulate) units are designed to provide the best tradeoffs between the classification accuracy and power consumption. The definition of WMED is based on the distribution of weights across all NN layers.
Ii Related work
This paper deals with functional approximation which is a technologyindependent circuit approximation method. Its purpose is to modify the implementation (function) of a given circuit in such a way that the quality of service is kept at desired level while power consumption is reduced (or performance is increased) with respect to the original implementation.
Iia Functional approximation
Approximations have been introduced to circuits described at the transistor, gate [5, 6], registertransfer and behavioral [4] levels. Many authors have introduced approximate operations directly at the level of abstract circuit representations such as binary decision diagrams and andinvert graphs [8]. Basic functional approximation principles are: (i) truncation, which is based on reducing bit widths of registers and all operations of the data path; (ii) pruning, which lies in removing some parts of the circuit; (iii) component replacement, in which exact components are replaced with approximate components available in a library of approximate components; (iv) resynthesis, in which the original logic function is replaced by a cheaper implementation; (v) other techniques such as table lookup etc.
The automated approximation methods are often constructed as iterative methods in which many candidate approximate circuits have to be generated and evaluated. This is, in fact, a multiobjective search process. Examples of elementary circuit modifications (i.e. steps in the search space) are replacing a gate by another one, reconnecting an internal signal or reconnecting a circuit output. It has been shown that this kind of search can effectively be performed by means of Cartesian genetic programming [9, 3, 6]. Details on CGP will be given in Section III.
IiB Approximate CNNs
With the rapid development of artificial intelligence methods based on deep CNNs, a lot of attention has been focused on efficient hardware implementations of neural networks
[10]. CNNs employ multiple layers of computational elements performing the convolution operation, pooling (selection/subsampling), nonlinear transformations and the final classification based on a common multilayer perceptron (MLP).
One of the key challenges in this area is to provide fast and energy efficient inference phase
(i.e. the application of an already trained network). The reason is that trained CNNs are employed in embedded systems and have to process enormous volumes of data in a realtime scenario. As CNNs are highly error resilient, a good strategy is to reduce the bit width for all involved operations and storage elements. This approach has been taken by the Tensor Processing Unit (TPU), where only 8bit operations are implemented in MAC units. The highly parallel processing enabled by TPU exploits a systolic array composed of 65,536 8bit MAC units
[11].Approximation techniques developed for circuit implementations of NNs were surveyed in [12]. In the case of approximate multipliers for NNs, they are implemented either as multiplierless multipliers [7], truncated multipliers [11] or applicationspecific multipliers [6]. For example, Mrazek et al. developed approximate multipliers that perform exact multiplication by zero (which is important as many weights are zero and no error is thus distributed to subsequent processing layers) and deep approximations are allowed for all the remaining operand values [6]. On two benchmark problems, this strategy provided better tradeoffs (energy vs. accuracy) than the multiplierless multipliers [7, 6].
Iii Design of applicationspecific approximate circuits
The proposed design method based on CGP is developed for combinational circuits. For the sake of simplicity, we will focus on approximate multipliers in this section.
Iiia Weighted mean error distance
We propose WMED as an extension of the conventional mean error distance (MED). Let and
be discrete random variables representing data at the inputs of a multiplier
. Let be a probability mass function of defined as . Given and a signed approximate bit multiplier , WMED is defined aswhere is the output of a signed approximate multiplier for inputs and , and is the weight determined by the probability mass function . In our case (), but a different approach can be chosen in general. The WMED for an unsigned approximate multiplier is constructed accordingly. Note that .
IiiB Circuit representation in CGP
In CGP [9], a combinational circuit is modeled as a twodimensional grid of nodes (see the example in Fig. 1), where the type of nodes depends on the level of abstraction used in modeling (the gates are used in our case). The circuit utilizes primary inputs and primary outputs. A unique address is assigned to all primary inputs (0 – 4 in Fig. 1) and to the outputs of all nodes (5 – 16 in Fig. 1) to define an addressing system enabling circuit topologies to be specified. As no feedback connections are allowed in the basic version of CGP, only combinational circuits can be created. Each candidate circuit is represented using integers, where is the number of columns, is the number of rows and is the maximum arity of node functions. All supported node functions are defined in the function set . In this representation, the integers specify one programmable node in such a way that integers specify source addresses for its inputs and one integer determines the function of the node. This circuit representation can be seen as a netlist in which redundant components are allowed.
IiiC Search algorithm and fitness function
Having a candidate circuit represented as a string of integers, new candidate circuits are created by a random modification of this string – the socalled mutation. It is important to ensure that all randomly created numbers are within a legal interval, i.e. a valid candidate circuit is always produced.
CGP employs a simple search algorithm denoted which operates with a set of candidate circuits (the socalled population) [9]. Starting with the original circuit (the socalled parent), a new population is created by applying the mutation operator on the original circuit and creating offspring circuits. The mutation operator randomly modifies up to randomly selected integers of the string. These offspring are evaluated in terms of functionality and electrical parameters and the socalled fitness score is assigned to them. The best performing individual is taken as a new parent. These steps are repeated until the time available for the evolution is exhausted.
The goal of the design process is to find an approximate circuit minimizing the area on a chip and keeping WMED below a predefined threshold. The area parameter is chosen because it is highly correlated with power consumption and can quickly be estimated using the technology library (see the methodology proposed in
[6]). The design process is repeated for several target approximation errors in order to construct Pareto front (the error vs. the area). The fitness value of a candidate approximate multiplier is defined as(1) 
where is estimated area of and the objective is to minimize .
Iv Case Study 1: Data distribution driven approximate multipliers
The objective of this section is to show that better tradeoffs (between key parameters of multipliers) can be obtained in comparison with the conventional approximation methods (which are assuming uniformly distributed input data) if a nonuniform data distribution is used in the WMED definition. Figure 2 shows the data distributions used in our experiments. and
are arbitrarily chosen normal and halfnormal distributions. The uniform distribution (
) will serve as a reference in all experiments.Approximate 8bit multipliers are evolved using CGP which utilizes standard parameter setting as recommended in the literature [9, 6]: = 16 (two 8bit inputs), = 16, = 320 … 490 depending on the initial multiplier, = 1, = 2, = {all standard twoinput gates}, mutations/individual, = 4. The initial population of CGP is seeded with different conventional implementations of exact multipliers. The fitness function is defined according to Eq. 1. For all 14 target WMED values, we repeated the CGPbased design ten times (one CGP run took 1 hour). The best evolved circuits were resynthesized with Synopsys Design Compiler (45 nm process; 1V) to obtain their power consumption and other parameters (Fig. 3). In order to investigate the impact of selected distributions on properties of resulting multipliers, each multiplier is also evaluated using the remaining WMEDs that were not considered during the design. For both and we confirmed that CGP can evolve approximate multipliers showing better tradeoffs than the approximate multipliers evolved for and topquality approximate multipliers available in [1].
The heat maps on Fig. 4 show for selected multipliers (see the highlighted points in Fig. 3) how the resulting approximation error is reflecting the data distribution applied in the approximation process. In the case of , if the operand is around 127 the product shows a low error, but higher errors are visible for operands near to 0 and 255. In the case of , low errors are visible for . In the case of , the error is spread more uniformly.
Intuitively, approximate multipliers optimized for error distribution should provide better tradeoffs than other multipliers when used in the image filter which is constructed to eliminate Gaussian noise. The reason is that Gaussian filters employ a pixel filtering window with many closetozero coefficients whose sum has to be less than 256.
If results of approximate multiplication by these coefficients are almost exact (the error can be arbitrarily high for noncoefficients) then the quality of filtering is higher than if the filter contains approximate multipliers showing uniformly distributed errors. Hence, we compared the impact of various approximate multipliers on the quality of filtering conducted with the approximate Gaussian filter. We used a standard Gaussian filter implementation in which pixels are multiplied by nine constants. Figure 5 clearly shows that Gaussian filters employing approximate multipliers (which were evolved according to distribution ) show better tradeoffs between Peak Signal to Noise Ratio (PSNR) and power consumption (given for the complete image filter implementation) than other implementations. PSNR is calculated as the mean value from 25 images. Please note that we have not designed any specialized approximate multipliers for this task; we just applied the approximate multipliers presented in Fig. 3.
V Case Study 2: Approximate MAC units for CNNs
When applying automated approximate circuit design techniques in the context of neural network based image classifiers, the key question is how to define an easytocalculate error metric for the approximation procedure working at the level of components (such as MACs and multipliers) because obtaining the classification accuracy of the whole NN is very time consuming. We will apply the CGPbased circuit approximation utilizing WMED to evolve approximate multipliers tailored for a particular trained NN.
Va Image classification benchmarks
Our method will be evaluated in the task of image classification (digits 0 – 9). Two NN architectures – a popular MultiLayer Perceptron (MLP) applied on the MNIST benchmark and CNN LeNet5 [14] applied on the more challenging Google’s SVHN benchmark – will be addressed. This setup will allow us to compare our results with [6]. We used the MLP network with input neurons, 300 neurons in the hidden layer and 10 output neurons whose outputs are interpreted as the probability of each of 10 target classes. We modified LeNet5 to be able to process pixel images stored in SVHN. The LeNet5 consists of five layers – three convolution layers, two pooling layers used for data subsampling and one fully connected layer. The latter layer consists of 120 neurons outputting 10 values that are interpreted as the probability of each of 10 target classes. In LeNet5, more than 278 thousand multiplication operations have to be executed to classify a single input image. A common MLP implementation shows 98 % accuracy on the MNIST data set. In the case of LeNet5, 90.8 – 92.7 % accuracy is typically reported on SVHN [6].
VB Reference implementation
Common implementations of neural networks typically use a 32bit floatingpoint representation of real numbers for data storage and manipulation. For both considered neural networks, we firstly apply a quantization process with Ristretto tool, which performs a fully automated trimming analysis of a given network [15]. The analysis using different bitwidths revealed that 8bit fixed point signed values provide sufficient classification accuracy (only a 0.01 % resp. 0.1 % accuracy drop for MNIST, resp. SVHN reported). At the end of this process, we obtained models that can be accelerated in HW using a systolic array of processing elements. Each processing element consists of an 8bit MAC unit and bit register (such as in [11]). Each MAC includes an 8bit signed multiplier and bit adder, where and is the maximum number of products that have to be summed up. In the case of fully connected layers and MLP, equals to the maximum number of weights that can be connected to a neuron. In the case of convolution layers, is the number of items in a kernel.

SVHN data set  MNIST data set  











0  0.00 %  0.24 %  0 %  0 %  0 %  0.00 %  0.09 %  0 %  0 %  0 %  
0.005  0.02 %  0.36 %  4 %  8 %  3 %  0.00 %  0.09 %  1 %  12 %  3 %  
0.01  0.00 %  0.44 %  4 %  14 %  5 %  0.00 %  0.10 %  14 %  16 %  6 %  
0.05  0.00 %  0.51 %  26 %  26 %  16 %  0.03 %  0.14 %  28 %  27 %  11 %  
0.1  0.07 %  0.41 %  29 %  37 %  27 %  0.05 %  0.10 %  35 %  32 %  13 %  
0.5  0.08 %  0.31 %  55 %  57 %  38 %  0.01 %  0.10 %  60 %  65 %  45 %  
1  0.13 %  0.20 %  60 %  65 %  45 %  0.42 %  0.12 %  70 %  71 %  49 %  
2  0.82 %  0.41 %  70 %  71 %  49 %  4.79 %  0.02 %  79 %  75 %  53 %  
5  18.56 %  1.85 %  90 %  86 %  70 %  3.70 %  0.30 %  85 %  83 %  66 %  
10  62.99 %  5.04 %  89 %  87 %  66 %  61.14 %  1.24 %  91 %  89 %  70 % 
VC Applying available approximate circuits
We replaced the exact multipliers with topquality approximate 8bit multipliers that have been proposed in literature. In particular, we considered brokenarray multipliers [13] and EvoApprox8b library [3]. We also utilized the approximate multipliers in which the exact multiplication by zero is guaranteed [6]. Then we evaluated the accuracy of the neural network containing these multipliers on test data sets. Results are presented in Fig. 7.
VD Evolutionary design of approximate multipliers
We employed CGP to evolve applicationtailored 8bit approximate multipliers with the WMED error metric reflecting the properties of our target neural networks and data sets. In order to establish WMED, we analyzed the distribution of weights across all convolutional CNN layers / MLP neurons in fully trained NNs. The resulting distributions are shown in Fig. 6 (Top). In the case of SVHN, the distribution of weights is close to the normal distribution with zero mean, but MNIST has 92 % the most frequent values within the interval (0.08 … 0.08).
CGP was used with the following parameters: = 16 (two 8bit inputs), = 16, = 320 … 490 depending on the initial multiplier, = 1, = 2, = {all standard twoinput gates}, mutations/individual, = 4, iterations/run. The fitness function is defined as proposed in Section III.
The best discovered multipliers were integrated into MAC units and relevant design parameters were obtained with Synopsys Design Compiler (45 nm process). Fig. 6 (Bottom) shows Power Delay Product (PDP) by means of box plot graphs for resulting approximate multipliers evolved for desired WMED. Each box plot was constructed from 25 independent CGP runs. For example, if WMED is constrained to 0.2%, PDP can be reduced by 50 % in the case of LeNet5 on SVHN.
VE Integration of approximate MACs to CNNs
The best nondominated MACs were integrated to both neural network architectures whose classification accuracy was then calculated using test sets (Table 1). We can observe that CNN accuracy remains practically unchanged for WMED 0.5%. However, corresponding PDP of MAC units was reduced by 55 %. If a deeper approximation is allowed (WMED = 2%), a 70 % reduction of PDP is reported.
The finetuning of the NN weights can, in principle, improve the accuracy drop introduced by quantization. During this finetuning, the network learns how to classify images with approximate multipliers. Table 4 shows that the effect of finetuning (10 iterations employed) is enormous especially in the case of 5 % and 10 % error. For 10 % error, for example, the accuracy was improved from % to % for SVHN and from % to % for MNIST. As it is acceptable to tolerate a 1 % accuracy drop in practice, we can achieve more than 70 % power and PDP reduction for SVHN (WMED = 2%) and 85 % reduction for MNIST (WMED = 5%). Fig. 7 compares the classification accuracy (obtained by LeNet5 on SVHN and MLP on MNIST) and relative power consumption when different approximate multipliers are employed in MAC units. Solutions obtained with the proposed method are clearly dominating.
Vi Conclusions
By means of the proposed error metric – WMED – we demonstrated how an applicationlevel error metric can be translated to a component level and exploited in searching for high quality applicationspecific approximate circuits. The method has been evaluated in the design of approximate multipliers in which the importance of one of the operands is determined using a probability mass function. Under this scenario we evolved approximate multipliers showing better tradeoffs than (i) approximate multipliers evolved with a common error metric and (ii) highquality conventionally designed approximate multipliers. The impact of the method was demonstrated in the approximate implementation of Gaussian image filters. We also showed that when evolved MAC units are used in NNbased classifiers, 65 % power consumption reduction is obtained (in the MAC units), with a negligible impact on the accuracy of classification.
This work was supported by Czech Science Foundation project 1910137S.
References
 [1] H. Jiang, C. Liu et al., “A review, classification, and comparative evaluation of approximate arithmetic circuits,” J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, Aug. 2017.
 [2] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic accuracy configurable adder,” in Proceedings of the 52nd Annual Design Automation Conference. ACM, 2015, pp. 86:1–86:6.
 [3] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “Evoapprox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2017, 2017, pp. 258–261.
 [4] K. Nepal, Y. Li, R. I. Bahar, and S. Reda, “ABACUS: A technique for automated behavioral synthesis of approximate computing circuits,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE’14. EDA Consortium, 2014, pp. 1–6.
 [5] S. Venkataramani, A. Sabne, V. J. Kozhikkottu, K. Roy, and A. Raghunathan, “SALSA: systematic logic synthesis of approximate circuits,” in The 49th Design Automation Conference. ACM, 2012, pp. 796–801.
 [6] V. Mrazek, S. S. Sarwar, L. Sekanina, Z. Vasicek, and K. Roy, “Design of powerefficient approximate multipliers for approximate artificial neural networks,” in Proceedings of the IEEE/ACM International Conference on ComputerAided Design. ACM, 2016, pp. 811–817.
 [7] S. S. Sarwar, S. Venkataramani, A. Raghunathan, and K. Roy, “Multiplierless artificial neurons exploiting error resiliency for energyefficient neural computing,” in Proc. of the Design, Automation & Test in Europe Conference. EDA Consortium, 2016, pp. 1–6.
 [8] A. Chandrasekharan, M. Soeken, D. Große, and R. Drechsler, “Approximationaware rewriting of aigs for error tolerant applications,” in Proc. of ICCAD’16. ACM, 2016, pp. 83:1–83:8.
 [9] J. F. Miller, Cartesian Genetic Programming. SpringerVerlag, 2011.
 [10] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 [11] N. P. Jouppi, C. Young, N. Patil et al., “Indatacenter performance analysis of a tensor processing unit,” in Proc. of the 44th Annual Int. Symposium on Computer Architecture. ACM, 2017, pp. 1–12.
 [12] P. Panda, A. Sengupta, S. S. Sarwar, G. Srinivasan, S. Venkataramani, A. Raghunathan, and K. Roy, “Invited – crosslayer approximations for neuromorphic computing: From devices to circuits and systems,” in 53nd Design Automation Conference. IEEE, 2016, pp. 1–6.
 [13] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bioinspired imprecise computational blocks for efficient vlsi implementation of softcomputing applications,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 57, no. 4, pp. 850–862, April 2010.
 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [15] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A framework for empirical study of resourceefficient inference in convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 11, pp. 5784–5789, 2018.
Comments
There are no comments yet.