I Introduction
Due to an ever increasing usage of portable devices such as cell phones, notebook computers, and personal digital assistants (PDAs), lowpower and energyefficient design of digital circuits and systems has gained a lot of attention. This is because of the growing need to increase the batterybased operation time of these devices by reducing their average energy consumption. Emerging applications in computer science and vision such as video and image processing [1], fast search engines [2, 3]
, deep learning and machine learning
[4, 5, 6] have opened new opportunities for lowpower and energyefficient circuit and system design. These applications typically require a large amount of computation implying high power consumption. Fortunately, these computations can tolerate some degree of inaccuracy in their final results. An approximate computing paradigm enables us to take advantage of the poweraccuracy tradeoff.Approximate computing is a computing technique, which although does not guarantee exactness, produces results with a sufficient level of accuracy that meets the application needs. This is done by relaxing the exact equivalency requirements between provided specifications and generated implementation results. The idea of approximate computing can be implemented at different levels of the design hierarchy. This paper presents a realization of the approximate computing approach by focusing on the logic synthesis step in the design flow. Logic synthesis has two phases (technologyindependent and technology mapping) and is defined as the process of optimizing a given Boolean network and mapping it to a gate level netlist while optimizing power, area, delay, or any other desired metric. There are multiple previous works detailing technologyindependent optimization during logic synthesis, focusing on approximate solutions that maintain a constraint on either the error rate or error magnitude. Such works attempt to reduce the total area and/or critical path delay of the final approximate circuit. These techniques mostly have a greedy nature and have a tendency to incur longer runtimes as well as high costs of ensuring that a satisfactory accuracy level is achieved after the approximation. Additionally, these techniques lack a learning process that would allow the framework to learn from the previous experiences in order to become more effective and runtime efficient.
In this paper, we present DeepPowerX, a novel approach for Approximate Logic Synthesis (ALS) which utilizes machine learning algorithms to target minimization of the circuit power consumption. DeepPowerX benefits from advances in deep learning by utilizing a Deep Neural Network (DNN) for fast calculation of error rate for an arbitrary netlist during the approximation process. During the training phase of DeepPowerX, training data is generated and used for training the embedded DNN. Each training data vector includes features of a node which is to be approximated, with a projected maximum error rate at the primary outputs of the circuit. In the inference phase, DeepPowerX receives as input, a mapped circuit and traverses the circuit in order to recommend replacements for each node. At each approximation step, DeepPowerX consults with the embedded DNN, which is trained for predicting the error rate at outputs of the given netlist. If the predicted error rate is more than the predetermined error rate given by the user, that specific gate replacement for the node under consideration is abandoned; otherwise, it will be accepted.
The embedded DNN in DeepPowerX receives features of a target gate in a circuit and its surrounding gates as inputs and predicts the resultant maximum error rate at the primary outputs of the circuit. The predicted error rate is defined by the normalized Hamming distance between the exact and approximate truth tables of the primary outputs. Boolean difference calculus is applied to calculate the error rate at the primary outputs due to local gate approximations and is further used for DNN training and calibration during implementation. Experimental results demonstrate that DeepPowerX achieves significant improvements in terms of runtime, power, and area savings over stateoftheart ALS frameworks. Due to the ease of integration with industry standard synthesis tools, the approximation is done after the technology mapping phase. We believe that this is the first paper to address the problem of approximate logic synthesis incorporating deep learning in the process of approximation.
Ii Background
Iia Probabilistic Error Propagation
In [7]
, a probabilistic error propagation method using Boolean difference calculus is presented to calculate the error rate at the output of a logic gate. In this method, having as input the Boolean function of the gate, error probabilities at its inputs, and the intrinsic error probability of the gate itself, the probability of error at the output of this gate is calculated. For example, as shown in Fig.
1, having intrinsic error probability of a 2input OR gate, , the signal probabilities at its inputs (probabilities for input signal to be 1), and , while the input error probabilities are and , the probability of error at the output of this gate is:(1) 
To calculate error rate at an output of a network resulting from error injection to one of its internal nodes, the above calculation should be done iteratively or recursively starting from this node and ending at the target output.
IiB Design Space Complexity of Approximate Logic Synthesis
The standard way of performing approximate logic synthesis by gate replacement involves going through all nodes and performing approximation on each and calculating the error rate at primary outputs resulting from this approximation. Therefore, the complexity analysis has two parts: node replacement, and error propagation. Let’s assume that the given circuit is modeled by a Directed Acyclic Graph G=(V,E), where V is the vertex set and E is the edge set.

node replacement: if there are up to k possible replacements for a node n in V, the total number of possible approximations for the given circuit will have an upper bound of .

error propagation: an approximated node injects error at its output which needs to be propagated throughout the circuit to find the error at primary outputs. Using the the error propagation method explained in Section IIA, a BreadthFirstSearch (BFS) with complexity of is needed, where m and n are edge count () and node count (), respectively.
To verify that the error at primary outputs will be bounded by the given error constraint, error propagation should be done at each gate replacement iteration. Therefore, the total complexity will be: .
IiC Deep Neural Networks
A DNN has one input layer, one or more hidden layers and one output layer. Each layer is comprised of a group of neurons. Inputs of neurons in hidden layers travel through a nonlinear activation function to learn any possibility of a complex relation that may be present between the input features and output classes. There are three main operations: The
feedforward operation computes activations and their derivatives by using the weights, biases, and an activation function. The backpropagation operation computes error values, while the update operation modifies trainable parameters using a learning rate hyperparameter.Iii Related Work
This section has the following logical flow: first we review a few papers on the topic of ALS, then we bring some others which are focused on optimizing power and energy during approximation. Finally, we will mention three papers that use machine leaning and deep learning in logic synthesis. Our paper has a flavor of all because it uses deep learning in ALS and targets power minimization.
Wu and Qian [8] used approximation of factored forms of Boolean expressions for each node to come up with efficient approximation for the whole circuit. They have implemented two versions namely singleselection and multi selection with better QoR for the former and lower runtime for the latter one. Hashemi et al. [9] proposed a Boolean Matrix Factorization (BMF) method to provide approximation on the Booleanlevel representation of a given circuit. Additionally, a decomposition method of subcircuits was proposed to provide a tradeoff between the required accuracy and the circuit complexity. Zhou et al. in [10] proposed a delaydriven ALS framework that utilizes an AndInverter Graph (AIG) representation of a given circuit in order to optimize the circuit’s performance.
For power and energy minimization in approximate logic synthesis, there are several papers in the literature. Swagath et al. [11] presented a framework for approximate logic synthesis targeting area and power minimization by functional approximations inside the circuit through logic gate removal and function simplification in an iterative way until the error constraint is reached. Schlachter et al. [12] proposed a gatelevel pruning method to perform approximation on arithmetic circuits found in functional blocks of discrete cosine transformation units that is used in image and video processing. The authors obtained a tradeoff between accuracy and power consumption using this approach. In [13], Zervakis et al. introduced a voltagedriven functional approximation method to perform gate pruning after synthesis. Their experimental results were applied to adders and multipliers, providing improvements on both energy and area.
Regarding usage of machine learning and deep learning algorithms in logic synthesis, there are some papers published recently; QALS, a reinforcement learningbased framework for approximate logic synthesis was presented in
[14]. QALS learns the maximum error rate tolerable by each node in order to optimize the circuit for delay and then area while adhering to a predetermined error rate at primary outputs. QALS is the first framework that formulates the technology mapping problem as a reinforcementlearning problem and gives solid definitions for state, action, and reward functions. Yu et al. [15]proposed an exact synthesis flow utilizing a Convolutional Neural Network (CNN) targeting elimination of human experts from the whole process. The authors could generate the best designs for three large scale circuits, beating the stateoftheart logic synthesis tools. In
[16], a deep reinforcement learning approach for exact logic synthesis is presented. The authors have used A2C reinforcement learning algorithm to determine the order of applying optimization commands (among a few candidate commands) to a given circuit for achieving better QoR. Similar to [15], in [16], the goal is to remove the human guidance and expertise from the process of logic synthesis. In this paper, we present DeepPowerX, which provides lowpower and areaefficient approximate logic solutions while benefiting from stateoftheart deep learning algorithms to offer significant improvements on QoR (power, area, delay, and runtime).Iv Proposed Framework: DeepPowerX
Our proposed framework for minimumpower approximate logic synthesis is called DeepPowerX. The first word refers to deep learning which is employed for error calculation and for providing recommendations for gate replacements; Power denotes the main optimization objective, whereas X refers to the approximate computing paradigm of our framework. Fig. 2 illustrates the DeepPowerX flowchart. In this implementation of DeepPowerX, power/area minimization algorithms and a DNN are utilized to produce the best approximate netlist solution for a given Boolean network. The embedded DNN in DeepPowerX is trained using some sample approximated networks having the error rates provided at their primary outputs. These error rates are in turn calculated by applying the Boolean difference calculus to calculate the normalized Hamming distance between the truth table of the exact network and its approximate implementation (cf. Section IVA). The replacement algorithm first attempts to replace critical power nodes^{2}^{2}2Critical power nodes are those with the highest switching powers. and a portion of their immediate fanout nodes with simpler nodes and then goes on to the area minimization phase, while continuously consulting the embedded DNN to ensure that the error rate at primary outputs is kept within the specified user constraints. DeepPowerX is trained to minimize the total switching power consumption during the approximation process. DeepPowerX has three main phases: training, inference, and testing.
Iva Training
The training phase is comprised of two steps: training data generation, and performing training of the onboard DNN using this data.
First, in an offline step, DeepPowerX generates the training data to be used for training its embedded DNN. For a training network, , DeepPowerX traverses over its nodes and calculates the error that can be injected into the network by replacing some node with a gate from the library. This error is calculated by finding the Hamming distance between truth tables of this node and that of the replacing gate. Next, using a probabilistic error propagation method as described in Section IIA
, the corresponding error rates at primary outputs for this gate replacement are estimated. Finally, some relevant features of the node under analysis including the node type, replacing gate type, local error due to replacement, number of fanouts, number of fanins, logic level, logic depth of the circuit, and all fanin and fanout node features (up to a certain depth limit) are extracted. These features comprise one training data point and the estimated error rate will be its label. This process is continued for all nodes in all training networks. When the training data is generated, it will be used for training the embedded DNN in the next step. The training data generation is shown in Algorithm
1. Inputs of the algorithm are training networks and a technology library, and its outputs are a list of training vectors together with their corresponding error rates as labels. Note that local errors resulting from a node replacement is also included in the feature vectors. Our experiments show that this helps the DNN converge much faster and provides more accurate predictions.In the second step, the generated training data is used for training the onboard DNN. Dimensions of the input layer are determined by a node with the highest number of fanins/fanouts in a dense training network. Note that since we include parts of features of nodes within the fanin/fanout cone of node in its feature vector, the length of the feature vector for node will increase if it has many nodes in its fanin/fanout cones. Based on our experiments on our training networks, we set 93 as the maximum value for length of a feature vector. For nodes that have a shorter feature vector size, we append zeros to their feature vectors to bring them into this standard size. The DNN in DeepPowerX is comprised of two fully connected hidden layers with 400 and 300 neurons in the first and second layers, respectively. Also, this DNN has 51 neurons in its output layer. The number of neurons on the output layer is chosen to achieve the desired accuracy on the error prediction. We have used Adam optimizer, binary cross entropy
as the loss function, and 30 as the number of
epochs. The DNN model and training parameters are shown in Table I.Network structure  9340030051 

Learning rate  0.001 
Optimizer  Adam 
Loss function  Binary cross entropy 
Epochs  30 
IvB Inference
First, the switching activities of nodes are extracted using a probabilistic simulationbased method [18, 19]. Then these nodes are sorted in descending order based on their switching activities and the top 20% of critical power nodes are extracted. Next, by traversing the circuit levelbylevel starting from primary inputs, critical power nodes and also 20% of their immediate fanout nodes (that have not been previously replaced) are considered as candidates for approximation. Critical power nodes are replaced with gates that have smaller output capacitance and their immediate fanout nodes are replaced with gates with smaller input capacitance. This way the effective switching capacitance of critical nodes will be reduced, resulting in reduction in total switching power of the circuit. A user can select to preserve the best critical delay of the circuit. In this case, the gate replacement will be done only for nodes which are not on critical delay path and/or they will be replaced only if this replacement does not results in increasing the critical path delay.
A candidate node’s features are extracted and are given to the DNN to predict the error rate at the primary outputs as a consequence of approximating the node. This process will continue until the critical power nodes (and 20% of their immediate fanouts) are all have either been completely replaced or the error rate constraint has been violated. Next, if there is still room for additional approximations within the network, power optimization will continue but with gate removal instead of replacement. The power optimization algorithm will continue for as long as the error rate is within the user provided error rate constraint. If this error rate is violated, the last replacement will be undone and the area optimization algorithm will start. The area optimization algorithm works by starting at the first level of the network and by replacing each gate in that level with the lowest cost available gate in the library in terms of area. This will work level by level until finally the error rate constraint is reached or the entire circuit has been traversed. Algorithm 2 shows the pseudo code for the inference phase of DeepPowerX. Lines 15 are for initialization, lines 623 are for power minimization, and lines 2427 are for area minimization.
After the network has been approximated, it will be exported and mapped again using ABC [17]. This allows the network to be further optimized. It is important to note that if the user wanted to prioritize area over power savings, the execution order of the algorithms could be swapped. This would dedicate a bigger portion of the error budget to the area optimization algorithm.
IvC Testing
We have used 60% of the generated data as in Section IVA for training, 20% for validation, and 20% for testing. We obtained a high accuracy of 98% on the test data. This means that for 98% of the cases, the embedded DNN in DeepPowerX could predict the correct value for error rate at primary outputs of an unseen circuit, given a random approximation on any internal nodes of this circuit. This therefore confirms that our framework provides good generalization of learned knowledge from training networks to apply to unseen test networks.
IvD Design Space Complexity
In the inference phase, DeepPowerX consults with the embedded DNN in order to find the maximum error rate at primary outputs resulting from an approximation. This is done once per each approximation iteration. Therefore, the complexity of error propagation as in the standard way of performing approximation (Section IIB) is reduced to a constant time. Also, DeepPowerX replaces a candidate gate with another one from the library (which has a fixed gate count) with smaller output (input) capacitance. Finding such a gate and performing a replacement are done in a constant time. Therefore, the complexity of node replacement in DeepPowerX is , where is a constant. Given the constant time complexity for error estimation in DeepPowerX, the total complexity will be , where is another constant. The complexity of area minimization phase is the same, hence, the total complexity of DeepPowerX is .
V Experimental Results
Circuit  Area ()  Power ()  SASIMI[11]  DeepPowerX  
area (%)  power (%)  area (%)  power (%)  
KSA  1429.81  910.2  16.3  14.79  27.5  26.4 
c880  639.3  335.84  13.1  18.03  22.4  28.4 
c1908  858.51  583.3  13.8  22.9  21.7  45.38 
c2670  1355  851.54  5.09  15.68  32.6  29.6 
c3540  1934.74  1212.04  21.94  19.72  17.6  17.5 
c7552  3970.43  3168.52  12.79  19.18  13.5  21.5 
AVG  1697.96  1176.90  13.83  18.38  22.55  28.13 
We implemented our framework such that a user can select to perform power+area or delay+area optimization. In the latter case, instead of performing approximation on critical power nodes, those nodes on the critical delay path of the circuit will be replaced with faster gates from the technology library. At the end of this section, some experimental results for the delay+area mode will be presented. Similarly to previous ALS frameworks, the error rate constraint at primary outputs of the circuit was set to 5%. This helps us better evaluate our framework. A 45nm ptm [20] library was used to generate technology mapped netlists from circuits contained in benchmark suites such as ISCAS85 [21] and MCNC [22], and EPFL [23]. A subset of these combinational circuits were used in previous ALS frameworks and have been selected for direct comparison. First, we experimented on EPFL random_control benchmark suite. This benchmark suite contains very large circuits such as mem_ctrl with 1204/1231 I/Os, 47110 nodes, and 93945 edges which are good candidates for evaluating the scalability of our framework. Fig. 3 shows both exact and approximate power and area results for these circuits. DeepPowerX reduced the power consumption by up to and area by up to for the router circuit with a maximum error rate of 4.17% at its primary outputs as compared to the exact circuit solution. Also, we saw a 36% and 53% reduction in power and area for the mem_ctrl circuit which demonstrates excellent scalability of DeepPowerX for large circuits. On average for the 10 benchmark circuits in EPFL random_control suite, DeepPowerX reduced the power consumption and the area by 49% and 41%, respectively.
Circuit  SASIMI [11]  Selectionbased [8]  QALS[14]  DeepPowerX 
(%)  (%)  (%)  (%)  
c880  11.4  11.7  13.6  59.6 
c1908  39.0  40.2  39.5  3.6 
c2670  28.6  32.7  33.3  96.8 
c3540  2.5  3.5  4.5  3.4 
c5315  1.9  1.9  37.9  67.2 
c7552  5.2  5.9  38.0  2.4 
AVG  14.76  15.98  27.8  28.9 
We also experimented on MCNC benchmark suite. As in Fig. 5, DeepPowerX could provide average savings of 37% and 27% on power and area compared with the exact solutions having the same error rate of 5% at primary outputs. We also compared DeepPowerX with one of the most powerful ALS frameworks (SASIMI [11]) that also offers optimizations for power and area. Therefore, this can be a good measure to evaluate our framework and see where it stands with respect to the stateoftheart tools. In the SASIMI paper, power and area savings are reported for several open source circuits. We extracted power and area results for the same circuits using DeepPowerX. Table LABEL:mcnc_comp_table lists these results for both SASIMI and DeepPowerX. As seen in this table, only for c3540 circuit SASIMI performs better, but for the rest of the circuits, DeepPowerX is better. On average for six circuits in Table LABEL:mcnc_comp_table, DeepPowerX provides 22.55% and 28.13% savings on area and power respectively, while the amount of average savings for SASIMI is less with 13.83% area savings and 18.38% power savings. Fig. 4 shows runtime for the same circuits listed in Table LABEL:mcnc_comp_table. DeepPowerX experiences a significant amount of average runtime savings when compared with SASIMI for the said benchmark circuits. For the delay+area optimization mode and again to compare with stateoftheart ALS tools, we experimented on a few benchmark circuits with available results in three other ALS tools namely SASIMI, Selectionbased [8], and QALS[14]. Table LABEL:delay_area_comp_table shows the results. On average, DeepPowerX could provide 14.14% better area saving compared with SASIMI, and 12.92% more reduction compared with selectionbased approach in [8]. DeepPowerX could even improve the average area saving of QALS, another powerful learningbased ALS tool, by a slight margin of 1.1%. Fig. 6 shows the comparison of runtimes of DeepPowerX (when it is in delay+area optimization mode), QALS, SASIMI and selectionbased. As seen in this figure, DeepPowerX improves runtime on all but the c5315 circuit when compared to SASIMI and selectionbased frameworks. However, when it comes to QALS, DeepPowerX could improve the runtime for only half of the circuits.
Vi Conclusion
We presented DeepPowerX, a DNN based approximate logic synthesis framework. DeepPowerX has two optimization modes, namely power+area and delay+area. In the first optimization mode, nodes with highest switching activities and a portion of their immediate fanouts are approximated. In the second mode, nodes in the critical delay path are replaced with faster gates. In both modes, an embedded pretrained DNN is used for guiding the synthesis tool to stick to a predetermined error rate at primary outputs. At the end of both modes, area minimization is performed in case additional approximation is possible. Experimental results on numerous circuits confirm significant improvements on QoR (power, area, delay, and runtime) for DeepPowerX when compared to exact solutions and also stateoftheart approximate logic synthesis tools.
Acknowledgement
The research is supported in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation. The authors would like to thank Souvik Kundu from the University of Southern California (USC) for his help in implementations used in this paper.
References
 [1] J. Huang, B. Wang, W. Wang, and P. Sen, “A surface approximation method for image and video correspondences,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5100–5113, 2015.

[2]
G. Ranjan, A. Tongaonkar, and R. Torres, “Approximate matching of persistent lexicon using searchengines for classifying mobile app traffic,” in
Computer Communications, IEEE INFOCOM 2016The 35th Annual IEEE International Conference on. IEEE, 2016, pp. 1–9.  [3] K.H. Yang, C.C. Pan, and T.L. Lee, “Approximate search engine optimization for directory service,” in Parallel and Distributed Processing Symposium, 2003. Proceedings. International. IEEE, 2003, pp. 8–pp.
 [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.

[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [6] M. Nazemi, G. Pasandi, and M. Pedram, “Energyefficient, lowlatency realization of neural networks through boolean logic minimization,” in 24th Asia and South Pacific Design Automation Conference (ASPDAC). IEEE, 2019, pp. 1–6.
 [7] N. Mohyuddin, E. Pakbaznia, and M. Pedram, “Probabilistic error propagation in a logic circuit using the boolean difference calculus,” in Advanced Techniques in Logic Synthesis, Optimizations and Applications. Springer, 2011, pp. 359–381.
 [8] Y. Wu and W. Qian, “An efficient method for multilevel approximate logic synthesis under error rate constraint,” in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 128.
 [9] S. Hashemi, H. Tann, and S. Reda, “Blasys: Approximate logic synthesis using boolean matrix factorization,” in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), June 2018, pp. 1–6.
 [10] Z. Zhou, Y. Yao, S. Huang, S. Su, C. Meng, and W. Qian, “Dals: Delaydriven approximate logic synthesis,” in 2018 IEEE/ACM International Conference on ComputerAided Design (ICCAD), Nov 2018, pp. 1–7.
 [11] S. Venkataramani, K. Roy, and A. Raghunathan, “Substituteandsimplify: A unified design paradigm for approximate and quality configurable circuits,” in Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 2013, pp. 1367–1372.
 [12] J. Schlachter, V. Camus, K. V. Palem, and C. Enz, “Design and applications of approximate circuits by gatelevel pruning,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 5, pp. 1694–1702, May 2017.
 [13] G. Zervakis, K. Koliogeorgi, D. Anagnostos, N. Zompakis, and K. Siozios, “Vader: Voltagedriven netlist pruning for crosslayer approximate arithmetic circuits,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 6, pp. 1460–1464, June 2019.
 [14] G. Pasandi, S. Nazarian, and M. Pedram, “Approximate logic synthesis: A reinforcement learningbased technology mapping approach,” in 20th International Symposium on Quality Electronic Design (ISQED). IEEE, 2019, pp. 26–32.
 [15] C. Yu, H. Xiao, and G. De Micheli, “Developing synthesis flows without human knowledge,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6.
 [16] A. Hosny, S. Hashemi, M. Shalan, and S. Reda, “Drills: Deep reinforcement learning for logic synthesis,” arXiv preprint arXiv:1911.04021, 2019.
 [17] A. Mishchenko et. al, “ABC: A system for sequential synthesis and verification,” Berkeley Logic Synthesis and Verification Group, 2018.
 [18] S. Jang, K. Chung, A. Mishchenko, R. Brayton et al., “A power optimization toolbox for logic synthesis and mapping,” 2009.
 [19] S. Iman and M. Pedram, Logic synthesis for low power VLSI designs. Springer Science & Business Media, 1998.
 [20] A. S. University. (2013) Predictive technology model (ptm). [Online]. Available: http://ptm.asu.edu/
 [21] F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinational Benchmark Circuits and a Target Translator in Fortran,” in Proceedings of IEEE Int’l Symposium Circuits and Systems (ISCAS 85). IEEE Press, Piscataway, N.J., 1985, pp. 677–692.
 [22] S. Yang, Logic synthesis and optimization benchmarks user guide: version 3.0. Microelectronics Center of North Carolina (MCNC), 1991.
 [23] L. Amaru et. al. (2017) The EPFL combinational benchmark suite. [Online]. Available: https://lsi.epfl.ch/benchmarks
Comments
There are no comments yet.