Due to an ever increasing usage of portable devices such as cell phones, notebook computers, and personal digital assistants (PDAs), low-power and energy-efficient design of digital circuits and systems has gained a lot of attention. This is because of the growing need to increase the battery-based operation time of these devices by reducing their average energy consumption. Emerging applications in computer science and vision such as video and image processing , fast search engines [2, 3]
, deep learning and machine learning[4, 5, 6] have opened new opportunities for low-power and energy-efficient circuit and system design. These applications typically require a large amount of computation implying high power consumption. Fortunately, these computations can tolerate some degree of inaccuracy in their final results. An approximate computing paradigm enables us to take advantage of the power-accuracy trade-off.
Approximate computing is a computing technique, which although does not guarantee exactness, produces results with a sufficient level of accuracy that meets the application needs. This is done by relaxing the exact equivalency requirements between provided specifications and generated implementation results. The idea of approximate computing can be implemented at different levels of the design hierarchy. This paper presents a realization of the approximate computing approach by focusing on the logic synthesis step in the design flow. Logic synthesis has two phases (technology-independent and technology mapping) and is defined as the process of optimizing a given Boolean network and mapping it to a gate level netlist while optimizing power, area, delay, or any other desired metric. There are multiple previous works detailing technology-independent optimization during logic synthesis, focusing on approximate solutions that maintain a constraint on either the error rate or error magnitude. Such works attempt to reduce the total area and/or critical path delay of the final approximate circuit. These techniques mostly have a greedy nature and have a tendency to incur longer run-times as well as high costs of ensuring that a satisfactory accuracy level is achieved after the approximation. Additionally, these techniques lack a learning process that would allow the framework to learn from the previous experiences in order to become more effective and run-time efficient.
In this paper, we present Deep-PowerX, a novel approach for Approximate Logic Synthesis (ALS) which utilizes machine learning algorithms to target minimization of the circuit power consumption. Deep-PowerX benefits from advances in deep learning by utilizing a Deep Neural Network (DNN) for fast calculation of error rate for an arbitrary netlist during the approximation process. During the training phase of Deep-PowerX, training data is generated and used for training the embedded DNN. Each training data vector includes features of a node which is to be approximated, with a projected maximum error rate at the primary outputs of the circuit. In the inference phase, Deep-PowerX receives as input, a mapped circuit and traverses the circuit in order to recommend replacements for each node. At each approximation step, Deep-PowerX consults with the embedded DNN, which is trained for predicting the error rate at outputs of the given netlist. If the predicted error rate is more than the predetermined error rate given by the user, that specific gate replacement for the node under consideration is abandoned; otherwise, it will be accepted.
The embedded DNN in Deep-PowerX receives features of a target gate in a circuit and its surrounding gates as inputs and predicts the resultant maximum error rate at the primary outputs of the circuit. The predicted error rate is defined by the normalized Hamming distance between the exact and approximate truth tables of the primary outputs. Boolean difference calculus is applied to calculate the error rate at the primary outputs due to local gate approximations and is further used for DNN training and calibration during implementation. Experimental results demonstrate that Deep-PowerX achieves significant improvements in terms of run-time, power, and area savings over state-of-the-art ALS frameworks. Due to the ease of integration with industry standard synthesis tools, the approximation is done after the technology mapping phase. We believe that this is the first paper to address the problem of approximate logic synthesis incorporating deep learning in the process of approximation.
Ii-a Probabilistic Error Propagation
, a probabilistic error propagation method using Boolean difference calculus is presented to calculate the error rate at the output of a logic gate. In this method, having as input the Boolean function of the gate, error probabilities at its inputs, and the intrinsic error probability of the gate itself, the probability of error at the output of this gate is calculated. For example, as shown in Fig.1, having intrinsic error probability of a 2-input OR gate, , the signal probabilities at its inputs (probabilities for input signal to be 1), and , while the input error probabilities are and , the probability of error at the output of this gate is:
To calculate error rate at an output of a network resulting from error injection to one of its internal nodes, the above calculation should be done iteratively or recursively starting from this node and ending at the target output.
Ii-B Design Space Complexity of Approximate Logic Synthesis
The standard way of performing approximate logic synthesis by gate replacement involves going through all nodes and performing approximation on each and calculating the error rate at primary outputs resulting from this approximation. Therefore, the complexity analysis has two parts: node replacement, and error propagation. Let’s assume that the given circuit is modeled by a Directed Acyclic Graph G=(V,E), where V is the vertex set and E is the edge set.
node replacement: if there are up to k possible replacements for a node n in V, the total number of possible approximations for the given circuit will have an upper bound of .
error propagation: an approximated node injects error at its output which needs to be propagated throughout the circuit to find the error at primary outputs. Using the the error propagation method explained in Section II-A, a Breadth-First-Search (BFS) with complexity of is needed, where m and n are edge count () and node count (), respectively.
To verify that the error at primary outputs will be bounded by the given error constraint, error propagation should be done at each gate replacement iteration. Therefore, the total complexity will be: .
Ii-C Deep Neural Networks
A DNN has one input layer, one or more hidden layers and one output layer. Each layer is comprised of a group of neurons. Inputs of neurons in hidden layers travel through a non-linear activation function to learn any possibility of a complex relation that may be present between the input features and output classes. There are three main operations: Thefeedforward operation computes activations and their derivatives by using the weights, biases, and an activation function. The backpropagation operation computes error values, while the update operation modifies trainable parameters using a learning rate hyper-parameter.
Iii Related Work
This section has the following logical flow: first we review a few papers on the topic of ALS, then we bring some others which are focused on optimizing power and energy during approximation. Finally, we will mention three papers that use machine leaning and deep learning in logic synthesis. Our paper has a flavor of all because it uses deep learning in ALS and targets power minimization.
Wu and Qian  used approximation of factored forms of Boolean expressions for each node to come up with efficient approximation for the whole circuit. They have implemented two versions namely single-selection and multi selection with better QoR for the former and lower run-time for the latter one. Hashemi et al.  proposed a Boolean Matrix Factorization (BMF) method to provide approximation on the Boolean-level representation of a given circuit. Additionally, a decomposition method of subcircuits was proposed to provide a trade-off between the required accuracy and the circuit complexity. Zhou et al. in  proposed a delay-driven ALS framework that utilizes an And-Inverter Graph (AIG) representation of a given circuit in order to optimize the circuit’s performance.
For power and energy minimization in approximate logic synthesis, there are several papers in the literature. Swagath et al.  presented a framework for approximate logic synthesis targeting area and power minimization by functional approximations inside the circuit through logic gate removal and function simplification in an iterative way until the error constraint is reached. Schlachter et al.  proposed a gate-level pruning method to perform approximation on arithmetic circuits found in functional blocks of discrete cosine transformation units that is used in image and video processing. The authors obtained a trade-off between accuracy and power consumption using this approach. In , Zervakis et al. introduced a voltage-driven functional approximation method to perform gate pruning after synthesis. Their experimental results were applied to adders and multipliers, providing improvements on both energy and area.
Regarding usage of machine learning and deep learning algorithms in logic synthesis, there are some papers published recently; Q-ALS, a reinforcement learning-based framework for approximate logic synthesis was presented in. Q-ALS learns the maximum error rate tolerable by each node in order to optimize the circuit for delay and then area while adhering to a predetermined error rate at primary outputs. Q-ALS is the first framework that formulates the technology mapping problem as a reinforcement-learning problem and gives solid definitions for state, action, and reward functions. Yu et al. 
proposed an exact synthesis flow utilizing a Convolutional Neural Network (CNN) targeting elimination of human experts from the whole process. The authors could generate the best designs for three large scale circuits, beating the state-of-the-art logic synthesis tools. In, a deep reinforcement learning approach for exact logic synthesis is presented. The authors have used A2C reinforcement learning algorithm to determine the order of applying optimization commands (among a few candidate commands) to a given circuit for achieving better QoR. Similar to , in , the goal is to remove the human guidance and expertise from the process of logic synthesis. In this paper, we present Deep-PowerX, which provides low-power and area-efficient approximate logic solutions while benefiting from state-of-the-art deep learning algorithms to offer significant improvements on QoR (power, area, delay, and run-time).
Iv Proposed Framework: Deep-PowerX
Our proposed framework for minimum-power approximate logic synthesis is called Deep-PowerX. The first word refers to deep learning which is employed for error calculation and for providing recommendations for gate replacements; Power denotes the main optimization objective, whereas X refers to the approximate computing paradigm of our framework. Fig. 2 illustrates the Deep-PowerX flowchart. In this implementation of Deep-PowerX, power/area minimization algorithms and a DNN are utilized to produce the best approximate netlist solution for a given Boolean network. The embedded DNN in Deep-PowerX is trained using some sample approximated networks having the error rates provided at their primary outputs. These error rates are in turn calculated by applying the Boolean difference calculus to calculate the normalized Hamming distance between the truth table of the exact network and its approximate implementation (cf. Section IV-A). The replacement algorithm first attempts to replace critical power nodes222Critical power nodes are those with the highest switching powers. and a portion of their immediate fanout nodes with simpler nodes and then goes on to the area minimization phase, while continuously consulting the embedded DNN to ensure that the error rate at primary outputs is kept within the specified user constraints. Deep-PowerX is trained to minimize the total switching power consumption during the approximation process. Deep-PowerX has three main phases: training, inference, and testing.
The training phase is comprised of two steps: training data generation, and performing training of the on-board DNN using this data.
First, in an offline step, Deep-PowerX generates the training data to be used for training its embedded DNN. For a training network, , Deep-PowerX traverses over its nodes and calculates the error that can be injected into the network by replacing some node with a gate from the library. This error is calculated by finding the Hamming distance between truth tables of this node and that of the replacing gate. Next, using a probabilistic error propagation method as described in Section II-A
, the corresponding error rates at primary outputs for this gate replacement are estimated. Finally, some relevant features of the node under analysis including the node type, replacing gate type, local error due to replacement, number of fanouts, number of fanins, logic level, logic depth of the circuit, and all fanin and fanout node features (up to a certain depth limit) are extracted. These features comprise one training data point and the estimated error rate will be its label. This process is continued for all nodes in all training networks. When the training data is generated, it will be used for training the embedded DNN in the next step. The training data generation is shown in Algorithm1. Inputs of the algorithm are training networks and a technology library, and its outputs are a list of training vectors together with their corresponding error rates as labels. Note that local errors resulting from a node replacement is also included in the feature vectors. Our experiments show that this helps the DNN converge much faster and provides more accurate predictions.
In the second step, the generated training data is used for training the on-board DNN. Dimensions of the input layer are determined by a node with the highest number of fanins/fanouts in a dense training network. Note that since we include parts of features of nodes within the fanin/fanout cone of node in its feature vector, the length of the feature vector for node will increase if it has many nodes in its fanin/fanout cones. Based on our experiments on our training networks, we set 93 as the maximum value for length of a feature vector. For nodes that have a shorter feature vector size, we append zeros to their feature vectors to bring them into this standard size. The DNN in Deep-PowerX is comprised of two fully connected hidden layers with 400 and 300 neurons in the first and second layers, respectively. Also, this DNN has 51 neurons in its output layer. The number of neurons on the output layer is chosen to achieve the desired accuracy on the error prediction. We have used Adam optimizer, binary cross entropy
as the loss function, and 30 as the number ofepochs. The DNN model and training parameters are shown in Table I.
|Loss function||Binary cross entropy|
First, the switching activities of nodes are extracted using a probabilistic simulation-based method [18, 19]. Then these nodes are sorted in descending order based on their switching activities and the top 20% of critical power nodes are extracted. Next, by traversing the circuit level-by-level starting from primary inputs, critical power nodes and also 20% of their immediate fanout nodes (that have not been previously replaced) are considered as candidates for approximation. Critical power nodes are replaced with gates that have smaller output capacitance and their immediate fanout nodes are replaced with gates with smaller input capacitance. This way the effective switching capacitance of critical nodes will be reduced, resulting in reduction in total switching power of the circuit. A user can select to preserve the best critical delay of the circuit. In this case, the gate replacement will be done only for nodes which are not on critical delay path and/or they will be replaced only if this replacement does not results in increasing the critical path delay.
A candidate node’s features are extracted and are given to the DNN to predict the error rate at the primary outputs as a consequence of approximating the node. This process will continue until the critical power nodes (and 20% of their immediate fanouts) are all have either been completely replaced or the error rate constraint has been violated. Next, if there is still room for additional approximations within the network, power optimization will continue but with gate removal instead of replacement. The power optimization algorithm will continue for as long as the error rate is within the user provided error rate constraint. If this error rate is violated, the last replacement will be undone and the area optimization algorithm will start. The area optimization algorithm works by starting at the first level of the network and by replacing each gate in that level with the lowest cost available gate in the library in terms of area. This will work level by level until finally the error rate constraint is reached or the entire circuit has been traversed. Algorithm 2 shows the pseudo code for the inference phase of Deep-PowerX. Lines 1-5 are for initialization, lines 6-23 are for power minimization, and lines 24-27 are for area minimization.
After the network has been approximated, it will be exported and mapped again using ABC . This allows the network to be further optimized. It is important to note that if the user wanted to prioritize area over power savings, the execution order of the algorithms could be swapped. This would dedicate a bigger portion of the error budget to the area optimization algorithm.
We have used 60% of the generated data as in Section IV-A for training, 20% for validation, and 20% for testing. We obtained a high accuracy of 98% on the test data. This means that for 98% of the cases, the embedded DNN in Deep-PowerX could predict the correct value for error rate at primary outputs of an unseen circuit, given a random approximation on any internal nodes of this circuit. This therefore confirms that our framework provides good generalization of learned knowledge from training networks to apply to unseen test networks.
Iv-D Design Space Complexity
In the inference phase, Deep-PowerX consults with the embedded DNN in order to find the maximum error rate at primary outputs resulting from an approximation. This is done once per each approximation iteration. Therefore, the complexity of error propagation as in the standard way of performing approximation (Section II-B) is reduced to a constant time. Also, Deep-PowerX replaces a candidate gate with another one from the library (which has a fixed gate count) with smaller output (input) capacitance. Finding such a gate and performing a replacement are done in a constant time. Therefore, the complexity of node replacement in Deep-PowerX is , where is a constant. Given the constant time complexity for error estimation in Deep-PowerX, the total complexity will be , where is another constant. The complexity of area minimization phase is the same, hence, the total complexity of Deep-PowerX is .
V Experimental Results
|Circuit||Area ()||Power ()||SASIMI||Deep-PowerX|
|area (%)||power (%)||area (%)||power (%)|
We implemented our framework such that a user can select to perform power+area or delay+area optimization. In the latter case, instead of performing approximation on critical power nodes, those nodes on the critical delay path of the circuit will be replaced with faster gates from the technology library. At the end of this section, some experimental results for the delay+area mode will be presented. Similarly to previous ALS frameworks, the error rate constraint at primary outputs of the circuit was set to 5%. This helps us better evaluate our framework. A 45nm ptm  library was used to generate technology mapped netlists from circuits contained in benchmark suites such as ISCAS85  and MCNC , and EPFL . A subset of these combinational circuits were used in previous ALS frameworks and have been selected for direct comparison. First, we experimented on EPFL random_control benchmark suite. This benchmark suite contains very large circuits such as mem_ctrl with 1204/1231 I/Os, 47110 nodes, and 93945 edges which are good candidates for evaluating the scalability of our framework. Fig. 3 shows both exact and approximate power and area results for these circuits. Deep-PowerX reduced the power consumption by up to and area by up to for the router circuit with a maximum error rate of 4.17% at its primary outputs as compared to the exact circuit solution. Also, we saw a 36% and 53% reduction in power and area for the mem_ctrl circuit which demonstrates excellent scalability of Deep-PowerX for large circuits. On average for the 10 benchmark circuits in EPFL random_control suite, Deep-PowerX reduced the power consumption and the area by 49% and 41%, respectively.
|Circuit||SASIMI ||Selection-based ||Q-ALS||Deep-PowerX|
We also experimented on MCNC benchmark suite. As in Fig. 5, Deep-PowerX could provide average savings of 37% and 27% on power and area compared with the exact solutions having the same error rate of 5% at primary outputs. We also compared Deep-PowerX with one of the most powerful ALS frameworks (SASIMI ) that also offers optimizations for power and area. Therefore, this can be a good measure to evaluate our framework and see where it stands with respect to the state-of-the-art tools. In the SASIMI paper, power and area savings are reported for several open source circuits. We extracted power and area results for the same circuits using Deep-PowerX. Table LABEL:mcnc_comp_table lists these results for both SASIMI and Deep-PowerX. As seen in this table, only for c3540 circuit SASIMI performs better, but for the rest of the circuits, Deep-PowerX is better. On average for six circuits in Table LABEL:mcnc_comp_table, Deep-PowerX provides 22.55% and 28.13% savings on area and power respectively, while the amount of average savings for SASIMI is less with 13.83% area savings and 18.38% power savings. Fig. 4 shows run-time for the same circuits listed in Table LABEL:mcnc_comp_table. Deep-PowerX experiences a significant amount of average run-time savings when compared with SASIMI for the said benchmark circuits. For the delay+area optimization mode and again to compare with state-of-the-art ALS tools, we experimented on a few benchmark circuits with available results in three other ALS tools namely SASIMI, Selection-based , and Q-ALS. Table LABEL:delay_area_comp_table shows the results. On average, Deep-PowerX could provide 14.14% better area saving compared with SASIMI, and 12.92% more reduction compared with selection-based approach in . Deep-PowerX could even improve the average area saving of Q-ALS, another powerful learning-based ALS tool, by a slight margin of 1.1%. Fig. 6 shows the comparison of run-times of Deep-PowerX (when it is in delay+area optimization mode), Q-ALS, SASIMI and selection-based. As seen in this figure, Deep-PowerX improves run-time on all but the c5315 circuit when compared to SASIMI and selection-based frameworks. However, when it comes to Q-ALS, Deep-PowerX could improve the run-time for only half of the circuits.
We presented Deep-PowerX, a DNN based approximate logic synthesis framework. Deep-PowerX has two optimization modes, namely power+area and delay+area. In the first optimization mode, nodes with highest switching activities and a portion of their immediate fanouts are approximated. In the second mode, nodes in the critical delay path are replaced with faster gates. In both modes, an embedded pre-trained DNN is used for guiding the synthesis tool to stick to a predetermined error rate at primary outputs. At the end of both modes, area minimization is performed in case additional approximation is possible. Experimental results on numerous circuits confirm significant improvements on QoR (power, area, delay, and run-time) for Deep-PowerX when compared to exact solutions and also state-of-the-art approximate logic synthesis tools.
The research is supported in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation. The authors would like to thank Souvik Kundu from the University of Southern California (USC) for his help in implementations used in this paper.
-  J. Huang, B. Wang, W. Wang, and P. Sen, “A surface approximation method for image and video correspondences,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5100–5113, 2015.
-  K.-H. Yang, C.-C. Pan, and T.-L. Lee, “Approximate search engine optimization for directory service,” in Parallel and Distributed Processing Symposium, 2003. Proceedings. International. IEEE, 2003, pp. 8–pp.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  M. Nazemi, G. Pasandi, and M. Pedram, “Energy-efficient, low-latency realization of neural networks through boolean logic minimization,” in 24th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2019, pp. 1–6.
-  N. Mohyuddin, E. Pakbaznia, and M. Pedram, “Probabilistic error propagation in a logic circuit using the boolean difference calculus,” in Advanced Techniques in Logic Synthesis, Optimizations and Applications. Springer, 2011, pp. 359–381.
-  Y. Wu and W. Qian, “An efficient method for multi-level approximate logic synthesis under error rate constraint,” in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 128.
-  S. Hashemi, H. Tann, and S. Reda, “Blasys: Approximate logic synthesis using boolean matrix factorization,” in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), June 2018, pp. 1–6.
-  Z. Zhou, Y. Yao, S. Huang, S. Su, C. Meng, and W. Qian, “Dals: Delay-driven approximate logic synthesis,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2018, pp. 1–7.
-  S. Venkataramani, K. Roy, and A. Raghunathan, “Substitute-and-simplify: A unified design paradigm for approximate and quality configurable circuits,” in Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 2013, pp. 1367–1372.
-  J. Schlachter, V. Camus, K. V. Palem, and C. Enz, “Design and applications of approximate circuits by gate-level pruning,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 5, pp. 1694–1702, May 2017.
-  G. Zervakis, K. Koliogeorgi, D. Anagnostos, N. Zompakis, and K. Siozios, “Vader: Voltage-driven netlist pruning for cross-layer approximate arithmetic circuits,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 6, pp. 1460–1464, June 2019.
-  G. Pasandi, S. Nazarian, and M. Pedram, “Approximate logic synthesis: A reinforcement learning-based technology mapping approach,” in 20th International Symposium on Quality Electronic Design (ISQED). IEEE, 2019, pp. 26–32.
-  C. Yu, H. Xiao, and G. De Micheli, “Developing synthesis flows without human knowledge,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6.
-  A. Hosny, S. Hashemi, M. Shalan, and S. Reda, “Drills: Deep reinforcement learning for logic synthesis,” arXiv preprint arXiv:1911.04021, 2019.
-  A. Mishchenko et. al, “ABC: A system for sequential synthesis and verification,” Berkeley Logic Synthesis and Verification Group, 2018.
-  S. Jang, K. Chung, A. Mishchenko, R. Brayton et al., “A power optimization toolbox for logic synthesis and mapping,” 2009.
-  S. Iman and M. Pedram, Logic synthesis for low power VLSI designs. Springer Science & Business Media, 1998.
-  A. S. University. (2013) Predictive technology model (ptm). [Online]. Available: http://ptm.asu.edu/
-  F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinational Benchmark Circuits and a Target Translator in Fortran,” in Proceedings of IEEE Int’l Symposium Circuits and Systems (ISCAS 85). IEEE Press, Piscataway, N.J., 1985, pp. 677–692.
-  S. Yang, Logic synthesis and optimization benchmarks user guide: version 3.0. Microelectronics Center of North Carolina (MCNC), 1991.
-  L. Amaru et. al. (2017) The EPFL combinational benchmark suite. [Online]. Available: https://lsi.epfl.ch/benchmarks