1. Introduction
Electronic Design Automation (EDA) involves a diverse set of software algorithms and applications that are required for the design of complex electronic systems. Given the deep design challenges that the designers are facing, developing highquality and efficient design flows has been crucial. A welldeveloped design flow could reduce timetomarket by enabling manufacturability, addressing timing closure and power consumption, etc. In general, the EDA vendors provide reference design flows along with the EDA tools. However, such design flows may not perform well for many designs.
There are two major reasons. First, the performance of the design flow varies on the Intellectual Property (IP) of the design. To achieve the design objectives, design flows need to be customized for the given IP. Such flows are called IPspecific or designspecific flows. This becomes more important while new types of designs are coming out, e.g., design methods for Neuromorphic chip (akopyan2015truenorth, ). Second, the design flows are mostly developed by the EDA developers and users based on their knowledge and user experience, with many testing iterations and intensive supervision. However, due to a large number of available flows, finding the best design flows among the entire search space by humantesting is impossible. It is particularly difficult to find the best flows for the recently developed transformations (cunxiyu:dac16, ; YuHNCKSCM18, ). For example, given 50 synthesis transformation that each of them can be processed independently. The total number of available design flows is . The search space of general flows is formally defined in Section 2.1. Although the significant efforts spent in providing highquality design flows, the technique that systematically generates IPspecific synthesis flows has been lagging. Similarly, these problems exist in designing SystemonChip (SoC). In Section 2 (Figure 1), two motivating examples are provided to show the needs of developing such technique.
Design flows are considered as iterative flows since the transformations are applied to the designs iteratively. Machine learning technique has been leveraged in flow optimization, such as iterative flow optimization for compilers using Markov Chain (agakov2006using, ). Regarding synthesis flow optimization, Liu et al. recently introduced an area optimization approach for Lookuptable (LUT) mapping, in which the logic transformations are guided using Markov Chain Monte Carlo (MCMC) method (liu2017parallelized, )
. However, Markov Chain model is not sufficient in autonomously designing synthesis flows. The main reason is that the synthesis transformation(s) may not affect the next transformation but affect the transformation several iterations later, which does not satisfy the
Markov Property (durrett2010probability, ). In this work, we formulate the problems of artificially developing synthesis flows as a Multiclass classification problem, and solved using Deep learning (lecun2015deep, ). Deep learning has shown considerable success in tasks like image recognition (krizhevsky2012imagenet, )and natural language processing
(farabet2013learning, ). Several advances mitigate the deficiencies of traditional multilayer perceptrons (MLPs), e.g., CNNs have made it possible to robustly and automatically extract learned features;
overfitting is mitigated in fully connected layers using the random regularization called dropout (srivastava2014dropout, ).Specifically, this paper includes the following contributions: a) The search space of artificially developing synthesis flows is formally defined in Section 2. b) We introduce a flowclassification model (Section 3.1) combining with the onehot modeling of flows (Section 3.2), such that the problem can be modeled as Multiclass Classfication problem. c) We develop a fully autonomous framework for developing synthesis flows based on Convolutional Neural Network (CNN). This framework takes HDL as input and output two sets of synthesis flows, namely angelflows and devilflows that provide the best and worst QoRs respectively^{1}^{1}1devilflows could provide information for improving the synthesis transformations.. d) Our framework is demonstrated by successively developing delaydriven and areadriven angel/devil
flows for 64bit Montgomery Multiplier, 128bit AES core and 64bit ALU. Evaluations of the CNN architecture and training process for classifying synthesis flows are also provided.
e) The datasets and demos are released publicly^{2}^{2}2https://github.com/ycunxi/FLowGenCNNsDAC18.git.2. Background
2.1. Notations and Search Space
Definition 1 nonerepetition Synthesis Flow: Given a set of unique synthesis transformations ={, ,…, }, a synthesis flow is a permutation of performed iteratively.
Example 1: Let ={, , }. are the transformations in the synthesis tools and can be processed independently. Then, there are totally six flows available:
Remark 1: Let be the number of all available flows, where includes elements, such that .
The upper bound of happens iff all elements in can be processed independently. In practice, there could be some constraints have to be satisfied for processing these transformations. In this case, will be smaller than . For example, given a constraint that has to be processed before , the available flows include only , , and .
Definition 2 repetition Synthesis Flow (m2): Given a set of unique synthesis transformations ={, ,…,}, a synthesis flow with repetition is a permutation of , where contains sets.
Example 2: Let ={, }. Each can be processed independently. For developing repetition synthesis flows, ={, , , }. The available flows include:
Remark 2: Let be the length of a synthesis flow. Given a repetition with transformations in , = .
Remark 3: Let function be the number of available repetition flows with elements in . uniquely satisfies the following recursive formula :
The number of available repetition flows with synthesis transformations is the same as counting permutations of objects. The proof of the recursive formula is similar to (mendelson1981permutations, ) that will not be included in this paper. The upper and lower boundary conditions are ¡ ¡ . We can see that becomes dramatically larger than (nonrepetition flows) as increasing.
2.2. Motivating Example
We provide two motivating examples using the Opensource logic synthesis framework ABC (mishchenko2010abc, ) shown in Figure 1. The setups are as follows:

[leftmargin=*]

={balance, restructure, rewrite, refactor, rewrite z, refactor z} (=6); the elements in are logic transformations in ABC^{3}^{3}3The names of these transformations are the same as the commands in ABC. that can be processed independently.

50,000 unique repetition flows are generated by random permutations of (=4, =6, =24).

Input designs: 128bit Advanced Encryption Standard (AES) core, and 64bit ALU taken from OpenCore (opencoreweb, ).

Delay and area of these flows are obtained after technology mapping using a 14nm standardcell library.
The QoR distributions of AES and ALU designs using the 50,000 random flows are shown in Figure 1(a, b) and (c, d). There are several important observations based on Figure 1, which show the main motivations of this work:

[leftmargin=*]

Given the same set of synthesis transformations, the QoR is very different using different flows. For example, delay and area of AES design produced by the 50,000 flows have up to 40% and 90% difference, respectively.

The search space of the synthesis flows is large. According to Remark 3, the total number of available repetition flows with independent synthesis transformations is more than . Discovering the highquality synthesis flows with humantesting among the entire search space is unlikely to be achieved.

The same set of flows perform differently on different designs. For example, in Figure 1, QoR distributions of AES and ALU are statistically significant. This means that the highquality flows for AES design could perform poorly for ALU. Therefore, synthesis flows need to be customized for specific IP or design.
3. Approach
3.1. Overview
This section presents our framework that artificially develops synthesis flows for a given design. Our framework takes the HDL as input and outputs two sets of synthesis flows, namely angelflows and devilflows, which provide the best and worst QoR according to the design objectives. This problem is formulated as Multiclass Classification and solved using CNN classifier. The main idea of our approach is that training a CNN Classifier with a small set of labeled random flows. The classes (or labels) of the synthesis flows are labeled based on one or multiple QoR metrics, such as delay, area, power, etc. The trained classifier is used to predict the classes of a large number of unlabeled random flows. Finally, angelflows and devilflows are generated by sorting the prediction confidence
, i.e., the probability to be in a certain class (Section
3.3). This framework is a generic model for designing synthesis flows in many stages, such as Highlevel synthesis and logic synthesis. The demonstration is made by designing logic synthesis flows using ABC (mishchenko2010abc, ) shown in Section 5. The flow of our framework is shown in Figure 2, including three main components:1⃝ Generate training datasets. In this work, the training dataset is a set of labeled synthesis flows, namely training flows. However, the training flows are originally unlabeled. This first step of our approach is labeling a set of random flows. This requires applying these synthesis flows to the input design and collecting the QoR result at the end of each flow. Note that applying a synthesis flow to a large design could be timeconsuming. Hence, our framework is performed in an incremental fashion. The CNN training (component 2⃝) starts after 1000 labeled flows collected, and it will be retrained every 500 new labeled flows collected. In this case, our framework can produce the intermediate results during the training process.
These flows will be labeled according to the classification model shown in Table 1. This model can be changed according to the design objectives, using either a singlemetric or multimetric model. For example, if the design objective is area optimization, a singlemetric model will be selected where is the area metric. If the design objectives are minimizing delay with a given area budget, a multimetric model will be selected. Note that the number of classes () is a fixed input of the proposed framework, and the definition (QoR range) of each class is decided using a general model. For example, to define seven classes (=6) in a singlemetric model, it requires six determinators, {, , …, }. We define the six determinators using the {5%,15%,40%,65%,90%,95%} QoR results of collected labeled synthesis flows. For example, assuming 1000 labeled flows collected, is the least value of the select metric and is the largest value. Since the training dataset is updating incrementally, the definitions of classes may change dynamically. Angelflows and devilflows are the subset of the flows corresponding to classes 0 and n.
Singlemetric  Multimetric  Class/Label 
,  0  
,  1  
,  2  
…  …  … 
,  n 
2⃝ Design and train CNN Classifier. The second component is training a CNN classifier that predicts the classes of unlabeled flows. To train a CNN classifier, the training data, i.e., labeled synthesis flows, need to be represented in the matrix. We present a onehot modeling that represents synthesis flow in binary matrix. This model and the CNN architecture are introduced in Section 3.2.
3⃝ Output Angelflows and Devilflows. The trained classifier will be used to predict the classes of a large number of untested sample flows. Although we are only interested in the flows in classes 0 and n, the classifier may label many flows in these two classes. However, for the synthesis perspective, selecting a small set of flows is sufficient. In this work, the angelflows and devilflows are selected from the flows labeled with 0 and n with highest prediction confidence. The details are included in Section 3.3.
3.2. CNN Classifier
3.2.1. Onehot Representation of Synthesis Flow
In this section, the onehot representation model of synthesis flow is introduced for repetition flow. The repetition flow can be represented using the same model.
Let be the binary matrix of a repetition flow with ={, ,…, } (see Definition 2). The number of transformations in equals to the length of the flow = (see Remark 2). Let the synthesis transformation in be , , . Its
by1 binary vector representation is
, where element is 1 and the other elements are 0. is an by matrix such that its row is .Example 3: We illustrate the onehot representation model using flow shown in Example 2, such that ={, } and =, is an 4by2 matrix.
3.2.2. CNN Architecture and Training
The input of the CNN are by binary matrices representing the synthesis flows. The CNN includes convolution, pooling, locally connected, dense and dropout layers. The kernel size of the convolutional and pooling layers are shown in Figure 3. The dropout rate in the dropout layer is 0.4 to prevent the overfitting problem (srivastava2014dropout, )
. Since our inputs are in onehot representation, the loss function is computed using
sparse softmax cross entropy function. The output of the network comes from softmaxfunction. The number of kernels (filters) of convolutional layers are 200. The stride size of the convolutional and pooling layers are
.Regarding the CNN architecture, two parameters have significant impacts on the prediction performance (accuracy): a) kernel size of convolutional layers and b)activation functions of convolutional and dense layers. Unlike most of the CNN classification applications, the by kernel size does not perform well in classifying synthesis flows. We use kernel size in this work. The reason is that there is only one nonzero element in each row of . Using
kernel could avoid computations over zeromatrix. The results of comparing the accuracy of the CNN classifier using 3
6, 66, and 612 kernels are shown in Section 5 Figure 6.The activation function of the nodes in the neural network defines the output of the nodes with a given set of inputs. In artificial neural networks, this function is also called the transfer function. The activation operations should provide different types of nonlinearities in the neural networks to solve Multiclass Classification problems. In general, there are two types of activation functions, including smooth nonlinear functions, such as Sigmoid, Tanh, Exponential Linear Units (ELU) (clevert2015fast, ), Scaled Exponential Linear Units (SELU)(klambauer2017self, ), etc., and smooth continuous functions, such as Rectified linear unit (ReLU) (nair2010rectified, ), Concatenated Rectified Linear Units (CReLu)(shang2016understanding, )
, etc. We find that for classifying synthesis flows, the activation functions with nonlinearities perform better, such SELU and Tanh. The activation functions including ReLU, ReLU6, ELU, SELU, Softplus, Softsign, Sigmoid and Tanh, have been compared in Section
5 Figure 7.Regarding the training process, the CNN classifier is trained specifically for each design as described in Section 3.1. Since the training data are collected incrementally, the CNN will be retrained after every 500 new data points collected. The MiniBatch (orr2003neural, )
training strategy is applied in this work with batch size 5, i.e., simultaneously evaluated five training examples in each iteration. In this work, we have evaluated five different gradient descent algorithms, including Stochastic gradient descent (SGD), Momentum
(qian1999momentum, ), AdaGrad (duchi2011adaptive, ), RMSProp
(tieleman2012lecture, ), and Follow the regularised leader (FTRL) (mcmahan2013ad, ). The comparison result is included in Section 5 Figures 4 and 5.3.3. AngelFlows and DevilFlows
In this work, the outputs of the proposed framework are 200 angelflows and 200 devilflows. There are two steps for generating these flows. First, it uses the trained CNN classifier to predict the class of a large number of random flows. According to the classification rule (Table 1), the angelflows and devilflows will be selected from the class flows and class flows. The predicted class of a random flow is the class corresponding to the highest probability in the result of the CNN classifier coming from softmax function. For example, assuming the output of the classifier (# classes = 7) is {}, where is the probability of a flow being class, then the predicted class is class. To minimize the errors in selecting the angel(devil)flows, our framework selects the flows with highest () within the class(class) flows.
Example 4: Let the prediction results in Table 2 be the prediction outputs of the CNN classifier of four synthesis flows. If two angelflows are required, and are selected and is eliminated.
Flow  

0.47  0.13  0.22  0.02  0.03  0.12  0.01  
0.51  0.12  0.01  0.09  0.17  0.08  0.02  
0.02  0.45  0.14  0.12  0.11  0.10  0.06  
0.12  0.03  0.17  0.62  0.01  0.02  0.03  
0.35  0.23  0.09  0.02  0.13  0.17  0.01 
4. Experimental Results
We demonstrate the proposed framework by designing logic synthesis flows Opensource synthesis framework ABC (mishchenko2010abc, )
. Our framework is implemented in C++. The CNN classifier is implemented using Tensorflow r1.3
(abadi2016tensorflow, ) using its C++API. The demonstration is made with three designs, including 64bit Montgomery multiplier, 128bit AES core (opencoreweb, ), and 64bit ALU (opencoreweb, ). The goal is to generate 200 angelflows and 200 devilflows for area or delay optimization. We use the same setups shown in the motivating example (Section 2.2). Thus, the synthesis flows will be repetition flows with six ABC synthesis transformations, ={balance, restructure, rewrite, refactor, rewrite z, refactor z}. The inputs of CNN classifier are by matrices representing the synthesis flows using the onehot modeling. These matrices are reshaped to by matrices for using two convolutional layers.For generating the area or delaydriven flows, we use the singlemetric classification model (Table 1) where is the area/delay of the design. The number of classes is seven. The six determinators are defined using { 5%, 15%, 40%, 65%, 90%, 95% } of the area/delay results of the training flows. The area and delay results are obtained after technology mapping with 14nm standardcell library. The number of training flows is 10,000 and the number of sample flows for generating the final flows is 100,000. The experimental results are obtained using a machine with Intel Xeon 2x12cores@2.5 GHz, 256GB RAM, 2x240GB SSD and 2 Nvidia Titan X GPUs.
The result section includes two parts. The first part contains the experimental results of training the CNN classifier. It consists of the evaluations of different gradient descent algorithms, various of convolutional kernel sizes and activation functions. Based on these results, we find the best settings for the CNN architecture and training strategy. Using these setting, we generate and evaluate the quality of generated angelflows and the devilflows. To evaluate the accuracy of the CNN classifier and the generated flows, we have explicitly collected the area and delay result by applying the 100,000 flows to the three designs. Hence, the true classes of the 100,000 sample flows are available for evaluation.
4.1. Results of Training CNN Classifier
4.1.1. Gradient Descent Algorithms
The results of training the CNN classifier using different gradient descent algorithms are shown in Figures 4 and 5. Figure 4 includes the results of training for generating areadriven flows using five different algorithms, including Stochastic gradient descent (SGD), Momentum (qian1999momentum, ), AdaGrad (duchi2011adaptive, ), RMSProp (tieleman2012lecture, ), and Follow the regularised leader (FTRL) (mcmahan2013ad, ); Figure 5 includes results of generating delaydriven flows. The learning rate = and number of training steps is 100,000. The kernel size of convolutional layers is 6by12. In Figures 4 and 5, the axis represents the accuracy of prediction. Let be the number of generated angelflows that their true class is class; let be the number of generated angelflows that their true class is class. The accuracy is defined as following:
The axis represents the training time of our framework. Note that the training process of the 64bit Montgomery multiplier is 2 faster than the other two designs. The reasons is that collecting the training dataset takes most of the runtime. The runtime of applying one synthesis flow to Montgomery multiplier is about 2 faster than the other two. The actually runtime for training the CNN classifier is about 3  5% of the entire training time. As shown in Figures 4 and 5, the RMSProp (tieleman2012lecture, ) outperforms other algorithms in classifying synthesis flows. The accuracy of the classifier in these six experiments reaches 95% after 24 hours.
4.1.2. Choice of Convolutional Kernel Size
As mentioned in Section 3.2, the size of the convolutional layer kernel has significant impacts on the CNN classifier. In Figure 6, three kernel sizes, 36, 612, have been tested using RMSProp algorithm (tieleman2012lecture, ), where the learning rate = and number of training steps is 100,000. The number of kernels at each convolutional layer is 200. The input design is the 128bit AES core, and the objective is generating delaydriven flows. We can see that the kernel with size (36, 612) perform much better than the kernel (66).
4.1.3. Evaluation of Activation Functions
For evaluating the performance of classifying synthesis flows using different activation functions, we set the learning rate =0.0001, learning steps=100,000, convolutional kernel size is 612, and use RMSProp to minimize the loss function. Figure 7 includes the comparison of eight different activation functions, including ReLU, ReLU6, ELU(clevert2015fast, ), SELU(klambauer2017self, ), Softplus, Softsign, Sigmoid and Tanh. We can see that the ELU, SELU, Softsign and Tanh functions outperform the others, and SELU offers the best accuracy for generating delaydriven flows for the 128bit AES core. Note that the accuracy of different activation functions varies on different datasets. In this work, SELU provides most reliable performance.
4.2. Quality of Generated Flows
Finally, we evaluate the quality of the generated angelflows and devilflows. The results shown in Figure 8 are obtained using the following settings: number of training flows is 10,000; number of sample flows is 100,000; =0.0001; learning steps is 100,000; activation function is SELU; gradient descent algorithm is RMSProp; convolutional kernel size is 612. The four types of points shown in Figure 8 represent the areadelay result of areaangelflows, areadevilflows, delayangelflows and delayangelflows. The axis represents delay and axis represents area. The background of each subfigures in Figure 8 is the 2D distribution of the 100,000 sample flows^{4}^{4}4The 2D distribution represents the distribution similarly to Section 2.2 Figure 1, but with 100,000 data points.. We can see that the generated area(delay) angelflows provide the best results in terms of area(delay), and the devilflows provide the worst results, among the 100,000 sample flows. For example, the data points of areaangelflows of these three designs are clearly bounded with a certain area value. The total runtime for generating these flows takes 34 days. It is demonstrated that our framework can successively develop angelflows and devilflows.
5. Conclusions
This work presents a fully autonomous framework that artificially produces designspecific synthesis flows without human guidance and baseline flows. We introduce a general approach for flow optimization problems by modeling into Multiclass Classification. The onehot modeling of iterative flows is proposed such that any flow can be represented using binary matrix. This approach is demonstrated by generating the best, and worst synthesis flows, using three large designs with 14nm technology. The future work will focus on artificially developing crosslayer synthesis flows to find the missingcorrelations between logic and physical designs (DBLP:conf/iccad/YuCSC17, ).
6. acknowledgement
This project is funded by ERC2014AdG 669354 grant.
References

(1)
F. Akopyan, J. Sawada et al.
, “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,”
IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.  (2) C. Yu, M. J. Ciesielski, M. Choudhury, and A. Sullivan, “Dagaware logic synthesis of datapaths,” in Proceedings of the 53rd Annual Design Automation Conference, DAC 2016, Austin, TX, USA, June 59, 2016, 2016, pp. 135:1–135:6.
 (3) C. Yu, C. Huang, G. Nam, M. Choudhury, V. N. Kravets, A. Sullivan, M. J. Ciesielski, and G. D. Micheli, “Endtoend industrial study of retiming,” in 2018 IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2018, Hong Kong, China, July 811, 2018, 2018, pp. 203–208.
 (4) F. Agakov, E. Bonilla, et al., “Using machine learning to focus iterative optimization,” in Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 2006, pp. 295–305.
 (5) G. Liu and Z. Zhang, “A parallelized iterative improvement approach to area optimization for lutbased technology mapping,” FPGA’17, 2017.
 (6) R. Durrett, Probability: theory and examples. Cambridge university press, 2010.
 (7) Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

(8)
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  (9) C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
 (10) N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 (11) H. Mendelson, “On permutations with limited repetition,” Journal of Combinatorial Theory, Series A, vol. 30, no. 3, pp. 351–353, 1981.
 (12) A. Mishchenko et al., “ABC: A System for Sequential Synthesis and Verification,” URL http://www. eecs. berkeley. edu/alanmi/abc.
 (13) “OpenCores,” URL https://opencores.org.
 (14) D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
 (15) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” arXiv preprint arXiv:1706.02515, 2017.

(16)
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814.  (17) W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in International Conference on Machine Learning, 2016, pp. 2217–2225.
 (18) G. B. Orr and K.R. Müller, Neural networks: tricks of the trade. Springer, 2003.
 (19) N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.

(20)
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.  (21) T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
 (22) H. B. McMahan, Holt et al., “Ad click prediction: a view from the trenches,” in KDD’13. ACM, 2013, pp. 1222–1230.
 (23) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 (24) C. Yu, M. Choudhury, A. Sullivan, and M. J. Ciesielski, “Advanced datapath synthesis using graph isomorphism,” in ICCAD 2017, Irvine, CA, USA, November 1316, 2017, 2017, pp. 424–429.