I Introduction
As the CMOS transistor technologies test the limits of Moore’s Law [18], the design flow of VLSI circuits demand increasingly more complex analysis, transformation, and verification iterations, to validate the correctness of functionality, and quality of design in terms of performance, power and signal integrity. The design flow steps need to also validate various processvoltagetemperature (PVT) corners and operating modes such as lowpower (LP) and highperformance (HP) that involve increasingly nonlinear effects. Fast and accurate simulation is therefore crucial to help lower the number of design iterations, speed up convergence, and consequently shorten the design turnaround time [15].
SPICE simulators are the de facto standard tools for accurate analysis and signoff, however they are very slow for billiontransistor circuits [21, 5]. Therefore, higher levels of circuit abstraction using approximation have been used to speed up simulation steps. Abstraction models are generally based on lookuptables (LUTs), closedform formulations, factors or their combinations. The traditional voltage based models, namely nonlinear delay model (NLDM), nonlinear power model (NLPM), effective current source model (ECSM [6]), and composite current source model (CCSM [26]) utilize LUTs for storing delay, noise or power as nonlinear functions w.r.t. physical, structural, and environmental parameters, and depend on voltage modeling more than current modeling. Voltage based models are intuitively better choices when compared to simple closedform formulation of nonlinear functions, however, it tends to be increasingly inaccurate in capturing signal integrity and short channel effects with the downscaling of technologies [4]. Alternatively, current based models such as Current Source Models (CSMs) [7, 12, 16, 11, 3, 17, 19, 10, 9] use voltagedependent components to model logic cells. In addition to higher accuracy, another advantage of current based models over voltage based models is the ability to simulate realistic output waveforms for arbitrary input signals. The major shortcoming of LUTbased approaches is the high latency for memory queries.
In this work, we present NNPARS, a neural network (NN) based PARallelized circuit Simulation framework that replaces current based CSM LUT queries with NN computations and exploits the architecture of graphical processing units (GPUs) for concurrent simulation. By following our proposed method, various gates in the circuit can be simulated in parallel. An eventdriven scheduling engine is embedded that selects gates for computation based on characteristics of the underlying GPU platform and the input netlist to minimize the total circuit simulation time. The major novelties of our NNPARS framework are as follows:

NNPARS accelerates the CSM simulation of complex integrated circuits using optimized NN structures considering the underlying GPU computational capabilities.

Considering the iterative nature of output signal waveform calculation based on CSM, NNPARS embeds a simple eventdriven scheduling methodology to further maximize simulation concurrency by performing calculation steps for many logic cells in the circuit in parallel, hence disentangling logic cell simulation from the order of cells in the circuit topology.
The remainder of our paper is organized as follows. Section II presents a brief background on CSM simulation. Sections III and IV elaborate our NNPARS framework and experimental results, respectively. Section V concludes the paper.
Ii Background
Although our NNPARS framework can be utilized to enhance any LUT based circuit simulation technique, we choose CSM as the method of comparison. CSM technique models each logic cell with voltagedependent current sources, as well as input, miller, and output capacitors [7, 9, 10]. In the case of a simple INV gate, CSM components are only dependent on input () and output () voltages. However, for logic cells with multiple numbers of inputs, these components depend on a larger number of variables, i.e voltage of inputs and internal nodes [3]. Consequently, the size of CSM LUTs grow exponentially with the number of variables.
Despite the recent advances in computational capabilities of CPUs, such as process parallelization by introducing manycore processors with dedicated cache memory, they still lack high efficiency when processing tasks with a large number of parallel computational subtasks. GPUs are specifically designed to outperform CPUs for such tasks with capability of higher order parallel computation. Moreover, these devices are also known as an efficient hardware platform for training and inference of NN [24]. This is partly because of two levels of parallelized processing units in GPUs: several multiprocessors (MPs), and several stream processors (SPs, also referred as cores) that run the actual computation for each multiprocessor. Each core is equipped with arithmetic units, register files, and designated cache. The superiority of GPUs can be observed by comparing the evolution of GPUs and CPUs in terms of number of floatingpoint operations per second (FLOPS) as shown in Fig. 1.
As suggested in [2], high dimensional CSMLUTs with large sizes can only fit in DRAM of CPUs or GPUs, while lowdimensional VLUT tables can easily fit into L1 caches. The major shortcoming in data retrieval from DRAM is the high latency. As an example, specification of a 24core Intel processor with Broadwell microarchitecture [14] given in Table I shows that the memory access in DRAM is about 2 orders of magnitude slower than that of L1 cache. Another disadvantage of memory query is that in contrast to the dedicated caches for each core in multicore processors and GPUs, the main memory is shared. However, the number of parallel reads from DRAM to processors, referred to as memorychannel, is limited and is much lower than the number of cores. As an example, the 24core processor in Table I has only 4 memory channels.
Dependency on memory drastically increases the total circuit simulation time and specially prevents accurate approaches such as CSM to be practical. To mitigate this shortcoming, semianalytical methods [8] suggest combining nonlinear analytical models and lowdimensional CSM lookup tables to simultaneously achieve high modeling accuracy in addition to low time and space complexity. On the other hand, [2] (referred to as CSMNN method throughout the paper) proposed complete removal of the long memory queries by approximating CSM component values using simple NNs. While this method improved the simulation time of simple gates, it did not touch upon on how it can be scaled up to the level of circuit simulation, especially using parallel computation capabilities of GPUs.
In the following two sections, we present how our NNPARS parallelizes simulation of logic cells in a circuit, while avoiding high latency memory retrievals needed in LUT based simulators, and further speeds up the simulation process by scheduling the concurrent tasks according to the GPU processing capabilities.
Intel Broadwell Microarchitecture  
Memory  Size (KByte)  Latency (Clock Cycle) 
L1 Data Cache  32 per core  45 
L2 Cache  256  1112 
L3 Cache  60,000  3842 
DRAM    250 
Intel Xeon Processor E78894 v4  
Cores  24  
Base Frequency  2.40 GHz  
Theoretical Peak Computation  920 GFLOPs 
Iii NNPARS Framework
The characterization in this method is the same as conventional CSMLUT. We followed the same training flow, i.e. choice of network architecture, optimization algorithm, preprocessing, and evaluation as in CSMNN. The following section explains modeling the CSM of standard cells with NN, required resources for parallel computation of NNs and latency on GPU platform, and finally the flow of circuit simulation, including the eventdriven scheduling of NNPARS.
Iiia NN Architecture
We followed the same approach as in CSMNN to substitute memory retrieval with NN computation for simple logic cells. Every logic cell in the library is modeled by a NN with one single hidden layer.
It is very important to note that while accuracy of NNs in predicting CSM component values is important, the accuracy should ultimately be reported based on the quality of the output waveforms, and not just a certain measurement such as logic cell delay. This coincides with the functionality of CSM in regenerating circuit voltage waveforms. Therefore, similar to [2] and [25], we use the expected waveform similarity () as a figure of merit for the simulation accuracy measurements. In this work, is defined as the mean of the absolute difference between precise HSPICE and NNPARS simulations relative to the supply voltage value of the technology as shown in Eq. 1.
(1) 
In addition to a measurement for reporting the accuracy of the results, we used
to find the architecture of NNs. The smallest number of neurons such that the model can pass a predefined accuracy threshold in terms of
when stimulated with set of noisy inputs is selected for the NN implementation of the logic cell.IiiB Computational resources and latency analysis
The main advantage of this proposed method is the high parallelizability and consequently very low latency in simulation of circuits when computed on GPU platforms. Therefore, a detailed analysis of the latency and the number of required computation resources of the CSMNN is necessary. The main computational operations of a singlehiddenlayer NN are multiplication (MUL) and addition (ADD). GPU cores are designed to perform one MUL and one ADD in a single cycle [20]. Considering the number of inputs and size of the hidden layer as and respectively, there are multiplications in the first layer. It is very important to note that there are no dependencies among MUL operations in one specific layer, therefore they can all be computed in parallel using cores within a single cycle. We occupied these initial cores in this cycle, but they can be reused in the next cycles. To calculate the output of each of the hidden neurons, values should be accumulated to generate the output. This can be efficiently parallelized by using treestructures within cycles. The number of required cores in the first cycle is , which is less than the number of initial cores, thus no further core allocation is required and the computation can be done on initial ones. Following the same approach for the output layer, we can conclude that singlehiddenlayer NN can be computed with cores within latency given in Eq. 2.
Latency  (2) 
By implementing a trained NN with fixed parameters on a GPU, the weights of each operation can be stored in register files, therefore, there is no need to retrieve data from memory.
Streaming Processors (SM)  80 
32bit FP CUDA core (per SM/total)  64/5120 
64bit FP CUDA core (per SM/total)  32/2560 
Register files per SM  256/4 KB 
L1 cache / shared memory (per SM/core)  128/2 KB 
L1 cache hit latency:  28 
Base clock frequency  1450 MHz 
Single precision FLOPS  14.8 TFLOPS 
IiiC Concurrent simulation of gates in CSM
In CSM simulation, voltage waveform calculation is performed in a series of short time intervals () in an iterative process. Considering the voltage values (s) and input slews () are known for all gates in one interval (), the change in voltage can be calculated for the next interval (). In other words, the change in output voltage of a driver gate in one time interval , is the input voltage change of the load gate () in the next time interval . Following this approach, the simulation of gates in a single interval are not dependent to each other and can potentially be done in parallel. On the other hand, voltage based simulation calculates the delay and output slew of a single gate based on the input slew, i.e. output slew of the driver gate, and the capacitive load. The dependency of delay calculation of load gates to simulation of their driver gates, prevents voltage based methods to simulate gates from different levels of the circuit in parallel.
IiiD NNPARS circuit simulation flow
To better illustrate the steps of our NNPARS, we use C7552 netlist from ISCAS85 [13] benchmark as an example circuit and the GPU platform introduced in Table II as an example processor. To further simplify our description, we limit the standard cells to INV, NAND2, and NOR2. First, NNPARS identifies the count of gates from each standard cell in the circuit netlist. For example, there are 2625 NAND2, 799 INV, and 401 NOR2 gates in C7552. Based on the relative ratio of these counts, we dedicate GPU cores to model the cells as shown in Fig. 2. Now that all the computational cores of GPU are dedicated, we can start the circuit simulation. A simple event driven simulation scheduler is designed that schedules the steps of simulation. According to the number of models on GPU for each cell, random gates in the circuit are selected for simulation. Due to independency at each interval, CSM simulation can be performed in parallel for many gates. Thus, at each time interval, NNPARS selects a subset of gates to run on GPU and simulate.
In our example, at each time frame, 52, 20 and 8 NAND2, INV and NOR2 gates of the circuit can be simulated in parallel (c.f. Fig. 2). Similar to this subset, all other gates are simulated for this time interval. This means that for C7552, it takes GPU iterations to simulate the circuit for one time interval.
Although CSM simulation of a logic cell at a certain time interval does not depend on that of other logic cells in that time interval, random selection of logic cells as a subset to be simulated on the GPU may not be optimal. This is because, in fairly large circuits, a large number of cells do not require any simulation in one time interval as their voltage levels for different nodes were not changed in the previous one. Therefore, the event driven simulation scheduler of NNPARS neglects the unnecessary gate simulations. The NNPARS scheduler assigns the logic cells with voltage values changed beyond a threshold to the active set so they will be simulated in the next time frame. On the other hand, the logic cells with no changes in any of their voltage nodes are removed from the active set.
Iv Experiments and Simulation Results
We implemented the simulator and the flow of our NNPARS framework in Python. Our implementation is technology independent and can characterize, and create NN models with flexible configurable setups, for various logic cells. More importantly, the simulator can exploit GPU in order to parallelize the simulation of the given combinational circuit netlist. NN implementation and training are based on the Scikitlearn [22] package.
CPU and GPU devices that are used as platforms for CSM and NNPARS are introduced in Table I and Table II respectively. Hardware platforms are comparable to each other in terms of cost (about 8,000 USD) and the production year (2017) in order to have a fair comparison.
Iva Selected Technologies
For better evaluation of our NNPARS and its technology independence characteristics, we performed our experiments on both MOSFET (16nm) and FinFET (7nm) devices from Predictive Technology Model (PTM) [23] packages. Two device types namely lowstandby power (LP) and high performance (HP) are used in our experiments [1].
IvB Training for logic cells
The total number of generated data points by characterization is 500 samples per gate. The data was randomly split into training (90%) and test (10%) datasets. The exponential range of the values (from to ) is not optimal for training nonlinear regression models. Therefore, we trained our models on values. The normalization of data in regression problems would help the solvers with faster convergence and better numerical stability. This process is implemented inside our solver [22]. To select the optimal size of the hidden layer for each model, we repeated the training process for various neuron numbers in the range of . Each of the trained models was tested by applying a set of noisy input signals. The model with the minimum size of the hidden layer that met threshold is chosen as the NNPARS architecture for the logic cell. The complete results for the choice of architecture for INV, NAND2, and NOR2 NNPARS models are given in Table III.
INV  NAND2  NOR2  

MOSFETHP 16nm  9  18  18 
MOSFETLP 16nm  8  17  17 
FinFETHP 7nm  10  20  20 
FinFETLP 7nm  10  21  21 
IvC Circuit Simulation
In this work we evaluated our NNPARS framework by simulating a full adder (FA) circuit with schematic shown in Fig. 3. In addition, we analysed the performance improvement achieved by NNPARS compared to NNLUT for real combinational circuits from ISCAS85 benchmarks [13].
CSMLUT method is considered to be computed on the CPU platform as it does not benefit from GPU parallelization. The required computation resources and latencies for GPU implementation of NNPARS are calculated using equations in Section IIIA. Comparing the output waveforms of SPICE, CSMLUT, and NNPARS methods in Fig. 4 confirm the simulation accuracy of NNPARS. We also measured by comparing output waveforms of HSPICE as the baseline with those of NNPARS simulations. Results in Table IV suggest that is limited to 2%.
Technology  MOSFET 16nm  FinFET 7nm  

Device  HP  LP  HP  LP 
1.64%  1.27%  1.81%  1.77%  
Improvement  30.4 
As we can see in Table IV, the improvement achieved by NNPARS is the same for different devices as all the gates of the FA can be modeled on our GPU in parallel. The limited number of gates in the FA circuit does not reveal the full performance increase of NNPARS. Therefore bigger circuits with thousands of gates were analyzed. The results are reported in Table V.
  # gates  MOSFET  FinFET 

c880  383  92  81 
c1355  546  120  124 
c7552  3825  134  134 
V Conclusions
Our goal in this work was to resolve the accuracy and latency issues of existing simulation methodologies that heavily depend on memory queries. Our NNPARS framework replaces long memory queries with efficient and parallelizable NN based computations and employs an optimized eventdriven scheduling engine that concurrently runs the simulation events of logic cells in the circuits.
The simulation latency of NNPARS was evaluated in multiple MOSFET and FinFET technologies based on predictive technology models. The results confirm that NNPARS improves the simulation speed by up to compared to a stateoftheart current based CSM baseline in large circuits. Furthermore the high accuracy of NNPARS in terms of waveform similarity was evaluated w.r.t. HSPICE. We expect the application of NNPARS in analysis and optimization of advanced VLSI circuits such as systemonchips (SoCs) will significantly improve the quality of results.
Acknowledgement
This research was sponsored in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation (NSF).
References
 [1] (201503) Optimal choice of FinFET devices for energy minimization in deeplyscaled technologies. In International Symposium on Quality Electronic Design (ISQED), Vol. , pp. 234–238. External Links: Document, ISSN Cited by: §IVA.
 [2] (2019) CSMNN: current source model based logic circuit simulation  a neural network approach. In International Conference on Computer Design (ICCD), Cited by: §II, §II, §IIIA.
 [3] (2008) A current source model for CMOS logic cells considering multiple input switching and stack effect. In Design, Automation and Test in Europe (DATE), Vol. , pp. 568–573. External Links: Document, ISSN 15301591 Cited by: §I, §II.
 [4] (2008) Current source based standard cell model for accurate signal integrity and timing analysis. Design, Automation and Test in Europe (DATE), pp. 574–579. External Links: Document Cited by: §I.
 [5] (200006) A survey of design techniques for systemlevel dynamic power management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8 (3), pp. 299–316. External Links: Document, ISSN Cited by: §I.
 [6] (Website) External Links: Link Cited by: §I.
 [7] (2003) Blade and razor: cell and interconnect delay analysis using currentbased models. In Design Automation Conference (DAC), Vol. , pp. 386–389. External Links: Document, ISSN Cited by: §I, §II.
 [8] (2014) Semianalytical current source modeling of FinFET devices operating in near/subthreshold regime with independent gate control and considering process variation. In Asia and South Pacific Design Automation Conference (ASPDAC), pp. 167–172. Cited by: §II.
 [9] (200701) A currentbased method for short circuit power calculation under noisy input waveforms. In Asia and South Pacific Design Automation Conference (ASPDAC), Vol. , pp. 774–779. External Links: Document, ISSN Cited by: §I, §II.
 [10] (2006) Statistical logic cell delay analysis using a currentbased model. In Design Automation Conference (DAC), pp. 253–256. Cited by: §I, §II.
 [11] (2008) Statistical waveform and current source based standard cell models for accurate timing analysis. In Design Automation Conference (DAC), pp. 227–230. Cited by: §I.
 [12] (2005) Current based delay models: a must for nanometer timing. Cadence Live Conference (CDNLive). Cited by: §I.
 [13] (199907) Unveiling the iscas85 benchmarks: a case study in reverse engineering. IEEE Des. Test 16 (3), pp. 72–80. External Links: ISSN 07407475, Document Cited by: §IIID, §IVC, TABLE V.
 [14] (Website) External Links: Link Cited by: §II.

[15]
(201810)
Using machine learning to predict pathbased slack from graphbased timing analysis
. In International Conference on Computer Design (ICCD), pp. 603–612. External Links: Document, ISSN 25766996 Cited by: §I.  [16] (200411) A robust celllevel crosstalk delay change analysis. In International Conference on ComputerAided Design (ICCAD), Vol. , pp. 147–154. External Links: Document, ISSN 10923152 Cited by: §I.
 [17] (2012) Current source modeling for power and timing analysis at different supply voltages. In Design, Automation Test in Europe (DATE), Vol. , pp. 923–928. External Links: Document, ISSN 15581101 Cited by: §I.
 [18] (1965Sep.) Cramming more components onto integrated circuits. Electronics 38 (8), pp. 114. External Links: Document, ISSN Cited by: §I.
 [19] (201101) Accurate timing and noise analysis of combinational and sequential logic cells using current source modeling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19 (1), pp. 92–103. External Links: Document, ISSN 10638210 Cited by: §I.
 [20] (200803) Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: ISSN 15427730, Document Cited by: §IIIB.
 [21] (200608) Thermal modeling, analysis, and management in VLSI circuits: principles and methods. Proceedings of the IEEE 94 (8), pp. 1487–1501. External Links: Document Cited by: §I.
 [22] (201111) Scikitlearn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: ISSN 15324435 Cited by: §IVB, §IV.
 [23] Predictive Technology Model from arizona state university. Note: http://ptm.asu.edu/Accessed: 20190520 Cited by: §IVA.

[24]
(2009)
Largescale deep unsupervised learning using graphics processors
. In International Conference on Machine Learning (ICML), pp. 873–880. External Links: ISBN 9781605585161, Link, Document Cited by: §II.  [25] (2016) Practical statistical static timing analysis with current source models. In Design Automation Conference (DAC), pp. 113:1–113:6. External Links: ISBN 9781450342360, Document Cited by: §IIIA.
 [26] (Website) External Links: Link Cited by: §I.
Comments
There are no comments yet.