As the CMOS transistor technologies test the limits of Moore’s Law , the design flow of VLSI circuits demand increasingly more complex analysis, transformation, and verification iterations, to validate the correctness of functionality, and quality of design in terms of performance, power and signal integrity. The design flow steps need to also validate various process-voltage-temperature (PVT) corners and operating modes such as low-power (LP) and high-performance (HP) that involve increasingly nonlinear effects. Fast and accurate simulation is therefore crucial to help lower the number of design iterations, speed up convergence, and consequently shorten the design turnaround time .
SPICE simulators are the de facto standard tools for accurate analysis and sign-off, however they are very slow for billion-transistor circuits [21, 5]. Therefore, higher levels of circuit abstraction using approximation have been used to speed up simulation steps. Abstraction models are generally based on look-up-tables (LUTs), closed-form formulations, factors or their combinations. The traditional voltage based models, namely nonlinear delay model (NLDM), nonlinear power model (NLPM), effective current source model (ECSM ), and composite current source model (CCSM ) utilize LUTs for storing delay, noise or power as nonlinear functions w.r.t. physical, structural, and environmental parameters, and depend on voltage modeling more than current modeling. Voltage based models are intuitively better choices when compared to simple closed-form formulation of nonlinear functions, however, it tends to be increasingly inaccurate in capturing signal integrity and short channel effects with the down-scaling of technologies . Alternatively, current based models such as Current Source Models (CSMs) [7, 12, 16, 11, 3, 17, 19, 10, 9] use voltage-dependent components to model logic cells. In addition to higher accuracy, another advantage of current based models over voltage based models is the ability to simulate realistic output waveforms for arbitrary input signals. The major shortcoming of LUT-based approaches is the high latency for memory queries.
In this work, we present NN-PARS, a neural network (NN) based PARallelized circuit Simulation framework that replaces current based CSM LUT queries with NN computations and exploits the architecture of graphical processing units (GPUs) for concurrent simulation. By following our proposed method, various gates in the circuit can be simulated in parallel. An event-driven scheduling engine is embedded that selects gates for computation based on characteristics of the underlying GPU platform and the input netlist to minimize the total circuit simulation time. The major novelties of our NN-PARS framework are as follows:
NN-PARS accelerates the CSM simulation of complex integrated circuits using optimized NN structures considering the underlying GPU computational capabilities.
Considering the iterative nature of output signal waveform calculation based on CSM, NN-PARS embeds a simple event-driven scheduling methodology to further maximize simulation concurrency by performing calculation steps for many logic cells in the circuit in parallel, hence disentangling logic cell simulation from the order of cells in the circuit topology.
The remainder of our paper is organized as follows. Section II presents a brief background on CSM simulation. Sections III and IV elaborate our NN-PARS framework and experimental results, respectively. Section V concludes the paper.
Although our NN-PARS framework can be utilized to enhance any LUT based circuit simulation technique, we choose CSM as the method of comparison. CSM technique models each logic cell with voltage-dependent current sources, as well as input, miller, and output capacitors [7, 9, 10]. In the case of a simple INV gate, CSM components are only dependent on input () and output () voltages. However, for logic cells with multiple numbers of inputs, these components depend on a larger number of variables, i.e voltage of inputs and internal nodes . Consequently, the size of CSM LUTs grow exponentially with the number of variables.
Despite the recent advances in computational capabilities of CPUs, such as process parallelization by introducing many-core processors with dedicated cache memory, they still lack high efficiency when processing tasks with a large number of parallel computational sub-tasks. GPUs are specifically designed to outperform CPUs for such tasks with capability of higher order parallel computation. Moreover, these devices are also known as an efficient hardware platform for training and inference of NN . This is partly because of two levels of parallelized processing units in GPUs: several multiprocessors (MPs), and several stream processors (SPs, also referred as cores) that run the actual computation for each multiprocessor. Each core is equipped with arithmetic units, register files, and designated cache. The superiority of GPUs can be observed by comparing the evolution of GPUs and CPUs in terms of number of floating-point operations per second (FLOPS) as shown in Fig. 1.
As suggested in , high dimensional CSM-LUTs with large sizes can only fit in DRAM of CPUs or GPUs, while low-dimensional V-LUT tables can easily fit into L1 caches. The major shortcoming in data retrieval from DRAM is the high latency. As an example, specification of a 24-core Intel processor with Broadwell microarchitecture  given in Table I shows that the memory access in DRAM is about 2 orders of magnitude slower than that of L1 cache. Another disadvantage of memory query is that in contrast to the dedicated caches for each core in multi-core processors and GPUs, the main memory is shared. However, the number of parallel reads from DRAM to processors, referred to as memory-channel, is limited and is much lower than the number of cores. As an example, the 24-core processor in Table I has only 4 memory channels.
Dependency on memory drastically increases the total circuit simulation time and specially prevents accurate approaches such as CSM to be practical. To mitigate this shortcoming, semi-analytical methods  suggest combining nonlinear analytical models and low-dimensional CSM lookup tables to simultaneously achieve high modeling accuracy in addition to low time and space complexity. On the other hand,  (referred to as CSM-NN method throughout the paper) proposed complete removal of the long memory queries by approximating CSM component values using simple NNs. While this method improved the simulation time of simple gates, it did not touch upon on how it can be scaled up to the level of circuit simulation, especially using parallel computation capabilities of GPUs.
In the following two sections, we present how our NN-PARS parallelizes simulation of logic cells in a circuit, while avoiding high latency memory retrievals needed in LUT based simulators, and further speeds up the simulation process by scheduling the concurrent tasks according to the GPU processing capabilities.
|Intel Broadwell Microarchitecture|
|Memory||Size (KByte)||Latency (Clock Cycle)|
|L1 Data Cache||32 per core||4-5|
|Intel Xeon Processor E7-8894 v4|
|Base Frequency||2.40 GHz|
|Theoretical Peak Computation||920 GFLOPs|
Iii NN-PARS Framework
The characterization in this method is the same as conventional CSM-LUT. We followed the same training flow, i.e. choice of network architecture, optimization algorithm, preprocessing, and evaluation as in CSM-NN. The following section explains modeling the CSM of standard cells with NN, required resources for parallel computation of NNs and latency on GPU platform, and finally the flow of circuit simulation, including the event-driven scheduling of NN-PARS.
Iii-a NN Architecture
We followed the same approach as in CSM-NN to substitute memory retrieval with NN computation for simple logic cells. Every logic cell in the library is modeled by a NN with one single hidden layer.
It is very important to note that while accuracy of NNs in predicting CSM component values is important, the accuracy should ultimately be reported based on the quality of the output waveforms, and not just a certain measurement such as logic cell delay. This coincides with the functionality of CSM in regenerating circuit voltage waveforms. Therefore, similar to  and , we use the expected waveform similarity () as a figure of merit for the simulation accuracy measurements. In this work, is defined as the mean of the absolute difference between precise HSPICE and NN-PARS simulations relative to the supply voltage value of the technology as shown in Eq. 1.
In addition to a measurement for reporting the accuracy of the results, we used
to find the architecture of NNs. The smallest number of neurons such that the model can pass a pre-defined accuracy threshold in terms ofwhen stimulated with set of noisy inputs is selected for the NN implementation of the logic cell.
Iii-B Computational resources and latency analysis
The main advantage of this proposed method is the high parallelizability and consequently very low latency in simulation of circuits when computed on GPU platforms. Therefore, a detailed analysis of the latency and the number of required computation resources of the CSM-NN is necessary. The main computational operations of a single-hidden-layer NN are multiplication (MUL) and addition (ADD). GPU cores are designed to perform one MUL and one ADD in a single cycle . Considering the number of inputs and size of the hidden layer as and respectively, there are multiplications in the first layer. It is very important to note that there are no dependencies among MUL operations in one specific layer, therefore they can all be computed in parallel using cores within a single cycle. We occupied these initial cores in this cycle, but they can be reused in the next cycles. To calculate the output of each of the hidden neurons, values should be accumulated to generate the output. This can be efficiently parallelized by using tree-structures within cycles. The number of required cores in the first cycle is , which is less than the number of initial cores, thus no further core allocation is required and the computation can be done on initial ones. Following the same approach for the output layer, we can conclude that single-hidden-layer NN can be computed with cores within latency given in Eq. 2.
By implementing a trained NN with fixed parameters on a GPU, the weights of each operation can be stored in register files, therefore, there is no need to retrieve data from memory.
|Streaming Processors (SM)||80|
|32bit FP CUDA core (per SM/total)||64/5120|
|64bit FP CUDA core (per SM/total)||32/2560|
|Register files per SM||256/4 KB|
|L1 cache / shared memory (per SM/core)||128/2 KB|
|L1 cache hit latency:||28|
|Base clock frequency||1450 MHz|
|Single precision FLOPS||14.8 TFLOPS|
Iii-C Concurrent simulation of gates in CSM
In CSM simulation, voltage waveform calculation is performed in a series of short time intervals () in an iterative process. Considering the voltage values (s) and input slews () are known for all gates in one interval (), the change in voltage can be calculated for the next interval (). In other words, the change in output voltage of a driver gate in one time interval , is the input voltage change of the load gate () in the next time interval . Following this approach, the simulation of gates in a single interval are not dependent to each other and can potentially be done in parallel. On the other hand, voltage based simulation calculates the delay and output slew of a single gate based on the input slew, i.e. output slew of the driver gate, and the capacitive load. The dependency of delay calculation of load gates to simulation of their driver gates, prevents voltage based methods to simulate gates from different levels of the circuit in parallel.
Iii-D NN-PARS circuit simulation flow
To better illustrate the steps of our NN-PARS, we use C7552 netlist from ISCAS85  benchmark as an example circuit and the GPU platform introduced in Table II as an example processor. To further simplify our description, we limit the standard cells to INV, NAND2, and NOR2. First, NN-PARS identifies the count of gates from each standard cell in the circuit netlist. For example, there are 2625 NAND2, 799 INV, and 401 NOR2 gates in C7552. Based on the relative ratio of these counts, we dedicate GPU cores to model the cells as shown in Fig. 2. Now that all the computational cores of GPU are dedicated, we can start the circuit simulation. A simple event driven simulation scheduler is designed that schedules the steps of simulation. According to the number of models on GPU for each cell, random gates in the circuit are selected for simulation. Due to independency at each interval, CSM simulation can be performed in parallel for many gates. Thus, at each time interval, NN-PARS selects a subset of gates to run on GPU and simulate.
In our example, at each time frame, 52, 20 and 8 NAND2, INV and NOR2 gates of the circuit can be simulated in parallel (c.f. Fig. 2). Similar to this subset, all other gates are simulated for this time interval. This means that for C7552, it takes GPU iterations to simulate the circuit for one time interval.
Although CSM simulation of a logic cell at a certain time interval does not depend on that of other logic cells in that time interval, random selection of logic cells as a subset to be simulated on the GPU may not be optimal. This is because, in fairly large circuits, a large number of cells do not require any simulation in one time interval as their voltage levels for different nodes were not changed in the previous one. Therefore, the event driven simulation scheduler of NN-PARS neglects the unnecessary gate simulations. The NN-PARS scheduler assigns the logic cells with voltage values changed beyond a threshold to the active set so they will be simulated in the next time frame. On the other hand, the logic cells with no changes in any of their voltage nodes are removed from the active set.
Iv Experiments and Simulation Results
We implemented the simulator and the flow of our NN-PARS framework in Python. Our implementation is technology independent and can characterize, and create NN models with flexible configurable setups, for various logic cells. More importantly, the simulator can exploit GPU in order to parallelize the simulation of the given combinational circuit netlist. NN implementation and training are based on the Scikit-learn  package.
CPU and GPU devices that are used as platforms for CSM and NN-PARS are introduced in Table I and Table II respectively. Hardware platforms are comparable to each other in terms of cost (about 8,000 USD) and the production year (2017) in order to have a fair comparison.
Iv-a Selected Technologies
For better evaluation of our NN-PARS and its technology independence characteristics, we performed our experiments on both MOSFET (16nm) and FinFET (7nm) devices from Predictive Technology Model (PTM)  packages. Two device types namely low-standby power (LP) and high performance (HP) are used in our experiments .
Iv-B Training for logic cells
The total number of generated data points by characterization is 500 samples per gate. The data was randomly split into training (90%) and test (10%) datasets. The exponential range of the values (from to ) is not optimal for training nonlinear regression models. Therefore, we trained our models on values. The normalization of data in regression problems would help the solvers with faster convergence and better numerical stability. This process is implemented inside our solver . To select the optimal size of the hidden layer for each model, we repeated the training process for various neuron numbers in the range of . Each of the trained models was tested by applying a set of noisy input signals. The model with the minimum size of the hidden layer that met threshold is chosen as the NN-PARS architecture for the logic cell. The complete results for the choice of architecture for INV, NAND2, and NOR2 NN-PARS models are given in Table III.
Iv-C Circuit Simulation
In this work we evaluated our NN-PARS framework by simulating a full adder (FA) circuit with schematic shown in Fig. 3. In addition, we analysed the performance improvement achieved by NN-PARS compared to NN-LUT for real combinational circuits from ISCAS85 benchmarks .
CSM-LUT method is considered to be computed on the CPU platform as it does not benefit from GPU parallelization. The required computation resources and latencies for GPU implementation of NN-PARS are calculated using equations in Section III-A. Comparing the output waveforms of SPICE, CSM-LUT, and NN-PARS methods in Fig. 4 confirm the simulation accuracy of NN-PARS. We also measured by comparing output waveforms of HSPICE as the baseline with those of NN-PARS simulations. Results in Table IV suggest that is limited to 2%.
|Technology||MOSFET 16nm||FinFET 7nm|
As we can see in Table IV, the improvement achieved by NN-PARS is the same for different devices as all the gates of the FA can be modeled on our GPU in parallel. The limited number of gates in the FA circuit does not reveal the full performance increase of NN-PARS. Therefore bigger circuits with thousands of gates were analyzed. The results are reported in Table V.
Our goal in this work was to resolve the accuracy and latency issues of existing simulation methodologies that heavily depend on memory queries. Our NN-PARS framework replaces long memory queries with efficient and parallelizable NN based computations and employs an optimized event-driven scheduling engine that concurrently runs the simulation events of logic cells in the circuits.
The simulation latency of NN-PARS was evaluated in multiple MOSFET and FinFET technologies based on predictive technology models. The results confirm that NN-PARS improves the simulation speed by up to compared to a state-of-the-art current based CSM baseline in large circuits. Furthermore the high accuracy of NN-PARS in terms of waveform similarity was evaluated w.r.t. HSPICE. We expect the application of NN-PARS in analysis and optimization of advanced VLSI circuits such as system-on-chips (SoCs) will significantly improve the quality of results.
This research was sponsored in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation (NSF).
-  (2015-03) Optimal choice of FinFET devices for energy minimization in deeply-scaled technologies. In International Symposium on Quality Electronic Design (ISQED), Vol. , pp. 234–238. External Links: Cited by: §IV-A.
-  (2019) CSM-NN: current source model based logic circuit simulation - a neural network approach. In International Conference on Computer Design (ICCD), Cited by: §II, §II, §III-A.
-  (2008) A current source model for CMOS logic cells considering multiple input switching and stack effect. In Design, Automation and Test in Europe (DATE), Vol. , pp. 568–573. External Links: Cited by: §I, §II.
-  (2008) Current source based standard cell model for accurate signal integrity and timing analysis. Design, Automation and Test in Europe (DATE), pp. 574–579. External Links: Cited by: §I.
-  (2000-06) A survey of design techniques for system-level dynamic power management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8 (3), pp. 299–316. External Links: Cited by: §I.
-  (Website) External Links: Cited by: §I.
-  (2003) Blade and razor: cell and interconnect delay analysis using current-based models. In Design Automation Conference (DAC), Vol. , pp. 386–389. External Links: Cited by: §I, §II.
-  (2014) Semi-analytical current source modeling of FinFET devices operating in near/sub-threshold regime with independent gate control and considering process variation. In Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 167–172. Cited by: §II.
-  (2007-01) A current-based method for short circuit power calculation under noisy input waveforms. In Asia and South Pacific Design Automation Conference (ASP-DAC), Vol. , pp. 774–779. External Links: Cited by: §I, §II.
-  (2006) Statistical logic cell delay analysis using a current-based model. In Design Automation Conference (DAC), pp. 253–256. Cited by: §I, §II.
-  (2008) Statistical waveform and current source based standard cell models for accurate timing analysis. In Design Automation Conference (DAC), pp. 227–230. Cited by: §I.
-  (2005) Current based delay models: a must for nanometer timing. Cadence Live Conference (CDNLive). Cited by: §I.
-  (1999-07) Unveiling the iscas-85 benchmarks: a case study in reverse engineering. IEEE Des. Test 16 (3), pp. 72–80. External Links: Cited by: §III-D, §IV-C, TABLE V.
-  (Website) External Links: Cited by: §II.
Using machine learning to predict path-based slack from graph-based timing analysis. In International Conference on Computer Design (ICCD), pp. 603–612. External Links: Cited by: §I.
-  (2004-11) A robust cell-level crosstalk delay change analysis. In International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 147–154. External Links: Cited by: §I.
-  (2012) Current source modeling for power and timing analysis at different supply voltages. In Design, Automation Test in Europe (DATE), Vol. , pp. 923–928. External Links: Cited by: §I.
-  (1965-Sep.) Cramming more components onto integrated circuits. Electronics 38 (8), pp. 114. External Links: Cited by: §I.
-  (2011-01) Accurate timing and noise analysis of combinational and sequential logic cells using current source modeling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19 (1), pp. 92–103. External Links: Cited by: §I.
-  (2008-03) Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: Cited by: §III-B.
-  (2006-08) Thermal modeling, analysis, and management in VLSI circuits: principles and methods. Proceedings of the IEEE 94 (8), pp. 1487–1501. External Links: Cited by: §I.
-  (2011-11) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: Cited by: §IV-B, §IV.
-  Predictive Technology Model from arizona state university. Note: http://ptm.asu.edu/Accessed: 2019-05-20 Cited by: §IV-A.
Large-scale deep unsupervised learning using graphics processors. In International Conference on Machine Learning (ICML), pp. 873–880. External Links: Cited by: §II.
-  (2016) Practical statistical static timing analysis with current source models. In Design Automation Conference (DAC), pp. 113:1–113:6. External Links: Cited by: §III-A.
-  (Website) External Links: Cited by: §I.