CSM-NN: Current Source Model Based Logic Circuit Simulation – A Neural Network Approach

02/13/2020 ∙ by Mohammad Saeed Abrishami, et al. ∙ University of Southern California 0

The miniaturization of transistors down to 5nm and beyond, plus the increasing complexity of integrated circuits, significantly aggravate short channel effects, and demand analysis and optimization of more design corners and modes. Simulators need to model output variables related to circuit timing, power, noise, etc., which exhibit nonlinear behavior. The existing simulation and sign-off tools, based on a combination of closed-form expressions and lookup tables are either inaccurate or slow, when dealing with circuits with more than billions of transistors. In this work, we present CSM-NN, a scalable simulation framework with optimized neural network structures and processing algorithms. CSM-NN is aimed at optimizing the simulation time by accounting for the latency of the required memory query and computation, given the underlying CPU and GPU parallel processing capabilities. Experimental results show that CSM-NN reduces the simulation time by up to 6× compared to a state-of-the-art current source model based simulator running on a CPU. This speedup improves by up to 15× when running on a GPU. CSM-NN also provides high accuracy levels, with less than 2% error, compared to HSPICE.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The down-scaling of transistor geometries has drastically increased the complexity of short channel effects and process-voltage-temperature (PVT) variations. Consequently, application-specific integrated circuit (ASIC) design flow techniques, such as multi-corner multi-mode (MCMM) and parametric on-chip variation (POCV) depend on increasingly more complex analysis, transformation, and verification iterations, to ensure the ASIC system functions correctly and meets design demands such as those related to performance, power and signal integrity. In these methods, the design is tested in different process-voltage-temperature (PVT) corners and operating modes such as low-power (LP), high-performance (HP), etc. Accurate simulation such as those for timing analysis during placement, clock network synthesis, and routing is crucial as it helps to lower the number of design iterations, speed up convergence, and plays a major role in the turnaround time of complex designs such as system-on-chips (SoCs) [19].

SPICE simulations are accurate but very slow for timing, power, thermal analysis, and optimization of modern ASIC designs with billions or trillions of transistors [30, 4]. Therefore, higher levels of circuit abstraction using approximation has been used to speed up simulation steps. Abstraction models are generally based on look-up-tables (LUTs), closed-form formulations, factors or their combinations. The traditional models, namely nonlinear delay model (NLDM), nonlinear power model (NLPM), effective current source model (ECSM [6]), and composite current source model (CCSM [36]) utilize LUTs for storing delay, noise or power as nonlinear functions w.r.t. physical, structural, and environmental parameters, and depend on voltage modeling more than current modeling. We refer to NLDM, ECSM, and CCSM models as voltage-LUT (V-LUT) throughout this paper. The V-LUT models are intuitively better choices when compared to simple closed-form formulation of nonlinear functions, however, tend to be increasingly inaccurate in capturing signal integrity and short channel effects with the down-scaling of technologies [3].

Alternatively, current source models (CSMs) [7, 15, 21, 13, 2, 22, 28, 12, 11] use voltage-dependent current sources and possibly voltage-dependent capacitances to model logic cells. In addition to higher accuracy, another advantage of CSM over V-LUT models is the ability to simulate realistic waveforms for arbitrary input signals and provide the output waveforms.

The number of CSM component values that should be stored in memory grows exponentially with the number of inputs and internal nodes in the logic cell. For example, 6-dimensional LUTs are required for modeling a 3-input NAND gate (NAND3). While V-LUT models are stored in smaller/faster memories such as L1-cache, relatively bigger tables in CSM-LUT can only fit into bigger/slower ones, like DRAM. Therefore a fundamental idea to shorten simulation time would be to replace some of the memorization with computation aiming for optimal space/time efficiency.

In [9], a Semi-Analytical CSM (SA-CSM) was presented which uses small-size LUTs combined with nonlinear analytical equations to simultaneously achieve high modeling accuracy and space/time efficiency. However, developing analytical equations for complex circuits is a tedious process.

In this work, we propose CSM-NN, a circuit simulation framework that fully replaces LUTs with neural networks (NNs). This eliminates the long memory access latency of LUTs, hence significantly shortens the simulation time, especially when CSM-NN computations can take advantage of parallelism offered by graphical processing units (GPUs) [20].

The major contributions of our work are as follows:

  • We developed a framework for simulating nonlinear behavior of complex integrated circuits using optimized NN structures as well as training and inference algorithms, according to the underlying CPU or GPU computational capabilities.

  • Our framework is scalable and technology-independent, i.e., it can efficiently handle increasingly complex technologies with high PVT variations while maintaining the accuracy and improving the simulation latency.

The remainder of our paper is organized as follows. Section II presents a short background on CSM and process variation issues. Sections III and  IV elaborate our CSM-NN framework and experimental results, respectively. Section V concludes the paper.

Ii Background

In this section, we briefly touch upon the basics of CSM and latency issues related to CSM-LUT memory access.

Each logic gate can be modeled using voltage-dependent current source as well as (miller and output) capacitance components [7]. The values of these components can be characterized using HSPICE simulations. The CSM components of a logic cell can be stored in LUTs and utilized for noise, timing and power analysis of VLSI circuits [2, 11, 12, 16]. Fig. 1 illustrates CSMs for single-input (INV) and multi-input (NAND2) logic cells.

(a) CSM for single input (INV) logic gate.
(b) CSM for two-input (NAND2) logic gate.
Fig. 1: CSM examples for one and two input logic cells [12, 2].

Given a large number of simulation runs needed during the ASIC design and verification flow, and the corresponding long memory retrieval time, it is desirable to keep the number of dimensions and size of LUTs very small. Table I lists the size of CSM LUTs for a simple library of basic gates.

The size of CSM-LUTs for simple logic cells (c.f. Table I) is an exponential function of logic cell complexity. As an example, NOR2 LUTs are 200 times larger than the one for INV, and XOR2 LUTs are 20,000 times larget than NOR2 ones. Note that in practical research or industrial standard cell libraries, there may be many logic cells of various sizes and complexities, some of which could be more complex than simple logic cells in Table I.


Gate #Dim. Variables Table Size
INV 2 FPs = 1.6KB
NAND2 4 FPs = 320KB
NOR2 4 FPs = 320KB
AOI 6 FPs = 48MB
NAND3 6 FPs = 48MB
NOR3 6 FPs = 48MB
XOR2 8 FPs = 6.4GB
TABLE I: CSM for simple logic cells. Number of LUT dimensions (), i.e. the count of inputs, outputs and internal voltage nodes; voltage-dependent capacitances () and current sources () required to model the cell; and the total size to be stored in memory (). All CSM-components are considered to be represented with 32bit (4Byte) floating points (FP). Characterization resolution is assumed to be 10 points per dimension.

Looking at the memory hierarchy details of Intel Broadwell micro-architecture [18] in Table II and comparing them with sizes in Table I, confirms that CSM LUTs cannot fit in any of the caches and should be stored in the main memory (DRAM) and written into cache in parts. The latency of memory access in DRAM is about 2 orders of magnitude higher than that of L1 cache. This main difference shows the extent of longer simulation latencies for CSM-LUT, compared to V-LUT.

In the following two sections, we present how our CSM-NN eliminates the need for LUTs, and instead utilizes NNs to compute the CSM data.


Intel Broadwell micro-architecture
Memory Size (KByte) Latency (Clock Cycle)
L1 Data Cache 32 4-5
L2 Cache 256 11-12
L3 Cache 20,480 38 - 42
DRAM - 250
Intel Xeon Processor E5-2699 v4
Cores 22
Base Frequency 2.2 GHz
Single Precision 774.4 GFLOPs
Double Precision 1548.8 GFLOPs
TABLE II: Latency values for information retrieval from different hierarchy levels of memory and hardware specifications of Intel Xeon E5-2699 v4 server processor with Intel Broadwell micro-architecture. The computational capability of the processor is given in Giga floating point operations per seconds (GFLOPs).

Iii CSM-NN Framework

The description of our CSM-NN, including NN architecture and optimization algorithms for training is as follows.

Iii-a NN Architecture and Computation

To avoid the large LUTs with long query latencies in CSM-LUT, our CSM-NN, embeds parametric nonlinear models that can be trained on fully-connected NNs, to represent nonlinear functions.

We believe CSM-NN can benefit from the following ML developments: (1) evolution of novel ML algorithms can be utilized towards improving the accuracy and efficiency of CSM-NN; and more importantly (2) exponential increase in computational capabilities, especially with recent advances in design of GPUs [33], significantly helps improving the performance of CSM-NN.

CSM-NN substitutes memory retrieval with computation, thus it is necessary to analyze and optimize the number of different structure and latency of operations required for CSM-NN in different hardware platforms.

There are two steps for CSM-NN: (1) simulation using a feed-forward pass that calculates the output of the model based on trained parameters and input values, and (2) back-propagation step, which modifies the parameters of the model based on the error, i.e. the difference between the expected values of the training data and the estimated output from the model. Since the training process is done only once, computation during back-propagation is not a concern. Our objective is to improve the circuit simulation time. We therefore focus mainly on the inference process, i.e., we optimize the computation steps of the feed-forward pass.

To choose the best NN architecture for our CSM-NN, we note that the number of hidden layers and the number of neurons in the hidden layer(s) determine the total number of parameters in the input-output function and the flexibility of the model. Increasing the number of hidden layers beyond one (i.e., making the model

deeper) instead of increasing the number of neurons in a single layer (i.e., making the layer wider) can also be considered. In deep neural networks (DNNs), the sequence of nonlinear activation layers enables the input-output dependency to have a higher degree of nonlinearity with more flexibility. Although there are still unanswered questions on profound results of DNNs [27], the belief is that multiple layers perform better at generalizing as they learn the intermediate features between the raw input-data and the high-level output [27, 38]

. As an example, thanks to the availability of data and computation resources in the past few years, the state-of-the-art solutions for challenging ML problems, such as image classification in the fields of computer vision, are made possible by creating models with over hundreds of layers 

[37] [17]. On the other hand, shallow networks do not generalize well but are very powerful in memorization [27]. In addition, training deeper models requires more data and time for training and also needs more computational resources for the feed-forward pass.

In conclusion, despite the recent emergence of the DNN solutions and applications and potential improvement of accuracy of circuit simulation for complex timing, noise, and power analysis, we do not believe DNN is a feasible choice for the architecture of CSM-NN.

In the mathematical theory of artificial neural networks (ANNs), the universal approximation theorem [8]

affirms that a single-hidden-layer NN can approximate continuous functions with a finite number of neurons, under assumptions over the nonlinear activation function and availability of sufficient data for training. Consequently, if a shallow wide network is trained with every possible input value, it could eventually memorize the corresponding output. The following characteristics of our problem further suggest that shallow wide networks with one hidden layer are more plausible solutions:

  • There are no discontinuity in CSM component values.

  • While in practical applications the training data is limited or expensive to generate, in CSM-NN it is straight forward to generate training data with HSPICE simulations during the characterization process.

  • The number of inputs to the neural network is relatively small, even for complex logic cells, and when considering PVT parameters (Table I). This implies that we are modeling a low dimensional function.

Based on these features and considering the impact on inference step during circuit simulations, CSM-NN adopts a simple NN architecture with a single hidden layer to model the nonlinear behavior of CSM-NN components. The architecture and input-output function are shown in Fig. 2 and Eq. 1.

Fig. 2: One layer NN architecture used in CSM-NN. , , , and are the inputs, output, weights, and the neurons respectively. The number of inputs (i.e., the dimension) and the width of the hidden layer are represented with and , respectively.
(1)

The number of MUL operations in feed-forward pass is equal to the number of model-parameters as calculated in Eq. 2. It is very important to note that there are no dependencies among MUL steps in a specific layer, therefore they can be completely parallelized.

(2)

Considering notation used in Eq. 1, there are summations of values in the hidden layer. These summations also can be parallelized completely. To calculate the output, the summation of values is required. This summation can be efficiently parallelized by using tree-structures. The total number of ADD operations and the latency of tree-structure summations are calculated in Eq. 3 and Eq. 4.

(3)
(4)

CSM-NN accounts for the availability of resources when applying parallelization. NNs can be trained and utilized in two different hardware platforms, namely CPUs and GPUs. The evolution of GPUs and CPUs in case of number of floating-point operations per second (FLOPS) are shown in Fig. 3.

Iii-A1 Cpu

There are two phases of CSM-NN simulation computation when using CPUs: first, the weights of the NNs are loaded from the memory; and second, MUL and ADD operations are performed by arithmetic logic units (ALUs). As later described in Section IV, the number of CSM-NN parameters is sufficiently small. Therefore, they can fit into the cache (L1) of a CPU, and are accessible by the ALU in the order of a few CPU clock cycles.

Iii-A2 Gpu

The computational capabilities of GPUs have increased dramatically in the past decade. This has made GPUs a good choice of hardware platform for NN computation [33].

There are two levels of parallelized processing units in GPUs: several multiprocessors (MPs), and several stream processors (SPs, also referred as cores) that run the actual computation for each multiprocessor. Each core is equipped with ADD and MUL arithmetic units and designated register files. By implementing a trained NN (fixed parameters) on a GPU, the weights of each operation can be stored in register files, therefore, retrieval of information from memory is not required. We will show in Section IV that NNs of our CSM-NN framework can fit into a typical GPU. As an example, the hardware specifications of an NVIDIA GPU equipped with CUDA [29] cores is shown in Table III.

Streaming Processors (SM) 56
32bit FP CUDA core (per SM / total) 64 / 3584
64bit FP CUDA core (per SM / total) 32 / 1792
Register file per SM 256 KB
Shared memory per SM 96 KB
Register file per CUDA core 4 KB
Total L1 cache 64 KB
Base clock frequency 1328 MHz
Single Precision GFLOPs 9519
TABLE III: NVIDIA Tesla P100 GPU Specifications.
Fig. 3: Theoretical peak FLOPs with single precision.

It is worth noting that LUT-based models such as CSM-LUT and V-LUT models are only dependent on memory queries, thus using GPUs will not improve their simulation time. Therefore, considering relatively stronger parallelization capabilities of GPUs over CPUs, the speed advantage of CSM-NN over CSM-LUT and V-LUT improves, when running on GPUs.

Iii-B Training Process

We have adopted L-BFGS as the optimization technique for training the NNs of our CSM-NN framework. The following provides our justification. There are several gradient descent based optimization algorithm candidates such as stochastic gradient descent (SGD), Nesterov, Adagrad, and ADAM [14] to be considered for the training of neural regression models. SGD and inherited algorithms, such as ADAM, are by far the most popular algorithms to optimize NNs [34]. Their advantages to other techniques include parallelization, fast computation, and use of minibatch training techniques for better generalization specially in DNNs. The functionality of these methods is conditioned to the appropriate tuning of hyper-parameters for training. On the other hand, Quasi-Newton methods such as Broyden-Fletcher-Goldfarb-Shanno  (BFGS), can be orders of magnitude faster than SGD. These methods are based on measuring the curvature of the objective function to select the length and direction of the steps. The main shortcoming of BFGS is that it requires high computation and memory resources when calculating the inverse of Hessian matrix for large datasets. Limited memory BFGS (L-BFGS) [25] is an optimization algorithm in the family of quasi-Newton methods that approximates the BFGS algorithm using a limited amount of memory.

The experimental results for low dimensional problems in [23] show that L-BFGS produces highly competitive or sometimes superior models compared to SGD methods. Another important advantage of L-BFGS is that it requires adjusting zero (and in advanced modified versions of L-BFGS, only a few) hyper-parameters. For example, differently from SGD, the learning rate (step-size) of L-BFGS is tuned internally. We should also note that while several mini-batch versions of L-BFGS have been suggested very recently in the literature [5], L-BFGS is generally considered as a batch algorithm and thus no batch-size adjustment is required. Considering these specifications, we chose L-BFGS as our optimization technique for training the NNs in the CSM-NN method.

The common approach in supervised learning is to verify the generalization of the trained model by utilizing a validation (test) dataset which is completely separate from the training dataset. This process would prevent the possible over-fitting of the model. Therefore, we can randomly select samples from characterization data and test the accuracy of model.

It is very important to note that while accuracy of NNs in predicting CSM component values is important, the accuracy should ultimately be measured based on the quality of the output signal waveforms. Even the measurement of the propagation delay of the gate is not sufficient to confirm the accuracy of a CSM simulator. Therefore, similar to [2, 35], we used expected waveform similarity () as a figure of merit for the measurement of the accuracy of our CSM simulations. In this work, is defined as the mean of the absolute difference between precise HSPICE and CSM-NN simulations relative to the supply voltage value of the technology as shown in Eq. 5.

(5)

Iii-C CSM-NN Flow

Technology information and standard cell libraries at the transistor level are provided by semiconductor manufacturers and design parties. Each of the cells in the standard library should be characterized separately for every PVT corner and mode settings. The number of different MCMM settings is technology and product design policy dependent. The characterization process is usually very time intensive, and can be done in different resolutions. While higher resolutions result in higher accuracies, they need a longer characterization times. It should be mentioned that more data needs a larger memory in CSM-LUT and possibly a longer training process in CSM-NN. Therefore, choosing an appropriate resolution is an important step in both CSM-LUT and CSM-NN flows. While our results in section IV are technology specific, they suggest a range of acceptable characterization resolutions. Up to this point of the flow, CSM-NN steps coincide with those of CSM-LUT.

The next step is to train the NNs, one for every CSM component (e.g. ), of a logic cell () and in a specific PVT corner (e.g. fast-fast and high temperature (FFHT)). The inputs of the NNs are the voltages of terminal and internal nodes (

), and the target output is the value of the CSM components in these voltage points (

).

The training data collected through characterization should first be preprocessed and then used for training. As explained in section III-A, wider network can result in a more accurate model, but requires more computation. Hence, we need to find an appropriate layer size. We choose the smallest number of neurons such that the network can pass a pre-defined accuracy threshold in terms of .

In the following section, we will show that this optimal set of NN parameters can fit into the cache (L1) of a typical CPU or the register files of a typical GPU. To simulate a circuit in a specific MCMM setup, the corresponding NN models of all logic cells in the standard library are loaded.

Iv Experiments and Simulation Results

We implemented the simulator and the flow of our CSM-NN framework in Python. Our implementation is technology independent and can characterize, and create NN models with flexible configurable setup, for any given combinational circuit netlist. NN implementation and training are based on the Scikit-learn [31] package.

CPU and GPU devices introduced in Table II and Table III are used for comparison between two platforms, as both products are introduced in the same year (2016) and their current retail prices are in the same order (of about 5,000 USD). In this following, we discuss our experiments including challenges regarding our specific problem setup.

Iv-a Selected Technologies

In this work, for better evaluation of our CSM-NN including its technology independent characteristics, we performed our experiments on both MOSFET (16nm) and FinFET (20nm) device technologies from Predictive Technology Model (PTM) [32] packages. Two device types namely low-standby power (LP) and high performance (HP) are used in our experiments [1].

As technology scales down, a growing number of physical and fitting parameters are needed to model PVT variations. However as pointed out in [40, 10, 26, 39], only a few of them are dominant, i.e., developing simulation models that account for those dominant parameters while ignoring the rest, provides sufficiently high accuracy levels. Following these studies, we considered the most important process variation factors for defining a limited number of process corners. There is no process variation distribution information available for PTM technologies. Therefore, we followed the same approach used in [24] which studied the same devices as this work to define PVT corners.

All distributions but temperature are considered normal (Gaussian) and reported as with () and (

), representing mean and standard deviation respectively. The typical temperature value is considered as 27°C and the highest temperature (

variation) as 125°C. The information of the distribution for process variation parameters and the defined process corners for experiments are provided in Table IV.

PVT Variation Distribution
Technology
Fin-LP 0.9,0.05 4.6,0.23 - 15,0.5
Fin-HP 0.9,0.05 4.4,0.22 - 15,0.5
MOS-LP 0.9,0.05 4.6,0.23 2,0.1 1.2,0.04
MOS-HP 0.7,0.035 4.4,0.23 2,0.1 0.95,0.03
PVT variation in pre-defined corners
Corner (°C)
FF
SS
FFHT
SSHT
TABLE IV: Process (P), Voltage (V), and Temperature (T) variation distributions of technologies used in experiments. The values of process attributes are reported for NMOS/NFET devices. is representing oxide thickness () for MOSFET and Fin thickness () for FinFET. The distributions are all Normal() and represented as ().

Iv-B Characterization

The resolution of characterization process is a key factor in determining the accuracy of both CSM-LUT and CSM-NN simulations. While more data points increase the accuracy of both simulators, it comes with the cost of longer characterization process, larger tables in CSM-LUT, and longer training time in our CSM-NN. We therefore evaluate our CSM-NN framework under different resolutions. The results can also be later used towards suggesting a baseline for other technologies.

It should be mentioned that CSM-components exhibit different sensitivity levels to different voltage-node variables. For example, seems to be more sensitive to than in INVX1, and it can be characterized with lower resolution for than . Moreover, the sensitivity to resolution of characterization for one CSM-component should not be necessarily the same as the other component. For example, the range of change in value for a single transition is from to , while this is about only for . The resolution can also vary based on the range of the voltage-node variable, e.g., higher resolutions for the noisy parts of the waveform (with higher frequencies of change) and lower resolutions for smooth parts of the waveforms.

However, for the sake of simplicity, we considered all voltage-node resolutions as similar. As the units for different dimensions are different, we defined three different resolution setups as explained in Table V. By comparing the preliminary results, normal setup was found to be an appropriate resolution and the experiments were continued with this setup.

S: Soft N: Normal C: Coarse
Resolution (v) 0.01 0.05 0.1
TABLE V: Characterization resolution settings used in our experiments.
TT FF SS FFHT SSHT
MOSFET-HP 16nm
INV 14 16 18 16 18
NAND2 24 28 30 28 30
MOSFET-LP 16nm
INV 20 20 24 22 26
NAND2 28 32 30 32 32
FinFET-HP 20nm
INV 20 20 20 26 24
NAND2 34 30 34 36 36
FinFET-LP 20nm
INV 20 20 20 26 20
NAND2 30 36 36 40 38
TABLE VI: Choice of NN hidden layer size for single and two input logic cells.
- MOSFET-HP 16nm MOSFET-LP 16nm FinFET-HP 20nm FinFET-LP 20nm Corner Nominal <2% 9.3 16.8 <2% 9.3 16.8 <2% 6.9 15.1 <2% 8.6 16.8 FF <2% 8.6 16.8 <1% 7.4 15.1 <2% 8.6 16.8 <2% 6.8 15.1 SS <2% 8.6 16.8 <2% 6.9 15.1 <1% 6.9 15.1 <1% 6.8 15.1 FFHT <2% 8.6 16.8 <1% 7.4 15.1 <1% 6.8 15.1 <2% 6.6 15.1 SSHT <2% 8.6 16.8 <1% 7.4 15.1 <1% 6.8 15.1 <1% 6.6 15.1
TABLE VII: CSM simulation results of a full adder circuit in both FinFET and MOSFET technologies. The simulation time improvements () is the ratio of the time required for CSM-LUT simulation over the one for CSM-NN. While CSM-LUT results would not improve if they were run on a GPU (instead of a CPU), the improvement results for the CPU and GPU implementation of CSM-NN are reported as and respectively. The hardware platform’s specs are reported in Table II and III. is the measure of accuracy introduced in Eq. 5.

Iv-C Preprocessing and Loss Function Modification

Mean Square Error (MSE, also referred as L2-norm error) is a commonly used regression loss function. It is simply the average of squared distances between our targets (

) and predicted values (). The loss function can also accommodate regularization term added to the loss function in order to prevent overfitting by shrinking the model parameters. The values of CSM-components vary in a large scale. For example, in INV, with , as variables, the DC current is in when both transistors are on, while in when one of them is off and the cell is leaking. The MSE-loss is a function of absolute error. Thus, by using this loss, the error in lower scale values will be less important compared to the higher scale values. To address this, we can log-transform the output, so the relative error will be used for loss calculation of the regression model as shown in Eq.6. An issue with such an adjustment is that some of the values are negative and this makes the log-transform more complicated. We simply resolved such issue with a simple shift of data toward positive values by subtracting all data points with their overall minimum .

(6)

The normalization of data in regression problems would help the solvers with faster convergence and better numerical stability. Hence, normalization of inputs and outputs is typically implemented inside the solver, such as that in the Scikit-learn package [31] used in our implementation.

Iv-D NN Size and Training for Logic Cells

To select the size of the hidden layer for each model, we repeated the training process for various neuron numbers in the range of . Preliminary results in our experiments showed that the tanh nonlinear function provides better outcomes compared to other functions such as sigmoid and ReLU. As mentioned in Section III-B, there is no hyper-parameter, e.g., no learning-rate or mini-batch size tuning is required in L-BFGS optimization.

The total number of generated data points is 500 per gate. We trained the NN with 90% of this data (5-fold cross-validation, 360 for training and 40 for validation) and then tested on the other 10%. The split between training, validation, and test datasets was done in random.

Next, we applied a few noisy input smaples to the cell and measured . The minimum size of the hidden layer that met is chosen as the CSM-NN architecture for the logic cell in the specific MCMM setup. The complete results for the choice of architecture for INV and NAND2 are given in Table VI for different MCMM setups.

Iv-E Circuit Simulation

In this work we evaluated our CSM-NN framework by simulating a full-adder circuit (schematic shown in Fig. 4).

For the sake of a fair comparison, the HSPICE characterization setup is the same for both CSM-NN and CSM-LUT. We measured by comparing output waveforms of HSPICE as the baseline with those of CSM-NN simulations. The CPU and GPU devices used in our experiments are introduced in Table II and Table III respectively. CSM-LUT is considered to be computed on the CPU platform as it does not benefit from GPU parallelization. The required computation resources and latencies are calculated using equations in section III-A. The results confirm that CSM-NN output waveforms match those of HSPICE in regard to propagation delay with error values limited to 0.1%. To better confirm the high accuracy of CSM-NN, we compared its waveform similarity to HSPICE, by measuring . As listed in Table VII, is limited to 2%.

Fig. 4: Gate level schematic of the full adder circuit used in our experiments.

V Conclusions and Future Work

CSM-NN, a scalable, technology-independent circuit simulation framework is proposed. CSM-NN is aimed to address the efficiency concerns of the existing tools that depend on data query from lookup tables stored in memory. Given the underlying CPU and GPU parallel processing capabilities, our framework replaces memorization by computation, utilizing a set of optimized NN structures, training and inference processing steps. The simulation latency of CSM-NN was evaluated in multiple MOSFET and FinFET technologies based on predictive technology models in various PVT corners and modes. The results confirm that CSM-NN improves the simulation speed by up to using CPU platforms, compared to a CSM-LUT baseline. CSM-NN can further benefit from parallelization capabilities of GPUs, therefore the simulation speed is improved by up to when run on a GPU. CSM-NN also provides high accuracy levels, maintaining the waveform similarity error within compared to HSPICE. We believe the application of CSM-NN in future simulation tools such as those for sign-off and MCMM analysis and optimization of advanced VLSI circuits can significantly improve the simulation accuracy and speed.

As part of our future work, we plan to investigate CSM-NN on industrial circuits using accurate foundry technology information including PVT variations. We also plan to enhance our NNs to account for PVT corner parameters as inputs, to be able to train NNs once for all modes and corners and evaluate the cost vs speed and accuracy trade-off.

Acknowledgement

This research was sponsored in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation. The authors would also like to thank Soheil Nazar Shahsavani and Mahdi Nazemi (of the University of Southern California) for helpful discussions.

References

  • [1] M. S. Abrishami, A. Shafaei, Y. Wang, and M. Pedram (2015-03) Optimal choice of FinFET devices for energy minimization in deeply-scaled technologies. In International Symposium on Quality Electronic Design (ISQED), Vol. , pp. 234–238. External Links: Document, ISSN Cited by: §IV-A.
  • [2] B. Amelifard, S. Hatami, H. Fatemi, and M. Pedram (2008) A current source model for CMOS logic cells considering multiple input switching and stack effect. In Design, Automation and Test in Europe (DATE), Vol. , pp. 568–573. External Links: Document, ISSN 1530-1591 Cited by: §I, Fig. 1, §II, §III-B.
  • [3] S. V. Amit Goel (2008) Current source based standard cell model for accurate signal integrity and timing analysis. Design, Automation and Test in Europe (DATE), pp. 574–579. External Links: Document Cited by: §I.
  • [4] L. Benini, A. Bogliolo, and G. De Micheli (2000-06) A survey of design techniques for system-level dynamic power management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8 (3), pp. 299–316. External Links: Document, ISSN Cited by: §I.
  • [5] R. Bollapragada, D. Mudigere, J. Nocedal, H. M. Shi, and P. T. P. Tang (2016)

    A progressive batching L-BFGS method for machine learning

    .
    In International Conference on Machine Learning (ICML), Cited by: §III-B.
  • [6] Cadence Inc., San Jose, California, U.S.(Website) External Links: Link Cited by: §I.
  • [7] J. F. Croix and D. F. Wong (2003) Blade and razor: cell and interconnect delay analysis using current-based models. In Design Automation Conference (DAC), Vol. , pp. 386–389. External Links: Document, ISSN Cited by: §I, §II.
  • [8] B. C. Csáji (2001) Approximation with artificial neural networks. Master’s Thesis, Faculty of Sciences, Eötvös Loránd University, HungaryFaculty of Sciences, Eötvös Loránd University, Hungary. Cited by: §III-A.
  • [9] T. Cui, Y. Wang, X. Lin, S. Nazarian, and M. Pedram (2014) Semi-analytical current source modeling of FinFET devices operating in near/sub-threshold regime with independent gate control and considering process variation. In Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 167–172. Cited by: §I.
  • [10] H. Dadgour, Vivek De, and K. Banerjee (2008-11) Statistical modeling of metal-gate work-function variability in emerging device technologies and implications for circuit design. In International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 270–277. External Links: Document, ISSN 1092-3152 Cited by: §IV-A.
  • [11] H. Fatemi, S. Nazarian, and M. Pedram (2007-01) A current-based method for short circuit power calculation under noisy input waveforms. In Asia and South Pacific Design Automation Conference (ASP-DAC), Vol. , pp. 774–779. External Links: Document, ISSN Cited by: §I, §II.
  • [12] H. Fatemi, S. Nazarian, and M. Pedram (2006) Statistical logic cell delay analysis using a current-based model. In Design Automation Conference (DAC), pp. 253–256. Cited by: §I, Fig. 1, §II.
  • [13] A. Goel and S. Vrudhula (2008) Statistical waveform and current source based standard cell models for accurate timing analysis. In Design Automation Conference (DAC), pp. 227–230. Cited by: §I.
  • [14] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §III-B.
  • [15] R. Goyal and N. Kumar (2005) Current based delay models: a must for nanometer timing. Cadence Live Conference (CDNLive). Cited by: §I.
  • [16] S. Hatami and M. Pedram (2010) Efficient representation, stratification, and compression of variational CSM library waveforms using robust principle component analysis. In Design, Automation and Test in Europe (DATE), pp. 1285–1290. External Links: ISBN 978-3-9810801-6-2 Cited by: §II.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §III-A.
  • [18] (Website) External Links: Link Cited by: §II.
  • [19] A. B. Kahng, U. Mallappa, and L. Saul (2018-10) Using machine learning to predict path-based slack from graph-based timing analysis. In International Conference on Computer Design (ICCD), pp. 603–612. External Links: Document, ISSN 2576-6996 Cited by: §I.
  • [20] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco (2011-Sep.) GPUs and the future of parallel computing. IEEE Micro 31 (5), pp. 7–17. External Links: Document, ISSN Cited by: §I.
  • [21] I. Keller, Ken Tseng, and N. Verghese (2004-11) A robust cell-level crosstalk delay change analysis. In International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 147–154. External Links: Document, ISSN 1092-3152 Cited by: §I.
  • [22] C. Knoth, H. Jedda, and U. Schlichtmann (2012) Current source modeling for power and timing analysis at different supply voltages. In Design, Automation Test in Europe (DATE), Vol. , pp. 923–928. External Links: Document, ISSN 1558-1101 Cited by: §I.
  • [23] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng (2011) On optimization methods for deep learning. In International Conference on Machine Learning (ICML), pp. 265–272. External Links: ISBN 978-1-4503-0619-5 Cited by: §III-B.
  • [24] Y. Li, C. Hwang, T. Li, and M. Han (2010-02) Process-variation effect, metal-gate work-function fluctuation, and random-dopant fluctuation in emerging CMOS technologies. IEEE Transactions on Electron Devices 57 (2), pp. 437–447. External Links: Document, ISSN 0018-9383 Cited by: §IV-A.
  • [25] D. C. Liu and J. Nocedal (1989-08-01) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1), pp. 503–528. External Links: ISSN 1436-4646, Document Cited by: §III-B.
  • [26] T. Matsukawa, S. O’uchi, K. Endo, Y. Ishikawa, H. Yamauchi, Y. X. Liu, J. Tsukada, K. Sakamoto, and M. Masahara (2009-06) Comprehensive analysis of variability sources of FinFET characteristics. In Symposium on VLSI Technology, Vol. , pp. 118–119. External Links: Document, ISSN 0743-1562 Cited by: §IV-A.
  • [27] H. Mhaskar, Q. Liao, and T. Poggio (2017) When and why are deep networks better than shallow ones?. In

    AAAI Conference on Artificial Intelligence

    ,
    pp. 2343–2349. Cited by: §III-A.
  • [28] S. Nazarian, H. Fatemi, and M. Pedram (2011-01) Accurate timing and noise analysis of combinational and sequential logic cells using current source modeling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19 (1), pp. 92–103. External Links: Document, ISSN 1063-8210 Cited by: §I.
  • [29] J. Nickolls, I. Buck, M. Garland, and K. Skadron (2008-03) Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: ISSN 1542-7730, Link, Document Cited by: §III-A2.
  • [30] M. Pedram and S. Nazarian (2006-08) Thermal modeling, analysis, and management in VLSI circuits: principles and methods. Proceedings of the IEEE 94 (8), pp. 1487–1501. External Links: Document Cited by: §I.
  • [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay (2011-11) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: ISSN 1532-4435 Cited by: §IV-C, §IV.
  • [32] Predictive Technology Model from arizona state university. Note: http://ptm.asu.edu/Accessed: 2019-05-20 Cited by: §IV-A.
  • [33] R. Raina, A. Madhavan, and A. Y. Ng (2009)

    Large-scale deep unsupervised learning using graphics processors

    .
    In International Conference on Machine Learning (ICML), pp. 873–880. External Links: ISBN 978-1-60558-516-1, Document Cited by: §III-A2, §III-A.
  • [34] S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv abs/1609.04747. Cited by: §III-B.
  • [35] D. Sinha, V. Zolotov, S. K. Raghunathan, M. H. Wood, and K. Kalafala (2016) Practical statistical static timing analysis with current source models. In Design Automation Conference (DAC), pp. 113:1–113:6. External Links: ISBN 978-1-4503-4236-0, Document Cited by: §III-B.
  • [36] Synopsys Inc., Mountain View, California, U.S.(Website) External Links: Link Cited by: §I.
  • [37] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Document, ISSN 1063-6919 Cited by: §III-A.
  • [38] T. Wiatowski and H. Bölcskei (2018-03)

    A mathematical theory of deep convolutional neural networks for feature extraction

    .
    IEEE Transactions on Information Theory 64 (3), pp. 1845–1866. External Links: Document, ISSN 0018-9448 Cited by: §III-A.
  • [39] Xiao Zhang, Jing Li, M. Grubbs, M. Deal, B. Magyari-Köpe, B. M. Clemens, and Y. Nishi (2009-12) Physical model of the impact of metal grain work function variability on emerging dual metal gate MOSFETs and its implication for sram reliability. In International Electron Devices Meeting (IEDM), Vol. , pp. 1–4. External Links: Document, ISSN 0163-1918 Cited by: §IV-A.
  • [40] X. Zhang, D. Connelly, P. Zheng, H. Takeuchi, M. Hytha, R. J. Mears, and T. K. Liu (2016-04) Analysis of 7/8-nm Bulk-Si FinFET technologies for 6T-SRAM scaling. IEEE Transactions on Electron Devices 63 (4), pp. 1502–1507. External Links: Document, ISSN 0018-9383 Cited by: §IV-A.