I Introduction
The downscaling of transistor geometries has drastically increased the complexity of short channel effects and processvoltagetemperature (PVT) variations. Consequently, applicationspecific integrated circuit (ASIC) design flow techniques, such as multicorner multimode (MCMM) and parametric onchip variation (POCV) depend on increasingly more complex analysis, transformation, and verification iterations, to ensure the ASIC system functions correctly and meets design demands such as those related to performance, power and signal integrity. In these methods, the design is tested in different processvoltagetemperature (PVT) corners and operating modes such as lowpower (LP), highperformance (HP), etc. Accurate simulation such as those for timing analysis during placement, clock network synthesis, and routing is crucial as it helps to lower the number of design iterations, speed up convergence, and plays a major role in the turnaround time of complex designs such as systemonchips (SoCs) [19].
SPICE simulations are accurate but very slow for timing, power, thermal analysis, and optimization of modern ASIC designs with billions or trillions of transistors [30, 4]. Therefore, higher levels of circuit abstraction using approximation has been used to speed up simulation steps. Abstraction models are generally based on lookuptables (LUTs), closedform formulations, factors or their combinations. The traditional models, namely nonlinear delay model (NLDM), nonlinear power model (NLPM), effective current source model (ECSM [6]), and composite current source model (CCSM [36]) utilize LUTs for storing delay, noise or power as nonlinear functions w.r.t. physical, structural, and environmental parameters, and depend on voltage modeling more than current modeling. We refer to NLDM, ECSM, and CCSM models as voltageLUT (VLUT) throughout this paper. The VLUT models are intuitively better choices when compared to simple closedform formulation of nonlinear functions, however, tend to be increasingly inaccurate in capturing signal integrity and short channel effects with the downscaling of technologies [3].
Alternatively, current source models (CSMs) [7, 15, 21, 13, 2, 22, 28, 12, 11] use voltagedependent current sources and possibly voltagedependent capacitances to model logic cells. In addition to higher accuracy, another advantage of CSM over VLUT models is the ability to simulate realistic waveforms for arbitrary input signals and provide the output waveforms.
The number of CSM component values that should be stored in memory grows exponentially with the number of inputs and internal nodes in the logic cell. For example, 6dimensional LUTs are required for modeling a 3input NAND gate (NAND3). While VLUT models are stored in smaller/faster memories such as L1cache, relatively bigger tables in CSMLUT can only fit into bigger/slower ones, like DRAM. Therefore a fundamental idea to shorten simulation time would be to replace some of the memorization with computation aiming for optimal space/time efficiency.
In [9], a SemiAnalytical CSM (SACSM) was presented which uses smallsize LUTs combined with nonlinear analytical equations to simultaneously achieve high modeling accuracy and space/time efficiency. However, developing analytical equations for complex circuits is a tedious process.
In this work, we propose CSMNN, a circuit simulation framework that fully replaces LUTs with neural networks (NNs). This eliminates the long memory access latency of LUTs, hence significantly shortens the simulation time, especially when CSMNN computations can take advantage of parallelism offered by graphical processing units (GPUs) [20].
The major contributions of our work are as follows:

We developed a framework for simulating nonlinear behavior of complex integrated circuits using optimized NN structures as well as training and inference algorithms, according to the underlying CPU or GPU computational capabilities.

Our framework is scalable and technologyindependent, i.e., it can efficiently handle increasingly complex technologies with high PVT variations while maintaining the accuracy and improving the simulation latency.
Ii Background
In this section, we briefly touch upon the basics of CSM and latency issues related to CSMLUT memory access.
Each logic gate can be modeled using voltagedependent current source as well as (miller and output) capacitance components [7]. The values of these components can be characterized using HSPICE simulations. The CSM components of a logic cell can be stored in LUTs and utilized for noise, timing and power analysis of VLSI circuits [2, 11, 12, 16]. Fig. 1 illustrates CSMs for singleinput (INV) and multiinput (NAND2) logic cells.
Given a large number of simulation runs needed during the ASIC design and verification flow, and the corresponding long memory retrieval time, it is desirable to keep the number of dimensions and size of LUTs very small. Table I lists the size of CSM LUTs for a simple library of basic gates.
The size of CSMLUTs for simple logic cells (c.f. Table I) is an exponential function of logic cell complexity. As an example, NOR2 LUTs are 200 times larger than the one for INV, and XOR2 LUTs are 20,000 times larget than NOR2 ones. Note that in practical research or industrial standard cell libraries, there may be many logic cells of various sizes and complexities, some of which could be more complex than simple logic cells in Table I.
Gate  #Dim.  Variables  Table Size 

INV  2  FPs = 1.6KB  
NAND2  4  FPs = 320KB  
NOR2  4  FPs = 320KB  
AOI  6  FPs = 48MB  
NAND3  6  FPs = 48MB  
NOR3  6  FPs = 48MB  
XOR2  8  FPs = 6.4GB 
Looking at the memory hierarchy details of Intel Broadwell microarchitecture [18] in Table II and comparing them with sizes in Table I, confirms that CSM LUTs cannot fit in any of the caches and should be stored in the main memory (DRAM) and written into cache in parts. The latency of memory access in DRAM is about 2 orders of magnitude higher than that of L1 cache. This main difference shows the extent of longer simulation latencies for CSMLUT, compared to VLUT.
In the following two sections, we present how our CSMNN eliminates the need for LUTs, and instead utilizes NNs to compute the CSM data.
Intel Broadwell microarchitecture  

Memory  Size (KByte)  Latency (Clock Cycle) 
L1 Data Cache  32  45 
L2 Cache  256  1112 
L3 Cache  20,480  38  42 
DRAM    250 
Intel Xeon Processor E52699 v4  
Cores  22  
Base Frequency  2.2 GHz  
Single Precision  774.4 GFLOPs  
Double Precision  1548.8 GFLOPs 
Iii CSMNN Framework
The description of our CSMNN, including NN architecture and optimization algorithms for training is as follows.
Iiia NN Architecture and Computation
To avoid the large LUTs with long query latencies in CSMLUT, our CSMNN, embeds parametric nonlinear models that can be trained on fullyconnected NNs, to represent nonlinear functions.
We believe CSMNN can benefit from the following ML developments: (1) evolution of novel ML algorithms can be utilized towards improving the accuracy and efficiency of CSMNN; and more importantly (2) exponential increase in computational capabilities, especially with recent advances in design of GPUs [33], significantly helps improving the performance of CSMNN.
CSMNN substitutes memory retrieval with computation, thus it is necessary to analyze and optimize the number of different structure and latency of operations required for CSMNN in different hardware platforms.
There are two steps for CSMNN: (1) simulation using a feedforward pass that calculates the output of the model based on trained parameters and input values, and (2) backpropagation step, which modifies the parameters of the model based on the error, i.e. the difference between the expected values of the training data and the estimated output from the model. Since the training process is done only once, computation during backpropagation is not a concern. Our objective is to improve the circuit simulation time. We therefore focus mainly on the inference process, i.e., we optimize the computation steps of the feedforward pass.
To choose the best NN architecture for our CSMNN, we note that the number of hidden layers and the number of neurons in the hidden layer(s) determine the total number of parameters in the inputoutput function and the flexibility of the model. Increasing the number of hidden layers beyond one (i.e., making the model
deeper) instead of increasing the number of neurons in a single layer (i.e., making the layer wider) can also be considered. In deep neural networks (DNNs), the sequence of nonlinear activation layers enables the inputoutput dependency to have a higher degree of nonlinearity with more flexibility. Although there are still unanswered questions on profound results of DNNs [27], the belief is that multiple layers perform better at generalizing as they learn the intermediate features between the raw inputdata and the highlevel output [27, 38]. As an example, thanks to the availability of data and computation resources in the past few years, the stateoftheart solutions for challenging ML problems, such as image classification in the fields of computer vision, are made possible by creating models with over hundreds of layers
[37] [17]. On the other hand, shallow networks do not generalize well but are very powerful in memorization [27]. In addition, training deeper models requires more data and time for training and also needs more computational resources for the feedforward pass.In conclusion, despite the recent emergence of the DNN solutions and applications and potential improvement of accuracy of circuit simulation for complex timing, noise, and power analysis, we do not believe DNN is a feasible choice for the architecture of CSMNN.
In the mathematical theory of artificial neural networks (ANNs), the universal approximation theorem [8]
affirms that a singlehiddenlayer NN can approximate continuous functions with a finite number of neurons, under assumptions over the nonlinear activation function and availability of sufficient data for training. Consequently, if a shallow wide network is trained with every possible input value, it could eventually memorize the corresponding output. The following characteristics of our problem further suggest that shallow wide networks with one hidden layer are more plausible solutions:

There are no discontinuity in CSM component values.

While in practical applications the training data is limited or expensive to generate, in CSMNN it is straight forward to generate training data with HSPICE simulations during the characterization process.

The number of inputs to the neural network is relatively small, even for complex logic cells, and when considering PVT parameters (Table I). This implies that we are modeling a low dimensional function.
Based on these features and considering the impact on inference step during circuit simulations, CSMNN adopts a simple NN architecture with a single hidden layer to model the nonlinear behavior of CSMNN components. The architecture and inputoutput function are shown in Fig. 2 and Eq. 1.
(1) 
The number of MUL operations in feedforward pass is equal to the number of modelparameters as calculated in Eq. 2. It is very important to note that there are no dependencies among MUL steps in a specific layer, therefore they can be completely parallelized.
(2) 
Considering notation used in Eq. 1, there are summations of values in the hidden layer. These summations also can be parallelized completely. To calculate the output, the summation of values is required. This summation can be efficiently parallelized by using treestructures. The total number of ADD operations and the latency of treestructure summations are calculated in Eq. 3 and Eq. 4.
(3) 
(4) 
CSMNN accounts for the availability of resources when applying parallelization. NNs can be trained and utilized in two different hardware platforms, namely CPUs and GPUs. The evolution of GPUs and CPUs in case of number of floatingpoint operations per second (FLOPS) are shown in Fig. 3.
IiiA1 Cpu
There are two phases of CSMNN simulation computation when using CPUs: first, the weights of the NNs are loaded from the memory; and second, MUL and ADD operations are performed by arithmetic logic units (ALUs). As later described in Section IV, the number of CSMNN parameters is sufficiently small. Therefore, they can fit into the cache (L1) of a CPU, and are accessible by the ALU in the order of a few CPU clock cycles.
IiiA2 Gpu
The computational capabilities of GPUs have increased dramatically in the past decade. This has made GPUs a good choice of hardware platform for NN computation [33].
There are two levels of parallelized processing units in GPUs: several multiprocessors (MPs), and several stream processors (SPs, also referred as cores) that run the actual computation for each multiprocessor. Each core is equipped with ADD and MUL arithmetic units and designated register files. By implementing a trained NN (fixed parameters) on a GPU, the weights of each operation can be stored in register files, therefore, retrieval of information from memory is not required. We will show in Section IV that NNs of our CSMNN framework can fit into a typical GPU. As an example, the hardware specifications of an NVIDIA GPU equipped with CUDA [29] cores is shown in Table III.
Streaming Processors (SM)  56 
32bit FP CUDA core (per SM / total)  64 / 3584 
64bit FP CUDA core (per SM / total)  32 / 1792 
Register file per SM  256 KB 
Shared memory per SM  96 KB 
Register file per CUDA core  4 KB 
Total L1 cache  64 KB 
Base clock frequency  1328 MHz 
Single Precision GFLOPs  9519 
It is worth noting that LUTbased models such as CSMLUT and VLUT models are only dependent on memory queries, thus using GPUs will not improve their simulation time. Therefore, considering relatively stronger parallelization capabilities of GPUs over CPUs, the speed advantage of CSMNN over CSMLUT and VLUT improves, when running on GPUs.
IiiB Training Process
We have adopted LBFGS as the optimization technique for training the NNs of our CSMNN framework. The following provides our justification. There are several gradient descent based optimization algorithm candidates such as stochastic gradient descent (SGD), Nesterov, Adagrad, and ADAM [14] to be considered for the training of neural regression models. SGD and inherited algorithms, such as ADAM, are by far the most popular algorithms to optimize NNs [34]. Their advantages to other techniques include parallelization, fast computation, and use of minibatch training techniques for better generalization specially in DNNs. The functionality of these methods is conditioned to the appropriate tuning of hyperparameters for training. On the other hand, QuasiNewton methods such as BroydenFletcherGoldfarbShanno (BFGS), can be orders of magnitude faster than SGD. These methods are based on measuring the curvature of the objective function to select the length and direction of the steps. The main shortcoming of BFGS is that it requires high computation and memory resources when calculating the inverse of Hessian matrix for large datasets. Limited memory BFGS (LBFGS) [25] is an optimization algorithm in the family of quasiNewton methods that approximates the BFGS algorithm using a limited amount of memory.
The experimental results for low dimensional problems in [23] show that LBFGS produces highly competitive or sometimes superior models compared to SGD methods. Another important advantage of LBFGS is that it requires adjusting zero (and in advanced modified versions of LBFGS, only a few) hyperparameters. For example, differently from SGD, the learning rate (stepsize) of LBFGS is tuned internally. We should also note that while several minibatch versions of LBFGS have been suggested very recently in the literature [5], LBFGS is generally considered as a batch algorithm and thus no batchsize adjustment is required. Considering these specifications, we chose LBFGS as our optimization technique for training the NNs in the CSMNN method.
The common approach in supervised learning is to verify the generalization of the trained model by utilizing a validation (test) dataset which is completely separate from the training dataset. This process would prevent the possible overfitting of the model. Therefore, we can randomly select samples from characterization data and test the accuracy of model.
It is very important to note that while accuracy of NNs in predicting CSM component values is important, the accuracy should ultimately be measured based on the quality of the output signal waveforms. Even the measurement of the propagation delay of the gate is not sufficient to confirm the accuracy of a CSM simulator. Therefore, similar to [2, 35], we used expected waveform similarity () as a figure of merit for the measurement of the accuracy of our CSM simulations. In this work, is defined as the mean of the absolute difference between precise HSPICE and CSMNN simulations relative to the supply voltage value of the technology as shown in Eq. 5.
(5) 
IiiC CSMNN Flow
Technology information and standard cell libraries at the transistor level are provided by semiconductor manufacturers and design parties. Each of the cells in the standard library should be characterized separately for every PVT corner and mode settings. The number of different MCMM settings is technology and product design policy dependent. The characterization process is usually very time intensive, and can be done in different resolutions. While higher resolutions result in higher accuracies, they need a longer characterization times. It should be mentioned that more data needs a larger memory in CSMLUT and possibly a longer training process in CSMNN. Therefore, choosing an appropriate resolution is an important step in both CSMLUT and CSMNN flows. While our results in section IV are technology specific, they suggest a range of acceptable characterization resolutions. Up to this point of the flow, CSMNN steps coincide with those of CSMLUT.
The next step is to train the NNs, one for every CSM component (e.g. ), of a logic cell () and in a specific PVT corner (e.g. fastfast and high temperature (FFHT)). The inputs of the NNs are the voltages of terminal and internal nodes (
), and the target output is the value of the CSM components in these voltage points (
).The training data collected through characterization should first be preprocessed and then used for training. As explained in section IIIA, wider network can result in a more accurate model, but requires more computation. Hence, we need to find an appropriate layer size. We choose the smallest number of neurons such that the network can pass a predefined accuracy threshold in terms of .
In the following section, we will show that this optimal set of NN parameters can fit into the cache (L1) of a typical CPU or the register files of a typical GPU. To simulate a circuit in a specific MCMM setup, the corresponding NN models of all logic cells in the standard library are loaded.
Iv Experiments and Simulation Results
We implemented the simulator and the flow of our CSMNN framework in Python. Our implementation is technology independent and can characterize, and create NN models with flexible configurable setup, for any given combinational circuit netlist. NN implementation and training are based on the Scikitlearn [31] package.
CPU and GPU devices introduced in Table II and Table III are used for comparison between two platforms, as both products are introduced in the same year (2016) and their current retail prices are in the same order (of about 5,000 USD). In this following, we discuss our experiments including challenges regarding our specific problem setup.
Iva Selected Technologies
In this work, for better evaluation of our CSMNN including its technology independent characteristics, we performed our experiments on both MOSFET (16nm) and FinFET (20nm) device technologies from Predictive Technology Model (PTM) [32] packages. Two device types namely lowstandby power (LP) and high performance (HP) are used in our experiments [1].
As technology scales down, a growing number of physical and fitting parameters are needed to model PVT variations. However as pointed out in [40, 10, 26, 39], only a few of them are dominant, i.e., developing simulation models that account for those dominant parameters while ignoring the rest, provides sufficiently high accuracy levels. Following these studies, we considered the most important process variation factors for defining a limited number of process corners. There is no process variation distribution information available for PTM technologies. Therefore, we followed the same approach used in [24] which studied the same devices as this work to define PVT corners.
All distributions but temperature are considered normal (Gaussian) and reported as with () and (
), representing mean and standard deviation respectively. The typical temperature value is considered as 27°C and the highest temperature (
variation) as 125°C. The information of the distribution for process variation parameters and the defined process corners for experiments are provided in Table IV.PVT Variation Distribution  

Technology  
FinLP  0.9,0.05  4.6,0.23    15,0.5  
FinHP  0.9,0.05  4.4,0.22    15,0.5  
MOSLP  0.9,0.05  4.6,0.23  2,0.1  1.2,0.04  
MOSHP  0.7,0.035  4.4,0.23  2,0.1  0.95,0.03 
PVT variation in predefined corners  
Corner  (°C)  
FF  
SS  
FFHT  
SSHT 
IvB Characterization
The resolution of characterization process is a key factor in determining the accuracy of both CSMLUT and CSMNN simulations. While more data points increase the accuracy of both simulators, it comes with the cost of longer characterization process, larger tables in CSMLUT, and longer training time in our CSMNN. We therefore evaluate our CSMNN framework under different resolutions. The results can also be later used towards suggesting a baseline for other technologies.
It should be mentioned that CSMcomponents exhibit different sensitivity levels to different voltagenode variables. For example, seems to be more sensitive to than in INVX1, and it can be characterized with lower resolution for than . Moreover, the sensitivity to resolution of characterization for one CSMcomponent should not be necessarily the same as the other component. For example, the range of change in value for a single transition is from to , while this is about only for . The resolution can also vary based on the range of the voltagenode variable, e.g., higher resolutions for the noisy parts of the waveform (with higher frequencies of change) and lower resolutions for smooth parts of the waveforms.
However, for the sake of simplicity, we considered all voltagenode resolutions as similar. As the units for different dimensions are different, we defined three different resolution setups as explained in Table V. By comparing the preliminary results, normal setup was found to be an appropriate resolution and the experiments were continued with this setup.
S: Soft  N: Normal  C: Coarse  
Resolution (v)  0.01  0.05  0.1 
TT  FF  SS  FFHT  SSHT  
MOSFETHP 16nm  
INV  14  16  18  16  18 
NAND2  24  28  30  28  30 
MOSFETLP 16nm  
INV  20  20  24  22  26 
NAND2  28  32  30  32  32 
FinFETHP 20nm  
INV  20  20  20  26  24 
NAND2  34  30  34  36  36 
FinFETLP 20nm  
INV  20  20  20  26  20 
NAND2  30  36  36  40  38 
IvC Preprocessing and Loss Function Modification
Mean Square Error (MSE, also referred as L2norm error) is a commonly used regression loss function. It is simply the average of squared distances between our targets (
) and predicted values (). The loss function can also accommodate regularization term added to the loss function in order to prevent overfitting by shrinking the model parameters. The values of CSMcomponents vary in a large scale. For example, in INV, with , as variables, the DC current is in when both transistors are on, while in when one of them is off and the cell is leaking. The MSEloss is a function of absolute error. Thus, by using this loss, the error in lower scale values will be less important compared to the higher scale values. To address this, we can logtransform the output, so the relative error will be used for loss calculation of the regression model as shown in Eq.6. An issue with such an adjustment is that some of the values are negative and this makes the logtransform more complicated. We simply resolved such issue with a simple shift of data toward positive values by subtracting all data points with their overall minimum .(6) 
The normalization of data in regression problems would help the solvers with faster convergence and better numerical stability. Hence, normalization of inputs and outputs is typically implemented inside the solver, such as that in the Scikitlearn package [31] used in our implementation.
IvD NN Size and Training for Logic Cells
To select the size of the hidden layer for each model, we repeated the training process for various neuron numbers in the range of . Preliminary results in our experiments showed that the tanh nonlinear function provides better outcomes compared to other functions such as sigmoid and ReLU. As mentioned in Section IIIB, there is no hyperparameter, e.g., no learningrate or minibatch size tuning is required in LBFGS optimization.
The total number of generated data points is 500 per gate. We trained the NN with 90% of this data (5fold crossvalidation, 360 for training and 40 for validation) and then tested on the other 10%. The split between training, validation, and test datasets was done in random.
Next, we applied a few noisy input smaples to the cell and measured . The minimum size of the hidden layer that met is chosen as the CSMNN architecture for the logic cell in the specific MCMM setup. The complete results for the choice of architecture for INV and NAND2 are given in Table VI for different MCMM setups.
IvE Circuit Simulation
In this work we evaluated our CSMNN framework by simulating a fulladder circuit (schematic shown in Fig. 4).
For the sake of a fair comparison, the HSPICE characterization setup is the same for both CSMNN and CSMLUT. We measured by comparing output waveforms of HSPICE as the baseline with those of CSMNN simulations. The CPU and GPU devices used in our experiments are introduced in Table II and Table III respectively. CSMLUT is considered to be computed on the CPU platform as it does not benefit from GPU parallelization. The required computation resources and latencies are calculated using equations in section IIIA. The results confirm that CSMNN output waveforms match those of HSPICE in regard to propagation delay with error values limited to 0.1%. To better confirm the high accuracy of CSMNN, we compared its waveform similarity to HSPICE, by measuring . As listed in Table VII, is limited to 2%.
V Conclusions and Future Work
CSMNN, a scalable, technologyindependent circuit simulation framework is proposed. CSMNN is aimed to address the efficiency concerns of the existing tools that depend on data query from lookup tables stored in memory. Given the underlying CPU and GPU parallel processing capabilities, our framework replaces memorization by computation, utilizing a set of optimized NN structures, training and inference processing steps. The simulation latency of CSMNN was evaluated in multiple MOSFET and FinFET technologies based on predictive technology models in various PVT corners and modes. The results confirm that CSMNN improves the simulation speed by up to using CPU platforms, compared to a CSMLUT baseline. CSMNN can further benefit from parallelization capabilities of GPUs, therefore the simulation speed is improved by up to when run on a GPU. CSMNN also provides high accuracy levels, maintaining the waveform similarity error within compared to HSPICE. We believe the application of CSMNN in future simulation tools such as those for signoff and MCMM analysis and optimization of advanced VLSI circuits can significantly improve the simulation accuracy and speed.
As part of our future work, we plan to investigate CSMNN on industrial circuits using accurate foundry technology information including PVT variations. We also plan to enhance our NNs to account for PVT corner parameters as inputs, to be able to train NNs once for all modes and corners and evaluate the cost vs speed and accuracy tradeoff.
Acknowledgement
This research was sponsored in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation. The authors would also like to thank Soheil Nazar Shahsavani and Mahdi Nazemi (of the University of Southern California) for helpful discussions.
References
 [1] (201503) Optimal choice of FinFET devices for energy minimization in deeplyscaled technologies. In International Symposium on Quality Electronic Design (ISQED), Vol. , pp. 234–238. External Links: Document, ISSN Cited by: §IVA.
 [2] (2008) A current source model for CMOS logic cells considering multiple input switching and stack effect. In Design, Automation and Test in Europe (DATE), Vol. , pp. 568–573. External Links: Document, ISSN 15301591 Cited by: §I, Fig. 1, §II, §IIIB.
 [3] (2008) Current source based standard cell model for accurate signal integrity and timing analysis. Design, Automation and Test in Europe (DATE), pp. 574–579. External Links: Document Cited by: §I.
 [4] (200006) A survey of design techniques for systemlevel dynamic power management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8 (3), pp. 299–316. External Links: Document, ISSN Cited by: §I.

[5]
(2016)
A progressive batching LBFGS method for machine learning
. In International Conference on Machine Learning (ICML), Cited by: §IIIB.  [6] (Website) External Links: Link Cited by: §I.
 [7] (2003) Blade and razor: cell and interconnect delay analysis using currentbased models. In Design Automation Conference (DAC), Vol. , pp. 386–389. External Links: Document, ISSN Cited by: §I, §II.
 [8] (2001) Approximation with artificial neural networks. Master’s Thesis, Faculty of Sciences, Eötvös Loránd University, HungaryFaculty of Sciences, Eötvös Loránd University, Hungary. Cited by: §IIIA.
 [9] (2014) Semianalytical current source modeling of FinFET devices operating in near/subthreshold regime with independent gate control and considering process variation. In Asia and South Pacific Design Automation Conference (ASPDAC), pp. 167–172. Cited by: §I.
 [10] (200811) Statistical modeling of metalgate workfunction variability in emerging device technologies and implications for circuit design. In International Conference on ComputerAided Design (ICCAD), Vol. , pp. 270–277. External Links: Document, ISSN 10923152 Cited by: §IVA.
 [11] (200701) A currentbased method for short circuit power calculation under noisy input waveforms. In Asia and South Pacific Design Automation Conference (ASPDAC), Vol. , pp. 774–779. External Links: Document, ISSN Cited by: §I, §II.
 [12] (2006) Statistical logic cell delay analysis using a currentbased model. In Design Automation Conference (DAC), pp. 253–256. Cited by: §I, Fig. 1, §II.
 [13] (2008) Statistical waveform and current source based standard cell models for accurate timing analysis. In Design Automation Conference (DAC), pp. 227–230. Cited by: §I.
 [14] (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §IIIB.
 [15] (2005) Current based delay models: a must for nanometer timing. Cadence Live Conference (CDNLive). Cited by: §I.
 [16] (2010) Efficient representation, stratification, and compression of variational CSM library waveforms using robust principle component analysis. In Design, Automation and Test in Europe (DATE), pp. 1285–1290. External Links: ISBN 9783981080162 Cited by: §II.

[17]
(201606)
Deep residual learning for image recognition.
In
Computer Vision and Pattern Recognition (CVPR)
, Cited by: §IIIA.  [18] (Website) External Links: Link Cited by: §II.
 [19] (201810) Using machine learning to predict pathbased slack from graphbased timing analysis. In International Conference on Computer Design (ICCD), pp. 603–612. External Links: Document, ISSN 25766996 Cited by: §I.
 [20] (2011Sep.) GPUs and the future of parallel computing. IEEE Micro 31 (5), pp. 7–17. External Links: Document, ISSN Cited by: §I.
 [21] (200411) A robust celllevel crosstalk delay change analysis. In International Conference on ComputerAided Design (ICCAD), Vol. , pp. 147–154. External Links: Document, ISSN 10923152 Cited by: §I.
 [22] (2012) Current source modeling for power and timing analysis at different supply voltages. In Design, Automation Test in Europe (DATE), Vol. , pp. 923–928. External Links: Document, ISSN 15581101 Cited by: §I.
 [23] (2011) On optimization methods for deep learning. In International Conference on Machine Learning (ICML), pp. 265–272. External Links: ISBN 9781450306195 Cited by: §IIIB.
 [24] (201002) Processvariation effect, metalgate workfunction fluctuation, and randomdopant fluctuation in emerging CMOS technologies. IEEE Transactions on Electron Devices 57 (2), pp. 437–447. External Links: Document, ISSN 00189383 Cited by: §IVA.
 [25] (19890801) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1), pp. 503–528. External Links: ISSN 14364646, Document Cited by: §IIIB.
 [26] (200906) Comprehensive analysis of variability sources of FinFET characteristics. In Symposium on VLSI Technology, Vol. , pp. 118–119. External Links: Document, ISSN 07431562 Cited by: §IVA.

[27]
(2017)
When and why are deep networks better than shallow ones?.
In
AAAI Conference on Artificial Intelligence
, pp. 2343–2349. Cited by: §IIIA.  [28] (201101) Accurate timing and noise analysis of combinational and sequential logic cells using current source modeling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19 (1), pp. 92–103. External Links: Document, ISSN 10638210 Cited by: §I.
 [29] (200803) Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: ISSN 15427730, Link, Document Cited by: §IIIA2.
 [30] (200608) Thermal modeling, analysis, and management in VLSI circuits: principles and methods. Proceedings of the IEEE 94 (8), pp. 1487–1501. External Links: Document Cited by: §I.
 [31] (201111) Scikitlearn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: ISSN 15324435 Cited by: §IVC, §IV.
 [32] Predictive Technology Model from arizona state university. Note: http://ptm.asu.edu/Accessed: 20190520 Cited by: §IVA.

[33]
(2009)
Largescale deep unsupervised learning using graphics processors
. In International Conference on Machine Learning (ICML), pp. 873–880. External Links: ISBN 9781605585161, Document Cited by: §IIIA2, §IIIA.  [34] (2016) An overview of gradient descent optimization algorithms. arXiv abs/1609.04747. Cited by: §IIIB.
 [35] (2016) Practical statistical static timing analysis with current source models. In Design Automation Conference (DAC), pp. 113:1–113:6. External Links: ISBN 9781450342360, Document Cited by: §IIIB.
 [36] (Website) External Links: Link Cited by: §I.
 [37] (201506) Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Document, ISSN 10636919 Cited by: §IIIA.

[38]
(201803)
A mathematical theory of deep convolutional neural networks for feature extraction
. IEEE Transactions on Information Theory 64 (3), pp. 1845–1866. External Links: Document, ISSN 00189448 Cited by: §IIIA.  [39] (200912) Physical model of the impact of metal grain work function variability on emerging dual metal gate MOSFETs and its implication for sram reliability. In International Electron Devices Meeting (IEDM), Vol. , pp. 1–4. External Links: Document, ISSN 01631918 Cited by: §IVA.
 [40] (201604) Analysis of 7/8nm BulkSi FinFET technologies for 6TSRAM scaling. IEEE Transactions on Electron Devices 63 (4), pp. 1502–1507. External Links: Document, ISSN 00189383 Cited by: §IVA.
Comments
There are no comments yet.