The down-scaling of transistor geometries has drastically increased the complexity of short channel effects and process-voltage-temperature (PVT) variations. Consequently, application-specific integrated circuit (ASIC) design flow techniques, such as multi-corner multi-mode (MCMM) and parametric on-chip variation (POCV) depend on increasingly more complex analysis, transformation, and verification iterations, to ensure the ASIC system functions correctly and meets design demands such as those related to performance, power and signal integrity. In these methods, the design is tested in different process-voltage-temperature (PVT) corners and operating modes such as low-power (LP), high-performance (HP), etc. Accurate simulation such as those for timing analysis during placement, clock network synthesis, and routing is crucial as it helps to lower the number of design iterations, speed up convergence, and plays a major role in the turnaround time of complex designs such as system-on-chips (SoCs) .
SPICE simulations are accurate but very slow for timing, power, thermal analysis, and optimization of modern ASIC designs with billions or trillions of transistors [30, 4]. Therefore, higher levels of circuit abstraction using approximation has been used to speed up simulation steps. Abstraction models are generally based on look-up-tables (LUTs), closed-form formulations, factors or their combinations. The traditional models, namely nonlinear delay model (NLDM), nonlinear power model (NLPM), effective current source model (ECSM ), and composite current source model (CCSM ) utilize LUTs for storing delay, noise or power as nonlinear functions w.r.t. physical, structural, and environmental parameters, and depend on voltage modeling more than current modeling. We refer to NLDM, ECSM, and CCSM models as voltage-LUT (V-LUT) throughout this paper. The V-LUT models are intuitively better choices when compared to simple closed-form formulation of nonlinear functions, however, tend to be increasingly inaccurate in capturing signal integrity and short channel effects with the down-scaling of technologies .
Alternatively, current source models (CSMs) [7, 15, 21, 13, 2, 22, 28, 12, 11] use voltage-dependent current sources and possibly voltage-dependent capacitances to model logic cells. In addition to higher accuracy, another advantage of CSM over V-LUT models is the ability to simulate realistic waveforms for arbitrary input signals and provide the output waveforms.
The number of CSM component values that should be stored in memory grows exponentially with the number of inputs and internal nodes in the logic cell. For example, 6-dimensional LUTs are required for modeling a 3-input NAND gate (NAND3). While V-LUT models are stored in smaller/faster memories such as L1-cache, relatively bigger tables in CSM-LUT can only fit into bigger/slower ones, like DRAM. Therefore a fundamental idea to shorten simulation time would be to replace some of the memorization with computation aiming for optimal space/time efficiency.
In , a Semi-Analytical CSM (SA-CSM) was presented which uses small-size LUTs combined with nonlinear analytical equations to simultaneously achieve high modeling accuracy and space/time efficiency. However, developing analytical equations for complex circuits is a tedious process.
In this work, we propose CSM-NN, a circuit simulation framework that fully replaces LUTs with neural networks (NNs). This eliminates the long memory access latency of LUTs, hence significantly shortens the simulation time, especially when CSM-NN computations can take advantage of parallelism offered by graphical processing units (GPUs) .
The major contributions of our work are as follows:
We developed a framework for simulating nonlinear behavior of complex integrated circuits using optimized NN structures as well as training and inference algorithms, according to the underlying CPU or GPU computational capabilities.
Our framework is scalable and technology-independent, i.e., it can efficiently handle increasingly complex technologies with high PVT variations while maintaining the accuracy and improving the simulation latency.
In this section, we briefly touch upon the basics of CSM and latency issues related to CSM-LUT memory access.
Each logic gate can be modeled using voltage-dependent current source as well as (miller and output) capacitance components . The values of these components can be characterized using HSPICE simulations. The CSM components of a logic cell can be stored in LUTs and utilized for noise, timing and power analysis of VLSI circuits [2, 11, 12, 16]. Fig. 1 illustrates CSMs for single-input (INV) and multi-input (NAND2) logic cells.
Given a large number of simulation runs needed during the ASIC design and verification flow, and the corresponding long memory retrieval time, it is desirable to keep the number of dimensions and size of LUTs very small. Table I lists the size of CSM LUTs for a simple library of basic gates.
The size of CSM-LUTs for simple logic cells (c.f. Table I) is an exponential function of logic cell complexity. As an example, NOR2 LUTs are 200 times larger than the one for INV, and XOR2 LUTs are 20,000 times larget than NOR2 ones. Note that in practical research or industrial standard cell libraries, there may be many logic cells of various sizes and complexities, some of which could be more complex than simple logic cells in Table I.
|INV||2||FPs = 1.6KB|
|NAND2||4||FPs = 320KB|
|NOR2||4||FPs = 320KB|
|AOI||6||FPs = 48MB|
|NAND3||6||FPs = 48MB|
|NOR3||6||FPs = 48MB|
|XOR2||8||FPs = 6.4GB|
Looking at the memory hierarchy details of Intel Broadwell micro-architecture  in Table II and comparing them with sizes in Table I, confirms that CSM LUTs cannot fit in any of the caches and should be stored in the main memory (DRAM) and written into cache in parts. The latency of memory access in DRAM is about 2 orders of magnitude higher than that of L1 cache. This main difference shows the extent of longer simulation latencies for CSM-LUT, compared to V-LUT.
In the following two sections, we present how our CSM-NN eliminates the need for LUTs, and instead utilizes NNs to compute the CSM data.
|Intel Broadwell micro-architecture|
|Memory||Size (KByte)||Latency (Clock Cycle)|
|L1 Data Cache||32||4-5|
|L3 Cache||20,480||38 - 42|
|Intel Xeon Processor E5-2699 v4|
|Base Frequency||2.2 GHz|
|Single Precision||774.4 GFLOPs|
|Double Precision||1548.8 GFLOPs|
Iii CSM-NN Framework
The description of our CSM-NN, including NN architecture and optimization algorithms for training is as follows.
Iii-a NN Architecture and Computation
To avoid the large LUTs with long query latencies in CSM-LUT, our CSM-NN, embeds parametric nonlinear models that can be trained on fully-connected NNs, to represent nonlinear functions.
We believe CSM-NN can benefit from the following ML developments: (1) evolution of novel ML algorithms can be utilized towards improving the accuracy and efficiency of CSM-NN; and more importantly (2) exponential increase in computational capabilities, especially with recent advances in design of GPUs , significantly helps improving the performance of CSM-NN.
CSM-NN substitutes memory retrieval with computation, thus it is necessary to analyze and optimize the number of different structure and latency of operations required for CSM-NN in different hardware platforms.
There are two steps for CSM-NN: (1) simulation using a feed-forward pass that calculates the output of the model based on trained parameters and input values, and (2) back-propagation step, which modifies the parameters of the model based on the error, i.e. the difference between the expected values of the training data and the estimated output from the model. Since the training process is done only once, computation during back-propagation is not a concern. Our objective is to improve the circuit simulation time. We therefore focus mainly on the inference process, i.e., we optimize the computation steps of the feed-forward pass.
To choose the best NN architecture for our CSM-NN, we note that the number of hidden layers and the number of neurons in the hidden layer(s) determine the total number of parameters in the input-output function and the flexibility of the model. Increasing the number of hidden layers beyond one (i.e., making the modeldeeper) instead of increasing the number of neurons in a single layer (i.e., making the layer wider) can also be considered. In deep neural networks (DNNs), the sequence of nonlinear activation layers enables the input-output dependency to have a higher degree of nonlinearity with more flexibility. Although there are still unanswered questions on profound results of DNNs , the belief is that multiple layers perform better at generalizing as they learn the intermediate features between the raw input-data and the high-level output [27, 38]
. As an example, thanks to the availability of data and computation resources in the past few years, the state-of-the-art solutions for challenging ML problems, such as image classification in the fields of computer vision, are made possible by creating models with over hundreds of layers . On the other hand, shallow networks do not generalize well but are very powerful in memorization . In addition, training deeper models requires more data and time for training and also needs more computational resources for the feed-forward pass.
In conclusion, despite the recent emergence of the DNN solutions and applications and potential improvement of accuracy of circuit simulation for complex timing, noise, and power analysis, we do not believe DNN is a feasible choice for the architecture of CSM-NN.
In the mathematical theory of artificial neural networks (ANNs), the universal approximation theorem 
affirms that a single-hidden-layer NN can approximate continuous functions with a finite number of neurons, under assumptions over the nonlinear activation function and availability of sufficient data for training. Consequently, if a shallow wide network is trained with every possible input value, it could eventually memorize the corresponding output. The following characteristics of our problem further suggest that shallow wide networks with one hidden layer are more plausible solutions:
There are no discontinuity in CSM component values.
While in practical applications the training data is limited or expensive to generate, in CSM-NN it is straight forward to generate training data with HSPICE simulations during the characterization process.
The number of inputs to the neural network is relatively small, even for complex logic cells, and when considering PVT parameters (Table I). This implies that we are modeling a low dimensional function.
Based on these features and considering the impact on inference step during circuit simulations, CSM-NN adopts a simple NN architecture with a single hidden layer to model the nonlinear behavior of CSM-NN components. The architecture and input-output function are shown in Fig. 2 and Eq. 1.
The number of MUL operations in feed-forward pass is equal to the number of model-parameters as calculated in Eq. 2. It is very important to note that there are no dependencies among MUL steps in a specific layer, therefore they can be completely parallelized.
Considering notation used in Eq. 1, there are summations of values in the hidden layer. These summations also can be parallelized completely. To calculate the output, the summation of values is required. This summation can be efficiently parallelized by using tree-structures. The total number of ADD operations and the latency of tree-structure summations are calculated in Eq. 3 and Eq. 4.
CSM-NN accounts for the availability of resources when applying parallelization. NNs can be trained and utilized in two different hardware platforms, namely CPUs and GPUs. The evolution of GPUs and CPUs in case of number of floating-point operations per second (FLOPS) are shown in Fig. 3.
There are two phases of CSM-NN simulation computation when using CPUs: first, the weights of the NNs are loaded from the memory; and second, MUL and ADD operations are performed by arithmetic logic units (ALUs). As later described in Section IV, the number of CSM-NN parameters is sufficiently small. Therefore, they can fit into the cache (L1) of a CPU, and are accessible by the ALU in the order of a few CPU clock cycles.
The computational capabilities of GPUs have increased dramatically in the past decade. This has made GPUs a good choice of hardware platform for NN computation .
There are two levels of parallelized processing units in GPUs: several multiprocessors (MPs), and several stream processors (SPs, also referred as cores) that run the actual computation for each multiprocessor. Each core is equipped with ADD and MUL arithmetic units and designated register files. By implementing a trained NN (fixed parameters) on a GPU, the weights of each operation can be stored in register files, therefore, retrieval of information from memory is not required. We will show in Section IV that NNs of our CSM-NN framework can fit into a typical GPU. As an example, the hardware specifications of an NVIDIA GPU equipped with CUDA  cores is shown in Table III.
|Streaming Processors (SM)||56|
|32bit FP CUDA core (per SM / total)||64 / 3584|
|64bit FP CUDA core (per SM / total)||32 / 1792|
|Register file per SM||256 KB|
|Shared memory per SM||96 KB|
|Register file per CUDA core||4 KB|
|Total L1 cache||64 KB|
|Base clock frequency||1328 MHz|
|Single Precision GFLOPs||9519|
It is worth noting that LUT-based models such as CSM-LUT and V-LUT models are only dependent on memory queries, thus using GPUs will not improve their simulation time. Therefore, considering relatively stronger parallelization capabilities of GPUs over CPUs, the speed advantage of CSM-NN over CSM-LUT and V-LUT improves, when running on GPUs.
Iii-B Training Process
We have adopted L-BFGS as the optimization technique for training the NNs of our CSM-NN framework. The following provides our justification. There are several gradient descent based optimization algorithm candidates such as stochastic gradient descent (SGD), Nesterov, Adagrad, and ADAM  to be considered for the training of neural regression models. SGD and inherited algorithms, such as ADAM, are by far the most popular algorithms to optimize NNs . Their advantages to other techniques include parallelization, fast computation, and use of minibatch training techniques for better generalization specially in DNNs. The functionality of these methods is conditioned to the appropriate tuning of hyper-parameters for training. On the other hand, Quasi-Newton methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS), can be orders of magnitude faster than SGD. These methods are based on measuring the curvature of the objective function to select the length and direction of the steps. The main shortcoming of BFGS is that it requires high computation and memory resources when calculating the inverse of Hessian matrix for large datasets. Limited memory BFGS (L-BFGS)  is an optimization algorithm in the family of quasi-Newton methods that approximates the BFGS algorithm using a limited amount of memory.
The experimental results for low dimensional problems in  show that L-BFGS produces highly competitive or sometimes superior models compared to SGD methods. Another important advantage of L-BFGS is that it requires adjusting zero (and in advanced modified versions of L-BFGS, only a few) hyper-parameters. For example, differently from SGD, the learning rate (step-size) of L-BFGS is tuned internally. We should also note that while several mini-batch versions of L-BFGS have been suggested very recently in the literature , L-BFGS is generally considered as a batch algorithm and thus no batch-size adjustment is required. Considering these specifications, we chose L-BFGS as our optimization technique for training the NNs in the CSM-NN method.
The common approach in supervised learning is to verify the generalization of the trained model by utilizing a validation (test) dataset which is completely separate from the training dataset. This process would prevent the possible over-fitting of the model. Therefore, we can randomly select samples from characterization data and test the accuracy of model.
It is very important to note that while accuracy of NNs in predicting CSM component values is important, the accuracy should ultimately be measured based on the quality of the output signal waveforms. Even the measurement of the propagation delay of the gate is not sufficient to confirm the accuracy of a CSM simulator. Therefore, similar to [2, 35], we used expected waveform similarity () as a figure of merit for the measurement of the accuracy of our CSM simulations. In this work, is defined as the mean of the absolute difference between precise HSPICE and CSM-NN simulations relative to the supply voltage value of the technology as shown in Eq. 5.
Iii-C CSM-NN Flow
Technology information and standard cell libraries at the transistor level are provided by semiconductor manufacturers and design parties. Each of the cells in the standard library should be characterized separately for every PVT corner and mode settings. The number of different MCMM settings is technology and product design policy dependent. The characterization process is usually very time intensive, and can be done in different resolutions. While higher resolutions result in higher accuracies, they need a longer characterization times. It should be mentioned that more data needs a larger memory in CSM-LUT and possibly a longer training process in CSM-NN. Therefore, choosing an appropriate resolution is an important step in both CSM-LUT and CSM-NN flows. While our results in section IV are technology specific, they suggest a range of acceptable characterization resolutions. Up to this point of the flow, CSM-NN steps coincide with those of CSM-LUT.
The next step is to train the NNs, one for every CSM component (e.g. ), of a logic cell () and in a specific PVT corner (e.g. fast-fast and high temperature (FFHT)). The inputs of the NNs are the voltages of terminal and internal nodes (
), and the target output is the value of the CSM components in these voltage points ().
The training data collected through characterization should first be preprocessed and then used for training. As explained in section III-A, wider network can result in a more accurate model, but requires more computation. Hence, we need to find an appropriate layer size. We choose the smallest number of neurons such that the network can pass a pre-defined accuracy threshold in terms of .
In the following section, we will show that this optimal set of NN parameters can fit into the cache (L1) of a typical CPU or the register files of a typical GPU. To simulate a circuit in a specific MCMM setup, the corresponding NN models of all logic cells in the standard library are loaded.
Iv Experiments and Simulation Results
We implemented the simulator and the flow of our CSM-NN framework in Python. Our implementation is technology independent and can characterize, and create NN models with flexible configurable setup, for any given combinational circuit netlist. NN implementation and training are based on the Scikit-learn  package.
CPU and GPU devices introduced in Table II and Table III are used for comparison between two platforms, as both products are introduced in the same year (2016) and their current retail prices are in the same order (of about 5,000 USD). In this following, we discuss our experiments including challenges regarding our specific problem setup.
Iv-a Selected Technologies
In this work, for better evaluation of our CSM-NN including its technology independent characteristics, we performed our experiments on both MOSFET (16nm) and FinFET (20nm) device technologies from Predictive Technology Model (PTM)  packages. Two device types namely low-standby power (LP) and high performance (HP) are used in our experiments .
As technology scales down, a growing number of physical and fitting parameters are needed to model PVT variations. However as pointed out in [40, 10, 26, 39], only a few of them are dominant, i.e., developing simulation models that account for those dominant parameters while ignoring the rest, provides sufficiently high accuracy levels. Following these studies, we considered the most important process variation factors for defining a limited number of process corners. There is no process variation distribution information available for PTM technologies. Therefore, we followed the same approach used in  which studied the same devices as this work to define PVT corners.
All distributions but temperature are considered normal (Gaussian) and reported as with () and (
), representing mean and standard deviation respectively. The typical temperature value is considered as 27°C and the highest temperature (variation) as 125°C. The information of the distribution for process variation parameters and the defined process corners for experiments are provided in Table IV.
|PVT Variation Distribution|
|PVT variation in pre-defined corners|
The resolution of characterization process is a key factor in determining the accuracy of both CSM-LUT and CSM-NN simulations. While more data points increase the accuracy of both simulators, it comes with the cost of longer characterization process, larger tables in CSM-LUT, and longer training time in our CSM-NN. We therefore evaluate our CSM-NN framework under different resolutions. The results can also be later used towards suggesting a baseline for other technologies.
It should be mentioned that CSM-components exhibit different sensitivity levels to different voltage-node variables. For example, seems to be more sensitive to than in INVX1, and it can be characterized with lower resolution for than . Moreover, the sensitivity to resolution of characterization for one CSM-component should not be necessarily the same as the other component. For example, the range of change in value for a single transition is from to , while this is about only for . The resolution can also vary based on the range of the voltage-node variable, e.g., higher resolutions for the noisy parts of the waveform (with higher frequencies of change) and lower resolutions for smooth parts of the waveforms.
However, for the sake of simplicity, we considered all voltage-node resolutions as similar. As the units for different dimensions are different, we defined three different resolution setups as explained in Table V. By comparing the preliminary results, normal setup was found to be an appropriate resolution and the experiments were continued with this setup.
|S: Soft||N: Normal||C: Coarse|
Iv-C Preprocessing and Loss Function Modification
Mean Square Error (MSE, also referred as L2-norm error) is a commonly used regression loss function. It is simply the average of squared distances between our targets () and predicted values (). The loss function can also accommodate regularization term added to the loss function in order to prevent overfitting by shrinking the model parameters. The values of CSM-components vary in a large scale. For example, in INV, with , as variables, the DC current is in when both transistors are on, while in when one of them is off and the cell is leaking. The MSE-loss is a function of absolute error. Thus, by using this loss, the error in lower scale values will be less important compared to the higher scale values. To address this, we can log-transform the output, so the relative error will be used for loss calculation of the regression model as shown in Eq.6. An issue with such an adjustment is that some of the values are negative and this makes the log-transform more complicated. We simply resolved such issue with a simple shift of data toward positive values by subtracting all data points with their overall minimum .
The normalization of data in regression problems would help the solvers with faster convergence and better numerical stability. Hence, normalization of inputs and outputs is typically implemented inside the solver, such as that in the Scikit-learn package  used in our implementation.
Iv-D NN Size and Training for Logic Cells
To select the size of the hidden layer for each model, we repeated the training process for various neuron numbers in the range of . Preliminary results in our experiments showed that the tanh nonlinear function provides better outcomes compared to other functions such as sigmoid and ReLU. As mentioned in Section III-B, there is no hyper-parameter, e.g., no learning-rate or mini-batch size tuning is required in L-BFGS optimization.
The total number of generated data points is 500 per gate. We trained the NN with 90% of this data (5-fold cross-validation, 360 for training and 40 for validation) and then tested on the other 10%. The split between training, validation, and test datasets was done in random.
Next, we applied a few noisy input smaples to the cell and measured . The minimum size of the hidden layer that met is chosen as the CSM-NN architecture for the logic cell in the specific MCMM setup. The complete results for the choice of architecture for INV and NAND2 are given in Table VI for different MCMM setups.
Iv-E Circuit Simulation
In this work we evaluated our CSM-NN framework by simulating a full-adder circuit (schematic shown in Fig. 4).
For the sake of a fair comparison, the HSPICE characterization setup is the same for both CSM-NN and CSM-LUT. We measured by comparing output waveforms of HSPICE as the baseline with those of CSM-NN simulations. The CPU and GPU devices used in our experiments are introduced in Table II and Table III respectively. CSM-LUT is considered to be computed on the CPU platform as it does not benefit from GPU parallelization. The required computation resources and latencies are calculated using equations in section III-A. The results confirm that CSM-NN output waveforms match those of HSPICE in regard to propagation delay with error values limited to 0.1%. To better confirm the high accuracy of CSM-NN, we compared its waveform similarity to HSPICE, by measuring . As listed in Table VII, is limited to 2%.
V Conclusions and Future Work
CSM-NN, a scalable, technology-independent circuit simulation framework is proposed. CSM-NN is aimed to address the efficiency concerns of the existing tools that depend on data query from lookup tables stored in memory. Given the underlying CPU and GPU parallel processing capabilities, our framework replaces memorization by computation, utilizing a set of optimized NN structures, training and inference processing steps. The simulation latency of CSM-NN was evaluated in multiple MOSFET and FinFET technologies based on predictive technology models in various PVT corners and modes. The results confirm that CSM-NN improves the simulation speed by up to using CPU platforms, compared to a CSM-LUT baseline. CSM-NN can further benefit from parallelization capabilities of GPUs, therefore the simulation speed is improved by up to when run on a GPU. CSM-NN also provides high accuracy levels, maintaining the waveform similarity error within compared to HSPICE. We believe the application of CSM-NN in future simulation tools such as those for sign-off and MCMM analysis and optimization of advanced VLSI circuits can significantly improve the simulation accuracy and speed.
As part of our future work, we plan to investigate CSM-NN on industrial circuits using accurate foundry technology information including PVT variations. We also plan to enhance our NNs to account for PVT corner parameters as inputs, to be able to train NNs once for all modes and corners and evaluate the cost vs speed and accuracy trade-off.
This research was sponsored in part by a grant from the Software and Hardware Foundations (SHF) program of the National Science Foundation. The authors would also like to thank Soheil Nazar Shahsavani and Mahdi Nazemi (of the University of Southern California) for helpful discussions.
-  (2015-03) Optimal choice of FinFET devices for energy minimization in deeply-scaled technologies. In International Symposium on Quality Electronic Design (ISQED), Vol. , pp. 234–238. External Links: Cited by: §IV-A.
-  (2008) A current source model for CMOS logic cells considering multiple input switching and stack effect. In Design, Automation and Test in Europe (DATE), Vol. , pp. 568–573. External Links: Cited by: §I, Fig. 1, §II, §III-B.
-  (2008) Current source based standard cell model for accurate signal integrity and timing analysis. Design, Automation and Test in Europe (DATE), pp. 574–579. External Links: Cited by: §I.
-  (2000-06) A survey of design techniques for system-level dynamic power management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8 (3), pp. 299–316. External Links: Cited by: §I.
A progressive batching L-BFGS method for machine learning. In International Conference on Machine Learning (ICML), Cited by: §III-B.
-  (Website) External Links: Cited by: §I.
-  (2003) Blade and razor: cell and interconnect delay analysis using current-based models. In Design Automation Conference (DAC), Vol. , pp. 386–389. External Links: Cited by: §I, §II.
-  (2001) Approximation with artificial neural networks. Master’s Thesis, Faculty of Sciences, Eötvös Loránd University, HungaryFaculty of Sciences, Eötvös Loránd University, Hungary. Cited by: §III-A.
-  (2014) Semi-analytical current source modeling of FinFET devices operating in near/sub-threshold regime with independent gate control and considering process variation. In Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 167–172. Cited by: §I.
-  (2008-11) Statistical modeling of metal-gate work-function variability in emerging device technologies and implications for circuit design. In International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 270–277. External Links: Cited by: §IV-A.
-  (2007-01) A current-based method for short circuit power calculation under noisy input waveforms. In Asia and South Pacific Design Automation Conference (ASP-DAC), Vol. , pp. 774–779. External Links: Cited by: §I, §II.
-  (2006) Statistical logic cell delay analysis using a current-based model. In Design Automation Conference (DAC), pp. 253–256. Cited by: §I, Fig. 1, §II.
-  (2008) Statistical waveform and current source based standard cell models for accurate timing analysis. In Design Automation Conference (DAC), pp. 227–230. Cited by: §I.
-  (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §III-B.
-  (2005) Current based delay models: a must for nanometer timing. Cadence Live Conference (CDNLive). Cited by: §I.
-  (2010) Efficient representation, stratification, and compression of variational CSM library waveforms using robust principle component analysis. In Design, Automation and Test in Europe (DATE), pp. 1285–1290. External Links: Cited by: §II.
Deep residual learning for image recognition.
Computer Vision and Pattern Recognition (CVPR), Cited by: §III-A.
-  (Website) External Links: Cited by: §II.
-  (2018-10) Using machine learning to predict path-based slack from graph-based timing analysis. In International Conference on Computer Design (ICCD), pp. 603–612. External Links: Cited by: §I.
-  (2011-Sep.) GPUs and the future of parallel computing. IEEE Micro 31 (5), pp. 7–17. External Links: Cited by: §I.
-  (2004-11) A robust cell-level crosstalk delay change analysis. In International Conference on Computer-Aided Design (ICCAD), Vol. , pp. 147–154. External Links: Cited by: §I.
-  (2012) Current source modeling for power and timing analysis at different supply voltages. In Design, Automation Test in Europe (DATE), Vol. , pp. 923–928. External Links: Cited by: §I.
-  (2011) On optimization methods for deep learning. In International Conference on Machine Learning (ICML), pp. 265–272. External Links: Cited by: §III-B.
-  (2010-02) Process-variation effect, metal-gate work-function fluctuation, and random-dopant fluctuation in emerging CMOS technologies. IEEE Transactions on Electron Devices 57 (2), pp. 437–447. External Links: Cited by: §IV-A.
-  (1989-08-01) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1), pp. 503–528. External Links: Cited by: §III-B.
-  (2009-06) Comprehensive analysis of variability sources of FinFET characteristics. In Symposium on VLSI Technology, Vol. , pp. 118–119. External Links: Cited by: §IV-A.
When and why are deep networks better than shallow ones?.
AAAI Conference on Artificial Intelligence, pp. 2343–2349. Cited by: §III-A.
-  (2011-01) Accurate timing and noise analysis of combinational and sequential logic cells using current source modeling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19 (1), pp. 92–103. External Links: Cited by: §I.
-  (2008-03) Scalable parallel programming with cuda. Queue 6 (2), pp. 40–53. External Links: Cited by: §III-A2.
-  (2006-08) Thermal modeling, analysis, and management in VLSI circuits: principles and methods. Proceedings of the IEEE 94 (8), pp. 1487–1501. External Links: Cited by: §I.
-  (2011-11) Scikit-learn: machine learning in python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: Cited by: §IV-C, §IV.
-  Predictive Technology Model from arizona state university. Note: http://ptm.asu.edu/Accessed: 2019-05-20 Cited by: §IV-A.
Large-scale deep unsupervised learning using graphics processors. In International Conference on Machine Learning (ICML), pp. 873–880. External Links: Cited by: §III-A2, §III-A.
-  (2016) An overview of gradient descent optimization algorithms. arXiv abs/1609.04747. Cited by: §III-B.
-  (2016) Practical statistical static timing analysis with current source models. In Design Automation Conference (DAC), pp. 113:1–113:6. External Links: Cited by: §III-B.
-  (Website) External Links: Cited by: §I.
-  (2015-06) Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Cited by: §III-A.
-  (2018-03) . IEEE Transactions on Information Theory 64 (3), pp. 1845–1866. External Links: Cited by: §III-A.
-  (2009-12) Physical model of the impact of metal grain work function variability on emerging dual metal gate MOSFETs and its implication for sram reliability. In International Electron Devices Meeting (IEDM), Vol. , pp. 1–4. External Links: Cited by: §IV-A.
-  (2016-04) Analysis of 7/8-nm Bulk-Si FinFET technologies for 6T-SRAM scaling. IEEE Transactions on Electron Devices 63 (4), pp. 1502–1507. External Links: Cited by: §IV-A.