1. Introduction
The need for computing power is steadily increasing across all computing domains, and has been rapidly accelerating recently in part due to the emergence of machine learning, bioinformatics, and internetofthings applications. With conventional device technology scaling slowing down and many traditional approaches for improving computing power reaching their limits, the increasing focus now is on heterogeneous computing with application specific circuits and specialized processors
(tpu, ; asic, ).Vectorbymatrix multiplication (VMM) is one of the most common operations in many computing applications and, therefore, the development of its efficient hardware is of the utmost importance. (We will use VMM acronym to refer to both vectorbymatrix multiplication and vectorbymatrix multiplier.) Indeed, VMM is a core computation in virtually any neuromorphic network (deep1, ) and many signal processing algorithms (sigprocess2, ; fg3, )
. For example, VMM is by far the most critical operation of deeplearning convolutional classifiers
(envision, ; memdpe2, ). Theoretical studies showed that sub8bit precision is typically sufficient for the inference computation (quant, ), which is why the most recent highperformance graphics processors support 8bit fixedpoint arithmetics (p40, ). The low precision is also adequate for lossy compression algorithms, e.g. those based on discrete cosine transform, which heavily rely on VMM operations (lossydct, ). In another recent study, dense lowprecision VMM accelerators were proposed to dramatically improve performance and energy efficiency of linear sparse system solvers, which may take months of processing time when using conventional modern supercomputers (ipek, ). In such solvers, the solution is first approximated with dense lowprecision VMM accelerators, and then iteratively improved by using traditional digital processors.The most promising implementations of low to medium precision VMMs are arguably based on analog and mixedsignal circuits (indiveri, ). In a currentmode implementation, the multibit inputs are encoded as analog voltages/currents, or digital voltage pulses (reram, ), which are applied to one set of (e.g., row) electrodes of the array with adjustable conductance crosspoint devices, such as memristors (memdpe2, ; reram, ) or floatinggate memories (fg1, ; fg3, ; sstVMM, ), while VMM outputs are represented by the currents flowing into the column electrodes. The main drawback of this approach is energyhungry and areademanding peripheral circuits, which, e.g. rely on large static currents to provide accurate virtual ground for the memristor implementation.
In principle, the switchedcapacitor approach does not have this deficiency since the computation is performed by only moving charges between capacitors (swcap, ). Unfortunately, the benefits of such potentially extremely energyefficient computation are largely negated at the interface, since reading the output, or cascading multiple VMMs, require energyhungry ADC and/or analog buffers. Perhaps even more serious drawback, significantly impacting the density (and as a result other performance metrics), is a lack of adjustable capacitors. Instead, each crosspoint element is typically implemented with the set of fixed binaryweighted capacitors coupled to a number of switches (transistors), which are controlled by the digitallystored weight.
In this paper we propose to perform vectorbymatrix multiplication in a timedomain, combining configurability and high density of the currentmode implementation with energyefficiency of the switchcapacitor VMM, however, avoiding costly I/O conversion of the latter. Our approach draws inspiration from prior work on timedomain computing (time1, ; time2, ; time4, ; time5, ; timebnn, ), but different in several important aspects. The main difference with respect to Refs. (timebnn, ; time4, ; time5, ) is that our approach allows for precise fourquadrant VMM using analog input and weights. Unlike the work presented in Ref. (time1, ), there is no weightdependent scaling factor in the timeencoded outputs, making it possible to chain multipliers together to implement functional largescale circuits completely in a time domain. Inputs were encoded in the duration of the digital pulses in recent work (reram, )
. However, that approach shares many similar problems of currentmode VMMs, most importantly of costly I/O conversion circuits. Moreover, due to resistive synapses considered in Ref.
(reram, ), the output line voltage must be pinned for accurate integration which results in large static power consumption. Finally, in this paper, we are presenting preliminary, postlayout performance results for a simple though representative circuit. Our estimates are based on the layout of the circuit in 55 nm process with embedded NOR flash memory floating gate technology, which is very promising for the proposed timedomain computing.2. TimeDomain VectorbyMatrix Multiplier
2.1. SingleQuadrant DotProduct Operation
Let us first focus on implementing element timedomain dotproduct (weightedsum) operation of the form
(1) 
with nonnegative inputs and output , and with weights in a range of [0, ]. Note that for convenience of chaining multiple operations, the output range is similar to that of the input due to normalization.
The th input and the dotproduct output are timeencoded, respectively, with digital voltages and , such that:
(2) 
(3) 
Here, value of represents a multibit (analog) input, which is defined within a time window , while the value represents a multibit (analog) output observed within a time window . With such definitions, the maximum values for the inputs and the output are always equal to , their minimum values are 0, while for and for .
The timeencoded voltage inputs are applied to the array’s row electrodes, which connect to the control input of the adjustable configurable current sources, i.e. gate terminals of transistors (Fig. 1a). input is assumed to turn off the current source (transistor). On the other hand, application of voltage initiates the constant current , specific to the programmed value of the th current source, to flow into the column electrode. This current will charge the capacitor which is comprised by the capacitance of the column (drain) line and and intentionally added external capacitor. When the capacitor voltage reaches threshold , the output of the digital buffer will switch from to at time , which is effectively the timeencoded output of the dotproduct operation.
Assuming for simplicity and negligible dependence of the currents injected by current sources on (e.g., achieved by biasing crosspoint transistor in subthreshold regime) the charging dynamics of the capacitor is given by the dot product of currents , where is Heaviside function, with their corresponding time intervals , i.e
(4) 
Before deriving the expression for the output , let us note two important conditions imposed by our assumptions for the minimum and maximum values of the output. First, the additional (weightdependent) bias current is added to the sum in Eq. 4, with its turnon time always , to make output equal to 0 for the smallest possible value of dot product, represented by and . Secondly, the largest possible output , which corresponds to the largest values of inputs and current sources , is ensured by
(5) 
Using in Eq. 4 and noting that is always 1 for , the relation between timeencoded output and the inputs is described similarly to Eq. 1 if we assume that currents are
(6) 
and the bias current is
(7) 
2.2. FourQuadrant Multiplier
The extension of the proposed timedomain dotproduct computation to timedomain VMM is achieved by utilizing the array of current source elements and performing multiple weightedsum operations in parallel (Fig. 1a). The implementation of fourquadrant multiplier, in which input, outputs, and weights can be of both polarities, is shown in Figure 1c. In such differential style implementation, dedicated wires are utilized for timeencoded positive and negative inputs / outputs, while each weight is represented by four current sources. The magnitude of the computed output is encoded by the delay, just as it was discussed for singlequadrant dotproduct operation, while the output’s sign is explicitly defined by the corresponding wire of a pair. To multiply input by the positive weight, and are set to the same value according to Eq. 7, with the other current source pair set to , while it is the opposite for the multiplication by the negative weights. Note that with such implementation, in the most general case, the input / output values are represented by timeencoded voltage pulses on both positive and negative wires of the pair, with, e.g., corresponding to the positive output (Fig. 1d), and negative (or zero, when ) output otherwise.
The output for the proposed timedomain VMM is always computed within fixed long window. Its maximum and minimum values are independent of a particular set of utilized weights and always correspond to and , which is different from prior proposals (time1, ). This property of our approach is convenient for implementing largescale circuits completely in a timedomain. For example, VMM outputs can be supplied directly to another VMM block or some other timedomain circuitry, e.g. implementing Race Logic (racelogic, ).
3. Case Study: Perceptron Network
3.1. Design Methodology
The design methodology for implementing larger circuits is demonstrated on the example of specific circuit comprised of two VMMs and a nonlinear (“rectifylinear”) function block, which is representative of multilayer perceptron network (Fig. 2a, b). Figure 2c shows gatelevel implementation of VMM and rectifylinear circuits, which is suitable for pipelined operation. Specifically, the thresholding (digital buffer) is implemented with SR latch. The rectifylinear functionality is realized with just one AND gate which takes input from two latches that are serving a differential pair. The AND gate will generate a voltage pulse with duration for positive outputs from the first VMM (see, e.g., Fig. 1d) or have zero voltage for negative ones.
It is important to note that in the considered implementation of rectifylinear function, the inputs to the second VMM are encoded not in the rising edge time of the voltage pulse, i.e. , but rather in its duration. Specifically, in this case the input is encoded by long voltage pulse in the phase I (i.e. time interval), and always long voltage pulse during the phase II ( time interval). Such pulseduration encoding is more general case of a scheme presented in Section 2.1. Indeed, in our approach, each product term in dotproduct computation is contributed by the total charge injected to an output capacitor by one current source, which in turn is proportional to the total time that current source is on. In the scheme discussed in Section 2.1, voltage pulse always ends at and thus encoding in the rising edge time of a pulse is equivalent to the encoding in the pulse duration.
Additional pass gates, one per each output line, are controlled by RESET signals (Fig. 2c) and are used to precharge output capacitor before starting new computation. Controlled by SET signal, the output OR gate is used to decouple computations in two adjacent VMMs. Specifically, the OR gate and SET signal allow to generate phase II’s long pulses applied to the second VMM and, at the same time, precharge and start new phase I computation in the first VMM. Using appropriate periodic synchronous SET and RESET signals, pipelined operation with period is established, where is a time needed to precharge output capacitor (Fig. 2d).
Though the slope of activation function is equal to one in the considered implementation, it can be easily controlled by appropriate scaling of VMM weights (either in one or both VMMs). Also, because of strictly positive inputs, in principle, only twoquadrant multiplier is needed for the second layer, which is easily implemented by removing all input wires carrying negative values in the fourquadrant design (Fig.
1c).3.2. Embedded NOR Flash Memory Implementation
We have designed twolayer perceptron network, based on two fourquadrant multipliers, in 55 nm CMOS process with modified embedded ESF3 NOR flash memory technology (sst, ). In such technology, erase gate lines in the memory cell matrix were rerouted (Fig. 2c) to enable precise individual tuning of the floating gate (FG) cells’ conductances. The details on the redesigned structure, static and dynamic  characteristics, analog retention, and noise of the floating gate transistors, as well as results of high precision tuning experiments can be found in Ref. (sstVMM, ).
The network, which features 10 inputs and 10 hiddenlayer / output neurons, is implemented with two identical arrays of supercells, CMOS circuits for the pipelined VMM operation and rectifylinear transfer function as described in previous subsection, as well as CMOS circuitry for programming and erasure of the FG cells (Fig. 3). During operation, all FG transistors are biased in subthreshold regime. The input voltages are applied to control gate lines, while the output currents are supplied by the drain lines. Because FG transistors are Ntype, the output lines are not discharged to the ground with RESET signal, but rather charged to , i.e. above the threshold voltage of SR latch. In this case, VMM operation is performed by sinking currents via adjustable current sources based on FG memory cells.
4. Design Tradeoffs and Performance Estimates
Obviously, understanding the true potentials of the proposed timedomain computing would require choosing optimal operating conditions and careful tuning of the CMOS circuit parameters. Furthermore, the optimal solution will differ depending on the specific optimization goals and input constrains such as VMM size and precision of operation. Here, we discuss important tradeoffs and factors at play in the optimization process, focusing specifically on the computing precision. We then provide preliminary estimates for area, performance, and energy efficiency.
4.1. Precision
There are number of factors which may limit computing precision. The weight precision is affected by the tuning accuracy, drift of analog memory state, and drain current fluctuations due to intrinsic cell’s noise. These issues can be further aggravated by variations in ambient temperature. Earlier, it has been shown that, at least in smallscale currentmode VMM circuits based on similar 55nm flash technology, even without any optimization, all these factors combined may allow up to 8 bit effective precision for majority of the weights (sstVMM, ). We expect that just like for currentmode VMM circuits, the temperature sensitivity will be improved due to differential design and utilization of higher drain currents, which is generally desired for optimal design.
Our analysis shows that for the considered memory technology, the main challenge for VMM precision is nonnegligible dependence of FG transistor subthreshold currents on the drain voltage (Fig. 4), due to the draininduced barrier lowering (DIBL). To cope with this problem, we minimized the relative output error . In general, Error depends on the precharged drain voltage , the voltage swing on the drain line , control and select gate voltages, and the maximum drain current utilized for weight encoding. In our initial study, we assumed that V, which cannot be too small, due to in subthreshold regime (with V), but otherwise has weak impact on Error. Furthermore, for simplicity, we assumed that V, which cannot be too low because of static and shortcircuit leakages in CMOS gates (see below), and that V, which is a standard CMOS logic voltage in 55 nm process. We found that the drain current is especially sensitive to select gate voltages with the distinct optimum at V (Fig. 4a). This is apparently due to shorter effective channel length for higher , and hence more severe DIBL, and voltage divider effect at lower . Furthermore, the drain dependency is the smallest at higher drain currents 1 A (Fig. 4a), though naturally bounded by the upper limit of the subthreshold conduction (Fig. 4b). At such optimal conditions, Error could be less than (Fig. 4a), which is ensuring at least 5 bits of computing precision.
Similarly to all FGbased analog computing, majority of the process variations, a typical concern for any analog or mixedsignal circuits, can be efficiently compensated by adjusting currents of FG devices, provided that such variations can be properly characterized. This, e.g., includes mismatches (which can be up to 20 mV rms for the implemented SR latch according to our Monte Carlo simulations). The only input dependent error, which cannot be easily compensated, is due to the variations in the slope of the transfer characteristics of SR latch gates. However, this does not seem to be a serious issue for our design, given that the threshold for the drain voltage is always crossed at the same voltage slew rate. Also, note that variations in subthreshold slope are not important for our circuit because of digital input voltages.
VMM precision can be also impacted by the factors similar to those of switchcapacitor approach, including leakages via OFFstate FG devices, channel charge injection from pass transistor at the RESET phase, and capacitive coupling of the drain lines. Fortunately, the dominating coupling between D and CG lines is inputindependent and can be again compensated by adjusting the weights.
4.2. Latency, Energy, and Area
For a given VMM size , the latency (i.e. ) is decreased by using higher and/or reducing . The obvious limitation to both are degradation in precision, as discussed in previous section, and also intrinsic parasitics of the array, most importantly drain line capacitance . Our postlayout analysis shows that for the implemented circuit and optimal , the latency per single bit of computing precision is ns, and roughly for higher precision , e.g., ns for 6bit VMM. These estimates are somewhat pessimistic because of conservative choice of pF per input, which ensures ¡0.5% votlage drop on the drain line due to capacitive coupling. (Note that in the implemented VMM circuit the intrinsic delay does not scale with array size because of the utilized bootstrapping technique.)
The energy per operation is contributed by the dynamic component of charging/discharging of control gate and drain lines as well as external output capacitors, and the static component, including vddtoground and shortcircuit leakages in the digital logic. Naturally, CMOS leakage currents are suppressed exponentially by increasing drain voltage swing, and can be further reduced by lowering CMOS transistor currents (i.e. increasing length to width ratio), while still keeping propagation delay negligible as compared to . The increase in the drain swing, however, have negative impact on VMM precision (Fig. 4b) and dynamic energy. Determining optimal value of and by how much CMOS transistor currents can be reduced without negatively impacting precision is important future research. The estimates, based on the implemented layout (with rather suboptimal value of ), V, and , show that the total energy is about 5.44 pJ for VMM, or equivalently TOps/J, with the static energy contributes roughly of the total budget. The energyefficiency significantly improves for larger VMMs, e.g. reaching TOps/J for = 100, due to reduced contribution of static energy component. It becomes even larger, potentially reaching TOps/J for = 1000, at which point it is completely dominated by dynamic energy related to charging/discharging external capacitor.
The area breakdown by the circuit components was accurately evaluated from the layout (Fig. 3). Clearly, because of rather small implemented VMMs, the peripheral circuitry dominates, with one neuron block occupying 1.5 larger area than the whole supercell array. However, with larger and more practical array sizes (e.g. ), the area is completely dominated by the memory array and external capacitors, which occupy and , respectively, of the total area for the conservative design.
In some cases, e.g. convolutional layers in deep neural networks, the same matrix of weights (kernels) is utilized repeatedly to perform large number of multiplications. To increase density, VMM operations are performed using timedivisionmultiplexing scheme which necessitates storing temporal results and, for our approach, performing conversion between digital and timedomain representations. Fortunately, the conversion circuitry for the proposed VMM can be very efficient due to digital nature of timeencoded input/output signals. We have designed such circuitry in which the input conversion is performed with a shared counter and a simple comparatorlatch to create timemodulated pulse, while the pulseencoded outputs are converted to digital signals with a help of shared counter and a multibit register. Figure
5 summarizes energy and area for a timedomain multiplier based on the conservative design, in particular showing that the overhead of the I/O conversion circuitry drops quickly and becomes negligible as VMM size increases.With a more advanced design, which will require more detailed simulations, the external capacitor can be significantly scaled down or eliminated completely. (For example, the capacitive coupling can be efficiently suppressed by adding dummy input lines and using differential input signaling.) In this case, latency and energy will be limited by intrinsic parasitics of the memory cell array, and, e.g., can be below 2 ns and 1 fJ per operation, respectively, for 6bit VMM. Moreover, the energy and latency are expected to improve with scaling of CMOS technology (decrease proportionally to the feature size).
The proposed approach compares very favourably with previously reported work, such as currentbased 180 nm FG/CMOS VMM with measured 5,670 GOps/J (fg1, ), currentbased 180 nm CMOS 3bit VMM with estimated 6,390 GOps/J (indiveri, ), switchcap 40 nm CMOS 3bit VMM with measured 7,700 GOps/J (swcap, ), memristive 22nm 4bit VMM with estimated 60,000 GOps/J (memdpe2, ), and ReRAMbased 14 nm 8bit VMM with estimated 181.8 TOps/J (reram, ). Note that the most impressive reported energy efficiency numbers do not account for costly I/O conversion, specific to these designs.
5. Summary
We have proposed novel timedomain approach for performing vectorbymatrix computation and then presented the design methodology for implementing largerscale circuit, based on the proposed multiplier, completely in time domain. As a case study, we have designed a simple multilayer perceptron network, which involves two layers of fourquadrant vectorbymatrix multipliers, in 55nm process with embedded NOR flash memory technology. In our performance study, we have focused on the detailed characterization of the most important factor limiting the precision, and then discussed key tradeoffs and key performance metrics. The postlayout estimates for the conservative design which also includes the I/O circuitry to convert between digital and timedomain representation, show up to TOps/J energy efficiency at bit computing precision. A much higher energy efficiency, exceeding POps/J energy efficiency threshold, can be potentially achieved by using more aggressive CMOS technology and advanced design, though this opportunity requires more investigation.
References
 [1] Norman P. Jouppi and Cliff Young et al. Indatacenter performance analysis of a tensor processing unit. arXiv preprint arXiv:1704.04760, 2017.
 [2] Ikuo Magaki and Moein Khazraee et al. Asic clouds: Specializing the datacenter. Proc. ISCA’16, pages 178–190, June 2016.
 [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.
 [4] Manuel de la Guia Solaz and Richard Conway. Razor based programmable truncated multiply and accumulate, energyreduction for efficient digital signal processing. IEEE TVLSI, 23:189–193, 2014.
 [5] Tyson S.Hall and Christopher M. Twigg et al. Largescale fieldprogrammable analog arrays for analog signal processing. IEEE TCAS I, 52(11):2298–2307, 2005.

[6]
Bert Moons and Roel Uytterhoeven et al.
Envision: A 0.26to10 tops/w subwordparallel dynamicvoltageaccuracyfrequencyscalable convolutional neural network processor in 28 nm fdsoi.
Proc. ISSCC’17, pages 246–257, Feb. 2017.  [7] Miao Hu and John Paul Strachan et al. Dotproduct engine for neuromorphic computing: Programming 1t1m crossbar to accelerate matrixvector multiplication. Proc. DAC’16, pages 1–6, June 2016.
 [8] Itay Hubara and Matthieu Courbariaux et al. Quantized neural networks: Training neural networks with low precision weights and activations. Arxiv preprint ArXiv:1609.07061, Sep. 2016.
 [9] NVIDIA. Nvidia p40. http://images.nvidia.com/content/pdf/tesla/184427TeslaP40DatasheetNVFinalLetterWeb.pdf.
 [10] A. M. Raid and W. M. Khedr et al. Jpeg image compression using discrete cosine transform  a survey. Arxiv preprint ArXiv:1405.6147, May 2014.
 [11] Isaac Richter and Kamil Pas et al. Memristive accelerator for extreme scale linear solvers. Proc. GOMACTech’15, Mar. 2015.
 [12] Jonathan Binas and Daniel Neil et al. Precise deep neural network computation on imprecise lowpower analog hardware. Arxive preprint ArXiv:1606.07786, June 2016.
 [13] Matthew J. Marinella and Sapan Agarwal et al. Multiscale codesign analysis of energy, latency, area, and accuracy of a reram analog neural training accelerator. arXiv preprint arXiv:1707.09952, 2017.
 [14] Farnood MerrikhBayat and Xinjie Guo et al. Sub1us, sub20nj pattern classification in a mixedsignal circuit based on embedded 180nm floatinggate memory cell arrays. Arxive preprint ArXiv:1610.02091, Oct. 2016.
 [15] Xinjie Guo and Farnood M. Bayat et al. Temperatureinsensitive analog vectorbymatrix multiplier based on 55 nm nor flash memory cells. Proc. CICC’17, Apr. 2017.
 [16] Edward H. Lee and S. Simon Wong. A 2.5 ghz 7.7 tops/w switchedcapacitor matrix multiplier with codesigned local memory in 40 nm. Proc. ISSCC’16, pages 418–419, Jan. 2016.
 [17] Vishnu Ravinuthula and Vaibhav Garg et al. Timemode circuits for analog computation. Int. J. Cir. Theory App., 37:631–659, 2009.
 [18] Wolfgang Maass and Christopher M. Bishop. Pulsed Neural Networks. MIT Press, 1st. edition, 2001.
 [19] Takashi Tohara and Haichao Liang et al. Silicon nanodisk array with a fin fieldeffect transistor for timedomain weighted sum calculation toward massively parallel spiking neural networks. Applied Phy. Express, 9, 2016. art. 034201.
 [20] Takashi Morie and Haichao Liang et al. Spikebased timedomain weightedsum calculation using nanodevices for low power operation. Proc. IEEE Nano’16, pages 390–392, Aug. 2016.
 [21] Daisuke Miyashita and Shouhei Kousai et al. A neuromorphic chip optimized for deep learning and cmos technology with timedomain analog and digital mixedsignal processing. IEEE JSCC, page in press, 2017.
 [22] Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. A 4mm 180nmcmos 15gigacellupdatespersecond dna sequence alignment engine based on asynchronous race conditions. Proc. CICC’17, Apr. 2017.
 [23] SST. Sst’s esf3 technology. http://www.sst.com/technology/superflashtechnology.aspx.
Comments
There are no comments yet.