1 Introduction
Deep neural networks (DNNs) are built of layers that primarily perform dot product operations between activations and weights. These basic operations are at the core of DNNs that achieve stateoftheart results in different domains [hannun2014deep, girshick2014rich, silver2016mastering]
. Yet, DNNs comprise abundant computations; for example, stateoftheart convolutional neural networks (CNNs) may require billions of multiplyandaccumulate (MAC) operations to classify a single image
[hannun2014deep, sze2017efficient]. Their great potential and computational burden have been a fertile ground for research and development of efficient DNN hardware accelerators over the last decade [reuther2019survey, jouppi2017datacenter, chen2014dadiannao].The control flow of DNNs is mostly predictable, yet computations are still executed inefficiently on underlying hardware. For example, DNNs may consist of many zerovalued activations and weights [albericio2016cnvlutin]
. During inference, a layer output is usually followed by a ReLU activation function, which clamps negative activation values to zero
[nair2010rectified]. In addition, static pruning techniques push the limits of model sparsity by zeroing out insignificant weights [han2015learning, li2016pruning]. Zeros can be also found in finer granularities [lascorz2019shapeshifter]; a quantized 8bit DNN has many values that can be effectively represented only by the 4bit leastsignificant bits (LSBs). This unstructured sparsity can be leveraged to increase efficiency, thereby improving performance and reducing energy. Until now, DNN accelerators have handled such inefficiencies with compressed encodings [gondimalla2019sparten, parashar2017scnn, han2016eie], output zerovalue prediction [akhlaghi2018snapea, song2018prediction, shomron2019thanks], input zerovalue skipping [sharify2019laconic, kim2017zena, zhang2016cambricon, albericio2016cnvlutin], and working with bitserial schemes [sharify2019laconic, judd2016stripes].In this paper, we introduce nonblocking simultaneous multithreading (NBSMT), a new approach to tackle sparsity and increase hardware efficiency. Conceptually, NBSMT is based on the wellknown SMT used to concurrently execute multiple instruction flows on shared resources [yamamoto1994performance, yamamoto1995increasing, tullsen1995simultaneous, eggers1997simultaneous]. In the same manner that SMT keeps several hardware threads to increase utilization of hardware resources, we propose maintaining a number of “DNN threads” that run in parallel so as to increase utilization of DNN hardware resources.
Conventional SMT dispatches instructions to an execution unit in an opportunistic manner. That is, if instruction dependencies are met and its needed resources are available, it will be executed; otherwise, the instruction will wait in a reservation station. NBSMT avoids this online scheduling by “squeezing” two (or more) threads together to the shared resource (e.g., execution unit) by temporarily reducing their numerical precision. By doing so, we (1) leverage DNN tolerance to reduced numerical precision, thereby enabling a nonblocking operation; (2) do not break the systematic operation of DNNs, thereby enabling implementation of SMT in dataflow architectures, which are popular as DNN accelerators; and (3) achieve a speedup that is directly proportional to the number of threads.
NBSMT may be implemented in different DNN accelerator architectures. In this paper, we demonstrate 2threaded and 4threaded NBSMT as an extension to an 8bit outputstationary (OS) systolic array (SA) for matrix multiplication [kung1979systolic, shomron2019smt, gupta2015deep], which we named SySMT. Compared with the conventional OSSA, a 2threaded SySMT achieves a 2 speedup with 33% energy reduction and less than 1% accuracy degradation of stateoftheart DNNs with a 1.4 area increase. As for 4threads, we observe that some layers contribute more errors to inference than others when executed with NBSMT. Therefore, we trade speedup for accuracy by decreasing the number of running threads in selective layers. Given a 1% accuracy degradation cap, a 4threaded SySMT delivers, for example, 3.4 speedup with 37% energy reduction and 2.5 area increase with 40%pruned ResNet18, compared with the conventional OSSA.
Our contributions in this paper are as follows:

We introduce the concept of nonblocking simultaneous multithreading (NBSMT), which increases DNN hardware utilization by exploiting DNN tolerance to reduced numerical precision, datawidth variability, and unstructured sparsity. By not blocking any thread, NBSMT achieves a speedup that is directly proportional to the number of threads.

We demonstrate NBSMT applicability using SySMT, which is an NBSMTenabled outputstationary systolic array. We describe different resource sharing strategies in which SySMT employs both MAC unit and output register sharing.

We evaluate a 2threaded and a 4threaded SySMT in terms of speedup, area, power, energy, and model accuracy with various stateoftheart CNN models and the ImageNet dataset.
The rest of this paper is organized as follows: Section 2 describes the rationale behind NBSMT, Section LABEL:sec:the_idea presents the basic principals of NBSMT, Section LABEL:sec:sa demonstrates NBSMT as an extension to an outputstationary systolic array (SySMT), Section LABEL:sec:eval evaluates the impact of NBSMT on SySMT implementation as well as on model accuracy, Section LABEL:sec:related_work discusses the applicability of NBSMT in other accelerators and reviews related work, and Section LABEL:sec:conclusions concludes.
2 Motivation
The CPU instruction pipeline faces many challenges in achieving efficient execution. These inefficiencies, also known as hazards, originate from the application’s dynamic execution flow and from the generality of the architecture (i.e., generalpurpose). DNNs, on the other hand, work in a systematic, layerbylayer fashion, with mostly MAC operations taking place during inference, making their control and data flow deterministic; which and how many computations will be conducted, what is the model’s memory footprint, where are weights stored, and where will activations be stored during execution, can all be deduced prior to execution (neglecting special cases of conditional DNNs, for example). Yet, DNNs still exhibit inefficiencies when considering the actual values that propagate through the layers.
Sparsity. DNNs comprise zerovalued activations and weights [nikolic2019characterizing]. Zerovalued activations are produced dynamically during inference, due, among other things, to the popular use of the ReLU activation function, which clamps negative values to zero [nair2010rectified]
. On the other hand, weights are static during inference, and in most cases, not many of them are zerovalued when trained only with a loss function. However, training the network with L1 regularization or pruning the network, for example, can substantially reduce the number of parameters (i.e., increase the number of zerovalued weights) with negligible decrease in model accuracy
[li2016pruning]. For example, 60% of ResNet50 parameters can be discarded [liu2018rethinking] by iteratively trimming small weights and retraining the model in an unstructured manner [han2015learning].Partial sparsity.
Zeros can be also observed when looking within the numerical representation. DNN tensors usually follow a bellshaped distribution, such as Gaussian or Laplace
[banner2019post]. Therefore, when considering a quantized DNN, some values will only be represented by a portion of the LSBs, leaving the mostsignificant bits (MSBs) to be equal to zero [lascorz2019shapeshifter]. Throughout this paper we use 8bit model representations, so by “partial sparsity” we refer to those numbers that can be represented solely by 4 bits.Unstructured sparsity. Activation sparsity is unstructured by nature, as the zerovalued activations may be scattered without any confined structure. Moreover, the values themselves are inputdependent, and thereby dynamic. Weights, on the other hand, are static during inference and therefore can be pruned either in an unstructured or structured manner. A general rule of thumb is that unstructured pruning techniques achieve a better parameter reduction to accuracy reduction ratio than do structured techniques. Indeed, with unstructured pruning, the algorithm has the freedom to cancel parameters in weight granularity, whereas structured pruning algorithms are constrained to remove parameters in larger granularity, such as channels or filters [wen2016learning]. The downside of unstructured pruning is, however, that it is not easily exploited by hardware [liu2018rethinking].
The unstructured sparse inputs cause spontaneous underutilization of the MAC units. From an hardware perspective, a MAC unit with one of its inputs equals to zero is practically idle; and an 8b8b MAC unit with an effective input datawidth of 4 bits is only partially utilized. Figure LABEL:fig:mac_util presents the average MAC utilization of five popular CNN models. We observe that, on average, 60% of MAC operations result in idle MAC units, since one of their inputs is zerovalued; 20% of MAC operations partially utilize the MAC units, since one of their inputs, or both, are effectively represented with 4 bits; and in a mere 10% of the time, the MAC operations fully utilize the MAC units. To increase hardware utilization, we propose nonblocking simultaneous multithreading (NBSMT) that exploits both the unstructured sparsities of the activations and weights, as well as DNN tolerance to numerical precision reduction.
Comments
There are no comments yet.