Non-Blocking Simultaneous Multithreading: Embracing the Resiliency of Deep Neural Networks

by   Gil Shomron, et al.

Deep neural networks (DNNs) are known for their inability to utilize underlying hardware resources due to hardware susceptibility to sparse activations and weights. Even in finer granularities, many of the non-zero values hold a portion of zero-valued bits that may cause inefficiencies when executed on hardware. Inspired by conventional CPU simultaneous multithreading (SMT) that increases computer resource utilization by sharing them across several threads, we propose non-blocking SMT (NB-SMT) designated for DNN accelerators. Like conventional SMT, NB-SMT shares hardware resources among several execution flows. Yet, unlike SMT, NB-SMT is non-blocking, as it handles structural hazards by exploiting the algorithmic resiliency of DNNs. Instead of opportunistically dispatching instructions while they wait in a reservation station for available hardware, NB-SMT temporarily reduces the computation precision to accommodate all threads at once, enabling a non-blocking operation. We demonstrate NB-SMT applicability using SySMT, an NB-SMT-enabled output-stationary systolic array (OS-SA). Compared with a conventional OS-SA, a 2-threaded SySMT consumes 1.4x the area and delivers 2x speedup with 33 savings and less than 1 ImageNet. A 4-threaded SySMT consumes 2.5x the area and delivers, for example, 3.4x speedup and 39 ResNet-18.



There are no comments yet.


page 4


Post-Training BatchNorm Recalibration

We revisit non-blocking simultaneous multithreading (NB-SMT) introduced ...

OpenMath and SMT-LIB

OpenMath and SMT-LIB are languages with very different origins, but both...

lazybvtoint at the SMT Competition 2020

lazybvtoint is a new prototype SMT-solver, that will participate in the ...

DDM: A Demand-based Dynamic Mitigation for SMT Transient Channels

Different from the traditional software vulnerability, the microarchitec...

Hardware-aware Pruning of DNNs using LFSR-Generated Pseudo-Random Indices

Deep neural networks (DNNs) have been emerged as the state-of-the-art al...

Tetris: Re-architecting Convolutional Neural Network Computation for Machine Learning Accelerators

Inference efficiency is the predominant consideration in designing deep ...

Scoup-SMT: Scalable Coupled Sparse Matrix-Tensor Factorization

How can we correlate neural activity in the human brain as it responds t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are built of layers that primarily perform dot product operations between activations and weights. These basic operations are at the core of DNNs that achieve state-of-the-art results in different domains [hannun2014deep, girshick2014rich, silver2016mastering]

. Yet, DNNs comprise abundant computations; for example, state-of-the-art convolutional neural networks (CNNs) may require billions of multiply-and-accumulate (MAC) operations to classify a single image

[hannun2014deep, sze2017efficient]. Their great potential and computational burden have been a fertile ground for research and development of efficient DNN hardware accelerators over the last decade [reuther2019survey, jouppi2017datacenter, chen2014dadiannao].

The control flow of DNNs is mostly predictable, yet computations are still executed inefficiently on underlying hardware. For example, DNNs may consist of many zero-valued activations and weights [albericio2016cnvlutin]

. During inference, a layer output is usually followed by a ReLU activation function, which clamps negative activation values to zero

[nair2010rectified]. In addition, static pruning techniques push the limits of model sparsity by zeroing out insignificant weights [han2015learning, li2016pruning]. Zeros can be also found in finer granularities [lascorz2019shapeshifter]; a quantized 8-bit DNN has many values that can be effectively represented only by the 4-bit least-significant bits (LSBs). This unstructured sparsity can be leveraged to increase efficiency, thereby improving performance and reducing energy. Until now, DNN accelerators have handled such inefficiencies with compressed encodings [gondimalla2019sparten, parashar2017scnn, han2016eie], output zero-value prediction [akhlaghi2018snapea, song2018prediction, shomron2019thanks], input zero-value skipping [sharify2019laconic, kim2017zena, zhang2016cambricon, albericio2016cnvlutin], and working with bit-serial schemes [sharify2019laconic, judd2016stripes].

In this paper, we introduce non-blocking simultaneous multithreading (NB-SMT), a new approach to tackle sparsity and increase hardware efficiency. Conceptually, NB-SMT is based on the well-known SMT used to concurrently execute multiple instruction flows on shared resources [yamamoto1994performance, yamamoto1995increasing, tullsen1995simultaneous, eggers1997simultaneous]. In the same manner that SMT keeps several hardware threads to increase utilization of hardware resources, we propose maintaining a number of “DNN threads” that run in parallel so as to increase utilization of DNN hardware resources.

Conventional SMT dispatches instructions to an execution unit in an opportunistic manner. That is, if instruction dependencies are met and its needed resources are available, it will be executed; otherwise, the instruction will wait in a reservation station. NB-SMT avoids this online scheduling by “squeezing” two (or more) threads together to the shared resource (e.g., execution unit) by temporarily reducing their numerical precision. By doing so, we (1) leverage DNN tolerance to reduced numerical precision, thereby enabling a non-blocking operation; (2) do not break the systematic operation of DNNs, thereby enabling implementation of SMT in dataflow architectures, which are popular as DNN accelerators; and (3) achieve a speedup that is directly proportional to the number of threads.

NB-SMT may be implemented in different DNN accelerator architectures. In this paper, we demonstrate 2-threaded and 4-threaded NB-SMT as an extension to an 8-bit output-stationary (OS) systolic array (SA) for matrix multiplication [kung1979systolic, shomron2019smt, gupta2015deep], which we named SySMT. Compared with the conventional OS-SA, a 2-threaded SySMT achieves a 2 speedup with 33% energy reduction and less than 1% accuracy degradation of state-of-the-art DNNs with a 1.4 area increase. As for 4-threads, we observe that some layers contribute more errors to inference than others when executed with NB-SMT. Therefore, we trade speedup for accuracy by decreasing the number of running threads in selective layers. Given a 1% accuracy degradation cap, a 4-threaded SySMT delivers, for example, 3.4 speedup with 37% energy reduction and 2.5 area increase with 40%-pruned ResNet-18, compared with the conventional OS-SA.

Our contributions in this paper are as follows:

  • We introduce the concept of non-blocking simultaneous multithreading (NB-SMT), which increases DNN hardware utilization by exploiting DNN tolerance to reduced numerical precision, data-width variability, and unstructured sparsity. By not blocking any thread, NB-SMT achieves a speedup that is directly proportional to the number of threads.

  • We demonstrate NB-SMT applicability using SySMT, which is an NB-SMT-enabled output-stationary systolic array. We describe different resource sharing strategies in which SySMT employs both MAC unit and output register sharing.

  • We evaluate a 2-threaded and a 4-threaded SySMT in terms of speedup, area, power, energy, and model accuracy with various state-of-the-art CNN models and the ImageNet dataset.

The rest of this paper is organized as follows: Section 2 describes the rationale behind NB-SMT, Section LABEL:sec:the_idea presents the basic principals of NB-SMT, Section LABEL:sec:sa demonstrates NB-SMT as an extension to an output-stationary systolic array (SySMT), Section LABEL:sec:eval evaluates the impact of NB-SMT on SySMT implementation as well as on model accuracy, Section LABEL:sec:related_work discusses the applicability of NB-SMT in other accelerators and reviews related work, and Section LABEL:sec:conclusions concludes.

2 Motivation

The CPU instruction pipeline faces many challenges in achieving efficient execution. These inefficiencies, also known as hazards, originate from the application’s dynamic execution flow and from the generality of the architecture (i.e., general-purpose). DNNs, on the other hand, work in a systematic, layer-by-layer fashion, with mostly MAC operations taking place during inference, making their control and data flow deterministic; which and how many computations will be conducted, what is the model’s memory footprint, where are weights stored, and where will activations be stored during execution, can all be deduced prior to execution (neglecting special cases of conditional DNNs, for example). Yet, DNNs still exhibit inefficiencies when considering the actual values that propagate through the layers.

Sparsity. DNNs comprise zero-valued activations and weights [nikolic2019characterizing]. Zero-valued activations are produced dynamically during inference, due, among other things, to the popular use of the ReLU activation function, which clamps negative values to zero [nair2010rectified]

. On the other hand, weights are static during inference, and in most cases, not many of them are zero-valued when trained only with a loss function. However, training the network with L1 regularization or pruning the network, for example, can substantially reduce the number of parameters (i.e., increase the number of zero-valued weights) with negligible decrease in model accuracy

[li2016pruning]. For example, 60% of ResNet-50 parameters can be discarded [liu2018rethinking] by iteratively trimming small weights and retraining the model in an unstructured manner [han2015learning].

Partial sparsity.

Zeros can be also observed when looking within the numerical representation. DNN tensors usually follow a bell-shaped distribution, such as Gaussian or Laplace

[banner2019post]. Therefore, when considering a quantized DNN, some values will only be represented by a portion of the LSBs, leaving the most-significant bits (MSBs) to be equal to zero [lascorz2019shapeshifter]. Throughout this paper we use 8-bit model representations, so by “partial sparsity” we refer to those numbers that can be represented solely by 4 bits.

Unstructured sparsity. Activation sparsity is unstructured by nature, as the zero-valued activations may be scattered without any confined structure. Moreover, the values themselves are input-dependent, and thereby dynamic. Weights, on the other hand, are static during inference and therefore can be pruned either in an unstructured or structured manner. A general rule of thumb is that unstructured pruning techniques achieve a better parameter reduction to accuracy reduction ratio than do structured techniques. Indeed, with unstructured pruning, the algorithm has the freedom to cancel parameters in weight granularity, whereas structured pruning algorithms are constrained to remove parameters in larger granularity, such as channels or filters [wen2016learning]. The downside of unstructured pruning is, however, that it is not easily exploited by hardware [liu2018rethinking].

The unstructured sparse inputs cause spontaneous underutilization of the MAC units. From an hardware perspective, a MAC unit with one of its inputs equals to zero is practically idle; and an 8b-8b MAC unit with an effective input data-width of 4 bits is only partially utilized. Figure LABEL:fig:mac_util presents the average MAC utilization of five popular CNN models. We observe that, on average, 60% of MAC operations result in idle MAC units, since one of their inputs is zero-valued; 20% of MAC operations partially utilize the MAC units, since one of their inputs, or both, are effectively represented with 4 bits; and in a mere 10% of the time, the MAC operations fully utilize the MAC units. To increase hardware utilization, we propose non-blocking simultaneous multithreading (NB-SMT) that exploits both the unstructured sparsities of the activations and weights, as well as DNN tolerance to numerical precision reduction.