Log In Sign Up

Energy-efficient DNN Inference on Approximate Accelerators Through Formal Property Exploration

by   Ourania Spantidi, et al.
Southern Illinois University

Deep Neural Networks (DNNs) are being heavily utilized in modern applications and are putting energy-constraint devices to the test. To bypass high energy consumption issues, approximate computing has been employed in DNN accelerators to balance out the accuracy-energy reduction trade-off. However, the approximation-induced accuracy loss can be very high and drastically degrade the performance of the DNN. Therefore, there is a need for a fine-grain mechanism that would assign specific DNN operations to approximation in order to maintain acceptable DNN accuracy, while also achieving low energy consumption. In this paper, we present an automated framework for weight-to-approximation mapping enabling formal property exploration for approximate DNN accelerators. At the MAC unit level, our experimental evaluation surpassed already energy-efficient mappings by more than ×2 in terms of energy gains, while also supporting significantly more fine-grain control over the introduced approximation.


page 1

page 9

page 11


Positive/Negative Approximate Multipliers for DNN Accelerators

Recent Deep Neural Networks (DNNs) managed to deliver superhuman accurac...

LightNN: Filling the Gap between Conventional Deep Neural Networks and Binarized Networks

Application-specific integrated circuit (ASIC) implementations for Deep ...

LOCAL: Low-Complex Mapping Algorithm for Spatial DNN Accelerators

Deep neural networks are a promising solution for applications that solv...

Energy-Efficient ConvNets Through Approximate Computing

Recently ConvNets or convolutional neural networks (CNN) have come up as...

AMR-MUL: An Approximate Maximally Redundant Signed Digit Multiplier

In this paper, we present an energy-efficient, yet high-speed approximat...

HADES: Hardware/Algorithm Co-design in DNN accelerators using Energy-efficient Approximate Alphabet Set Multipliers

Matrix Vector Multiplications are a dominant contributor to high compute...

ALWANN: Automatic Layer-Wise Approximation of Deep Neural Network Accelerators without Retraining

The state-of-the-art approaches employ approximate computing to improve ...

I Introduction

Deep Neural Networks (DNNs) are consistently pushing the computing limitations of modern embedded devices. Current applications impose strict accuracy requirements making DNNs essential components in multiple continuously advancing domains [jouppi2017datacenter], resulting also in deeper and more complex implementations. The computing requirements of state-of-art DNNs have increased so much that a single inference requires billions of multiply and accumulate (MAC) operations. As embedded devices are resource-constrained (i.e., limited computing and power capabilities), they integrate hardware accelerators, which comprise large amounts of MAC units, to balance out the accuracy/throughput requirements, e.g., 4K MAC units in the Google Edge TPU [cass2019taking] and 6K MAC units in the Samsung embedded NPU [park20219]). However, such a high number of MAC units operating in parallel and performing billions of operations per second results in elevated energy requirements, power consumption, and thermal bottlenecks [amrouch2020npu].

Approximate computing [han2013approximate] has recently emerged as the dominant paradigm that trades quality loss for energy and performance gains. A great amount of DNN operations can tolerate some degree of approximation [mrazek2019alwann, tasoulas2020weight], and since the majority of DNN inference is spent on convolution and matrix multiplication operations, the design of approximate MAC units has attracted significant interest. Particularly, the majority of research has been focused on the design of approximate multipliers [axddnnsurvey2022], as they are the most complex components of the MAC units and dominate energy consumption inside the unit. However, such multipliers are not application specific. They have been designed with static approximation and, once deployed, the generated circuits cannot adapt to input changes, causing serious integration problems.

To balance this energy gain/accuracy trade-off dilemma, approximate accelerators have been proposed following two main architectural designs: (i) tile-based MAC units, and (ii) MAC units with reconfigurable approximate multipliers. Regarding the first method, instead of having a mesh of MAC units, the DNN accelerator is organized as a mesh of tiles. Each tile hosts a combination of exact and multiple static approximate multipliers [mrazek2019alwann]. Then, based on the DNN and the system constraints, an analysis is performed to map the DNN layers on the appropriate multiplier (exact or approximate), while the rest are power-gated. Since each layer of a DNN has different error sensitivity, the layer-to-static multiplier mapping is not a trivial problem; the design space is so big that exhaustive exploration is prohibited [mrazek2019alwann]. Additionally, such a coarse-grain layer-based mapping of approximation limits the energy gains [tasoulas2020weight]. The second method is more fine-grain, flexible, and focuses on the design of reconfigurable approximate multipliers [tasoulas2020weight, spantidi2021positive]. Such multipliers sacrifice area to support multiple approximation modes with different introduced errors and are able to change the level of introduced error at run-time, based on the the weight values of each layer, to offer more fine-grain control. However, the main problem regarding their utilization lies in how to decide which approximation mode of the multiplier will be triggered for the different weight values in order to keep DNN accuracy within specific thresholds. Overall, the difficulty lies in the fact that there is no systematic approach to explore different weight-to-approximation mappings as previous works are based on hybrid methods [zervakis2020design, mrazek2019alwann] and require constant manual tuning [tasoulas2020weight, spantidi2021positive]. Additionally, they target only the average accuracy of a dataset, which can be misleading in some cases as we show is Section III. In Section V we will also show how the energy gains achieved from state-of-the-art related works at DNN accelerator level are fairly suboptimal due to inefficient exploitation of the underlying hardware.

To create more efficient mappings, while automating and enabling the systematic exploration of the introduced error, we can express the properties of the approximate accelerators using Signal Temporal Logic (STL) [HoxhaDF17sttt], a specification formalism to express system properties. Once an STL formula is expressed, a robustness analysis [bartocci2018specification] is used to evaluate in which cases the properties hold true. This way, we can check the system robustness under different STL expressions which allows systematic exploration, but is not scalable as each expression captures a small part of the exploration. Parametric Signal Temporal Logic (PSTL) [asarin2011parametric] extends STL by replacing threshold constants with parameters that get to be inferred.

In this paper, we address the aforementioned problems regarding weight-to-approximation mapping for energy-efficient DNN inference on approximate accelerators. Particularly, we present a unified framework that utilizes PSTL to express accuracy constraints and find weight-to-approximation mappings for the weights of a DNN on approximate multipliers, so as the constraints are satisfied and the energy consumption is minimized. The contributions of the proposed framework are:

  • we propose a novel framework which, based on appropriate constructed formal expressions and input formulation, utilizes robustness metrics to achieve energy-efficient approximation mappings;

  • we systematically explore the properties of approximate accelerators when multipliers with different introduced errors are employed, in correlation with their utilization under specific accuracy thresholds;

  • we show that although the state of the art satisfies tight but coarse constraints, it fails to satisfy more fine-grain ones. To the best of our knowledge, this is the first work that allows fine-grain exploration of approximate accelerator error properties under multiple accuracy constraints, while avoiding manual tuning and retraining or weight tuning; and

  • we evaluate the resulting mapping combinations against state-of-the-art mapping methodologies and compare our energy gain findings when considering both fine- and coarse-grain accuracy requirements.

Ii Related work

Approximate computing has been heavily utilized on DNN inference [axddnnsurvey2022]. Many works present mapping methodologies that balance out the computation accuracy-power consumption trade-off, and recent research has focused on the design and utilization of approximate multipliers on DNN inference.

The work in [vasicek2019automated]

deals with the automated design of application-specific approximate circuits using Cartesian Genetic Programming (CGP), employing a weighted mean error distance metric to guide the circuit approximation process. In 

[ansari2019improving] the authors employ designs of CGP-based multipliers to achieve energy and area savings. The authors in [sarwar2018energy]

utilize the notion of computation sharing and propose a compact and energy-efficient Multiplier-less Artificial Neuron. However, all works in  

[vasicek2019automated, ansari2019improving, sarwar2018energy] require DNN retraining to retrieve some of the accuracy loss, which cannot always be applied [convar:dac2021].

To avoid retraining, the work in [mrazek2019alwann] employs a layer-based (coarse) approximation method where each static multiplier is taken from [mrazek2017evoapprox8b] and it can be assigned to a distinct layer. The resulting multiplier selection problem is solved by a multi-objective optimization algorithm, however the applied mapping is layer-based which leaves room for improvement. The work in [mrazek2020libraries] extends the library in [mrazek2017evoapprox8b] using CGP optimization to generate approximate multipliers, achieving significant energy gains with minimal accuracy loss for less complex networks. In [hanif2019cann], the authors present a compensation module that reduces energy consumption, but the additional accumulation row in the MAC array increases the computational latency. Similarly, in [convar:dac2021] an additional MAC array column is used to predict and compensate the approximate multiplications error. The work in [guo2020reconfigurable] proposes a reconfigurable approximate multiplier for quantized CNN applications that enables the reusability of resources for multiplications of different precisions. In [zervakis2020design] the authors generate approximate multipliers with the capability of run-time reconfigurable accuracy. The work in [tasoulas2020weight] uses [zervakis2020design]

to generate low-variance approximate reconfigurable multipliers, and also proposes a fine-grain weight-oriented mapping methodology to achieve high energy gains with small accuracy losses. Additionally, 

[spantidi2021positive] presents a dynamically configurable approximate multiplier that comprises an exact, a positive error, and a negative error mode, and proposes a filter-oriented approximation method that achieves high energy gains under different accuracy constraints, but the exploration time becomes unmanageable for larger DNNs and datasets.

A canonic sign digit-based approximation methodology for representing the filter weights of pre-trained Convolutional DNNs is presented in [riaz2020caxcnn]. The work in [hammad9cnn] proposes an architecture comprising a pre-processing precision controller at the system level and approximate multipliers of varying precisions on the Processing Element (PE) level. The authors in [park2021design] utilize different approximate multipliers in an interleaved way to reduce the energy consumption of MAC-oriented signal processing algorithms. However, both of the works [hammad9cnn, park2021design] target 16-bit inference, while modern DNN accelerators are mainly using 8-bit precision [jouppi2017datacenter]. Towards the optimization of DNN accelerators, the work in [zhang2022full] presents a full-stack accelerator search technique which improves the performance per thermal design power ratio. The work in [kosaian2021boosting] transforms convolutional and fully-connected DNN layers to achieve higher performance in terms of FLOPs/sec.

When compared to related works, our proposed approach differentiates in the following points: (1) given a reconfigurable multiplier, we systematically produce approximation mappings through the exploration of the error properties of approximate accelerators, (2) we employ multiple accuracy constraints for a given DNN and dataset, (3) our proposed framework can receive any trained and quantized DNN as input and does not require retraining.

Iii Motivation

This section contains a motivational example that shows the necessity of automatic and fine-grain exploration of the error properties of approximate accelerators in order to generate efficient mappings. Overall, we focus our analysis on state-of-art approximate multipliers and mapping methodologies [tasoulas2020weight, mrazek2019alwann, spantidi2021positive, zervakis2020design]. Specifically, the works in [tasoulas2020weight, spantidi2021positive, zervakis2020design] propose reconfigurable multiplier designs that comprise three modes of operation, each one introducing varying levels of error and energy gains. These works also present mapping methodologies and specifically the works in [mrazek2019alwann, zervakis2020design] present layer-wise approximation mappings where each layer is entirely mapped to a different multiplier or multiplier mode respectively. To combat minimal energy gains achieved by layer-wise approaches, the works in [tasoulas2020weight, spantidi2021positive] propose fine-grain weight-oriented methodologies to decide which approximation modes of the respective reconfigurable multiplier will be used for each weight value in each layer of the DNN.

We argue that the existing mapping approaches are inadequate for the following reasons: (i) they may result in resource under-utilization due to biased decisions; (ii) they target only average accuracy, which can be misleading; and (iii) they ignore big accuracy drops on specific dataset batches.

Resource under-utilization due to biased decisions: With the term utilization we refer to the amount of times each distinct multiplier mode is being used in a mapping. The methods in [tasoulas2020weight, spantidi2021positive] both perform weight-based mapping of approximation by employing the concepts of layer significance and weight magnitude. For instance, the authors in [tasoulas2020weight] presented LVRM (Low-Variance Reconfigurable Multiplier), an approximate reconfigurable multiplier that supports three operation modes: (i) , which triggers exact multiplications; (ii) , which introduces low error and small energy gains contrary to ; and (iii) , which introduces greater error with larger energy gains than . The method in[tasoulas2020weight] tries to initially identify which layers are more resilient to error, and map their weights entirely to the most aggressive approximate mode . Then, the weights of the remaining layers are mapped first to , then to , and finally to based on some experimentally derived ranges around the value zero, requiring manually tuning as each layer has different weight distribution. Even though this method produces considerable energy savings, it is not scalable and results in resource under-utilization. By mapping complete layers to , [tasoulas2020weight] introduces significant error and makes the DNN susceptible to further approximation. Thus, the utilization of the

mode is reduced. As an example, their methodology on the ResNet20 and CIFAR-10 dataset, for a

accuracy drop threshold, produces a mapping where the of the total multiplications is assigned to , only to , and the vast majority of to . Thus, one of the approximate multiplier modes are barely utilized. Considering the sub-linear relation between induced error and energy reduction in approximate multipliers [mrazek2017evoapprox8b, ZervakisTVLSI2019], we argue that approximating multiple layers with (i.e., moderate approximation) will deliver an energy reduction closer to linear and thus, potentially higher energy reduction than the one achieved by approximating only a few layers with .

Fig. 1: (a) ResNet44 on CIFAR-100 with the multipliers in [mrazek2017evoapprox8b] and method in [mrazek2019alwann] and (b) ResNet56 on CIFAR-100 with the multiplier and method in [spantidi2021positive].

Targeting only average accuracy can be misleading: Works that propose approximate multipliers and mapping methodologies [tasoulas2020weight, mrazek2019alwann, sarwar2018energy, spantidi2021positive, zervakis2020design], evaluate their findings on the average accuracy of a DNN over the target dataset. However, the introduction of error in computations does not affect all the batches of the dataset equally, creating, in many cases, great variations in the achieved accuracy. Such behavior is not acceptable when quality of service requirements need to be achieved over the entire dataset [dokhanchi2018evaluating]. Figure 1(a) shows the inference accuracy differences of the ALWANN method [mrazek2019alwann] against the baseline (exact computations without error) for ResNet44 over the entire CIFAR-100 test dataset. Particularly, we split the 10,000 images into 100 equal batches and we show the accuracy differences for each one of them. Even though ALWANN had an overall accuracy drop of only 1% over the entire dataset, when we take a deeper look into the achieved accuracy per batch, we can see cases where the accuracy drops as low as 10% when compared to the exact operation (e.g., batch 40). Additionally, looking at the batches in which ALWANN achieved lower accuracy, over 20% of them have an accuracy drop of more than 5%, which is significant considering the strict constraint of 1%. Similar behavior has been observed from the mappings produced by [tasoulas2020weight]. Satisfying such fine-grain requirements could be posed in the form of queries. For example, “For the dataset batches that perform worse than the exact behavior, we want no more than 20% of the cases to drop more than 5% when we introduce approximation.” In Section IV we show how to express formally such queries with PSTL and perform the corresponding fine-grain exploration.

Ignoring big accuracy drops on specific batches: Introducing approximation can also have an additional effect. Even though the variation of the achieved accuracy can be within specific values, there might be specific batches with very low accuracy. Figure 1(b) shows the inference accuracy difference of the method in [spantidi2021positive] against the baseline (exact computations without error) for ResNet56 over the entire CIFAR-100 test dataset (10,000 images split into 100 equal batches). Even though [spantidi2021positive] achieves an overall accuracy drop of 1% and satisfies the query presented in the previous observation, we can see that for batch 47 the accuracy drop is 16%. Having such large drops during the inference phase could in some cases not be deemed acceptable, even if it is only for a small number of batches. For instance, such applications could be image recognition tasks in autonomous vehicles, where a steady stream of data is provided through different sensors (RADAR, LIDAR, etc) [dokhanchi2018evaluating, samal2020attention]. In such cases, it is important to study the accuracy for each block of this data stream instead of evaluating an NN over the final average accuracy. Therefore, even more complex queries are needed to capture this behavior. For instance: “For the dataset batches that perform worse than the exact behavior, we want no more than 20% of the cases to drop more than 5% and no case whatsoever to drop more than 15% when we introduce approximation.” Again, in Section IV we show how to express formally such queries with PSTL and perform the corresponding fine-grain exploration.

The analysis presented above shows the necessity to express complex queries in a formal way and explore solutions systematically without manual tuning. In that way, we will be able to produce flexible and scalable mappings, provide a more fine-grain control of the introduced error on the overall dataset, and support different levels of quality of service.

Iv Methodology

In this paper, we present a methodology to systematically map DNN weights to approximation, under multiple and distinct objectives, using PSTL. By employing PSTL, we show that it is possible to describe more intricate properties of accelerators that comprise approximate multipliers. We additionally show that through PSTL we are allowed to formulate and solve optimization problems with respect to energy gains. The user provides an approximate multiplier, a DNN already trained and quantized to 8-bits and a dataset. The system we target in this work is a DNN accelerator comprising MAC units that utilize the given approximate multiplier. The output of the system is a single trajectory that captures the accuracy behavior of the given DNN for each batch of the given dataset. Specifically, the trajectory captures the accuracy drop of the utilized approximate multiplier against the exact multiplier for each respective dataset batch. By considering this accuracy drop per batch for a given DNN as a trajectory, we can investigate more specific and fine-grain properties of the overall system and acquire more knowledge on the impact of approximate accelerators. Having defined the output trajectory of the accelerator, we can express a property query using PSTL. An example of such a query is “For a given accelerator employing a reconfigurable multiplier and a given accuracy drop threshold, what is the maximum achievable energy gain we can achieve without violating the accuracy requirement?”. After expressing a PSTL query, the mapping exploration phase, also called parameter mining phase, is triggered. Initially, the weights of the DNN are randomly assigned to approximation modes of the given reconfigurable multiplier. The output accuracy trajectory is then analyzed for its robustness, which is then fed to a stochastic optimizer that decides on the next approximation mapping for each DNN layer. Essentially, the stochastic optimizer correlates the robustness value with per-layer approximation, and tries to find operating conditions that satisfy the defined PSTL query. The aim of the stochastic optimization step is to push the system’s behavior to the constraint boundaries that are set through the PSTL queries. Once the exploration phase is completed, we build a Pareto-front of mined parameters where the PSTL query is guaranteed to be satisfied.

Iv-a Expressing system properties via Signal Temporal Logic

As aforementioned, state-of-the-art mapping methodologies require manual tuning. For example, specific layers need to be selected based on their error resilience, and then further exploration is needed on individual weight value ranges [tasoulas2020weight, spantidi2021positive]. These methods mostly rely on experimental observations without systematic exploration. How can queries like the ones described in Section III be posed? How can we exploit such queries to infer system properties? In order to support such fine-grain exploration and automate the mapping procedure, we utilize STL[maler2004monitoring, HoxhaDF17sttt] and specifically we build queries through PSTL [asarin2011parametric], considering the accuracy of all the incoming batches over time as the output trajectory. From this point onward, we refer to such trajectories as signals.

STL is a specification formalism used to express the properties of a given system in a compact way. To define the quantitative semantics of STL over arbitrary predicates, robustness is utilized as a quantitative measure on a given STL formula . Robustness indicates how far the signal is from satisfying or violating the defined STL specification [HoxhaDF17sttt]. An example of an STL expression is: “Does the accuracy signal always remain above 98% accuracy”. The syntax of STL comprises multiple Boolean operators but in this work, we only utilize the conjunction . Additional operators can be defined as syntactic abbreviations. In this work, we will be using the notion of the “always” operator [HoxhaDF17sttt] , meaning that “always during the interval , should be true”. We consider , and therefore can be dropped from the notation. These STL operators have been defined and used in literature to express temporal properties [maler2004monitoring, hoxha2018mining]. Moreover, we extend “always” to a more relaxed operator , where instead of demanding the entire specification to hold over the entire interval , we consider that is true if it holds just for of the signal values over the interval .

Using STL, it is possible to judge whether a signal satisfies or not a specified system property . However, it is possible to also explore more elaborate system properties that a signal satisfies. Specifically, consider the aforementioned example of the STL query “Does the accuracy signal always remain above 98% accuracy?”. This query is essentially reduced to a “yes or no” problem: the signal either remains over 98% over the entire interval or not. Instead of pre-defining the 98% accuracy value, we can leave it as a parameter to be mined. Therefore, the query can be rephrased as “Which is the lowest accuracy value that the signal always satisfies?”. Such questions can be posed through the formal language PSTL [asarin2011parametric]. PSTL, an extension of STL, is a formal language used to express questions when we need to further explore the properties of a system instead of just determining whether an STL specification is satisfied or not. Parameter mining is the procedure of determining parameter values for PSTL formulas for which the specification is falsified. Therefore, parameter mining answers the question of which parameter ranges cause falsification [HoxhaDF17sttt]. Through the robustness metric, the parameter mining problem can be posed as an optimization problem [HoxhaDF17sttt].

Iv-B Parametric Signal Temporal Logic Queries

The employment of PSTL can assist in the exploration of DNN weight-to approximation mappings w.r.t. to maximizing a given parameter, which in our case is the energy savings of the approximate accelerator. In this section, we show how we build incremental queries to capture energy gain values while keeping inference accuracy within multiple specific constraints: some of them more strict than others. Considering equal batches of our dataset as the stream of input data, we build the following initial query:

  1. What is the maximum achieved energy gain such that, when applying approximation, any accuracy drop from the baseline is no more than for X% of the time?

In this initial query IQ1, values and are defined by the user. For instance, in the motivation example presented in Figure 1(a) and described in Section III and . The baseline is the achieved accuracy without applying any approximation (i.e., exact computations). The parameter is at this stage unknown to the user and is left to be mined (Section IV-C). Note that, the value refers to the accuracy difference per batch of each target DNN. In other words, we want to make sure that for all the input batches, in which approximation results in accuracy drop, this accuracy drop is no more than for of these batches.

With the query IQ1, we are able to impose fine-grain constraints across all batches of the dataset regarding the variation of the accuracy drop. However, as we showed in Section III, there are cases that even though the variation of the achieved accuracy is within specific values, there might be specific batches with very low accuracy. To that end we extend IQ1 as follows:

  1. What is the maximum achieved energy gain such that, when applying approximation, any accuracy drop from the baseline is no more than for X% of the time, and no more than at any time?

Again in this case, the values of , and are defined by the user based on the needs of the application.

Finally, since many related works on approximate reconfigurable multipliers take into consideration the average accuracy of each target DNN [mrazek2019alwann, tasoulas2020weight, spantidi2021positive], we add it to query IQ2 to capture both fine-grain and coarse-grain accuracy information simultaneously. Therefore:

  1. What is the maximum achieved energy gain such that, when applying approximation, any accuracy drop from the baseline is no more than for X% of the time, no more than at any time, and the average accuracy drop is below ?

Once again, the values of , , and this time also , are defined by the user. Concluding, in this section we used the initial query IQ1 to build the more elaborate query IQ3 in an attempt to profile more intricate accuracy properties for a given DNN. In our evaluation (Section V), we present the experimental results of different versions of the aforementioned queries, with different , and values.

Iv-C Weight-to-approximation mapping

In our methodology, we follow a layer-oriented approach to map specified multiplications to approximation. To that end, we utilize a stochastic optimizer under multiple accuracy constraints (e.g., query IQ 3), that iteratively (i)

takes as input the observed accuracy per batch and the estimated output energy of the system, and

(ii) produces as output a signal that describes the mapping of the DNN weights of each layer to the approximate modes of the multiplier.

To formulate the problem and without loss of generality, we assume an accelerator whose MAC units are composed of reconfigurable approximate multipliers. Each reconfigurable approximate multiplier supports three operation modes: (i) , which corresponds to the exact operation; (ii) , which introduces small error with small energy gains when compared to ; and (iii) , which aggressively introduces greater error and achieves larger energy gains than . Even though our method can be used for any number of approximation modes, we select three because previous research works have shown that the area overhead is not big and the control logic remains simple [tasoulas2020weight, spantidi2021positive]. Supposing that a DNN consists of layers (), then the outputs of the stochastic optimizer are two signals and , where each element in both of them is a number between and . Each element represents the percentage of all the multiplications of layer that are chosen to be mapped on the approximate mode . Similarly, is the percentage of all the multiplications of layer mapped on the approximate mode . Therefore, for any layer , the percentage of the total layer’s multiplications, that will be mapped to mode , will be . Overall, the main goal of the stochastic optimizer is to observe the behavior of the DNN, in terms of accuracy and energy consumption, correlate it with the values of and

, and finally find appropriate values iteratively for the two vectors such that the constraints are satisfied.

Fig. 2:

The weight distribution for all layers on different networks (ShuffleNet, ResNet20, GoogLeNet, MobileNet) on different datasets (CIFAR-10, CIFAR-100, GTSRB, ImageNet). 8-bit quantization in [0,255] is used.

To guide the optimizer regarding which weight values would be most likely assigned to modes and , we assign the different approximate modes to ranges around the median value of the weights for each DNN layer. We base this decision on the fact that in most cases the values of the weights of each layer are gathered around a centered value, featuring low dispersion as shown in [spantidi2021positive]. This way, the more aggressive and modes would be utilized more frequently, maximizing the potential for higher energy gains. For completeness, Figure 2

shows the weight distribution of all layers for four different networks (ShuffleNet, ResNet20, GoogLeNet, and MobileNet) on four different datasets (CIFAR-10, CIFAR-100, GTSRB, and ImageNet). From our analysis, it is observed that the vast majority of layers follow this principle having either a very distinguish peak or a more flattened behavior. However, there were no layers with multiple peaks regarding the distribution. To accommodate fine-grain mapping we follow the approach described in 

[tasoulas2020weight] where a 2-bit signal is used to select the multiplier mode, and a control unit that comprises 4 8-bit comparators, two AND gates and an OR gate to activate multiplier modes based on weight values (i.e. selected ranges). Overall, the hardware needed to support such fine-grain weight mapping to different multiplier modes using ranges results in a minimal area overhead of less than 3% [tasoulas2020weight]. Also, the number of control units is equal to the number of the MAC array rows and thus, it increases only linearly as the MAC array size increases (quadratically).

Regarding the values assigned to and , Figure 3 shows an example of the weight distribution for four different layers of ResNet20 and how the proposed mapping is applied. The dark red area represents the weight values that are mapped to (), while the lighter red area represents the weight values that are mapped to (). The rest of the values that are not included in either colored patches are mapped to . By modifying the values of and the stochastic optimizer widens or narrows down the colored areas to find the best solution that satisfies the required constraints.

Fig. 3: Mapping example for ResNet20 on CIFAR-10. Darker area indicates the mapping range of mode, lighter area refers to mapping and uncolored area refers to mapping. 8-bit quantization in [0,255] is used.
Fig. 4: Optimization steps for mapping DNN weights to approximate modes.

When the exploration is triggered on a query , initially the and signals contain random values (Figure 4). The output signal of the accelerator for the executed DNN, in terms of accuracy per batch, is then analyzed for its robustness, the stochastic optimizer correlates the robustness value with per layer approximation, and it alters the impact of the approximation through modifications on and . In each optimization iteration, the stochastic optimizer aims to gradually minimize the analyzed robustness of the system based on its output. Therefore, the goal of the optimizer is to eventually tweak the and signals in a way that would satisfy all the given constraints described in the query (e.g., IQ1-IQ3). Regarding the utilized stochastic optimizer, we employ the Expected Robustness Guided Monte Carlo (ERGMC) algorithm, which is based on simulated annealing and is presented in [abbas2014robustness]. We set the number of control points to be equal to the amount of convolution layers of each target DNN, and evenly distribute them. Overall, the stochastic optimizer aims to push the system’s behavior as close as possible to the specified constraint boundaries. The parameter mining phase is completed after a predefined number of tests. We then generate a Pareto-front in the parameter space, resulting from all the conducted tests. What we consider to be the final output of this phase is the mapping that corresponds to the maximum found value of the parameter .

Fig. 5: A parameter mining example on GoogLeNet and CIFAR-100, showcasing different approximation mappings selected by the stochastic optimizer.

Figure 5 shows an example of the parameter mining process on the GoogLeNet network on the CIFAR-100 dataset. We utilized LVRM [tasoulas2020weight] as the approximate multiplier which comprises three modes of operation namely , , and . This example is based on the query IQ3 () presented in Section IV-B, where , , , and . Additionally, we set the maximum number of tests for the stochastic optimizer to . For each run, Figure 5 shows the achieved accuracy for each batch of the dataset (output signal), the corresponding utilization of the approximate modes across all the layers of the DNN, and the constraints of the query that were satisfied. In the very first run of the parameter mining phase all weights are assigned to an approximate mode randomly. By the fifth run of the parameter mining phase (Figure 5(a)), of the weights are mapped on , to and the rest to , since the stochastic optimizer correlates energy gains to and values, pushing for more energy savings through increased utilization. The robustness for this output signal is negative as it falsifies all constraints. Thus, the optimizer modifies the values of and to correlate the robustness value with the per layer introduced approximation. By letting the optimizer run for a few more tests, Figure 5(b) shows the output signal at run #20 and the corresponding information about the utilization of the approximate modes and the satisfaction of the constraints. Since the initial high utilization of resulted in very low robustness, the optimizer tried to utilize the mode more to make the robustness value higher, as it introduces less error. At this run, the utilization of is 49%, of 21% and of 30%. This mapping satisfies two of the constraints but still not all of them, therefore the robustness of the system is still negative. By letting the optimizer continue, in the last run #50 (Figure 5(c)), we see that all constraints are satisfied and the optimizer managed to find a balance for the utilization of the approximate modes and thus, the robustness of the system is now positive. Overall, the robustness value is an indication of how “far” or “near” the signal is from satisfying the initially set accuracy requirements. The robustness is evaluated on the accuracy signal based on the PSTL requirements, and is then utilized by the stochastic optimizer to select the next approximation mappings, which lead to parameter ranges.

V Evaluation

We aim to showcase the strengths of the proposed method in terms of 1) efficient mappings with higher energy savings than previous methods; 2) efficient utilization of all approximate modes; 3) automatic and scalable fine-grain exploration. Regarding the energy consumption, the MAC units are described in Verilog RTL, synthesized using Synopsys Design Compiler and mapped to a 7nm technology library. Mentor Questasim is used for post-synthesis timing simulations and the switching activity of the MAC units is captured through 1 million randomly generated inputs, which is then fed to Synopsys PrimeTime to calculate power consumption. Additionally, we bridge  [HoxhaDF17sttt]

, a toolbox for temporal logic falsification, with the Tensorflow machine learning library, in which we overrode the convolution layers and replaced the exact multiplications with the respective approximate ones 

[mrazek2019alwann]. Finally, we use the stochastic optimizer in to solve the PSTL query and acquire the approximation mappings.

We consider seven different PSTL queries and evaluate our findings against the mapping methods presented in LVRM [tasoulas2020weight] and ALWANN [mrazek2019alwann] across seven DNNs quantized to 8 bits: GoogleNet [szegedy2015going], MobileNetv2 [sandler2018mobilenetv2], ResNet20 [he2016deep], ResNet32 [he2016deep], ResNet44 [he2016deep], ResNet56 [he2016deep], and ShuffleNet [zhang2018shufflenet]. All the aforementioned DNNs were trained on three datasets: CIFAR-10, CIFAR-100, and GTSRB each of which considers input image size. We additionally evaluate our method on the Imagenet dataset ( image size) on the following four DNNs: InceptionV3 [szegedy2016rethinking] (rescales images to ), MobileNet [howard2017mobilenets], NasNet [zoph2018learning], and VGG16 [simonyan2014very]. We also utilized the method presented in [spantidi2021positive], however due to the in-depth per filter search, the exploration time for the Imagenet dataset was extremely high. We use 25% of each dataset during the optimization phase. We provide more information about execution time in Section V-D.

V-a Considered PSTL queries

What is the maximum achieved energy gain during inference such that:
any per batch accuracy drop is less than 3% for 40% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
any per batch accuracy drop is less than 3% for 60% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
any per batch accuracy drop is less than 3% for 80% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
any per batch accuracy drop is less than 5% for 40% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
any per batch accuracy drop is less than 5% for 60% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
any per batch accuracy drop is less than 5% for 80% of the batches, and
the per batch accuracy drop is less than 15% at any time, and
the average accuracy drop is less than .
the average accuracy drop is less than .
TABLE I: All considered queries expressed in PSTL with their respective description, where .

Our initial motivation was to express intricate properties of systems through PSTL and find energy-efficient approximation mappings given any approximate reconfigurable multiplier. In Section IV-B, we presented the queries we considered to describe said properties. In our evaluation, we use variations of the initially presented queries to build the Queries Q1-Q7 as shown in Table I. We constructed queries that aim to capture different levels of requirements, with some queries being more relaxed than others. Note that, as mentioned in Section IV-B, there are some user-defined variables: , , and . For all of the considered queries in this evaluation, is set to be 15%. This value was selected based on the assumption that such a big accuracy drop should be the maximum drop per batch that can be tolerated. The queries shown in Table I are split into three main parts.

Strict fine-grain constraints (Q1-Q3): For the first three queries Q1-Q3, we set the maximum acceptable accuracy drop per batch from the baseline to be and we set this requirement to hold for of these batches. Requiring this specification to hold for of the batches (Q1) is a less aggressive approach compared to requiring it to hold for of the batches (Q3). Overall these three queries are the most strict considered in this evaluation.

Relaxed fine-grain constraints (Q4-Q6): The next three queries Q4-Q6 in Table I are a more relaxed version of the previously analyzed Q1-Q3 queries. We variate in the same way, i.e., but this time the per batch acceptable accuracy drop threshold is larger and set to . With queries Q4-Q6, we wanted to allow more room for aggressive approximation, by slightly increasing the acceptable accuracy drop per batch. Similar to the previous triad of queries, there is a gradient strictness as we move from query Q4 to Q6 caused by the value.

No fine-grain constraints (Q7): Q7 is the last query considered in this evaluation and is the most relaxed among all. We did not impose any constraints per batch and we only set the coarse-grain requirement of to hold. We included this query in our evaluation since it is the same requirement enforced by previous works, considering only average accuracy drop [tasoulas2020weight, mrazek2019alwann, sarwar2018energy, spantidi2021positive, zervakis2020design].

For all seven queries, we consider three different cases of : and . The different combinations of , and values provide a diverse coverage in terms of requirement strictness on DNN accuracy. Note that, the proposed methodology can be scaled to any number of iterations and batch size. The following results are based on a number of 100 iterations and a batch size of 100 to stay consistent with the examples shown in Sections III and IV-C. We evaluated this work for smaller batch sizes with similar behavior. However, for smaller batch sizes the and values would require appropriate alterations.

V-B Comparison with a weight-oriented mapping methodology

Fig. 6: Utilization of approximate modes, per layer, used by LVRM [tasoulas2020weight](top) compared to our mapping (bottom) for ResNet20, CIFAR-10 dataset, Q7, and accuracy drop threshold.
CIFAR-10 Q1 Q2 Q3 Q4 Q5 Q6 Q7
GoogLeNet 2% 1%, 2%
MobileNetv2 1%, 2%
ResNet20 2%
ResNet32 2% 1%, 2% 2% 2%
ResNet44 2% 2%
ResNet56 1%, 2% 1%, 2%
ShuffleNet 2% 2% 2% 1%, 2% 2% 2%
CIFAR-100 Q1 Q2 Q3 Q4 Q5 Q6 Q7
GoogLeNet 2%
MobileNetv2 2%
ResNet20 2% 1%, 2%
ResNet32 2% 1%, 2% 2
ResNet44 2%
ResNet56 2% 2%
ShuffleNet 2% 1%, 2% 2% 1%, 2%
GTSRB Q1 Q2 Q3 Q4 Q5 Q6 Q7
MobileNetv2 1%, 2% 2%
ResNet20 1%, 2% 1%, 2%
ResNet32 1%, 2%
ResNet44 2% 1%, 2%
ResNet56 1%, 2% 1%, 2%
ShuffleNet 1%, 2%
ImageNet Q1 Q2 Q3 Q4 Q5 Q6 Q7
Inceptionv3 2%
Mobilenet 1%, 2% 1%, 2%
VGG16 2% 2%
TABLE II: The queries LVRM [tasoulas2020weight] satisfies for all datasets and DNNs

In this section, we present the benefits of our mapping approach against the weight-oriented mapping methodology presented in [tasoulas2020weight]. As aforementioned, [tasoulas2020weight] presents LVRM, an approximate reconfigurable multiplier that supports three operation modes namely LVRM0 (), LVRM1 (), and LVRM2 (). is the exact operation, is the least aggressive approximate mode and is the most aggressive one. LVRM additionally presents a four-step methodology that maps DNN weights to different multiplier modes based on the layer sensitivity. For the evaluation, we:

Fig. 7: Energy gains across all DNNs for (a) CIFAR-10, (b) CIFAR-100, (c) GTSRB, and (d) Imagenet over [tasoulas2020weight]. Q7 is the only constraint that [tasoulas2020weight] satisfies across all DNNs and datasets.
  • Generated weight-to-mode mapping solutions utilizing the LVRM multiplier by following their four-step methodology, which requires a single constraint regarding the average accuracy achieved over the dataset. Thus, we set .

  • Used the same approximate reconfigurable multiplier (LVRM) with our proposed mapping methodology, performed our proposed exploration, and generated different mappings per DNN, dataset and value.

In our motivation (Section III) we mentioned that the mapping methodology used in [tasoulas2020weight] is fairly aggressive and initially aims to map entire layers to the most aggressive approximate mode of the multiplier . To that end, the other approximate mode is being underutilized. To further showcase this issue, Figure 6 depicts the utilization of the approximate modes across all layers of ResNet20 on CIFAR-10 between the mapping produced by [tasoulas2020weight] and our approach for Q7 and accuracy drop threshold of 1%. It is evident that the mapping method in [tasoulas2020weight] severely under-utilizes M1, leading to suboptimal solutions and inability to adequately benefit from this mode in terms of energy reduction. Contrary, our mapping achieved a more balanced utilization across the three multiplier modes, this time utilizing the mode the most. Additionally, by assigning more operations to the mode, the exact mode is being utilized less; when compared to the utilization by [tasoulas2020weight]. This analysis validates our motivation that [tasoulas2020weight] loses efficient solutions due to biased decisions.

First, we evaluated which queries out of the seven total shown in Table I were satisfied. Our method produced mapping solutions that satisfied all queries for all DNNs and datasets. Table II shows which queries the mapping methodology in [tasoulas2020weight] satisfied. Table II does not quantitatively compares our method with LVRM. Rather, we demonstrate that although LVRM satisfies tight coarse constraints (Q7), it fails to satisfy finer ones (i.e. Q1-Q6). LVRM [tasoulas2020weight] produces only one mapping that satisfies a general accuracy constraint threshold. Solutions towards more fine-grain optimizations are not supported, since the search way this method employs makes it infeasible to support more complex constraints.

Note that, the solution produced by LVRM is only one final mapping. We wanted to observe how demanding the more fine-grain accuracy requirements can be with a method that does not consider them at all. We use the following notation: 1. the symbol ✗ indicates that the mapping did not satisfy the query for any of the three thresholds regarding average accuracy ; 2. the symbol ✔ indicates that the mapping satisfied the query for all three thresholds; and 3. in any other case, we list the threshold values under which the query was satisfied. Table II shows that the mapping methodology in [tasoulas2020weight] satisfies Q7 for all DNNs, datasets, and accuracy drop thresholds. This is expected as Q7 formally expresses the general required constraint that the average accuracy drop under approximation should be within specific values. However, it fails to satisfy Q2, Q3, and Q6 almost completely and especially for Imagenet, which is the most difficult examined dataset. These three queries require fine-grain exploration, which the method in [tasoulas2020weight] does not support, deeming it inappropriate for applications that require specific quality of service. Thus, it validates our motivation for supporting systematic fine-grain exploration.

Second, we compared the energy savings between the different methods. Particularly, Figure 7 shows the energy gains of our mapping for each query over the solution found following the methodology in [tasoulas2020weight] for all DNNs, datasets, and average accuracy drop thresholds. For the CIFAR-10 dataset (Figure 7(a)), the gains of our method over LVRM [tasoulas2020weight] are overall lower than the ones achieved for the other datasets. This is attributed to the fact that CIFAR-10 is overall an “easy” dataset that comprises a number of 10 classes in total. Therefore, LVRM can achieve energy efficient mappings that leave little room for improvement; however the proposed method can exploit this tight margin to produce even better solutions in terms of energy gains. Respectively, the slightly more difficult GTSRB dataset (Figure 7(c)) that comprises 43 classes shows bigger gains over LVRM than the ones in CIFAR-10, and the even more difficult 100-class dataset CIFAR-100 (Figure 7(b)) shows the biggest energy gains overall. The hardest dataset evaluated in this work is the ImageNet which comprises 1000 classes, and shows the highest gains over LVRM overall. This is attributed to the fact that LVRM is mainly targetting mapping entire convolutional layers to the most aggressive multiplier mode, which can lead to very pessimistic solutions. For instance, if mapping the two most error-resilient layers to the most aggressive multiplier mode violated the accuracy threshold, only the most error-resilient layer will be mapped to this mode entirely. For the rest of the layers, only ranges of weights are being examined for approximation mappings, which can provide very low final energy gains.

Based on the findings presented above, our methodology allows us to find mappings under multiple and fine-grain constraints, being also more energy-efficient than the state-of-art. Additionally, due to the formalization and the usage of PSTL, it is easy to create new queries and automate the search process (increased scalability), avoiding manual tuning.

V-C Comparison with a layer-oriented mapping methodology

We compare our proposed methodology against ALWANN [mrazek2019alwann], a layer-oriented method where a different static approximate multiplier from the EvoApprox library [mrazek2017evoapprox8b]

is mapped to a DNN layer, leading to a different architecture for each distinct DNN. The authors consider a heterogeneous platform of set tiles, and employ a multi-objective genetic algorithm to map multipliers to each tile. Even though ALWANN does not propose a reconfigurable multiplier, the tile-based architecture allows the existence of multiple approximate multipliers at the same time. In our experimental evaluation, we consider the number of multipliers per tile to be 3. For the evaluation, we:

  • Generated layer-to-approximation mapping solutions following the procedure of ALWANN to select the multipliers. ALWANN also requires a constraint regarding the average accuracy achieved over the dataset, and thus we set for each DNN.

  • We used the same approximate multipliers selected by ALWANN under our proposed mapping framework, performed the proposed exploration and generated different mappings per DNN, dataset and value.

CIFAR-10 Q1 Q2 Q3 Q4 Q5 Q6 Q7
GoogLeNet 2% 2% 2%
ResNet20 1%, 2%
ResNet32 1%, 2% 2% 1%, 2% 1%, 2%
ResNet44 1%, 2% 2% 2%
ResNet56 2% 2%
ShuffleNet 2% 2%
CIFAR-100 Q1 Q2 Q3 Q4 Q5 Q6 Q7
GoogLeNet 2% 2%
MobileNetv2 1%, 2% 2% 1%, 2% 2%
ResNet20 1%, 2% 1%, 2%
ResNet32 1%, 2%
ResNet44 1%, 2% 1%, 2% 1%, 2% 1%, 2%
ResNet56 2% 2% 2%
ShuffleNet 2% 1%, 2%
GTSRB Q1 Q2 Q3 Q4 Q5 Q6 Q7
ResNet32 2% 2%
ResNet44 2%
ShuffleNet 2% 2% 2% 2%
ImageNet Q1 Q2 Q3 Q4 Q5 Q6 Q7
Inceptionv3 1%, 2%
NasNet 2%
VGG16 1%, 2% 1%, 2% 2%
TABLE III: The queries ALWANN [mrazek2019alwann] satisfies for all datasets and DNNs

First, we evaluated which queries out of the seven total shown in Table I were satisfied by ALWANN. As aforementioned, our method produced mapping solutions that satisfied all queries for all DNNs and datasets. Table III shows which queries the ALWANN method satisfied. Again, we can see that ALWANN satisfies Q7 for all DNNs, datasets, and accuracy drop thresholds. This is expected because, as described previously, Q7 formally expresses the required constraint that the average accuracy drop should be within specific values. However, it fails to satisfy Q3 and Q6 almost completely and especially for Imagenet. In this case again, the solution produced by ALWANN is only one final mapping, and does not change across the queries. Solutions towards more fine-grain optimizations are not supported since the search way this method employs makes it infeasible to support more complex constraints. At this point it is important to mention that since ALWANN follows a layer-based approach, the selected multipliers introduce smaller error with smaller energy reduction to satisfy the average accuracy thresholds. Consequently, ALWANN satisfies a lot of Q1 queries, while still having an impact on the achieved energy gains.

Fig. 8: Energy gains across all DNNs for (a) CIFAR-10, (b) CIFAR-100, (c) GTSRB, and (d) Imagenet over [mrazek2019alwann]. Q7 is the only constraint that [mrazek2019alwann] satisfies across all DNNs and datasets.

Figure 8 shows the energy gains of our proposed mapping methodology for each query over the solution found by ALWANN for all DNNs, datasets, and average accuracy drop thresholds. The proposed method produced mappings that achieved significantly higher gains in energy, since ALWANN is a layer-to-multiplier mapping approach while the proposed method follows a fine-grain weight mapping approach. Specifically for the ImageNet dataset (Figure 8 (d)), the multipliers selected by ALWANN were some of the least aggressive ones available [mrazek2017evoapprox8b] to satisfy the average accuracy constraints.

V-D Cost-effective Analysis

All methods in our evaluation avoid retraining, therefore the execution cost lies in the amount of required inferences for each methodology. For instance, both methods in [tasoulas2020weight, spantidi2021positive] heavily depend on the examination of the error-tolerance for each distinct DNN layer (i.e., layer sensitivity). Consequently, these methods are guaranteed to run at least as many inferences as the amount of convolutional layers in a DNN.

When compared to the ImageNet dataset, the CIFAR-10, CIFAR-100 and GTSRB datasets comprise images that are significantly smaller. They are resized to , while the images in ImageNet are rescaled to ( for Inceptionv3). Therefore, inference is run significantly faster for the CIFAR-10, CIFAR-100 and GTSRB datasets. Additionally, since a larger image size leads to a larger amount of total multiplications that need to be computed on inference, for the ImageNet dataset step-based methods [tasoulas2020weight, spantidi2021positive] suffer in execution time. This fact, coupled with the vast amount of convolutional layers comprised by DNNs such as the Inceptionv3 or the NasNet, makes it infeasible to achieve a solution in an acceptable amount of time. For instance, for the VGG16 model on the Imagenet dataset, the work in [spantidi2021positive] required more than one day to finish a simple evaluation of approximate multipliers on a Xeon Gold 6230 processor running at 2.10GHz, deeming it not scalable for larger datasets, thus we did not include it in our evaluation.

To ensure the proposed framework will always produce mapping solutions in acceptable timeframes, we pre-define a set number of optimization iterations. Specifically, we define the number of iterations for the optimization to be 50 for the CIFAR-10, CIFAR-100 and GTSRB datasets, and 100 for the ImageNet dataset. The proposed framework found the solution for a single query 45% faster on average compared to the complete 4-step exploration of [tasoulas2020weight]. This speed boost compared to [tasoulas2020weight] allowed us to include more queries in our evaluation, providing greater insight on the impact of fine-grain accuracy constraints and approximate multipliers on the energy gains of a DNN accelerator. The inclusion of ERGMC and robustness calculation in this work inflict negligible time overhead.

Vi Conclusion

In this work, we present a unified framework that uses formal properties of approximate DNN accelerators that support reconfigurable approximate multiplications , thus enabling fine-grain optimizations. We utilize specification formalisms to express DNN properties and employ stochastic optimization to find approximate mappings that satisfy accuracy thresholds. We conducted an in-depth evaluation across multiple DNNs, datasets and accuracy requirements, to obtain a better grasp of how strict constraints can still yield energy savings. When compared to other mapping methodologies, on the same multipliers, our framework achieves even more than the energy gains without violating the defined accuracy thresholds.