Multi-objective Recurrent Neural Networks Optimization for the Edge – a Quantization-based Approach

The compression of deep learning models is of fundamental importance in deploying such models to edge devices. Incorporating hardware model and application constraints during compression maximizes the benefits but makes it specifically designed for one case. Therefore, the compression needs to be automated. Searching for the optimal compression method parameters is considered an optimization problem. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers both hardware efficiency and inference error as objectives for mixed-precision quantization. The proposed method makes the evaluation of candidate solutions in a large search space feasible by relying on two steps. First, post-training quantization is applied for fast solution evaluation. Second, we propose a search technique named "beacon-based search" to retrain selected solutions only in the search space and use them as beacons to know the effect of retraining on other solutions. To evaluate the optimization potential, we chose a speech recognition model using the TIMIT dataset. The model is based on Simple Recurrent Unit (SRU) due to its considerable speedup over other recurrent units. We applied our method to run on two platforms: SiLago and Bitfusion. Experimental evaluations showed that SRU can be compressed up to 8x by post-training quantization without any significant increase in the error and up to 12x with only a 1.5 percentage point increase in error. On SiLago, the inference-only search found solutions that achieve 80% and 64% of the maximum possible speedup and energy saving, respectively, with a 0.5 percentage point increase in the error. On Bitfusion, with a constraint of a small SRAM size, beacon-based search reduced the error gain of inference-only search by 4 percentage points and increased the possible reached speedup to be 47x compared to the Bitfusion baseline.



There are no comments yet.


page 1

page 2

page 3

page 4


Neuroevolution-Enhanced Multi-Objective Optimization for Mixed-Precision Quantization

Mixed-precision quantization is a powerful tool to enable memory and com...

Joint Neural Architecture Search and Quantization

Designing neural architectures is a fundamental step in deep learning ap...

Model Compression

With time, machine learning models have increased in their scope, functi...

Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization

Since model quantization helps to reduce the model size and computation ...

BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer

As the applications of deep learning models on edge devices increase at ...

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

This work proposes a novel Deep Neural Network (DNN) quantization framew...

GeneCAI: Genetic Evolution for Acquiring Compact AI

In the contemporary big data realm, Deep Neural Networks (DNNs) are evol...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Realizing Deep Learning (DL) models on embedded platforms has become vital to bring intelligence to edge devices found in homes, cars, and wearable devices. However, DL models’ appetite for memory and computational power is in direct conflict with the limited resources of embedded platforms. This conflict led to research that combines methods to shrink the models’ compute and memory requirements without affecting their accuracy Rezk et al. (2020). Models can be compressed by pruning neural network weights Dai et al. (2017), quantizing the weights and activations to use fewer bits Courbariaux et al. (2015); Rybalkin et al. (2018), and many other techniques Rezk et al. (2020).

The effect of the model’s compression can be improved by involving the target platform hardware model guidance during the compression. Different Neural Network (NN) layers have different effects on energy savings, latency improvement, and accuracy degradation Yang et al. (2016). Selecting different optimization parameters like pruning ratio or precision for different layers can maximize accuracy and hardware performance. Such methods are called hardware-aware optimizations. These optimization methods would result in a specific solution that depends on a particular model, hardware platform, and accuracy constraint. Changes in the hardware platform or the application constraints would make this solution no more optimal. Thus, it is required to automate the NN compression to adapt to changes in the platform and the constraints.

The compression of the NN model to run on embedded platforms can be considered an optimization problem. Some prior work on pruning and quantization has considered optimization algorithms to select pruning/quantization parameters. Some researchers have tried to maximize the accuracy while considering platform-related metrics such as latency or memory-size as constraints Nahshan et al. (2019); Yang et al. (2018b); Yao et al. (2017). In contrast, another group of researchers considered accuracy as the constraint and selected a platform performance metric as the main objective Yang et al. (2016). In this work, we treat the optimization problem as a Multi-Objective Optimization Problem (MOOP) where both the accuracy and hardware efficiency are considered objectives. First, it is non-trivial to decide whether accuracy or performance is the objective of the optimization. Thus, considering the problem as multi-objective is more appropriate. Also, MOOP has some advantages over single objective optimization such as giving a set of solutions that meet the objectives in different ways Savic (2002). By doing so, the embedded system designer can do the trade-off between different alternative solutions. The role of the optimization algorithm is only to generate the most efficient solutions and not to select one solution.

However, applying multi-objective search to select NN models compression configurations is not straightforward. Many compression methods require retraining to compensate for accuracy loss, and that cannot be done during evaluating candidate solutions in a large search space. That is why in most of the recent work, the compression parameters have been selected during training and end with one solution. To make the multi-objective search possible, we rely on two aspects. First, we use post-training quantization as a compression method. Post-training quantization has become a more and more reliable compression method recently Zhao et al. (2019); Banner et al. (2019); Nahshan et al. (2019); Nagel et al. (2020). Thus, the evaluation of one candidate solution would require only to run the inference of the NN model. Second, we know that we cannot guarantee that post-training will provide high-accuracy solutions under all constraints scenarios. Thus, we propose a novel method called ”beacon-based search” to support the inference-only search with retraining. In beacon-based search, only a few solutions are selected for retraining (beacons), and they are used to guide other possible solutions without retraining all evaluated solutions.

This article applies the proposed Multi-Objective Hardware-Aware Quantization (MOHAQ) method to a recurrent neural network model used for speech recognition. Recurrent Neural Networks (RNNs) are NNs that are designed to deal with sequential data inputs or outputs applications Donahue et al. (2014)

. RNNs recognize the temporal relationship between input/output sequences by adding feedback to Feed-Forward (FF) neural networks. The model used in the experiments uses Simple Recurrent Units (SRU) instead of Long Short Term Memory (LSTM) due to its ability to be parallelized over multiple time steps and being faster in training and inference 

Lei et al. (2018)

. We selected speech recognition as an important RNN application and the TIMIT dataset due to its popularity in speech recognition research. Also, we find good software support for speech recognition using TIMIT dataset by the Pytorch-kaldi project 

Ravanelli et al. (2019). To prove the flexibility of the method to support different applications and hardware platforms, we applied our methods to two hardware architectures with varying constraints. The first is SiLago architecture Hemani et al. (2017), and the second is Bitfusion architecture Sharma et al. (2017). These two architectures are chosen as they support the varying precision required in our experiments.

The contributions of this work are summarized as:

  • We enable the fast evaluation of candidate solutions for multi-objective optimization for bit-width selection for quantized NN layers by using post-training quantization. We show how to take account for combined objectives like the model error and hardware efficiency.

  • We propose a method called beacon-based search to predict the retraining effect on candidate solutions without retraining all the evaluated solutions in the search space.

  • We demonstrated the flexibility of the proposed method by applying it to two hardware architectures, SiLago and Bitfusion.

  • To the best of our knowledge, this is the first work analyzing the quantization effect on SRU units. Considering SRU as an optimized version of the LSTM, it has not been investigated before if low precision quantization can be applied to SRU.

This article is organized as follows. Section 2 gives a brief explanation of the neural network layers, optimization methods, and hardware platforms used in this work. Then, Section 3 discusses the related research work. Later, in Section 4, we explain the quantization of the SRU-model and the multi-objective-based search used for the hardware-aware per-layer bit selection. In Section 5, we describe our experiments and results to assess the proposed method. Afterward, in Section 6, we discuss our results. Finally, in Section 7, we conclude the article.

2 Background

In this section, we discuss the essential components, methods, and platforms used in this work. We first explain the RNN layers related to this work. Then, we explain how NN models are quantized. Afterwards, we discuss the multi-objective optimization techniques. Finally, we cover the details of the hardware architectures used in this article.

2.1 RNN Model Components

RNNs recognize the temporal relation between data sequences by adding recurrent layers to the NN model. The recurrent layer adds feedback from previous time steps and adds memory cells to mimic the human memory. Here we cover the most popular recurrent layer called Long Short Term Memory (LSTM) Hochreiter y Schmidhuber (1997) and the Simple Recurrent Unit (SRU) Lei et al. (2018). SRU is an alternative for LSTM that has been proposed to improve LSTM computational complexity. Then we briefly cover more layers required in the RNN model used in this paper experiments.

2.1.1 Long Short Term Memory (LSTM)

LSTM is composed of four major computational blocks (Figure 1

). These blocks are Matrix to Vector multiplications (

) between the input and output feedback vector and weight matrices followed by bias addition and application to non-linear function. Three of these blocks computations are to compute the forget, input, and output gates. These gates decide which information should be forgotten, which information should be renewed, and which information should be in the output vector respectively. The fourth computation block is used to compute the memory state vector values. For example, the forget gate output is computed as:


where is the input vector, is the hidden state output vector, is the weight matrix,

is the bias vector, and

is the function. The number of computations and parameters for LSTM are shown in Table 1.

Figure 1: Long Short Term Memory (LSTM) Rezk et al. (2020).

2.1.2 Simple Recurrent Unit (SRU)

Simple Recurrent Unit (SRU) is designed to make the recurrent unit easily parallelized Lei et al. (2018). Most of the LSTM computations are in the form of matrix to vector multiplications. Thus their parallelization is of great value. However, these computations rely on the previous time-step output and previous time-step state vectors, and therefore it isn’t easy to be parallelized over times-steps. The SRU overcomes this problem by removing and from all matrix to vector multiplications and is used only in element-wise operations. The SRU is composed of two gates (forget and update gates) and a memory state (Figure 2). It has three matrix to vector multiplications blocks. For example, the forget gate vector is computed as:


where is the input vector, is the old state vector, is the input weights matrix, is the recurrent weights vector, is the bias vector, and is the function. The number of operations and parameters for an SRU is shown in Table 1.

Figure 2: Simple Recurrent Unit (SRU) Rezk et al. (2020)

2.2 Bidirectional RNN layer

In a Bidirectional RNN layer, the input is being fed into the layer from past to future and future to past. Consequently, the recurrent layer is duplicated, equivalent to two recurrent layers working simultaneously, and each layer is processing input in a different temporal direction. Obtaining data from the past and the future helps the network to understand the context better. This concept can be applied to different recurrent layer types like Bi-LSTM Li y Shen (2017) and Bi-GRU Vukotic et al. (2016) and Bi-SRU. The number of operations and parameters for a Bi-SRU is shown in Table 1.

2.2.1 Projection layers

The projection layer is an extra layer added before or after the recurrent layer Sak et al. (2014). A projection layer is similar to a Fully-Connected (FC) layer. The projection layer is added to allow an increase in the number of hidden cells while keeping the total number of parameters low. The projection layer has a number of units less than the recurrent hidden cells . The number of weights computations in the recurrent layer will be dominated by a multiple of and not . As , can increase without increasing the size of the model dramatically.

RNN layer Number of Operations Number of Parameters
MAC Element-wise Nonlinear Weights Biases
Table 1: Number of operations and parameters in LSTM, SRU, and Bi-SRU. is input vector size, is the hidden vector size.

2.3 Quantization

Quantization reduces the number of bits used in neural network operations. It is possible to quantize the neural network weights only or the activations as well. The precision can change from a 32-floating point to a 16-bit fixed-point, which usually does not affect the model’s accuracy. Therefore, many neural network accelerators uses 16-bit fixed point precision instead of floating point Chen et al. (2017); Ankit et al. (2019); Wang et al. (2018). Or, quantization can be to integer precision to make it more feasible to deploy NN models on embedded platforms Fang et al. (2020). Integer quantization can be anything from 8 bits to 1-bit. Low precision integer accelerators have been proposed to achieve efficiency in terms of speedup, and energy and area savings Google ([Accessed on: October. 2021]); Judd et al. (2016); Sharma et al. (2017); Lee et al. (2019)

. However, integer quantization can cause a high degradation in accuracy. Thus, retraining is required to minimize this degradation in many cases, or the model is trained using quantization-aware training from the beginning. Recently, there has been a growing interest in post-training quantization. Post-training quantization quantizes the pre-trained model parameters without any further retraining epochs after quantization. That would be useful if the training data is unavailable during the deployment time or the training platform itself is unavailable. Most post-training quantization methods work on the outlier values that consume the allowed precision and cause accuracy loss. Clipping the outliers to narrow the data range can overcome the problem, and several techniques are used for selecting clipping thresholds 

Nahshan et al. (2019); Zhao et al. (2019). Alternatively, Outlier Channel Splitting (OCS) is a method that duplicates the channels with the outlier values and then halves the output values or their outgoing weights to preserve functional correctness Zhao et al. (2019). However, it increases the size of the model by this channel duplication. In this work, we use clipping during quantization, and the clipping thresholds are selected using the Minimum Mean Square Error (MMSE) method Sung et al. (2015).

2.4 Optimization using Genetic Algorithms

Optimization is essential to various problems in engineering and economics whenever decision-making is required Chong y Zak (2013). It works on finding the best choice from multiple alternatives. The search for the best is guided by an objective function and restricted by defined constraints. In many problems, there exists more than one objective. Those objectives can be conflicting that enhancing one of them requires worsening the others. Single-Objective Optimization (SOOP) will try to find a single best solution Savic (2002). This solution corresponds to the minimum or maximum value of a single objective function that groups all the objectives into one. This type of optimization does not provide the designer with a collection of alternative solutions that trade different objectives against one another. On the other hand, Multi-Objective Optimization (MOOP) is a kind of optimization that provides a set of solutions known as a Pareto-set (front) for conflicting objectives. Pareto-set is a non-dominating set of solutions that none of them can be further enhanced by any other solution. A solution dominates a solution by enhancing at least one objective without making any other objective get worse.

Genetic Algorithms (GAs) are popular algorithms for both single and multi-objective optimization Chahar et al. (2021). GAs are inspired by natural selection, where survival is for the fittest. It is an algorithm based on populations. Each population is composed of some candidate solutions (individuals). Each individual is composed of a number of variables called chromosomes. GA is an iterative algorithm that evaluates one population solution’s fitness values and generates a new population (offsprings) from the old population until a criterion is met or the defined number of populations is completed. The latest population is formed by selecting pairs of good solutions (based on their fitness values) and applying crossover on them. Crossover composes a new individual by mixing parts of parent individuals. Then mutation is applied to the offspring individual to change in some chromosomes. The selection, crossover, and mutation operations are repeated until a new population is complete. Also, an encoding function is used to map the solution variables values into another representation that can be used for genetic operations such as crossover and mutation.

Modifications have been applied to GAs to work on multi-objective problems. These modifications are more related to fitness function assignment while the rest of the algorithm is similar to the original GA Chahar et al. (2021). There are various of multi-objective GAs such as NPGA Horn et al. (1994), NSGA Srinivas y Deb (2000), and NSGA-II Deb et al. (2002). NSGA-II (Nondominated Sorting Genetic Algorithm II) is the enhanced version of NSGA. NSGA-II is a well-known fast multi-objective GA Yusoff et al. (2011). In this work, we have used the NSGA-II as our multi-objective search method. The NSGA-II implementation is provided by PYMOO (a python library for optimization) and is based on the NSGA-II paper Deb et al. (2002). NSGA-II is similar to a general GA, but the mating and the survival selection are modified PYMOO ([Accessed on: October. 2021])

. NSGA-II selects the individuals front-wise. Since it might not be possible to find enough individuals allowed to survive, it might be required to split a front. The crowding distance (Manhattan distance in the objective space) is used for selection in the front splitting. Extreme points are assigned a crowding distance of infinity to keep them in all generations. The selection method used is binary tournament mating selection, where each individual is compared first by a rank followed by a crowding distance. NSGA-II popularity relies on three factors: The fast non-dominated sorting approach, the simple crowded comparison operator, and the fast, crowded distance estimation procedure 

Yusoff et al. (2011).

2.5 Architectures Under Study

This section gives a brief presentation of the two architectures we have applied our methods to. The first is SiLago architecture, and the second is Bitfusion. In SiLago architecture, the low precision support is a new feature under construction. Thus, we introduce the low precision support idea in SiLago and the expected speedup and energy saving.

2.5.1 SiLago Architecture

The architecture is a customized Coarse Grain Reconfigurable Architecture (CGRA) built upon two types of fabrics. The first fabric is called Dynamically Reconfigurable Resource Array (DRRA) and is optimized for dense linear algebra and streaming applications. The second fabric is called Distributed Memory Architecture (DiMArch) and is used as variable size streaming scratchpad memory. Figure 3 shows a SiLago design example.

Figure 3: Fragment of DRRA and DiMArch CGRA fabrics and their components.
  • DRRA:

    All the components of the DRRA (computation, storage, interconnect, control, and address generation) are customized for streaming applications and further updated to support NN operations. DRRA has a unique, extensive parallel distributed local control scheme and interconnect network Shami y Hemani (2009); Farahini et al. (2014). Each cell in the DRRA is comprised of a Register File (RF), sequencer (SEQ), data processing unit (DPU), and switchbox. Each DPU and RF in DRRA outputs to a bus that straddles two columns on each side (right and left) to create an overlapping 5-column span. The two busses connect via a switchbox to create a circuit switch network on chip (NoC). The data on the selected output bus are propagated via the switchbox to the inputs of DPU and RF Shami y Hemani (2009). This interconnect allows for the DPUs and RFs of different cells to be chained together and create larger and more complex data paths. The RFs and NoCs in the DRRA use 16-bit words. The DPU inside each cell can be customized for NN computations. For use in NN the DPU is using a special computation unit called Non-Linear Arithmetic Unit (NACU), Baccelli et al. (2020). NACU was originally developed to operate on a specific bit-width, decided on design time.

    For this work, we have updated the design to be able to support three different types of low precision operations, 1x 16-bit, 2x 8-bit, or 4x 4-bit. This is done by modifying the existing multiplier and accumulator inside the NACU to use Vedic multiplication Tiwari et al. (2008). The 16-bit multiplier is split into 16 4-bit multiplications. Depending on the type of operation, the multiplier can be reconfigured, and a different number of these smaller multipliers is used. Their results are then added to produce the final result.

  • DiMArch

    The computational fabric is complemented by a memory fabric called DiMArch. DiMArch provides a matching parallel distributed streaming scratchpad memory. It is composed principally of SRAM macros Tajammul et al. (2016), coupled with an address generation unit (AGU). The SRAM macros, along with the AGUs, are connected with each other by two NoCs. One NOC is a circuit-switched high-bandwidth data NOC, and the second is a packet-switched control and configuration NOC. The massively parallel interconnect between the two fabrics in Figure 3 makes sure that the computational parallelism is matched by the parallelism to access the scratchpad memory. The cells of the two CGRAs can be dynamically clustered to morph into custom datapaths shown as private execution partition in Figure 3, see Tajammul et al. (2013). This has been exploited in Jafri et al. (2017) to create neural network accelerators.

  • Energy and timing estimations

    Table 2 presents the energy consumption and the speedup of the arithmetic operations in the SiLago platform. The energy consumption is based on post-layout simulations of the reconfigurable multiplier and accumulator (MAC). The multiplier and accumulator were synthesized using a 28nm technology node. The energy consumption for the SRAM access was based on SRAM macro-generated tables in the same 28nm technology. The speedup is calculated as operations per clock cycle. The MAC unit can be reconfigured to calculate one 16-bit, 8-bit, or 4-bit MACs in every cycle.

    16x16 8x8 4x4
    MAC speedup 1x 2x 4x
    MAC energy cost () 1.666 0.542 0.153
    Loading 1-bit energy cost () 0.08
    Table 2: The speedup and energy consumed by different types of low precision operations on SiLago architecture.

2.5.2 Bitfusion Architecture

Bitfusion is a variable precision architecture designed to support variations in the precision in quantized neural networks Sharma et al. (2017). It is composed of a 2-d systolic array of what is called Fused Processing Element (Fused-PE). Each Fused-PE is composed of 16 individual Bit-Bricks, each of which is designed to do 1-bit or 2-bit MAC operations. By grouping bit-bricks in one Fused-PE, higher precision operations are supported. The highest parallelism rate of one Fused-PE is 16x when the two operands are 1-bit or 2- bit, and no parallelism is achieved by having two 8-bit operands. To support 16-bit operations, the Fused-PE is used for four cycles. Thus, the speedup of using 2-bit over 16-bit operations is 64x.

This section explained the NN model components, compression methods, and optimization methods required to understand before reading this article. Next, we present and compare the literature related to this article and highlight the main differences between them.

3 Related Work

In this section, we go through the related work to this article. First, we discuss the research done on the compression of SRU-based models. Then, we review the research relevant to the optimization of compressed neural networks.

3.1 Simple Recurrent Unit (SRU) compression

Shannguan et al. used two SRU layers in a speech recognition model as a decoder Shangguan et al. (2019). They managed to prune 30% of the SRU layers without a noticeable increase in the error. The pruning was applied during training to ensure low error Zhu y Gupta (2018). To the best of our knowledge, no work applies quantization or any other compression method, except the pruning approaches on SRU models.

Compression Objectives Constraints Hardware-aware
Yang et al. Yang et al. (2016) Pruning Energy Accuracy Energy model Yang et al. (2016)
Yang et al. Yang et al. (2018b), Netadapt Pruning Accuracy Resource budget Empirical measurements
Yang et al. Yang et al. (2018a) Pruning Accuracy Energy Energy model Yang et al. (2016)
Yao et al.Yao et al. (2017), Deepiot Pruning Accuracy Size Memory information
Rizakiset al.  Rizakis et al. (2018) Pruning Accuracy Latency Roofline model
Wang et al. Wang et al. (2019), HAQ Quantization Accuracy Resource budget Platform model
Nahshan et al. Nahshan et al. (2019), LAPQ Quantization Loss - -
Cai et al. Cai et al. (2020), ZeroQ Quantization
Size -
This work, MOHAQ Quantization
Energy model Yang et al. (2016)
Speedup estimation
Table 3: Comparison of literature work on the optimization of NN models.

3.2 Optimization of Neural Networks Compression

The compression of neural network models has been treated as an optimization problem to select the degree of compression for each layer/channel. In many cases, feedback from a hardware platform or hardware model is used during the optimization (hardware-aware compression). We have summarized the work done on the optimization of NN models for compression in Table 3. The choice of constraints and objectives have varied among different papers.

Energy-aware pruning Yang et al. (2016) is a pruning method that minimizes energy consumption on a given platform. The platform model was used to guide the pruning process by informing it which layer when pruned, would lead to more energy saving. The pruning process stops when a predefined accuracy constraint has been hit. Netadapt Yang et al. (2018b) eliminated the need for platform models by using direct empirical measurements. Nevertheless, pruning in Netadapt is constrained by a resource budget such as latency, memory size, and energy consumption. In both methods, pruning starts from pre-trained models, and fine-tuning is applied to retain accuracy. Similarly, an energy-constrained compression method Yang et al. (2018a) used pruning guided by energy constraints. Energy results are predicted from a mathematical model for a TPU-like systolic array structure architecture. However, this compression method trains the model from the beginning. On the other hand, DeepIOT Yao et al. (2017) obtains the memory size information from the target platform to compute the required compression ratio as a constraint. In another work, optimization variables are selected based on time constraints Rizakis et al. (2018), where roof-line models are used for calculating the maximum achievable performance for different pruning configurations.

HAQ (Hardware-aware Quantization) used reinforcement learning to select bit-width for weights and activations to quantize a model during training while considering hardware constraints 

Wang et al. (2019). LAPQ Nahshan et al. (2019) and ZeroQ Cai et al. (2020) applied optimization algorithms on quantized NN models but without any hardware model guidance. Loss Aware Post-training quantization (LAPQ) is a layer-wise iterative optimization algorithm to calculate the optimum quantization step for clipping Nahshan et al. (2019)

. The authors proved that there is a relation between the quantization step and the cross-entropy loss. Small changes in the quantization step have a drastic effect on the accuracy. LAPQ managed to quantize ImageNet models to 4-bit with a slight decrease in accuracy level. In the Zero-shot Quantization (ZeroQ), Nagel

et al. proposed a data-free quantization method Nagel et al. (2020). Their approach uses multi-objective optimization to select the precision of different layers in the model. The two objectives are the memory size and the total quantization sensitivity, where they define an equation to measure the sensitivity of each layer for different precisions. The authors assumed that the sensitivity for each layer to a specific precision is independent of other layers’ precisions. This assumption simplifies the computation of the overall sensitivity for different quantization configurations in the search space,

None of the discussed work has applied hardware-aware multi-objective optimization to the problem of NN compression. In this work, we use quantization as a compression method and target hardware models to guide the compression. We allow both the model error/accuracy and hardware efficiency metrics (speedup and energy consumption) to be objectives. We use the hardware on-chip memory size as a constraint to avoid high-cost off-chip communication. The details of our proposed method are explained in the next section.

4 Method

This section discusses the details of our proposed method for the Multi-Objective Hardware-Aware Quantization (MOHAQ) of the SRU-model for speech recognition. First, we explain how we apply the post-training quantization on the SRU. Next, we present the optimization algorithm used to select the layers’ precisions guided by the hardware model. then, we explain how to enable retraining in a multi-objective search for setting mixed-precision quantization configurations. Finally, we discuss how we use the hardware models for guidance during the compression optimization.

4.1 Post-training quantization of SRU model

The Simple Recurrent Unit (SRU) was initially designed to overcome the parallelization difficulty in LSTM and other recurrent units. Outputs from the previous time-steps are used in the current time-step operations. This property makes it impossible to fully parallelize the MV operations over multiple time steps. In SRU, the outputs from the previous time-step are excluded from MV computations and only used in element-wise computations.

Another side effect for excluding the recurrent inputs from MV operations is that the number of weights used in the recurrent operations decreases significantly. Thus, it becomes possible to also exclude the recurrent part from low-precision quantization. We apply low-precision quantization on weights and activations used in MV operations only. Other weights are kept in a 16-bit fixed-point format. By doing so, We achieve our goal of reducing the model’s overall size while keeping it performing with a low error rate. We have both 16-bit fixed-point and integer precisions in the same model in different layers in our work. We here explain how quantization has been applied to the weights and activations during inference and how we move from fixed-point to integer operations and the reverse.

  • Weights integer quantization with clipping: We applied integer linear quantization on the weight matrices. We used the Minimum Mean Square Error (MMSE) method to determine the clipping threshold Sung et al. (2015). We used the implementation for CNN quantization provided by OCS paper on github Zhao et al. (2019) as a base for our implementations. We then modified the implementation to work with SRU units and to support varying precision per layer. The range of the quantized values are [:127], [:7], and [:1] for 8-bit, 4-bit, and 2-bit, respectively.

  • Weights 16-bit fixed-point quantization: It is used for the recurrent weights, bias vectors, and weight matrices that might be chosen to have 16-bit. Depending on the range of data, we compute the minimum number of bits required for the integer part. The rest of the 16-bits are used as a sign bit and the approximated fraction part.

  • Activation integer quantization with clipping: Integer quantization of activations is similar to weights. However, since we cannot compute the range of vectors required for clipping threshold computation, we calculate the expected ranges. To compute expected ranges, we first use a portion of the validation data sequences. The predicted range of a given vector is calculated as the median value of the ranges recorded while running the validation sequences. In our experiments, 70 sequences were enough to compute the expected ranges.

  • Activation re-quantization to 16-bit fixed-point: The activations are quantized to 16- bits the same way as the weights. If a vector is an output of an integer operation, we found it necessary to re-quantize the values into fixed points by dividing them by a scale value. The scale value is computed to return the vector range to the same range if quantization was not applied. The range of Non-quantized data is computed using a portion of the validation data sequences while using original model weights and activation, a.k.a, turning off quantization.

4.2 Multi-objective quantization of neural network models

During the compression/quantization of NN models, we have two types of objectives. The first one is the NN model performance metric, such as the accuracy or the error. The second type of objective is related to the efficiency of the hardware platform, such as memory size, speedup, energy consumption, and area. Treating the problem as a multi-objective problem provides the designer with various solutions with different options. The embedded system designer then can decide which solution is a trade-off suitable for the running application. We have used a Genetic Algorithm (GA) as it is one of the efficient multi-objective search optimizers. The multi-objective GA we used is called NSGA-II provided by the Pymoo python library Deb et al. (2002); Blank y Deb (2020). As mentioned in Section 2, NSGA-II is a popular GA that supports more than one objective. NSGA-II showed the ability to find better convergence and better solutions spread near the actual Pareto-optimal front for many difficult test problems Srinivas y Deb (2000). Thus, NGSA-II appears to be a good candidate for our experiments.

Most of the automated hardware-aware compression/quantization work uses a single objective and provides a single answer. The reason is that many compression methods require training or retraining to compensate for the accuracy loss caused by compression. The selection of the compression/quantization configuration is made iteratively during training epochs. Trying to turn the problem into a multi-objective problem would make it necessary to retrain all the evaluated solutions in the search space, and that would be infeasible. Our approach to tackle this problem relies on two observations. The first is that researchers are progressively enhancing post-training compression/quantization techniques. Thus, it is possible to evaluate candidate solutions by running inference only without any retraining. That is also useful for the case when the training data is not available. The second observation is that a retrained model using one candidate solution variables can be used as a retrained model for neighbor candidate solutions in the search space. We use this observation in a method we call ”beacon-based search” (further explained in Section 4.3). So, if the inference-only search fails to find accuracy-wise accepted solutions, the beacon-based search can be applied. Nevertheless, if the designer wants, the beacon-based search can be applied from the beginning. The complete search framework is demonstrated in Figure 4.

Figure 4: Steps of the Multi-Objective Hardware-Aware Quantization (MOHAQ) method. The designer inputs the model pre-trained parameters, the hardware platform objectives equations, and any hardware constraints if they exist. Then the designer can apply inference-only search or beacon-based search. The beacon-based search is the search method that requires retraining of the model using some candidate solutions variables. The output of the search is a Pareto set of optimal solutions. Suppose the inference-only search resulted in solutions that had unaccepted accuracy levels. In that case, the designer can repeat the search using the beacon-based method.

Our problem is to select the precision for weights and activations per layer. A candidate solution has a number of variables that equals twice the number of layers. That is because each layer requires precision for the weight and precision for the activation. The possible precisions covered in this work are 2, 4, 8 bits integer, and 16-bit fixed point. We skipped precisions like 3, 5, and 6 because they are not frequently found in hardware platforms. However, the method is generic, and other precisions can be included. The candidate solutions variables are encoded into genetic algorithm representation. We use discrete values for the solutions variables that are 1, 2, 3, and 4. 2- bit is encoded into 1, 4-bit is encoded into 2, 8-bit is encoded into 3, and 16-bit is encoded into 4. In addition to encoding and decoding, we select the number of generations, define fitness functions and constraints, then keep the library default configuration for the rest of GA steps such as the crossover, mutation, and selection. For each objective, we define a fitness function. All the objectives have to be either minimization or maximization objectives. Since Pymoo by default treats objectives as minimization objectives, we change the maximization objectives to be minimization objectives by negating them.

Initially, the search relies on post-training quantization (inference-only search). The evaluation of candidate solutions does not incorporate any training, and thus it is possible to carry out the search within a reasonable time. To evaluate one solution’s error objective, we run the inference of the model and get the inference error value as the error objective that we want to minimize. Another way to speed up the search is to define a feasibility area and if a solution falls outside this area, this solution is then directly excluded from the solution-pool and further search. We have used this to exclude solutions that have more than 8 percent points higher error rate compared to the baseline as such solutions were deemed irrelevant.

To evaluate a quantized model, data from an ”unseen” validation set would be preferable. However, besides the testing set that we should not use for this purpose, no such ”unseen” data exist. Thus, we are forced to use the validation set suggested for this database, even if it already has been used for training the model and therefore already been ”seen” by the model. Still, this should not influence our results too strongly, as the validation data only have been used to set hyperparameters during training. However, in our experiments, we have noticed that the gap between the validation error and the testing error varies significantly among the Pareto optimal solutions when we use the full validation set. To mitigate this variation (which can lead to solutions changing the order for the testing set compared to the validation set), we have split the validation set into smaller subsets (in this case, four subsets) and then taking the maximum error among the validation subset errors. This method has led to a better correspondence between validation errors and testing errors, and we will further explore and analyze it in the future.

It should also be noted that the evaluations of candidate solutions in one generation are not related to the other candidate solutions. Therefore, it is possible to parallelize the search over the candidate solutions and distribute the computation over multiple GPUs Rezk et al. (2014) with linear speedup.

4.3 Beacon-based Search (Retraining-based Multi-objective Quantization)

So far, we have been focusing on using post-training quantization during the search for optimum quantization configuration to achieve a speedy evaluation of candidate solutions. However, at high compression ratios, we sometimes find that the post-training quantization is not achieving acceptable accuracy levels. The accuracy can be improved by retraining the model using the quantized parameters. Still, as retraining is a very time-consuming process, it is infeasible to do for all candidate solutions. Therefore, we have developed a beacon-based approach that only retrains a small set of solutions, our beacons. Then we let ”neighboring” solutions share the retrained model (beacon) as a basis for their quantization instead of the original pretrained model.

We want to define the neighborhood where we get a similar retraining effect (improved accuracy) for all solutions in the neighborhood when using the neighborhood beacon. A natural measure of closeness is to define a distance in the parameters space. That is, for each parameter, we compare the of the precision values and then sum over all parameters. During our experiments, we have found that the precision of the weights is more important than the precision of the activations when finding neighbor solutions that can use the same retrained parameters. Thus, we only used the weights precisions in the distance computation for this paper. The distance between one solution and a beacon can thus defined as:

where is the distance between and . is the number of layers in the model. and is the value of the precision of layer k weights in and respectively.

To validate the assumption that solutions in the neighborhood of a beacon behave consistently after applying quantization on the beacon parameters, we have placed beacons in the search space and calculated the accuracy (actually the word error rate) for neighbor solutions in the search space using both the original model and the beacons parameters. In Figure 5 we show the neighborhood for one of these beacons (the others show a similar behavior). Each point in the plot is an evaluated solution. The x-axis shows the increase in the error by a solution when quantizing the original model parameters compared to the Non-quantized baseline error. The y-axis shows the decrease in error then quantizing the beacon model compared to the original quantized model. To exemplify, we have marked one solution with a star in Figure 5. For this solution, the baseline model error rate is 16.2%, and applying post-training quantization on this solution using the baseline model parameters gives an error rate of 24.2%. Applying post-training quantization on the same solution using the beacon parameters gives an error rate of 18.8%. Thus, we find the particular solution at 8 () on the x-axis and 5.4 () on the y-axis. From the figure, we see a close to a linear relationship between the increase in the error using the baseline model parameters and the decrease in the error using the beacon parameters. We conclude that there is no need to retrain the model for all evaluated solutions during the search. It is sufficient to set up a small set of beacons to take the retraining accuracy increase into account during the search.

Figure 5: A plot for the error enhancement using one retrained model parameters (beacon). Each point corresponds to a neighbor solution to the retrained model solution. The x-axis corresponds to the increase in the baseline model error by evaluating solutions using baseline model parameters. The y-axis corresponds to the increase in the baseline model error by evaluating solutions using beacon model parameters.

To retrain the model, we used a Binary-connect approach Courbariaux et al. (2015), where quantized weights are used during the forward and backward propagation only, while full precision weights are used for the parameters update step. Therefore, in the retrained model, we always have the floating-point parameters, which then can be used as a basis for various other quantization configurations.

We use the term beacon due to the similarity to its use in swarm robots search. Where the simple robots do not have the communication capabilities needed for the swarm to fulfill its task and some of the robots are assigned to be fixed communication beacons for the others Tan y yang Zheng (2013). Similarly, when our search reaches an area with no beacons, one solution is turned into a beacon by retraining the model using these solution variables. By the end, we get a solution-set that considers the retraining effect on the candidate solutions. Later on, when the designer selects a solution from the Pareto optimal set, the designer can use the beacon parameters directly or retrain the model using the selected solution parameters.

As mentioned before, we define a feasibility area to speed up the search, and if a solution falls outside this area, it is directly excluded from the solution pool and further search. But one effect of enabling retraining is that solutions that were outside the feasibility area before retraining now become feasible again. Thus, when enabling retraining, one should define an enlarged ”beacon-feasible” area not to exclude such solutions too early. In addition, it is possible to add more constraints to this area based on the designer’s experience to decrease the number of created beacons such as thresholds for other objectives and not allowing low error solutions to be retrained. Low error solutions may not benefit much from retraining and can consume a lot of retraining time.

The steps of our beacon-based search are shown in Algorithm LABEL:alg:beacon. If the solution is in the beacon-feasible area, we compute the distance between the solution and all existing beacons using the distance equation defined earlier. If the nearest beacon is farther than a predefined threshold, then this solution will be converted into a beacon by retraining the model using its variables. After retraining, the solution is added to the beacon list, and by definition, the nearest beacon for this solution is the solution itself. Finally, the error objective is re-evaluated using the nearest beacon model parameters.


The threshold value is important in controlling how many beacons we will use (and, therefore, the amount of time needed to retrain models). If it is too high, it can limit the benefit of applying retraining in decreasing the model error. The suitable value will depend on the model size (how many parameters we use in the distance calculation) and the supported precisions. For the experiments below, we have a model of 8 layers. We found that a threshold of 6 resulted in 1 beacon while a threshold of 5 got three beacons. This is approximately 25% and 21% of the maximum distance possible but was found in an exploratory fashion. How to optimally set this threshold is, however, not further explored in this paper.

4.4 Hardware-aware Optimization

In this work, the hardware model is an input given to the optimization algorithm in the form of objective functions. We have selected two architectures to apply our methods on. These two architectures are selected as they support varying precision operations, and thus applying quantization optimization becomes feasible. For these two architectures we do not have implementations for the RNN modules and thus we cannot get empirical measurements during evaluations. Instead, we have defined objective functions in a simple way that mainly focuses on the effect of decreasing the precision of the NN operations. The hardware platform constraints and objectives serve as a proof of concept that shows how models can be compressed differently in different scenarios.

Neural network models are characterized by having large memory requirements. Deploying the models in their original form results in frequent data loading from and into the off-chip memory. Thus, the implementation would be memory bounded, and many studies have been performed to increase the reuse of local data and minimize the off-chip memory data usage. On the other hand, the success of NN compression has made it possible to squeeze the whole NN model into the on-chip memory and transfer the NN applications into compute-bound applications. So, in our experiments, we use the platform SRAM size as an optimization constraint and not an objective. First, having the NN model size less than the SRAM size achieves the ultimate goal of compression by avoiding the expensive loading of weights from the off-chip memory. Second, compressing the model more would not be beneficial anymore from the memory point of view. It can be beneficial for energy consumption and computation speedup, which are accounted for as optimization objectives. Next, we show the details of the energy and speedup objectives equations.

4.4.1 Energy Estimation

In our experiments, we use the energy estimation model developed by the Eyeriss project team Yang et al. (2016). In this model, the total consumed energy is computed by adding the total energy required for computation and the total energy required for data movements. Since the majority of computation in NN models are in the form of MAC operations, the total energy required for computation is the number of MAC operations multiplied by the energy cost of one MAC operation. The total energy consumed by data movement is computed by multiplying the cost of one-bit transfer by the number of transferred bits. For a hardware architecture that has a hierarchy of multiple memory levels, such as Eyeriss, different energy costs are used for each level of data loading. In our case, we have only one memory level which is the SRAM.

The final equation we use is as follow:


where is the overall energy consumed, is the number of bits in the model, and is the energy cost of loading one bit from the SRAM. is the set of supported precisions, is the energy cost of one MAC operation using the precision , and is the number of MAC operations using the precision .

4.4.2 Speedup Estimation

Since we adopt compute-bound implementations in this work, we rely solely on the speedup gained at the MAC operations computations as an approximation for the expected speedup gained by different quantization configurations. Currently, we do not have a real implementation for the model under study on the architectures (SiLago and Bitfusion). Having an implementation would enrich the speedup equation with more details such as the tile size to compute the proportion of loading time over computation time. However, in this paper, the hardware model is an input. What we show is how we generate different sets of solutions and account for retraining if required in the case of having different hardware models and constraints. In both architectures, the highest supported precision is the 16-bit fixed point. Thus, we define an objective for the speedup to compute the speedup over 16-bit operations. The speedup objective is computed by multiplying the number of MAC operations done in a given precision by the speedup of this precision and sum over all supported precisions using the following formula:


where is the overall speedup, and is the set of supported precisions. For example, an architecture that supports mixed-precision with 4 and 8 bits have a set , with . If the same architecture does not support mixed-precision, , with . Furthermore, is the speedup gained using precision over the highest precision supported by the given architecture. is the number of MAC (Multiply-Accumulate) operations using the precision , and is the total number of MAC operations in the model.

5 Evaluation and Experiments

This section applies the Multi-Objective Hardware-Aware Quantization (MOHAQ) method using the NSGA-II genetic algorithm to an SRU model using the TIMIT dataset for speech recognition. As we mentioned in the introduction, SRU-model is selected due to the ease of SRU parallelization and for being faster in training and inference. Also, we found in the Pytorch-Kaldi project good software support for the SRU-based models for speech recognition using TIMIT dataset Ravanelli et al. (2019). TIMIT is a dataset composed of recordings for 630 different speakers using six different American English dialects, where each speaker is reading up to 10 sentences Garofolo et al. (1993). In Section 5.1 we show the components of the model used in our experiments.

We designed three experiments to evaluate our MOHAQ method. In the three experiments, the search should give a set of Pareto optimum solutions. Each solution is the precision of each layer and activation in the model. In the first experiment, we evaluate the capabilities of the post-training quantization on the SRU model without any hardware consideration. In the latter two experiments, we use two hardware models, SiLago and Bitfusion. SiLago does not support precision less than 4-bit; inference-only search was enough for the example model as the compression ratio did not exceed 8x. On the other hand, Bitfusion supports 2-bit operations and hence supports high compression ratio solutions. Therefore, this example architecture gave us an opportunity to test the search method at a high compression ratio by setting the memory constraint to 2 MB (10.6x compression ratio). We first apply the inference-only search, and then we use the beacon-based search and examine the quality of the solution set.

FC Total
Input vector size (m) 23 1100 256 1100 256 1100 256 1100 -
Number of hidden cells (n) 550 256 550 256 550 256 550 1904 -
MAC operations 75900 281600 844800 281600 844800 281600 844800 2094400 5549500
Element-wise operations 15400 - 15400 - 15400 - 15400 - 88000
Non-linear operations 2200 - 2200 - 2200 - 2200 1904 10704
Matrices weights 75900 281600 844800 281600 844800 281600 844800 2094400 5549500
Vectors weights 4400 - 4400 - 4400 - 4400 - 17600
Table 4: The breakdown of the model used in the experiments. For each layer, we put the input vector size and the output vector size. Then, we compute the number of weights and operations for both MAC operations and other operations. We apply the formulas in Table  1. Each layer is denoted as , and each projection layer is denoted as , where x is the layer index and FC is the fully connected layer.

5.1 Example SRU Model

In our experiments, we use a speech recognition model from the Pytorch-Kaldi project Ravanelli et al. (2019)

. Pytorch-Kaldi is a project that develops hybrid speech recognition systems using state-of-the-art DNN/RNN. Pytorch is used for the NN acoustic model. Kaldi toolkit is used for feature extraction, label computation, and decoding 

Povey et al. (2011)

. In our experiments, the feature extraction is done using logarithmic Mel-filter bank coefficients (FBANK). The labels required for the acoustic model training come from a procedure of forced alignment between the context-dependent phone state sequence and the speech features. Then, the Pytorch NN module takes the features vector as input and generates an output vector. The output vector is a set of posterior probabilities over the phone states. The Kaldi decoder uses this vector to compute the final Word-Error-Rate (WER). We used the TIMIT dataset 

Garofolo et al. (1993) and trained the model for 24 epochs as set in the Pytorch-Kaldi default configurations. Figure  6

a shows the model used in our experiments. The NN model is composed of 4 Bi-SRU layers with 3 projection layers in between. A FC layer is used after the SRU layers and the output is applied to a Softmax layer 

Ravanelli et al. (2019).

Figure 6:

In Table 4, we show the breakdown of the SRU-model used in the experiments. In the first two rows, we show the input vector size and the number of hidden cells. Then, in the middle three rows, we show the number of operations corresponding to each layer for MAC, element-wise, and non-linear operations. Finally, in the last two rows, we show the number of weights for the matrices used in the MAC operations and the vectors used in the element-wise operations. Considering that one MAC operation is equivalent to two element-wise operations, the number of operations and weights not involved in the matrix to vector multiplications is less than 1% of the total number of operations and weights. Also, in Figure 6b, we show the percentage of weights required by each type of layers. These types are SRU-gates matrices, projection layers matrices, FC matrix, and the SRU vectors. The total size of the model is the total size of weights.

5.2 Multi-objective search to minimize two objectives: WER and memory size

Our first experiment does a multi-objective search to minimize two objectives: and memory size. is the error rate evaluated using the validation set of the TIMIT dataset. No hardware model is used in this experiment to explore the general compression of the model before any hardware platform is involved. In the search space, we have 4.3 billion possible solutions () as each solution has 16 variables, and each variable has four possible values. The genetic algorithm used is NSGA-II. After some initial experiments, we found that 60 generation was sufficient to get a stable solution for all tested objectives. Each generation has ten individuals except the initial generation, which has 40 individuals. So, 630 solutions have been evaluated during the search. During the search, solutions with a high error rate are infeasible. The search output is a Pareto optimal set of solutions that shows a trade-off between the model size and the error rate to the embedded system designer. Figure 7 shows a plot for the Pareto optimal set, and Table 5 shows the details of each solution in the set. Each row in the table corresponds to a solution in the set. We report the precision of each layer weight and activation for each solution, followed by the solution , compression ratio, and testing error . The first row is for the base model that is not quantized. The base model testing WER is .

Table 5 shows that the model can be compressed to 8x without any increase in the error. The designer can compress the model to 12x with only 1.5 p.p.(percentage point) increase in the error and to 15.6x with 1.9 p.p. increase in error. In most of the solutions, 4 bits and 2 bits have been used extensively for the weights. The activation precision has been kept between 8-bit and 16-bit in most of the solutions. However, 4-bit and 2-bit activations have been used in few layers. It is also observed that some solutions have an error rate better than the baseline model. It has been shown that quantization has a regularization effect during training Rybalkin et al. (2018). Therefore, we think the improved error is a result of the quantization error introducing a noise that reduces some of the over-fitting effect during inference.

In Table 5, we expected the to be higher than but also we hoped to see that the relative order is kept between the solutions for both and . We tried to minimize the gap between the validation error and the testing error as explained in Section 4. In all the solutions we found in the solution set, the difference between the two errors is small. However, if we look at the solutions sorted by the value, we find that the corresponding values are not perfectly sorted. and look as outliers. The reason is that had better than expected, and had worse than the surrounding solutions. Still, for both cases, the variation was in the range of 0.3 p.p. and we believe that such small variations are expected to happen as there is no guarantee for two different datasets’ errors to be the same.

Figure 7: Pareto optimal set for two objectives: validation error () and memory size. High error solutions are considered infeasible ().

To get a better understanding of how successful our post-training quantization applied to the SRU-model is, we look at the previous work done on post-training quantization. Since researchers have found that 16-bit and 8-bit quantization do not significantly affect accuracy Chen et al. (2016); Zhao et al. (2019)

, we will focus on 4-bit quantization (8x compression). CNN ImageNet models have been used for quantization experiments in most of the work we have seen. The accuracy drop due to 4-bit post-training quantization on CNN models has varied in these papers as follows: LAPQ 

Nahshan et al. (2019) (6.1 to 9.4 p.p.), ACIQ Banner et al. (2019) (0.4 to 10.8 p.p.), OCS Chen et al. (2016) (more than 5 p.p.), and ZeroQ Cai et al. (2020) (1.6 p.p.). Where ZeroQ applied mixed precision to reach the 8x compression ratio. Also, on RNN models for language translation, the BLEU score decreased by 1.2 Aji y Heafield (2019). Comparing this to the mixed-precision post-training of the SRU model, our search found solutions that use a mix of 2, 4, and 8-bits. Those solutions achieve compression ratios between 8x and 9x with an error increase that varies between 0 p.p. and 0.3 p.p. With an error increase of 1.5 p.p., the compression ratio reaches 12x. Thus, we conclude that the error increase we get is lower than most of the other studies, and we can see that we have higher compression ratios in our experiments.

Sol. FC
Baseline 32/32 32/32 32/32 32/32 32/32 32/32 32/32 32/32 16.2% 1x 17.2%
S1 8/16 4/16 4/4 2/16 4/8 4/8 4/16 4/16 16.1% 8.1x 16.9%
S2 4/16 2/8 4/4 2/16 4/8 4/8 4/16 4/8 16.4% 8.4x 17.0%
S3 4/16 4/16 2/4 2/16 4/16 4/16 4/16 4/16 16.5% 8.9x 17.4%
S4 4/16 2/8 2/4 2/16 4/8 4/8 4/16 4/8 16.7% 9.1x 17.5%
S5 8/16 2/8 2/4 2/16 4/16 2/8 4/16 4/8 16.9% 9.4x 17.7%
S6 4/16 8/16 2/4 4/16 4/16 4/16 4/16 2/16 17.2% 9.9x 18.2%
S7 8/16 2/2 2/4 2/16 2/16 4/16 4/16 4/8 17.3% 10.0x 17.9%
S8 8/16 2/16 2/4 2/16 4/16 4/16 4/16 2/16 17.4% 11.4x 18.2%
S9 4/16 2/8 2/4 2/16 4/16 4/16 4/16 2/16 17.4% 11.4x 18.3%
S10 4/16 2/16 2/4 2/16 4/8 4/16 4/16 2/8 17.6% 11.6x 18.3%
S11 4/16 2/4 2/4 2/16 2/16 8/16 4/16 2/8 17.8% 12.0x 18.4%
S12 4/16 2/8 2/4 2/16 2/16 8/16 4/16 2/8 17.8% 12.0x 18.7%
S13 4/16 2/8 2/4 2/16 2/16 4/16 4/16 2/8 17.9% 13.0x 18.5%
S14 4/16 2/8 2/4 2/16 2/16 2/16 4/16 2/8 18.0% 13.6x 18.6%
S15 4/16 2/2 2/16 2/16 2/8 2/16 2/16 2/8 18.7% 15.6x 19.1%
Table 5: The Pareto-set of solutions resulted from applying NSGA-II minimizing two objectives: and memory size in MB. is the error rate evaluated using the validation set. The WER of the solutions when using the testing set is also previewed in the table as . Each layer is denoted as , and each projection layer is denoted as , where x is the layer index. For each layer, the number of bits used is written as W/A, where W is the number of bits for weights and A is the number of bits for activations. is the compression ratio. The base model is a 16-bit full implementation. The base model testing WER is 17.2%.

5.3 Multi-objective Quantization on the SiLago architecture

In the second experiment, we apply the Multi-Objective HArdware-Aware Quantization (MOHAQ) method for SiLago Architecture using the inference-only search. The SiLago architecture can support varying precisions between layers. However, for each layer, the weight and the activation must use the same precision. Thus, the number of variables in a solution is 8, not 16, as in the previous experiment. The precisions supported on SiLago are 16, 8, and 4 bits, as explained in Section 2.5.1. Since the highest precision supported on SiLago is a 16-bit fixed point, the baseline model is a 16-bit full implementation. And so, we compute the speedup gained by low precision by comparing it to 16-bit. The speedup is computed using Equation 4. Also, we have evaluated the energy consumption required by the MAC operation for different precisions. Table 2 shows the speedup gained per one MAC operation when using 8-bit or 4-bit when compared to 16-bit and the energy consumed by 16-bit, 8-bit, and 4-bit operations. To compute the expected overall required energy by each solution, we use the energy model proposed by Energy-aware pruning Yang et al. (2016) and described in Section 4.4.

As discussed in Section 4.4, all weights should be stored in the on-chip memory. Thus, it is crucial to add the SRAM size as a constraint for the model size during the search. Even though we do not know the exact size of the SRAM, we have to establish a limit. Given that the highest possible compression ratio on SiLago is 8x, which corresponds to 2.65 MB for the experimental model, we chose 6 MB (3.5x compression ratio) as a reasonable memory size constraint to have room for proper size Pareto-set.

So, in this case, the multi-objective search has three objectives: WER, speedup, and energy consumption. Both WER and energy consumption are objectives to be minimized, while speedup is an objective to be maximized. We use the negative of the speedup as an objective instead of the speedup as the GA will minimize all the objectives.

Sol. FC Speedup Energy
Base 32/32 32/32 32/32 32/32 32/32 32/32 32/32 32/32 16.2% 1x - - 17.2%
16/16 16/16 16/16 16/16 16/16 16/16 16/16 16/16 16.2% 2x 1x 16.4 µJ 17.2%
S1 16/16 4/4 8/8 8/8 4/4 16/16 4/4 8/8 16.2% 4.5x 2.6x 5.8 µJ 17.1%
S2 16/16 4/4 4/4 8/8 4/4 16/16 4/4 8/8 16.3% 4.9x 2.9x 5.2 µJ 17.2%
S3 8/8 4/4 4/4 4/4 4/4 4/4 4/4 8/8 16.8% 5.7x 3.2x 4.2 µJ 17.4%
S4 4/4 4/4 4/4 4/4 4/4 4/4 4/4 8/8 17.3% 5.8x 3.2x 4.1 µJ 17.7%
S5 8/8 8/8 4/4 4/4 8/8 4/4 4/4 4/4 18.6% 6.6x 3.5x 3.5 µJ 19.4%
S6 8/8 8/8 4/4 16/16 4/4 4/4 4/4 4/4 18.6% 6.6x 3.7x 3.6 µJ 19.0%
S7 4/4 4/4 4/4 4/4 4/4 4/4 4/4 4/4 19.0% 8x 3.9x 2.6 µJ 19.8%
Table 6: The Pareto-set of solutions resulted from applying NSGA-II minimizing three objectives, and speedup and energy consumption. is the error rate evaluated using the validation set. The speedup and energy consumption are computed using SiLago architecture model. The WER of the solutions when using the testing set is also previewed in the table as . Each layer is denoted as , and each projection layer is denoted as , where x is the layer index. For each layer, the number of bits used is written as W/A, where W is the number of bits for weights and A is the number of bits for activations. is the compression ratio. is the base model that can run on SiLago using a 16-bit full implementation.
Figure 8: Pareto optimal set for three objectives, validation error (), speedup, and energy consumption on SiLago architecture model. High error solutions () are considered infeasible.

Since the search space is smaller than the search space in experiment one, we only need to run the search for 15 generations. Each generation has ten individual solutions, except the first generation has 40 individual solutions. The whole search, therefore, evaluates 180 solutions out of 6561 possible solutions (). Solutions with a high error rate were considered infeasible. Figure 8 shows the Pareto optimal set, and Table 6 shows the details of each solution and the testing WER. To judge the speedup and energy consumption quality, we compare the solutions against the best possible performing solution on SiLago, which is using 4-bit for all layers. This solution can reach a 3.9x speedup and the lowest energy consumption of (a 6.3x improvement than the base solution). The search managed to find solutions that can achieve 74% of the maximum speedup and 51% of the maximum energy saving without any increase in the error. If the designer agreed to have a 0.5 p.p. (percentage point) increase in the error, we could achieve 81% of the maximum speedup and 64% of the maximum energy saving. And we see that to reach the maximum possible performance, there will be a 2.6 p.p. increase in the error.

5.4 Multi-objective Quantization on the Bitfusion architecture

In the third experiment, we apply the Multi-Objective Hardware-Aware Quantization (MOHAQ) method for the Bitfusion architecture using two objectives. The first is the WER and the second is the speedup. We first apply the inference-only search, and then we apply the beacon-based search to enhance the solution set. Bitfusion architecture is introduced in Section 2.5.2. We use Equation 4 to compute the expected speedup for different solutions. The genetic algorithm used is NSGA-II, and it runs for 60 generations as we did in experiment 1. Each generation has ten individuals except the initial generation, which has 40 individuals. The whole search evaluates 630 solutions out of 4.3 billion possible solutions (). We chose to consider high error solutions (more than 24%) infeasible to limit the search to low error rates solutions only. In the speedup equation, we assume that all the weights have to be in the SRAM, and the application is compute-bound. Thus, memory size has to be a constraint in the search. In this experiment, we put a constraint for the memory size to be less than 2MB. We do so to allow the search to find high error solutions to do the beacon-based search as a next step. This memory size is equivalent to 9.4% of the original model size.

Figure 9 shows the Pareto optimal set and Table 7 shows the detailed solutions. Since we have 26 solutions in the Pareto-set, we skip some solutions details in the table. In this case, we see solutions with high error rates that might not be accepted. Thus, we applied the beacon-based search introduced in Section 4.3. We used a distance threshold between candidate solutions and beacons of value 6. The selected threshold is reasonable when compared to the number of layers (8). By the end of the search, we found that one beacon has been created during the search. We have repeated the experiment by decreasing the threshold or manually selecting beacons to increase the number of beacons, and we got a similar set of solutions. Thus, one beacon is enough for our model. For deeper models, more beacons would be required.

Figure 10 and Table 8 depict the new generated Pareto optimal set with solution details. As done in Table 7, we skipped some solutions details to keep the tables relatively concise. Comparing the two Pareto sets, the testing error on the first set reached 24.2% to achieve a 40.7x speedup. The new set reached the same speedup level with a testing error of 20%. The new set also found more solutions with higher speedups up to 47.1x with a testing error of 20.7%. One drawback of retraining is that the validation data is reused for retraining and the model gets more knowledge about it. Thus, for some solutions, the gap between the and might be higher than in inference-only search.

Figure 9: Pareto optimal set for two objectives, validation error () and speedup on Bitfusion architecture model. High error solutions () are considered infeasible. A memory constrained of 2MB is applied.
Figure 10: Comparing the two Pareto sets generated by the inference-only search (in red) and the beacon-based search (in green). The retrained solution used as a beacon is the orange star.
Sol. FC Speedup
Base 32/32 32/32 32/32 32/32 32/32 32/32 32/32 32/32 16.2% 1x - 17.2%
16/16 16/16 16/16 16/16 16/16 16/16 16/16 16/16 16.2% 2x 1x 17.2%
S1 8/16 2/2 2/16 4/8 4/8 4/16 4/4 2/8 17.4% 11.6x 14.6x 18.1%
S2 4/16 2/2 2/16 4/8 4/8 4/16 4/4 2/8 17.6% 11.6x 14.6x 18.4%
S10 8/16 2/2 2/2 2/4 4/8 2/8 4/2 2/8 18.9% 13.6x 27.2x 19.4%
S14 4/16 2/2 2/2 2/8 2/4 2/8 4/2 2/8 19.7% 13.6x 30.0x 20.2%
S18 4/16 2/2 2/2 2/4 2/2 2/16 4/2 2/8 20.6% 13.6x 35.2x 21.3%
S19 4/8 2/2 2/2 2/4 2/2 2/16 4/2 2/8 20.8% 13.3x 35.2x 21.5%
S22 4/16 2/2 2/2 2/4 4/8 2/8 2/2 2/4 22.9% 13.3x 37.9x 23.1%
S25 4/16 2/2 2/2 2/2 4/8 2/8 2/2 2/4 23.5% 13.6x 39.5x 24.0%
S26 8/16 2/2 2/2 2/2 4/4 2/8 2/2 2/4 23.7% 13.3x 40.7x 24.2%
Table 7: The inference-only search Pareto-set of solutions resulted from applying the NSGA-II algorithm minimizing two objectives, and speedup. is the error rate evaluated using the validation set. The speedup is computed using Bitfusion architecture model. The WER of the solutions when using the testing set is also previewed in the table (). Each layer is denoted as , and each projection layer is denoted as , where x is the layer index. For each layer, the number of bits used is written as W/A, where W is the number of bits for weights and A is the number of bits for activations. is the compression ratio. is the base model that can run on Bitfusion using a 16-bit full implementation.
Sol. FC Speedup
Base 32/32 32/32 32/32 32/32 32/32 32/32 32/32 32/32 16.2% 1x - 17.2%
16/16 16/16 16/16 16/16 16/16 16/16 16/16 16/16 16.2% 2x 1x 17.2%
S1 8/16 4/2 4/8 2/4 4/16 2/16 2/2 2/8 17.1% 11.4x 21.0x 18.1%
S2 8/8 4/2 2/8 4/4 4/16 2/16 2/2 2/8 17.3% 12.3z 21.4x 18.5%
S8 16/8 8/4 2/2 2/4 4/4 2/16 2/2 2/4 18.2% 11.3x 35.9x 19.3%
S14 16/8 4/2 2/2 2/2 4/4 2/16 2/2 2/4 19.0% 12.2x 38.7x 19.5%
S15 8/8 2/4 2/2 2/4 2/4 2/4 2/2 2/4 19.1% 15.2x 40.7x 20.0%
S19 4/16 2/4 2/2 2/4 2/2 2/4 2/2 2/4 20.2% 15.6x 45.5x 21.2%
S20 4/16 2/2 2/2 2/4 2/2 2/4 2/2 2/4 20.4% 15.6x 47.1x 20.7%
Table 8: The beacon-based search Pareto-set of solutions resulted from applying NSGA-II minimizing two objectives, and speedup. is the error rate evaluated using the validation set. The speedup is computed using Bitfusion architecture model. The WER of the solutions when using the testing set is also previewed in the table (). Each layer is denoted as , and each projection layer is denoted as , where x is the layer index. For each layer, the numbers of bits used is written as W/A, where W is the number of bits for weights and A is the number of bits for activations. is the compression ratio. is the base model that can run on Bitfusion using a 16-bit full implementation.

6 Discussion and Limitations

The main focus of this paper is to open up a research direction for hardware-aware multi-objective compression of neural networks. There exist many network models and a large number of compression techniques. In this work, we principally focus on quantization and application to SRU models for several reasons. Quantization is one of the vital compression methods that can be used solely or with other compression methods. Therefore, we consider enabling the MOOP on quantization is of great benefit. Also, SRU is a promising recurrent layer that allows a powerful hardware parallelization. One reason for using SRU is to examine to what extent it can be quantized as it remains under-investigated. Since the SRU can be considered as an optimized version of the LSTM, we wanted to investigate the effect of quantization on it. Our experiments showed that by excluding the recurrent vectors and biases from quantization, SRU could be quantized to high compression ratios without a harsh effect on the model error rate. Thus, the SRU combines both the high parallelization speedup and compression model size reduction benefits. The second reason is that running experiments using SRU is much faster than any other RNN model. Thus, we had a better opportunity to have multiple trials to explore our methodology.

In this article, we further claim that the automation of hardware-aware compression is essential to meet changes in the application and hardware architecture. To prove this claim and show that our proposed method is hardware-agnostic, we have applied our search method to two different hardware architectures, SiLago and Bitfusion. Those architectures have been particularly selected due to the varying precision they support. In Sections 5.3 and 5.4 we have shown two different sets of solutions. These findings show that compression can be done in different ways depending on the target platform. The differences between the speedup values that appear in Tables 6 and 8 for SiLago and Bitfusion might imply that Bitfusion is faster than Silago. This difference in speed is not what this work is particularly investigating since we compare the optimized solutions on a given architecture to the baseline running on the same architecture.

For our method to be entirely generic, it needs to support variations in the NN model, compression method, and hardware platform. In this work, we have applied the method on two different architectures: CGRA (SiLago) and systolic array architecture (Bitfusion). Concerning the variations in the NN model, we have applied our method on one model, but post-training quantization has been applied successfully on several models in the literature. And since our method mainly relies on the success of post-training quantization, we believe our method is generic enough to be applied to many NN models. However, the beacon-based search needs to be applied to more models of several depths to investigate if a generic equation for the threshold selection can be defined. The aspect that needs more work is the variation in the compression method. The possibility of applying the post-training version of the other compression methods should be investigated. Also, the idea of the beacon-based search needs to be examined on other compression techniques to see if it can be directly applied, modified, or replaced by another method that satisfies the need for considering retraining effect on different compression configurations within a reasonable time.

7 Conclusion and Future Work

Compression of neural networks application contributes invasively to the efficient realization of such applications on edge devices. The compression of the model can be customized by involving the hardware model and application constraints to reinforce the compression benefits. As a result of the compression customization, compression automation became required to meet variations in the hardware and the application constraints. Thus, the compression configuration selection such as per-layer pruning percentage or bit-selection is considered an optimization problem.

This article proposes a Multi-objective Hardware-aware Quantization (MOHAQ) method and applies it to a Simple Recurrent Unit (SRU)-based RNN model for speech recognition. In our approach, both hardware efficiency and the error rate are treated as objectives during quantization, and the designer has the freedom to choose between varying Pareto alternatives. We relied on post-training quantization to enable the evaluation of candidate solutions during the search within a feasible time (inference-only search). We then enhance the quantization error by retraining with a novel method called beacon-based search. The beacon-based search uses few retrained models to guide the search instead of retraining the model for all the evaluated solutions.

We have shown that the SRU unit, as an optimized version of the LSTM, can be post-training quantized with negligible or small increases in the error rate. Also, we have applied the multi-objective search to quantize the SRU model to run on two architectures, SiLago and Bitfusion. We found a different solution set for each platform to meet the changing constraints. On SiLago, using inference-only search, we have found a set of solutions that, with increases in the error rate range from 0 till 2.6 percentage point, can achieve a high percentage to complete percentage of the maximum possible performance. On Bitfusion, assuming a small SRAM size, we have shown the search results using both inference-only search and beacon-based search. We have shown how our Beacon-based search decreases the error and finds better-performing solutions with lower error rates. The highest speedup solution in the inference-only search was achieved by a 4.2 percentage point decrease in the error rate using the beacon-based search. Also, the highest speedup achieved by the beacon-based search is 47.2x compared to the 40.7x achieved by the inference-only search and yet at a lower error rate.

This work introduces an idea for multi-objective search while retraining is considered. We managed to consider the retraining effect for quantization. Next, we want to apply this to other compression techniques such as deltaRNN and restructured matrices. Also, we want to know how to apply our method to a hybrid of compression/quantization techniques and not only one. In addition, we want to run experiments on more hybrid NN models, such as models with convolution and recurrent layers.

8 Acknowledgements

This research is part of the CERES research program funded by the ELLIIT strategic research initiative funded by the Swedish government and Vinnova FFI project SHARPEN, under grant agreement no. 2018-05001.

The authors would also like to acknowledge the contribution of Tiago Fernandes Cortinhal in setting up the Python libraries and Yu Yang in the thoughtful discussions about SiLago architecture.