1 Introduction
Realizing Deep Learning (DL) models on embedded platforms has become vital to bring intelligence to edge devices found in homes, cars, and wearable devices. However, DL models’ appetite for memory and computational power is in direct conflict with the limited resources of embedded platforms. This conflict led to research that combines methods to shrink the models’ compute and memory requirements without affecting their accuracy Rezk et al. (2020). Models can be compressed by pruning neural network weights Dai et al. (2017), quantizing the weights and activations to use fewer bits Courbariaux et al. (2015); Rybalkin et al. (2018), and many other techniques Rezk et al. (2020).
The effect of the model’s compression can be improved by involving the target platform hardware model guidance during the compression. Different Neural Network (NN) layers have different effects on energy savings, latency improvement, and accuracy degradation Yang et al. (2016). Selecting different optimization parameters like pruning ratio or precision for different layers can maximize accuracy and hardware performance. Such methods are called hardwareaware optimizations. These optimization methods would result in a specific solution that depends on a particular model, hardware platform, and accuracy constraint. Changes in the hardware platform or the application constraints would make this solution no more optimal. Thus, it is required to automate the NN compression to adapt to changes in the platform and the constraints.
The compression of the NN model to run on embedded platforms can be considered an optimization problem. Some prior work on pruning and quantization has considered optimization algorithms to select pruning/quantization parameters. Some researchers have tried to maximize the accuracy while considering platformrelated metrics such as latency or memorysize as constraints Nahshan et al. (2019); Yang et al. (2018b); Yao et al. (2017). In contrast, another group of researchers considered accuracy as the constraint and selected a platform performance metric as the main objective Yang et al. (2016). In this work, we treat the optimization problem as a MultiObjective Optimization Problem (MOOP) where both the accuracy and hardware efficiency are considered objectives. First, it is nontrivial to decide whether accuracy or performance is the objective of the optimization. Thus, considering the problem as multiobjective is more appropriate. Also, MOOP has some advantages over single objective optimization such as giving a set of solutions that meet the objectives in different ways Savic (2002). By doing so, the embedded system designer can do the tradeoff between different alternative solutions. The role of the optimization algorithm is only to generate the most efficient solutions and not to select one solution.
However, applying multiobjective search to select NN models compression configurations is not straightforward. Many compression methods require retraining to compensate for accuracy loss, and that cannot be done during evaluating candidate solutions in a large search space. That is why in most of the recent work, the compression parameters have been selected during training and end with one solution. To make the multiobjective search possible, we rely on two aspects. First, we use posttraining quantization as a compression method. Posttraining quantization has become a more and more reliable compression method recently Zhao et al. (2019); Banner et al. (2019); Nahshan et al. (2019); Nagel et al. (2020). Thus, the evaluation of one candidate solution would require only to run the inference of the NN model. Second, we know that we cannot guarantee that posttraining will provide highaccuracy solutions under all constraints scenarios. Thus, we propose a novel method called ”beaconbased search” to support the inferenceonly search with retraining. In beaconbased search, only a few solutions are selected for retraining (beacons), and they are used to guide other possible solutions without retraining all evaluated solutions.
This article applies the proposed MultiObjective HardwareAware Quantization (MOHAQ) method to a recurrent neural network model used for speech recognition. Recurrent Neural Networks (RNNs) are NNs that are designed to deal with sequential data inputs or outputs applications Donahue et al. (2014)
. RNNs recognize the temporal relationship between input/output sequences by adding feedback to FeedForward (FF) neural networks. The model used in the experiments uses Simple Recurrent Units (SRU) instead of Long Short Term Memory (LSTM) due to its ability to be parallelized over multiple time steps and being faster in training and inference
Lei et al. (2018). We selected speech recognition as an important RNN application and the TIMIT dataset due to its popularity in speech recognition research. Also, we find good software support for speech recognition using TIMIT dataset by the Pytorchkaldi project
Ravanelli et al. (2019). To prove the flexibility of the method to support different applications and hardware platforms, we applied our methods to two hardware architectures with varying constraints. The first is SiLago architecture Hemani et al. (2017), and the second is Bitfusion architecture Sharma et al. (2017). These two architectures are chosen as they support the varying precision required in our experiments.The contributions of this work are summarized as:

We enable the fast evaluation of candidate solutions for multiobjective optimization for bitwidth selection for quantized NN layers by using posttraining quantization. We show how to take account for combined objectives like the model error and hardware efficiency.

We propose a method called beaconbased search to predict the retraining effect on candidate solutions without retraining all the evaluated solutions in the search space.

We demonstrated the flexibility of the proposed method by applying it to two hardware architectures, SiLago and Bitfusion.

To the best of our knowledge, this is the first work analyzing the quantization effect on SRU units. Considering SRU as an optimized version of the LSTM, it has not been investigated before if low precision quantization can be applied to SRU.
This article is organized as follows. Section 2 gives a brief explanation of the neural network layers, optimization methods, and hardware platforms used in this work. Then, Section 3 discusses the related research work. Later, in Section 4, we explain the quantization of the SRUmodel and the multiobjectivebased search used for the hardwareaware perlayer bit selection. In Section 5, we describe our experiments and results to assess the proposed method. Afterward, in Section 6, we discuss our results. Finally, in Section 7, we conclude the article.
2 Background
In this section, we discuss the essential components, methods, and platforms used in this work. We first explain the RNN layers related to this work. Then, we explain how NN models are quantized. Afterwards, we discuss the multiobjective optimization techniques. Finally, we cover the details of the hardware architectures used in this article.
2.1 RNN Model Components
RNNs recognize the temporal relation between data sequences by adding recurrent layers to the NN model. The recurrent layer adds feedback from previous time steps and adds memory cells to mimic the human memory. Here we cover the most popular recurrent layer called Long Short Term Memory (LSTM) Hochreiter y Schmidhuber (1997) and the Simple Recurrent Unit (SRU) Lei et al. (2018). SRU is an alternative for LSTM that has been proposed to improve LSTM computational complexity. Then we briefly cover more layers required in the RNN model used in this paper experiments.
2.1.1 Long Short Term Memory (LSTM)
LSTM is composed of four major computational blocks (Figure 1
). These blocks are Matrix to Vector multiplications (
) between the input and output feedback vector and weight matrices followed by bias addition and application to nonlinear function. Three of these blocks computations are to compute the forget, input, and output gates. These gates decide which information should be forgotten, which information should be renewed, and which information should be in the output vector respectively. The fourth computation block is used to compute the memory state vector values. For example, the forget gate output is computed as:(1) 
where is the input vector, is the hidden state output vector, is the weight matrix,
is the bias vector, and
is the function. The number of computations and parameters for LSTM are shown in Table 1.2.1.2 Simple Recurrent Unit (SRU)
Simple Recurrent Unit (SRU) is designed to make the recurrent unit easily parallelized Lei et al. (2018). Most of the LSTM computations are in the form of matrix to vector multiplications. Thus their parallelization is of great value. However, these computations rely on the previous timestep output and previous timestep state vectors, and therefore it isn’t easy to be parallelized over timessteps. The SRU overcomes this problem by removing and from all matrix to vector multiplications and is used only in elementwise operations. The SRU is composed of two gates (forget and update gates) and a memory state (Figure 2). It has three matrix to vector multiplications blocks. For example, the forget gate vector is computed as:
(2) 
where is the input vector, is the old state vector, is the input weights matrix, is the recurrent weights vector, is the bias vector, and is the function. The number of operations and parameters for an SRU is shown in Table 1.
2.2 Bidirectional RNN layer
In a Bidirectional RNN layer, the input is being fed into the layer from past to future and future to past. Consequently, the recurrent layer is duplicated, equivalent to two recurrent layers working simultaneously, and each layer is processing input in a different temporal direction. Obtaining data from the past and the future helps the network to understand the context better. This concept can be applied to different recurrent layer types like BiLSTM Li y Shen (2017) and BiGRU Vukotic et al. (2016) and BiSRU. The number of operations and parameters for a BiSRU is shown in Table 1.
2.2.1 Projection layers
The projection layer is an extra layer added before or after the recurrent layer Sak et al. (2014). A projection layer is similar to a FullyConnected (FC) layer. The projection layer is added to allow an increase in the number of hidden cells while keeping the total number of parameters low. The projection layer has a number of units less than the recurrent hidden cells . The number of weights computations in the recurrent layer will be dominated by a multiple of and not . As , can increase without increasing the size of the model dramatically.
RNN layer  Number of Operations  Number of Parameters  
MAC  Elementwise  Nonlinear  Weights  Biases  
LSTM  
SRU  
BiSRU 
2.3 Quantization
Quantization reduces the number of bits used in neural network operations. It is possible to quantize the neural network weights only or the activations as well. The precision can change from a 32floating point to a 16bit fixedpoint, which usually does not affect the model’s accuracy. Therefore, many neural network accelerators uses 16bit fixed point precision instead of floating point Chen et al. (2017); Ankit et al. (2019); Wang et al. (2018). Or, quantization can be to integer precision to make it more feasible to deploy NN models on embedded platforms Fang et al. (2020). Integer quantization can be anything from 8 bits to 1bit. Low precision integer accelerators have been proposed to achieve efficiency in terms of speedup, and energy and area savings Google ([Accessed on: October. 2021]); Judd et al. (2016); Sharma et al. (2017); Lee et al. (2019)
. However, integer quantization can cause a high degradation in accuracy. Thus, retraining is required to minimize this degradation in many cases, or the model is trained using quantizationaware training from the beginning. Recently, there has been a growing interest in posttraining quantization. Posttraining quantization quantizes the pretrained model parameters without any further retraining epochs after quantization. That would be useful if the training data is unavailable during the deployment time or the training platform itself is unavailable. Most posttraining quantization methods work on the outlier values that consume the allowed precision and cause accuracy loss. Clipping the outliers to narrow the data range can overcome the problem, and several techniques are used for selecting clipping thresholds
Nahshan et al. (2019); Zhao et al. (2019). Alternatively, Outlier Channel Splitting (OCS) is a method that duplicates the channels with the outlier values and then halves the output values or their outgoing weights to preserve functional correctness Zhao et al. (2019). However, it increases the size of the model by this channel duplication. In this work, we use clipping during quantization, and the clipping thresholds are selected using the Minimum Mean Square Error (MMSE) method Sung et al. (2015).2.4 Optimization using Genetic Algorithms
Optimization is essential to various problems in engineering and economics whenever decisionmaking is required Chong y Zak (2013). It works on finding the best choice from multiple alternatives. The search for the best is guided by an objective function and restricted by defined constraints. In many problems, there exists more than one objective. Those objectives can be conflicting that enhancing one of them requires worsening the others. SingleObjective Optimization (SOOP) will try to find a single best solution Savic (2002). This solution corresponds to the minimum or maximum value of a single objective function that groups all the objectives into one. This type of optimization does not provide the designer with a collection of alternative solutions that trade different objectives against one another. On the other hand, MultiObjective Optimization (MOOP) is a kind of optimization that provides a set of solutions known as a Paretoset (front) for conflicting objectives. Paretoset is a nondominating set of solutions that none of them can be further enhanced by any other solution. A solution dominates a solution by enhancing at least one objective without making any other objective get worse.
Genetic Algorithms (GAs) are popular algorithms for both single and multiobjective optimization Chahar et al. (2021). GAs are inspired by natural selection, where survival is for the fittest. It is an algorithm based on populations. Each population is composed of some candidate solutions (individuals). Each individual is composed of a number of variables called chromosomes. GA is an iterative algorithm that evaluates one population solution’s fitness values and generates a new population (offsprings) from the old population until a criterion is met or the defined number of populations is completed. The latest population is formed by selecting pairs of good solutions (based on their fitness values) and applying crossover on them. Crossover composes a new individual by mixing parts of parent individuals. Then mutation is applied to the offspring individual to change in some chromosomes. The selection, crossover, and mutation operations are repeated until a new population is complete. Also, an encoding function is used to map the solution variables values into another representation that can be used for genetic operations such as crossover and mutation.
Modifications have been applied to GAs to work on multiobjective problems. These modifications are more related to fitness function assignment while the rest of the algorithm is similar to the original GA Chahar et al. (2021). There are various of multiobjective GAs such as NPGA Horn et al. (1994), NSGA Srinivas y Deb (2000), and NSGAII Deb et al. (2002). NSGAII (Nondominated Sorting Genetic Algorithm II) is the enhanced version of NSGA. NSGAII is a wellknown fast multiobjective GA Yusoff et al. (2011). In this work, we have used the NSGAII as our multiobjective search method. The NSGAII implementation is provided by PYMOO (a python library for optimization) and is based on the NSGAII paper Deb et al. (2002). NSGAII is similar to a general GA, but the mating and the survival selection are modified PYMOO ([Accessed on: October. 2021])
. NSGAII selects the individuals frontwise. Since it might not be possible to find enough individuals allowed to survive, it might be required to split a front. The crowding distance (Manhattan distance in the objective space) is used for selection in the front splitting. Extreme points are assigned a crowding distance of infinity to keep them in all generations. The selection method used is binary tournament mating selection, where each individual is compared first by a rank followed by a crowding distance. NSGAII popularity relies on three factors: The fast nondominated sorting approach, the simple crowded comparison operator, and the fast, crowded distance estimation procedure
Yusoff et al. (2011).2.5 Architectures Under Study
This section gives a brief presentation of the two architectures we have applied our methods to. The first is SiLago architecture, and the second is Bitfusion. In SiLago architecture, the low precision support is a new feature under construction. Thus, we introduce the low precision support idea in SiLago and the expected speedup and energy saving.
2.5.1 SiLago Architecture
The architecture is a customized Coarse Grain Reconfigurable Architecture (CGRA) built upon two types of fabrics. The first fabric is called Dynamically Reconfigurable Resource Array (DRRA) and is optimized for dense linear algebra and streaming applications. The second fabric is called Distributed Memory Architecture (DiMArch) and is used as variable size streaming scratchpad memory. Figure 3 shows a SiLago design example.

DRRA:
All the components of the DRRA (computation, storage, interconnect, control, and address generation) are customized for streaming applications and further updated to support NN operations. DRRA has a unique, extensive parallel distributed local control scheme and interconnect network Shami y Hemani (2009); Farahini et al. (2014). Each cell in the DRRA is comprised of a Register File (RF), sequencer (SEQ), data processing unit (DPU), and switchbox. Each DPU and RF in DRRA outputs to a bus that straddles two columns on each side (right and left) to create an overlapping 5column span. The two busses connect via a switchbox to create a circuit switch network on chip (NoC). The data on the selected output bus are propagated via the switchbox to the inputs of DPU and RF Shami y Hemani (2009). This interconnect allows for the DPUs and RFs of different cells to be chained together and create larger and more complex data paths. The RFs and NoCs in the DRRA use 16bit words. The DPU inside each cell can be customized for NN computations. For use in NN the DPU is using a special computation unit called NonLinear Arithmetic Unit (NACU), Baccelli et al. (2020). NACU was originally developed to operate on a specific bitwidth, decided on design time.
For this work, we have updated the design to be able to support three different types of low precision operations, 1x 16bit, 2x 8bit, or 4x 4bit. This is done by modifying the existing multiplier and accumulator inside the NACU to use Vedic multiplication Tiwari et al. (2008). The 16bit multiplier is split into 16 4bit multiplications. Depending on the type of operation, the multiplier can be reconfigured, and a different number of these smaller multipliers is used. Their results are then added to produce the final result.

DiMArch
The computational fabric is complemented by a memory fabric called DiMArch. DiMArch provides a matching parallel distributed streaming scratchpad memory. It is composed principally of SRAM macros Tajammul et al. (2016), coupled with an address generation unit (AGU). The SRAM macros, along with the AGUs, are connected with each other by two NoCs. One NOC is a circuitswitched highbandwidth data NOC, and the second is a packetswitched control and configuration NOC. The massively parallel interconnect between the two fabrics in Figure 3 makes sure that the computational parallelism is matched by the parallelism to access the scratchpad memory. The cells of the two CGRAs can be dynamically clustered to morph into custom datapaths shown as private execution partition in Figure 3, see Tajammul et al. (2013). This has been exploited in Jafri et al. (2017) to create neural network accelerators.

Energy and timing estimations
Table 2 presents the energy consumption and the speedup of the arithmetic operations in the SiLago platform. The energy consumption is based on postlayout simulations of the reconfigurable multiplier and accumulator (MAC). The multiplier and accumulator were synthesized using a 28nm technology node. The energy consumption for the SRAM access was based on SRAM macrogenerated tables in the same 28nm technology. The speedup is calculated as operations per clock cycle. The MAC unit can be reconfigured to calculate one 16bit, 8bit, or 4bit MACs in every cycle.
16x16 8x8 4x4 MAC speedup 1x 2x 4x MAC energy cost () 1.666 0.542 0.153 Loading 1bit energy cost () 0.08 Table 2: The speedup and energy consumed by different types of low precision operations on SiLago architecture.
2.5.2 Bitfusion Architecture
Bitfusion is a variable precision architecture designed to support variations in the precision in quantized neural networks Sharma et al. (2017). It is composed of a 2d systolic array of what is called Fused Processing Element (FusedPE). Each FusedPE is composed of 16 individual BitBricks, each of which is designed to do 1bit or 2bit MAC operations. By grouping bitbricks in one FusedPE, higher precision operations are supported. The highest parallelism rate of one FusedPE is 16x when the two operands are 1bit or 2 bit, and no parallelism is achieved by having two 8bit operands. To support 16bit operations, the FusedPE is used for four cycles. Thus, the speedup of using 2bit over 16bit operations is 64x.
This section explained the NN model components, compression methods, and optimization methods required to understand before reading this article. Next, we present and compare the literature related to this article and highlight the main differences between them.
3 Related Work
In this section, we go through the related work to this article. First, we discuss the research done on the compression of SRUbased models. Then, we review the research relevant to the optimization of compressed neural networks.
3.1 Simple Recurrent Unit (SRU) compression
Shannguan et al. used two SRU layers in a speech recognition model as a decoder Shangguan et al. (2019). They managed to prune 30% of the SRU layers without a noticeable increase in the error. The pruning was applied during training to ensure low error Zhu y Gupta (2018). To the best of our knowledge, no work applies quantization or any other compression method, except the pruning approaches on SRU models.
Compression  Objectives  Constraints  Hardwareaware  
Yang et al. Yang et al. (2016)  Pruning  Energy  Accuracy  Energy model Yang et al. (2016)  
Yang et al. Yang et al. (2018b), Netadapt  Pruning  Accuracy  Resource budget  Empirical measurements  
Yang et al. Yang et al. (2018a)  Pruning  Accuracy  Energy  Energy model Yang et al. (2016)  
Yao et al.Yao et al. (2017), Deepiot  Pruning  Accuracy  Size  Memory information  
Rizakiset al. Rizakis et al. (2018)  Pruning  Accuracy  Latency  Roofline model  
Wang et al. Wang et al. (2019), HAQ  Quantization  Accuracy  Resource budget  Platform model  
Nahshan et al. Nahshan et al. (2019), LAPQ  Quantization  Loss      
Cai et al. Cai et al. (2020), ZeroQ  Quantization 

Size    
This work, MOHAQ  Quantization 

Size 

3.2 Optimization of Neural Networks Compression
The compression of neural network models has been treated as an optimization problem to select the degree of compression for each layer/channel. In many cases, feedback from a hardware platform or hardware model is used during the optimization (hardwareaware compression). We have summarized the work done on the optimization of NN models for compression in Table 3. The choice of constraints and objectives have varied among different papers.
Energyaware pruning Yang et al. (2016) is a pruning method that minimizes energy consumption on a given platform. The platform model was used to guide the pruning process by informing it which layer when pruned, would lead to more energy saving. The pruning process stops when a predefined accuracy constraint has been hit. Netadapt Yang et al. (2018b) eliminated the need for platform models by using direct empirical measurements. Nevertheless, pruning in Netadapt is constrained by a resource budget such as latency, memory size, and energy consumption. In both methods, pruning starts from pretrained models, and finetuning is applied to retain accuracy. Similarly, an energyconstrained compression method Yang et al. (2018a) used pruning guided by energy constraints. Energy results are predicted from a mathematical model for a TPUlike systolic array structure architecture. However, this compression method trains the model from the beginning. On the other hand, DeepIOT Yao et al. (2017) obtains the memory size information from the target platform to compute the required compression ratio as a constraint. In another work, optimization variables are selected based on time constraints Rizakis et al. (2018), where roofline models are used for calculating the maximum achievable performance for different pruning configurations.
HAQ (Hardwareaware Quantization) used reinforcement learning to select bitwidth for weights and activations to quantize a model during training while considering hardware constraints
Wang et al. (2019). LAPQ Nahshan et al. (2019) and ZeroQ Cai et al. (2020) applied optimization algorithms on quantized NN models but without any hardware model guidance. Loss Aware Posttraining quantization (LAPQ) is a layerwise iterative optimization algorithm to calculate the optimum quantization step for clipping Nahshan et al. (2019). The authors proved that there is a relation between the quantization step and the crossentropy loss. Small changes in the quantization step have a drastic effect on the accuracy. LAPQ managed to quantize ImageNet models to 4bit with a slight decrease in accuracy level. In the Zeroshot Quantization (ZeroQ), Nagel
et al. proposed a datafree quantization method Nagel et al. (2020). Their approach uses multiobjective optimization to select the precision of different layers in the model. The two objectives are the memory size and the total quantization sensitivity, where they define an equation to measure the sensitivity of each layer for different precisions. The authors assumed that the sensitivity for each layer to a specific precision is independent of other layers’ precisions. This assumption simplifies the computation of the overall sensitivity for different quantization configurations in the search space,None of the discussed work has applied hardwareaware multiobjective optimization to the problem of NN compression. In this work, we use quantization as a compression method and target hardware models to guide the compression. We allow both the model error/accuracy and hardware efficiency metrics (speedup and energy consumption) to be objectives. We use the hardware onchip memory size as a constraint to avoid highcost offchip communication. The details of our proposed method are explained in the next section.
4 Method
This section discusses the details of our proposed method for the MultiObjective HardwareAware Quantization (MOHAQ) of the SRUmodel for speech recognition. First, we explain how we apply the posttraining quantization on the SRU. Next, we present the optimization algorithm used to select the layers’ precisions guided by the hardware model. then, we explain how to enable retraining in a multiobjective search for setting mixedprecision quantization configurations. Finally, we discuss how we use the hardware models for guidance during the compression optimization.
4.1 Posttraining quantization of SRU model
The Simple Recurrent Unit (SRU) was initially designed to overcome the parallelization difficulty in LSTM and other recurrent units. Outputs from the previous timesteps are used in the current timestep operations. This property makes it impossible to fully parallelize the MV operations over multiple time steps. In SRU, the outputs from the previous timestep are excluded from MV computations and only used in elementwise computations.
Another side effect for excluding the recurrent inputs from MV operations is that the number of weights used in the recurrent operations decreases significantly. Thus, it becomes possible to also exclude the recurrent part from lowprecision quantization. We apply lowprecision quantization on weights and activations used in MV operations only. Other weights are kept in a 16bit fixedpoint format. By doing so, We achieve our goal of reducing the model’s overall size while keeping it performing with a low error rate. We have both 16bit fixedpoint and integer precisions in the same model in different layers in our work. We here explain how quantization has been applied to the weights and activations during inference and how we move from fixedpoint to integer operations and the reverse.

Weights integer quantization with clipping: We applied integer linear quantization on the weight matrices. We used the Minimum Mean Square Error (MMSE) method to determine the clipping threshold Sung et al. (2015). We used the implementation for CNN quantization provided by OCS paper on github Zhao et al. (2019) as a base for our implementations. We then modified the implementation to work with SRU units and to support varying precision per layer. The range of the quantized values are [:127], [:7], and [:1] for 8bit, 4bit, and 2bit, respectively.

Weights 16bit fixedpoint quantization: It is used for the recurrent weights, bias vectors, and weight matrices that might be chosen to have 16bit. Depending on the range of data, we compute the minimum number of bits required for the integer part. The rest of the 16bits are used as a sign bit and the approximated fraction part.

Activation integer quantization with clipping: Integer quantization of activations is similar to weights. However, since we cannot compute the range of vectors required for clipping threshold computation, we calculate the expected ranges. To compute expected ranges, we first use a portion of the validation data sequences. The predicted range of a given vector is calculated as the median value of the ranges recorded while running the validation sequences. In our experiments, 70 sequences were enough to compute the expected ranges.

Activation requantization to 16bit fixedpoint: The activations are quantized to 16 bits the same way as the weights. If a vector is an output of an integer operation, we found it necessary to requantize the values into fixed points by dividing them by a scale value. The scale value is computed to return the vector range to the same range if quantization was not applied. The range of Nonquantized data is computed using a portion of the validation data sequences while using original model weights and activation, a.k.a, turning off quantization.
4.2 Multiobjective quantization of neural network models
During the compression/quantization of NN models, we have two types of objectives. The first one is the NN model performance metric, such as the accuracy or the error. The second type of objective is related to the efficiency of the hardware platform, such as memory size, speedup, energy consumption, and area. Treating the problem as a multiobjective problem provides the designer with various solutions with different options. The embedded system designer then can decide which solution is a tradeoff suitable for the running application. We have used a Genetic Algorithm (GA) as it is one of the efficient multiobjective search optimizers. The multiobjective GA we used is called NSGAII provided by the Pymoo python library Deb et al. (2002); Blank y Deb (2020). As mentioned in Section 2, NSGAII is a popular GA that supports more than one objective. NSGAII showed the ability to find better convergence and better solutions spread near the actual Paretooptimal front for many difficult test problems Srinivas y Deb (2000). Thus, NGSAII appears to be a good candidate for our experiments.
Most of the automated hardwareaware compression/quantization work uses a single objective and provides a single answer. The reason is that many compression methods require training or retraining to compensate for the accuracy loss caused by compression. The selection of the compression/quantization configuration is made iteratively during training epochs. Trying to turn the problem into a multiobjective problem would make it necessary to retrain all the evaluated solutions in the search space, and that would be infeasible. Our approach to tackle this problem relies on two observations. The first is that researchers are progressively enhancing posttraining compression/quantization techniques. Thus, it is possible to evaluate candidate solutions by running inference only without any retraining. That is also useful for the case when the training data is not available. The second observation is that a retrained model using one candidate solution variables can be used as a retrained model for neighbor candidate solutions in the search space. We use this observation in a method we call ”beaconbased search” (further explained in Section 4.3). So, if the inferenceonly search fails to find accuracywise accepted solutions, the beaconbased search can be applied. Nevertheless, if the designer wants, the beaconbased search can be applied from the beginning. The complete search framework is demonstrated in Figure 4.
Our problem is to select the precision for weights and activations per layer. A candidate solution has a number of variables that equals twice the number of layers. That is because each layer requires precision for the weight and precision for the activation. The possible precisions covered in this work are 2, 4, 8 bits integer, and 16bit fixed point. We skipped precisions like 3, 5, and 6 because they are not frequently found in hardware platforms. However, the method is generic, and other precisions can be included. The candidate solutions variables are encoded into genetic algorithm representation. We use discrete values for the solutions variables that are 1, 2, 3, and 4. 2 bit is encoded into 1, 4bit is encoded into 2, 8bit is encoded into 3, and 16bit is encoded into 4. In addition to encoding and decoding, we select the number of generations, define fitness functions and constraints, then keep the library default configuration for the rest of GA steps such as the crossover, mutation, and selection. For each objective, we define a fitness function. All the objectives have to be either minimization or maximization objectives. Since Pymoo by default treats objectives as minimization objectives, we change the maximization objectives to be minimization objectives by negating them.
Initially, the search relies on posttraining quantization (inferenceonly search). The evaluation of candidate solutions does not incorporate any training, and thus it is possible to carry out the search within a reasonable time. To evaluate one solution’s error objective, we run the inference of the model and get the inference error value as the error objective that we want to minimize. Another way to speed up the search is to define a feasibility area and if a solution falls outside this area, this solution is then directly excluded from the solutionpool and further search. We have used this to exclude solutions that have more than 8 percent points higher error rate compared to the baseline as such solutions were deemed irrelevant.
To evaluate a quantized model, data from an ”unseen” validation set would be preferable. However, besides the testing set that we should not use for this purpose, no such ”unseen” data exist. Thus, we are forced to use the validation set suggested for this database, even if it already has been used for training the model and therefore already been ”seen” by the model. Still, this should not influence our results too strongly, as the validation data only have been used to set hyperparameters during training. However, in our experiments, we have noticed that the gap between the validation error and the testing error varies significantly among the Pareto optimal solutions when we use the full validation set. To mitigate this variation (which can lead to solutions changing the order for the testing set compared to the validation set), we have split the validation set into smaller subsets (in this case, four subsets) and then taking the maximum error among the validation subset errors. This method has led to a better correspondence between validation errors and testing errors, and we will further explore and analyze it in the future.
It should also be noted that the evaluations of candidate solutions in one generation are not related to the other candidate solutions. Therefore, it is possible to parallelize the search over the candidate solutions and distribute the computation over multiple GPUs Rezk et al. (2014) with linear speedup.
4.3 Beaconbased Search (Retrainingbased Multiobjective Quantization)
So far, we have been focusing on using posttraining quantization during the search for optimum quantization configuration to achieve a speedy evaluation of candidate solutions. However, at high compression ratios, we sometimes find that the posttraining quantization is not achieving acceptable accuracy levels. The accuracy can be improved by retraining the model using the quantized parameters. Still, as retraining is a very timeconsuming process, it is infeasible to do for all candidate solutions. Therefore, we have developed a beaconbased approach that only retrains a small set of solutions, our beacons. Then we let ”neighboring” solutions share the retrained model (beacon) as a basis for their quantization instead of the original pretrained model.
We want to define the neighborhood where we get a similar retraining effect (improved accuracy) for all solutions in the neighborhood when using the neighborhood beacon. A natural measure of closeness is to define a distance in the parameters space. That is, for each parameter, we compare the of the precision values and then sum over all parameters. During our experiments, we have found that the precision of the weights is more important than the precision of the activations when finding neighbor solutions that can use the same retrained parameters. Thus, we only used the weights precisions in the distance computation for this paper. The distance between one solution and a beacon can thus defined as:
where is the distance between and . is the number of layers in the model. and is the value of the precision of layer k weights in and respectively.
To validate the assumption that solutions in the neighborhood of a beacon behave consistently after applying quantization on the beacon parameters, we have placed beacons in the search space and calculated the accuracy (actually the word error rate) for neighbor solutions in the search space using both the original model and the beacons parameters. In Figure 5 we show the neighborhood for one of these beacons (the others show a similar behavior). Each point in the plot is an evaluated solution. The xaxis shows the increase in the error by a solution when quantizing the original model parameters compared to the Nonquantized baseline error. The yaxis shows the decrease in error then quantizing the beacon model compared to the original quantized model. To exemplify, we have marked one solution with a star in Figure 5. For this solution, the baseline model error rate is 16.2%, and applying posttraining quantization on this solution using the baseline model parameters gives an error rate of 24.2%. Applying posttraining quantization on the same solution using the beacon parameters gives an error rate of 18.8%. Thus, we find the particular solution at 8 () on the xaxis and 5.4 () on the yaxis. From the figure, we see a close to a linear relationship between the increase in the error using the baseline model parameters and the decrease in the error using the beacon parameters. We conclude that there is no need to retrain the model for all evaluated solutions during the search. It is sufficient to set up a small set of beacons to take the retraining accuracy increase into account during the search.
To retrain the model, we used a Binaryconnect approach Courbariaux et al. (2015), where quantized weights are used during the forward and backward propagation only, while full precision weights are used for the parameters update step. Therefore, in the retrained model, we always have the floatingpoint parameters, which then can be used as a basis for various other quantization configurations.
We use the term beacon due to the similarity to its use in swarm robots search. Where the simple robots do not have the communication capabilities needed for the swarm to fulfill its task and some of the robots are assigned to be fixed communication beacons for the others Tan y yang Zheng (2013). Similarly, when our search reaches an area with no beacons, one solution is turned into a beacon by retraining the model using these solution variables. By the end, we get a solutionset that considers the retraining effect on the candidate solutions. Later on, when the designer selects a solution from the Pareto optimal set, the designer can use the beacon parameters directly or retrain the model using the selected solution parameters.
As mentioned before, we define a feasibility area to speed up the search, and if a solution falls outside this area, it is directly excluded from the solution pool and further search. But one effect of enabling retraining is that solutions that were outside the feasibility area before retraining now become feasible again. Thus, when enabling retraining, one should define an enlarged ”beaconfeasible” area not to exclude such solutions too early. In addition, it is possible to add more constraints to this area based on the designer’s experience to decrease the number of created beacons such as thresholds for other objectives and not allowing low error solutions to be retrained. Low error solutions may not benefit much from retraining and can consume a lot of retraining time.
The steps of our beaconbased search are shown in Algorithm LABEL:alg:beacon. If the solution is in the beaconfeasible area, we compute the distance between the solution and all existing beacons using the distance equation defined earlier. If the nearest beacon is farther than a predefined threshold, then this solution will be converted into a beacon by retraining the model using its variables. After retraining, the solution is added to the beacon list, and by definition, the nearest beacon for this solution is the solution itself. Finally, the error objective is reevaluated using the nearest beacon model parameters.
algocf[]
The threshold value is important in controlling how many beacons we will use (and, therefore, the amount of time needed to retrain models). If it is too high, it can limit the benefit of applying retraining in decreasing the model error. The suitable value will depend on the model size (how many parameters we use in the distance calculation) and the supported precisions. For the experiments below, we have a model of 8 layers. We found that a threshold of 6 resulted in 1 beacon while a threshold of 5 got three beacons. This is approximately 25% and 21% of the maximum distance possible but was found in an exploratory fashion. How to optimally set this threshold is, however, not further explored in this paper.
4.4 Hardwareaware Optimization
In this work, the hardware model is an input given to the optimization algorithm in the form of objective functions. We have selected two architectures to apply our methods on. These two architectures are selected as they support varying precision operations, and thus applying quantization optimization becomes feasible. For these two architectures we do not have implementations for the RNN modules and thus we cannot get empirical measurements during evaluations. Instead, we have defined objective functions in a simple way that mainly focuses on the effect of decreasing the precision of the NN operations. The hardware platform constraints and objectives serve as a proof of concept that shows how models can be compressed differently in different scenarios.
Neural network models are characterized by having large memory requirements. Deploying the models in their original form results in frequent data loading from and into the offchip memory. Thus, the implementation would be memory bounded, and many studies have been performed to increase the reuse of local data and minimize the offchip memory data usage. On the other hand, the success of NN compression has made it possible to squeeze the whole NN model into the onchip memory and transfer the NN applications into computebound applications. So, in our experiments, we use the platform SRAM size as an optimization constraint and not an objective. First, having the NN model size less than the SRAM size achieves the ultimate goal of compression by avoiding the expensive loading of weights from the offchip memory. Second, compressing the model more would not be beneficial anymore from the memory point of view. It can be beneficial for energy consumption and computation speedup, which are accounted for as optimization objectives. Next, we show the details of the energy and speedup objectives equations.
4.4.1 Energy Estimation
In our experiments, we use the energy estimation model developed by the Eyeriss project team Yang et al. (2016). In this model, the total consumed energy is computed by adding the total energy required for computation and the total energy required for data movements. Since the majority of computation in NN models are in the form of MAC operations, the total energy required for computation is the number of MAC operations multiplied by the energy cost of one MAC operation. The total energy consumed by data movement is computed by multiplying the cost of onebit transfer by the number of transferred bits. For a hardware architecture that has a hierarchy of multiple memory levels, such as Eyeriss, different energy costs are used for each level of data loading. In our case, we have only one memory level which is the SRAM.
The final equation we use is as follow:
(3) 
where is the overall energy consumed, is the number of bits in the model, and is the energy cost of loading one bit from the SRAM. is the set of supported precisions, is the energy cost of one MAC operation using the precision , and is the number of MAC operations using the precision .
4.4.2 Speedup Estimation
Since we adopt computebound implementations in this work, we rely solely on the speedup gained at the MAC operations computations as an approximation for the expected speedup gained by different quantization configurations. Currently, we do not have a real implementation for the model under study on the architectures (SiLago and Bitfusion). Having an implementation would enrich the speedup equation with more details such as the tile size to compute the proportion of loading time over computation time. However, in this paper, the hardware model is an input. What we show is how we generate different sets of solutions and account for retraining if required in the case of having different hardware models and constraints. In both architectures, the highest supported precision is the 16bit fixed point. Thus, we define an objective for the speedup to compute the speedup over 16bit operations. The speedup objective is computed by multiplying the number of MAC operations done in a given precision by the speedup of this precision and sum over all supported precisions using the following formula:
(4) 
where is the overall speedup, and is the set of supported precisions. For example, an architecture that supports mixedprecision with 4 and 8 bits have a set , with . If the same architecture does not support mixedprecision, , with . Furthermore, is the speedup gained using precision over the highest precision supported by the given architecture. is the number of MAC (MultiplyAccumulate) operations using the precision , and is the total number of MAC operations in the model.
5 Evaluation and Experiments
This section applies the MultiObjective HardwareAware Quantization (MOHAQ) method using the NSGAII genetic algorithm to an SRU model using the TIMIT dataset for speech recognition. As we mentioned in the introduction, SRUmodel is selected due to the ease of SRU parallelization and for being faster in training and inference. Also, we found in the PytorchKaldi project good software support for the SRUbased models for speech recognition using TIMIT dataset Ravanelli et al. (2019). TIMIT is a dataset composed of recordings for 630 different speakers using six different American English dialects, where each speaker is reading up to 10 sentences Garofolo et al. (1993). In Section 5.1 we show the components of the model used in our experiments.
We designed three experiments to evaluate our MOHAQ method. In the three experiments, the search should give a set of Pareto optimum solutions. Each solution is the precision of each layer and activation in the model. In the first experiment, we evaluate the capabilities of the posttraining quantization on the SRU model without any hardware consideration. In the latter two experiments, we use two hardware models, SiLago and Bitfusion. SiLago does not support precision less than 4bit; inferenceonly search was enough for the example model as the compression ratio did not exceed 8x. On the other hand, Bitfusion supports 2bit operations and hence supports high compression ratio solutions. Therefore, this example architecture gave us an opportunity to test the search method at a high compression ratio by setting the memory constraint to 2 MB (10.6x compression ratio). We first apply the inferenceonly search, and then we use the beaconbased search and examine the quality of the solution set.
FC  Total  
Input vector size (m)  23  1100  256  1100  256  1100  256  1100   
Number of hidden cells (n)  550  256  550  256  550  256  550  1904   
MAC operations  75900  281600  844800  281600  844800  281600  844800  2094400  5549500 
Elementwise operations  15400    15400    15400    15400    88000 
Nonlinear operations  2200    2200    2200    2200  1904  10704 
Matrices weights  75900  281600  844800  281600  844800  281600  844800  2094400  5549500 
Vectors weights  4400    4400    4400    4400    17600 
5.1 Example SRU Model
In our experiments, we use a speech recognition model from the PytorchKaldi project Ravanelli et al. (2019)
. PytorchKaldi is a project that develops hybrid speech recognition systems using stateoftheart DNN/RNN. Pytorch is used for the NN acoustic model. Kaldi toolkit is used for feature extraction, label computation, and decoding
Povey et al. (2011). In our experiments, the feature extraction is done using logarithmic Melfilter bank coefficients (FBANK). The labels required for the acoustic model training come from a procedure of forced alignment between the contextdependent phone state sequence and the speech features. Then, the Pytorch NN module takes the features vector as input and generates an output vector. The output vector is a set of posterior probabilities over the phone states. The Kaldi decoder uses this vector to compute the final WordErrorRate (WER). We used the TIMIT dataset
Garofolo et al. (1993) and trained the model for 24 epochs as set in the PytorchKaldi default configurations. Figure 6a shows the model used in our experiments. The NN model is composed of 4 BiSRU layers with 3 projection layers in between. A FC layer is used after the SRU layers and the output is applied to a Softmax layer
Ravanelli et al. (2019).In Table 4, we show the breakdown of the SRUmodel used in the experiments. In the first two rows, we show the input vector size and the number of hidden cells. Then, in the middle three rows, we show the number of operations corresponding to each layer for MAC, elementwise, and nonlinear operations. Finally, in the last two rows, we show the number of weights for the matrices used in the MAC operations and the vectors used in the elementwise operations. Considering that one MAC operation is equivalent to two elementwise operations, the number of operations and weights not involved in the matrix to vector multiplications is less than 1% of the total number of operations and weights. Also, in Figure 6b, we show the percentage of weights required by each type of layers. These types are SRUgates matrices, projection layers matrices, FC matrix, and the SRU vectors. The total size of the model is the total size of weights.
5.2 Multiobjective search to minimize two objectives: WER and memory size
Our first experiment does a multiobjective search to minimize two objectives: and memory size. is the error rate evaluated using the validation set of the TIMIT dataset. No hardware model is used in this experiment to explore the general compression of the model before any hardware platform is involved. In the search space, we have 4.3 billion possible solutions () as each solution has 16 variables, and each variable has four possible values. The genetic algorithm used is NSGAII. After some initial experiments, we found that 60 generation was sufficient to get a stable solution for all tested objectives. Each generation has ten individuals except the initial generation, which has 40 individuals. So, 630 solutions have been evaluated during the search. During the search, solutions with a high error rate are infeasible. The search output is a Pareto optimal set of solutions that shows a tradeoff between the model size and the error rate to the embedded system designer. Figure 7 shows a plot for the Pareto optimal set, and Table 5 shows the details of each solution in the set. Each row in the table corresponds to a solution in the set. We report the precision of each layer weight and activation for each solution, followed by the solution , compression ratio, and testing error . The first row is for the base model that is not quantized. The base model testing WER is .
Table 5 shows that the model can be compressed to 8x without any increase in the error. The designer can compress the model to 12x with only 1.5 p.p.(percentage point) increase in the error and to 15.6x with 1.9 p.p. increase in error. In most of the solutions, 4 bits and 2 bits have been used extensively for the weights. The activation precision has been kept between 8bit and 16bit in most of the solutions. However, 4bit and 2bit activations have been used in few layers. It is also observed that some solutions have an error rate better than the baseline model. It has been shown that quantization has a regularization effect during training Rybalkin et al. (2018). Therefore, we think the improved error is a result of the quantization error introducing a noise that reduces some of the overfitting effect during inference.
In Table 5, we expected the to be higher than but also we hoped to see that the relative order is kept between the solutions for both and . We tried to minimize the gap between the validation error and the testing error as explained in Section 4. In all the solutions we found in the solution set, the difference between the two errors is small. However, if we look at the solutions sorted by the value, we find that the corresponding values are not perfectly sorted. and look as outliers. The reason is that had better than expected, and had worse than the surrounding solutions. Still, for both cases, the variation was in the range of 0.3 p.p. and we believe that such small variations are expected to happen as there is no guarantee for two different datasets’ errors to be the same.
To get a better understanding of how successful our posttraining quantization applied to the SRUmodel is, we look at the previous work done on posttraining quantization. Since researchers have found that 16bit and 8bit quantization do not significantly affect accuracy Chen et al. (2016); Zhao et al. (2019)
, we will focus on 4bit quantization (8x compression). CNN ImageNet models have been used for quantization experiments in most of the work we have seen. The accuracy drop due to 4bit posttraining quantization on CNN models has varied in these papers as follows: LAPQ
Nahshan et al. (2019) (6.1 to 9.4 p.p.), ACIQ Banner et al. (2019) (0.4 to 10.8 p.p.), OCS Chen et al. (2016) (more than 5 p.p.), and ZeroQ Cai et al. (2020) (1.6 p.p.). Where ZeroQ applied mixed precision to reach the 8x compression ratio. Also, on RNN models for language translation, the BLEU score decreased by 1.2 Aji y Heafield (2019). Comparing this to the mixedprecision posttraining of the SRU model, our search found solutions that use a mix of 2, 4, and 8bits. Those solutions achieve compression ratios between 8x and 9x with an error increase that varies between 0 p.p. and 0.3 p.p. With an error increase of 1.5 p.p., the compression ratio reaches 12x. Thus, we conclude that the error increase we get is lower than most of the other studies, and we can see that we have higher compression ratios in our experiments.Sol.  FC  
Baseline  32/32  32/32  32/32  32/32  32/32  32/32  32/32  32/32  16.2%  1x  17.2% 
S1  8/16  4/16  4/4  2/16  4/8  4/8  4/16  4/16  16.1%  8.1x  16.9% 
S2  4/16  2/8  4/4  2/16  4/8  4/8  4/16  4/8  16.4%  8.4x  17.0% 
S3  4/16  4/16  2/4  2/16  4/16  4/16  4/16  4/16  16.5%  8.9x  17.4% 
S4  4/16  2/8  2/4  2/16  4/8  4/8  4/16  4/8  16.7%  9.1x  17.5% 
S5  8/16  2/8  2/4  2/16  4/16  2/8  4/16  4/8  16.9%  9.4x  17.7% 
S6  4/16  8/16  2/4  4/16  4/16  4/16  4/16  2/16  17.2%  9.9x  18.2% 
S7  8/16  2/2  2/4  2/16  2/16  4/16  4/16  4/8  17.3%  10.0x  17.9% 
S8  8/16  2/16  2/4  2/16  4/16  4/16  4/16  2/16  17.4%  11.4x  18.2% 
S9  4/16  2/8  2/4  2/16  4/16  4/16  4/16  2/16  17.4%  11.4x  18.3% 
S10  4/16  2/16  2/4  2/16  4/8  4/16  4/16  2/8  17.6%  11.6x  18.3% 
S11  4/16  2/4  2/4  2/16  2/16  8/16  4/16  2/8  17.8%  12.0x  18.4% 
S12  4/16  2/8  2/4  2/16  2/16  8/16  4/16  2/8  17.8%  12.0x  18.7% 
S13  4/16  2/8  2/4  2/16  2/16  4/16  4/16  2/8  17.9%  13.0x  18.5% 
S14  4/16  2/8  2/4  2/16  2/16  2/16  4/16  2/8  18.0%  13.6x  18.6% 
S15  4/16  2/2  2/16  2/16  2/8  2/16  2/16  2/8  18.7%  15.6x  19.1% 
5.3 Multiobjective Quantization on the SiLago architecture
In the second experiment, we apply the MultiObjective HArdwareAware Quantization (MOHAQ) method for SiLago Architecture using the inferenceonly search. The SiLago architecture can support varying precisions between layers. However, for each layer, the weight and the activation must use the same precision. Thus, the number of variables in a solution is 8, not 16, as in the previous experiment. The precisions supported on SiLago are 16, 8, and 4 bits, as explained in Section 2.5.1. Since the highest precision supported on SiLago is a 16bit fixed point, the baseline model is a 16bit full implementation. And so, we compute the speedup gained by low precision by comparing it to 16bit. The speedup is computed using Equation 4. Also, we have evaluated the energy consumption required by the MAC operation for different precisions. Table 2 shows the speedup gained per one MAC operation when using 8bit or 4bit when compared to 16bit and the energy consumed by 16bit, 8bit, and 4bit operations. To compute the expected overall required energy by each solution, we use the energy model proposed by Energyaware pruning Yang et al. (2016) and described in Section 4.4.
As discussed in Section 4.4, all weights should be stored in the onchip memory. Thus, it is crucial to add the SRAM size as a constraint for the model size during the search. Even though we do not know the exact size of the SRAM, we have to establish a limit. Given that the highest possible compression ratio on SiLago is 8x, which corresponds to 2.65 MB for the experimental model, we chose 6 MB (3.5x compression ratio) as a reasonable memory size constraint to have room for proper size Paretoset.
So, in this case, the multiobjective search has three objectives: WER, speedup, and energy consumption. Both WER and energy consumption are objectives to be minimized, while speedup is an objective to be maximized. We use the negative of the speedup as an objective instead of the speedup as the GA will minimize all the objectives.
Sol.  FC  Speedup  Energy  
Base  32/32  32/32  32/32  32/32  32/32  32/32  32/32  32/32  16.2%  1x      17.2% 
16/16  16/16  16/16  16/16  16/16  16/16  16/16  16/16  16.2%  2x  1x  16.4 µJ  17.2%  
S1  16/16  4/4  8/8  8/8  4/4  16/16  4/4  8/8  16.2%  4.5x  2.6x  5.8 µJ  17.1% 
S2  16/16  4/4  4/4  8/8  4/4  16/16  4/4  8/8  16.3%  4.9x  2.9x  5.2 µJ  17.2% 
S3  8/8  4/4  4/4  4/4  4/4  4/4  4/4  8/8  16.8%  5.7x  3.2x  4.2 µJ  17.4% 
S4  4/4  4/4  4/4  4/4  4/4  4/4  4/4  8/8  17.3%  5.8x  3.2x  4.1 µJ  17.7% 
S5  8/8  8/8  4/4  4/4  8/8  4/4  4/4  4/4  18.6%  6.6x  3.5x  3.5 µJ  19.4% 
S6  8/8  8/8  4/4  16/16  4/4  4/4  4/4  4/4  18.6%  6.6x  3.7x  3.6 µJ  19.0% 
S7  4/4  4/4  4/4  4/4  4/4  4/4  4/4  4/4  19.0%  8x  3.9x  2.6 µJ  19.8% 
Since the search space is smaller than the search space in experiment one, we only need to run the search for 15 generations. Each generation has ten individual solutions, except the first generation has 40 individual solutions. The whole search, therefore, evaluates 180 solutions out of 6561 possible solutions (). Solutions with a high error rate were considered infeasible. Figure 8 shows the Pareto optimal set, and Table 6 shows the details of each solution and the testing WER. To judge the speedup and energy consumption quality, we compare the solutions against the best possible performing solution on SiLago, which is using 4bit for all layers. This solution can reach a 3.9x speedup and the lowest energy consumption of (a 6.3x improvement than the base solution). The search managed to find solutions that can achieve 74% of the maximum speedup and 51% of the maximum energy saving without any increase in the error. If the designer agreed to have a 0.5 p.p. (percentage point) increase in the error, we could achieve 81% of the maximum speedup and 64% of the maximum energy saving. And we see that to reach the maximum possible performance, there will be a 2.6 p.p. increase in the error.
5.4 Multiobjective Quantization on the Bitfusion architecture
In the third experiment, we apply the MultiObjective HardwareAware Quantization (MOHAQ) method for the Bitfusion architecture using two objectives. The first is the WER and the second is the speedup. We first apply the inferenceonly search, and then we apply the beaconbased search to enhance the solution set. Bitfusion architecture is introduced in Section 2.5.2. We use Equation 4 to compute the expected speedup for different solutions. The genetic algorithm used is NSGAII, and it runs for 60 generations as we did in experiment 1. Each generation has ten individuals except the initial generation, which has 40 individuals. The whole search evaluates 630 solutions out of 4.3 billion possible solutions (). We chose to consider high error solutions (more than 24%) infeasible to limit the search to low error rates solutions only. In the speedup equation, we assume that all the weights have to be in the SRAM, and the application is computebound. Thus, memory size has to be a constraint in the search. In this experiment, we put a constraint for the memory size to be less than 2MB. We do so to allow the search to find high error solutions to do the beaconbased search as a next step. This memory size is equivalent to 9.4% of the original model size.
Figure 9 shows the Pareto optimal set and Table 7 shows the detailed solutions. Since we have 26 solutions in the Paretoset, we skip some solutions details in the table. In this case, we see solutions with high error rates that might not be accepted. Thus, we applied the beaconbased search introduced in Section 4.3. We used a distance threshold between candidate solutions and beacons of value 6. The selected threshold is reasonable when compared to the number of layers (8). By the end of the search, we found that one beacon has been created during the search. We have repeated the experiment by decreasing the threshold or manually selecting beacons to increase the number of beacons, and we got a similar set of solutions. Thus, one beacon is enough for our model. For deeper models, more beacons would be required.
Figure 10 and Table 8 depict the new generated Pareto optimal set with solution details. As done in Table 7, we skipped some solutions details to keep the tables relatively concise. Comparing the two Pareto sets, the testing error on the first set reached 24.2% to achieve a 40.7x speedup. The new set reached the same speedup level with a testing error of 20%. The new set also found more solutions with higher speedups up to 47.1x with a testing error of 20.7%. One drawback of retraining is that the validation data is reused for retraining and the model gets more knowledge about it. Thus, for some solutions, the gap between the and might be higher than in inferenceonly search.
Sol.  FC  Speedup  
Base  32/32  32/32  32/32  32/32  32/32  32/32  32/32  32/32  16.2%  1x    17.2% 
16/16  16/16  16/16  16/16  16/16  16/16  16/16  16/16  16.2%  2x  1x  17.2%  
S1  8/16  2/2  2/16  4/8  4/8  4/16  4/4  2/8  17.4%  11.6x  14.6x  18.1% 
S2  4/16  2/2  2/16  4/8  4/8  4/16  4/4  2/8  17.6%  11.6x  14.6x  18.4% 
S3–S9  –  –  –  –  –  –  –  –  –  –  –  – 
S10  8/16  2/2  2/2  2/4  4/8  2/8  4/2  2/8  18.9%  13.6x  27.2x  19.4% 
S11–S14  –  –  –  –  –  –  –  –  –  –  –  – 
S14  4/16  2/2  2/2  2/8  2/4  2/8  4/2  2/8  19.7%  13.6x  30.0x  20.2% 
S15–S17  –  –  –  –  –  –  –  –  –  –  –  – 
S18  4/16  2/2  2/2  2/4  2/2  2/16  4/2  2/8  20.6%  13.6x  35.2x  21.3% 
S19  4/8  2/2  2/2  2/4  2/2  2/16  4/2  2/8  20.8%  13.3x  35.2x  21.5% 
S20–S21  –  –  –  –  –  –  –  –  –  –  –  – 
S22  4/16  2/2  2/2  2/4  4/8  2/8  2/2  2/4  22.9%  13.3x  37.9x  23.1% 
S23–S24  –  –  –  –  –  –  –  –  –  –  –  – 
S25  4/16  2/2  2/2  2/2  4/8  2/8  2/2  2/4  23.5%  13.6x  39.5x  24.0% 
S26  8/16  2/2  2/2  2/2  4/4  2/8  2/2  2/4  23.7%  13.3x  40.7x  24.2% 
Sol.  FC  Speedup  
Base  32/32  32/32  32/32  32/32  32/32  32/32  32/32  32/32  16.2%  1x    17.2% 
16/16  16/16  16/16  16/16  16/16  16/16  16/16  16/16  16.2%  2x  1x  17.2%  
S1  8/16  4/2  4/8  2/4  4/16  2/16  2/2  2/8  17.1%  11.4x  21.0x  18.1% 
S2  8/8  4/2  2/8  4/4  4/16  2/16  2/2  2/8  17.3%  12.3z  21.4x  18.5% 
S3–S7  –  –  –  –  –  –  –  –  –  –  –  – 
S8  16/8  8/4  2/2  2/4  4/4  2/16  2/2  2/4  18.2%  11.3x  35.9x  19.3% 
S9–S13  –  –  –  –  –  –  –  –  –  –  –  – 
S14  16/8  4/2  2/2  2/2  4/4  2/16  2/2  2/4  19.0%  12.2x  38.7x  19.5% 
S15  8/8  2/4  2/2  2/4  2/4  2/4  2/2  2/4  19.1%  15.2x  40.7x  20.0% 
S16–S18  –  –  –  –  –  –  –  –  –  –  –  – 
S19  4/16  2/4  2/2  2/4  2/2  2/4  2/2  2/4  20.2%  15.6x  45.5x  21.2% 
S20  4/16  2/2  2/2  2/4  2/2  2/4  2/2  2/4  20.4%  15.6x  47.1x  20.7% 
6 Discussion and Limitations
The main focus of this paper is to open up a research direction for hardwareaware multiobjective compression of neural networks. There exist many network models and a large number of compression techniques. In this work, we principally focus on quantization and application to SRU models for several reasons. Quantization is one of the vital compression methods that can be used solely or with other compression methods. Therefore, we consider enabling the MOOP on quantization is of great benefit. Also, SRU is a promising recurrent layer that allows a powerful hardware parallelization. One reason for using SRU is to examine to what extent it can be quantized as it remains underinvestigated. Since the SRU can be considered as an optimized version of the LSTM, we wanted to investigate the effect of quantization on it. Our experiments showed that by excluding the recurrent vectors and biases from quantization, SRU could be quantized to high compression ratios without a harsh effect on the model error rate. Thus, the SRU combines both the high parallelization speedup and compression model size reduction benefits. The second reason is that running experiments using SRU is much faster than any other RNN model. Thus, we had a better opportunity to have multiple trials to explore our methodology.
In this article, we further claim that the automation of hardwareaware compression is essential to meet changes in the application and hardware architecture. To prove this claim and show that our proposed method is hardwareagnostic, we have applied our search method to two different hardware architectures, SiLago and Bitfusion. Those architectures have been particularly selected due to the varying precision they support. In Sections 5.3 and 5.4 we have shown two different sets of solutions. These findings show that compression can be done in different ways depending on the target platform. The differences between the speedup values that appear in Tables 6 and 8 for SiLago and Bitfusion might imply that Bitfusion is faster than Silago. This difference in speed is not what this work is particularly investigating since we compare the optimized solutions on a given architecture to the baseline running on the same architecture.
For our method to be entirely generic, it needs to support variations in the NN model, compression method, and hardware platform. In this work, we have applied the method on two different architectures: CGRA (SiLago) and systolic array architecture (Bitfusion). Concerning the variations in the NN model, we have applied our method on one model, but posttraining quantization has been applied successfully on several models in the literature. And since our method mainly relies on the success of posttraining quantization, we believe our method is generic enough to be applied to many NN models. However, the beaconbased search needs to be applied to more models of several depths to investigate if a generic equation for the threshold selection can be defined. The aspect that needs more work is the variation in the compression method. The possibility of applying the posttraining version of the other compression methods should be investigated. Also, the idea of the beaconbased search needs to be examined on other compression techniques to see if it can be directly applied, modified, or replaced by another method that satisfies the need for considering retraining effect on different compression configurations within a reasonable time.
7 Conclusion and Future Work
Compression of neural networks application contributes invasively to the efficient realization of such applications on edge devices. The compression of the model can be customized by involving the hardware model and application constraints to reinforce the compression benefits. As a result of the compression customization, compression automation became required to meet variations in the hardware and the application constraints. Thus, the compression configuration selection such as perlayer pruning percentage or bitselection is considered an optimization problem.
This article proposes a Multiobjective Hardwareaware Quantization (MOHAQ) method and applies it to a Simple Recurrent Unit (SRU)based RNN model for speech recognition. In our approach, both hardware efficiency and the error rate are treated as objectives during quantization, and the designer has the freedom to choose between varying Pareto alternatives. We relied on posttraining quantization to enable the evaluation of candidate solutions during the search within a feasible time (inferenceonly search). We then enhance the quantization error by retraining with a novel method called beaconbased search. The beaconbased search uses few retrained models to guide the search instead of retraining the model for all the evaluated solutions.
We have shown that the SRU unit, as an optimized version of the LSTM, can be posttraining quantized with negligible or small increases in the error rate. Also, we have applied the multiobjective search to quantize the SRU model to run on two architectures, SiLago and Bitfusion. We found a different solution set for each platform to meet the changing constraints. On SiLago, using inferenceonly search, we have found a set of solutions that, with increases in the error rate range from 0 till 2.6 percentage point, can achieve a high percentage to complete percentage of the maximum possible performance. On Bitfusion, assuming a small SRAM size, we have shown the search results using both inferenceonly search and beaconbased search. We have shown how our Beaconbased search decreases the error and finds betterperforming solutions with lower error rates. The highest speedup solution in the inferenceonly search was achieved by a 4.2 percentage point decrease in the error rate using the beaconbased search. Also, the highest speedup achieved by the beaconbased search is 47.2x compared to the 40.7x achieved by the inferenceonly search and yet at a lower error rate.
This work introduces an idea for multiobjective search while retraining is considered. We managed to consider the retraining effect for quantization. Next, we want to apply this to other compression techniques such as deltaRNN and restructured matrices. Also, we want to know how to apply our method to a hybrid of compression/quantization techniques and not only one. In addition, we want to run experiments on more hybrid NN models, such as models with convolution and recurrent layers.
8 Acknowledgements
This research is part of the CERES research program funded by the ELLIIT strategic research initiative funded by the Swedish government and Vinnova FFI project SHARPEN, under grant agreement no. 201805001.
The authors would also like to acknowledge the contribution of Tiago Fernandes Cortinhal in setting up the Python libraries and Yu Yang in the thoughtful discussions about SiLago architecture.
References

Aji y Heafield (2019)
Aji, A. F., Heafield, K., 2019. Neural machine translation with 4bit precision
and beyond. CoRR abs/1909.06091.
URL: http://arxiv.org/abs/1909.06091 
Ankit et al. (2019)
Ankit, A., Hajj, I. E., Chalamalasetti, S. R., Ndu, G., Foltin, M., Williams,
R. S., Faraboschi, P., Hwu, W., Strachan, J. P., Roy, K., Milojicic, D. S.,
2019. PUMA: A programmable ultraefficient memristorbased accelerator
for machine learning inference. CoRR abs/1901.10351.
URL: http://arxiv.org/abs/1901.10351 
Baccelli et al. (2020)
Baccelli, G., Stathis, D., Hemani, A., Martina, M., jul 2020. NACU: A
NonLinear Arithmetic Unit for Neural Networks. En: Design Automation
Conference (DAC). Vol. 2020July. IEEE, pp. 1–6.
DOI: doi: 10.1109/DAC18072.2020.9218549  Banner et al. (2019) Banner, R., Nahshan, Y., Soudry, D., 2019. Post training 4bit quantization of convolutional networks for rapiddeployment. En: Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchéBuc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 7950–7958.
 Blank y Deb (2020) Blank, J., Deb, K., 2020. Pymoo: Multiobjective optimization in python. IEEE Access 8, 89497–89509.
 Cai et al. (2020) Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M. W., Keutzer, K., 2020. Zeroq: A novel zero shot quantization framework.

Chahar et al. (2021)
Chahar, V., Katoch, S., Chauhan, S., 02 2021. A review on genetic algorithm:
Past, present, and future. Multimedia Tools and Applications 80.
DOI: doi: 10.1007/s11042020101396 
Chen et al. (2016)
Chen, T., Goodfellow, I., Shlens, J., 2016. Net2net: Accelerating learning via
knowledge transfer. En: International Conference on Learning Representations.
URL: http://arxiv.org/abs/1511.05641 
Chen et al. (2017)
Chen, Y.H., Krishna, T., Emer, J. S., Sze, V., 2017. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits 52 (1), 127–138.
DOI: doi: 10.1109/JSSC.2016.2616357  Chong y Zak (2013) Chong, E. K. P., Zak, S. H., 2013. An Introduction to Optimization, Fourth Edition. John Wiley and Sons, Ltd.
 Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., David, J., 2015. Binaryconnect: training deep neural networks with binary weights during propagations. CoRR abs/1511.00363.
 Dai et al. (2017) Dai, X., Yin, H., Jha, N. K., 2017. NeST: a neural network synthesis tool based on a growandprune paradigm. CoRR abs/1711.02017.

Deb et al. (2002)
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T., 2002. A fast and elitist multiobjective genetic algorithm: Nsgaii. IEEE Transactions on Evolutionary Computation 6 (2), 182–197.
DOI: doi: 10.1109/4235.996017  Donahue et al. (2014) Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T., 2014. Longterm recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389.

Fang et al. (2020)
Fang, J., Shafiee, A., AbdelAziz, H., Thorsley, D., Georgiadis, G., Hassoun,
J., 2020. Posttraining piecewise linear quantization for deep neural
networks. CoRR abs/2002.00104.
URL: https://arxiv.org/abs/2002.00104 
Farahini et al. (2014)
Farahini, N., Hemani, A., Sohofi, H., Jafri, S. M., Tajammul, M. A., Paul, K.,
nov 2014. Parallel distributed scalable runtime address generation scheme
for a coarse grain reconfigurable computation and storage fabric.
Microprocessors and Microsystems 38 (8), 788–802.
DOI: doi: 10.1016/j.micpro.2014.05.009  Garofolo et al. (1993) Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., 01 1993. DARPA TIMIT acousticphonetic continous speech corpus cdrom. nist speech disc 11.1. NASA STI/Recon Technical Report N 93, 27403.
 Google ([Accessed on: October. 2021]) Google, [Accessed on: October. 2021]. Edge TPU. https://cloud.google.com/edgetpu/.

Hemani et al. (2017)
Hemani, A., Farahini, N., Jafri, S. M. A. H., Sohofi, H., Li, S., Paul, K.,
2017. The SiLago Solution: Architecture and Design Methods for a
Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric. Springer
International Publishing, Cham, pp. 47–94.
URL: https://doi.org/10.1007/9783319315966_3
DOI: doi: 10.1007/9783319315966_3 
Hochreiter y Schmidhuber (1997)
Hochreiter, S., Schmidhuber, J., Nov. 1997. Long shortterm memory. Neural
Comput. 9 (8), 1735–1780.
DOI: doi: 10.1162/neco.1997.9.8.1735 
Horn et al. (1994)
Horn, J., Nafpliotis, N., Goldberg, D., 1994. A niched pareto genetic algorithm
for multiobjective optimization. En: Proceedings of the First IEEE Conference
on Evolutionary Computation. IEEE World Congress on Computational
Intelligence. pp. 82–87 vol.1.
DOI: doi: 10.1109/ICEC.1994.350037 
Jafri et al. (2017)
Jafri, S. M. A. H., Hemani, A., Paul, K., Abbas, N., 2017. MOCHA: Morphable
Locality and Compression Aware Architecture for Convolutional Neural
Networks. En: International Parallel and Distributed Processing Symposium
(IPDPS). pp. 276–286.
DOI: doi: 10.1109/IPDPS.2017.59 
Judd et al. (2016)
Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., Moshovos, A., 2016.
Stripes: Bitserial deep neural network computing. En: 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12.
DOI: doi: 10.1109/MICRO.2016.7783722 
Lee et al. (2019)
Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., Yoo, H.J., 2019. Unpu: An
energyefficient deep neural network accelerator with fully variable weight
bit precision. IEEE Journal of SolidState Circuits 54 (1), 173–185.
DOI: doi: 10.1109/JSSC.2018.2865489 
Lei et al. (2018)
Lei, T., Zhang, Y., Wang, S. I., Dai, H., Artzi, Y., Oct.Nov. 2018. Simple recurrent units for highly parallelizable recurrence. En: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pp. 4470–4481.
URL: https://aclanthology.org/D181477
DOI: doi: 10.18653/v1/D181477 
Li y Shen (2017)
Li, J., Shen, Y., March 2017. Image describing based on bidirectional LSTM
and improved sequence sampling. En: 2017 IEEE 2nd International Conference on
Big Data Analysis (ICBDA). pp. 735–739.
DOI: doi: 10.1109/ICBDA.2017.8078733  Nagel et al. (2020) Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., Blankevoort, T., 2020. Up or down? adaptive rounding for posttraining quantization.
 Nahshan et al. (2019) Nahshan, Y., Chmiel, B., Baskin, C., Zheltonozhskii, E., Banner, R., Bronstein, A. M., Mendelson, A., 2019. Loss aware posttraining quantization.

Povey et al. (2011)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K., Dec. 2011. The kaldi speech recognition toolkit. En: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRWUSB.
 PYMOO ([Accessed on: October. 2021]) PYMOO, [Accessed on: October. 2021]. NSGAII: Nondominated Sorting Genetic Algorithm. https://pymoo.org/algorithms/moo/nsga2.html.
 Ravanelli et al. (2019) Ravanelli, M., Parcollet, T., Bengio, Y., 2019. The pytorchkaldi speech recognition toolkit.

Rezk et al. (2014)
Rezk, N. M., Alkabani, Y., Bedor, H., Hammad, S., 2014. A distributed genetic
algorithm for swarm robots obstacle avoidance. En: 2014 9th International
Conference on Computer Engineering Systems (ICCES). pp. 170–174.
DOI: doi: 10.1109/ICCES.2014.7030951 
Rezk et al. (2020)
Rezk, N. M., Purnaprajna, M., NordstrÃ¶m, T., UlAbdin, Z., 2020.
Recurrent neural networks: An embedded computing perspective. IEEE Access 8,
57967–57996.
DOI: doi: 10.1109/ACCESS.2020.2982416 
Rizakis et al. (2018)
Rizakis, M., Venieris, S. I., Kouris, A., Bouganis, C., 2018. Approximate
fpgabased lstms under computation time constraints. CoRR abs/1801.02190.
URL: http://arxiv.org/abs/1801.02190  Rybalkin et al. (2018) Rybalkin, V., Pappalardo, A., Ghaffar, M. M., Gambardella, G., Wehn, N., Blott, M., 2018. FINNL: library extensions and design tradeoff analysis for variable precision LSTM networks on FPGAs. CoRR abs/1807.04093.
 Sak et al. (2014) Sak, H., Senior, A., Beaufays, F., 2014. Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. En: Fifteenth annual conference of the international speech communication association.
 Savic (2002) Savic, D., 01 2002. Singleobjective vs. multiobjective optimisation for integrated decision support. Proceedings of the First Biennial Meeting of the International Environmental Modelling and Software Society 1, 7–12.

Shami y Hemani (2009)
Shami, M. A., Hemani, A., oct 2009. Partially reconfigurable interconnection
network for dynamically reprogrammable resource array. En: International
Conference on ASIC. IEEE, pp. 122–125.
DOI: doi: 10.1109/ASICON.2009.5351593 
Shangguan et al. (2019)
Shangguan, Y., Li, J., Qiao, L., Alvarez, R., McGraw, I., 2019. Optimizing
speech recognition for the edge. CoRR abs/1909.12408.
URL: http://arxiv.org/abs/1909.12408 
Sharma et al. (2017)
Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Kim, J. K., Chandra, V.,
Esmaeilzadeh, H., 2017. Bit fusion: Bitlevel dynamically composable
architecture for accelerating deep neural networks. CoRR abs/1712.01507.
URL: http://arxiv.org/abs/1712.01507  Srinivas y Deb (2000) Srinivas, N., Deb, K., 06 2000. Multiobjective function optimization using nondominated sorting genetic algorithms 2.

Sung et al. (2015)
Sung, W., Shin, S., Hwang, K., 2015. Resiliency of deep neural networks under
quantization. CoRR abs/1511.06488.
URL: http://arxiv.org/abs/1511.06488 
Tajammul et al. (2016)
Tajammul, M. A., Jafri, S. M., Hemani, A., Ellervee, P., 2016. TransMem: A
memory architecture to support dynamic remapping and parallelism in low power
high performance CGRAs. En: International Workshop on Power and Timing
Modeling, Optimization and Simulation (PATMOS). IEEE, pp. 92–99.
DOI: doi: 10.1109/PATMOS.2016.7833431 
Tajammul et al. (2013)
Tajammul, M. A., Jafri, S. M. A. H., Hemani, A., Plosila, J., Tenhunen, H., jun
2013. Private configuration environments (PCE) for efficient
reconfiguration, in CGRAs. En: International Conference on
ApplicationSpecific Systems, Architectures and Processors. IEEE, pp.
227–236.
DOI: doi: 10.1109/ASAP.2013.6567579 
Tan y yang Zheng (2013)
Tan, Y., yang Zheng, Z., 2013. Research advance in swarm robotics. Defence
Technology 9 (1), 18–39.
URL: https://www.sciencedirect.com/science/article/pii/S221491471300024X
DOI: doi: https://doi.org/10.1016/j.dt.2013.03.001 
Tiwari et al. (2008)
Tiwari, H. D., Gankhuyag, G., Chan Mo Kim, Yong Beom Cho, nov 2008.
Multiplier design based on ancient Indian Vedic Mathematics. En:
International SoC Design Conference. Vol. 2. IEEE, pp. II–65–II–68.
DOI: doi: 10.1109/SOCDC.2008.4815685  Vukotic et al. (2016) Vukotic, V., Raymond, C., Gravier, G., Sep. 2016. A step beyond local observations with a dialog aware bidirectional GRU network for spoken language understanding. En: Interspeech. San Francisco, United States.

Wang et al. (2019)
Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S., June 2019. Haq: Hardwareaware automated quantization with mixed precision. En: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
 Wang et al. (2018) Wang, S., Li, Z., Ding, C., Yuan, B., Wang, Y., Qiu, Q., Liang, Y., 2018. CLSTM: enabling efficient LSTM using structured compression techniques on FPGAs. CoRR abs/1803.06305.

Yang et al. (2018a)
Yang, H., Zhu, Y., Liu, J., 2018a. Endtoend learning of
energyconstrained deep neural networks. CoRR abs/1806.04321.
URL: http://arxiv.org/abs/1806.04321 
Yang et al. (2016)
Yang, T., Chen, Y., Sze, V., 2016. Designing energyefficient convolutional
neural networks using energyaware pruning. CoRR abs/1611.05128.
URL: http://arxiv.org/abs/1611.05128 
Yang et al. (2018b)
Yang, T., Howard, A. G., Chen, B., Zhang, X., Go, A., Sze, V., Adam, H.,
2018b. Netadapt: Platformaware neural network adaptation for
mobile applications. CoRR abs/1804.03230.
URL: http://arxiv.org/abs/1804.03230 
Yao et al. (2017)
Yao, S., Zhao, Y., Zhang, A., Su, L., Abdelzaher, T. F., 2017. Compressing deep
neural network structures for sensing systems with a compressorcritic
framework. CoRR abs/1706.01215.
URL: http://arxiv.org/abs/1706.01215 
Yusoff et al. (2011)
Yusoff, Y., Ngadiman, M. S., Zain, A. M., 2011. Overview of nsgaii for
optimizing machining process parameters. Procedia Engineering 15, 3978–3983,
cEIS 2011.
URL: https://www.sciencedirect.com/science/article/pii/S1877705811022466
DOI: doi: https://doi.org/10.1016/j.proeng.2011.08.745 
Zhao et al. (2019)
Zhao, R., Hu, Y., Dotzel, J., Sa, C. D., Zhang, Z., 2019. Improving neural
network quantization without retraining using outlier channel splitting. CoRR
abs/1901.09504.
URL: http://arxiv.org/abs/1901.09504 
Zhu y Gupta (2018)
Zhu, M., Gupta, S., 2018. To prune, or not to prune: Exploring the efficacy of
pruning for model compression. En: 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018,
Workshop Track Proceedings.
URL: https://openreview.net/forum?id=Sy1iIDkPM
Comments
There are no comments yet.