I Introduction
Recently, a great many studies have worked on proposing lowcomplexity, highperformance neural networks (NN) training and inference methods as well as their hardware architectures. The main objective is to reduce the huge computational and memory storage/access needed by popular modern NN models. Research teams from academia and companies alike have presented numerous efficient acceleration solutions for convolutional NN (CNN) training and inference, based on FPGA, ASIC, and GPU. Among these methods, optimal NN model trained with singleprecision floatingpoint (FP32) or halfprecision floatingpoint (FP16, bfloat16) arithmetic is usually obtained and then techniques such as pruning [1], quantization, and compression are applied. For the quantizationbased precision reduction methods, recent works mostly focused on reduction of the NN inference complexity. For example, in [2], an FP32 trained NN model was quantized into a model with fixpointed weights that can achieve similar inference accuracy as the original FP32 NN model.
In contrast to the many works that focused on reducedprecision NN inference, there have been several other studies that worked on reducedprecision NN training. For instance, MPT [3] trains CNNs using FP16 arithmetic and additionally proposed a lossscaling method to preserve small gradients. In [4], precision for weights and activations were scaled down to 4 bits with a small loss of accuracy. DoReFaNet [5] trained NNs with 1bit weights and 2bit activations, while reducing the gradient precision down to 6 bits. FloatSD [6] trained CNNs with special floatingpoint weights with only two nonzero digits, along with 8bit activations and gradients, and achieve very small degradation in the accuracy performance of trained NN models. Finally, [7] trains CNNs with 8bit floatingpoint numbers (FP8) and proposes chunkbased accumulation and floatingpoint stochastic rounding to reduce the arithmetic precision for additions from 32 bits to 16 bits.
All the above proposed methods face certain problems, such as most applications were demonstrated by a few selected models, specifically only on CNNs; some methods suffer significant performance degradation; and some methods may require FP32 accumulation.
The main objective of this paper is to extend the application of reducedprecision methods to recurrent NNs (RNNs) while mitigating the aforementioned problems. FloatSD is a reducedprecision CNN training method that has been demonstrated to provide efficient training for CNNs as large as Resnet50 [6]. Moreover, we chose long shortterm memories (LSTMs) [8] as the target for lowcomplexity training by FloatSD. Unlike CNNs, LSTMs do not adopt convolutional layers and require only fullyconnected (FC) layers consisting of simple matrix multiplications. In addition, the gradient vanishing and gradient exploding problem make LSTM very sensitive to quantization errors. These two features make the reducedprecision training of LSTMs more challenging than that of CNNs.
We first modified the training equations and weight updates procedure of LSTM so that they become compatible with the FloatSD method. Then, a suite of LSTM models and datasets were used to demonstrate the applicability of the FloatSD technique to efficient LSTM training with no degradation in trained model accuracy. Finally, we also designed and validated a FloatSD processing element circuit suitable of lowpower LSTM training acceleration.
To sum up, the contributions of this paper are:

Adaptation of the FloatSD method for LSTM training.

Modification of the LSTM equations for lower computation complexity.

Precision settings for input layers, output layers, and master copy weights.

A lowcomplexity LSTM training scheme integrating the above techniques.

An efficient LSTM inference acceleration hardware.
Ii Background
Iia Long ShortTerm Memory
LSTM is one of the most popular types of RNNs. It can perform very well in dealing with sequence related tasks and is widely used in areas like speech recognition and natural language processing (NLP). As traditional RNNs, LSTM takes sequence input
and generates sequence output , which is computed by the following equations:(1) 
(2) 
(3) 
(4) 
(5) 
(6) 
where the forget gate selects the information to be removed from the neuron’s memory at time step ; the input gate selects the information to be written into the neuron’s memory at time step ; the output gate controls the amount of information stored in cell state contributing to the output; the cell gate represents the new information at time step ; the cell state represents the content of the neuron’s memory; and represents the output of the cell at time step . It is also fed back to neuron at time step . and represent the weights and biases, respectively. Finally, and
denote the sigmoid function and the elementwise product, respectively. The architecture of an LSTM neuron is shown in Fig. 1.
Despite excellent performance in certain areas, its complex neuron processing makes LSTM more complicated than CNNs and traditional RNNs, requiring substantially more memory and computation loading. In addition, LSTM is fullyconnected and therefore its training may pose a problem of large memory IO bandwidth requirement.
Positive  

Zero  
Negative  
IiB FloatSD Number Representation
Floatingpoint signed digit (FloatSD) [6] was designed for weights representation in lowcomplexity CNN training and inference. The structure of the FloatSD representation is shown in Fig. 2. It consists of several signed digit (SD) groups and an exponent field. It is known that the complexity of multiplying two numbers depends on the number of nonzero digits in the multiplier. To achieve low multiplication complexity, each SD group allows no more than one nonzero digit. For digit SD group, there are possible values. For example, in the threedigit group case, each group assume one of the seven possible values: , , , , , , , as shown in Table I
. That is, the probability of a digit in an SD number with
digit groups being 0 isIn the case of K = 3, this probability is , which is even higher than the case of Canonical Signed Digit (CSD) representation (about ). This of course leads to the relative lower multiplication complexity in the FloatSD number representation.
Note that although having a low number of nonzero digits, FloatSD representation cannot cover all possible binary numbers. For example, a 3digit group can only represent seven instead of eight different values by three bits in the binary representation. However, neural networks are known to be tolerant to numerical inaccuracy and sometimes even benefit from such, e.g., deliberate noise injection [9]. To further reduce the complexity, one can adopt only a few, not all, digit groups of the FloatSD weights for inference and backward propagation, as shown in Fig. 3.
Iii LowComplexity LSTM Training
Iiia Weight Representation
In this work, we chose FloatSD8 number format for LSTM weight representation. FloatSD8 consists of a 3bit exponent field, 3digit most significant group (MSG), and a 2digit second group. This representation has 7 possible values (, , , , , , ) in the MSG and 5 possible values (, , , , ) in the second group, leading to 35 combinations. However, out of the 35 combinations, only 31 distinct combinations exist, making 5 bits enough for encoding these two groups. With a 3bit exponent field and 5bit SD group mantissa, the FloatSD8 format requires only 8 bits for representing a neural network weight.
IiiB Weight Update Mechanism and Master Copy
Instead of using the FloatSD weight master copy and the Single Trigger Update (STU) method [6], we stored the master copy of the LSTM weights during training in the conventional floatingpoint (FP) number format and adopted the traditional weight update mechanism. After updating, the master copy weights are then quantized to FloatSD8 for the next iteration. The change in the format of the master copy and update mechanism allows us to easily control the master copy precision without modifying the FloatSD8 format. In addition, weight initialization is very crucial to the final outcome of neural network training. As such, by choosing the FP format for weight master copy, we can adopt common weight initialization methods without modification.
IiiC Sigmoid Function Quantization
In (1)(4), the multiplications are computed between FloatSD8 weights and FP inputs, which are really efficient since a FloatSD8 weight generates only two partial products. However, in (5) and (6), the elementwise multiplications are computed between two FP numbers, which would be inefficient as the FP numbers generally involve quite a few partial products. To this end, the forget gate , the input gate , and the output gate are further quantized to the FloatSD8 representation. Then, we can convert the multiplications in (5) and (6) to multiplications between a FloatSD8 number and an FP number, the same as the format in (1)(4). Direct FloatSD8 quantization of the sigmoid function leads to unbalanced quantization error distribution between positive and negative inputs, as shown in Fig. 4. This is caused by the logarithmic linear nature of the FloatSD representation.
Therefore, we decompose the quantization operation into two regions by
(7) 
(8) 
where denotes the FloatSD8 quantization function. The quantized sigmoid function and the unquantized counterpart for are plotted in Fig. 5. Note that in Eq. (8), the output may need two FloatSD8 numbers to represent.
In actual implementation, the sigmoid function and the FloatSD quantization can be merged and realized by a lookup table (LUT). The extra multiplications and additions from two FloatSD numbers representing one quantized sigmoid output can be handled by the specially designed multiply and accumulate (MAC) circuit. Moreover, because there are only 42 possible values in a quantized sigmoid output when the input is nonpositive, the depth of the LUT can be reduced, significantly lowering the memory requirement.
IiiD Other LowComplexity Considerations
The FloatSD8 representation is used for the LSTM network weights. However, LSTM training and inference computation involves more than just network weights. In this work, forward neuron activations, backward neuron activations, and all gradients were quantized to 8bit FP number (FP8), having 1bit sign, 5bit exponent, and 2bit mantissa [7]. Note that although quantization by stochastic rounding is shown to provide better training performance, all the above quantization adopted the regular rounding in consideration of hardware design complexity.
IiiE Summary
The precision settings of the proposed lowcomplexity LSTM training scheme are summarized in Table II. By quantizing these variables, LSTM training can benefit from not only low hardware complexity but also low memory access bandwidth.
FloatSD8  FP8  FP8  FP32  FloatSD8 

Weights, Gradients, Activations, Master copy of weights, Quantized sigmoid function output
Iv Simulation and Discussion
Iva Platform, Datasets, and Models
PyTorch and QPyTorch [10] were used as the frameworks to study the proposed method. Four commonly used NLP datasets – UDPOS [11], SNLI [12], Multi30K [13], and WikiText2 [14]
were used in the simulations. All networks are trained via the proposed FloatSD8 training method with the same network architectures, hyperparameters, and other preprocessing as the baseline implementation using the IEEE singleprecision floatingpoint (FP32) arithmetic. One exception is the lossscaling technique
[3] was adopted in the lowcomplexity training method to limit the backpropagated error magnitude within a small interval. In the following, we briefly introduce each dataset, the corresponding model, and the hyperparameters used.Udpos
UDPOS comprises 254,830 words and 16,622 sentences taken from five genres of web media, with sentences annotated using universal dependency relations. For UDPOS simulation, we adopted a model consisting of a word embedding layer, twolayer bidirectional LSTM, and a fullyconnected output layer. The model was trained via the ADAM [15] optimizer with a single scaling factor of 1024.
Snli
Stanford Natural Language Inference (SNLI) dataset is a collection of 570k humanwritten English sentence pairs, which are manually labeled for balanced classification with the natural language inference (NLI) labels that are either entailment, contradiction, or neutral. For SNLI simulation, we adopted a model that consists of a word embedding layer, a fully connected projection layer, a singlelayer bidirectional LSTM, and a sequence of four fullyconnected layers. The model was trained via the ADAM optimizer with a single scaling factor of 1024.
Multi30K
Multi30K consists of 29,000 training data and 1,014 development data, each containing an English source sentence, German translation by humans, and a corresponding image. The dataset is for the multimodal translation task that translates the English sentence describing an image into German. For Multi30K simulation, we adopted the model with an encoder and a decoder. The encoder is made up of a word embedding layer and a singlelayer LSTM; and the decoder consists of a word embedding layer, a singlelayer LSTM, and a fullyconnected output layer. The model was trained via the ADAM optimizer with a single scaling factor of 1024.
WikiText2
The WikiText language modeling dataset (WikiText2) is a collection of over 100 million tokens extracted from articles in Wikipedia. This dataset is relatively bigger than the previous three datasets, providing us insights into the effectiveness of the proposed training scheme on a huge dataset. For WikiText2 simulation, we adopted a model with a word embedding layer, a twolayer LSTM encoder, and a fullyconnected output decoder. The model was trained via the SGD optimizer with a single scaling factor of 1024.
Dataset  Epoch  Batchsize  Parameters 

UDPOS  50  64  0.64M 
SNLI  30  128  4.23M 
Multi30K  30  128  15.27M 
WikiText2  50  64  84.98M 
IvB Simulation Results and Discussion
The hyperparameter settings and parameter counts of these four datasets are summarized in Table III. The simulated performance curves during FloatSD8 training and FP32 training of the four datasets are shown in Fig. 6. It is clear that when compared with the FP32 trained model the proposed FloatSD8 training scheme can achieve similar or even better performance in UDPOS, SNLI, and Multi30K applications. However, in the WikiText2 task, degradation in perplexity by the proposed method is quite obvious. The simulated FloatSD8/FP32 trained LSTM results are summarized in the second and third columns in Table IV.
Activation Precision of the First and Last Layers
In NN training, the first and last NN layers are often excluded from quantization due to their sensitivity. Previous simulations quantized the forward and backward activations of the first and the last layers in FP8. This may make no difference in relatively small datasets, however, in the larger dataset as WikiText2, the poor precision in these two layers may significantly impact the training performance. To gain more insight into the cause of the performance degradation in WikiText2 task, we experimented with various settings of activations using the same model architecture and hyperparameters. The results are summarized in Table V. Note that the first layer in the table means the outputs of the embedding layer since the inputs of the embedding layer are just indices. From the results, we can conclude that the last layer’s activation precision is more important than the first layer’s activation precision. Also, the setting of using FP8 first layer activations, FP16 last layer activations, and FP8 other layers activations are sufficient to provide similar results comparing to the FP32 baseline. Note that this way we relaxed only the output layer activation precision to FP16, while keeping the weights and all other activations in 8bit precision. All the multiplications in the LSTM were still between FloatSD8 and FP8, except that the output layer activations were not further quantized to FP8.
Dataset  FP32 baseline  FloatSD8  FloatSD8 

UDPOS  89.05  89.09  89.13 
SNLI  79.28  79.32  79.24 
Multi30K  37.02  36.87  37.26 
WikiText2  87.83  98.94  91.06 

Accuracy(%), Accuracy(%), Perplexity, Perplexity With FP16 master copy of weights
First layer  Last layer  Other layers  Perplexity 

FP8  FP8  FP8  98.94 
FP16  FP16  FP16  88.92 
FP8  FP16  FP8  89.87 
FP16  FP8  FP8  99.81 
FP16  FP16  FP8  89.59 
Precision of the Master Copy Weights
The master copy weights used in the previous experiments were in the FP32 format. If we can reduce the precision of the master copy, both memory and complexity can be saved. As such, we further experimented with the four datasets using the FP16 master copy weights during training. Note that the simulations were done without any other change in model architecture or hyperparameters, except changing the FP32 master copy to FP16 precision. Simulated results of the four datasets are summarized in the fourth column in Table IV. Comparing to their FP32 counterparts, the results using the FP16 master copy have quite similar performance. The highest degradation between using all FP32 arithmetic and FloatSD8/FP8 with FP16 output activations (see Table IV) and FP16 master copy appears on the WikiText2 dataset, which is only about 3.7% in degradation in perplexity.
In conclusion, the modified FloatSD8 training scheme, i.e. FloatSD8 weights, FP16 master copy, FP8 gradients, FP8 forward and backward activations, except for FP16 last layer’s outputs, and FloatSD8 sigmoid function quantization, can achieve low complexity training and inference across different LSTM models as well as comparable performance with negligible degradation when compared to the baselines trained in FP32 arithmetic. The precision setting of the modified training scheme is summarized in Table VI.
w  g  o  a  m  s 

FloatSD8  FP8  FP16  FP8  FP16  FloatSD8 

Weights, Gradients, Activations of the last layer output, Activations of other layers, Master copy of weights, Outputs of the sigmoid function,
IvC Complexity Analysis
FloatSD8 represents a network weight with no more than two nonzero digits and an exponent field. Consequently, multiplication involving a FloatSD8 weight can be implemented by addition of two partial products. As the neuron activations are quantized to the FP8 format, the forward pass multiplication is done by addition of two partial products generated from the FP8 multiplicand, and requires only FP16 additions. In the backward pass, FP16 additions also suffice because the backward neuron activations are also quantized to the FP8 format. The weight update process is implemented by addition of the FP16 master copy weight and the FP8 gradient, which can also be realized by FP16 addition. In conclusion, FP16 accumulation is sufficient for all operations in LSTM model training and inference.
V Hardware Implementation
Based on the FloatSD8 weight representation, FP8 input activation, sigmoid function FloatSD8 quantization, and FP16 accumulation, we designed a lowcomplexity LSTM inference accelerator circuit that aims to leverage the low precision variable representation of the proposed method.
Va Processing Element
The LSTM processing elements (PE) is the core circuit of the proposed hardware accelerator. The PE computes matrix multiplication between FP8 inputs and FloatSD8 weights. The architecture of the PE is illustrated in Fig. 7. Since the input size is influenced by the varying input sequence length, the LSTM PE adopts the outputstationary design, and accumulates the product sum generated by the current batch of inputs/weights in the partial sum register. The FloatSD8 MAC takes four FP8 inputs, four FloatSD8 weights and the previous results or the bias as input data; computes multiplications between inputs and weights; and then accumulates all products and the previous result or the bias. Taking advantage of 8bit inputs/weights, the FloatSD8 MAC simultaneously handles four pairs of inputs and weights using the same IO bandwidth as an FP32 MAC.
The block diagram of the proposed fivestage pipelined FloatSD8 MAC is depicted in Fig. 8. In the first stage, FloatSD8 weights are decoded; the partial product generator then generates partial products between four pairs of inputs and weights, and the max exponent detector finds the largest exponent among all partial products. In the second stage, the partial products are aligned by respective shifters. In the third stage, aligned partial products are added by Wallacetree type carrysave adders. Finally, in the fourth and fifth pipeline stages, the result is rounded and normalized to the FP16 format. Note that the FloatSD8 MAC takes the previous result as one input, so the PE would have to wait for five cycles before computing another outcome, leading to low throughput and low hardware utilization. To overcome this problem, batch workloads are adopted in our design with the partial sum registers. With the batch size larger than five, the hardware utilization would reach 100%.
VB LSTM Unit
The architecture of the LSTM inference circuit is shown in Fig. 9. It consists of four PEs, LUTs for sigmoid and tanh function, memory for the cell state, and two FloatSD8 MAC. For the computation of the whole LSTM operation, the inputs and weights would first be sent to four PEs for calculation of matrix multiplications in (1) – (4). After completing matrix multiplications, the outputs of PEs would then be sent to LUTs, getting the results of four gates. The FloatSD8 MAC would then compute the cell state according to (5). As mentioned before, the outputs of sigmoid function LUT are two FloatSD8 format numbers, so the computation of the cell state would be multiplyaccumulate operation between four FloatSD8 numbers and four FP8 numbers, exactly what a FloatSD8 MAC can handle. Finally, the cell state would be sent to a tanh LUT, and then the FloatSD8 MAC would calculate the output according to (6).
VC Area and Power Comparison
In order to demonstrate the effectiveness of the proposed FloatSD8 method and its associated circuit, we also designed an FP32 MAC that takes four pairs of inputs and weights as input data. The FP32 MAC was properly pipelined to run at the same speed as the FloatSD8 MAC. These two MACs were then synthesized in Synopsys Design Compiler using a 40nm CMOS process. Moreover, Synopsys PrimeTime PX was used for accurate power estimation. Table
VII lists the estimated area and power consumption of the two synthesized MAC circuits. When running at 400MHz, the FloatSD8 MAC is about 7.66X smaller in die area and consumes 5.75X less power than the FP32 MAC running at the same speed. The significant saving in circuit area and power consumption indeed validate the effectiveness of the proposed FloatSD8 design for LSTM applications.Process  Type  Period  Area  Power 

40nm CMOS  FP32  2.5ns  26661  2.920 
40nm CMOS  FloatSD8  2.5ns  3479  0.508 
Vi Conclusions
In this paper, we applied the novel FloatSD8 weight representation to LSTM training and inference. To fully leverage the low complexity feature of the FloatSDbased multiplier, the sigmoid function in LSTM operation is cascaded with FloatSD8 quantization. With this modification, we can execute all multiplications in the LSTM network by an FP8FloatSD8 multiplier using only two partial products. To further reduce the computational cost and storage plus IO access cost in LSTM training, we used 8bit FP8 gradients and activations, 16bit accumulation, and the master copy weights. Simulation of four different LSTM applications indicate that the proposed FloatSD8 based training method can achieve almost the same and in some cases better performance when compared to FP32 baselines. To more convincingly verify the advantage of the proposed method, we designed an LSTM inference acceleration circuit for the proposed FloatSD technology. We show that our design outperforms the LSTM circuit based on FP32 arithmetic in both die area and power consumption, respectively by 7.66X and 5.75X. Finally, based on this work and the previous FloatSD work on CNNs [3], we believe that the FloatSD8 representation is suitable for NN training across different domains and model architectures.
In the future, we plan to train more types of NNs using the FloatSD technology to broaden its scope of application. Meanwhile, a generalpurpose highperformance FloatSD8based NN training/inference accelerator SoC has been taped out and will be tested soon.
References
 [1] S. Han, J. Pool, J. Tran, and W. J. Dally, ”Learning both weights and connections for efficient neural network,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NIPS), pp. 1135–1143, December 2015.
 [2] D. D. Lin, S. Talathi, and V. S. Annapureddy, ”Fixed point quantization of deep convolutional networks,” arXiv:1511.06393, November 2015.
 [3] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, et al., ”Mixed precision training,” arXiv:1710.03740, October 2017.
 [4] J. Choi, Z. Wang, S. Venkataramani, P. I.J. Chuang, V. Srinivasan, and K. Gopalakrishnan, ”Pact: Parameterized clipping activation for quantized neural networks,” arXiv:1805.06085, May, 2018.
 [5] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, ”DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients.” arXiv:1606.06160, June, 2016.
 [6] P.C. Lin, M.K. Sun, C. Kung, and T.D. Chiueh, ”FloatSD: A new weight representation and associated update method for efficient convolutional neural network training,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, June 2019.
 [7] N. Wang, J. Choi, D. Brand, C.Y. Chen, and K. Gopalakrishnan, ”Training deep neural networks with 8bit floating point numbers,” in Advances in Neural Information Processing Systems 31, 2018.
 [8] S. Hochreiter and J. Schmidhuber, ”Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [9] Y. Grandvalet, S. Canu, and S. Boucheron, ”Noise injection: Theoretical prospects,” Neural Computation, vol. 9, no. 5, pp. 1093–1108, 1997.
 [10] T. Zhang, Z. Lin, G. Yang, and C. D. Sa, ”Qpytorch: A lowprecision arithmetic simulation framework,” arXiv:1910.04540, October 2019.
 [11] N. Silveira, T. Dozat, M.C. de Marneffe, S. Bowman, M. Connor, J. Bauer, et al., ”A gold standard dependency corpus for English,” in Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.2897–2904, Reykjavik, Iceland, May 2014.
 [12] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, ”A large annotated corpus for learning natural language inference,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015.
 [13] D. Elliott, S. Frank, K. Sima’an, and L. Specia, ”Multi30k: Multilingual EnglishGerman image descriptions,” arXiv:1605.00459, May 2016.
 [14] S. Merity, C. Xiong, J. Bradbury, and R. Socher, ”Pointer sentinel mixture models,” arXiv:1609.07843, September 2016.
 [15] D. P. Kingma and J. L. Ba., ”Adam: A method for stochastic optimization,” in Proc. of International Conference on Learning Representations (ICLR), San Diego CA, USA, May 2015.
Comments
There are no comments yet.