Recently, a great many studies have worked on proposing low-complexity, high-performance neural networks (NN) training and inference methods as well as their hardware architectures. The main objective is to reduce the huge computational and memory storage/access needed by popular modern NN models. Research teams from academia and companies alike have presented numerous efficient acceleration solutions for convolutional NN (CNN) training and inference, based on FPGA, ASIC, and GPU. Among these methods, optimal NN model trained with single-precision floating-point (FP32) or half-precision floating-point (FP16, bfloat16) arithmetic is usually obtained and then techniques such as pruning , quantization, and compression are applied. For the quantization-based precision reduction methods, recent works mostly focused on reduction of the NN inference complexity. For example, in , an FP32 trained NN model was quantized into a model with fix-pointed weights that can achieve similar inference accuracy as the original FP32 NN model.
In contrast to the many works that focused on reduced-precision NN inference, there have been several other studies that worked on reduced-precision NN training. For instance, MPT  trains CNNs using FP16 arithmetic and additionally proposed a loss-scaling method to preserve small gradients. In , precision for weights and activations were scaled down to 4 bits with a small loss of accuracy. DoReFa-Net  trained NNs with 1-bit weights and 2-bit activations, while reducing the gradient precision down to 6 bits. FloatSD  trained CNNs with special floating-point weights with only two nonzero digits, along with 8-bit activations and gradients, and achieve very small degradation in the accuracy performance of trained NN models. Finally,  trains CNNs with 8-bit floating-point numbers (FP8) and proposes chunk-based accumulation and floating-point stochastic rounding to reduce the arithmetic precision for additions from 32 bits to 16 bits.
All the above proposed methods face certain problems, such as most applications were demonstrated by a few selected models, specifically only on CNNs; some methods suffer significant performance degradation; and some methods may require FP32 accumulation.
The main objective of this paper is to extend the application of reduced-precision methods to recurrent NNs (RNNs) while mitigating the aforementioned problems. FloatSD is a reduced-precision CNN training method that has been demonstrated to provide efficient training for CNNs as large as Resnet-50 . Moreover, we chose long short-term memories (LSTMs)  as the target for low-complexity training by FloatSD. Unlike CNNs, LSTMs do not adopt convolutional layers and require only fully-connected (FC) layers consisting of simple matrix multiplications. In addition, the gradient vanishing and gradient exploding problem make LSTM very sensitive to quantization errors. These two features make the reduced-precision training of LSTMs more challenging than that of CNNs.
We first modified the training equations and weight updates procedure of LSTM so that they become compatible with the FloatSD method. Then, a suite of LSTM models and datasets were used to demonstrate the applicability of the FloatSD technique to efficient LSTM training with no degradation in trained model accuracy. Finally, we also designed and validated a FloatSD processing element circuit suitable of low-power LSTM training acceleration.
To sum up, the contributions of this paper are:
Adaptation of the FloatSD method for LSTM training.
Modification of the LSTM equations for lower computation complexity.
Precision settings for input layers, output layers, and master copy weights.
A low-complexity LSTM training scheme integrating the above techniques.
An efficient LSTM inference acceleration hardware.
Ii-a Long Short-Term Memory
LSTM is one of the most popular types of RNNs. It can perform very well in dealing with sequence related tasks and is widely used in areas like speech recognition and natural language processing (NLP). As traditional RNNs, LSTM takes sequence inputand generates sequence output , which is computed by the following equations:
where the forget gate selects the information to be removed from the neuron’s memory at time step ; the input gate selects the information to be written into the neuron’s memory at time step ; the output gate controls the amount of information stored in cell state contributing to the output; the cell gate represents the new information at time step ; the cell state represents the content of the neuron’s memory; and represents the output of the cell at time step . It is also fed back to neuron at time step . and represent the weights and biases, respectively. Finally, and
denote the sigmoid function and the element-wise product, respectively. The architecture of an LSTM neuron is shown in Fig. 1.
Despite excellent performance in certain areas, its complex neuron processing makes LSTM more complicated than CNNs and traditional RNNs, requiring substantially more memory and computation loading. In addition, LSTM is fully-connected and therefore its training may pose a problem of large memory IO bandwidth requirement.
Ii-B FloatSD Number Representation
Floating-point signed digit (FloatSD)  was designed for weights representation in low-complexity CNN training and inference. The structure of the FloatSD representation is shown in Fig. 2. It consists of several signed digit (SD) groups and an exponent field. It is known that the complexity of multiplying two numbers depends on the number of non-zero digits in the multiplier. To achieve low multiplication complexity, each SD group allows no more than one non-zero digit. For -digit SD group, there are possible values. For example, in the three-digit group case, each group assume one of the seven possible values: , , , , , , , as shown in Table I
. That is, the probability of a digit in an SD number with-digit groups being 0 is
In the case of K = 3, this probability is , which is even higher than the case of Canonical Signed Digit (CSD) representation (about ). This of course leads to the relative lower multiplication complexity in the FloatSD number representation.
Note that although having a low number of non-zero digits, FloatSD representation cannot cover all possible binary numbers. For example, a 3-digit group can only represent seven instead of eight different values by three bits in the binary representation. However, neural networks are known to be tolerant to numerical inaccuracy and sometimes even benefit from such, e.g., deliberate noise injection . To further reduce the complexity, one can adopt only a few, not all, digit groups of the FloatSD weights for inference and backward propagation, as shown in Fig. 3.
Iii Low-Complexity LSTM Training
Iii-a Weight Representation
In this work, we chose FloatSD8 number format for LSTM weight representation. FloatSD8 consists of a 3-bit exponent field, 3-digit most significant group (MSG), and a 2-digit second group. This representation has 7 possible values (, , , , , , ) in the MSG and 5 possible values (, , , , ) in the second group, leading to 35 combinations. However, out of the 35 combinations, only 31 distinct combinations exist, making 5 bits enough for encoding these two groups. With a 3-bit exponent field and 5-bit SD group mantissa, the FloatSD8 format requires only 8 bits for representing a neural network weight.
Iii-B Weight Update Mechanism and Master Copy
Instead of using the FloatSD weight master copy and the Single Trigger Update (STU) method , we stored the master copy of the LSTM weights during training in the conventional floating-point (FP) number format and adopted the traditional weight update mechanism. After updating, the master copy weights are then quantized to FloatSD8 for the next iteration. The change in the format of the master copy and update mechanism allows us to easily control the master copy precision without modifying the FloatSD8 format. In addition, weight initialization is very crucial to the final outcome of neural network training. As such, by choosing the FP format for weight master copy, we can adopt common weight initialization methods without modification.
Iii-C Sigmoid Function Quantization
In (1)(4), the multiplications are computed between FloatSD8 weights and FP inputs, which are really efficient since a FloatSD8 weight generates only two partial products. However, in (5) and (6), the element-wise multiplications are computed between two FP numbers, which would be inefficient as the FP numbers generally involve quite a few partial products. To this end, the forget gate , the input gate , and the output gate are further quantized to the FloatSD8 representation. Then, we can convert the multiplications in (5) and (6) to multiplications between a FloatSD8 number and an FP number, the same as the format in (1)(4). Direct FloatSD8 quantization of the sigmoid function leads to unbalanced quantization error distribution between positive and negative inputs, as shown in Fig. 4. This is caused by the logarithmic linear nature of the FloatSD representation.
Therefore, we decompose the quantization operation into two regions by
where denotes the FloatSD8 quantization function. The quantized sigmoid function and the un-quantized counterpart for are plotted in Fig. 5. Note that in Eq. (8), the output may need two FloatSD8 numbers to represent.
In actual implementation, the sigmoid function and the FloatSD quantization can be merged and realized by a lookup table (LUT). The extra multiplications and additions from two FloatSD numbers representing one quantized sigmoid output can be handled by the specially designed multiply and accumulate (MAC) circuit. Moreover, because there are only 42 possible values in a quantized sigmoid output when the input is non-positive, the depth of the LUT can be reduced, significantly lowering the memory requirement.
Iii-D Other Low-Complexity Considerations
The FloatSD8 representation is used for the LSTM network weights. However, LSTM training and inference computation involves more than just network weights. In this work, forward neuron activations, backward neuron activations, and all gradients were quantized to 8-bit FP number (FP8), having 1-bit sign, 5-bit exponent, and 2-bit mantissa . Note that although quantization by stochastic rounding is shown to provide better training performance, all the above quantization adopted the regular rounding in consideration of hardware design complexity.
The precision settings of the proposed low-complexity LSTM training scheme are summarized in Table II. By quantizing these variables, LSTM training can benefit from not only low hardware complexity but also low memory access bandwidth.
Weights, Gradients, Activations, Master copy of weights, Quantized sigmoid function output
Iv Simulation and Discussion
Iv-a Platform, Datasets, and Models
were used in the simulations. All networks are trained via the proposed FloatSD8 training method with the same network architectures, hyperparameters, and other preprocessing as the baseline implementation using the IEEE single-precision floating-point (FP32) arithmetic. One exception is the loss-scaling technique was adopted in the low-complexity training method to limit the back-propagated error magnitude within a small interval. In the following, we briefly introduce each dataset, the corresponding model, and the hyperparameters used.
UDPOS comprises 254,830 words and 16,622 sentences taken from five genres of web media, with sentences annotated using universal dependency relations. For UDPOS simulation, we adopted a model consisting of a word embedding layer, two-layer bidirectional LSTM, and a fully-connected output layer. The model was trained via the ADAM  optimizer with a single scaling factor of 1024.
Stanford Natural Language Inference (SNLI) dataset is a collection of 570k human-written English sentence pairs, which are manually labeled for balanced classification with the natural language inference (NLI) labels that are either entailment, contradiction, or neutral. For SNLI simulation, we adopted a model that consists of a word embedding layer, a fully connected projection layer, a single-layer bidirectional LSTM, and a sequence of four fully-connected layers. The model was trained via the ADAM optimizer with a single scaling factor of 1024.
Multi30K consists of 29,000 training data and 1,014 development data, each containing an English source sentence, German translation by humans, and a corresponding image. The dataset is for the multimodal translation task that translates the English sentence describing an image into German. For Multi30K simulation, we adopted the model with an encoder and a decoder. The encoder is made up of a word embedding layer and a single-layer LSTM; and the decoder consists of a word embedding layer, a single-layer LSTM, and a fully-connected output layer. The model was trained via the ADAM optimizer with a single scaling factor of 1024.
The WikiText language modeling dataset (WikiText-2) is a collection of over 100 million tokens extracted from articles in Wikipedia. This dataset is relatively bigger than the previous three datasets, providing us insights into the effectiveness of the proposed training scheme on a huge dataset. For WikiText-2 simulation, we adopted a model with a word embedding layer, a two-layer LSTM encoder, and a fully-connected output decoder. The model was trained via the SGD optimizer with a single scaling factor of 1024.
Iv-B Simulation Results and Discussion
The hyperparameter settings and parameter counts of these four datasets are summarized in Table III. The simulated performance curves during FloatSD8 training and FP32 training of the four datasets are shown in Fig. 6. It is clear that when compared with the FP32 trained model the proposed FloatSD8 training scheme can achieve similar or even better performance in UDPOS, SNLI, and Multi30K applications. However, in the WikiText-2 task, degradation in perplexity by the proposed method is quite obvious. The simulated FloatSD8/FP32 trained LSTM results are summarized in the second and third columns in Table IV.
Activation Precision of the First and Last Layers
In NN training, the first and last NN layers are often excluded from quantization due to their sensitivity. Previous simulations quantized the forward and backward activations of the first and the last layers in FP8. This may make no difference in relatively small datasets, however, in the larger dataset as WikiText-2, the poor precision in these two layers may significantly impact the training performance. To gain more insight into the cause of the performance degradation in WikiText-2 task, we experimented with various settings of activations using the same model architecture and hyperparameters. The results are summarized in Table V. Note that the first layer in the table means the outputs of the embedding layer since the inputs of the embedding layer are just indices. From the results, we can conclude that the last layer’s activation precision is more important than the first layer’s activation precision. Also, the setting of using FP8 first layer activations, FP16 last layer activations, and FP8 other layers activations are sufficient to provide similar results comparing to the FP32 baseline. Note that this way we relaxed only the output layer activation precision to FP16, while keeping the weights and all other activations in 8-bit precision. All the multiplications in the LSTM were still between FloatSD8 and FP8, except that the output layer activations were not further quantized to FP8.
Accuracy(%), Accuracy(%), Perplexity, Perplexity With FP16 master copy of weights
|First layer||Last layer||Other layers||Perplexity|
Precision of the Master Copy Weights
The master copy weights used in the previous experiments were in the FP32 format. If we can reduce the precision of the master copy, both memory and complexity can be saved. As such, we further experimented with the four datasets using the FP16 master copy weights during training. Note that the simulations were done without any other change in model architecture or hyperparameters, except changing the FP32 master copy to FP16 precision. Simulated results of the four datasets are summarized in the fourth column in Table IV. Comparing to their FP32 counterparts, the results using the FP16 master copy have quite similar performance. The highest degradation between using all FP32 arithmetic and FloatSD8/FP8 with FP16 output activations (see Table IV) and FP16 master copy appears on the WikiText-2 dataset, which is only about 3.7% in degradation in perplexity.
In conclusion, the modified FloatSD8 training scheme, i.e. FloatSD8 weights, FP16 master copy, FP8 gradients, FP8 forward and backward activations, except for FP16 last layer’s outputs, and FloatSD8 sigmoid function quantization, can achieve low complexity training and inference across different LSTM models as well as comparable performance with negligible degradation when compared to the baselines trained in FP32 arithmetic. The precision setting of the modified training scheme is summarized in Table VI.
Weights, Gradients, Activations of the last layer output, Activations of other layers, Master copy of weights, Outputs of the sigmoid function,
Iv-C Complexity Analysis
FloatSD8 represents a network weight with no more than two non-zero digits and an exponent field. Consequently, multiplication involving a FloatSD8 weight can be implemented by addition of two partial products. As the neuron activations are quantized to the FP8 format, the forward pass multiplication is done by addition of two partial products generated from the FP8 multiplicand, and requires only FP16 additions. In the backward pass, FP16 additions also suffice because the backward neuron activations are also quantized to the FP8 format. The weight update process is implemented by addition of the FP16 master copy weight and the FP8 gradient, which can also be realized by FP16 addition. In conclusion, FP16 accumulation is sufficient for all operations in LSTM model training and inference.
V Hardware Implementation
Based on the FloatSD8 weight representation, FP8 input activation, sigmoid function FloatSD8 quantization, and FP16 accumulation, we designed a low-complexity LSTM inference accelerator circuit that aims to leverage the low precision variable representation of the proposed method.
V-a Processing Element
The LSTM processing elements (PE) is the core circuit of the proposed hardware accelerator. The PE computes matrix multiplication between FP8 inputs and FloatSD8 weights. The architecture of the PE is illustrated in Fig. 7. Since the input size is influenced by the varying input sequence length, the LSTM PE adopts the output-stationary design, and accumulates the product sum generated by the current batch of inputs/weights in the partial sum register. The FloatSD8 MAC takes four FP8 inputs, four FloatSD8 weights and the previous results or the bias as input data; computes multiplications between inputs and weights; and then accumulates all products and the previous result or the bias. Taking advantage of 8-bit inputs/weights, the FloatSD8 MAC simultaneously handles four pairs of inputs and weights using the same IO bandwidth as an FP32 MAC.
The block diagram of the proposed five-stage pipelined FloatSD8 MAC is depicted in Fig. 8. In the first stage, FloatSD8 weights are decoded; the partial product generator then generates partial products between four pairs of inputs and weights, and the max exponent detector finds the largest exponent among all partial products. In the second stage, the partial products are aligned by respective shifters. In the third stage, aligned partial products are added by Wallace-tree type carry-save adders. Finally, in the fourth and fifth pipeline stages, the result is rounded and normalized to the FP16 format. Note that the FloatSD8 MAC takes the previous result as one input, so the PE would have to wait for five cycles before computing another outcome, leading to low throughput and low hardware utilization. To overcome this problem, batch workloads are adopted in our design with the partial sum registers. With the batch size larger than five, the hardware utilization would reach 100%.
V-B LSTM Unit
The architecture of the LSTM inference circuit is shown in Fig. 9. It consists of four PEs, LUTs for sigmoid and tanh function, memory for the cell state, and two FloatSD8 MAC. For the computation of the whole LSTM operation, the inputs and weights would first be sent to four PEs for calculation of matrix multiplications in (1) – (4). After completing matrix multiplications, the outputs of PEs would then be sent to LUTs, getting the results of four gates. The FloatSD8 MAC would then compute the cell state according to (5). As mentioned before, the outputs of sigmoid function LUT are two FloatSD8 format numbers, so the computation of the cell state would be multiply-accumulate operation between four FloatSD8 numbers and four FP8 numbers, exactly what a FloatSD8 MAC can handle. Finally, the cell state would be sent to a tanh LUT, and then the FloatSD8 MAC would calculate the output according to (6).
V-C Area and Power Comparison
In order to demonstrate the effectiveness of the proposed FloatSD8 method and its associated circuit, we also designed an FP32 MAC that takes four pairs of inputs and weights as input data. The FP32 MAC was properly pipelined to run at the same speed as the FloatSD8 MAC. These two MACs were then synthesized in Synopsys Design Compiler using a 40nm CMOS process. Moreover, Synopsys PrimeTime PX was used for accurate power estimation. TableVII lists the estimated area and power consumption of the two synthesized MAC circuits. When running at 400MHz, the FloatSD8 MAC is about 7.66X smaller in die area and consumes 5.75X less power than the FP32 MAC running at the same speed. The significant saving in circuit area and power consumption indeed validate the effectiveness of the proposed FloatSD8 design for LSTM applications.
In this paper, we applied the novel FloatSD8 weight representation to LSTM training and inference. To fully leverage the low complexity feature of the FloatSD-based multiplier, the sigmoid function in LSTM operation is cascaded with FloatSD8 quantization. With this modification, we can execute all multiplications in the LSTM network by an FP8-FloatSD8 multiplier using only two partial products. To further reduce the computational cost and storage plus IO access cost in LSTM training, we used 8-bit FP8 gradients and activations, 16-bit accumulation, and the master copy weights. Simulation of four different LSTM applications indicate that the proposed FloatSD8 based training method can achieve almost the same and in some cases better performance when compared to FP32 baselines. To more convincingly verify the advantage of the proposed method, we designed an LSTM inference acceleration circuit for the proposed FloatSD technology. We show that our design outperforms the LSTM circuit based on FP32 arithmetic in both die area and power consumption, respectively by 7.66X and 5.75X. Finally, based on this work and the previous FloatSD work on CNNs , we believe that the FloatSD8 representation is suitable for NN training across different domains and model architectures.
In the future, we plan to train more types of NNs using the FloatSD technology to broaden its scope of application. Meanwhile, a general-purpose high-performance FloatSD8-based NN training/inference accelerator SoC has been taped out and will be tested soon.
-  S. Han, J. Pool, J. Tran, and W. J. Dally, ”Learning both weights and connections for efficient neural network,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NIPS), pp. 1135–1143, December 2015.
-  D. D. Lin, S. Talathi, and V. S. Annapureddy, ”Fixed point quantization of deep convolutional networks,” arXiv:1511.06393, November 2015.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, et al., ”Mixed precision training,” arXiv:1710.03740, October 2017.
-  J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, ”Pact: Parameterized clipping activation for quantized neural networks,” arXiv:1805.06085, May, 2018.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, ”DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients.” arXiv:1606.06160, June, 2016.
-  P.-C. Lin, M.-K. Sun, C. Kung, and T.-D. Chiueh, ”FloatSD: A new weight representation and associated update method for efficient convolutional neural network training,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, June 2019.
-  N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, ”Training deep neural networks with 8-bit floating point numbers,” in Advances in Neural Information Processing Systems 31, 2018.
-  S. Hochreiter and J. Schmidhuber, ”Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Y. Grandvalet, S. Canu, and S. Boucheron, ”Noise injection: Theoretical prospects,” Neural Computation, vol. 9, no. 5, pp. 1093–1108, 1997.
-  T. Zhang, Z. Lin, G. Yang, and C. D. Sa, ”Qpytorch: A low-precision arithmetic simulation framework,” arXiv:1910.04540, October 2019.
-  N. Silveira, T. Dozat, M.-C. de Marneffe, S. Bowman, M. Connor, J. Bauer, et al., ”A gold standard dependency corpus for English,” in Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.2897–2904, Reykjavik, Iceland, May 2014.
-  S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, ”A large annotated corpus for learning natural language inference,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015.
-  D. Elliott, S. Frank, K. Sima’an, and L. Specia, ”Multi30k: Multilingual English-German image descriptions,” arXiv:1605.00459, May 2016.
-  S. Merity, C. Xiong, J. Bradbury, and R. Socher, ”Pointer sentinel mixture models,” arXiv:1609.07843, September 2016.
-  D. P. Kingma and J. L. Ba., ”Adam: A method for stochastic optimization,” in Proc. of International Conference on Learning Representations (ICLR), San Diego CA, USA, May 2015.