Deep neural networks (DNNs) are becoming attractive solutions for many edge AI applications and have made remarkable progress in various areas such as computer vision, natural language processing, health care, autonomous driving, and surveillance. Meanwhile, with the increase of the size and complexity of the neural networks, training and deploying a DNN with a large number of parameters and complex data transmission on small and power-constrained edge devices, such as smart phones and wearable devices, becomes increasingly challenging(hao2021enabling; han2015deep; zhang2017machine). In this work, we focus on three primary challenges: ultra-low memory training, ultra-low bitwidth quantization, and ultra-low latency acceleration, and discuss our solutions for each of them.
First, there is an increasing demand for on-device machine learning model training, to preserve data privacy, enable model personalization and lifelong learning, and to improve energy efficiency to avoid the massive data transmission to the cloud (7979979; wang2019e2). However, model training has a much larger memory requirement than inference, exposing additional challenges for on-device training, where the edge-devices are usually equipped with limited memory capacity. Therefore, ultra-low memory training method must be explored to enable on-device training. To this end, we present an end-to-end low-precision tensorized neural network training framework with orders-of-magnitude memory reduction (zhang2021fpga). The rank-adaptive tensorized training method employs a Bayesian method for automatic tensor rank determination and model compression in the training process.
Second, to implement DNNs on the memory-constrained edge devices, pruning and quantization are promising to reduce the number of weights and the data bit-width in DNN models, with an extreme case that quantizes the weights down to binary/ternary representations (han2015deep; li2016ternary; Courbariaux-binary). These methods can dramatically reduce the network size as well as number of the multiplications during the execution of the model. Given the tight memory and computing resource budget on the edge, ultra-low bitwidth quantization methods are especially attractive. However, ultra-low bitwidth quantization can easily cause significant degradation on the model accuracy, making such aggressive quantization methods challenging. To address such challenges, we present , achieving the state-of-the-art accuracy under the same compression ratio (gong2020vecq).
Third, for efficient DNN deployment on the edge-devices, FPGAs are becoming attractive platforms comparing with CPUs, GPUs and digital signal processors (DSPs) (obtract; zhang2018dnnbuilder; qiu2016going).
FPGAs can provide the flexibility to be configured as domain specific architecture that can meet various implementation requirements such as ultra-low latency on the edge-devices.
In addition, modern SoC FPGAs integrate low power processors and sufficient interfaces that can support widely used sensors for Internet-of-things (IoT) applications.
We present the first instruction based ternarized low-latency deep learning accelerator
the first instruction based ternarized low-latency deep learning acceleratorwith high performance, low resource utilization, and high flexibility for different DNN models (chen2019tdla).
The remaining of this paper is organized as follows. Section 2 introduces our low-memory rank-adaptive on-device training framework; Section 3 introduces our low-bitwidth DNN quantization solution; Section 4 introduces our low-latency DNN accelerator design. In Section 5 we demonstrate the effectiveness of our proposed methods, followed by the conclusions and future work in Section 6.
2. Ultra-Low Memory Training
The large amount of model parameters consume massive computing and memory resources, which prevents direct training of neural networks on edge devices. A promising technique of reducing model parameters is low-rank tensor decomposition (kolda2009tensor; oseledets2011tensor). This method has achieved great success in post-training compression and fixed-rank training (zhou2019tensor; calvi2019tucker; yin2020compressing; tjandra2017compressing; lebedev2014speeding; novikov2015tensorizing; garipov2016ultimate). However, several fundamental issues need to be addressed in on-device one-shot training:
Firstly, a rank-adaptive training framework is needed to avoid combinatorial search of tensor ranks and multiple training runs.
Secondly, hardware-friendly tensor algorithms should be developed to facilitate their implementation on edge devices.
In this section, we summarize our recent work on the algorithm (hawkins2019bayesian; hawkins2020towards) and hardware (zhang2021fpga) levels to address these challenges.
2.1. Bayesian Tensorized Training Models
2.1.1. Low-rank tensor representation.
In many cases we can describe a neural network with much less parameters via low-rank tensors. Consider a weight matrix for example (and other parameters such as convolutional filters and embedding tables can be handled similarly). We can firstly fold to a high-dimensional tensor of size , where . Then, we can describe the tensor with some low-rank tensor factors . This can be done with various low-rank tensor decomposition formats as shown in Fig. 1 (hawkins2020towards). In various tensor decompositions, denotes the associated tensor factors. For large fully connected layers and embedding tables, the tensor-train matrix (TTM) format turns to be highly effective (hawkins2020towards). In the TTM format, , and each is an order- TTM core. The vector with is the tensor ranks that determine the model complexity. With low-rank tensors, one may reduce the number of model parameters from an exponential function of to a linear one.
2.1.2. Bayesian Tensorized End-to-End Training.
Despite the high compression ratio via tensor methods, determining the tensor rank in advance is very hard (hillar2013most). This is further complicated by the nonlinear forward model in neural networks, which has prevented tensorized one-shot on-device training in previous works. We have developed two Bayesian models to address this issue:
Stein Variational Inference for TTM Format. In (hawkins2019bayesian), we have considered TTM format. We model each slice of
with a zero-mean Gaussian prior density. We further control the variance by two tunable Gamma hyper-priors to enforce low tensor ranks. The actual tensor rank is decided jointly by the training data and rank-controlling hyper-parameters. Starting from an initial rank parameter, we can learn an actual rank , leading to further model compression in the training process. This method uses a Stein variational inference (liu2016stein) to compute the posteior density for small- or medium-size neural networks.
Scalable SVI for One-Shot Tensorized Training. In (hawkins2020towards), we have developed a more generic and efficient Bayesian model for tensorized training. This work can handle CP, Tucker, TT and TTM formats. It uses Gaussian priors to model low-rank tensor factors, and uses Half-Cauchy or Log-Uniform hyper-priors to control tensor ranks. We have improved the stochastic variational inference (SVI) (hoffman2013stochastic) by two steps. Firstly, we simplify the posterior density of rank-controlling hyper-parameters to a Delta function to avoid gradient explosion. Secondly, we use a hybrid numerical/analytical update rule inside SVI. This highly scalable method can perform one-shot training of very large-scale neural networks with billions of model parameters.
2.1.3. Performance Summary.
Our first method (hawkins2019bayesian) has been tested on a two-layer fully connected neural network, a 6-layer CNN and a 110-layer residual neural network. Our work has produced to more compact neural networks directly from the training with little or no accuracy loss.
Our recent work (hawkins2020towards) has been tested on a practical CNN, a large-scale NLP model (khrulkov2019tensorized) and an extremely large deep learning recommendation model (DLRM) (naumov2019deep) from Facebook. Orders-of-magnitude parameter reduction has been achieved in the training process. As shown in Table 1, training the DLRM with a standard method involves variables. Our proposed method only trains variables due to low-rank tensorization, and it further reduce the model parameters to K in the training process due to the automatic rank determination. The overall parameter reduction ratio in the training process is .
2.2. One-Shot On-Device Tensorized Training
To demonstrate on-device training, we have developed a low-precision tensorized training algorithm and its FPGA prototype (zhang2021fpga).
2.2.1. Low-Precision Tensorized Training.
. In this case, the training loss function includes two parts: the cross-entropy loss of a neural network classifier dependent on TTM factors, and a regularization term caused by the Gaussian priors of TTM factors as well as the Log-Uniform hyper-priors for rank-controlling parameters ’s. In the training process, both TTM factors and rank-controlling parameters will be computed. To reduce the training cost on hardware, a low-precision tensorized training algorithm is developed based on the following key ideas:
We use BinaryConnect (courbariaux2015binaryconnect) to compute low-precision TTM factors. BinaryConnect keeps the real values of all low-precision parameters in a buffer. In each iteration, the gradients are accumulated in the buffer, and the low-precision parameters are updated by quantizing the buffer. To handle the non-differentiable quantization function in the training process, we use the straight-through estimator (STE) (bengio2013estimating) to approximate its gradient.
We use different precisions for different variables in the training process. Specifically, we use 4 bits to represent TT factors, 8 bits for activations and bias, and 16 bits for the gradients.
2.2.2. On-FPGA Training.
To demonstrate our training algorithms on edge devices, we have implemented an FPGA accelerator as shown in Fig. 2 for the low-precision tensorized training framework.
Since our low-rank tensorization can greatly reduce the training variables, all model parameters may be stored in the on-chip BRAM. The data samples, activations, and gradients are stored in the off-chip DRAM during the training process.
The forward and backward propagations are run on the FPGA programmable logic. The TTM factors and rank-controlling parameters are updated on the embedded ARM core.
Three processing elements (PEs) are designed for the forward and backward propagation. PE1 and PE2 are shared by the forward and backward propagations, and they handle tensor contractions. PE1 is used for a two-index tensor contraction which contains the last dimension of two tensor variables. In contrast, PE2 performs a tensor contraction along a single dimension that is not the last. PE3 computes the outer products in a backward propagation.
3. Ultra-low Bitwidth Quantization
Neural network quantization employs low precision (bitwidth) data for efficient model execution. Especially, ultra-low bitwidth quantization leads to much less memory usage, lower complexity of the multiply-accumulate operations, and higher efficiency of model execution, making it an appealing technology for enabling AI at edge devices. However, aggressively lowering the data bitwidth (e.g., lower than 4-bit) is very challenging:
It can easily result in large accuracy degradation (cong2019dac; chen2019tdla; gysel2016hardware), requiring a careful balance between the computing efficiency and the final model accuracy.
Minimizing the quantization loss, i.e., the L2 distance between the original and the quantized values, is an appealing method (han2015deep; ENN2017; TSQ2018; cheng2019uL2Q; li2016ternary; leng2018extremely) but have major drawbacks such as easily falling into local optima and neglecting the distribution and correlations of the weights (gong2020vecq).
To address such challenges and achieve high-accuracy ultra-low bitwidth quantization, we have proposed a quantization method, namely VecQ (gong2020vecq), with a novel vectorized loss function
and an open-sourced training flow. VecQ can quantize the model weights into 1-bit to 16-bit and shows exceptional performance especially under ultra-low bitwidth, e.g., ternary values.
Vectorized Loss Function. We organize the weights within one DNN layer into a vector where is the number of weights, and denote the original floating point weight vector by and the quantized weight vector by . Typically, there is a scaling factor such that , where each element in is a low-bitwidth representation. Based on the vector representations, we define the quantization angle between and , denoted by . Figure 3 shows an example when . The objective is to find optimal and such that is as close to as possible.
We propose the vectorized loss to describe the quantization loss, denoted by , and we minimize during the quantization. is defined as the summation of the orientation loss, denoted by , and the modulus loss, denoted by as follows:
The orientation loss describes the angle between two vectors, while the modulus loss describes the squared distance between and . Notably, by minimizing , we usually achieve lower quantization loss comparing with directly minimizing . More details can be found in the VecQ paper (gong2020vecq).
Vectorized Loss Minimization. We minimize in two steps, namely steering and driving, as shown in Figure 4. First, the steering step minimizes the orientation loss to find the best , since is independent of . Second, the driving step minimizes the modulus loss to find the best scaling factor . For a convolution layer, all the weights within the same layer share the same scaling factor; for a depth-wise convolution layer, each kernel has its own scaling factor for a better representation of the less number of weights for it. We quantize the activations to fixed point values during the training to further reduce the memory utilization.
Training Flow Integration.to , and then use to compute the output activations, which is also quantized into fixed point. In the backward propagation, the gradients are also calculated using to update .
4. Ultra-low Latency Acceleration
The effectiveness of a dedicated FPGA accelerator for DNN models have been widely demonstrated (qiu2016going; chen2019clouddnn). However, ultra-low latency accelerators for edge devices with an extremely limited resource budget still require careful design considerations.
Benefiting from our quantization solution VecQ for ultra-low bitwidth, we have proposed T-DLA, a light-weight ternarized accelerator overlay under strict resource constraint, to achieve ultra-low latency on edge devices (chen2019tdla). The key features of T-DLA include:
An optimized and expressive single instruction multiple data (SIMD) instruction set.
A novel memory sub-system supporting effective data access of the computation modules.
An efficient execution pipeline with low-latency computation modules.
OP: operation code
22footnotemark: 2FS: input feature size
33footnotemark: 3SAM/SAL: source address most/least significant byte
44footnotemark: 4DAM/DAL: destination address most/least significant byte
55footnotemark: 5KS: kernel size 66footnotemark: 6CC: in/out/activation/pooling selection
SIMD Instruction Set. To support the task scheduling for various DNN models, the instruction set of T-DLA is designed as simple yet expressive enough for a large variety of DNNs. Each instruction is a 64-bit word (8 bytes) with the format shown in Table 2. The payloads of the different bytes are generated according to the layer configurations.
Memory Sub-system. The memory subsystem contains two levels of storage to provide low latency data fetching to the computation units: a simple input buffer and a variable-length line buffer. The simple input buffer is a BRAM buffer for temporary input feature storage; the variable line buffer serves for the efficient data streaming into the ternary computation array, as shown in Figure 5. It is designed to support variable kernel size and variable buffer depth , which are specified by the instruction to reduce the data transmission latency caused by fixed hardware paths. Once configured by the instruction, it provides an output of data each clock cycle to the computation array.
Execution Pipeline and Computation Modules. T-DLA has four major computation modules: 1) a ternary computation array, 2) a set of adder trees, 3) activation and scaling modules, and 4) pooling modules.
Ternary computation array. With our ternarized model training via VecQ, the weights are represented by 2 bits using two’s complement encoding, so that the multiplication in the convolution layer is simplified to selection and inversion logic. Benefiting from such simplified logic, we can achieve parallelism along the input channel, the output channel, and the kernel dimensions. The computation array is constructed by computation units, which can process this number of input data simultaneously. and are the maximum numbers of the input and output channel that can be processed by the computation array, and is the pre-defined maximum allowable kernel size. The values of and the length of the line buffer are all configurable and could be determined based on the on-chip resource availability.
Adder tree. Since the computation array is built only using LUTs and FFs, we use DSPs to construct adder trees. We take advantage of the SIMD mode of the DSPs where the internal carry propagation between segments is blocked to ensure independent operation. Therefore, we split the 48-bit input of a DSP into four 12-bit independent accumulation channels, so that a single DSP can perform addition for 8 pieces of input data and provide 4 outputs. Benefiting from the SIMD mode, the DSPs can provide outputs in every single clock once the internal register lines are filled up. Furthermore, the clock frequency of the DSPs are configured to be higher than other logic parts with the help of input/output asynchronous FIFOs, which further reduces the processing latency.
5. Experimental Results
In this section we demonstrate the effectiveness of our methods, including the ultra-low memory training framework, the ultra-low bitwidth VecQ quantization, and the ultra-low latency T-DLA design. Notbly, all these works are open-sourced.
|Method||Training accuracy||Testing accuracy||Model parameters||Memory in bits||Memory reduction|
|Floating, w/o prior||92.54%||88.03%||31.4|
|Fixed, w/o prior||88.31%||86.67%||243|
|Floating, w/ prior||90.17%||87.88%|
|Fixed, w/ prior||85.45%||84.86%||292|
The on-device ultra-low memory training method on the Fashion MNIST dataset.
5.1. On-device Training
We implement our low-precision rank-adaptive tensorized training on an Avnet Ultra96-V2 FPGA board and use it to train a two-layer neural network for a classification task on the FashionMNIST dataset. There are 512 neurons in the hidden layer, folded intofor the first layer, and for the second layer. We use the Pytorch and Tensorly modules to implement our training algorithm on the embedded processor. For the FPGA we set the clock rate to 100MHz. We compare the training methods with or without the low rank TT priors. As shown in Table 3, our method achieves memory reduction for the model parameters compared with the standard non-tensorized training. This on-FPGA training has achieved speedup and energy reduction than the training on an embedded CPU.
5.2. Ultra-low Bitwidth Quantization
We use MNIST, Cifar10 and ImageNet to evaluate the ultra-low bitwidth quantization, VecQ. The evaluated DNN models include Lenet-5, Cifarnet, a VGG-like network(hperf), and Resnet-18.
|Param. Total (M)||0.43||0.279||5.35||11.69|
|Param. Conv (M)||0.025||0.258||1.114||11.177|
|Resource Utilization(%)||79 / 47.47 / 68.93 / 91.82|
|Clock Frequency Logic / Adder (MHz)||125 / 250|
|Peak Performance (GOPS)||400|
|Dataset||Design||Model||Acc.(%)||F., W. (bits)||fps||platform|
|CIFAR 10||(hperf)||VGG-like||86.71||8, 2||27043||VC709|
|CIFAR 10||(accbnn)||VGG-like||88.68||1, 1||168||Zedboard|
Xeon: Xeon E5-2630 v3; 22footnotemark: 21080Ti: Nvidia 1080Ti
5.2.1. Classification accuracy
The classification accuracy on different datasets are shown in Table 4. For simplicity, we only show the top-1 accuracy. Comparing to the floating point models (Floating in the table), the classification accuracy using ternary weights and quantized scalars and activations shows negligible degradation. VecQ also achieves superior accuracy comparing to the recent works (efficient; li2016ternary), in which only the weights are ternarized but not the scalars and activations. Our proposed method shows better accuracy for Resnet-18 on ImageNet data set. This result demonstrates the scalability and stability of VecQ, especially in aggressive low-bitwidth quantization scenarios.
5.2.2. Model size reduction
VecQ also greatly reduces the memory footprint (Mem. Reduc.) as shown in Table 5. Ternary weight occupies only 2 bits whereas the original floating point requires 32 bits. As shown in Table 5, for convolution layers, VecQ compresses the parameters nearly to the theoretical limit (almost reduction). We quantize the last FC layer to 12-bit to maintain accuracy, so that the networks with less or no FC layers have higher compression ratio, such as Cifarnet and Resnet-18. Specifically, VecQ reduces up to 92.93% (14.14) size of Resnet-18 in floating point.
5.3. Ultra-low Latency Acceleration
We use the models quantized by VecQ to evaluate our T-DLA accelerator design in terms of accuracy and frame per second (fps). The measurements of the original models are on a server with two Intel Xeon E5-2630 v3 CPUs and one Nvidia 1080 Ti GPU. T-DLA is implemented on a Xilinx Zedboard FPGA, which is suitable for edge applications with very limited logic resources. It has an on-chip dual-core ARM Cortex A9, and has 53.2K LUTs, 106.4K FFs, 140 BRAM blocks of 36Kb each, and 220 DSPs. Vivado System Design Suite 2019.2 is used for system implementation.
5.3.1. Hardware Resource and Processing Latency Evaluation
We choose an accelerator configuration that fully utilizes the given resources, shown in Table 6, together with the execution latency of the different models with this configuration. We only show the most important configuration parameters including , and the quantized bitwidth of the activations (). As can be seen in Table 6, T-DLA with customized configuration can almost use all the resources, especially the DSPs. The targeted FPGA can support up to 250MHz for the DSPs, which is twice of the frequency of other logic benefiting from the ternary computation array and independent clock design of the adder trees.
5.3.2. Performance Comparison
We compare T-DAL in terms of accuracy and fps with existing designs, either using the same DNN model or the same dataset. The results are shown in Table 7. For MNIST dataset, the design in (finn) shows higher fps because of the DNN model they used is simpler and the ZC706 platform has almost more resources than ours. However, our implementation on Zedboard has a comparable fps (62051) to a design (impl16) (70000) with 3-bit weights on the ZC706 platform. On CIFAR10 dataset, our design shows dominating accuracy advantage among all the VGG-like models. On ImageNet dataset, we directly compare our results with the floating point version. T-DLA shows longer execution latency than the GPU but outperforms the CPU by .
In this paper, we summarized our recent efforts for efficient on-device AI development including both training and inference. We mainly focused on three major challenges of edge AI development. First, we presented on-device training with ultra-low memory usage by proposing a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduced VecQ, a novel quantization method that supports ultra-low bitwidth quantization with negligible accuracy degradation. Third, we presented T-DLA, an ultra-low latency DNN accelerator design for ternarized DNNs achieving the state-of-the-art performance. On top of the achievements in this paper, we expect more research breakthroughs to boost the development and deployment for the edge AI.