1. Introduction
Deep neural networks (DNNs) are becoming attractive solutions for many edge AI applications and have made remarkable progress in various areas such as computer vision, natural language processing, health care, autonomous driving, and surveillance. Meanwhile, with the increase of the size and complexity of the neural networks, training and deploying a DNN with a large number of parameters and complex data transmission on small and powerconstrained edge devices, such as smart phones and wearable devices, becomes increasingly challenging
(hao2021enabling; han2015deep; zhang2017machine). In this work, we focus on three primary challenges: ultralow memory training, ultralow bitwidth quantization, and ultralow latency acceleration, and discuss our solutions for each of them.First, there is an increasing demand for ondevice machine learning model training, to preserve data privacy, enable model personalization and lifelong learning, and to improve energy efficiency to avoid the massive data transmission to the cloud (7979979; wang2019e2). However, model training has a much larger memory requirement than inference, exposing additional challenges for ondevice training, where the edgedevices are usually equipped with limited memory capacity. Therefore, ultralow memory training method must be explored to enable ondevice training. To this end, we present an endtoend lowprecision tensorized neural network training framework with ordersofmagnitude memory reduction (zhang2021fpga). The rankadaptive tensorized training method employs a Bayesian method for automatic tensor rank determination and model compression in the training process.
Second, to implement DNNs on the memoryconstrained edge devices, pruning and quantization are promising to reduce the number of weights and the data bitwidth in DNN models, with an extreme case that quantizes the weights down to binary/ternary representations (han2015deep; li2016ternary; Courbariauxbinary). These methods can dramatically reduce the network size as well as number of the multiplications during the execution of the model. Given the tight memory and computing resource budget on the edge, ultralow bitwidth quantization methods are especially attractive. However, ultralow bitwidth quantization can easily cause significant degradation on the model accuracy, making such aggressive quantization methods challenging. To address such challenges, we present
a novel ternary weight quantization method by proposing a vectorized loss function
, achieving the stateoftheart accuracy under the same compression ratio (gong2020vecq).Third, for efficient DNN deployment on the edgedevices, FPGAs are becoming attractive platforms comparing with CPUs, GPUs and digital signal processors (DSPs) (obtract; zhang2018dnnbuilder; qiu2016going). FPGAs can provide the flexibility to be configured as domain specific architecture that can meet various implementation requirements such as ultralow latency on the edgedevices. In addition, modern SoC FPGAs integrate low power processors and sufficient interfaces that can support widely used sensors for Internetofthings (IoT) applications. We present
the first instruction based ternarized lowlatency deep learning accelerator
with high performance, low resource utilization, and high flexibility for different DNN models (chen2019tdla).The remaining of this paper is organized as follows. Section 2 introduces our lowmemory rankadaptive ondevice training framework; Section 3 introduces our lowbitwidth DNN quantization solution; Section 4 introduces our lowlatency DNN accelerator design. In Section 5 we demonstrate the effectiveness of our proposed methods, followed by the conclusions and future work in Section 6.
2. UltraLow Memory Training
The large amount of model parameters consume massive computing and memory resources, which prevents direct training of neural networks on edge devices. A promising technique of reducing model parameters is lowrank tensor decomposition (kolda2009tensor; oseledets2011tensor). This method has achieved great success in posttraining compression and fixedrank training (zhou2019tensor; calvi2019tucker; yin2020compressing; tjandra2017compressing; lebedev2014speeding; novikov2015tensorizing; garipov2016ultimate). However, several fundamental issues need to be addressed in ondevice oneshot training:

[leftmargin=*]

Firstly, a rankadaptive training framework is needed to avoid combinatorial search of tensor ranks and multiple training runs.

Secondly, hardwarefriendly tensor algorithms should be developed to facilitate their implementation on edge devices.
In this section, we summarize our recent work on the algorithm (hawkins2019bayesian; hawkins2020towards) and hardware (zhang2021fpga) levels to address these challenges.
2.1. Bayesian Tensorized Training Models
2.1.1. Lowrank tensor representation.
In many cases we can describe a neural network with much less parameters via lowrank tensors. Consider a weight matrix for example (and other parameters such as convolutional filters and embedding tables can be handled similarly). We can firstly fold to a highdimensional tensor of size , where . Then, we can describe the tensor with some lowrank tensor factors . This can be done with various lowrank tensor decomposition formats as shown in Fig. 1 (hawkins2020towards). In various tensor decompositions, denotes the associated tensor factors. For large fully connected layers and embedding tables, the tensortrain matrix (TTM) format turns to be highly effective (hawkins2020towards). In the TTM format, , and each is an order TTM core. The vector with is the tensor ranks that determine the model complexity. With lowrank tensors, one may reduce the number of model parameters from an exponential function of to a linear one.
2.1.2. Bayesian Tensorized EndtoEnd Training.
Despite the high compression ratio via tensor methods, determining the tensor rank in advance is very hard (hillar2013most). This is further complicated by the nonlinear forward model in neural networks, which has prevented tensorized oneshot ondevice training in previous works. We have developed two Bayesian models to address this issue:

[leftmargin=*]

Stein Variational Inference for TTM Format. In (hawkins2019bayesian), we have considered TTM format. We model each slice of
with a zeromean Gaussian prior density. We further control the variance by two tunable Gamma hyperpriors to enforce low tensor ranks. The actual tensor rank is decided jointly by the training data and rankcontrolling hyperparameters. Starting from an initial rank parameter
, we can learn an actual rank , leading to further model compression in the training process. This method uses a Stein variational inference (liu2016stein) to compute the posteior density for small or mediumsize neural networks. 
Scalable SVI for OneShot Tensorized Training. In (hawkins2020towards), we have developed a more generic and efficient Bayesian model for tensorized training. This work can handle CP, Tucker, TT and TTM formats. It uses Gaussian priors to model lowrank tensor factors, and uses HalfCauchy or LogUniform hyperpriors to control tensor ranks. We have improved the stochastic variational inference (SVI) (hoffman2013stochastic) by two steps. Firstly, we simplify the posterior density of rankcontrolling hyperparameters to a Delta function to avoid gradient explosion. Secondly, we use a hybrid numerical/analytical update rule inside SVI. This highly scalable method can perform oneshot training of very largescale neural networks with billions of model parameters.
2.1.3. Performance Summary.

[leftmargin=*]

Our first method (hawkins2019bayesian) has been tested on a twolayer fully connected neural network, a 6layer CNN and a 110layer residual neural network. Our work has produced to more compact neural networks directly from the training with little or no accuracy loss.

Our recent work (hawkins2020towards) has been tested on a practical CNN, a largescale NLP model (khrulkov2019tensorized) and an extremely large deep learning recommendation model (DLRM) (naumov2019deep) from Facebook. Ordersofmagnitude parameter reduction has been achieved in the training process. As shown in Table 1, training the DLRM with a standard method involves variables. Our proposed method only trains variables due to lowrank tensorization, and it further reduce the model parameters to K in the training process due to the automatic rank determination. The overall parameter reduction ratio in the training process is .
standard  tensorization  rankadaptive training  
# parameters  4.25B  2.36M  164K 
compression  N/A 
2.2. OneShot OnDevice Tensorized Training
To demonstrate ondevice training, we have developed a lowprecision tensorized training algorithm and its FPGA prototype (zhang2021fpga).
2.2.1. LowPrecision Tensorized Training.
We consider the maximum a posteriori probability (MAP) estimate of the Bayesian model
(hawkins2020towards). In this case, the training loss function includes two parts: the crossentropy loss of a neural network classifier dependent on TTM factors
, and a regularization term caused by the Gaussian priors of TTM factors as well as the LogUniform hyperpriors for rankcontrolling parameters ’s. In the training process, both TTM factors and rankcontrolling parameters will be computed. To reduce the training cost on hardware, a lowprecision tensorized training algorithm is developed based on the following key ideas:
[leftmargin=*]

We use BinaryConnect (courbariaux2015binaryconnect) to compute lowprecision TTM factors. BinaryConnect keeps the real values of all lowprecision parameters in a buffer. In each iteration, the gradients are accumulated in the buffer, and the lowprecision parameters are updated by quantizing the buffer. To handle the nondifferentiable quantization function in the training process, we use the straightthrough estimator (STE) (bengio2013estimating) to approximate its gradient.

We use different precisions for different variables in the training process. Specifically, we use 4 bits to represent TT factors, 8 bits for activations and bias, and 16 bits for the gradients.
2.2.2. OnFPGA Training.
To demonstrate our training algorithms on edge devices, we have implemented an FPGA accelerator as shown in Fig. 2 for the lowprecision tensorized training framework.

[leftmargin=*]

Since our lowrank tensorization can greatly reduce the training variables, all model parameters may be stored in the onchip BRAM. The data samples, activations, and gradients are stored in the offchip DRAM during the training process.

The forward and backward propagations are run on the FPGA programmable logic. The TTM factors and rankcontrolling parameters are updated on the embedded ARM core.

Three processing elements (PEs) are designed for the forward and backward propagation. PE1 and PE2 are shared by the forward and backward propagations, and they handle tensor contractions. PE1 is used for a twoindex tensor contraction which contains the last dimension of two tensor variables. In contrast, PE2 performs a tensor contraction along a single dimension that is not the last. PE3 computes the outer products in a backward propagation.
3. Ultralow Bitwidth Quantization
Neural network quantization employs low precision (bitwidth) data for efficient model execution. Especially, ultralow bitwidth quantization leads to much less memory usage, lower complexity of the multiplyaccumulate operations, and higher efficiency of model execution, making it an appealing technology for enabling AI at edge devices. However, aggressively lowering the data bitwidth (e.g., lower than 4bit) is very challenging:

[leftmargin=*]

It can easily result in large accuracy degradation (cong2019dac; chen2019tdla; gysel2016hardware), requiring a careful balance between the computing efficiency and the final model accuracy.

Minimizing the quantization loss, i.e., the L2 distance between the original and the quantized values, is an appealing method (han2015deep; ENN2017; TSQ2018; cheng2019uL2Q; li2016ternary; leng2018extremely) but have major drawbacks such as easily falling into local optima and neglecting the distribution and correlations of the weights (gong2020vecq).
To address such challenges and achieve highaccuracy ultralow bitwidth quantization, we have proposed a quantization method, namely VecQ (gong2020vecq), with a novel vectorized loss function
and an opensourced training flow. VecQ can quantize the model weights into 1bit to 16bit and shows exceptional performance especially under ultralow bitwidth, e.g., ternary values.
Vectorized Loss Function. We organize the weights within one DNN layer into a vector where is the number of weights, and denote the original floating point weight vector by and the quantized weight vector by . Typically, there is a scaling factor such that , where each element in is a lowbitwidth representation. Based on the vector representations, we define the quantization angle between and , denoted by . Figure 3 shows an example when . The objective is to find optimal and such that is as close to as possible.
We propose the vectorized loss to describe the quantization loss, denoted by , and we minimize during the quantization. is defined as the summation of the orientation loss, denoted by , and the modulus loss, denoted by as follows:
(1) 
The orientation loss describes the angle between two vectors, while the modulus loss describes the squared distance between and . Notably, by minimizing , we usually achieve lower quantization loss comparing with directly minimizing . More details can be found in the VecQ paper (gong2020vecq).
Vectorized Loss Minimization. We minimize in two steps, namely steering and driving, as shown in Figure 4. First, the steering step minimizes the orientation loss to find the best , since is independent of . Second, the driving step minimizes the modulus loss to find the best scaling factor . For a convolution layer, all the weights within the same layer share the same scaling factor; for a depthwise convolution layer, each kernel has its own scaling factor for a better representation of the less number of weights for it. We quantize the activations to fixed point values during the training to further reduce the memory utilization.
Training Flow Integration.
We integrate our VecQ solution into the Tensorflow and Pytorch DNN training frameworks. For each layer, in the forward propagation, we first quantize the weights from
to , and then use to compute the output activations, which is also quantized into fixed point. In the backward propagation, the gradients are also calculated using to update .4. Ultralow Latency Acceleration
The effectiveness of a dedicated FPGA accelerator for DNN models have been widely demonstrated (qiu2016going; chen2019clouddnn). However, ultralow latency accelerators for edge devices with an extremely limited resource budget still require careful design considerations.
Benefiting from our quantization solution VecQ for ultralow bitwidth, we have proposed TDLA, a lightweight ternarized accelerator overlay under strict resource constraint, to achieve ultralow latency on edge devices (chen2019tdla). The key features of TDLA include:

[leftmargin=*]

An optimized and expressive single instruction multiple data (SIMD) instruction set.

A novel memory subsystem supporting effective data access of the computation modules.

An efficient execution pipeline with lowlatency computation modules.
Byte Idx  7  6  5  4  3  2  1  0 
Load  OP  FS  SAM  SAL  DAM  DAL  KS  CC 
^{1}^{1}footnotemark: 1
OP: operation code
^{2}^{2}footnotemark: 2FS: input feature size
^{3}^{3}footnotemark: 3SAM/SAL: source address most/least significant byte
^{4}^{4}footnotemark: 4DAM/DAL: destination address most/least significant byte
^{5}^{5}footnotemark: 5KS: kernel size
^{6}^{6}footnotemark: 6CC: in/out/activation/pooling selection
SIMD Instruction Set. To support the task scheduling for various DNN models, the instruction set of TDLA is designed as simple yet expressive enough for a large variety of DNNs. Each instruction is a 64bit word (8 bytes) with the format shown in Table 2. The payloads of the different bytes are generated according to the layer configurations.
Memory Subsystem. The memory subsystem contains two levels of storage to provide low latency data fetching to the computation units: a simple input buffer and a variablelength line buffer. The simple input buffer is a BRAM buffer for temporary input feature storage; the variable line buffer serves for the efficient data streaming into the ternary computation array, as shown in Figure 5. It is designed to support variable kernel size and variable buffer depth , which are specified by the instruction to reduce the data transmission latency caused by fixed hardware paths. Once configured by the instruction, it provides an output of data each clock cycle to the computation array.
Execution Pipeline and Computation Modules. TDLA has four major computation modules: 1) a ternary computation array, 2) a set of adder trees, 3) activation and scaling modules, and 4) pooling modules.
Ternary computation array. With our ternarized model training via VecQ, the weights are represented by 2 bits using two’s complement encoding, so that the multiplication in the convolution layer is simplified to selection and inversion logic. Benefiting from such simplified logic, we can achieve parallelism along the input channel, the output channel, and the kernel dimensions. The computation array is constructed by computation units, which can process this number of input data simultaneously. and are the maximum numbers of the input and output channel that can be processed by the computation array, and is the predefined maximum allowable kernel size. The values of and the length of the line buffer are all configurable and could be determined based on the onchip resource availability.
Adder tree. Since the computation array is built only using LUTs and FFs, we use DSPs to construct adder trees. We take advantage of the SIMD mode of the DSPs where the internal carry propagation between segments is blocked to ensure independent operation. Therefore, we split the 48bit input of a DSP into four 12bit independent accumulation channels, so that a single DSP can perform addition for 8 pieces of input data and provide 4 outputs. Benefiting from the SIMD mode, the DSPs can provide outputs in every single clock once the internal register lines are filled up. Furthermore, the clock frequency of the DSPs are configured to be higher than other logic parts with the help of input/output asynchronous FIFOs, which further reduces the processing latency.
Other modules.
The ReLU activation module, the linear scaling module, and the max pooling module are all designed to process in a single clock cycle to reduce the depth of the execution pipeline.
5. Experimental Results
In this section we demonstrate the effectiveness of our methods, including the ultralow memory training framework, the ultralow bitwidth VecQ quantization, and the ultralow latency TDLA design. Notbly, all these works are opensourced.
Method  Training accuracy  Testing accuracy  Model parameters  Memory in bits  Memory reduction 

Vanilla  95.75%  89.27%  N/A  
Floating, w/o prior  92.54%  88.03%  31.4  
Fixed, w/o prior  88.31%  86.67%  243  
Floating, w/ prior  90.17%  87.88%  
Fixed, w/ prior  85.45%  84.86%  292 
The ondevice ultralow memory training method on the Fashion MNIST dataset.
5.1. Ondevice Training
We implement our lowprecision rankadaptive tensorized training on an Avnet Ultra96V2 FPGA board and use it to train a twolayer neural network for a classification task on the FashionMNIST dataset. There are 512 neurons in the hidden layer, folded into
for the first layer, and for the second layer. We use the Pytorch and Tensorly modules to implement our training algorithm on the embedded processor. For the FPGA we set the clock rate to 100MHz. We compare the training methods with or without the low rank TT priors. As shown in Table 3, our method achieves memory reduction for the model parameters compared with the standard nontensorized training. This onFPGA training has achieved speedup and energy reduction than the training on an embedded CPU.5.2. Ultralow Bitwidth Quantization
Dataset  MNIST  CIFAR10  CIFAR10  ImageNet 
Model  Lenet5  Cifarnet  VGGlike  Resnet18 
Floating  99.41  80.54  93.49  69.60 
IJCNN’17 (efficient)  98.33    87.89   
NIPS’16 (li2016ternary)  99.35    92.56  61.8 
Ours  99.5  78.7  92.94  68.23 
We use MNIST, Cifar10 and ImageNet to evaluate the ultralow bitwidth quantization, VecQ. The evaluated DNN models include Lenet5, Cifarnet, a VGGlike network
(hperf), and Resnet18.Model  Lenet5  Cifarnet  VGGlike  Resnet18 
Param. Total (M)  0.43  0.279  5.35  11.69 
Param. Conv (M)  0.025  0.258  1.114  11.177 
Floating (MB)  1.644  1.065  20.408  44.594 
Ours (MB)  0.393  0.081  4.284  3.154 
Mem.Reduc.(%)  76.09  92.39  79.01  92.93 
Configuration Parameters  

Resource Utilization(%)  79 / 47.47 / 68.93 / 91.82  
Clock Frequency Logic / Adder (MHz)  125 / 250  
Peak Performance (GOPS)  400  
DNN Model  Lenet5  Cifarnet  VGGlike  Resnet18 
latency(ms)  0.016  0.063  2.12  48.8 
Dataset  Design  Model  Acc.(%)  F., W. (bits)  fps  platform 
MNIST  (finn)  MFCmax  97.69  1, 1  6238000  ZC706 
MNIST  (impl16)  Lenet5    8, 3  70000  ZC706 
MNIST  Ours  Lenet5  99.5  8, 2  62051.1  Zedboard 
CIFAR10  (finn)  VGGlike  80.1  24, 1  21900  ZC706 
CIFAR10  (fcfree)  VGGlike  81.8  1, 1  420  Zedboard 
CIFAR 10  (hperf)  VGGlike  86.71  8, 2  27043  VC709 
CIFAR 10  (accbnn)  VGGlike  88.68  1, 1  168  Zedboard 
CIFAR 10  Ours  VGGlike  89.08  8,2  457  Zedboard 
ImageNet  (li2016ternary)  Resnet18  65.44  FP32,FP32  1.545  Xeon^{1}^{1}footnotemark: 1 
ImageNet  (li2016ternary)  Resnet18  65.44  FP32,FP32  387.597  1080Ti^{2}^{2}footnotemark: 2 
ImageNet  Ours  Resnet18  68.23  8, 2  20.48  Zedboard 
Xeon: Xeon E52630 v3; ^{2}^{2}footnotemark: 21080Ti: Nvidia 1080Ti
5.2.1. Classification accuracy
The classification accuracy on different datasets are shown in Table 4. For simplicity, we only show the top1 accuracy. Comparing to the floating point models (Floating in the table), the classification accuracy using ternary weights and quantized scalars and activations shows negligible degradation. VecQ also achieves superior accuracy comparing to the recent works (efficient; li2016ternary), in which only the weights are ternarized but not the scalars and activations. Our proposed method shows better accuracy for Resnet18 on ImageNet data set. This result demonstrates the scalability and stability of VecQ, especially in aggressive lowbitwidth quantization scenarios.
5.2.2. Model size reduction
VecQ also greatly reduces the memory footprint (Mem. Reduc.) as shown in Table 5. Ternary weight occupies only 2 bits whereas the original floating point requires 32 bits. As shown in Table 5, for convolution layers, VecQ compresses the parameters nearly to the theoretical limit (almost reduction). We quantize the last FC layer to 12bit to maintain accuracy, so that the networks with less or no FC layers have higher compression ratio, such as Cifarnet and Resnet18. Specifically, VecQ reduces up to 92.93% (14.14) size of Resnet18 in floating point.
5.3. Ultralow Latency Acceleration
We use the models quantized by VecQ to evaluate our TDLA accelerator design in terms of accuracy and frame per second (fps). The measurements of the original models are on a server with two Intel Xeon E52630 v3 CPUs and one Nvidia 1080 Ti GPU. TDLA is implemented on a Xilinx Zedboard FPGA, which is suitable for edge applications with very limited logic resources. It has an onchip dualcore ARM Cortex A9, and has 53.2K LUTs, 106.4K FFs, 140 BRAM blocks of 36Kb each, and 220 DSPs. Vivado System Design Suite 2019.2 is used for system implementation.
5.3.1. Hardware Resource and Processing Latency Evaluation
We choose an accelerator configuration that fully utilizes the given resources, shown in Table 6, together with the execution latency of the different models with this configuration. We only show the most important configuration parameters including , and the quantized bitwidth of the activations (). As can be seen in Table 6, TDLA with customized configuration can almost use all the resources, especially the DSPs. The targeted FPGA can support up to 250MHz for the DSPs, which is twice of the frequency of other logic benefiting from the ternary computation array and independent clock design of the adder trees.
5.3.2. Performance Comparison
We compare TDAL in terms of accuracy and fps with existing designs, either using the same DNN model or the same dataset. The results are shown in Table 7. For MNIST dataset, the design in (finn) shows higher fps because of the DNN model they used is simpler and the ZC706 platform has almost more resources than ours. However, our implementation on Zedboard has a comparable fps (62051) to a design (impl16) (70000) with 3bit weights on the ZC706 platform. On CIFAR10 dataset, our design shows dominating accuracy advantage among all the VGGlike models. On ImageNet dataset, we directly compare our results with the floating point version. TDLA shows longer execution latency than the GPU but outperforms the CPU by .
6. Conclusions
In this paper, we summarized our recent efforts for efficient ondevice AI development including both training and inference. We mainly focused on three major challenges of edge AI development. First, we presented ondevice training with ultralow memory usage by proposing a novel rankadaptive tensorbased tensorized neural network model, which offers ordersofmagnitude memory reduction during training. Second, we introduced VecQ, a novel quantization method that supports ultralow bitwidth quantization with negligible accuracy degradation. Third, we presented TDLA, an ultralow latency DNN accelerator design for ternarized DNNs achieving the stateoftheart performance. On top of the achievements in this paper, we expect more research breakthroughs to boost the development and deployment for the edge AI.
Comments
There are no comments yet.