3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training and inference. First, we present on-device training with ultra-low memory usage. We propose a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduce an ultra-low bitwidth quantization method for DNN model compression, achieving the state-of-the-art accuracy under the same compression ratio. Third, we introduce an ultra-low latency DNN accelerator design, practicing the software/hardware co-design methodology. This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.



There are no comments yet.


page 3


On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method

Various hardware accelerators have been developed for energy-efficient a...

NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic

While there is a large body of research on efficient processing of deep ...

Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation

Deep neural networks (DNN) have shown superior performance in a variety ...

Trimming Feature Extraction and Inference for MCU-based Edge NILM: a Systematic Approach

Non-Intrusive Load Monitoring (NILM) enables the disaggregation of the g...

Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices

High quality AI solutions require joint optimization of AI algorithms, s...

Adaptive Precision Training for Resource Constrained Devices

Learn in-situ is a growing trend for Edge AI. Training deep neural netwo...

A Brain-Inspired Low-Dimensional Computing Classifier for Inference on Tiny Devices

By mimicking brain-like cognition and exploiting parallelism, hyperdimen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks (DNNs) are becoming attractive solutions for many edge AI applications and have made remarkable progress in various areas such as computer vision, natural language processing, health care, autonomous driving, and surveillance. Meanwhile, with the increase of the size and complexity of the neural networks, training and deploying a DNN with a large number of parameters and complex data transmission on small and power-constrained edge devices, such as smart phones and wearable devices, becomes increasingly challenging 

(hao2021enabling; han2015deep; zhang2017machine). In this work, we focus on three primary challenges: ultra-low memory training, ultra-low bitwidth quantization, and ultra-low latency acceleration, and discuss our solutions for each of them.

First, there is an increasing demand for on-device machine learning model training, to preserve data privacy, enable model personalization and lifelong learning, and to improve energy efficiency to avoid the massive data transmission to the cloud (7979979; wang2019e2). However, model training has a much larger memory requirement than inference, exposing additional challenges for on-device training, where the edge-devices are usually equipped with limited memory capacity. Therefore, ultra-low memory training method must be explored to enable on-device training. To this end, we present an end-to-end low-precision tensorized neural network training framework with orders-of-magnitude memory reduction (zhang2021fpga). The rank-adaptive tensorized training method employs a Bayesian method for automatic tensor rank determination and model compression in the training process.

Second, to implement DNNs on the memory-constrained edge devices, pruning and quantization are promising to reduce the number of weights and the data bit-width in DNN models, with an extreme case that quantizes the weights down to binary/ternary representations (han2015deep; li2016ternary; Courbariaux-binary). These methods can dramatically reduce the network size as well as number of the multiplications during the execution of the model. Given the tight memory and computing resource budget on the edge, ultra-low bitwidth quantization methods are especially attractive. However, ultra-low bitwidth quantization can easily cause significant degradation on the model accuracy, making such aggressive quantization methods challenging. To address such challenges, we present

a novel ternary weight quantization method by proposing a vectorized loss function

, achieving the state-of-the-art accuracy under the same compression ratio (gong2020vecq).

Third, for efficient DNN deployment on the edge-devices, FPGAs are becoming attractive platforms comparing with CPUs, GPUs and digital signal processors (DSPs) (obtract; zhang2018dnnbuilder; qiu2016going). FPGAs can provide the flexibility to be configured as domain specific architecture that can meet various implementation requirements such as ultra-low latency on the edge-devices. In addition, modern SoC FPGAs integrate low power processors and sufficient interfaces that can support widely used sensors for Internet-of-things (IoT) applications. We present

the first instruction based ternarized low-latency deep learning accelerator

with high performance, low resource utilization, and high flexibility for different DNN models (chen2019tdla).

The remaining of this paper is organized as follows. Section 2 introduces our low-memory rank-adaptive on-device training framework; Section 3 introduces our low-bitwidth DNN quantization solution; Section 4 introduces our low-latency DNN accelerator design. In Section 5 we demonstrate the effectiveness of our proposed methods, followed by the conclusions and future work in Section 6.

2. Ultra-Low Memory Training

The large amount of model parameters consume massive computing and memory resources, which prevents direct training of neural networks on edge devices. A promising technique of reducing model parameters is low-rank tensor decomposition (kolda2009tensor; oseledets2011tensor). This method has achieved great success in post-training compression and fixed-rank training (zhou2019tensor; calvi2019tucker; yin2020compressing; tjandra2017compressing; lebedev2014speeding; novikov2015tensorizing; garipov2016ultimate). However, several fundamental issues need to be addressed in on-device one-shot training:

  • [leftmargin=*]

  • Firstly, a rank-adaptive training framework is needed to avoid combinatorial search of tensor ranks and multiple training runs.

  • Secondly, hardware-friendly tensor algorithms should be developed to facilitate their implementation on edge devices.

In this section, we summarize our recent work on the algorithm (hawkins2019bayesian; hawkins2020towards) and hardware (zhang2021fpga) levels to address these challenges.

Figure 1. (a): An order-3 tensor. (b) and (c): CP and Tucker representations, respectively. (d): TT representation, where the gray lines and squares indicate a slice of the TT core by fixing its mode index. This figure is reproduced from (hawkins2020towards).

2.1. Bayesian Tensorized Training Models

2.1.1. Low-rank tensor representation.

In many cases we can describe a neural network with much less parameters via low-rank tensors. Consider a weight matrix for example (and other parameters such as convolutional filters and embedding tables can be handled similarly). We can firstly fold to a high-dimensional tensor of size , where . Then, we can describe the tensor with some low-rank tensor factors . This can be done with various low-rank tensor decomposition formats as shown in Fig. 1 (hawkins2020towards). In various tensor decompositions, denotes the associated tensor factors. For large fully connected layers and embedding tables, the tensor-train matrix (TTM) format turns to be highly effective (hawkins2020towards). In the TTM format, , and each is an order- TTM core. The vector with is the tensor ranks that determine the model complexity. With low-rank tensors, one may reduce the number of model parameters from an exponential function of to a linear one.

2.1.2. Bayesian Tensorized End-to-End Training.

Despite the high compression ratio via tensor methods, determining the tensor rank in advance is very hard (hillar2013most). This is further complicated by the nonlinear forward model in neural networks, which has prevented tensorized one-shot on-device training in previous works. We have developed two Bayesian models to address this issue:

  • [leftmargin=*]

  • Stein Variational Inference for TTM Format. In (hawkins2019bayesian), we have considered TTM format. We model each slice of

    with a zero-mean Gaussian prior density. We further control the variance by two tunable Gamma hyper-priors to enforce low tensor ranks. The actual tensor rank is decided jointly by the training data and rank-controlling hyper-parameters. Starting from an initial rank parameter

    , we can learn an actual rank , leading to further model compression in the training process. This method uses a Stein variational inference (liu2016stein) to compute the posteior density for small- or medium-size neural networks.

  • Scalable SVI for One-Shot Tensorized Training. In (hawkins2020towards), we have developed a more generic and efficient Bayesian model for tensorized training. This work can handle CP, Tucker, TT and TTM formats. It uses Gaussian priors to model low-rank tensor factors, and uses Half-Cauchy or Log-Uniform hyper-priors to control tensor ranks. We have improved the stochastic variational inference (SVI)  (hoffman2013stochastic) by two steps. Firstly, we simplify the posterior density of rank-controlling hyper-parameters to a Delta function to avoid gradient explosion. Secondly, we use a hybrid numerical/analytical update rule inside SVI. This highly scalable method can perform one-shot training of very large-scale neural networks with billions of model parameters.

2.1.3. Performance Summary.

  • [leftmargin=*]

  • Our first method (hawkins2019bayesian) has been tested on a two-layer fully connected neural network, a 6-layer CNN and a 110-layer residual neural network. Our work has produced to more compact neural networks directly from the training with little or no accuracy loss.

  • Our recent work (hawkins2020towards) has been tested on a practical CNN, a large-scale NLP model (khrulkov2019tensorized) and an extremely large deep learning recommendation model (DLRM) (naumov2019deep) from Facebook. Orders-of-magnitude parameter reduction has been achieved in the training process. As shown in Table 1, training the DLRM with a standard method involves variables. Our proposed method only trains variables due to low-rank tensorization, and it further reduce the model parameters to K in the training process due to the automatic rank determination. The overall parameter reduction ratio in the training process is .

standard tensorization rank-adaptive training
# parameters 4.25B 2.36M 164K
compression N/A
Table 1. Performance of our tensorized training (hawkins2020towards) on the Facebook DLRM model.

2.2. One-Shot On-Device Tensorized Training

To demonstrate on-device training, we have developed a low-precision tensorized training algorithm and its FPGA prototype (zhang2021fpga).

2.2.1. Low-Precision Tensorized Training.

We consider the maximum a posteriori probability (MAP) estimate of the Bayesian model


. In this case, the training loss function includes two parts: the cross-entropy loss of a neural network classifier dependent on TTM factors

, and a regularization term caused by the Gaussian priors of TTM factors as well as the Log-Uniform hyper-priors for rank-controlling parameters ’s. In the training process, both TTM factors and rank-controlling parameters will be computed. To reduce the training cost on hardware, a low-precision tensorized training algorithm is developed based on the following key ideas:

  • [leftmargin=*]

  • We use BinaryConnect (courbariaux2015binaryconnect) to compute low-precision TTM factors. BinaryConnect keeps the real values of all low-precision parameters in a buffer. In each iteration, the gradients are accumulated in the buffer, and the low-precision parameters are updated by quantizing the buffer. To handle the non-differentiable quantization function in the training process, we use the straight-through estimator (STE) (bengio2013estimating) to approximate its gradient.

  • We use different precisions for different variables in the training process. Specifically, we use 4 bits to represent TT factors, 8 bits for activations and bias, and 16 bits for the gradients.

2.2.2. On-FPGA Training.

To demonstrate our training algorithms on edge devices, we have implemented an FPGA accelerator as shown in Fig. 2 for the low-precision tensorized training framework.

  • [leftmargin=*]

  • Since our low-rank tensorization can greatly reduce the training variables, all model parameters may be stored in the on-chip BRAM. The data samples, activations, and gradients are stored in the off-chip DRAM during the training process.

  • The forward and backward propagations are run on the FPGA programmable logic. The TTM factors and rank-controlling parameters are updated on the embedded ARM core.

  • Three processing elements (PEs) are designed for the forward and backward propagation. PE1 and PE2 are shared by the forward and backward propagations, and they handle tensor contractions. PE1 is used for a two-index tensor contraction which contains the last dimension of two tensor variables. In contrast, PE2 performs a tensor contraction along a single dimension that is not the last. PE3 computes the outer products in a backward propagation.

Figure 2. Our FPGA accelerator for the end-to-end tensorized training. Reproduced from (zhang2021fpga).

3. Ultra-low Bitwidth Quantization

Neural network quantization employs low precision (bitwidth) data for efficient model execution. Especially, ultra-low bitwidth quantization leads to much less memory usage, lower complexity of the multiply-accumulate operations, and higher efficiency of model execution, making it an appealing technology for enabling AI at edge devices. However, aggressively lowering the data bitwidth (e.g., lower than 4-bit) is very challenging:

  • [leftmargin=*]

  • It can easily result in large accuracy degradation (cong2019dac; chen2019tdla; gysel2016hardware), requiring a careful balance between the computing efficiency and the final model accuracy.

  • Minimizing the quantization loss, i.e., the L2 distance between the original and the quantized values, is an appealing method (han2015deep; ENN2017; TSQ2018; cheng2019uL2Q; li2016ternary; leng2018extremely) but have major drawbacks such as easily falling into local optima and neglecting the distribution and correlations of the weights (gong2020vecq).

To address such challenges and achieve high-accuracy ultra-low bitwidth quantization, we have proposed a quantization method, namely VecQ (gong2020vecq), with a novel vectorized loss function

and an open-sourced training flow. VecQ can quantize the model weights into 1-bit to 16-bit and shows exceptional performance especially under ultra-low bitwidth, e.g., ternary values.

Figure 3. The quantization angle between and .

Vectorized Loss Function. We organize the weights within one DNN layer into a vector where is the number of weights, and denote the original floating point weight vector by and the quantized weight vector by . Typically, there is a scaling factor such that , where each element in is a low-bitwidth representation. Based on the vector representations, we define the quantization angle between and , denoted by . Figure 3 shows an example when . The objective is to find optimal and such that is as close to as possible.

We propose the vectorized loss to describe the quantization loss, denoted by , and we minimize during the quantization. is defined as the summation of the orientation loss, denoted by , and the modulus loss, denoted by as follows:


The orientation loss describes the angle between two vectors, while the modulus loss describes the squared distance between and . Notably, by minimizing , we usually achieve lower quantization loss comparing with directly minimizing . More details can be found in the VecQ paper (gong2020vecq).

Vectorized Loss Minimization. We minimize in two steps, namely steering and driving, as shown in Figure 4. First, the steering step minimizes the orientation loss to find the best , since is independent of . Second, the driving step minimizes the modulus loss to find the best scaling factor . For a convolution layer, all the weights within the same layer share the same scaling factor; for a depth-wise convolution layer, each kernel has its own scaling factor for a better representation of the less number of weights for it. We quantize the activations to fixed point values during the training to further reduce the memory utilization.

Figure 4. The data quantization flow of VecQ, reproduced from VecQ (gong2020vecq).

Training Flow Integration.

We integrate our VecQ solution into the Tensorflow and Pytorch DNN training frameworks. For each layer, in the forward propagation, we first quantize the weights from

to , and then use to compute the output activations, which is also quantized into fixed point. In the backward propagation, the gradients are also calculated using to update .

4. Ultra-low Latency Acceleration

The effectiveness of a dedicated FPGA accelerator for DNN models have been widely demonstrated (qiu2016going; chen2019clouddnn). However, ultra-low latency accelerators for edge devices with an extremely limited resource budget still require careful design considerations.

Benefiting from our quantization solution VecQ for ultra-low bitwidth, we have proposed T-DLA, a light-weight ternarized accelerator overlay under strict resource constraint, to achieve ultra-low latency on edge devices (chen2019tdla). The key features of T-DLA include:

  • [leftmargin=*]

  • An optimized and expressive single instruction multiple data (SIMD) instruction set.

  • A novel memory sub-system supporting effective data access of the computation modules.

  • An efficient execution pipeline with low-latency computation modules.

Byte Idx 7 6 5 4 3 2 1 0

11footnotemark: 1

OP: operation code    22footnotemark: 2FS: input feature size
33footnotemark: 3SAM/SAL: source address most/least significant byte
44footnotemark: 4DAM/DAL: destination address most/least significant byte
55footnotemark: 5KS: kernel size    66footnotemark: 6CC: in/out/activation/pooling selection

Table 2. The format of the instruction word in T-DLA.

SIMD Instruction Set. To support the task scheduling for various DNN models, the instruction set of T-DLA is designed as simple yet expressive enough for a large variety of DNNs. Each instruction is a 64-bit word (8 bytes) with the format shown in Table 2. The payloads of the different bytes are generated according to the layer configurations.

Memory Sub-system. The memory subsystem contains two levels of storage to provide low latency data fetching to the computation units: a simple input buffer and a variable-length line buffer. The simple input buffer is a BRAM buffer for temporary input feature storage; the variable line buffer serves for the efficient data streaming into the ternary computation array, as shown in Figure 5. It is designed to support variable kernel size and variable buffer depth , which are specified by the instruction to reduce the data transmission latency caused by fixed hardware paths. Once configured by the instruction, it provides an output of data each clock cycle to the computation array.

Figure 5. The variable-length line buffer. The kernel size and the depth of the buffers are both configurable.

Execution Pipeline and Computation Modules. T-DLA has four major computation modules: 1) a ternary computation array, 2) a set of adder trees, 3) activation and scaling modules, and 4) pooling modules.

Ternary computation array. With our ternarized model training via VecQ, the weights are represented by 2 bits using two’s complement encoding, so that the multiplication in the convolution layer is simplified to selection and inversion logic. Benefiting from such simplified logic, we can achieve parallelism along the input channel, the output channel, and the kernel dimensions. The computation array is constructed by computation units, which can process this number of input data simultaneously. and are the maximum numbers of the input and output channel that can be processed by the computation array, and is the pre-defined maximum allowable kernel size. The values of and the length of the line buffer are all configurable and could be determined based on the on-chip resource availability.

Adder tree. Since the computation array is built only using LUTs and FFs, we use DSPs to construct adder trees. We take advantage of the SIMD mode of the DSPs where the internal carry propagation between segments is blocked to ensure independent operation. Therefore, we split the 48-bit input of a DSP into four 12-bit independent accumulation channels, so that a single DSP can perform addition for 8 pieces of input data and provide 4 outputs. Benefiting from the SIMD mode, the DSPs can provide outputs in every single clock once the internal register lines are filled up. Furthermore, the clock frequency of the DSPs are configured to be higher than other logic parts with the help of input/output asynchronous FIFOs, which further reduces the processing latency.

Other modules.

The ReLU activation module, the linear scaling module, and the max pooling module are all designed to process in a single clock cycle to reduce the depth of the execution pipeline.

5. Experimental Results

In this section we demonstrate the effectiveness of our methods, including the ultra-low memory training framework, the ultra-low bitwidth VecQ quantization, and the ultra-low latency T-DLA design. Notbly, all these works are open-sourced.

Method Training accuracy Testing accuracy Model parameters Memory in bits Memory reduction
Vanilla 95.75% 89.27% N/A
Floating, w/o prior 92.54% 88.03% 31.4
Fixed, w/o prior 88.31% 86.67% 243
Floating, w/ prior 90.17% 87.88%
Fixed, w/ prior 85.45% 84.86% 292
Table 3.

The on-device ultra-low memory training method on the Fashion MNIST dataset.

5.1. On-device Training

We implement our low-precision rank-adaptive tensorized training on an Avnet Ultra96-V2 FPGA board and use it to train a two-layer neural network for a classification task on the FashionMNIST dataset. There are 512 neurons in the hidden layer, folded into

for the first layer, and for the second layer. We use the Pytorch and Tensorly modules to implement our training algorithm on the embedded processor. For the FPGA we set the clock rate to 100MHz. We compare the training methods with or without the low rank TT priors. As shown in Table 3, our method achieves memory reduction for the model parameters compared with the standard non-tensorized training. This on-FPGA training has achieved speedup and energy reduction than the training on an embedded CPU.

5.2. Ultra-low Bitwidth Quantization

Dataset MNIST CIFAR10 CIFAR10 ImageNet
Model Lenet-5 Cifarnet VGG-like Resnet-18
Floating 99.41 80.54 93.49 69.60
IJCNN’17 (efficient) 98.33 - 87.89 -
NIPS’16 (li2016ternary) 99.35 - 92.56 61.8
Ours 99.5 78.7 92.94 68.23
Table 4. VecQ: top-1 classification accuracy

We use MNIST, Cifar10 and ImageNet to evaluate the ultra-low bitwidth quantization, VecQ. The evaluated DNN models include Lenet-5, Cifarnet, a VGG-like network 

(hperf), and Resnet-18.

Model Lenet-5 Cifarnet VGG-like Resnet-18
Param. Total (M) 0.43 0.279 5.35 11.69
Param. Conv (M) 0.025 0.258 1.114 11.177
Floating (MB) 1.644 1.065 20.408 44.594
Ours (MB) 0.393 0.081 4.284 3.154
Mem.Reduc.(%) 76.09 92.39 79.01 92.93
Table 5. VecQ: model size reduction
Configuration Parameters
Resource Utilization(%) 79 / 47.47 / 68.93 / 91.82
Clock Frequency Logic / Adder (MHz) 125 / 250
Peak Performance (GOPS) 400
DNN Model Lenet-5 Cifarnet VGG-like Resnet-18
latency(ms) 0.016 0.063 2.12 48.8
Table 6. T-DLA resource and performance
Dataset Design Model Acc.(%) F., W. (bits) fps platform
MNIST (finn) MFC-max 97.69 1, 1 6238000 ZC706
MNIST (impl16) Lenet-5 - 8, 3 70000 ZC706
MNIST Ours Lenet-5 99.5 8, 2 62051.1 Zedboard
CIFAR10 (finn) VGG-like 80.1 24, 1 21900 ZC706
CIFAR10 (fcfree) VGG-like 81.8 1, 1 420 Zedboard
CIFAR 10 (hperf) VGG-like 86.71 8, 2 27043 VC709
CIFAR 10 (accbnn) VGG-like 88.68 1, 1 168 Zedboard
CIFAR 10 Ours VGG-like 89.08 8,2 457 Zedboard
ImageNet (li2016ternary) Resnet-18 65.44 FP32,FP32 1.545 Xeon11footnotemark: 1
ImageNet (li2016ternary) Resnet-18 65.44 FP32,FP32 387.597 1080Ti22footnotemark: 2
ImageNet Ours Resnet-18 68.23 8, 2 20.48 Zedboard
11footnotemark: 1

Xeon: Xeon E5-2630 v3; 22footnotemark: 21080Ti: Nvidia 1080Ti

Table 7. T-DLA: Comparison with the state-of-the-art implementations.

5.2.1. Classification accuracy

The classification accuracy on different datasets are shown in Table 4. For simplicity, we only show the top-1 accuracy. Comparing to the floating point models (Floating in the table), the classification accuracy using ternary weights and quantized scalars and activations shows negligible degradation. VecQ also achieves superior accuracy comparing to the recent works (efficient; li2016ternary), in which only the weights are ternarized but not the scalars and activations. Our proposed method shows better accuracy for Resnet-18 on ImageNet data set. This result demonstrates the scalability and stability of VecQ, especially in aggressive low-bitwidth quantization scenarios.

5.2.2. Model size reduction

VecQ also greatly reduces the memory footprint (Mem. Reduc.) as shown in Table 5. Ternary weight occupies only 2 bits whereas the original floating point requires 32 bits. As shown in Table 5, for convolution layers, VecQ compresses the parameters nearly to the theoretical limit (almost reduction). We quantize the last FC layer to 12-bit to maintain accuracy, so that the networks with less or no FC layers have higher compression ratio, such as Cifarnet and Resnet-18. Specifically, VecQ reduces up to 92.93% (14.14) size of Resnet-18 in floating point.

5.3. Ultra-low Latency Acceleration

We use the models quantized by VecQ to evaluate our T-DLA accelerator design in terms of accuracy and frame per second (fps). The measurements of the original models are on a server with two Intel Xeon E5-2630 v3 CPUs and one Nvidia 1080 Ti GPU. T-DLA is implemented on a Xilinx Zedboard FPGA, which is suitable for edge applications with very limited logic resources. It has an on-chip dual-core ARM Cortex A9, and has 53.2K LUTs, 106.4K FFs, 140 BRAM blocks of 36Kb each, and 220 DSPs. Vivado System Design Suite 2019.2 is used for system implementation.

5.3.1. Hardware Resource and Processing Latency Evaluation

We choose an accelerator configuration that fully utilizes the given resources, shown in Table 6, together with the execution latency of the different models with this configuration. We only show the most important configuration parameters including , and the quantized bitwidth of the activations (). As can be seen in Table 6, T-DLA with customized configuration can almost use all the resources, especially the DSPs. The targeted FPGA can support up to 250MHz for the DSPs, which is twice of the frequency of other logic benefiting from the ternary computation array and independent clock design of the adder trees.

5.3.2. Performance Comparison

We compare T-DAL in terms of accuracy and fps with existing designs, either using the same DNN model or the same dataset. The results are shown in Table 7. For MNIST dataset, the design in (finn) shows higher fps because of the DNN model they used is simpler and the ZC706 platform has almost more resources than ours. However, our implementation on Zedboard has a comparable fps (62051) to a design (impl16) (70000) with 3-bit weights on the ZC706 platform. On CIFAR10 dataset, our design shows dominating accuracy advantage among all the VGG-like models. On ImageNet dataset, we directly compare our results with the floating point version. T-DLA shows longer execution latency than the GPU but outperforms the CPU by .

6. Conclusions

In this paper, we summarized our recent efforts for efficient on-device AI development including both training and inference. We mainly focused on three major challenges of edge AI development. First, we presented on-device training with ultra-low memory usage by proposing a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduced VecQ, a novel quantization method that supports ultra-low bitwidth quantization with negligible accuracy degradation. Third, we presented T-DLA, an ultra-low latency DNN accelerator design for ternarized DNNs achieving the state-of-the-art performance. On top of the achievements in this paper, we expect more research breakthroughs to boost the development and deployment for the edge AI.