Tiny Transfer Learning: Towards Memory-Efficient On-Device Learning

07/22/2020 ∙ by Han Cai, et al. ∙ 21

We present Tiny-Transfer-Learning (TinyTL), an efficient on-device learning method to adapt pre-trained models to newly collected data on edge devices. Different from conventional transfer learning methods that fine-tune the full network or the last layer, TinyTL freezes the weights of the feature extractor while only learning the biases, thus doesn't require storing the intermediate activations, which is the major memory bottleneck for on-device learning. To maintain the adaptation capacity without updating the weights, TinyTL introduces memory-efficient lite residual modules to refine the feature extractor by learning small residual feature maps in the middle. Besides, instead of using the same feature extractor, TinyTL adapts the architecture of the feature extractor to fit different target datasets while fixing the weights: TinyTL pre-trains a large super-net that contains many weight-shared sub-nets that can individually operate; different target dataset selects the sub-net that best match the dataset. This backpropagation-free discrete sub-net selection incurs no memory overhead. Extensive experiments show that TinyTL can reduce the training memory cost by order of magnitude (up to 13.3x) without sacrificing accuracy compared to fine-tuning the full network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices)111https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/ have been ubiquitous in our daily lives. These devices keep collecting new and sensitive data through the sensor every day while being expected to provide high-quality and customized services without sacrificing privacy222https://ec.europa.eu/info/law/law-topic/data-protection_en. This requires the AI systems to have the ability to continually adapt pre-trained models to these newly collected data without leaking them to the cloud (i.e., on-device learning).

While there is plenty of efficient inference techniques that have significantly reduced the parameter size and the computation FLOPs Howard et al. (2017); Sandler et al. (2018); Zhang et al. (2018); Han et al. (2015, 2016); Tan et al. (2019); Howard et al. (2019); Wu et al. (2019); Cai et al. (2019a, 2020), the size of intermediate activations, required by back-propagation, causes a huge training memory footprint (Figure 1 left), making it difficult to train on edge devices.

First, edge devices are memory-constrained. For example, a Raspberry Pi 1 Model A only has 256MB of memory, sufficient for the inference. However, as shown in Figure 1

(left, red line), the memory footprint of the training phase can easily exceed this limit, even using a lightweight neural network architecture (MobileNetV2

Sandler et al. (2018)

). Furthermore, the memory is shared by various on-device applications (e.g., other deep learning models) and the operating system. A single application may only be allocated a small fraction of the total memory, which makes this challenge more critical. Second, edge devices are energy-constrained. Under the 45nm CMOS technology

Han et al. (2015), a 32bit off-chip DRAM access consumes 640 pJ, which is two orders of magnitude larger than a 32bit on-chip SRAM access (5 pJ) or a 32bit float multiplication (3.7 pJ). The large memory footprint required by training cannot fit into the limited on-chip SRAM. For instance, TPU Jouppi et al. (2017) only has 28MB of SRAM that is far smaller than the training memory footprint of MobileNetV2, even using a batch size of 1 (Figure 1 left). It results in many costly DRAM accesses and thereby consumes lots of energy, draining the battery of edge devices.

Figure 1: Left: The memory footprint required by training grows linearly w.r.t. the batch size and soon exceeds the limit of edge devices. Right: Memory cost comparison between ResNet-50 and MobileNetV2-1.4 under batch size 8. Recent advances in efficient model design only reduce the size of parameters, but activation size, the main bottleneck for training, does not improve much.

In this work, we propose Tiny-Transfer-Learning (TinyTL) to address these challenges. By analyzing the memory footprint during the backward pass, we notice that the intermediate activations (the main bottleneck) are only involved in updating the weights; updating the biases does not require them (Eq. 2). Inspired by this finding, we propose to freeze the weights of the pre-trained feature extractor to reduce the memory footprint (Figure 2b). To compensate for the capacity loss due to freezing weights while keeping the memory overhead small, we introduce lite residual learning that improves the model capacity by learning lite residual modules to refine the intermediate feature maps of the pre-trained feature extractor (Figure 2c). Meanwhile, it aggressively shrinks the resolution dimension and width dimension of the lite residual modules to have a small memory overhead. We also empirically find that different transfer datasets require very different feature extractors, especially when the weights are frozen (Figure 3). Therefore, we introduce feature extractor adaptation to update the architecture of the feature extractor while fixing the weights to fit different target datasets (Figure 2d). Concretely, we select different sub-nets from a large pre-trained super-net. Different from conventional approaches that fix the architecture and update the weights in the continuous optimization space, our approach optimizes the feature extractor in the discrete space, which does not require any back-propagation and thus do not incur additional memory overhead. Extensive experiments on transfer learning datasets demonstrate that TinyTL achieves the same level (or even higher) accuracy then fine-tuning the full network while reducing the training memory footprint by up to 13.3. Our contributions can summarized as follows:

  • [leftmargin=*]

  • We propose TinyTL, a novel transfer learning method for memory-efficient on-device learning. To the best of our knowledge, this is the first work that tackles this challenging but critical problem.

  • We systematically analyze the memory bottleneck of training and find the heavy memory cost comes from updating the weights, not biases (assume ReLU activation).

  • We propose two novel techniques (lite residual learning and feature extractor adaptation) to improve the model capacity while freezing the weights with little memory overhead.

  • Extensive experiments on transfer learning tasks show that our method is highly memory-efficient and effective. It reduces the training memory footprint by up to 13.3, making it possible to learn on memory-constrained edge devices (e.g., Raspberry Pi) without sacrificing accuracy.

2 Related Work

Efficient Inference Techniques.

Improving the inference efficiency of deep neural networks on resource-constrained edge devices has recently drawn extensive attention. Starting from Han et al. (2015); Gong et al. (2014); Han et al. (2016); Denton et al. (2014); Vanhoucke et al. (2011), one line of research focuses on compressing pre-trained neural networks, including i) network pruning that removes less-important units Han et al. (2015); Frankle and Carbin (2019) or channels Liu et al. (2017); He et al. (2017); ii) network quantization that reduces the bitwidth of parameters Han et al. (2016); Courbariaux et al. (2015) or activations Jacob et al. (2018); Wang et al. (2019). However, these techniques cannot handle the training phase, as they rely on a well-trained model on the target task as the starting point.

Another line of research focuses on lightweight neural architectures by either manual design Iandola et al. (2016); Howard et al. (2017); Sandler et al. (2018); Zhang et al. (2018); Huang et al. (2018) or neural architecture search Tan et al. (2019); Cai et al. (2019b); Wu et al. (2019); Cai et al. (2018). These lightweight neural networks provide highly competitive accuracy Tan and Le (2019); Cai et al. (2020) while significantly improving inference efficiency. However, concerning the training memory efficiency, key bottlenecks are not solved: the training memory is dominated by activations, not parameters. For example, Figure 1 (right) shows the cost comparison between ResNet-50 and MobileNetV2-1.4. In terms of parameter size, MobileNetV2-1.4 is 4.3 smaller than ResNet-50. However, in terms of the training activation size, MobileNetV2-1.4 is almost the same as ResNet-50 (only 1.1 smaller), leading to little memory footprint reduction.

Training Memory Cost Reduction.

Researchers have been seeking ways to reduce the training memory footprint. One typical approach is to re-compute discarded activations during backward Gruslys et al. (2016); Chen et al. (2016). This approach reduces memory usage at the cost of a large computation overhead. Thus it is not preferred for edge devices. Layer-wise training Greff et al. (2016) can also reduce the memory footprint compared to end-to-end training. However, it cannot achieve the same level of accuracy as end-to-end training. Another representative approach is through activation pruning Liu et al. (2019b), which builds a dynamic sparse computation graph to prune activations during training. Similarly, Wang et al. (2018) proposes to reduce the bitwidth of training activations by introducing new reduced-precision floating-point formats. Different from these techniques that prune or quantize existing networks with a given architecture, we can adapt the architecture to different datasets, and our method is orthogonal to these techniques.

Transfer Learning.

Neural networks pre-trained on large-scale datasets (e.g., ImageNet

Deng et al. (2009)) are widely used as a fixed feature extractor for transfer learning, then only the last layer needs to be fine-tuned Chatfield et al. (2014); Donahue et al. (2014); Gan et al. (2015); Sharif Razavian et al. (2014). This approach does not require to store the intermediate activations of the feature extractor, and thus is memory-efficient. However, the capacity of this approach is limited, resulting in poor accuracy, especially on datasets Maji et al. (2013); Krause et al. (2013) whose distribution is far from ImageNet (e.g., only 45.9% Aircraft top1 accuracy achieved by Inception-V3 Mudrakarta et al. (2019a)). Alternatively, fine-tuning the full network can achieve better accuracy Kornblith et al. (2019); Cui et al. (2018). But it requires a vast memory footprint and hence is not friendly for training on edge devices. Recently, Mudrakarta et al. (2019b)

proposes to reduce the number of trainable parameters by only updating parameters of the batch normalization (BN)

Ioffe and Szegedy (2015) layers. Unfortunately, parameter-efficiency doesn’t translate to memory-efficiency. It still requires a large amount of memory (e.g., 326MB under batch size 8) to store the input activations of the BN layers (Table 1). Additionally, the accuracy of this approach is still much worse than fine-tuning the full network (70.7% v.s. 85.5%; Table 1). People can also partially fine-tune some layers, but how many layers to select is still ad hoc. This paper provides a systematic approach to adapt the feature extractor to different datasets and use lite residual learning to save memory.

3 Method

3.1 Understanding the Memory Footprint of Back-propagation

Without loss of generality, we consider a neural network that consists of a sequence of layers:

(1)

where denotes the parameters of the layer. Let and be the input and output activations of the layer, respectively, and be the loss. In the backward pass, given , there are two goals for the layer: computing and .

Assuming the layer is a linear layer whose forward process is given as: , then its backward process under batch size 1 is

(2)

According to Eq. (2), the intermediate activations (i.e., ) that dominate the memory footprint are only required to compute the gradient of the weights (i.e., ), not the bias. If we only update the bias, training memory can be greatly saved. This property is also applicable to convolution layers and normalization layers (e.g., batch normalization Ioffe and Szegedy (2015), group normalizationWu and He (2018), etc) since they can be considered as special types of linear layers.

Regarding non-linear activation layers (e.g., ReLU, sigmoid, h-swish)333Detailed forward and backward processes of the activation layers are provided in Appendix D., sigmoid and h-swish require to store to compute , hence they are not memory-efficient. Activation layers that build upon them are also not memory-efficient consequently, such as tanh, swish Ramachandran et al. (2018), etc. In contrast, ReLU and other ReLU-styled activation layers (e.g., LeakyReLU Xu et al. (2015)) only requires to store a binary mask representing whether the value is smaller than 0, which is 32 smaller than storing .

3.2 Tiny Transfer Learning

Based on the memory footprint analysis, one possible solution of reducing the memory cost is to freeze the weights of the pre-trained feature extractor while only update the biases (Figure 2b). However, only updating biases has limited adaptation capacity. In this section, we explore two optimization techniques to improve the model capacity without updating weights: i) lite residual modules to refine the intermediate feature maps (Figure 2c); ii) feature extractor adaptation to enable specialized feature extractors that best match different transfer datasets (Figure 2d).

Figure 2: TinyTL overview (“C” denotes the width and “R” denote the resolution). Conventional transfer learning fixes the architecture of the feature extractor and relies on fine-tuning the weights to adapt the model (Fig.a), which requires a large amount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage by fixing the weights while: (Fig.b) only fine-tuning the bias. (Fig.c) exploit lite residual learning to compensate for the capacity loss, using group convolution and avoiding inverted bottleneck to achieve high arithmetic intensity and small memory footprint. (Fig.d) adapting the feature extractor architecture to different downstream tasks, which can specialize a small feature extractor for an easy dataset (Flowers), and a large feature extractor for a difficult dataset (Aircraft). Their weights are shared from the same super-net, which is also parameter-efficient.

3.2.1 Lite Residual Learning

Formally, a layer with frozen weights and learnable biases can be represented as:

(3)

To improve the model capacity while keeping a small memory footprint, we propose to add a lite residual module that generates a residual feature map to refine the output:

(4)

where is the reduced activation. According to Eq. (2), learning these lite residual modules only requires to store the reduced activations {} rather than the full activations {}.

Implementation (Figure 2c).

We apply Eq. (4) to mobile inverted bottleneck blocks (MB-block) Sandler et al. (2018). The key principle is to keep the activation small. Following this principle, we explore two design dimensions to reduce the activation size:

  • [leftmargin=*]

  • Width. The widely-used inverted bottleneck requires a huge number of channels (6) to compensate for the small capacity of a depthwise convolution, which is parameter-efficient but highly activation-inefficient. Even worse, converting 1 channels to 6 channels back and forth requires two projection layers, which doubles the total activation to 12. Depthwise convolution also has a very low arithmetic intensity (its OPs/Byte is less than 4% of convolution’s OPs/Byte if with 256 channels), thus highly memory in-efficient with little reuse. To solve these limitations, our lite residual module employs the group convolution (g=2) that has 300 higher arithmetic intensity than depthwise convolution, providing a good trade-off between FLOPs and memory. That also removes the 1 projection layer, reducing the total channel number by .

  • Resolution. The activation size grows quadratically with the resolution. Therefore, we aggressively shrink the resolution in the lite residual module by employing a average pooling to downsample the input feature map. The output of the lite residual module is then upsampled to match the size of the main branch’s output feature map via bilinear upsampling. Combining resolution and width optmizations, the activation of our lite residual module is smaller than the inverted bottleneck.

Figure 3: Transfer learning performances of various ImageNet pre-trained models with the last linear layer trained. The relative accuracy order between different pre-trained models changes significantly among ImageNet and the transfer learning datasets. For example, our specialized feature extractors (red) consistently achieve the best results on transfer datasets, though having weaker ImageNet accuracy. It suggests that we need to adapt the feature extractor to fit different transfer datasets instead of using the same one for all datasets.

3.2.2 Feature Extractor Adaptation

Motivation.

Conventional transfer learning chooses the feature extractor according to higher pre-training accuracy (e.g., ImageNet accuracy) and uses the same one for all transfer tasks Cui et al. (2018); Mudrakarta et al. (2019b). However, we find this approach sub-optimal, since different target tasks may need very different feature extractors and high pre-training accuracy does not guarantee good transferability of the pre-trained weights. This is especially critical in our case where the weights are frozen.

Figure 3 shows the top1 accuracy of various widely used ImageNet pre-trained models on three transfer datasets by only learning the last layer, which reflects the transferability of their pre-trained weights. The relative order between different pre-trained models is not consistent with their ImageNet accuracy on all three datasets. This result indicates that the ImageNet accuracy is not a good proxy for transferability. Besides, we also find that the same pre-trained model can have very different rankings on different tasks. For instance, Inception-V3 gives poor accuracy on Flowers but provides top results on the other two datasets. Therefore, we need to specialize the feature extractor to best match the target dataset.

Implementation (Figure 2d).

Motivated by these observations, we propose to adapt the feature extractor for different transfer tasks. This is achieved by allowing a set of candidate weight operations instead of using a fixed weight operation:

(5)

It forms a discrete optimization space, allowing us to adapt the feature extractor for different target datasets without updating the weights. The detailed training flow is described as follows:

  • [leftmargin=*]

  • Pre-training. The size of all possible weight operation combinations is exponentially large w.r.t. the depth, making it computationally impossible to pre-train all of them independently. Therefore, we employ the weight sharing technique Cai et al. (2019b); Liu et al. (2019a); Yu et al. (2020) to reduce the pre-training cost, where a single super-net is jointly optimized on the pre-training dataset (e.g., ImageNet) to support all possible sub-nets (i.e., different combinations of weight operations). Different sub-nets can operate independently by selecting different parts from the super-net. For example, centered weights in the full convolution kernels are taken to form smaller convolution kernels; blocks are skipped to form a sub-net with a lower depth; channels are skipped to reduce the width of a convolution operation. In our experiments, we use ImageNet as the pre-training dataset. We employ progressive shrinking Cai et al. (2020); Yu et al. (2020) for training the super-net, using the same training setting suggested by Cai et al. (2020).

  • Fine-tuning the super-net. We fine-tune the pre-trained super-net on the target transfer dataset with the weights of the main branches (i.e., MB-blocks) frozen and the other parameters (i.e., biases, lite residual modules, classification head) updated via gradient descent. In this phase, we randomly sample one sub-net in each training step.

  • Discrete operation search. Based on the fine-tuned super-net, we collect 450 [sub-net, accuracy] pairs on the validation set (20% randomly sampled training data) and train an accuracy predictor444Details of the accuracy predictor is provided in Appendix E. using the collected data Cai et al. (2020). We employ evolutionary search Guo et al. (2019) based on the accuracy predictor to find the sub-net (i.e., the combination of weight operations) that best matches the target transfer dataset. No back-propagation on the super-net is required in this step, thus incurs no additional memory overhead.

  • Final fine-tuning. Finally, we fine-tune the searched model with the weights of the main branches frozen and the other parameters updated (i.e., biases, lite residual modules, classification head), using the full training set to get the final results.

4 Experiments

Following the common practice Cui et al. (2018); Mudrakarta et al. (2019b); Kornblith et al. (2019), we evaluate our TinyTL on three benchmark datasets including Cars Krause et al. (2013), Flowers Nilsback and Zisserman (2008), and Aircraft Maji et al. (2013), using ImageNet Deng et al. (2009) as the pre-training dataset.

Figure 4: Results under different resolutions. With the same level of accuracy, TinyTL provide an order of magnitude training memory saving compared to fine-tuning the full ResNet-50, making it possible to learning on-device for Raspberry Pi 1.
Model Architecture.

We build the super-net using the MobileNetV2 design space Tan et al. (2019); Cai et al. (2019b) that contains five stages with a gradually decreased resolution, and each stage consists of a sequence of MB-blocks. In the stage-level, it supports different depths (i.e., 2, 3, 4). In the block-level, it supports different kernel sizes (i.e., 3, 5, 7) and different width expansion ratios (i.e., 3, 4, 6)555The detailed architecture of the super-net is provided in Appendix C.. For each MB-block, we insert a lite residual module as described in Section 3.2.1 and Figure 2 (c). The group number = 2, and the kernel size = 5. We use the ReLU activation since it is more memory-efficient according to Section 3.1.

Training Details.

We freeze the weights of the feature extractors while allowing biases to be updated during transfer learning. Both the lite residual learning (LiteResidual; Section 3.2.1) and feature extractor adaptation (FeatureAdapt; Section 3.2.2) are applied in our experiments. For fine-tuning the pre-trained super-net, we use the Adam optimizer Kingma and Ba (2014) with an initial learning rate of 4e-3 following the cosine learning rate decay Loshchilov and Hutter (2016)

. The model is trained on 80% randomly sampled training data for 50 epochs. For fine-tuning the searched model, we use the same training setting but on the full training data. Additionally, we apply 8bits weight quantization

Han et al. (2016) on the frozen weights to reduce the parameter size, which causes a negligible accuracy drop in our experiments. For all compared methods, we also assume the 8bits weight quantization is applied if eligible when calculating their training memory footprint.

Method Flowers (Batch Size = 8) Cars Flowers Aircraft
Train. Mem. Reduce Rate Top1 (%) Top1 (%) Top1 (%)
Last Inception-V3 Mudrakarta et al. (2019b) 94MB 1.0 55.0 84.5 45.9
ResNet-50 76MB 1.2 51.6 92.4 41.5
TinyTL FeatureAdapt (FA) 41MB 2.3 60.7 94.2 51.5
BN+ ResNet-50 391MB 1.0 80.1 95.4 72.2
Last Inception-V3 Mudrakarta et al. (2019b) 326MB 1.2 81.0 90.4 70.7
MobileNetV2 224MB 1.7 77.5 95.0 65.5
TinyTL LiteResidual (R=320) 70MB 5.6 89.4 96.9 81.5
LiteResidual (R=224) 40MB 9.8 85.5 96.2 75.6
Full Inception-V3 Cui et al. (2018) 850MB 1.0 91.3 96.3 85.5
ResNet-50 Kornblith et al. (2019) 802MB 1.1 91.7 97.5 86.6
MobileNetV2-1.4 Kornblith et al. (2019) 644MB 1.3 91.8 97.5 86.8
NASNet-A Kornblith et al. (2019) 566MB 1.5 88.5 96.8 72.8
TinyTL FA + LiteResidual (R=320) 92MB 9.2 92.3 97.6 85.4
FA + LiteResidual (R=256) 64MB 13.3 91.6 97.5 84.0
Table 1: Comparison with conventional transfer learning methods. indicates our re-implemented results. “R” denotes the input image size. TinyTL reduces the training memory by 13.3 without sacrificing accuracy compared to fine-tuning the full Inception-V3.
Main Results.

Table 1 reports the comparison between TinyTL and previous transfer learning methods that are divided into three groups, including: i) fine-tuning the last linear layer Chatfield et al. (2014); Donahue et al. (2014); Sharif Razavian et al. (2014) (referred as Last); ii) fine-tuning the BN layers and the last linear layer Mudrakarta et al. (2019a) (referred as BN+Last) ; iii) fine-tuning the full network Kornblith et al. (2019); Cui et al. (2018) (referred as Full).

In the first group, we only apply FeatureAdapt to adapt the feature extractor while only training the parameters of the last linear layer, similar to Last. Compared to Last+Inception-V3, our model reduces the training memory cost by 2.3 while improving the top1 accuracy by 5.7% on Cars, 9.7% on Flowers, and 5.6% on Aircraft. It shows our specialized feature extractors can better fit different transfer datasets than these fixed feature extractors. In the second group, we only apply LiteResidual to refine the intermediate feature maps using Proxyless-Mobile Cai et al. (2019b) as the feature extractor. Compared to BN+Last with ResNet-50, our model improves the training memory efficiency by 9.8 while providing consistently better accuracy (5.4% higher on Cars, 0.8% higher on Flowers, and 3.4% higher on Aircraft). By increasing the input image size from 224 to 320, we can further increase the accuracy improvement from 5.4% to 9.3% on Cars, from 0.8% to 1.5% on Flowers, from 3.4% to 9.3% on Aircraft, which shows that learning lite residual modules and biases is not only more memory-efficient but also more effective than BN+Last. In the third group, we apply both FeatureAdapt and LiteResidual. Compared to Full+Inception-V3, TinyTL can achieve the same level of accuracy while providing 13.3 training memory saving, reducing the training memory from 850MB to 64MB (the same level as only learning the last linear layer).

Figure 4 demonstrates the results under different input resolution. With similar accuracy, TinyTL provides an order of magnitude memory reduction (11.5 on Cars, 12.5 on Flowers, and 9.5 on Aircraft) compared to fine-tuning the full ResNet-50. Remarkably, it moves the training memory cost from the out-of-memory region (red) to the feasible region (green) on Raspberry Pi 1, making it possible to learn on-device without sacrificing accuracy.

4.1 Ablation Studies and Discussions

Figure 5: Compared with dynamic activation pruning Liu et al. (2019b), TinyTL saves the memory more effectively.
Comparison with Dynamic Activation Pruning.

The comparison between TinyTL and dynamic activation pruning Liu et al. (2019b) is summarized in Figure 5. TinyTL is more effective because it re-designed the transfer learning architecture (lite residual module, feature extractor adaptation) rather than prune an existing architecture. The transfer accuracy drops quickly when the pruning ratio increases beyond 50% (only 2 memory saving). In contrast, TinyTL can achieve much higher memory reduction without loss of accuracy.

Effectiveness of LiteResidual.

Figure 6 (left) shows the results of TinyTL with and without LiteResidual (only bias) on Aircraft, where we can observe significant accuracy drops (up to 7.4%) if disabling the lite residual modules.

Pre-trained Weight Matters, Not Only Architecture.

Figure 6 (middle) reports the performance of TinyTL if retraining the searched feature extractor on ImageNet (only arch). The retrained feature extractor cannot reach the same accuracy compared to keeping both the pre-trained weight and the architecture. It suggests that not only the architecture of the feature extractor matters, the pre-trained weight also contributes a lot to the final performance.

Figure 6: Left & Middle: Ablation Studies of TinyTL on Aircraft. Right: TinyTL reduces both the parameter size and the activation size, providing a more balanced cost composition than previous efficient inference techniques that focus on reducing the parameter size.
Figure 7: TinyTL can adapt the feature extractor’s architecture to different transfer datasets.
Adapt the Feature Extractor to Different Transfer Datasets.

Figure 7 reports the details of the transfer learning datasets and the corresponding feature extractors specialized for these datasets in TinyTL. For an easier dataset such as Flowers (fewer #classes, fewer #training images), TinyTL chooses a smaller feature extractor (fewer blocks, fewer parameters, less computation). For a more difficult dataset like Cars (more #classes, more #training images), TinyTL chooses a larger feature extractor (more blocks, more parameters, more computation).

Figure 8: On-device training cost on Flowers. Achieving the same accuracy, TinyTL requires 10 smaller memory cost and 18 smaller computation cost compared to fine-tuning the full MobileNetV2-1.4 Kornblith et al. (2019).
Cost Details.

As shown in Figure 6 (right), TinyTL reduces both the parameter size and the activation size instead of only reducing the parameter size as previous efficient inference methods did, hence provides a more balanced cost composition. This activation size is the peak activation size during the three on-device phases (Section 3.2.2), including fine-tuning the super-net, discrete operation search, and final fine-tuning. Concretely, for each layer, we compute the size of already stored activations (required by back-propagation), the size of already stored binary masks (required by ReLU layers), and the size of buffers (required by the forward process). The peak value of their sum across all layers is taken as the peak activation size.

The on-device training cost is summarized in Figure 8. TinyTL reduces the training memory by 10, and reduces the training computation by 18, achieving the same accuracy as fine-tuning the full MobileNetV2-1.4. The peak memory cost of TinyTL under resolution 256 is 64MB while the total MAC is 491T. In contrast, fine-tuning the full network requires 644MB and the total MAC is 8,919T (20,000 steps with batch size 256 Kornblith et al. (2019))666We report the memory cost under batch size 8 for consistency, which does not change the reduction ratio.. TinyTL is not only much more memory-efficient but also much more computation-efficient.

5 Conclusion

We proposed Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning that aims to adapt pre-trained models to newly collected data on edge devices. Unlike previous transfer learning methods that fix the architecture and fine-tune the weights to fit different target datasets, TinyTL fixes the weights while adapting the architecture of the feature extractor and learning memory-efficient lite residual modules and biases to fit different target datasets. Extensive experiments on benchmark transfer learning datasets consistently show the effectiveness and memory-efficiency of TinyTL, paving the way for efficient on-device machine learning.

References

  • H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2018) Efficient architecture search by network transformation. In AAAI, Cited by: §2.
  • H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han (2020) Once for all: train one network and specialize it for efficient deployment. In ICLR, External Links: Link Cited by: §1, §2, 1st item, 3rd item.
  • H. Cai, J. Lin, Y. Lin, Z. Liu, K. Wang, T. Wang, L. Zhu, and S. Han (2019a) AutoML for architecting efficient and specialized neural networks. IEEE Micro. Cited by: §1.
  • H. Cai, L. Zhu, and S. Han (2019b) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, External Links: Link Cited by: §2, 1st item, §4, §4.
  • K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. In BMVC, Cited by: §2, §4.
  • T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §2.
  • M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In NeurIPS, Cited by: §2.
  • Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie (2018) Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, Cited by: §2, §3.2.2, §4, Table 1, §4.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §2, §4.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In NeurIPS, Cited by: §2.
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, Cited by: §2, §4.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §2.
  • C. Gan, N. Wang, Y. Yang, D. Yeung, and A. G. Hauptmann (2015) Devnet: a deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577. Cited by: §2.
  • Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014)

    Compressing deep convolutional networks using vector quantization

    .
    arXiv preprint arXiv:1412.6115. Cited by: §2.
  • K. Greff, R. K. Srivastava, and J. Schmidhuber (2016)

    Highway and residual networks learn unrolled iterative estimation

    .
    arXiv preprint arXiv:1612.07771. Cited by: §2.
  • A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves (2016) Memory-efficient backpropagation through time. In NeurIPS, Cited by: §2.
  • Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: 3rd item.
  • S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1, §2, §4.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NeurIPS, Cited by: §1, §1, §2.
  • Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: Table 2, §1, §2.
  • A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In ICCV, Cited by: Table 3, §1.
  • G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger (2018) Condensenet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §2.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2, §3.1.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, Cited by: §2.
  • N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017)

    In-datacenter performance analysis of a tensor processing unit

    .
    In ISCA, Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In CVPR, Cited by: §2, Figure 8, §4, §4.1, Table 1, §4.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In

    Proceedings of the IEEE International Conference on Computer Vision Workshops

    ,
    Cited by: §2, §4.
  • H. Liu, K. Simonyan, and Y. Yang (2019a) DARTS: differentiable architecture search. In ICLR, Cited by: 1st item.
  • L. Liu, L. Deng, X. Hu, M. Zhu, G. Li, Y. Ding, and Y. Xie (2019b) Dynamic sparse graph for efficient deep learning. In ICLR, Cited by: §2, Figure 5, §4.1.
  • Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §4.
  • S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §2, §4.
  • P. K. Mudrakarta, M. Sandler, A. Zhmoginov, and A. Howard (2019a) K for the price of 1: parameter efficient multi-task and transfer learning. In ICLR, Cited by: §2, §4.
  • P. K. Mudrakarta, M. Sandler, A. Zhmoginov, and A. Howard (2019b) K for the price of 1: parameter-efficient multi-task and transfer learning. In ICLR, Cited by: §2, §3.2.2, Table 1, §4.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Cited by: §4.
  • P. Ramachandran, B. Zoph, and Q. V. Le (2018)

    Searching for activation functions

    .
    In ICLR Workshop, Cited by: §3.1.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Figure 9, Table 2, §1, §1, §2, §3.2.1.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In CVPR Workshops, Cited by: §2, §4.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §1, §2, §4.
  • M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: §2.
  • V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. In NeurIPS Deep Learning and Unsupervised Feature Learning Workshop, Cited by: §2.
  • K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization. In CVPR, Cited by: §2.
  • N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. In NeurIPS, Cited by: §2.
  • B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, Cited by: §1, §2.
  • Y. Wu and K. He (2018) Group normalization. In ECCV, Cited by: §3.1.
  • B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.1.
  • J. Yu, P. Jin, H. Liu, G. Bender, P. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le (2020) Bignas: scaling up neural architecture search with big single-stage models. arXiv preprint arXiv:2003.11142. Cited by: 1st item.
  • X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §1, §2.

Appendix A Detailed Architectures of Specialized Feature Extractors

Figure 9: Detailed architectures of the feature extractors on different transfer datasets. “LR” denotes the lite residual module (Section 3.2.1) while “MB4 ” denotes the mobile inverted bottleneck block Sandler et al. (2018) with expansion ratio 4 and kernel size 7. TinyTL adapts a higher-capacity feature extractor for a harder task (Cars).

Appendix B Details of the On-device Training Cost

The detailed training cost of the on-device learning phases is described as follows:

  • [leftmargin=*]

  • Fine-tuning the super-net. We fine-tune the pre-trained super-net under resolution 224. The peak memory cost of this phase is 64MB, which is reached when the largest sub-net is sampled. Regarding the computation cost, the average MAC (forward & backward)777In the super-net fine-tuning phase, the training MAC of a sampled sub-net is roughly 2 larger than its inference MAC, rather than 3, since we do not need to update the weights of the main branches. of sampled sub-nets is (802M + 2535M) / 2 = 1668.5M per sample, where 802M is the training MAC of the smallest sub-net and 2535M is the training MAC of the largest sub-net. Therefore, the total MAC of this phase is 1668.5M 2040 0.8 50 = 136T (27.7% of 491T) on Flowers, where 2040 is the number of total training samples, 0.8 means the super-net is fine-tuned on 80% of the training samples (the remaining 20% is reserved for search), and 50 is the number of training epochs.

  • Discrete operation search. As discussed in Appendix E, the memory overhead and computation overhead of the accuracy predictor are negligible. The primary memory cost and computation cost of this phase come from collecting 450 [sub-net, accuracy] pairs required to train the accuracy predictor. It only involves the forward processes of sampled sub-nets, and no back-propagation is required. Therefore, the memory overhead of this phase is negligible compared to the super-net fine-tuning phase. The average MAC (only forward) of sampled sub-nets is (352M + 1179M) / 2 = 765.5M per sample, where 352M is the inference MAC of the smallest sub-net and 1179M is the inference MAC of the largest sub-net. Therefore, the total MAC of this phase is 765.5M 2040 0.2 450 = 141T (28.7% of 491T) on Flowers, where 2040 is the number of total training samples, 0.2 means the validation set consists of 20% of the training samples, and 450 is the number of measured sub-nets.

  • Final fine-tuning. To achieve the same accuracy as fine-tuning the full MobileNetV2-1.4, we use a resolution of 256. The memory cost of this phase is 63.9MB and the total MAC is 2100M 2040 1.0 50 = 214T (43.6% of 491T), on Flowers, where 2100M is the training MAC, 2040 is the number of total training samples, 1.0 means the full training set is used, and 50 is the number of training epochs.

Appendix C Detailed Architecture of the Super-Net

Input Operator #Out Stride LiteResidual Kernel Expand Repeat
Stride Size Ratio
Conv2d -
SepConv 1
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
MB-LiteResidual
Conv2d - -
Avg-pool - -
Linear - - - -
Table 2: Detailed architecture of the super-net using the MobileNetV2 design space with lite residual modules (Section 3.2.1). “SepConv” denotes the separable convolution block Howard et al. (2017) that consists of a depthwise-separable convolution layer and a convolution layer. “MB-LiteResidual” denotes the mobile inverted bottleneck block Sandler et al. (2018) with a lite residual module (described in Section 3.2.1).

Appendix D Memory Footprint of Non-Linear Activation Layers

Layer Type Forward Backward Memory Cost
ReLU bits
sigmoid 32 bits
h-swish Howard et al. (2019) 32 bits
Table 3: Detailed forward and backward processes of non-linear activation layers. denotes the number of elements of . “” denotes the element-wise product. if and otherwise. .

Appendix E Details of the Accuracy Predictor

The accuracy predictor is a three-layer feed-forward neural network with a hidden dimension of 400 and ReLU as the activation function for each layer. It takes the one-hot encoding of the sub-net’s architecture as the input and outputs the predicted accuracy of the given sub-net. The inference MAC of this accuracy predictor is only 0.37M, which is 3-4 orders of magnitude smaller than the inference MAC of the CNN classification models. The memory footprint of this accuracy predictor is only 5KB. Therefore, both the computation overhead and the memory overhead of the accuracy predictor are negligible.