Accelerating Training using Tensor Decomposition

09/10/2019 ∙ by Mostafa Elhoushi, et al. ∙ 0

Tensor decomposition is one of the well-known approaches to reduce the latency time and number of parameters of a pre-trained model. However, in this paper, we propose an approach to use tensor decomposition to reduce training time of training a model from scratch. In our approach, we train the model from scratch (i.e., randomly initialized weights) with its original architecture for a small number of epochs, then the model is decomposed, and then continue training the decomposed model till the end. There is an optional step in our approach to convert the decomposed architecture back to the original architecture. We present results of using this approach on both CIFAR10 and Imagenet datasets, and show that there can be upto 2x speed up in training time with accuracy drop of upto 1.5 training acceleration approach is independent of hardware and is expected to have similar speed ups on both CPU and GPU platforms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

While deep learning has obtained high accuracy in computer vision, natural language understanding, and many other fields, one of its main challenges is the extensive computation cost required to train its models. A typical neural network architecture may take upto 2 weeks to train on the Imagenet dataset. Using distributed training, a ResNet50 model can train in 6 minutes, but at the financial and energy cost of 1024 GPUs

[15]. The energy required to train the average deep learning model is equivalent to using fossil fuel releasing around 78,000 pounds of carbon, which is more than half of a car’s output during its lifetime [32]

. The financial and time costs of training deep learning models is making it increasingly difficult for small- and medium- sized companies and research labs to explore various architectures and hyperparameters.

While there has been extensive development to meet the increasing computation cost of training by improving hardware design of GPUs - as well as other forms of specialized hardware - and advancements in distributed training using large number of servers and GPUs, there has been less advancement in reducing the computation costs of training. [29]estimated that from 2012 to 2018 the computations required for deep learning research have estimated 300,000 .

In this paper we propose a hardware independent method to reduce the computation cost of training using tensor decomposition. A lot of research has been made on compressing pre-trained models using tensor decomposition. However, to the best of our knowledge, this paper is the first to propose to use tensor decomposition during training to reduce the computation cost and training time.

In this paper, we will first present related work in reducing the computation cost and latency time of models during inference, followed by related work in reducing training time. Then, we will explain tensor decomposition, and one of its specific methods - Tucker decomposition - that we use in our solution. We then present our proposed solution to decompose the model during training, followed by the results and performances on CIFAR10 and Imagenet datasets.

Related Work

Inference Acceleration

This section covers related works in reducing CNN model size and speeding up training and/or inference which fall in several categories. First, is searching or designing architectures that have lower number of parameters and hence reduce computation latency time while maintaining reasonable prediction accuracy. Proposed models under this category are SqueezeNet [13], MobileNets [11], and MobileNetV2 [28]. Such methods usually result in considerable drop in accuracy: SqueezeNets have Top1/Top5 accuracies of 58.1%/80.4%, MobileNets have 69.6%/89.1%, and MobileNets have 71.8%/90.4%.

The second category, is replacing a portion or all of convolution operations in a model with operations that require less computation time or less parameters. Binarized Neural Networks

[3]

, XNOR-Net

[27], Bi-Real Net [21] and ABCNet [19] replaced multiplication in convolution operation with logical XNOR operation. Such XNOR-based models usually use regular convolutions in training and therefore do not speed up training, although they speed up inference and reduce parameter size. [2] proposed Octave Convolution, a spatialy less-redundant variant of regular convolution. [35] proposed pixel-wise shift operation to replace a portion of convolution filters.

The third category is to quantize the parameters of regular convolution operations from 32-bit floating point presentation to smaller number of bits, such as 8-bit integers. [31]

used vector-quantization method that is efficient in compressing ResNet models by upto 20x. However, most quantization methods require training using regular convolution till the end, before quantizing, and therefore do not speed up training, although they speed up inference.

The fourth category is pruning: removing a portion of convolution filters in each layer - usually - based on some heurestic. [10] and [17] published some of the other earlier works on pruning. [12] proposed using information entropy of filters as the criteria to prune filters. [7] proposed an end-to-end pruning method using trainable masks to decide which convolution filters should be pruned. [25] used Taylor expansion to estimate the contribution of each filter, and hence prune the filters that contribute less to the model accuracy. While most literature on pruning show that training a pruned model from scratch has lower accuracy then pruning a pretrained model to the same pruned architecture, some researchers have reported otherwise [22]. Nevertheless, although training a pruned architecture is usually faster than training the original model, obtaining the pruned architecture in the first place requires training the full model in the first place till the end, while our proposed solution only requires training the first 10 or 30 epochs before compressing.

The fifth categroy is tensor decomposition: separating a regular convolution operation into multiple smaller convolutions whose combined number of parameters or combined latency is less than that of the original operation. This will be explained in the next subsection.

Tensor Decomposition

(a) Exact Decomposition
(b) Approximate Decomposition
Figure 1: Singular Value Decomposition.
(a) Regular Convolution
(b) Spatial Decomposition
(c) Channel Decomposition
(d) Depthwise Decomposition
(e) Tucker Decomposition
Figure 2: Illustration of different types of decompositions. Note that , , or are always chosen to be less than in order to achieve a reduction in the number of parameters.

Tensor decomposition is based on a concept of linear algebra known as Singular Value Decomposition (SVD) that states that any matrix, , whose dimensions are - without loss of generality - can be expressed as:

(1)

As shown in Figure 1, the dimensions of , , and are , , .

is referred to as the unitary matrix and it is a diagonal matrix. Each value along the diagonal represent the “importance” of the corresponding column of

and row of . The SVD algorithm specifies how to calculate the values of , , matrices in order to hold this equality. is known as the rank of the matrix . Most of the time, the rank of a matrix is . If one or more rows and columns of the matrix are linearly dependent on other rows and matrices in the matrix, then the rank is . However, in the context of neural networks where the values of the weight matrices are updated during training, this condition is unlikely to happen.

In order to decompose into terms that have fewer number of parameters, we need to set the rank of the transformation to be less than the rank, of the matrix :

(2)

This results in an approximation. In the extreme case of choosing the rank of decomposition , the matrix can be represented as compared to parameters of the original matrix. For large values of , and hence the storage of the decomposed representation and FLOPs of matrix operation on the decomposed representation is much lower.

The rank has to be selected in an optimal manner in order to balance between the approximation error introduced and the compression obtained. Different methods of rank selection are explained in the following sub-sub-section.

In the context of neural networks, tensor decomposition extends SVD to the 4-dimensional matrices of weights of convolution operators that have dimensions: , where is the number of channels of the output of the operator, is the number of channels of the input image, is the height of each filter, and is the width of each filter. There are various types of tensor decompositions as illustrated in Figure 2: spatial decomposition [18], channel decomposition [36], depthwise decomposition [9], Tucker decomposition [34]. The mathematical expression of each decomposition type and the their derivations from SVD is out of the scope of this paper but they can be found in the reference of each decomposition method. This paper uses Tucker decomposition, and will be explained in more detail in Section Proposed Method.

Rank Selection

Some researchers have used time-consuming trial-and-error to select the optimal rank of decomposition of each convolution layer in a network, by analyzing the final accuracy of the model. [5] used alternate least squares. [23] proposed a data-driven one-shot decision using empirical Bayes. In this paper, we used variational Bayesian matrix factorization (VBMF) [26].

Training Acceleration

To speed up training neural networks, the main research efforts in industry gear towards designing and enhancing hardware platforms. Vector units in CPUs enable it to perform arithmetic operations on a group of btyes in parallel. However, CPUs are only practical to train small datasets such as MNIST. NVIDIA’s Graphical Processing Units (GPUs) are the most popular hardware platforms used for training. The core principle of GPUs is the existence of hundreds or thousands of cores that perform computation on data in parallel. Other hardware platforms, such as Google’s Tensor Processor Unit (TPU) and Huawei’s Neural Processing Unit (NPU), depend on large dedicated matrix multipliers. Other accelerated systems based on custom FPGA and ASIC designs are also explored.

Another main method to accelerate training is distributed training: training over multiple GPUs on the same workstation, or training over multiple workstations, with each workstation having one or more GPUs. Distributed training methods can be classified into model parallelism methods as well as data parallelism methods. Frameworks and libraries have provided support for distributed training

[30] to automate the process of distributing workload and accumulating results. A detailed survey of parallel and distributed deep learning is presented in [1]. A key area for research regarding distributed training is to optimize communication between the various workstations [20].

Our proposed method in accelerating training is hardware independent, i.e., does not require specific hardware design. It can build upon the speed ups by faster hardware designs and distributed approaches.

Other hardware independent approaches in literature include [33]

that presented a method to speed up training by only updating a portion of the weights during each backpropagation pass. However, the results are only shown for the basic MNIST dataset.

[4] presented an approach to accelerate training by starting with downsampled kernels and input images to a certain number of epochs, before upscaling to the original input image size and kernel size.

Proposed Method

We use the end-to-end tensor decomposition scheme similar to that proposed by [16] that in turn uses the Tucker decomposition algorithm proposed by [34] and the the rank determined by a global analytic solution of variational Bayesian matrix factorization (VBMF) [26]:

  1. Initialize Model: We start with a model architecture from scratch (i.e., initialized with random weights).

  2. Initial Training: We then train the weights of the model for a certain number of epochs, e.g., 10 epochs.

  3. Decompose: Then, we decompose the model and its weights using Tucker decomposition and VBMF rank selection. The decomposed model has a smaller number of weights then the original model, and hence lower training (and inference time), and a sudden drop in accuracy is expected at that point.

  4. Continue Training: Then, we continue updating those decomposed weights till the end of training.

  5. Reconstruction [Optional]: Before the end of training by a certain number of epochs (e.g., 10) we reconstruct the original architecture by combining the weight matrices of each set of decomposed convolution operations. The accuracy at this point does not change, as this reconstruction step is lossless.

  6. Fine Tuning [Optional]: Train the reconstructed model for a few more epochs.

Tucker decomposition is illustrated in Figure 1(e). To explain Tucker decomposition, we first express a regular convolution operation of weight with dimensions , acting on an input tensor with dimensions to produce an output tensor with dimensions :

(3)

where:

(4)
(5)

where

is the stride and

is the padding.

Tucker decomposition converts this convolution operation into 3 consecutive convolutions:

(7)
(8)
(9)

The first and third convolutions are pointwise convolutions, while the second convolution is a regular spatial convolution with input channels and output channels reduced to and respectively. The Tucker decomposition method described in [34] derives the equations to deduce the weights , , from . The compression ratio of the decomposition is expressed as:

(10)

and

(11)

Due to incorporating the height and width of the input and output tensors into the numerator and denominator of the speedup equation, we will notice that speedup in training time is lower than compression ratio.

The values of and are determined by the rank selection method, which is in our case is VBMF. It is out of scope to explain the method here.

It is noteworthy that unlike other compression methods such as pruning and distillation, tensor decomposition can be reversed to retain the original architecture without a change in accuracy in a straightforward manner: by simply performing matrix multiplication of the decomposed matrices. The last 2 steps in the process are only to show that the overall training process can retain the original, in case if someone would like to use the original architecture - due to some reason - rather than the decomposed smaller architecture.

Experiments

We have tested our approach by training VGG19, DenseNet40, ResNet56 on CIFAR10 dataset, and ResNet50 in Imagenet dataset. When training on CIFAR10, the batch size used was 128, and the learning rate was initialized at 0.1, reduced to 0.01 at the 100th epoch, and to 0.001 at the 150th epoch. When training on Imagenet, the batch size was 256 and the learning rate was set to 0.1.

In addition to training from scratch the original model, as well as training the decomposed model at an early epoch, and the reconstructed model at a late epoch, we also decomposed the original model after it completed training, and fine-tuned for an additional number of epochs. We did this to compare the accuracy and number of parameters of a model decomposed early during training versus a model decomposed after it completed training.

The results are shown in Tables 1, 2, 3 and 4 and Figures 3, 4, 5, and 6. In those tables, “Dec.” is abbreviation for decomposed, and “Rec.” is abbreviation for reconstructed.

Results

For VGG19 on CIFAR10, we notice from Table 1 that there was more than speedup in training time, a compression of parameter size, but a drop of almost 2% when decomposed from the 10th epoch. From Figure 3 we notice, a sudden drop in accuracy when decomposition happens, however that drop is compensated for after less than 1000 seconds. When decomposing from later epochs, there was a general trend of decreasing accuracy drop in return for a reduction in model size compression. It may seem that in later epochs, the EVBMF detects more noise in the weight values - as the weights try to fit the training data with higher accuracy and cover more corner cases - and hence selects a higher rank for decomposition. Surprisingly, the scenarios for reconstructing at the 190th epoch, and for decomposing after complete training, did not result in higher best accuracy than decomposing at the 10th epoch without reconstruction.

For DenseNet40 on CIFAR10, we notice a similar drop in accuracy as in VGG19, but less training speedup, despite more than reduction in the number of parameters. This is expected from the compression and speedup ratios expressed in Equations 10 and 11. The results for DenseNet40 also show that both accuracy and model compression are higher for decomposing during training than decomposing after training.

On the other hand, as shown in Table 3, decomposing ResNet56 resulted in an increase accuracy, but less than 10% reduction in training time. This can be interpreted that the EVBMF algorithm selected high ranks for most layers in the ResNet56 model. This is a point that should be further researched in the future for analysis.

For Imagenet dataset, the drop in accuracy was less than 0.2% for ResNet50 but the reduction in training time was negligible. Furthermore, decomposing at the 30th epoch resulted in better accuracy than decomposing after complete training.

Figure 3: Training progress of VGG19 model on CIFAR10 dataset on NVIDIA Tesla K40c GPU for 200 epochs with batch size 128.
Model Accuracy Params Training Time
Best Final
Original 93.55% 93.56% 4.77 hr
Dec. @ 10 91.69% 91.39% 2.32 hr
Dec. @ 20 92.10% 91.89% 2.75 hr
Dec. @ 30 91.85% 91.78% 2.95 hr
Dec. @ 40 92.57% 92.43% 3.04 hr
Dec. @ 50 92.51% 92.49% 2.41 hr
Dec. @ 10
Rec. @ 190
91.69% 91.53% 2.45 hr
Original
then Dec.
91.52% 91.36%
4.77 hr
+ 0.49 hr
Table 1: Performance and size of VGG19 with different scenarios of training on NVIDIA Tesla K40c GPU on CIFAR10 dataset. The epoch at which decomposition or reconstruction happens is mentioned. The total number of epochs for all scenarios is 200, except for the last case where decomposition happens after the 200th epoch, and fine tuned for another 40 epochs.
Figure 4: Training progress of DenseNet40 model on CIFAR10 dataset on NVIDIA Tesla P100 GPU for 160 epochs with batch size 128.
Model Accuracy Params Training Time
Best Final
Original 94.00% 93.78% 11.13 hr
Dec. @ 20 92.00% 91.82% 8.62 hr
Dec. @ 20
Rec. @ 150
92.00% 91.96% 8.83 hr
Original
then Dec.
91.49% 91.46%
11.13 hr
+ 2.17 hr
Table 2: Performance and size of DenseNet40 with different scenarios of training and decomposition on NVIDIA Tesla P100 GPU on CIFAR10 dataset. The total number of epochs for all scenarios is 160, except for the last case where decomposition happens after the 160th epoch, and fine tuned for another 40 epochs.
Figure 5: Training progress of ResNet56 model on CIFAR10 dataset on NVIDIA Tesla K40c GPU for 200 epochs with batch size 128.
Model Accuracy Params Training Time
Best Final
Original 91.83% 91.69% 4.39 hr
Dec. @ 10 92.16% 91.97% 4.06 hr
Dec. @ 20 92.27% 92.07% 4.10 hr
Dec. @ 30 91.66% 91.51% 4.15 hr
Dec. @ 40 91.67% 91.50% 4.14 hr
Dec. @ 50 91.65% 91.15% 4.15 hr
Dec. @ 10
Rec. @ 190
92.16% 91.92% 4.07 hr
Original
then Dec.
92.32% 92.22%
4.39 hr
+ 0.85 hr
Table 3: Performance and size of ResNet56 with different scenarios of training on NVIDIA Tesla K40c GPU on CIFAR10 dataset. The epoch at which decomposition or reconstruction happens is mentioned. The total number of epochs for all scenarios is 200, except for the last case where decomposition happens after the 200th epoch, and fine tuned for another 40 epochs.
(a) Top5 Accuracy
(b) Top1 Accuracy
Figure 6: Training progress of ResNet50 model on Imagenet dataset on NVIDIA Tesla V100 GPU for 90 epochs with batch size of 256.
Model Best Accuracy Params Training Time
Top1 Top5
Original 75.65% 92.85% 185.4 hr
Dec. @ 30 75.34% 92.68% 179.2 hr
Original
then Dec.
69.26% 89.38%
185.4 hr
+ 20.56 hr
Table 4: Performance and size of ResNet50 with different scenarios of training on NVIDIA Tesla K40c GPU on Imagenet dataset. The epoch at which decomposition or reconstruction happens is mentioned. The total number of epochs for all scenarios is 90, except for the last case where decomposition happens after the 90th epoch, and fine tuned for another 20 epochs.

Conclusion and Future Work

In this paper we have shown that to compress a model using tensor decomposition, we do not have to wait till training ends. We have shown the decomposing at the 10th or 20th epoch of training, results in accuracy close to - and sometimes higher than - that of the original model trained till the end.

We have also shown that in all of the cases on CIFAR10 dataset, the size of a model decomposed after 10 or 20 epochs of training is smaller than that of the model decomposed after complete training. Moreover, we have shown - for CIFAR10 - that training a decomposed model for VGG and DenseNet architectures results in considerable faster training time: more than for VGG19 and for DenseNet40. However, the speedup obtained for ResNet architecture was negligible.

For future work, there is a need to explore ways to reduce the accuracy drop in accuracy for our “decomposition-in-training” approach for some models, and to increase the training speedup for other architectures especially ResNet.

Acknowledgments

We would like to thank Jacob Gildenblat [8] and Ruihang Du [6]

for providing open-sourced code for Tucker decomposition using PyTorch. We also thank Yerlan Idelbayev for providing open-sourced code and model files to reproduce the accuracies of the original ResNet

[14], DenseNet, and VGG papers [24] results on CIFAR10.

References

  • [1] T. Ben-Nun and T. Hoefler (2019-08) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52 (4), pp. 65:1–65:43. External Links: ISSN 0360-0300, Link, Document Cited by: Training Acceleration.
  • [2] Y. Chen, H. Fang, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach, S. Yan, and J. Feng (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. CoRR abs/1904.05049. Cited by: Inference Acceleration.
  • [3] M. Courbariaux and Y. Bengio (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. CoRR abs/1602.02830. External Links: Link, 1602.02830 Cited by: Inference Acceleration.
  • [4] P. P. B. de Gusmão, G. Francini, S. Lepsøy, and E. Magli (2016) Fast training of convolutional neural networks via kernel rescaling. External Links: 1610.03623 Cited by: Training Acceleration.
  • [5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, Cambridge, MA, USA, pp. 1269–1277. External Links: Link Cited by: Rank Selection.
  • [6] R. Du (2018) Decompose-cnn. Note: https://github.com/larry0123du/Decompose-CNNAccessed: 2019-09-05 Cited by: Acknowledgments.
  • [7] S. Elkerdawy, H. Zhang, and N. Ray (2019) Lightweight monocular depth estimation model by joint end-to-end filter pruning. CoRR abs/1905.05212. External Links: Link, 1905.05212 Cited by: Inference Acceleration.
  • [8] J. Gildenblat (2018) PyTorch tensor decompositions. Note: https://github.com/jacobgil/pytorch-tensor-decompositionsAccessed: 2019-09-05 Cited by: Acknowledgments.
  • [9] J. Guo, Y. Li, W. Lin, Y. Chen, and J. Li (2018) Network decoupling: from regular to depthwise separable convolutions. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, pp. 248. External Links: Link Cited by: Tensor Decomposition.
  • [10] B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), pp. 164–171. External Links: Link Cited by: Inference Acceleration.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: Inference Acceleration.
  • [12] C. Hur and S. Kang (2019-06-01) Entropy-based pruning method for convolutional neural networks. The Journal of Supercomputing 75 (6), pp. 2950–2963. External Links: ISSN 1573-0484, Document, Link Cited by: Inference Acceleration.
  • [13] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: Inference Acceleration.
  • [14] Y. Idelbayev (2018) Proper resnet implementation for cifar10/cifar100 in pytorch. Note: https://github.com/akamaster/pytorch_resnet_cifar10Accessed: 2019-05-23 Cited by: Acknowledgments.
  • [15] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu (2018) Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. ArXiv abs/1807.11205. Cited by: Introduction.
  • [16] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: Proposed Method.
  • [17] Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.), pp. 598–605. External Links: Link Cited by: Inference Acceleration.
  • [18] S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo (2018) Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: Tensor Decomposition.
  • [19] X. Lin, C. Zhao, and W. Pan (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 345–353. External Links: Link Cited by: Inference Acceleration.
  • [20] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, External Links: Link Cited by: Training Acceleration.
  • [21] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. CoRR abs/1808.00278. External Links: Link, 1808.00278 Cited by: Inference Acceleration.
  • [22] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. In ICLR, Cited by: Inference Acceleration.
  • [23] D. J.C. MacKay (1991)

    Bayesian interpolation

    .
    NEURAL COMPUTATION 4, pp. 415–447. Cited by: Rank Selection.
  • [24] E. Mingjie (2018) Network slimming (pytorch). Note: https://github.com/Eric-mingjie/network-slimmingAccessed: 2019-09-05 Cited by: Acknowledgments.
  • [25] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz (2019-06) Importance estimation for neural network pruning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: Inference Acceleration.
  • [26] S. Nakajima, R. Tomioka, M. Sugiyama, and S. D. Babacan (2012) Perfect dimensionality recovery by variational bayesian pca. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 971–979. External Links: Link Cited by: Rank Selection, Proposed Method.
  • [27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-net: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Link, 1603.05279 Cited by: Inference Acceleration.
  • [28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Document, ISSN 2575-7075 Cited by: Inference Acceleration.
  • [29] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni (2019) Green AI. CoRR abs/1907.10597. External Links: Link, 1907.10597 Cited by: Introduction.
  • [30] A. Sergeev and M. D. Balso (2018)

    Horovod: fast and easy distributed deep learning in TensorFlow

    .
    arXiv preprint arXiv:1802.05799. Cited by: Training Acceleration.
  • [31] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jégou (2019) And the bit goes down: revisiting the quantization of neural networks. External Links: 1907.05686 Cited by: Inference Acceleration.
  • [32] E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. In ACL, Cited by: Introduction.
  • [33] X. Sun, X. Ren, S. Ma, and H. Wang (2017) MeProp: sparsified back propagation for accelerated deep learning with reduced overfitting. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3299–3308. Cited by: Training Acceleration.
  • [34] L. R. Tucker (1966-09-01) Some mathematical notes on three-mode factor analysis. Psychometrika 31 (3), pp. 279–311. External Links: ISSN 1860-0980, Document, Link Cited by: Tensor Decomposition, Proposed Method, Proposed Method.
  • [35] B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. CVPR 2018, pp. 9127–9135. Cited by: Inference Acceleration.
  • [36] X. Zhang, J. Zou, K. He, and J. Sun (2016-10) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. External Links: ISSN 2160-9292, Link, Document Cited by: Tensor Decomposition.