Introduction
While deep learning has obtained high accuracy in computer vision, natural language understanding, and many other fields, one of its main challenges is the extensive computation cost required to train its models. A typical neural network architecture may take upto 2 weeks to train on the Imagenet dataset. Using distributed training, a ResNet50 model can train in 6 minutes, but at the financial and energy cost of 1024 GPUs
[15]. The energy required to train the average deep learning model is equivalent to using fossil fuel releasing around 78,000 pounds of carbon, which is more than half of a car’s output during its lifetime [32]. The financial and time costs of training deep learning models is making it increasingly difficult for small and medium sized companies and research labs to explore various architectures and hyperparameters.
While there has been extensive development to meet the increasing computation cost of training by improving hardware design of GPUs  as well as other forms of specialized hardware  and advancements in distributed training using large number of servers and GPUs, there has been less advancement in reducing the computation costs of training. [29]estimated that from 2012 to 2018 the computations required for deep learning research have estimated 300,000 .
In this paper we propose a hardware independent method to reduce the computation cost of training using tensor decomposition. A lot of research has been made on compressing pretrained models using tensor decomposition. However, to the best of our knowledge, this paper is the first to propose to use tensor decomposition during training to reduce the computation cost and training time.
In this paper, we will first present related work in reducing the computation cost and latency time of models during inference, followed by related work in reducing training time. Then, we will explain tensor decomposition, and one of its specific methods  Tucker decomposition  that we use in our solution. We then present our proposed solution to decompose the model during training, followed by the results and performances on CIFAR10 and Imagenet datasets.
Related Work
Inference Acceleration
This section covers related works in reducing CNN model size and speeding up training and/or inference which fall in several categories. First, is searching or designing architectures that have lower number of parameters and hence reduce computation latency time while maintaining reasonable prediction accuracy. Proposed models under this category are SqueezeNet [13], MobileNets [11], and MobileNetV2 [28]. Such methods usually result in considerable drop in accuracy: SqueezeNets have Top1/Top5 accuracies of 58.1%/80.4%, MobileNets have 69.6%/89.1%, and MobileNets have 71.8%/90.4%.
The second category, is replacing a portion or all of convolution operations in a model with operations that require less computation time or less parameters. Binarized Neural Networks
[3], XNORNet
[27], BiReal Net [21] and ABCNet [19] replaced multiplication in convolution operation with logical XNOR operation. Such XNORbased models usually use regular convolutions in training and therefore do not speed up training, although they speed up inference and reduce parameter size. [2] proposed Octave Convolution, a spatialy lessredundant variant of regular convolution. [35] proposed pixelwise shift operation to replace a portion of convolution filters.The third category is to quantize the parameters of regular convolution operations from 32bit floating point presentation to smaller number of bits, such as 8bit integers. [31]
used vectorquantization method that is efficient in compressing ResNet models by upto 20x. However, most quantization methods require training using regular convolution till the end, before quantizing, and therefore do not speed up training, although they speed up inference.
The fourth category is pruning: removing a portion of convolution filters in each layer  usually  based on some heurestic. [10] and [17] published some of the other earlier works on pruning. [12] proposed using information entropy of filters as the criteria to prune filters. [7] proposed an endtoend pruning method using trainable masks to decide which convolution filters should be pruned. [25] used Taylor expansion to estimate the contribution of each filter, and hence prune the filters that contribute less to the model accuracy. While most literature on pruning show that training a pruned model from scratch has lower accuracy then pruning a pretrained model to the same pruned architecture, some researchers have reported otherwise [22]. Nevertheless, although training a pruned architecture is usually faster than training the original model, obtaining the pruned architecture in the first place requires training the full model in the first place till the end, while our proposed solution only requires training the first 10 or 30 epochs before compressing.
The fifth categroy is tensor decomposition: separating a regular convolution operation into multiple smaller convolutions whose combined number of parameters or combined latency is less than that of the original operation. This will be explained in the next subsection.
Tensor Decomposition
Tensor decomposition is based on a concept of linear algebra known as Singular Value Decomposition (SVD) that states that any matrix, , whose dimensions are  without loss of generality  can be expressed as:
(1) 
As shown in Figure 1, the dimensions of , , and are , , .
is referred to as the unitary matrix and it is a diagonal matrix. Each value along the diagonal represent the “importance” of the corresponding column of
and row of . The SVD algorithm specifies how to calculate the values of , , matrices in order to hold this equality. is known as the rank of the matrix . Most of the time, the rank of a matrix is . If one or more rows and columns of the matrix are linearly dependent on other rows and matrices in the matrix, then the rank is . However, in the context of neural networks where the values of the weight matrices are updated during training, this condition is unlikely to happen.In order to decompose into terms that have fewer number of parameters, we need to set the rank of the transformation to be less than the rank, of the matrix :
(2) 
This results in an approximation. In the extreme case of choosing the rank of decomposition , the matrix can be represented as compared to parameters of the original matrix. For large values of , and hence the storage of the decomposed representation and FLOPs of matrix operation on the decomposed representation is much lower.
The rank has to be selected in an optimal manner in order to balance between the approximation error introduced and the compression obtained. Different methods of rank selection are explained in the following subsubsection.
In the context of neural networks, tensor decomposition extends SVD to the 4dimensional matrices of weights of convolution operators that have dimensions: , where is the number of channels of the output of the operator, is the number of channels of the input image, is the height of each filter, and is the width of each filter. There are various types of tensor decompositions as illustrated in Figure 2: spatial decomposition [18], channel decomposition [36], depthwise decomposition [9], Tucker decomposition [34]. The mathematical expression of each decomposition type and the their derivations from SVD is out of the scope of this paper but they can be found in the reference of each decomposition method. This paper uses Tucker decomposition, and will be explained in more detail in Section Proposed Method.
Rank Selection
Some researchers have used timeconsuming trialanderror to select the optimal rank of decomposition of each convolution layer in a network, by analyzing the final accuracy of the model. [5] used alternate least squares. [23] proposed a datadriven oneshot decision using empirical Bayes. In this paper, we used variational Bayesian matrix factorization (VBMF) [26].
Training Acceleration
To speed up training neural networks, the main research efforts in industry gear towards designing and enhancing hardware platforms. Vector units in CPUs enable it to perform arithmetic operations on a group of btyes in parallel. However, CPUs are only practical to train small datasets such as MNIST. NVIDIA’s Graphical Processing Units (GPUs) are the most popular hardware platforms used for training. The core principle of GPUs is the existence of hundreds or thousands of cores that perform computation on data in parallel. Other hardware platforms, such as Google’s Tensor Processor Unit (TPU) and Huawei’s Neural Processing Unit (NPU), depend on large dedicated matrix multipliers. Other accelerated systems based on custom FPGA and ASIC designs are also explored.
Another main method to accelerate training is distributed training: training over multiple GPUs on the same workstation, or training over multiple workstations, with each workstation having one or more GPUs. Distributed training methods can be classified into model parallelism methods as well as data parallelism methods. Frameworks and libraries have provided support for distributed training
[30] to automate the process of distributing workload and accumulating results. A detailed survey of parallel and distributed deep learning is presented in [1]. A key area for research regarding distributed training is to optimize communication between the various workstations [20].Our proposed method in accelerating training is hardware independent, i.e., does not require specific hardware design. It can build upon the speed ups by faster hardware designs and distributed approaches.
Other hardware independent approaches in literature include [33]
that presented a method to speed up training by only updating a portion of the weights during each backpropagation pass. However, the results are only shown for the basic MNIST dataset.
[4] presented an approach to accelerate training by starting with downsampled kernels and input images to a certain number of epochs, before upscaling to the original input image size and kernel size.Proposed Method
We use the endtoend tensor decomposition scheme similar to that proposed by [16] that in turn uses the Tucker decomposition algorithm proposed by [34] and the the rank determined by a global analytic solution of variational Bayesian matrix factorization (VBMF) [26]:

Initialize Model: We start with a model architecture from scratch (i.e., initialized with random weights).

Initial Training: We then train the weights of the model for a certain number of epochs, e.g., 10 epochs.

Decompose: Then, we decompose the model and its weights using Tucker decomposition and VBMF rank selection. The decomposed model has a smaller number of weights then the original model, and hence lower training (and inference time), and a sudden drop in accuracy is expected at that point.

Continue Training: Then, we continue updating those decomposed weights till the end of training.

Reconstruction [Optional]: Before the end of training by a certain number of epochs (e.g., 10) we reconstruct the original architecture by combining the weight matrices of each set of decomposed convolution operations. The accuracy at this point does not change, as this reconstruction step is lossless.

Fine Tuning [Optional]: Train the reconstructed model for a few more epochs.
Tucker decomposition is illustrated in Figure 1(e). To explain Tucker decomposition, we first express a regular convolution operation of weight with dimensions , acting on an input tensor with dimensions to produce an output tensor with dimensions :
(3) 
where:
(4)  
(5) 
where
is the stride and
is the padding.
Tucker decomposition converts this convolution operation into 3 consecutive convolutions:
(7)  
(8)  
(9) 
The first and third convolutions are pointwise convolutions, while the second convolution is a regular spatial convolution with input channels and output channels reduced to and respectively. The Tucker decomposition method described in [34] derives the equations to deduce the weights , , from . The compression ratio of the decomposition is expressed as:
(10) 
and
(11) 
Due to incorporating the height and width of the input and output tensors into the numerator and denominator of the speedup equation, we will notice that speedup in training time is lower than compression ratio.
The values of and are determined by the rank selection method, which is in our case is VBMF. It is out of scope to explain the method here.
It is noteworthy that unlike other compression methods such as pruning and distillation, tensor decomposition can be reversed to retain the original architecture without a change in accuracy in a straightforward manner: by simply performing matrix multiplication of the decomposed matrices. The last 2 steps in the process are only to show that the overall training process can retain the original, in case if someone would like to use the original architecture  due to some reason  rather than the decomposed smaller architecture.
Experiments
We have tested our approach by training VGG19, DenseNet40, ResNet56 on CIFAR10 dataset, and ResNet50 in Imagenet dataset. When training on CIFAR10, the batch size used was 128, and the learning rate was initialized at 0.1, reduced to 0.01 at the 100th epoch, and to 0.001 at the 150th epoch. When training on Imagenet, the batch size was 256 and the learning rate was set to 0.1.
In addition to training from scratch the original model, as well as training the decomposed model at an early epoch, and the reconstructed model at a late epoch, we also decomposed the original model after it completed training, and finetuned for an additional number of epochs. We did this to compare the accuracy and number of parameters of a model decomposed early during training versus a model decomposed after it completed training.
The results are shown in Tables 1, 2, 3 and 4 and Figures 3, 4, 5, and 6. In those tables, “Dec.” is abbreviation for decomposed, and “Rec.” is abbreviation for reconstructed.
Results
For VGG19 on CIFAR10, we notice from Table 1 that there was more than speedup in training time, a compression of parameter size, but a drop of almost 2% when decomposed from the 10th epoch. From Figure 3 we notice, a sudden drop in accuracy when decomposition happens, however that drop is compensated for after less than 1000 seconds. When decomposing from later epochs, there was a general trend of decreasing accuracy drop in return for a reduction in model size compression. It may seem that in later epochs, the EVBMF detects more noise in the weight values  as the weights try to fit the training data with higher accuracy and cover more corner cases  and hence selects a higher rank for decomposition. Surprisingly, the scenarios for reconstructing at the 190th epoch, and for decomposing after complete training, did not result in higher best accuracy than decomposing at the 10th epoch without reconstruction.
For DenseNet40 on CIFAR10, we notice a similar drop in accuracy as in VGG19, but less training speedup, despite more than reduction in the number of parameters. This is expected from the compression and speedup ratios expressed in Equations 10 and 11. The results for DenseNet40 also show that both accuracy and model compression are higher for decomposing during training than decomposing after training.
On the other hand, as shown in Table 3, decomposing ResNet56 resulted in an increase accuracy, but less than 10% reduction in training time. This can be interpreted that the EVBMF algorithm selected high ranks for most layers in the ResNet56 model. This is a point that should be further researched in the future for analysis.
For Imagenet dataset, the drop in accuracy was less than 0.2% for ResNet50 but the reduction in training time was negligible. Furthermore, decomposing at the 30th epoch resulted in better accuracy than decomposing after complete training.
Model  Accuracy  Params  Training Time  
Best  Final  
Original  93.55%  93.56%  4.77 hr  
Dec. @ 10  91.69%  91.39%  2.32 hr  
Dec. @ 20  92.10%  91.89%  2.75 hr  
Dec. @ 30  91.85%  91.78%  2.95 hr  
Dec. @ 40  92.57%  92.43%  3.04 hr  
Dec. @ 50  92.51%  92.49%  2.41 hr  

91.69%  91.53%  2.45 hr  

91.52%  91.36% 

Model  Accuracy  Params  Training Time  
Best  Final  
Original  94.00%  93.78%  11.13 hr  
Dec. @ 20  92.00%  91.82%  8.62 hr  

92.00%  91.96%  8.83 hr  

91.49%  91.46% 

Model  Accuracy  Params  Training Time  
Best  Final  
Original  91.83%  91.69%  4.39 hr  
Dec. @ 10  92.16%  91.97%  4.06 hr  
Dec. @ 20  92.27%  92.07%  4.10 hr  
Dec. @ 30  91.66%  91.51%  4.15 hr  
Dec. @ 40  91.67%  91.50%  4.14 hr  
Dec. @ 50  91.65%  91.15%  4.15 hr  

92.16%  91.92%  4.07 hr  

92.32%  92.22% 

Model  Best Accuracy  Params  Training Time  
Top1  Top5  
Original  75.65%  92.85%  185.4 hr  
Dec. @ 30  75.34%  92.68%  179.2 hr  

69.26%  89.38% 

Conclusion and Future Work
In this paper we have shown that to compress a model using tensor decomposition, we do not have to wait till training ends. We have shown the decomposing at the 10th or 20th epoch of training, results in accuracy close to  and sometimes higher than  that of the original model trained till the end.
We have also shown that in all of the cases on CIFAR10 dataset, the size of a model decomposed after 10 or 20 epochs of training is smaller than that of the model decomposed after complete training. Moreover, we have shown  for CIFAR10  that training a decomposed model for VGG and DenseNet architectures results in considerable faster training time: more than for VGG19 and for DenseNet40. However, the speedup obtained for ResNet architecture was negligible.
For future work, there is a need to explore ways to reduce the accuracy drop in accuracy for our “decompositionintraining” approach for some models, and to increase the training speedup for other architectures especially ResNet.
Acknowledgments
We would like to thank Jacob Gildenblat [8] and Ruihang Du [6]
for providing opensourced code for Tucker decomposition using PyTorch. We also thank Yerlan Idelbayev for providing opensourced code and model files to reproduce the accuracies of the original ResNet
[14], DenseNet, and VGG papers [24] results on CIFAR10.References
 [1] (201908) Demystifying parallel and distributed deep learning: an indepth concurrency analysis. ACM Comput. Surv. 52 (4), pp. 65:1–65:43. External Links: ISSN 03600300, Link, Document Cited by: Training Acceleration.
 [2] (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. CoRR abs/1904.05049. Cited by: Inference Acceleration.
 [3] (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or 1. CoRR abs/1602.02830. External Links: Link, 1602.02830 Cited by: Inference Acceleration.
 [4] (2016) Fast training of convolutional neural networks via kernel rescaling. External Links: 1610.03623 Cited by: Training Acceleration.
 [5] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 1, NIPS’14, Cambridge, MA, USA, pp. 1269–1277. External Links: Link Cited by: Rank Selection.
 [6] (2018) Decomposecnn. Note: https://github.com/larry0123du/DecomposeCNNAccessed: 20190905 Cited by: Acknowledgments.
 [7] (2019) Lightweight monocular depth estimation model by joint endtoend filter pruning. CoRR abs/1905.05212. External Links: Link, 1905.05212 Cited by: Inference Acceleration.
 [8] (2018) PyTorch tensor decompositions. Note: https://github.com/jacobgil/pytorchtensordecompositionsAccessed: 20190905 Cited by: Acknowledgments.
 [9] (2018) Network decoupling: from regular to depthwise separable convolutions. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 36, 2018, pp. 248. External Links: Link Cited by: Tensor Decomposition.
 [10] (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), pp. 164–171. External Links: Link Cited by: Inference Acceleration.
 [11] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: Inference Acceleration.
 [12] (20190601) Entropybased pruning method for convolutional neural networks. The Journal of Supercomputing 75 (6), pp. 2950–2963. External Links: ISSN 15730484, Document, Link Cited by: Inference Acceleration.
 [13] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: Inference Acceleration.
 [14] (2018) Proper resnet implementation for cifar10/cifar100 in pytorch. Note: https://github.com/akamaster/pytorch_resnet_cifar10Accessed: 20190523 Cited by: Acknowledgments.
 [15] (2018) Highly scalable deep learning training system with mixedprecision: training imagenet in four minutes. ArXiv abs/1807.11205. Cited by: Introduction.
 [16] (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, External Links: Link Cited by: Proposed Method.
 [17] (1990) Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.), pp. 598–605. External Links: Link Cited by: Inference Acceleration.
 [18] (2018) Holistic cnn compression via lowrank decomposition with knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 01628828 Cited by: Tensor Decomposition.
 [19] (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 345–353. External Links: Link Cited by: Inference Acceleration.
 [20] (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, External Links: Link Cited by: Training Acceleration.
 [21] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. CoRR abs/1808.00278. External Links: Link, 1808.00278 Cited by: Inference Acceleration.
 [22] (2019) Rethinking the value of network pruning. In ICLR, Cited by: Inference Acceleration.

[23]
(1991)
Bayesian interpolation
. NEURAL COMPUTATION 4, pp. 415–447. Cited by: Rank Selection.  [24] (2018) Network slimming (pytorch). Note: https://github.com/Ericmingjie/networkslimmingAccessed: 20190905 Cited by: Acknowledgments.

[25]
(201906)
Importance estimation for neural network pruning.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: Inference Acceleration.  [26] (2012) Perfect dimensionality recovery by variational bayesian pca. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 971–979. External Links: Link Cited by: Rank Selection, Proposed Method.
 [27] (2016) XNORnet: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Link, 1603.05279 Cited by: Inference Acceleration.
 [28] (201806) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Document, ISSN 25757075 Cited by: Inference Acceleration.
 [29] (2019) Green AI. CoRR abs/1907.10597. External Links: Link, 1907.10597 Cited by: Introduction.

[30]
(2018)
Horovod: fast and easy distributed deep learning in TensorFlow
. arXiv preprint arXiv:1802.05799. Cited by: Training Acceleration.  [31] (2019) And the bit goes down: revisiting the quantization of neural networks. External Links: 1907.05686 Cited by: Inference Acceleration.
 [32] (2019) Energy and policy considerations for deep learning in nlp. In ACL, Cited by: Introduction.

[33]
(2017)
MeProp: sparsified back propagation for accelerated deep learning with reduced overfitting.
In
Proceedings of the 34th International Conference on Machine Learning
, Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3299–3308. Cited by: Training Acceleration.  [34] (19660901) Some mathematical notes on threemode factor analysis. Psychometrika 31 (3), pp. 279–311. External Links: ISSN 18600980, Document, Link Cited by: Tensor Decomposition, Proposed Method, Proposed Method.
 [35] (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. CVPR 2018, pp. 9127–9135. Cited by: Inference Acceleration.
 [36] (201610) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. External Links: ISSN 21609292, Link, Document Cited by: Tensor Decomposition.
Comments
There are no comments yet.