While deep learning has obtained high accuracy in computer vision, natural language understanding, and many other fields, one of its main challenges is the extensive computation cost required to train its models. A typical neural network architecture may take upto 2 weeks to train on the Imagenet dataset. Using distributed training, a ResNet50 model can train in 6 minutes, but at the financial and energy cost of 1024 GPUs. The energy required to train the average deep learning model is equivalent to using fossil fuel releasing around 78,000 pounds of carbon, which is more than half of a car’s output during its lifetime 
. The financial and time costs of training deep learning models is making it increasingly difficult for small- and medium- sized companies and research labs to explore various architectures and hyperparameters.
While there has been extensive development to meet the increasing computation cost of training by improving hardware design of GPUs - as well as other forms of specialized hardware - and advancements in distributed training using large number of servers and GPUs, there has been less advancement in reducing the computation costs of training. estimated that from 2012 to 2018 the computations required for deep learning research have estimated 300,000 .
In this paper we propose a hardware independent method to reduce the computation cost of training using tensor decomposition. A lot of research has been made on compressing pre-trained models using tensor decomposition. However, to the best of our knowledge, this paper is the first to propose to use tensor decomposition during training to reduce the computation cost and training time.
In this paper, we will first present related work in reducing the computation cost and latency time of models during inference, followed by related work in reducing training time. Then, we will explain tensor decomposition, and one of its specific methods - Tucker decomposition - that we use in our solution. We then present our proposed solution to decompose the model during training, followed by the results and performances on CIFAR10 and Imagenet datasets.
This section covers related works in reducing CNN model size and speeding up training and/or inference which fall in several categories. First, is searching or designing architectures that have lower number of parameters and hence reduce computation latency time while maintaining reasonable prediction accuracy. Proposed models under this category are SqueezeNet , MobileNets , and MobileNetV2 . Such methods usually result in considerable drop in accuracy: SqueezeNets have Top1/Top5 accuracies of 58.1%/80.4%, MobileNets have 69.6%/89.1%, and MobileNets have 71.8%/90.4%.
The second category, is replacing a portion or all of convolution operations in a model with operations that require less computation time or less parameters. Binarized Neural Networks
, XNOR-Net, Bi-Real Net  and ABCNet  replaced multiplication in convolution operation with logical XNOR operation. Such XNOR-based models usually use regular convolutions in training and therefore do not speed up training, although they speed up inference and reduce parameter size.  proposed Octave Convolution, a spatialy less-redundant variant of regular convolution.  proposed pixel-wise shift operation to replace a portion of convolution filters.
The third category is to quantize the parameters of regular convolution operations from 32-bit floating point presentation to smaller number of bits, such as 8-bit integers. 
used vector-quantization method that is efficient in compressing ResNet models by upto 20x. However, most quantization methods require training using regular convolution till the end, before quantizing, and therefore do not speed up training, although they speed up inference.
The fourth category is pruning: removing a portion of convolution filters in each layer - usually - based on some heurestic.  and  published some of the other earlier works on pruning.  proposed using information entropy of filters as the criteria to prune filters.  proposed an end-to-end pruning method using trainable masks to decide which convolution filters should be pruned.  used Taylor expansion to estimate the contribution of each filter, and hence prune the filters that contribute less to the model accuracy. While most literature on pruning show that training a pruned model from scratch has lower accuracy then pruning a pretrained model to the same pruned architecture, some researchers have reported otherwise . Nevertheless, although training a pruned architecture is usually faster than training the original model, obtaining the pruned architecture in the first place requires training the full model in the first place till the end, while our proposed solution only requires training the first 10 or 30 epochs before compressing.
The fifth categroy is tensor decomposition: separating a regular convolution operation into multiple smaller convolutions whose combined number of parameters or combined latency is less than that of the original operation. This will be explained in the next subsection.
Tensor decomposition is based on a concept of linear algebra known as Singular Value Decomposition (SVD) that states that any matrix, , whose dimensions are - without loss of generality - can be expressed as:
As shown in Figure 1, the dimensions of , , and are , , .
is referred to as the unitary matrix and it is a diagonal matrix. Each value along the diagonal represent the “importance” of the corresponding column ofand row of . The SVD algorithm specifies how to calculate the values of , , matrices in order to hold this equality. is known as the rank of the matrix . Most of the time, the rank of a matrix is . If one or more rows and columns of the matrix are linearly dependent on other rows and matrices in the matrix, then the rank is . However, in the context of neural networks where the values of the weight matrices are updated during training, this condition is unlikely to happen.
In order to decompose into terms that have fewer number of parameters, we need to set the rank of the transformation to be less than the rank, of the matrix :
This results in an approximation. In the extreme case of choosing the rank of decomposition , the matrix can be represented as compared to parameters of the original matrix. For large values of , and hence the storage of the decomposed representation and FLOPs of matrix operation on the decomposed representation is much lower.
The rank has to be selected in an optimal manner in order to balance between the approximation error introduced and the compression obtained. Different methods of rank selection are explained in the following sub-sub-section.
In the context of neural networks, tensor decomposition extends SVD to the 4-dimensional matrices of weights of convolution operators that have dimensions: , where is the number of channels of the output of the operator, is the number of channels of the input image, is the height of each filter, and is the width of each filter. There are various types of tensor decompositions as illustrated in Figure 2: spatial decomposition , channel decomposition , depthwise decomposition , Tucker decomposition . The mathematical expression of each decomposition type and the their derivations from SVD is out of the scope of this paper but they can be found in the reference of each decomposition method. This paper uses Tucker decomposition, and will be explained in more detail in Section Proposed Method.
Some researchers have used time-consuming trial-and-error to select the optimal rank of decomposition of each convolution layer in a network, by analyzing the final accuracy of the model.  used alternate least squares.  proposed a data-driven one-shot decision using empirical Bayes. In this paper, we used variational Bayesian matrix factorization (VBMF) .
To speed up training neural networks, the main research efforts in industry gear towards designing and enhancing hardware platforms. Vector units in CPUs enable it to perform arithmetic operations on a group of btyes in parallel. However, CPUs are only practical to train small datasets such as MNIST. NVIDIA’s Graphical Processing Units (GPUs) are the most popular hardware platforms used for training. The core principle of GPUs is the existence of hundreds or thousands of cores that perform computation on data in parallel. Other hardware platforms, such as Google’s Tensor Processor Unit (TPU) and Huawei’s Neural Processing Unit (NPU), depend on large dedicated matrix multipliers. Other accelerated systems based on custom FPGA and ASIC designs are also explored.
Another main method to accelerate training is distributed training: training over multiple GPUs on the same workstation, or training over multiple workstations, with each workstation having one or more GPUs. Distributed training methods can be classified into model parallelism methods as well as data parallelism methods. Frameworks and libraries have provided support for distributed training to automate the process of distributing workload and accumulating results. A detailed survey of parallel and distributed deep learning is presented in . A key area for research regarding distributed training is to optimize communication between the various workstations .
Our proposed method in accelerating training is hardware independent, i.e., does not require specific hardware design. It can build upon the speed ups by faster hardware designs and distributed approaches.
Other hardware independent approaches in literature include 
that presented a method to speed up training by only updating a portion of the weights during each backpropagation pass. However, the results are only shown for the basic MNIST dataset. presented an approach to accelerate training by starting with downsampled kernels and input images to a certain number of epochs, before upscaling to the original input image size and kernel size.
We use the end-to-end tensor decomposition scheme similar to that proposed by  that in turn uses the Tucker decomposition algorithm proposed by  and the the rank determined by a global analytic solution of variational Bayesian matrix factorization (VBMF) :
Initialize Model: We start with a model architecture from scratch (i.e., initialized with random weights).
Initial Training: We then train the weights of the model for a certain number of epochs, e.g., 10 epochs.
Decompose: Then, we decompose the model and its weights using Tucker decomposition and VBMF rank selection. The decomposed model has a smaller number of weights then the original model, and hence lower training (and inference time), and a sudden drop in accuracy is expected at that point.
Continue Training: Then, we continue updating those decomposed weights till the end of training.
Reconstruction [Optional]: Before the end of training by a certain number of epochs (e.g., 10) we reconstruct the original architecture by combining the weight matrices of each set of decomposed convolution operations. The accuracy at this point does not change, as this reconstruction step is lossless.
Fine Tuning [Optional]: Train the reconstructed model for a few more epochs.
Tucker decomposition is illustrated in Figure 1(e). To explain Tucker decomposition, we first express a regular convolution operation of weight with dimensions , acting on an input tensor with dimensions to produce an output tensor with dimensions :
is the stride and
is the padding.
Tucker decomposition converts this convolution operation into 3 consecutive convolutions:
The first and third convolutions are pointwise convolutions, while the second convolution is a regular spatial convolution with input channels and output channels reduced to and respectively. The Tucker decomposition method described in  derives the equations to deduce the weights , , from . The compression ratio of the decomposition is expressed as:
Due to incorporating the height and width of the input and output tensors into the numerator and denominator of the speedup equation, we will notice that speedup in training time is lower than compression ratio.
The values of and are determined by the rank selection method, which is in our case is VBMF. It is out of scope to explain the method here.
It is noteworthy that unlike other compression methods such as pruning and distillation, tensor decomposition can be reversed to retain the original architecture without a change in accuracy in a straightforward manner: by simply performing matrix multiplication of the decomposed matrices. The last 2 steps in the process are only to show that the overall training process can retain the original, in case if someone would like to use the original architecture - due to some reason - rather than the decomposed smaller architecture.
We have tested our approach by training VGG19, DenseNet40, ResNet56 on CIFAR10 dataset, and ResNet50 in Imagenet dataset. When training on CIFAR10, the batch size used was 128, and the learning rate was initialized at 0.1, reduced to 0.01 at the 100th epoch, and to 0.001 at the 150th epoch. When training on Imagenet, the batch size was 256 and the learning rate was set to 0.1.
In addition to training from scratch the original model, as well as training the decomposed model at an early epoch, and the reconstructed model at a late epoch, we also decomposed the original model after it completed training, and fine-tuned for an additional number of epochs. We did this to compare the accuracy and number of parameters of a model decomposed early during training versus a model decomposed after it completed training.
For VGG19 on CIFAR10, we notice from Table 1 that there was more than speedup in training time, a compression of parameter size, but a drop of almost 2% when decomposed from the 10th epoch. From Figure 3 we notice, a sudden drop in accuracy when decomposition happens, however that drop is compensated for after less than 1000 seconds. When decomposing from later epochs, there was a general trend of decreasing accuracy drop in return for a reduction in model size compression. It may seem that in later epochs, the EVBMF detects more noise in the weight values - as the weights try to fit the training data with higher accuracy and cover more corner cases - and hence selects a higher rank for decomposition. Surprisingly, the scenarios for reconstructing at the 190th epoch, and for decomposing after complete training, did not result in higher best accuracy than decomposing at the 10th epoch without reconstruction.
For DenseNet40 on CIFAR10, we notice a similar drop in accuracy as in VGG19, but less training speedup, despite more than reduction in the number of parameters. This is expected from the compression and speedup ratios expressed in Equations 10 and 11. The results for DenseNet40 also show that both accuracy and model compression are higher for decomposing during training than decomposing after training.
On the other hand, as shown in Table 3, decomposing ResNet56 resulted in an increase accuracy, but less than 10% reduction in training time. This can be interpreted that the EVBMF algorithm selected high ranks for most layers in the ResNet56 model. This is a point that should be further researched in the future for analysis.
For Imagenet dataset, the drop in accuracy was less than 0.2% for ResNet50 but the reduction in training time was negligible. Furthermore, decomposing at the 30th epoch resulted in better accuracy than decomposing after complete training.
|Dec. @ 10||91.69%||91.39%||2.32 hr|
|Dec. @ 20||92.10%||91.89%||2.75 hr|
|Dec. @ 30||91.85%||91.78%||2.95 hr|
|Dec. @ 40||92.57%||92.43%||3.04 hr|
|Dec. @ 50||92.51%||92.49%||2.41 hr|
|Dec. @ 20||92.00%||91.82%||8.62 hr|
|Dec. @ 10||92.16%||91.97%||4.06 hr|
|Dec. @ 20||92.27%||92.07%||4.10 hr|
|Dec. @ 30||91.66%||91.51%||4.15 hr|
|Dec. @ 40||91.67%||91.50%||4.14 hr|
|Dec. @ 50||91.65%||91.15%||4.15 hr|
|Model||Best Accuracy||Params||Training Time|
|Dec. @ 30||75.34%||92.68%||179.2 hr|
Conclusion and Future Work
In this paper we have shown that to compress a model using tensor decomposition, we do not have to wait till training ends. We have shown the decomposing at the 10th or 20th epoch of training, results in accuracy close to - and sometimes higher than - that of the original model trained till the end.
We have also shown that in all of the cases on CIFAR10 dataset, the size of a model decomposed after 10 or 20 epochs of training is smaller than that of the model decomposed after complete training. Moreover, we have shown - for CIFAR10 - that training a decomposed model for VGG and DenseNet architectures results in considerable faster training time: more than for VGG19 and for DenseNet40. However, the speedup obtained for ResNet architecture was negligible.
For future work, there is a need to explore ways to reduce the accuracy drop in accuracy for our “decomposition-in-training” approach for some models, and to increase the training speedup for other architectures especially ResNet.
for providing open-sourced code for Tucker decomposition using PyTorch. We also thank Yerlan Idelbayev for providing open-sourced code and model files to reproduce the accuracies of the original ResNet, DenseNet, and VGG papers  results on CIFAR10.
-  (2019-08) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52 (4), pp. 65:1–65:43. External Links: Cited by: Training Acceleration.
-  (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. CoRR abs/1904.05049. Cited by: Inference Acceleration.
-  (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. CoRR abs/1602.02830. External Links: Cited by: Inference Acceleration.
-  (2016) Fast training of convolutional neural networks via kernel rescaling. External Links: Cited by: Training Acceleration.
-  (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, Cambridge, MA, USA, pp. 1269–1277. External Links: Cited by: Rank Selection.
-  (2018) Decompose-cnn. Note: https://github.com/larry0123du/Decompose-CNNAccessed: 2019-09-05 Cited by: Acknowledgments.
-  (2019) Lightweight monocular depth estimation model by joint end-to-end filter pruning. CoRR abs/1905.05212. External Links: Cited by: Inference Acceleration.
-  (2018) PyTorch tensor decompositions. Note: https://github.com/jacobgil/pytorch-tensor-decompositionsAccessed: 2019-09-05 Cited by: Acknowledgments.
-  (2018) Network decoupling: from regular to depthwise separable convolutions. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, pp. 248. External Links: Cited by: Tensor Decomposition.
-  (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), pp. 164–171. External Links: Cited by: Inference Acceleration.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Cited by: Inference Acceleration.
-  (2019-06-01) Entropy-based pruning method for convolutional neural networks. The Journal of Supercomputing 75 (6), pp. 2950–2963. External Links: Cited by: Inference Acceleration.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Cited by: Inference Acceleration.
-  (2018) Proper resnet implementation for cifar10/cifar100 in pytorch. Note: https://github.com/akamaster/pytorch_resnet_cifar10Accessed: 2019-05-23 Cited by: Acknowledgments.
-  (2018) Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. ArXiv abs/1807.11205. Cited by: Introduction.
-  (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Cited by: Proposed Method.
-  (1990) Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.), pp. 598–605. External Links: Cited by: Inference Acceleration.
-  (2018) Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: Tensor Decomposition.
-  (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 345–353. External Links: Cited by: Inference Acceleration.
-  (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, External Links: Cited by: Training Acceleration.
-  (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. CoRR abs/1808.00278. External Links: Cited by: Inference Acceleration.
-  (2019) Rethinking the value of network pruning. In ICLR, Cited by: Inference Acceleration.
Bayesian interpolation. NEURAL COMPUTATION 4, pp. 415–447. Cited by: Rank Selection.
-  (2018) Network slimming (pytorch). Note: https://github.com/Eric-mingjie/network-slimmingAccessed: 2019-09-05 Cited by: Acknowledgments.
Importance estimation for neural network pruning.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Inference Acceleration.
-  (2012) Perfect dimensionality recovery by variational bayesian pca. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 971–979. External Links: Cited by: Rank Selection, Proposed Method.
-  (2016) XNOR-net: imagenet classification using binary convolutional neural networks. CoRR abs/1603.05279. External Links: Cited by: Inference Acceleration.
-  (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Cited by: Inference Acceleration.
-  (2019) Green AI. CoRR abs/1907.10597. External Links: Cited by: Introduction.
Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799. Cited by: Training Acceleration.
-  (2019) And the bit goes down: revisiting the quantization of neural networks. External Links: Cited by: Inference Acceleration.
-  (2019) Energy and policy considerations for deep learning in nlp. In ACL, Cited by: Introduction.
MeProp: sparsified back propagation for accelerated deep learning with reduced overfitting.
Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3299–3308. Cited by: Training Acceleration.
-  (1966-09-01) Some mathematical notes on three-mode factor analysis. Psychometrika 31 (3), pp. 279–311. External Links: Cited by: Tensor Decomposition, Proposed Method, Proposed Method.
-  (2017) Shift: a zero flop, zero parameter alternative to spatial convolutions. CVPR 2018, pp. 9127–9135. Cited by: Inference Acceleration.
-  (2016-10) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. External Links: Cited by: Tensor Decomposition.