I Introduction
In the past few years, deep neural networks (DNNs) [1]
have achieved great success in machine learning, especially the convolutional neural networks (CNNs) with some representative networks such as AlexNet
[2], VGGNet[3], GoogleNet [4], ResNet [5], DenseNet [6], etc. Nowadays, three dimensional convolutional neural networks (3DCNNs) [7, 8] have been applied in many tasks of recognition of spatiotemporal data from videos [9, 10, 11, 12, 13, 14, 15, 16], pure 3D data from depth cameras [17], or stacking utterances from speech data [18]. However, these highdimensional 3DCNN architectures, particularly the longterm temporal convolutions with large sized 3D convolutional kernels [15] and the very deep 3D ResNets [16], make the situation of inflated sizes of DNNs [19] more serious. Even worse, to the best of our knowledge, there does not exist any practice to compress 3DCNNs to satisfy miniaturization requirements in confining environments such as embedded devices.Fortunately, there exist researches on the compression of other neural networks, which provides opportunity for the compression of 3DCNNs. According to Cheng et al.’s survey [19], four categories of compression methods including: 1) compact architecture; 2) weight sharing or quantization; 3) sparsifying or pruning; and 4) matrix/tensor decomposition or lowrank factorization, can be collected. Among these methods, compact architecture can just obtain limited compression ratio, quantization and pruning always need to pretrain corresponding uncompressed models, only decomposition methods may afford us the socalled in situ training [20] which can directly get a trained model from scratch with sufficient compression performance. Therefore in this work, we focus on compressing 3DCNNs by applying decomposition methods.
In the aspect of decomposition methods, SVD is the most widely employed matrix decomposition method for the compression of DNNs. For instance, Zhang et al. [21, 22] split one convolutional kernel into two sub kernels, and Shim et al. [23]
compress the last softmax layer for large vocabulary neural networks. However, it may be not enough to completely eliminate the inherent redundancy in DNNs
[24] in the viewpoint of matrix. Hence, in the view of tensor, a higher compression ratio could be approached by reshaping the weight matrices to tensors, termed as tensorizing [25]. Nevertheless, traditional tensor decomposition methods, such as CP [26] and Tucker [27], are inevitable to fall in the curse of dimensionality because their kernel tensors still give an exponential contribution to the space complexity
[28].Tensor network decomposition methods [29], including hierarchical Tucker [30, 31], tensor train (TT) [32], and tensor chain [33, 34], can completely avoid the curse of dimensionality by representing a tensor as linked tiny factor tensors with restricted orders. Thereinto, TT decomposition is the most concise format so that many compression applications are based on it. Novikov et al. [25] first utilize TT decomposition to compress the weight matrices of fully connected (FC) layers in DNNs. Since then, Huang et al. [35] and Su et al. [36] extend the applications based on CNNs with TT decomposed FC layers. Only Garipov et al. [37] apply the TT format to convolutional layers by first viewing the kernel as a 4thorder tensor, then reshaping the tensor to a matrix, and finally matching the matrix to the any thorder tensorizing TT approach [25]
. In the domain of recurrent neural networks (RNNs), Tjandra et al.
[38, 39] utilize TT format to compress all matrices within different kinds of gate structures, and further, Yang et al. [40] test the performance of TTRNNs and achieve extremely high compression ratio with miraculous accuracy improvement rather than loss based on larger models and datasets.From the above recent practices on TTbased compression, we observe that: 1) all but Garipov et al.’s method [37] are based on tensorizing TT approach for weight matrices including FC layers and RNN gate units; 2) all but Yang et al.’s work [40] exist more or less accuracy losses; 3) although it is important to make TT format to be lowrank [41, 42], how to select suitable TT ranks with given tensor sizes especially for training DNNs has not been addressed yet. Based on these observations and facing the specific 3D convolutional kernels in 3DCNNs, in this work, we will first study a tensorizing method to compress a 5thorder 3D convolutional kernel tensor into the TT format as a thorder tensor. Secondly, we will consider how the accuracy loss can be avoided for large 3DCNNs. Thirdly, we will provide a general rule to decide the value of TT ranks for a specific tensor with given size. Finally, according to fundamental methods reported by Novikov et al. [25] and Garipov et al. [37], we will propose a tensorizing approach to compress 3D convolutional kernels based on the TT decomposition. Last but not least, inspired by Yang et al. [40], multiple experiments on VIVA challenge [43] and UCF11 [44] datasets will be conducted to give empirical proof that the accuracy improvement rather than loss can be obtained in compressed TT3DCNNs with appropriate TT ranks if their original uncompressed ones are redundant.
We list the main contributions of this work as follows:

To the best of our knowledge, we are the first to utilize the TT format to compress convolutional kernels in 3DCNNs. This method can provide an direct in situ training approach without pretraining or elaborate design to compress largescale 3DCNNs for application scenarios with limited storage space.

We give a general principle to select TT ranks for a sizefixed tensor based on two bases. One is the theoretical analysis according to the relationship between TT and hierarchical Tucker which contains the source of TT ranks, and the other is the experimental verification.

We empirically prove that the accuracy loss in compression can be avoided in redundant DNNs combining the inherent regularity of TT decomposition. Under this accuracy constraint, a very high compression ratio can still be obtained.

We achieve stateoftheart result on VIVA challenge dataset (81.83%) which significantly outperform previous work.
The rest of the paper is organized as follows. Section II first introduces fundamental knowledge of the TT format including tensorizing for matrices, then proposes the tensorizing for 3D convolutional kernels and discusses the selection of TT ranks. Section III presents the elaborate contrast experiments based on VIVA challenge and UCF11 datasets to verify that the accuracy loss can be avoided when compressing a redundant 3DCNN model based on the TT decomposition. Section IV further discusses some possible internal mechanisms for lossless TT networks. Section V concludes this work.
Ii Tensor Train Decomposition for 3DCNNs
In this section, we first introduce basic knowledge of TT format. Then we propose the tensorizing method for compressing 3D convolutional kernels. Finally we investigate the principle regarding how to select TT ranks. For convenience, we will use the bold lower case letter as the vector symbol (e.g.
), the bold upper case letter as the matrix symbol (e.g. ), and the calligraphic bold upper case letter as the tensor notation (e.g. ).Iia Tensor Train Format
IiA1 Basic TT Format
According to [32], the basic TT format of a thorder tensor can be represented in the measure of entry as
(1) 
where () is the th index of the entry in tensor , serial products on the right side of the equation are core matrices to calculate the entry, each factor matrix has the size of and . There are totally values of which are collectively called TT ranks. Additionally, all corresponding to the same mode can be stacked into a rdorder core tensor . The TT format of can also be represented as [45]
(2) 
where is called mode contracted product, which means just one pair equal modes in any thorder tensor and thorder tensor will be contracted to produce a new thorder tensor .
Suppose that the maximum value of all modes is , and the maximal rank is . It is easy to work out that the space complexity of tensor can be reduced from to . Obviously, the compression ratio is growing exponentially as the value of order increases linearly. This means that the more complex a data structure is, the higher compression ratio we can obtain.
IiA2 Tensorizing and TT for Matrices
It is meaningless to directly use Equation (2) to decompose a matrix as a 2ndorder tensor, because such naive approach will make TT decomposition degenerating as normal lowrank matrix decomposition. From analysis of space complexity above, significant compression ratio can be obtained if the original matrix is reshaped to set the value of order higher. Such idea is the socalled tensorizing which is proposed by Novikov et al. [25], which makes it possible to utilize TT for matrices not only in the field of DNNs.
In detail, consider a large matrix and each value of its modes can be factorized into integers like
Then a thorder tensor can be constructed by two bijections which can map the original matrix mode or to tensor modes or () separately. The corresponding relationship between the original matrix and the reshaped tensor can be represented as
where , , and
are the two bijections.
After tensorizing, one can use Equation (1) to rewrite the tensor into its TT format as
(3)  
Relevant space complexity can be reduced from to where and are the maximal and (), respectively. The visualized structure of TT for matrix is illustrated in Figure 1(a).
IiB TT for 3D Convolutional Kernels
A normal 2D convolutional kernel could always be regarded as a 4thorder tensor , which can be reshaped to a thorder tensor by referring to tensorizing for matrices, where means the edge length of convolutional filter, and denote the input and the output channels, respectively [37]. For TT format, such reshaping approach is more efficient than the naive Equation (1), because the value of or is usually much larger than , and a tensor with more balanced shape can usually get less errors [25]. Based on Equation (3), Garipov et al. [37] give the TT format for convolutional kernels
(4)  
where or , , and . The visualized structure of TT for convolutional kernel is illustrated in Figure 1(b).
Inspired by [37], we intend to propose a similar method to reconstruct a 3D convolutional kernel to a thorder tensor with relatively balanced size, and then utilize the TT format on this tensor. However, a 3D convolutional kernel is a 5thorder tensor which has a convolutional window with 3 sizes () rather than regular hexahedron in most cases. Thus, in order to utilize Equation (4), we should make a mapping to transfer the entry from to a new 4thorder tensor with the constraint .
First, let’s ignore and , stretch to a 3rdorder tensor with the constraint . Then, suppose the value of each index begins at 0, and we have
where , , , , , are indices of the modes ,, , , , , respectively, with . Second, fold to a 4thorder tensor with the constraint , we have
where are indices of the modes , with . Combining the above two steps, the mapping from to should include 3 bijections: , , and , which let
(5)  
and
Finally, as the same as Equation (4), by further reshaping to , we can get the TT format for 3D convolutional kernels like
(6)  
The corresponding compression ratio can be calculated as
(7) 
where () denotes the TT ranks . Note that because is a thorder tensor. It is easy to see that the compression ratio depends on the TT ranks significantly. Let be the maximum value of , we obtain that
(8) 
The visualized structure of TT for 3D convolutional kernel is illustrated in Figure 1(c). Moreover, for easily programming and directly using convolutional operation , Algorithm 1 below shows how to design the structure of a 3D convolutional kernel and compute the convolutional output with input data. Note that we calculate the values of and as close as possible to ensure the data ranges of and are sufficient.
IiC The Selection of TT ranks
IiC1 Theoretical Foundation
According to Equation (7), compression ability depends on the value of order (), modes ( and ) and TT ranks (). Further, from Equation (8), one can make the value of modes to be as equal as possible so that the best compression ratio may be picked up for tensorizing. Hence, the rest to be considered carefully is the value of TT ranks. However, as mentioned in Section I, there is still a lack of verified principles for the selection of TT ranks to represent a tensor with given sizes, although Novikov et al. [25] have summarized some phenomena when TT ranks growing. In fact, it is widely accepted that the TT format is a special form of the hierarchical Tucker format [31, 45, 46, 47], so it is possible to find the theoretical foundation of TT ranks from researching the details of hierarchical Tucker.
In order to explain the hierarchical Tucker format, we first introduce the concept of modes matricization [31]. Consider a thorder tensor with the set of indices of modes which can be split into two subsets and ( ). If the set can also be split into , we can conclude [48]
where and means the Kronecker product. Similarly, sizes of are serial products of elements in and its complementary set , separately, and so does . Such form like is called modes matricization of tensor . Furthermore, given , , and as bases of the column spaces of , , and , respectively, we have
(9) 
where , and is called transfer matrix. Note that , , and are the respective ranks of , , and .
It can be easily observed that only two subsets will be produced by utilizing Equation (9) each time. Thus, a specific hierarchical Tucker format is corresponding to a binary tree which continuously divides the original set of full modes until all the leaf nodes appear to be a singleton set of one mode. Such binary tree is called dimension tree, and there exists a special form called degenerate dimension tree which splits out only one mode each time as shown in Figure 2.
According to Lemma 5.2 in [31] which proves that the hierarchical Tucker corresponding to the degenerate dimension tree yields the TT format. For the thorder tensor in Figure 2, each entry of the th () TT core tensor can be represented like
where , , , , , and are indices of the modes , respectively. Note that the implicit is defined based on Equation (9) in
where the transfer matrix is the initial form of tensor .
Trace back to Equation (2), any TT rank of a thorder tensor is actually the which comes from the matrix rank of with the base of column space . For example, a 4thorder tensor whose serial modes matricizations based on the degenrate dimension tree are , , and , then the TT rank values should be , , and if we suppose that these three matrices are all full rank. However, higher values of TT ranks always cause ripples for numerical computation. In practice of DNNs, because of the considerable redundancy in weight matrices [24], we can just consider the truncated TT format by selecting a comparatively small rank value to replace all original ranks, e.g., set and for the above tensor .
IiC2 Pragmatic Verification
In order to verify the effectiveness of truncated TT format for training DNNs, we make two brief tests based on CIFAR10 dataset [49] to examine whether it is enough to just select a relatively small matrix rank value from any . Each test is executed through a CNN and its TTCNN form which compresses all but convolutional kernels based on Equation (4). The detailed network architectures are shown in Table I. Furthermore, we force all the TT ranks to be the same value with range from 2 to 100 for easily observing the relationship between the accuracy and the rank value.
–  CNN_1  CNN_2  

Layers  Shape  TT  Shape  TT 
Input  –  –  
Conv 1.1  –  –  
Conv 1.2  
Max Pooling 1  –  –  
Conv 2.1  –  
Conv 2.2  
Max Pooling 2  –  –  
Conv 3.1  
Conv 3.2  
Average Pooling  –  –  
Linear  128  –  256  – 
Output  10  –  10  – 
According to the definitions of TT format in Table I, the theoretical value of truncated TT ranks should be 9, 16 or 32 in CNN_1, and 9 or 16 in CNN_2, respectively. Because occupies the most proportion of all feasible values of original in both CNN_1 and CNN_2, 16 could be the most suitable rank value for truncated TT format with the premise that all values of TT ranks are the same. Moreover, the test results illustrated in Figure 3 show that the accuracy improvements have slowed down when passing the point with rank value of 16 in both two TTCNNs.
In a nutshell, to design a truncated TT format for one layer in DNNs, we deem that for an arbitrary tensorizing thorder weight or convolutional kernel tensor , one can select an appropriate rank or which is the full rank of or , respectively. Approximately, in practice it is usually better to fine tune the th TT rank to be the full rank of . At last, it needs to be stressed that all the tests and experiments in this paper are implemented using the t3f library [50].
Iii Experiments
In this section, based on VIVA challenge dataset, we first design five different 3DCNN models and their TT formats by continuously enlarging the network scale to observer how the accuracy loss is wiped out when the network redundancy increases. Second, a carefully designed 3DCNN model and its TT format are trained on UCF11 dataset for further verification.
Iiia VIVA Challenge Dataset
IiiA1 Dataset Description
VIVA challenge dataset [43] is comprised of 885 video sequences including 19 dynamic hand gestures without any official split of training and validation set, and each video sequence contains two consistent channels which are intensity and depth. The resolution of videos is , and each video clip has about 32 frames. The dataset describes the hand gesture based humancomputer interaction system in cars for both driver and copilot under varying illumination environment.
Although it is beneficial to train DNNs with two natural channels by sampling from depth camera, the sample amount is still very limited for a dataset with 19 classes. Besides, every frame in the video prefers to be a gray scale image with one band although the original extracted raw frame has three bands, because the intensity data is inherently one band from the infrared sampling and the depth data is linearly mapped from the distance to a gray value rather than the color space.
IiiA2 Preprocessing and Augmentation
It is clear that VIVA challenge dataset has characteristics of multiple channels, small sample size, large resolution, short video clip length, and no split of training or validation set. Thus, some data preprocessing and augmentation must be considered to prevent overfitting and improve performance.
Inspired by [11], our data preprocessing includes four main steps below which are also vividly illustrated in Figure 4.

Extract frame sequences from video clips and normalize them into length of 32 through the socalled nearest neighbor interpolation (NNI)
[11]. In detail, we uniformly delete several frames for the sequences longer than 32, and repeat adjacent frames for the ones shorter than 32. Moreover, we shrink every frame image to as width height by bicubic interpolation. 
Generate Sobel gradient images from intensity frames to be an extra channel of frame sequences.

Create new frame sequences by reversing the order of frames through utilizing the time symmetry of some dynamic gestures, e.g., the gesture of swiping from left to right with right hand can be reversed to be a new gesture of swiping from right to left with right hand. But this step is not suitable for a part of frame sequences that have no trait of time symmetry.

Produce new frame sequences from both original and reversed ones by vertically mirroring every frame through that utilizes the symmetry of hands. After this step, the augmented sample size should be 3076.
Besides, in order to easily design the batch size, we compensate the whole sample size from 3076 to 3100 through randomly copying 24 samples which are not time symmetrical. Finally, every sample is a stacked frame sequence which has three channels including intensity, depth, and Sobel gradient. We call a single stacked frame sequence with one channel as a volume, e.g., intensity volume, depth volume, or Sobel gradient volume. For assessing training results objectively, we randomly split the entire 3100 samples into one training set with 2500 samples and one validation set with 600 samples for the next learning stage based on proximate kfold cross validation, and we ensure that there are enough samples belonging to different 19 classes in both training and validation sets.
Before feeding the training volumes into 3DCNNs, some data augmentation methods must be applied to further avoid overfitting. We have observed that frame images in the same channel have some noteworthy common features which have obvious differences from other channels. It is easy to get intuitive sense from Figure 4 that the frame images in intensity volume are natural and bright, the ones in depth volume are dark, and the ones in Soble gradient volume are cleancut. These features can also be reflected in average gray histograms of different channels as shown in Figure 5, where the subfigure (a) shows that the histogram of intensity frame image is relatively equalized, the subfigure (b) indicates that the most pixel values in depth frame image are partial to black, and the subfigure (c) reveals that the most remarkable characteristic of Sobel gradient frame images is that there are lots of edges draw by pure white.
Based on the histogram analysis above, we make traditional affine transformation for all channels and diverse approaches for different channels. In detail, we first apply affine transformation for all frame images with randomly resizing, horizontal translating of pixels, and vertical translating of pixels. Note that these affine transformation parameters are the same to images in each volume. Then we simultaneously apply random contrast adjustment, random bright increasing, and random pixels drop out for intensity, depth, and Sobel gradient volumes, respectively, as follows.

Apply random contrast adjustment in the range of for intensity frame images.

Increase the brightness randomly from for depth frame images.

Drop out random pixels (set their values to be 0) for Sobel gradient frame images.
It should be emphasized that data augmentation is just applied on volumes in the training set, i.e., half of training volumes and all of validation volumes are directly fed into networks without any augmentation.
IiiA3 Design of Networks
In order to study the relationship between the compression performance and the network scale, as shown in Figure 6, we design five different 3DCNN models with larger and larger scales on VIVA challenge dataset. From 3DCNN_VIVA_1 to 3DCNN_VIVA_5 in Figure 6, we gradually enlarge the convolutional kernel size and the output channel number, while the concrete number of parameters is listed in Table II. To prevent the possible gradient vanishing problem, we do not increase the network depth blindly. ResNetbased 3DCNNs [16] are not considered because we are not sure whether the TT format is applicable for this specific architecture. However, our 3DCNNs are carefully designed for easily comparing between original uncompressed networks and their TT compressed counterparts.
Networks  Convolutional Parameters  Whole Parameters 

3DCNN_VIVA_1  70152  2303240 
3DCNN_VIVA_2  206520  13585592 
3DCNN_VIVA_3  2750400  16032704 
3DCNN_VIVA_4  3905472  73121216 
3DCNN_VIVA_5  4790208  84809664 
Networks  Convolutional Parameters  Whole Parameters 

3DCNN_VIVA_1  14880  194544 
3DCNN_VIVA_2  14558  285886 
3DCNN_VIVA_3  132880  459168 
3DCNN_VIVA_4  228976  1364080 
3DCNN_VIVA_5  245360  448660 
According to the network architectures designed above and Equation (6), the corresponding TT compressed network architectures are given in Figure 7. We select the comparatively small modes matricization ranks for the truncated TT decomposition. For instance, the tensorizing format of Conv2 layer in 3DCNN_VIVA_3 from Figure 7 is which yields a 4thorder tensor with the size of , then we let 16 to be the value of all TT ranks. We also list the concrete number of parameters in Table III
. Furthermore, it is important to make the keeping probability (dropout parameter) of TT compressed network higher than the uncompressed ones because the TT FC layers have much less neurons than the original FC layers. Finally, all the activation functions omitted in both Figure
6 and Figure 7are ReLU.
IiiA4 Learning and Results
Because there does not exist any official split of training and validation sets for VIVA challenge dataset, as mentioned previously in this section, we randomly split 3100 samples into 2500 samples as training set and the rest 600 samples as validation set. In order to carry out the kfold cross validation, then we train each network in Figure 6 and Figure 7 in total six times with nonrepetitive split schemes which guarantees sufficient randomness. After that, overall performance can be assessed by collecting all results across different training times.
We train totally 100 epochs for all networks in Figure
6 and Figure 7based on TensorFlow. On account of the varying scales of different networks, the initial learning rate is set to 0.01 for 3DCNN_VIVA_1, 3DCNN_VIVA_2 and 3DCNN_VIVA_3, 0.005 for 3DCNN_VIVA_4 and 3DCNN_VIVA_5, respectively. The learning rate decreases by a factor of 0.1 after every 30 epochs and the momentum is set to 0.9. Due to the limitation of our GPU resources, the minibatch size is carefully designed for each network, and concretely, the batch size is 100 for 3DCNN_VIVA_1 and 3DCNN_VIVA_2, 50 for 3DCNN_VIVA_3, 25 for 3DCNN_VIVA_4 and 3DCNN_VIVA_5. Besides, we use stochastic gradient descent (SGD) to optimize our networks and the loss function is cross entropy with the softmax layer.
The results of these experiments for networks above are shown in Table IV, in which their storage requirements are also listed. One can easily observe that the accuracy degeneration is decreasing as the scale of each network increases until it is vanished. In addition, higher compression ratio can be easier obtained for redundant networks. Further detailed discussions are given in the next section.
–  3DCNN_VIVA_1  3DCNN_VIVA_2  3DCNN_VIVA_3  3DCNN_VIVA_4  3DCNN_VIVA_5  

–  Original  TT  Original  TT  Original  TT  Original  TT  Original  TT  
Accuracy (%) 











Degeneration (pp)  6.86  5.84  3.46  0.58  0.36  
Storage (MB)  27  4  156  5  189  11  839  19  972  8  
Compression ratio  6.75  31.20  17.18  44.16  121.50 
IiiB UCF11 Dataset
IiiB1 Dataset Description
The purpose of our experiments in this subsection is just to further confirm that the TT format of redundant 3DCNNs can avoid accuracy degeneration. Thus, we choose the smallscale UCF11 dataset [44] which is collected from YouTube. The dataset contains 1600 video clips with 11 classes but without any official split either. The resolution of each video is 320 240. However, this dataset is still challenging although the scale is small, because there are large variations in camera motion, cluttered background, illumination conditions, etc.
IiiB2 Preprocessing and Augmentation
For UCF11, it is unbeneficial to utilize NNI to normalize frame sequences to the same length, because the number of original frames distributes very divergently as shown in Figure 8. In general, the socalled random clipping [15, 51], which samples a consecutive frame sequence with a fixed length, can be used to organize the input data before feeding them into networks. Thus, to save the training time of sampling sequence frames from original video clips, we make the following data preprocessing in advance.

Confirm 50 to be the length of frame sequences which will be fed into networks according to the distribution of frame amount in Figure 8(b).

Extract every frame from video clips and shrink them to 80 60 as weight height by bicubic interpolation. Then we stack these frame images in sequence and save them in disk. Besides, we repeat the last frame images for those shorter video clips until 50 frames are got. Note that one stacked frame image looks like a 4thorder tensor rather than a 3rdorder tensor because every frame is a RGB image with three bands.

Calculate the Farnebäck optical flow [52] images from original RGB images under the size of 320 240. Then shrink these optical flow images to 80 60 and stack them in sequence in the same way as RGB frames. Likewise, complementing shorter ones to 50 frames should also be considered.
Finally, every sample should be two stacked frame sequences, one of which is consisted of R, G, and B volumes and the other has two optical flow volumes because one Farnebäck optical flow image has two bands. As there does not exist any split preprocessing similar to VIVA, we randomly split 1600 samples into training set with 1280 samples and validation set with 320 samples to apply kfold cross validation in the next learning stage. We randomly select 29 samples for one class and put them into validation set to ensure that the class distribution can be balanced in both training and validation sets.
Some data augmentation methods should also be applied to prevent overfitting and improve performance. In detail, we first apply affine transformation with the same parameters as those on VIVA challenge dataset for both RGB and optical flow volumes, then use random contrast adjustment () and random saturation adjustment () for RGB volumes only. We abandon the drop out technique due to the possible unconvergence.
IiiB3 Design of Networks
As one optical flow is produced by two contiguous RGB frames in the temporal dimension, we cannot directly stack RGB volumes and optical flow volumes into one sample like what we have done on VIVA challenge dataset. Hence, inspired by the two stream architecture [51], we design a similar two stream network with two sub 3DCNNs that handle the RGB volumes and optical flow volumes respectively.
The basal uncompressed network is shown in Figure 9 and the corresponding TT network is illustrated in Figure 10
where all the activation functions are ReLU. Note that there are two cross entropy losses following the RGB stream and the optical flow stream respectively in both the uncompressed network and the TT compressed one. It means that the training progress are separated into two streams. However, we design a final output which synthesizes the two outputs from different streams to gain the final classified result. Concretely, let
and to denote the RGB stream and the optical flow stream respectively, the output of two softmax layers could be and separately, where denotes the label vector and denotes the output observation vector. Rather than the Hadamard product chosen by Molchanov et al. [11], we select the mean vector which may collect higher probability values likewhere is the mean vector which denotes the final classification result.
IiiB4 Learning and Results
As mentioned above we make a split of training set and validation set, then we train both the uncompressed network in Figure 9 and the TT compressed network in Figure 10 in total five times to implement the kfold cross validation. The initial learning rate is 0.003 for the uncompressed network and 0.001 for the TT network, respectively. The total number of training epochs is up to 100 and the learning rate decreases exponentially by 0.1 factor after every 30 epochs. We also use SGD optimizer and the momentum coefficient is set to 0.9. The batch size is 20 for both networks.
The accuracy results of uncompressed network and TT network from kfold cross validation are with 1012 MB storage and
with 14.6 MB storage, respectively. Although there presents a little degeneration (0.52%), it can still be deemed that the accuracy loss is wiped out in TT network when considering the standard deviation and possible random error.
Iv Discussions
Overall, our experiments have evidenced that 3DCNNs for a specific task can be compressed losslessly based on TT decomposition due to the sufficient redundancy. In this section, we further explain some detailed experimental phenomena and discuss the significance of using TT to compress 3DCNNs.
Iva Regularization of TT Decomposition
It is observed that TT format can bring a certain level of regularization to DNNs. This is the reason why we make the keeping probability of dropout to be higher in TT networks, e.g., network architectures in Figure 7 and Figure 10. However, abandoning dropout completely is also inadvisable because even utilizing dropout with 0.9 keeping probability can avoid overfitting significantly [15].
We vary the dropout ratio of 3DCNN_VIVA_3 in TT format to observe the regularization effect of TT decomposition in FC layers. The keeping probability of dropout in the uncompressed network is set to 0.5 that illustrates in Figure 6, but the values of keeping factor in the TT network are varied from 0.5 to 1.0. The test result is shown in Figure 11. We can find that the network presents underfitting if the keeping probability is less than 0.7, and overfitting occurs when the keeping probability exceeds 0.7. It is obvious that keeping probability of 0.7 should be the best configuration so that we decide to set this value which can be seen in Figure 7.
IvB Redundancy of 3DCNNs
We are aware of that 3DCNNs have higher level of redundancy than traditional normal 2DCNNs in the light of our experiments and which phenomenon can permit us to develop a lossless compression method based on TT decomposition. However, it is still hard to give a quantitative evaluation to demarcate the boundary of redundancy. Here we qualitatively discuss the redundancy and lossless TT3DCNNs in four respects.
IvB1 Layer Coupling
It is necessary to point out that there exists coupling between convolutional layers and FC layers that affects the performance of TTCNNs. Novikov’s group [25, 37] shows that compressing the FC part may get sufficient compression ratio with a little accuracy loss, while compressing both FC and convolutional parts is still hard to avoid degeneration. Thus, it seems that compressing the convolutional part is unnecessary. However, we find the coupling between convolutional layers and FC layers gives the meaning to compress convolutional kernels.
In detail, for the network 3DCNN_VIVA_1, we severally compress the FC part, the convolutional part, and both of these two parts. The results in Table V indicates that just compressing either part can cause considerable accuracy loss, while compressing both parts has not produced more substantial degeneration. In a word, compressing the whole networks especially for 3DCNNs will not be worse than compressing only a layer part (Conv or FC). Therefore, discussing the redundancy of 3DCNNs should focus on the entire network.
Compressing Part  Base  FC  Convolution  Both  

Accuracy (%) 





Degeneration (pp)    5.64  6.03  6.86 
IvB2 Large Convolutional Kernel Size
A stack of two 33 convolutional kernels with fewer parameters has the equivalent receptive field as a single 55 convolutional kernel which is realized in VGGNet [3]. Furthermore, the principle of using 33 kernel is widely adopted nowadays [5, 6, 53]. However, 33 or 333 convolutional kernels are unfriendly for TT decomposition according to Algorithm 1, as the product of modes in should not be too small to make the shape of relatively balanced which is helpful to define the corresponding TT format.
From network 3DCNN_VIVA_1 to network 3DCNN_VIVA_3, the kernel sizes and channels increase gradually without deepening the network, and the accuracy degeneration decreases accordingly. In contrast, we try to redesign the kernels in 3DCNN_VIVA_2 as 33, i.e., transform the Conv1 from to a stack of two layers, transform the Conv2 from to a stack of and layers, and transform the Conv3 from to a stack of and layers. As we doing so, the degeneration increases even the performance of both the uncompressed and TT networks increase concurrently. The degradation may be mainly caused by the and kernels which make too small to utilize Equation (6). That is another reason for us to abandon 33 or 333 convolutional kernels especially for the data with the shape of irregular cube. Anyway, how to compress 333 convolutional kernels in TT format without degeneration should be studied further.
IvB3 Number of Channels
In CNNs, more channels represent more possible feature combinations, which can positively improve the performance of networks. The convolution in TT format in Equation (4) and Equation (6) allows us to design wider CNNs under the restriction of storage capacity. The performance of 3DCNN_VIVA_3 verifies this by comparing with 3DCNN_VIVA_2, in which the former network has higher ratio of convolutional parameters shown in Table II.
Furthermore, more channels contribute to more balanced TT shapes that may bring higher compression ratio, particularly for FC layers. Comparing 3DCNN_VIVA_4 with 3DCNN_VIVA_5 in Table IV, the latter with higher compression ratio still has comparable performance with the former, by just enlarging the final output channels from 256 to 384. It is obvious in Figure 7 that the shapes of () and () in 3DCNN_VIVA_5 are better than the corresponding shapes of () and ().
IvB4 Scale of Entire Networks
Regarding the network scale, we find that the larger network may be easier to be compressed. Comparing the performance of network 3DCNN_VIVA_4 and network 3DCNN_VIVA_5 in Table IV, the latter can obtain better accuracy with even lower storage cost. We believe that the larger scale 3DCNNs contain stronger redundancy. Moreover, in general, a sparse 3DCNN can get better performance than the dense one with same network scale [54]. This rule can be perceived in Figure 12 which illustrates the variations of parameter amount and accuracy degeneration from 3DCNN_VIVA_1 to 3DCNN_VIVA_5 according to Table II, III and IV. Therefore, we suggest to design large and sparse 3DCNNs rather than tiny and dense ones.
IvB5 Data Distribution
In macro sense, network redundancy has strong correlation to the data distribution. More challenging dataset may need more redundant network for lossless compression. Hence, arbitrarily estimating whether a network is redundant or not seems inadvisable. However, researchers can enlarge the size of convolutional kernels and the number of channels brick by brick until their requirements have been satisfied.
We have ever utilized a picoflexx ToF depth camera [55] to make a simple static hand gesture dataset which is shown in Figure 13. In detail, we segmented the foreground hand gesture and took out the background based on depth and infrared data to produce binary images of hand profile. Then we employ the normal CNN named convfc and the corresponding TTCNN named TTconvTTfc in [37] for static gesture recognition task. We find that there is no accuracy loss on our gesture dataset, in spite of the existing around 2% degeneration on CIFAR10. Relevant particular experiments of these two networks are illustrated in Figure 14.
IvC Initial Learning Rate
Sometimes, DNNs in TT format need lower initial learning rate than the corresponding uncompressed networks. For example, if we set the initial learning rate to 0.01 for 3DCNN_VIVA_4, the network in TT format will hardly converge. Similarly, if we set the initial learning rate to 0.003 for the two stream TT networks in Figure 10, the RGB stream will not converge either.
Therefore, as mentioned in Section III, we set the initial learning rates of network in Figure 9 and network in Figure 10 to be 0.003 and 0.001, respectively. Figure 15 illustrates the learning curves on validation set of UCF11 dataset in one time training. By locating the point of intersection of losses and accuracy precision, it is easy find that the network in TT format can even converge faster with lower initial learning rate.
IvD Core Significance of TT format
The core significance of using TT decomposition to compress neural networks is that one can easily and directly construct a large network which just consumes tiny storage in the socalled in situ training approach [20] without any delicate design or slow pretraining.
In contrast, compact architecture requires engineers to know the knowledge of neural networks well so as to work out an ingenious architecture, such as dilated convolution [56] and deformable convolution [57]. Using compact architecture alone is hard to make compression ratio very high. Furthermore, quantization and pruning or joint compression based on both of them always need pretraining of the uncompressed network, e.g., the socalled deep compression needs to train several times in total three stages [58].
All in all, TT decomposition affords us a simple and convenient approach to compress DNNs with high compression ratio. Although previous researches [25, 35, 36, 37, 38, 39] show that the accuracy loss is hard to avoid, our study demonstrates that 3DCNNs with sufficient redundancy can realize lossless compression based on the TT decomposition.
IvE Comparison with the StateoftheArt on VIVA
From the website of VIVA challenge dataset^{1}^{1}1http://cvrr.ucsd.edu/vivachallenge/index.php/hands/handgestures/ and the survey [59], our result produced by 3DCNN_VIVA_5 in the TT format has already outperformed other extant practices. Details are shown in Table VI,
Method  Modality  Accuracy (%) 

Two 3DCNNs: LRN + HRN [11]  RGB + Depth  77.5 
HOG + HOG [43]  RGB + Depth  64.5 
HON4D [60]  Depth  58.7 
Dense Trajectories [61]  RGB + Depth  54 
TT3DCNN  RGB + Depth  81.83 
V Conclusions
This paper introduces a compression method for convolutional kernels in 3DCNNs based on TT decomposition. How to select suitable truancated TT ranks is analyzed and demonstrated in both theory and practice. Our experiments on VIVA challenge and UCF11 datasets verify that 3DCNNs with sufficient redundancy can be compressed in the TT format without accuracy loss. Moreover, fully utilizing the redundant design for 3DCNNs, e.g., larger convolutional kernel size, more channels, and larger entire scale, can result in better performance including higher compression ratio and lower degeneration. The result on VIVA challenge dataset has got the score of 81.83% which outperforms all the other existing practices. We believe that TT decomposition is a promising approach to compress large scale of 3DCNNs.
Acknowledgment
The work was supported partially by National Science Foundation of China (No. 61876215, 61603209), National Basic Research Program of China (973 Program, Grant No. 2015CB057406), and Independent Research Plan of Tsinghua University (20151080467).
References

[1]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
nature, vol. 521, no. 7553, pp. 436–444, 2015. 
[2]
A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in
Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.  [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in 3rd International Conference on Learning Representations (ICLR), 2015.

[4]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015, pp. 1–9.  [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
 [6] G. Huang, Z. Liu, L. v. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
 [7] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
 [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
 [9] L. Zhang, G. Zhu, P. Shen, and J. Song, “Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition,” in IEEE International Conference on Computer Vision Workshop (ICCVW), 2017, pp. 3120–3128.
 [10] G. Zhu, L. Zhang, P. Shen, and J. Song, “Multimodal gesture recognition using 3D convolution and convolutional LSTM,” IEEE Access, vol. 5, pp. 4517–4524, 2017.
 [11] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3D convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, pp. 1–7.
 [12] P. Molchanov, S. Gupta, K. Kim, and K. Pulli, “Multisensor system for driver’s handgesture recognition,” in 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2015, pp. 1–8.
 [13] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4207–4215.
 [14] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using convolutional 3D neural networks for userindependent continuous gesture recognition,” in 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 49–54.
 [15] G. Varol, I. Laptev, and C. Schmid, “Longterm temporal convolutions for action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1510–1517, 2018.
 [16] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6546–6555.

[17]
L. Ge, H. Liang, J. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5679–5688.  [18] A. Torfi, N. Nasrabadi, and J. Dawson, “Textindependent speaker verification using 3D convolutional neural networks,” ArXiv Preprint arXiv:1705.09422v7, 2018.
 [19] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.
 [20] F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by memristive crossbar circuits using ex situ and in situ training,” Nature Communications, vol. 4, no. 2072, pp. 131–140, 2013.
 [21] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate approximations of nonlinear convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1984–1992.
 [22] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
 [23] K. Shim, M. Lee, I. Choi, Y. Boo, and W. Sung, “SVDSoftmax: Fast softmax approximation on large vocabulary neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), 2017, pp. 5463–5473.
 [24] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. d. Freitas, “Predicting parameters in deep learning,” in Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), 2013, pp. 2148–2156.
 [25] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2015, pp. 442–450.
 [26] J. D. Caroll and J. J. Chang, “Analysis of individual differences in multidimensional scaling via nway generalization of EckartYoung decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
 [27] L. R. Tucker, “Some mathematical notes on threemode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
 [28] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, and C. Caiafa, “Tensor decompositions for signal processing applications: From twoway to multiway component analysis,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 145–163, 2015.
 [29] A. Cichocki, “Tensor networks for dimensionality reduction, big data and deep learning,” in Advances in Data Analysis with Computational Intelligence Methods, ser. Studies in Computational Intelligence. Springer International Publishing AG, 2018, vol. 738, pp. 3–49.
 [30] W. Hackbusch and S. Kühn, “A new scheme for the tensor representation,” Journal of Fourier Analysis and Applications, vol. 15, no. 5, pp. 706–722, 2009.

[31]
L. Grasedyck, “Hierarchical singular value decomposition of tensors,”
SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 4, p. 2029–2054, 2010.  [32] I. V. Oseledets, “Tensortrain decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011.
 [33] B. N. Khoromskij, “quantics approximation of tensors in highdimensional numerical modeling,” Constructive Approximation, vol. 34, no. 2, pp. 257–280, 2011.
 [34] Q. Zhao, M. Sugiyama, L. Yuan, and A. Cichocki, “Learning efficient tensor representations with ring structure networks,” in 6th International Conference on Learning Representations (ICLR), 2018.
 [35] H. Huang, L. Ni, K. Wang, Y. Wang, and H. Yu, “A highly parallel and energy efficient threedimensional multilayer CMOSRRAM accelerator for tensorized neural network,” IEEE Transactions on Nanotechnology, vol. 17, no. 4, pp. 645–656, 2018.
 [36] J. Su, J. Li, B. Bhattacharjee, and F. Huang, “Tensorized spectrum preserving compression for neural networks,” ArXiv Preprint arXiv:1805.10352v2, 2018.
 [37] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: compressing convolutional and FC layers alike,” ArXiv Preprint arXiv:1611.03214v1, 2016.
 [38] A. Tjandra, S. Sakti, and S. Nakamura, “Compressing recurrent neural network with tensor train,” in International Joint Conference on Neural Networks (IJCNN), 2017, pp. 4451–4458.
 [39] ——, “Tensor decomposition for compressing recurrent neural network,” arXiv preprint arXiv:1802.10410v2, 2018.
 [40] Y. Yang, D. Krompass, and V. Tresp, “Tensortrain recurrent neural networks for video classification,” in Proceedings of the 34th International Conference on Machine Learning (ICML), vol. 70, 2017, pp. 3891–3900.
 [41] N. Lee and A. Cichocki, “Big data matrix singular value decomposition based on lowrank tensor train decomposition,” in Advances in Neural Networks – International Symposium on Neural Networks (ISNN), 2014, pp. 121–130.
 [42] J. A. Bengua, H. N. Phien, H. D. Tuan, and M. N. Do, “Efficient tensor completion for color image and video recovery: Lowrank tensor train,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2466–2479, 2017.
 [43] E. OhnBar and M. M. Trivedi, “Hand gesture recognition in real time for automotive interfaces: A multimodal visionbased approach and evaluations,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 6, pp. 2368–2377, 2014.
 [44] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos “in the wild”,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1996–2003.
 [45] N. Lee and A. Cichocki, “Regularized computation of approximate pseudoinverse of large matrices using lowrank tensor train decompositions,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 2, pp. 598–623, 2016.
 [46] L. Grasedyck and W. Hackbusch, “An introduction to Hierarchical () Rank and TTRank of tensors with examples,” Computational Methods in Applied Mathematics, vol. 11, no. 3, pp. 291–304, 2011.
 [47] V. Khrulkov, A. Novikov, and I. Oseledets, “Expressive power of recurrent neural networks,” in 6th International Conference on Learning Representations (ICLR), 2018.

[48]
D. Kressner and C. Tobler, “Preconditioned lowrank methods for highdimensional elliptic PDE eigenvalue problems,”
Computational Methods in Applied Mathematics, vol. 11, no. 3, pp. 363–381, 2011.  [49] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
 [50] A. Novikov, P. Izmailov, V. Khrulkov, M. Figurnov, and I. Oseledets, “Tensor train decomposition on tensorflow (t3f),” arXiv preprint arXiv:1801.01928, 2018.
 [51] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), 2014, pp. 568–576.
 [52] G. Farnebäck, “Twoframe motion estimation based on polynomial expansion,” in Proceedings of the 13th Scandinavian conference on Image analysis (SCIA), 2003, pp. 363–370.
 [53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
 [54] M. H. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy of pruning for model compression,” in 6th International Conference on Learning Representations (ICLR), 2018.
 [55] D. Presnov, M. Lambers, and A. Kolb, “Robust range camera pose estimation for mobile online scene reconstruction,” IEEE Sensors Journal, vol. 18, no. 7, pp. 2903–2915, 2018.
 [56] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1451–1460.
 [57] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 764–773.
 [58] S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations (ICLR), 2016.
 [59] M. AsadiAghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. PonceLópez, X. Baró, I. Guyon, S. Kasaei, and S. Escalera, “A survey on deep learning based approaches for action and gesture recognition in image sequences,” in 12th IEEE International Conference on Automatic Face Gesture Recognition (FG), 2017.
 [60] O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 716–723.
 [61] H. Wang, A. Kläser, C. Schmid, and C.L. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” International Journal of Computer Vision, vol. 103, no. 1, pp. 60–79, 2013.
Comments
There are no comments yet.