1 Introduction
Modern deep learning has celebrated tremendous success in the area of automatic feature extraction from data with a gridlike structure, such as images. This success can be largely attributed to the convolutional neural network architecture
[10], specifically 2D convolutional neural networks (CNNs). These networks are successful due the principles of sparse connectivity, parameter sharing and invariance to translation in the input space [2]. Loosely said, 2D CNNs efficiently find classdiscriminating local features independent of where they appear in the input space. Since video is essentially a sequence of images/frames, 2D CNNs can be and are used to extract features from the individual frames of the sequence [6]. However, the drawback of this method is that the temporal information between frames is discarded. Temporal information is important when we want to perform tasks on video such as gesture, action and emotion recognition or classification. One possible way to simulate the use of time is to stack a recurrent layer after the convolutional layers [15]. But correlated spatiotemporal features will not be learnt because spatial and temporal features are explicitly learned in separate regions of the network. To solve this problem [1] proposed to expand the 2D convolution into a 3D convolution, essentially treating time as a third dimension. Ref. [5] used these 3D convolutions to build a 3D CNN for action recognition without using any recurrent layers. It is important to notice that the principles that govern 2D CNNs also govern 3D CNNs. Translation invariance in time is useful because the precise beginning and ending of an action are typically illdefined [14]. Even though 3D CNNs have been shown to work for different kinds of tasks on video data, they remain difficult to train. There are roughly three main issues with 3D CNNs. First, they are parameterexpensive, requiring an abundance of GPU memory. Second, they are datahungry, requiring much more training data compared to their 2D counterparts. And third, the increase in free parameters leads to a larger search space. As a result these models can be unstable and take a longer time to train. Existing literature tries to solve these problems by essentially avoiding the use of 3D convolutions completely. The most common method is the factorization of the 3D convolution into a 2D convolution followed by a 1D convolution at the layer level [11, 13] or at the network level [12, 8, 7].1.1 Contribution
We propose a simple and novel method to structure the way 3D kernels are learned during training. This method is based on the idea that nearby frames change very little in appearance. Each 3D convolutional kernel is represented as one 2D kernel with a set of transformation parameters. The 3D kernel is then constructed by sequentially applying a spatial transformation [4] directly inside the kernel, allowing spatial manipulation of the 2D kernel values. We achieve the following benefits:

A reduction in the size of the search space by imposing a sequential prior on the kernel values;

a reduction in the number of parameters in the 3D convolutional kernel;

efficient learning from fewer videos.
2 Related Work
Previously, an entire 3D convolutional neural network was factorized into separate spatial and temporal layers called factorized spatiotemporal convolutional networks [12]. This was achieved by decomposing a stack of 3D convolutional layers into a stack of spatial 2D convolutional layers followed by a temporal 1D convolutional layer. Ref. [13] followed in this line of research by factorizing the individual 3D convolutional filters into separate spatial and temporal components called R(2+1)D blocks. Both methods managed to separate the temporal component from the spatial one. One on the network level [12] and one on the layer level [13]
. To our knowledge our approach provides the first instance of a temporal factorization at the single kernel level. In effect, we applied the concept of the spatial transformer network
[4] to the 3D convolutional kernel to obtain a factorization along the temporal dimension.3 Methods
The proposed method uses fewer parameters compared to regular 3D convolutions and it imposes a strong sequential dependency on the relationship between temporal kernel slices. In theory our method should allow efficient feature extraction from video data, using fewer parameters and fewer data. The method is explained in Section 3.1. We demonstrate the performance of our method on a variant of the classic MNIST dataset [9] which we call VideoMNIST. The details of this dataset are explained in Section 3.2. As models we implement 3D and 3DTTN variants of LeNet5: LeNet53D and LeNet53DTTN respectively (see Section 3.3). Training and inference details are explained in Section 3.4.
3.1 Temporal factorization of the 3D convolutional kernel
Consider a 3D convolutional layer consisting of 3D kernels. We focus on the inner workings of a single kernel , where , and refer to the temporal resolution, width and height of the kernel respectively. Without loss of generality, we will assume that the input has a channel with a dimension of one.
If we slice along the temporal dimension, we end up with 2D kernels . Let us refer to the temporal slice at as . Instead of learning entire directly, we only learn and . We factorize such that with depends indirectly on via where with are the learnable parameters of the transformation function . For every pair of slices we have
(1) 
can be further restricted to only contain affine transformation parameters. That is, scaling , rotation , translation in the horizontal direction and translation in the vertical direction . This yields:
(2) 
In that case, has only free parameters. We can additionally add the restriction that there is only one shared transformation per kernel. That is, for . This results in just parameters. Essentially, given , modifies to become , sequentially building the 3D kernel from . This way we impose a strong sequential relationship between the slices along the temporal dimension of our kernel.
The nonlinear transformation is applied in two stages. First, is transformed into a sampling grid that matches the shape of the input feature map, plus an explicit dimension for each spatial dimension, . Here is the input feature map and is the output feature map. We should think of this transformation as an explicit spatial mapping of into the input feature space. Each coordinate from the input space is split in separate with and with components, and calculated as
(3) 
Now that we have sampling grid we can obtain a spatially transformed output feature map from our input feature map
. To interpolate the values of our new temporal kernel slice we use bilinear interpolation. For one particular pixel coordinate
in the output map we compute(4) 
Given that our method transforms temporal kernel slices, we refer to 3D kernels composed with our method as 3DTT kernels. Convolutional networks that use 3DTT kernels instead of regular 3D kernels are referred to as 3DTT convolutional networks or 3DTTNs.
3.2 VideoMNIST
In order to test our method we constructed a dataset, referred to as VideoMNIST, in which each class has a different appearance and dynamic behavior. VideoMNIST is a novel variant of the popular MNIST dataset. It contains 70000 sequences, each sequence containing 30 frames showing an affine transformation on a single original digit moving in a pixel frame. The classspecific affine transformations are restricted to scale, rotation and x, y translations; see Table 1. We maintain the same trainvalidationtest split as in the original MNIST dataset. To make the problem more difficult and reliant on both spatial and motion cues, classes , , and and contain random variations of their specific transformation respectively. For classes , , and the initial direction (left or right) and the initial velocity at which the digit travels per frame is varied. In class the direction of rotation and the size of the radius of the circular path are varied. In addition we also allow the digits to go partially out of frame or almost vanish ( and ). We also made sure that there are overlapping movements between classes, such as rotation or translation in the same direction (, and ). Finally some classes can appear visually similar because of the transformation ( and ). In Figure 2 one example of each class is illustrated.
digit  transformation description  parameter(s) 

0  moves horizontally  
1  moves vertically  
2  scales down and then up  
3  rotates clockwise  
4  rotates counterclockwise  
5  moves along a circular path  , 
6  scales up while rotating clockwise  , 
7  moves horizontally while rotating counterclockwise  , 
8  rotates clockwise and then counterclockwise  
9  random rotation and horizontal+vertical movements  , , 
3.3 Model architectures
We use the LeNet5 architecture [9] as it is a good starting point for training a model based on a variant of the MNIST dataset. The original 2D convolutions are replaced with regular 3D convolutions and 3DTT convolutions for the LeNet53D and the LeNet53DTTN respectively. The number of filters in each convolutional layer can vary since during experimentation we noticed that we can achieve better performance by either increasing or reducing the number of filters in the convolutional layers for both LeNet53D and LeNet53DTTN. LeNet53D serves as the baseline model.
3.4 Training and inference
3.4.1 Training
All models are optimized using SGD with mometum of . Depending on the model, the starting learning rate value can vary from to . The models are trained for a total of epochs where every th epoch the learning rate decreases exponentially if the validation accuracy has not improved. We noticed that epochs provides a good timewindow for the models to converge. Generally a batch size of is used unless we are training on only videos, in which case a batch size of is used. Initialized LeNet53D model weights as well as all fully connected layers follow a KaimingUniform scheme[3]^{1}^{1}1
The default setting in PyTorch.
. LeNet53DTTN initializeswith weights sampled from a Gaussian distribution
^{2}^{2}2Experimentally this gave the best results, however there was very little difference between different types of initializations.. In our main experiments we use a parameterization of with the following initialization: , , and .3.4.2 Replication of video selection
Each model is ran
times (runs) with the same initialization parameters but with different randomly initialized weights for the convolution and fully connected layers. The training data is randomly selected across different runs by using a seed. The seed assures that the same videos are chosen again when we execute the same run with a different model or when we use different initialization parameters. This way we can compare only the difference between the model architecture and parameters without confounding our results with video variance. Given that we experiment with very few videos, we make sure that the classes are represented equally in the randomly selected training data.
3.4.3 Inference
Model selection is based on the accuracy of the validation split. The 30 models in the run with highest average accuracy are ran against the test split. Each run is essentially the same model using the same hyperparameters but with different randomized weight initializations for the convolutional and fully connected layers. In the end the test results of the 30 models are averaged and the standard error of the results is calculated. The final results can be seen in Figure
3.3.5 Setup
To test if our method can outperform conventional 3D convolutions on very few datapoints we train each separate model on a different number of training videos. This way we can test how efficient our method is. The total number of videos are varied from low to high: 10, 20, 30, 40, 50, 100, 500, 1000, 2000, 5000. Model selection happens for each of the number of videos separately. The models trained on 10 videos are different from the models trained on 20 videos. After model selection based on the validation split the models from the best run perform inference on the test split.
LeNet53D  LeNet53DTTN  









10  0.374  0.0122  375668  0.477  0.0139  444966  
20  0.498  0.0131  496430  0.588  0.0108  348612  
30  0.625  0.0107  496430  0.668  0.0091  444966  
40  0.671  0.0095  496430  0.721  0.0079  348612  
50  0.750  0.0083  496430  0.773  0.0067  444966  
100  0.837  0.0059  496430  0.829  0.0073  348612  
500  0.960  0.0025  496430  0.925  0.0058  222940  
1000  0.976  0.0014  496430  0.968  0.0020  254058  
2000  0.988  0.0007  496430  0.972  0.0022  222940  
5000  0.994  0.0003  496430  0.988  0.0010  222940 
4 Results
In Table 3 and Figure 3 we can see that the that our method outperforms the conventional 3D convolution significantly in the low data regime. However, when we have ample training data the conventional 3D convolution outperforms our method, as is to be expected. It is worth mentioning that, in general, our method uses fewer parameters and still achieves reasonable results in all settings.
5 Conclusion
We propose a novel factorization method for 3D convolutional kernels. Our method factorizes the 3D kernel along the temporal dimension and provides a way to learn the 3D kernel through transformations of a 2D kernel, thereby greatly reducing the number of parameters needed. We demonstrate that our method significantly outperforms the conventional 3D convolution in the low data regime ( to training videos), yielding 0.58 vs. 0.65 on average for the LeNet53D and the LeNet53DTTN respectively. Additionally our model achieves competitive results in the high data regime (), with 0.95 vs. 0.94 on average for the LeNet53D and the LeNet53DTTN respectively, using up to
fewer parameters. Hence, 3DTTNs provide a useful building block when estimating models for video processing in the low data regime. In future work, we will explore in which realworld problem settings 3DTTNs outperform their nonfactorized counterparts.
References
 [1] (2011) Sequential deep learning for human action recognition. In International workshop on human behavior understanding, pp. 29–39. Cited by: §1.
 [2] (2016) Deep learning. Vol. 1, MIT Press. Cited by: §1.

[3]
(2015)
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. InProceedings of the IEEE international conference on computer vision
, pp. 1026–1034. Cited by: §3.4.1.  [4] (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.1, §2.
 [5] (2012) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1.

[6]
(2014)
Largescale video classification with convolutional neural networks.
In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
, pp. 1725–1732. Cited by: §1.  [7] (2016) Segmental spatiotemporal cnns for finegrained action segmentation. In European Conference on Computer Vision, pp. 36–52. Cited by: §1.
 [8] (2016) Temporal convolutional networks: a unified approach to action segmentation. In European Conference on Computer Vision, pp. 47–54. Cited by: §1.
 [9] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3, §3.
 [10] (1989) Generalization and network design strategies. In Connectionism in perspective, Cited by: §1.
 [11] (2017) Learning spatiotemporal representation with pseudo3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §1.
 [12] (2015) Human action recognition using factorized spatiotemporal convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4597–4605. Cited by: §1, §2.
 [13] (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1, §2.
 [14] (2017) Longterm temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1510–1517. Cited by: §1.
 [15] (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: §1.
Comments
There are no comments yet.