Modern deep learning has celebrated tremendous success in the area of automatic feature extraction from data with a grid-like structure, such as images. This success can be largely attributed to the convolutional neural network architecture, specifically 2D convolutional neural networks (CNNs). These networks are successful due the principles of sparse connectivity, parameter sharing and invariance to translation in the input space . Loosely said, 2D CNNs efficiently find class-discriminating local features independent of where they appear in the input space. Since video is essentially a sequence of images/frames, 2D CNNs can be and are used to extract features from the individual frames of the sequence . However, the drawback of this method is that the temporal information between frames is discarded. Temporal information is important when we want to perform tasks on video such as gesture, action and emotion recognition or classification. One possible way to simulate the use of time is to stack a recurrent layer after the convolutional layers . But correlated spatiotemporal features will not be learnt because spatial and temporal features are explicitly learned in separate regions of the network. To solve this problem  proposed to expand the 2D convolution into a 3D convolution, essentially treating time as a third dimension. Ref.  used these 3D convolutions to build a 3D CNN for action recognition without using any recurrent layers. It is important to notice that the principles that govern 2D CNNs also govern 3D CNNs. Translation invariance in time is useful because the precise beginning and ending of an action are typically ill-defined . Even though 3D CNNs have been shown to work for different kinds of tasks on video data, they remain difficult to train. There are roughly three main issues with 3D CNNs. First, they are parameter-expensive, requiring an abundance of GPU memory. Second, they are data-hungry, requiring much more training data compared to their 2D counterparts. And third, the increase in free parameters leads to a larger search space. As a result these models can be unstable and take a longer time to train. Existing literature tries to solve these problems by essentially avoiding the use of 3D convolutions completely. The most common method is the factorization of the 3D convolution into a 2D convolution followed by a 1D convolution at the layer level [11, 13] or at the network level [12, 8, 7].
We propose a simple and novel method to structure the way 3D kernels are learned during training. This method is based on the idea that nearby frames change very little in appearance. Each 3D convolutional kernel is represented as one 2D kernel with a set of transformation parameters. The 3D kernel is then constructed by sequentially applying a spatial transformation  directly inside the kernel, allowing spatial manipulation of the 2D kernel values. We achieve the following benefits:
A reduction in the size of the search space by imposing a sequential prior on the kernel values;
a reduction in the number of parameters in the 3D convolutional kernel;
efficient learning from fewer videos.
2 Related Work
Previously, an entire 3D convolutional neural network was factorized into separate spatial and temporal layers called factorized spatio-temporal convolutional networks . This was achieved by decomposing a stack of 3D convolutional layers into a stack of spatial 2D convolutional layers followed by a temporal 1D convolutional layer. Ref.  followed in this line of research by factorizing the individual 3D convolutional filters into separate spatial and temporal components called R(2+1)D blocks. Both methods managed to separate the temporal component from the spatial one. One on the network level  and one on the layer level 
. To our knowledge our approach provides the first instance of a temporal factorization at the single kernel level. In effect, we applied the concept of the spatial transformer network to the 3D convolutional kernel to obtain a factorization along the temporal dimension.
The proposed method uses fewer parameters compared to regular 3D convolutions and it imposes a strong sequential dependency on the relationship between temporal kernel slices. In theory our method should allow efficient feature extraction from video data, using fewer parameters and fewer data. The method is explained in Section 3.1. We demonstrate the performance of our method on a variant of the classic MNIST dataset  which we call Video-MNIST. The details of this dataset are explained in Section 3.2. As models we implement 3D and 3DTTN variants of LeNet-5: LeNet-5-3D and LeNet-5-3DTTN respectively (see Section 3.3). Training and inference details are explained in Section 3.4.
3.1 Temporal factorization of the 3D convolutional kernel
Consider a 3D convolutional layer consisting of 3D kernels. We focus on the inner workings of a single kernel , where , and refer to the temporal resolution, width and height of the kernel respectively. Without loss of generality, we will assume that the input has a channel with a dimension of one.
If we slice along the temporal dimension, we end up with 2D kernels . Let us refer to the temporal slice at as . Instead of learning entire directly, we only learn and . We factorize such that with depends indirectly on via where with are the learnable parameters of the transformation function . For every pair of slices we have
can be further restricted to only contain affine transformation parameters. That is, scaling , rotation , translation in the horizontal direction and translation in the vertical direction . This yields:
In that case, has only free parameters. We can additionally add the restriction that there is only one shared transformation per kernel. That is, for . This results in just parameters. Essentially, given , modifies to become , sequentially building the 3D kernel from . This way we impose a strong sequential relationship between the slices along the temporal dimension of our kernel.
The nonlinear transformation is applied in two stages. First, is transformed into a sampling grid that matches the shape of the input feature map, plus an explicit dimension for each spatial dimension, . Here is the input feature map and is the output feature map. We should think of this transformation as an explicit spatial mapping of into the input feature space. Each coordinate from the input space is split in separate with and with components, and calculated as
Now that we have sampling grid we can obtain a spatially transformed output feature map from our input feature map
. To interpolate the values of our new temporal kernel slice we use bilinear interpolation. For one particular pixel coordinatein the output map we compute
Given that our method transforms temporal kernel slices, we refer to 3D kernels composed with our method as 3DTT kernels. Convolutional networks that use 3DTT kernels instead of regular 3D kernels are referred to as 3DTT convolutional networks or 3DTTNs.
In order to test our method we constructed a dataset, referred to as Video-MNIST, in which each class has a different appearance and dynamic behavior. Video-MNIST is a novel variant of the popular MNIST dataset. It contains 70000 sequences, each sequence containing 30 frames showing an affine transformation on a single original digit moving in a pixel frame. The class-specific affine transformations are restricted to scale, rotation and x, y translations; see Table 1. We maintain the same train-validation-test split as in the original MNIST dataset. To make the problem more difficult and reliant on both spatial and motion cues, classes , , and and contain random variations of their specific transformation respectively. For classes , , and the initial direction (left or right) and the initial velocity at which the digit travels per frame is varied. In class the direction of rotation and the size of the radius of the circular path are varied. In addition we also allow the digits to go partially out of frame or almost vanish ( and ). We also made sure that there are overlapping movements between classes, such as rotation or translation in the same direction (, and ). Finally some classes can appear visually similar because of the transformation ( and ). In Figure 2 one example of each class is illustrated.
|2||scales down and then up|
|5||moves along a circular path||,|
|6||scales up while rotating clockwise||,|
|7||moves horizontally while rotating counter-clockwise||,|
|8||rotates clockwise and then counter-clockwise|
|9||random rotation and horizontal+vertical movements||, ,|
3.3 Model architectures
We use the LeNet-5 architecture  as it is a good starting point for training a model based on a variant of the MNIST dataset. The original 2D convolutions are replaced with regular 3D convolutions and 3DTT convolutions for the LeNet-5-3D and the LeNet-5-3DTTN respectively. The number of filters in each convolutional layer can vary since during experimentation we noticed that we can achieve better performance by either increasing or reducing the number of filters in the convolutional layers for both LeNet-5-3D and LeNet-5-3DTTN. LeNet-5-3D serves as the baseline model.
3.4 Training and inference
All models are optimized using SGD with mometum of . Depending on the model, the starting learning rate value can vary from to . The models are trained for a total of epochs where every th epoch the learning rate decreases exponentially if the validation accuracy has not improved. We noticed that epochs provides a good time-window for the models to converge. Generally a batch size of is used unless we are training on only videos, in which case a batch size of is used. Initialized LeNet-5-3D model weights as well as all fully connected layers follow a Kaiming-Uniform scheme111 The default setting in PyTorch.
The default setting in PyTorch.. LeNet-5-3DTTN initializes
with weights sampled from a Gaussian distribution222Experimentally this gave the best results, however there was very little difference between different types of initializations.. In our main experiments we use a parameterization of with the following initialization: , , and .
3.4.2 Replication of video selection
Each model is ran
times (runs) with the same initialization parameters but with different randomly initialized weights for the convolution and fully connected layers. The training data is randomly selected across different runs by using a seed. The seed assures that the same videos are chosen again when we execute the same run with a different model or when we use different initialization parameters. This way we can compare only the difference between the model architecture and parameters without confounding our results with video variance. Given that we experiment with very few videos, we make sure that the classes are represented equally in the randomly selected training data.
Model selection is based on the accuracy of the validation split. The 30 models in the run with highest average accuracy are ran against the test split. Each run is essentially the same model using the same hyperparameters but with different randomized weight initializations for the convolutional and fully connected layers. In the end the test results of the 30 models are averaged and the standard error of the results is calculated. The final results can be seen in Figure3.
To test if our method can outperform conventional 3D convolutions on very few datapoints we train each separate model on a different number of training videos. This way we can test how efficient our method is. The total number of videos are varied from low to high: 10, 20, 30, 40, 50, 100, 500, 1000, 2000, 5000. Model selection happens for each of the number of videos separately. The models trained on 10 videos are different from the models trained on 20 videos. After model selection based on the validation split the models from the best run perform inference on the test split.
In Table 3 and Figure 3 we can see that the that our method outperforms the conventional 3D convolution significantly in the low data regime. However, when we have ample training data the conventional 3D convolution outperforms our method, as is to be expected. It is worth mentioning that, in general, our method uses fewer parameters and still achieves reasonable results in all settings.
We propose a novel factorization method for 3D convolutional kernels. Our method factorizes the 3D kernel along the temporal dimension and provides a way to learn the 3D kernel through transformations of a 2D kernel, thereby greatly reducing the number of parameters needed. We demonstrate that our method significantly outperforms the conventional 3D convolution in the low data regime ( to training videos), yielding 0.58 vs. 0.65 on average for the LeNet-5-3D and the LeNet-5-3DTTN respectively. Additionally our model achieves competitive results in the high data regime (), with 0.95 vs. 0.94 on average for the LeNet-5-3D and the LeNet-5-3DTTN respectively, using up to
fewer parameters. Hence, 3DTTNs provide a useful building block when estimating models for video processing in the low data regime. In future work, we will explore in which real-world problem settings 3DTTNs outperform their nonfactorized counterparts.
-  (2011) Sequential deep learning for human action recognition. In International workshop on human behavior understanding, pp. 29–39. Cited by: §1.
-  (2016) Deep learning. Vol. 1, MIT Press. Cited by: §1.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §3.4.1.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.1, §2.
-  (2012) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1.
Large-scale video classification with convolutional neural networks.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §1.
-  (2016) Segmental spatiotemporal cnns for fine-grained action segmentation. In European Conference on Computer Vision, pp. 36–52. Cited by: §1.
-  (2016) Temporal convolutional networks: a unified approach to action segmentation. In European Conference on Computer Vision, pp. 47–54. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3, §3.
-  (1989) Generalization and network design strategies. In Connectionism in perspective, Cited by: §1.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §1.
-  (2015) Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4597–4605. Cited by: §1, §2.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1, §2.
-  (2017) Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1510–1517. Cited by: §1.
-  (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: §1.