Understanding the inner workings of ConvNets is important when they are used to make actionable decisions or when humans have to make actionable decisions based on lower-level decisions from decision support systems. Three-dimensional convolutional networks (3DConvNets) Baccouche et al. (2011) are a natural extension of 2DConvNets and have been investigated for the processing of spatiotemporal data, e.g., action recognition in video Ji et al. (2012); Tran et al. (2015); Varol et al. (2017); Carreira and Zisserman (2017). While some of these models achieve state-of-the-art results in video action recognition benchmarks, the temporal aspect of their inner workings remains difficult to interpret. Given that a 3DConv filter is simply the 3D extension of the 2DConv filter, one could apply visualization methods that are suitable for 2DConvNets to the individual slices along the temporal axis of a 3DConv filter Carreira and Zisserman (2017); Anders et al. (2019); Yang et al. (2018). This approach is effective to the degree that we can gain insight into what the model learns in terms of spatial features. However, these methods do not provide a meaningful insight into the temporal dynamics that the model takes into consideration.
Erhan et al. Erhan et al. (2009)
visualized important features in an arbitrary layer of a DNN by optimizing a randomly initialized input such that the activation of the chosen neuron in a layer is maximized. More recent workSimonyan et al. (2013); Olah et al. (2017) extends this method, resulting in beautiful visualizations of not only neurons but entire channels, layers, and class representations. One big obstacle to this approach is that the resulting visualizations can be difficult to recognize and are subject to interpretation. Applying these optimization-based methods on 3DConvs will likely exacerbate these problems given that the search space for optimization is at least one order of magnitude larger (depending on the number of frames in the video).
Gradient-based methods reveal which input region causes high activations in specific network components Simonyan et al. (2013); Montavon et al. (2017); Zhou et al. (2016); Selvaraju et al. (2017). Applying this method to a 3DConv results in a saliency sequence that can be applied to the input video to reveal spatial components that cause a channel to have a high output. Concrete examples can be seen in Anders et al. (2019), where gradient-based analysis Simonyan et al. (2013) and deep Taylor decomposition and layerwise relevance propagation Montavon et al. (2017) are used on a 3DConvNet to analyze model predictions. Yang et al. Yang et al. (2018) explain model predictions by adapting CAM Zhou et al. (2016) and Grad-CAM Selvaraju et al. (2017) for their 3DConvNet. While these methods return an accurate spatial representation of which input components are responsible for high channel output, they are restricted to the spatial domain.
3TConvNets, in constrast, are 3DConvNets where the convolutional filter is factorized into a 2D spatial filter and corresponding temporal parameters that transform the 2D filter into a 3D filter. The 3TConvNet learns explicit temporal affine transformations whose values can be plotted in an interpretable graph. This alternative approach to learning a 3D filter offers a completely novel way of visualizing and understanding the temporal dynamics learned by a 3DConvNet. In addition, 3TConvs bring the best of both worlds: Because the 3D filter is built from a 2D filter and a set of transformation parameters, many visualization methods can be used as-is or in combination with the transformation parameters. Using 3TConvs gives us a much stronger grasp on the interpretation of spatiotemporal features because the temporal features are directly interpretable regardless of visualization methods used. The goal of this paper is not to improve benchmarks, but to demonstrate that 3TConv can increase human understanding of automated video analyses by providing visual representations of motion features. Thus it contributes to explainable AI and enables a more informed evaluation of the societal implications of AI use cases such as video surveillance.
2 Related work
Our method bears a resemblance to other types of convolutions that have previously factorized spatiotemporal processing into separate spatial and temporal components and convolutions that incorporate affine transformations. Jaderberg et al. Jaderberg et al. (2015)
apply similar affine transformations to features extracted from 2D filters. Our method differs in the sense that it is used to obtain sequentially build a 3D filter, while inJaderberg et al. (2015) the affine transformations are used to obtain spatially robust feature representations in 2DConvNets. Group equivariant convolutions Cohen and Welling (2016) make use of symmetry groups containing reflection and rotation to extend translational spatial symmetry. The goal is to learn less redundant convolutional filters in the spatial domain. Our method is not the first to apply the concept of separating the spatial and temporal dimensions in 3DConvNets. Tran et al. Tran et al. (2018) factorize the individual 3D convolutional filters into separate spatial and temporal components called R(2+1)D blocks. This creates two specific learning phases: a spatial feature learning phase and a temporal feature learning phase guided by one weight per temporal dimension. In our method, we also separate the spatial and temporal components, however, we use four affine parameters instead of one weight. Also, we impose the restriction that the slices in the filter are dependent on one another, thereby avoiding the need for two distinct phases. Instead, spatiotemporal features are learned jointly.
3.1 Background and notation
In this section we will introduce some basic notation to describe 3D convolutional layers. We denote the tensor of convolutional filter weights as, where , , and refer to the number of channels, the temporal depth, the width and the height of the filter respectively. Slicing along the temporal dimension, we end up with 2D filters . The slice at time lag is denoted as . Ignoring the bias term, the forward pass of a 3DConv layer is given by
where is the input tensor, is the output tensor and
is an activation function. In standard 3DConvNets, each 2D filteris learned independently. This parameterization is arguably unparsimonious as many features of the input stream vary smoothly. In the next section, we will introduce an alternative parameterization in which the -th filter is the result of a differentiable transformation of the -th filter.
If the input stream has a smooth time dependency, it is reasonable to assume that the filters required to analyze the -th time lag are similar to the filters required to analyze the -th lag.111This is an extension of our earlier preliminary method Anonymous (2019) We can formalize this insight by parameterizing the -th filter as a smooth transform of its predecessor. I.e.,
where is a set of transformation parameters. We denote a filter parameterized in this way as a 3TConv filter. Essentially, this is a way to impose a strong sequential relationship between the slices along the temporal dimension of our filter. While regular 3DConvNets learn the entire directly, 3TConvNets only learn and .
3.3 Affine 3TConv
One of the main sources of temporal variability in video streams is the optical flow due to the motion of the camera and of the background. These movements induce a constant optical flow that can be modeled as a global affine transformation of the frames. This suggests an affine parameterization of the transformation function in terms of translations, rotations and scaling parameters. Specifically, for every pair of slices we have
Compared to a regular 3D filter which has parameters, the 3T filter only has trainable parameters.
The nonlinear transformation is applied in two stages. First, is transformed into a sampling grid that matches the shape of the input feature map, plus an explicit dimension for each spatial dimension, . Here is the input feature map and is the output feature map. We should think of this transformation as an explicit spatial mapping of into the input feature space. Each coordinate from the input space is split in separate with and with components, and calculated as
Now that we have sampling grid we can obtain a spatially transformed output feature map from our input feature map
. To interpolate the values of our new temporal filter slice we use bilinear interpolation. For one particular pixel coordinatein the output map we compute
3.4 Explaining temporal dynamics with 3TConv
Temporal parameters containing the scale , rotation and translation parameters are obtained from the convolutional layers of a trained 3TConvNet. When a 3TConvNet is first initialized, the parameters are set to the identity mapping: , , and . After training the model, parameter values can be interpreted as follows. The parameter is the scaling factor relative to 1. The larger , the bigger the resulting transformation. Applied on an image this has the effect of zooming in or out. The parameter is the rotation measured in degrees where a positive value of indicates a counter-clockwise rotation. In the resulting plots we multiply such that a positive value indicate a clockwise rotation. The translation parameters in their raw form indicate what percentage of the image has translated. To obtain the amount of translation in pixel units we need to multiply by the width and height of the image: and .
In the following experiments we will demonstrate how 3TConvs can be used for explainability. Modified versions of ResNet18 He et al. (2016) and GoogLeNet Szegedy et al. (2015) are used. In both architectures the 2DConv filters are replaced by 3DConv and 3TConv filters. The networks are trained on the Jester dataset Materzynska et al. (2019)
to classify 27 different hand-gestures and the UCF101 datasetSoomro et al. (2012)
to classify 101 human activities in various scenarios. Implementation was done in PyTorchPaszke et al. (2019)
and for the experiments using transfer learning, model weights are obtained from pretrained GoogLeNet and ResNet18 models from the torchvision model zoo. Details about data pre-processing and model training as well a link to the codebase can be found in the Supplementary Materials.
4.1 Performance comparison
Given that the goal of our paper is to demonstrate the explanation capabilities of the 3TConv we only briefly address the performance difference between 3DConvNets and 3TConvNets on the classification of hand- and activity recognition. The performance comparison is shown in Table 1. On UCF101, pretrained 3TConvNets can outperform pretrained 3DConvNets using 33% to 53% fewer parameters. However on the Jester dataset 3DConvNets outperform 3TConvNets.
|Jester val acc||# params||UCF101 val acc||# params|
4.2 Applying 2D visualization methods to 3DConv
Next, we applied the gradient-based method from Simonyan et al. (2013) and activation-maximization method from Olah et al. (2017) to visualize the Conv1 layer filters directly. The goal is to demonstrate how 2DConv visualization methods translate to the 3DConv-GoogLeNet domain as a baseline for comparing the 3TConv method with. Figure 2 depicts the visualizations.
The gradient-based method results indicate that the network is picking up spatial components that are both recognizable to humans and that are relevant to classification. That is, areas of the hand are causing specific channels to return high-activation responses. While the method provides an unambiguous indication of the importance of visual components, it does not provide a structure to identify what temporal features are being learned.
As mentioned in Section 2, the application of activation-maximization can lead to visual patterns that are difficult to interpret. Figure 2 indeed shows that it is difficult to relate the patterns observed in the images to visual aspects of objects in the real world. With no visible continuation between frames, other than the low-frequency patterns themselves, interpretation from a temporal perspective becomes increasingly challenging. The visualizations of the Conv1 filters show that some filters remain unchanged while others exhibit minor color variations. This is likely the result of the small learning rate used to fine-tune the network. Given that the model trains and achieves very reasonable results, we hypothesize that in this model, the spatial features alone are sufficient for the model to make predictions. In any case, no insight can be achieved about the temporal qualities that the model may extract from the data.
4.3 Using temporal transformations and 2D visualization methods
To compare the results of 3TConv with 3DConv, the visualization methods in 4.2 are applied to the pretrained 3TConv-GoogLeNet trained on the Jester dataset. To make use of the transformation parameters, both gradient-based and activation-maximization were adapted to only produce an image and not an entire video.222Note that this strategy can also be applied to 3DConvNets, however, given that 3DConvs do not have explicit temporal parameters, no insight into the temporal mechanics can be derived. In the gradient-based method, only the frame causing the highest activation is shown. In the activation-maximization method, a single frame is optimized and stacked on top of each other to create a video input. Results are shown in Figure 3.
Similar to the results for the 3DConv-GoogLeNet, the gradient results for 3TConv-GoogLeNet show that recognizable components in the input are responsible for the high activations in the channels. The visualizations for activation-maximization remain uninterpretable, however, we do notice that the images exhibit more structure compared to the ones in Figure 2. This is likely the result of optimizing for a single image rather than optimizing a video; optimizing for a single image is an easier task because an image contains fewer free parameters compared to a video. For the indicated channels in the Conv1 visualizations, we can see transitions that correspond with the temporal parameters. For the channels in each example, we can plot the temporal parameters directly and gain insight into what that particular channel has learned. For example, for inception3b we can directly read from the graphs that channel 141 learned the motion of zooming out while moving up-left and rotating counterclockwise. This provides us a novel way to explain model behavior from the temporal perspective independent from the visualization methods used.
4.4 Analysis of temporal parameters across models and datasets
In the previous section, the temporal parameters were analyzed on a per-channel level. The temporal parameters can also be used to understand the global decision-making process of the model in interpretable and quantifiable model statistics. In Figure 4, the distributions of the learned temporal parameters of 3TConv-ResNet18 and 3TConv-GoogLeNet trained on the Jester and UCF101 datasets are visualized.
|pret. 3TConv-GoogLeNet||pret. 3TConv-ResNet18|
|28.0 / 46.14||6.24 / 7.55||8.46 / 16.1||3.57 / 5.3|
|11.11 / 13.77||2.38 / 2.67||3.26 / 5.44||1.65 / 2.55|
|7.58 / 10.27||1.47 / 1.48||1.87 / 3.16||1.49 / 2.07|
|8.14 / 10.75||1.71 / 1.65||1.99 / 3.5||1.51 / 2.21|
Means and standard deviations of estimated parameters for pretrained models ().
Results show that models trained on the Jester dataset develop temporal parameters that vary more in value and range compared to the parameters of the UCF101 dataset.
4.5 Results for tabula rasa models
Finally, to quantify the benefits of using pretraining, we compare the results of the previous section with those of tabula rasa models that were trained from scratch.
|Jester val acc||UCF101 val acc|
Table 3 compares the classification accuracies for 3TConvnets that were either pretrained or trained from scratch. Pretrained models massively outperform tabula rasa models, showing that transfer learning has a significant impact on 3TConv performance.
|40.14 / 67.09||26.86 / 34.69||42.37 / 51.92||14.85 / 19.47|
|30.12 / 47.2||23.16 / 32.11||29.31 / 38.26||13.12 / 19.32|
|19.88 / 32.05||13.72 / 17.86||18.21 / 22.85||7.06 / 8.95|
|19.81 / 31.97||13.32 / 18.5||17.94 / 22.59||7.23 / 10.22|
Figure 5 and Table 4 show the parameters estimated for tabula rasa models that were trained from scratch. A comparison between these results and those of the previous section shows that the difference in results for the Jester and UCF101 datasets becomes more pronounced for tabula rasa models. This is to be expected since models trained from scratch are not endowed with good visual features and likely develop a larger repertoire of temporal parameters to compensate for this. It seems that, on the Jester dataset, the model needs to develop a larger variety of temporal parameters to do classification. This suggests that classification on the Jester dataset is more dependent on affine motion transformations than classification on the UCF101 dataset. This is not surprising since many of the classes in UCF101 are not dependent on unique motion patterns at all. For example, the SkyDiving, Skiing, and Skijet classes can be classified based on spatial features alone. In contrast, the classes in the Jester dataset are strongly dependent on what kind of motion the person performs with their hand.
The comparison with pretrained models further reveals that the dependency of the models on the temporal parameters decreases strongly across models and datasets when initialization with pretrained weights is provided. This implies that when the model can extract good spatial features, the dependency on learning a broad range of temporal parameters decreases. This suggests that tabula rasa 3TConv models develop a greater dependency on temporal parameters. This interesting phenomenon warrants further investigation.
Until now methods for explaining temporal features learned by 3DConvNets, in terms of simple and interpretable parameters, were not available. In this paper, we make an attempt to bridge the gap between the ever-increasing demand for deep learning models to be interpretable and the availability of methods that allow us to interpret said models. We introduced the 3TConv as an interpretable alternative to the regular 3DConv. By training different models on different datasets and analyzing the results, we achieve novel insight into what temporal aspects drive model classification. It is also demonstrated that 3TConvNets can make use of pretrained 2DConv weights in order to boost classification performance, in some cases surpassing the performance of traditional pretrained 3DConvs while containing up to 50% less parameters.
This paper contributes to a more informed evaluation of the potential societal consequences of deep learning approaches to video analysis by proposing a method to achieve clarity about the nature of information processing in 3TConvNets. An important application of the type of research reported here is the usage of video classification for surveillance. This can have both positive and negative societal implications, depending on the purposes for which video classification is used, and on the effectiveness of the oversight mechanisms that aim to uphold conformity to the regulations, such as the GDPR in the EU. For a proper assessment of such implications, it is vital that human understanding of the computational processing involved is possible. The current research contributes to the explainability of 3D feature analysis by providing visual representations of what is being learned, in a format that is intuitively interpretable by humans. Therefore, it will be valuable to stakeholders that are interested in applying automated surveillance, as well as to stakeholders that are concerned about the potential infringement of human rights via such techniques.
We would like to thank Erdi Çallı and Elsbeth van Dam for the helpful discussions and general support.
- Understanding patch-based learning of video data by explaining predictions. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 297–309. Cited by: §1, §1.
- Anonymized title. Cited by: footnote 1.
- Sequential deep learning for human action recognition. In International workshop on human behavior understanding, pp. 29–39. Cited by: §1.
- Quo vadis, action recognition? a new model and the kinetics dataset. In , pp. 6299–6308. Cited by: §1.
Group equivariant convolutional networks.
International conference on machine learning, pp. 2990–2999. Cited by: §2.
- Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.
- Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.
- 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1.
- The jester dataset: a large-scale video dataset of human gestures. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §4.
- Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition 65, pp. 211–222. Cited by: §1.
- Feature visualization. Distill 2 (11), pp. e7. Cited by: §1, §4.2.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Cited by: §4.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1.
- Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §1, §4.2.
- UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §4.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §1.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
- Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1510–1517. Cited by: §1.
- Visual explanations from deep 3d convolutional neural networks for alzheimer’s disease classification. In AMIA Annual Symposium Proceedings, Vol. 2018, pp. 1571. Cited by: §1, §1.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §1.